Site Reliability Engineering Lead

vsol vn solutions

Ho Chi Minh, Vietnam

6-8 Years

Save

Posted 6 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

VSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting to execution that drives the growth and innovation of our clients. As VSOL is in a phase of rapid expansion, we offer a dynamic, creative environment that accelerates your personal and professional development. We are looking for talented individuals eager to develop in international markets while contributing to the company's future in a constructive and supportive manner.

Responsibilities

Serve as the primary technical point of contact between the Managed Services team, clients, and the Development team — owning service reliability, client communication, and cross-team collaboration end-to-end
Act as Incident Manager and Lead SRE: command and coordinate the response to major production incidents, lead technical triage, drive root cause analysis (RCA), and facilitate blameless post-incident reviews.
Define, implement, and continuously improve SLIs, SLOs, and SLAs in alignment with client contracts and operational capabilities; lead Production Readiness Reviews (PRR) prior to system go-live.
Design and build the observability stack — monitoring, alerting, logging, and distributed tracing (e.g., Prometheus, Grafana, ELK Stack, Datadog) — ensuring full visibility across hybrid cloud and on-premises environments.
Pioneer AI-driven automation within the Managed Services team: evaluate and implement AI agents, AIOps tooling, and LLM-based automation workflows for anomaly detection, predictive alerting, and self-healing operations.
Drive systematic toil reduction: identify repetitive operational tasks and automate them using scripting (Python, Bash), Infrastructure as Code (Terraform, Ansible), and intelligent automation frameworks.
Review and approve operational runbooks, escalation paths, and recovery procedures; provide architectural guidance on observability design and troubleshooting patterns for complex distributed systems
Lead and mentor the Managed Services team (SREs, system admins, support engineers): define on-call rotation schedules, set team goals aligned to SLAs, and build a culture of ownership, reliability, and continuous improvement
Collaborate with Development, DevOps, infrastructure, and security teams during incident response, capacity planning, and service onboarding; embed reliability practices (error budgets, CI/CD quality gates) into the development lifecycle
Apply ITIL v4 practices across incident management, problem management, change management, and continual service improvement to ensure structured, consistent, and auditable service delivery
Lead periodic service reviews with clients: present SLA performance reports, observability dashboards, and upcoming reliability improvement roadmaps in a clear, professional manner
Participate in on-call rotations for critical managed services, ensuring 24/7 operational coverage and rapid response to high-severity incidents
Evaluate and integrate new technologies, including AI/ML tooling and cloud-native solutions, to enhance service reliability and operational maturity
Create and maintain technical documentation for system architecture, operational procedures, and client-facing service reports

Note: This position may require international travel or onsite engagement in UAE (United Arab Emirates) and KSA (Kingdom of Saudi Arabia) for periods of 3 to 6 months continuously. Candidates will be required to accept this requirement as part of the position

Requirements

6+ years of experience in SRE, DevOps, or technical operations roles, with at least 2 years in a lead or senior individual contributor capacity managing 24/7 production environments.
Proven experience as an Incident Manager or incident command lead for major production incidents — including RCA facilitation, stakeholder communication, and post-incident review processes.
Strong hands-on experience with Linux/Unix system administration, networking fundamentals (TCP/IP, DNS, firewalls, routing, load balancing), and hybrid cloud/on-premises environments.
Observability and monitoring: deep experience building and operating stacks using Prometheus, Grafana, ELK Stack, Datadog, or equivalent tools.
Scripting and automation expertise in Python and Bash (essential); Go is a plus.
Infrastructure as Code: proficiency with Terraform and Ansible or equivalent tools.
Container orchestration: strong knowledge of Kubernetes (CKA certification preferred) and Docker.
Cloud platforms: GCP (preferred), AWS, or Azure; experience with hybrid on-premises and cloud environments.
Good knowledge of GitOps tools (e.g., Argo CD, FluxCD).
Basic understanding of AI agents, LLM-based automation (e.g., LangChain, AutoGen, or equivalent frameworks), and AIOps tooling for anomaly detection and intelligent alerting; hands-on experience is a strong differentiator.
Solid understanding of ITIL v4 practices: incident management, problem management, change management, and continual service improvement; ITIL v4 Foundation certification is required; Managing Professional (MP) level is a strong plus.
Client-facing communication: able to present service health reports, SLA performance, and technical updates clearly to stakeholders; English proficiency at CEFR B2 or above.
Ability to guide and mentor team members technically; comfortable making operational decisions and providing direction in high-pressure incident situations.
Knowledge of security frameworks and compliance standards relevant to managed services environments

Preferred Qualifications

Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent experience).
ITIL v4 Foundation certification — required; Managing Professional (MP) or Strategic Leader (SL) track is a strong plus.
Certified Kubernetes Administrator (CKA) — preferred.
AWS / GCP / Azure professional certification — a plus.
Cisco Certified Network Associate (CCNA) — a plus.
Written and spoken English communication skills at CEFR B2 level or above.

Why you'll love working here:

Working in a start-up environment, English-speaking, with the opportunity to be part of an innovation team and global projects
On-site opportunities inthe UAE (United Arab Emirates) and the KSA (Kingdom of Saudi Arabia)
13th-month salary bonus
Premium Health insurance for employees and family members (depending on level), Annual Health Check, Government Insurance in probation
14++ days of Annual leave and 5 days of Outing leave
Lunch allowance and free parking
Taxi & phone allowance (depending on level)