Search by job, company or skills

vsol vn solutions

Site Reliability Engineering Lead

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

VSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting to execution that drives the growth and innovation of our clients. As VSOL is in a phase of rapid expansion, we offer a dynamic, creative environment that accelerates your personal and professional development. We are looking for talented individuals eager to develop in international markets while contributing to the company's future in a constructive and supportive manner.

Responsibilities

  • Serve as the primary technical point of contact between the Managed Services team, clients, and the Development team — owning service reliability, client communication, and cross-team collaboration end-to-end
  • Act as Incident Manager and Lead SRE: command and coordinate the response to major production incidents, lead technical triage, drive root cause analysis (RCA), and facilitate blameless post-incident reviews.
  • Define, implement, and continuously improve SLIs, SLOs, and SLAs in alignment with client contracts and operational capabilities; lead Production Readiness Reviews (PRR) prior to system go-live.
  • Design and build the observability stack — monitoring, alerting, logging, and distributed tracing (e.g., Prometheus, Grafana, ELK Stack, Datadog) — ensuring full visibility across hybrid cloud and on-premises environments.
  • Pioneer AI-driven automation within the Managed Services team: evaluate and implement AI agents, AIOps tooling, and LLM-based automation workflows for anomaly detection, predictive alerting, and self-healing operations.
  • Drive systematic toil reduction: identify repetitive operational tasks and automate them using scripting (Python, Bash), Infrastructure as Code (Terraform, Ansible), and intelligent automation frameworks.
  • Review and approve operational runbooks, escalation paths, and recovery procedures; provide architectural guidance on observability design and troubleshooting patterns for complex distributed systems
  • Lead and mentor the Managed Services team (SREs, system admins, support engineers): define on-call rotation schedules, set team goals aligned to SLAs, and build a culture of ownership, reliability, and continuous improvement
  • Collaborate with Development, DevOps, infrastructure, and security teams during incident response, capacity planning, and service onboarding; embed reliability practices (error budgets, CI/CD quality gates) into the development lifecycle
  • Apply ITIL v4 practices across incident management, problem management, change management, and continual service improvement to ensure structured, consistent, and auditable service delivery
  • Lead periodic service reviews with clients: present SLA performance reports, observability dashboards, and upcoming reliability improvement roadmaps in a clear, professional manner
  • Participate in on-call rotations for critical managed services, ensuring 24/7 operational coverage and rapid response to high-severity incidents
  • Evaluate and integrate new technologies, including AI/ML tooling and cloud-native solutions, to enhance service reliability and operational maturity
  • Create and maintain technical documentation for system architecture, operational procedures, and client-facing service reports

Note: This position may require international travel or onsite engagement in UAE (United Arab Emirates) and KSA (Kingdom of Saudi Arabia) for periods of 3 to 6 months continuously. Candidates will be required to accept this requirement as part of the position

Requirements

  • 6+ years of experience in SRE, DevOps, or technical operations roles, with at least 2 years in a lead or senior individual contributor capacity managing 24/7 production environments.
  • Proven experience as an Incident Manager or incident command lead for major production incidents — including RCA facilitation, stakeholder communication, and post-incident review processes.
  • Strong hands-on experience with Linux/Unix system administration, networking fundamentals (TCP/IP, DNS, firewalls, routing, load balancing), and hybrid cloud/on-premises environments.
  • Observability and monitoring: deep experience building and operating stacks using Prometheus, Grafana, ELK Stack, Datadog, or equivalent tools.
  • Scripting and automation expertise in Python and Bash (essential); Go is a plus.
  • Infrastructure as Code: proficiency with Terraform and Ansible or equivalent tools.
  • Container orchestration: strong knowledge of Kubernetes (CKA certification preferred) and Docker.
  • Cloud platforms: GCP (preferred), AWS, or Azure; experience with hybrid on-premises and cloud environments.
  • Good knowledge of GitOps tools (e.g., Argo CD, FluxCD).
  • Basic understanding of AI agents, LLM-based automation (e.g., LangChain, AutoGen, or equivalent frameworks), and AIOps tooling for anomaly detection and intelligent alerting; hands-on experience is a strong differentiator.
  • Solid understanding of ITIL v4 practices: incident management, problem management, change management, and continual service improvement; ITIL v4 Foundation certification is required; Managing Professional (MP) level is a strong plus.
  • Client-facing communication: able to present service health reports, SLA performance, and technical updates clearly to stakeholders; English proficiency at CEFR B2 or above.
  • Ability to guide and mentor team members technically; comfortable making operational decisions and providing direction in high-pressure incident situations.
  • Knowledge of security frameworks and compliance standards relevant to managed services environments

Preferred Qualifications

  • Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent experience).
  • ITIL v4 Foundation certification — required; Managing Professional (MP) or Strategic Leader (SL) track is a strong plus.
  • Certified Kubernetes Administrator (CKA) — preferred.
  • AWS / GCP / Azure professional certification — a plus.
  • Cisco Certified Network Associate (CCNA) — a plus.
  • Written and spoken English communication skills at CEFR B2 level or above.

Why you'll love working here:

  • Working in a start-up environment, English-speaking, with the opportunity to be part of an innovation team and global projects
  • On-site opportunities inthe UAE (United Arab Emirates) and the KSA (Kingdom of Saudi Arabia)
  • 13th-month salary bonus
  • Premium Health insurance for employees and family members (depending on level), Annual Health Check, Government Insurance in probation
  • 14++ days of Annual leave and 5 days of Outing leave
  • Lunch allowance and free parking
  • Taxi & phone allowance (depending on level)

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146732013

Similar Jobs