VSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting to execution that drives the growth and innovation of our clients. As VSOL is in a phase of rapid expansion, we offer a dynamic, creative environment that accelerates your personal and professional development. We are looking for talented individuals eager to develop in international markets while contributing to the company's future in a constructive and supportive manner.
Responsibilities
- Serve as the primary technical point of contact between the Managed Services team, clients, and the Development team — owning service reliability, client communication, and cross-team collaboration end-to-end
- Act as Incident Manager and Lead SRE: command and coordinate the response to major production incidents, lead technical triage, drive root cause analysis (RCA), and facilitate blameless post-incident reviews.
- Define, implement, and continuously improve SLIs, SLOs, and SLAs in alignment with client contracts and operational capabilities; lead Production Readiness Reviews (PRR) prior to system go-live.
- Design and build the observability stack — monitoring, alerting, logging, and distributed tracing (e.g., Prometheus, Grafana, ELK Stack, Datadog) — ensuring full visibility across hybrid cloud and on-premises environments.
- Pioneer AI-driven automation within the Managed Services team: evaluate and implement AI agents, AIOps tooling, and LLM-based automation workflows for anomaly detection, predictive alerting, and self-healing operations.
- Drive systematic toil reduction: identify repetitive operational tasks and automate them using scripting (Python, Bash), Infrastructure as Code (Terraform, Ansible), and intelligent automation frameworks.
- Review and approve operational runbooks, escalation paths, and recovery procedures; provide architectural guidance on observability design and troubleshooting patterns for complex distributed systems
- Lead and mentor the Managed Services team (SREs, system admins, support engineers): define on-call rotation schedules, set team goals aligned to SLAs, and build a culture of ownership, reliability, and continuous improvement
- Collaborate with Development, DevOps, infrastructure, and security teams during incident response, capacity planning, and service onboarding; embed reliability practices (error budgets, CI/CD quality gates) into the development lifecycle
- Apply ITIL v4 practices across incident management, problem management, change management, and continual service improvement to ensure structured, consistent, and auditable service delivery
- Lead periodic service reviews with clients: present SLA performance reports, observability dashboards, and upcoming reliability improvement roadmaps in a clear, professional manner
- Participate in on-call rotations for critical managed services, ensuring 24/7 operational coverage and rapid response to high-severity incidents
- Evaluate and integrate new technologies, including AI/ML tooling and cloud-native solutions, to enhance service reliability and operational maturity
- Create and maintain technical documentation for system architecture, operational procedures, and client-facing service reports
Note: This position may require international travel or onsite engagement in UAE (United Arab Emirates) and KSA (Kingdom of Saudi Arabia) for periods of 3 to 6 months continuously. Candidates will be required to accept this requirement as part of the position
Requirements
- 6+ years of experience in SRE, DevOps, or technical operations roles, with at least 2 years in a lead or senior individual contributor capacity managing 24/7 production environments.
- Proven experience as an Incident Manager or incident command lead for major production incidents — including RCA facilitation, stakeholder communication, and post-incident review processes.
- Strong hands-on experience with Linux/Unix system administration, networking fundamentals (TCP/IP, DNS, firewalls, routing, load balancing), and hybrid cloud/on-premises environments.
- Observability and monitoring: deep experience building and operating stacks using Prometheus, Grafana, ELK Stack, Datadog, or equivalent tools.
- Scripting and automation expertise in Python and Bash (essential); Go is a plus.
- Infrastructure as Code: proficiency with Terraform and Ansible or equivalent tools.
- Container orchestration: strong knowledge of Kubernetes (CKA certification preferred) and Docker.
- Cloud platforms: GCP (preferred), AWS, or Azure; experience with hybrid on-premises and cloud environments.
- Good knowledge of GitOps tools (e.g., Argo CD, FluxCD).
- Basic understanding of AI agents, LLM-based automation (e.g., LangChain, AutoGen, or equivalent frameworks), and AIOps tooling for anomaly detection and intelligent alerting; hands-on experience is a strong differentiator.
- Solid understanding of ITIL v4 practices: incident management, problem management, change management, and continual service improvement; ITIL v4 Foundation certification is required; Managing Professional (MP) level is a strong plus.
- Client-facing communication: able to present service health reports, SLA performance, and technical updates clearly to stakeholders; English proficiency at CEFR B2 or above.
- Ability to guide and mentor team members technically; comfortable making operational decisions and providing direction in high-pressure incident situations.
- Knowledge of security frameworks and compliance standards relevant to managed services environments
Preferred Qualifications
- Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent experience).
- ITIL v4 Foundation certification — required; Managing Professional (MP) or Strategic Leader (SL) track is a strong plus.
- Certified Kubernetes Administrator (CKA) — preferred.
- AWS / GCP / Azure professional certification — a plus.
- Cisco Certified Network Associate (CCNA) — a plus.
- Written and spoken English communication skills at CEFR B2 level or above.
Why you'll love working here:
- Working in a start-up environment, English-speaking, with the opportunity to be part of an innovation team and global projects
- On-site opportunities inthe UAE (United Arab Emirates) and the KSA (Kingdom of Saudi Arabia)
- 13th-month salary bonus
- Premium Health insurance for employees and family members (depending on level), Annual Health Check, Government Insurance in probation
- 14++ days of Annual leave and 5 days of Outing leave
- Lunch allowance and free parking
- Taxi & phone allowance (depending on level)