Senior Site Reliability Engineer

VNG

Ho Chi Minh, Vietnam

3-5 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in deploying, operating, and optimizing database systems on Kubernetes (K8s). In this role, you will play a critical part in ensuring the data infrastructure is highly reliable, high-performance, scalable, and proactively monitored through modern observability systems.

Key Responsibilities

Research, deploy, administer, and optimize database systems (PostgreSQL, Kafka, Opensearch, Redis, etc.) on Kubernetes.
Operate, optimize, and scale Kubernetes clusters.
Set up and manage monitoring & alerting systems such as Prometheus, Alertmanager, Grafana, ELK, etc.
Define and fine-tune metrics, alert thresholds, SLO/SLA, and error budgets for database services and key infrastructure components.
Participate in incident response, conduct root cause analysis, and perform post-mortem reviews to improve system reliability.
Automate operational workflows (backup, failover, scaling, recovery, patching, CI/CD, etc.).
Develop and standardize runbooks, playbooks, and documentation to support efficient and consistent incident handling.
Collaborate with development teams to enhance database and big data platform capabilities.

Requirements

Minimum 3 years of experience in SRE, DevOps, Database Engineering, or System Engineering.
Strong hands-on experience in deploying, operating, and optimizing database systems (MySQL, PostgreSQL, MongoDB, Redis, Kafka, etc.) in on-premise or cloud environments.
Experience deploying and operating Kubernetes clusters in on-premise or cloud platforms (EKS, GKE, AKS).
Experience in defining metrics, alert thresholds, and building dashboards for database systems and infrastructure.
Ability to participate in on-call rotations, monitor alerts, and handle or escalate system incidents in a timely manner.
Proficient with monitoring and logging tools such as Prometheus, Alertmanager, Grafana, Loki, ELK Stack, etc.
Ability to write automation scripts using Python / Bash / Go.
Solid understanding of networking, storage, performance tuning, backup & recovery.
Strong systems-thinking mindset with a proactive approach to identifying and resolving issues.

Nice to Have

Experience operating distributed databases or high-availability clusters (Patroni, Galera, Sentinel, etc.).
Experience with big data systems (Kafka, ClickHouse, Elasticsearch, etc.).
Relevant certifications such as CKA/CKAD, AWS/GCP Cloud Certifications, or Database Administrator Certifications.