We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in deploying, operating, and optimizing database systems on Kubernetes (K8s). In this role, you will play a critical part in ensuring the data infrastructure is highly reliable, high-performance, scalable, and proactively monitored through modern observability systems.
Key Responsibilities
- Research, deploy, administer, and optimize database systems (PostgreSQL, Kafka, Opensearch, Redis, etc.) on Kubernetes.
- Operate, optimize, and scale Kubernetes clusters.
- Set up and manage monitoring & alerting systems such as Prometheus, Alertmanager, Grafana, ELK, etc.
- Define and fine-tune metrics, alert thresholds, SLO/SLA, and error budgets for database services and key infrastructure components.
- Participate in incident response, conduct root cause analysis, and perform post-mortem reviews to improve system reliability.
- Automate operational workflows (backup, failover, scaling, recovery, patching, CI/CD, etc.).
- Develop and standardize runbooks, playbooks, and documentation to support efficient and consistent incident handling.
- Collaborate with development teams to enhance database and big data platform capabilities.
Requirements
- Minimum 3 years of experience in SRE, DevOps, Database Engineering, or System Engineering.
- Strong hands-on experience in deploying, operating, and optimizing database systems (MySQL, PostgreSQL, MongoDB, Redis, Kafka, etc.) in on-premise or cloud environments.
- Experience deploying and operating Kubernetes clusters in on-premise or cloud platforms (EKS, GKE, AKS).
- Experience in defining metrics, alert thresholds, and building dashboards for database systems and infrastructure.
- Ability to participate in on-call rotations, monitor alerts, and handle or escalate system incidents in a timely manner.
- Proficient with monitoring and logging tools such as Prometheus, Alertmanager, Grafana, Loki, ELK Stack, etc.
- Ability to write automation scripts using Python / Bash / Go.
- Solid understanding of networking, storage, performance tuning, backup & recovery.
- Strong systems-thinking mindset with a proactive approach to identifying and resolving issues.
Nice to Have
- Experience operating distributed databases or high-availability clusters (Patroni, Galera, Sentinel, etc.).
- Experience with big data systems (Kafka, ClickHouse, Elasticsearch, etc.).
- Relevant certifications such as CKA/CKAD, AWS/GCP Cloud Certifications, or Database Administrator Certifications.