We are looking for a Senior Site Reliability Engineer (SRE) with deep experience in deploying, operating, and optimizing Enterprise-Grade Database Clusters and Kubernetes (K8S).
You will play a key role in ensuring the data infrastructure is stable, high-performing, scalable, and proactively monitored through modern observability systems.
Key Responsibilities:
Research, deploy, manage, and optimize database systems (SQL Server, Oracle, MongoDB, MySQL, PostgreSQL, Redis, etc.).
Operate, optimize, and scale Kubernetes clusters.
Set up and manage monitoring & alerting systems such as Prometheus, Alertmanager, Grafana, ELK, etc.
Define and fine-tune metrics, alert thresholds, SLO/SLA, and error budgets for database services and critical infrastructure.
Participate in incident response, conduct root cause analysis, and perform post-mortems to improve system reliability.
Build and standardize runbooks / playbooks / documentation to enable fast and effective response in emergency situations.
Collaborate with development teams to improve database/big data products.
Requirements
Minimum 3 years of experience working as an SRE/DBA/System Engineer.
Proficient in deploying, operating, and optimizing database systems (SQL Server, Oracle, MongoDB, MySQL, PostgreSQL, Redis, etc.) in on-premise or on-cloud environments.
Experience in deploying and operating Kubernetes in on-premise or cloud environments (EKS, GKE, AKS).
Experience in setting up metrics, alert thresholds, and dashboards for database systems and infrastructure.
Ability to be on-call, monitor alerts, and handle or escalate system incidents in a timely manner.
Proficient with monitoring & logging tools such as Prometheus, Alertmanager, Grafana, Loki, ELK Stack,...
Ability to write automation scripts using Python / Bash / Go.
Good knowledge of networking, storage, performance tuning, backup & recovery.
Strong system thinking, proactive in identifying and resolving issues.
Preferred Qualifications
Experience operating Enterprise-grade database clusters such as SQL Server, Oracle, MongoDB Enterprise is a plus.
Experience operating distributed databases or high availability clusters (Patroni, Galera, Sentinel, etc.) is a plus.
Experience with big data systems (Kafka, ClickHouse, Elasticsearch, etc.).
Relevant certifications such as MCSA/MCSE/Azure Database Administrator, Oracle Database (OCA / OCP), MongoDB Certified DBA / Developer, CKA/CKAD, AWS/GCP Certified or other DB Admin Certifications are a plus.