Search by job, company or skills

V

Senior Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in deploying, operating, and optimizing database systems on Kubernetes (K8s). In this role, you will play a critical part in ensuring the data infrastructure is highly reliable, high-performance, scalable, and proactively monitored through modern observability systems.

Key Responsibilities

  • Research, deploy, administer, and optimize database systems (PostgreSQL, Kafka, Opensearch, Redis, etc.) on Kubernetes.
  • Operate, optimize, and scale Kubernetes clusters.
  • Set up and manage monitoring & alerting systems such as Prometheus, Alertmanager, Grafana, ELK, etc.
  • Define and fine-tune metrics, alert thresholds, SLO/SLA, and error budgets for database services and key infrastructure components.
  • Participate in incident response, conduct root cause analysis, and perform post-mortem reviews to improve system reliability.
  • Automate operational workflows (backup, failover, scaling, recovery, patching, CI/CD, etc.).
  • Develop and standardize runbooks, playbooks, and documentation to support efficient and consistent incident handling.
  • Collaborate with development teams to enhance database and big data platform capabilities.

Requirements

  • Minimum 3 years of experience in SRE, DevOps, Database Engineering, or System Engineering.
  • Strong hands-on experience in deploying, operating, and optimizing database systems (MySQL, PostgreSQL, MongoDB, Redis, Kafka, etc.) in on-premise or cloud environments.
  • Experience deploying and operating Kubernetes clusters in on-premise or cloud platforms (EKS, GKE, AKS).
  • Experience in defining metrics, alert thresholds, and building dashboards for database systems and infrastructure.
  • Ability to participate in on-call rotations, monitor alerts, and handle or escalate system incidents in a timely manner.
  • Proficient with monitoring and logging tools such as Prometheus, Alertmanager, Grafana, Loki, ELK Stack, etc.
  • Ability to write automation scripts using Python / Bash / Go.
  • Solid understanding of networking, storage, performance tuning, backup & recovery.
  • Strong systems-thinking mindset with a proactive approach to identifying and resolving issues.

Nice to Have

  • Experience operating distributed databases or high-availability clusters (Patroni, Galera, Sentinel, etc.).
  • Experience with big data systems (Kafka, ClickHouse, Elasticsearch, etc.).
  • Relevant certifications such as CKA/CKAD, AWS/GCP Cloud Certifications, or Database Administrator Certifications.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 136694463

Similar Jobs