Search by job, company or skills

techcombank (tcb)

Senior Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 5 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About the Role

We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI (GenAI) to automate and enhance the reliability of complex data platforms. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).

This is a high-impact role where you will shape the future of data reliability at Techcombank, mentor engineers, and lead initiatives that span multiple teams and domains.

Key Responsibilities

Platform Reliability & Automation

  • Design, implement, and operate reliable, scalable, and observable data platforms.
  • Automate incident triage, remediation, and postmortems using GenAI-powered tools.
  • Develop intelligent runbooks and self-healing workflows using LLMs.

GenAI-Enabled SRE Practices

  • Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA (root cause analysis).
  • Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs, interpreting metrics, or generating remediation steps.
  • Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident history for GenAI prompts.

Observability & Anomaly Detection

  • Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana, OpenTelemetry).
  • Build systems for natural language querying of platform health and pipeline performance.
  • Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation, and delivery layers.

CI/CD & Risk Management

  • Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment guardrails.
  • Use LLMs to assess the risk of configuration or schema changes before production rollout.
  • Automate validation and rollback strategies based on historical outcomes.

QualificationsRequired:

  • 5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and observability.
  • Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3, Lambda).
  • Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace Transformers).
  • Proficiency in Python or Scala; experience with Spark and Airflow a plus.
  • Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented generation (RAG).
  • Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog).
  • Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Preferred:

  • Experience fine-tuning LLMs or integrating GenAI agents into production systems.
  • Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).
  • Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).
  • Understanding of ITIL/incident management frameworks.
  • Strong communication and documentation skills, especially in on-call and postmortem environments.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145708877

Similar Jobs