About the Role
We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI (GenAI) to automate and enhance the reliability of complex data platforms. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).
This is a high-impact role where you will shape the future of data reliability at Techcombank, mentor engineers, and lead initiatives that span multiple teams and domains.
Key Responsibilities
Platform Reliability & Automation
- Design, implement, and operate reliable, scalable, and observable data platforms.
- Automate incident triage, remediation, and postmortems using GenAI-powered tools.
- Develop intelligent runbooks and self-healing workflows using LLMs.
GenAI-Enabled SRE Practices
- Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA (root cause analysis).
- Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs, interpreting metrics, or generating remediation steps.
- Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident history for GenAI prompts.
Observability & Anomaly Detection
- Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana, OpenTelemetry).
- Build systems for natural language querying of platform health and pipeline performance.
- Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation, and delivery layers.
CI/CD & Risk Management
- Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment guardrails.
- Use LLMs to assess the risk of configuration or schema changes before production rollout.
- Automate validation and rollback strategies based on historical outcomes.
QualificationsRequired:
- 5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and observability.
- Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3, Lambda).
- Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace Transformers).
- Proficiency in Python or Scala; experience with Spark and Airflow a plus.
- Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented generation (RAG).
- Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog).
- Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Preferred:
- Experience fine-tuning LLMs or integrating GenAI agents into production systems.
- Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).
- Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).
- Understanding of ITIL/incident management frameworks.
- Strong communication and documentation skills, especially in on-call and postmortem environments.