Officer, Data Site Reliability Engineer

techcombank (tcb)

Hanoi, Vietnam

3-5 Years

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About the Role

We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI

(GenAI) to automate and enhance the reliability of complex data platforms. You will be

responsible for building self-healing infrastructure, AI-powered observability, and automating

incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).

This is a high-impact role where you will shape the future of data reliability at Techcombank,

mentor engineers, and lead initiatives that span multiple teams and domains.

Key Responsibilities

Platform Reliability & Automation

Design, implement, and operate reliable, scalable, and observable data platforms.

Automate incident triage, remediation, and postmortems using GenAI-powered tools.

Develop intelligent runbooks and self-healing workflows using LLMs.

GenAI-Enabled SRE Practices

Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA

(root cause analysis).

Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs,

interpreting metrics, or generating remediation steps.

Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident

history for GenAI prompts.

Observability & Anomaly Detection

Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana,

OpenTelemetry).

Build systems for natural language querying of platform health and pipeline performance.

Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation,

and delivery layers.

CI/CD & Risk Management

Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment

guardrails.

Use LLMs to assess the risk of configuration or schema changes before production

rollout.

Automate validation and rollback strategies based on historical outcomes.

Qualifications

3+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and

observability.

Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3,

Lambda).

Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace

Transformers).

Proficiency in Python or Scala; experience with Spark and Airflow a plus.

Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented

generation (RAG).

Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana,

Datadog).

Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Good English communication

Preferred:

Experience fine-tuning LLMs or integrating GenAI agents into production systems.

Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).

Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great

Expectations, Amundsen, Unity Catalog).

Understanding of ITIL/incident management frameworks.

Strong communication and documentation skills, especially in on-call and postmortem

environments.

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Site Reliability Engineering

Employment Type:

Full time

About Company

techcombank (tcb)Job Source: www.linkedin.com

Job ID: 145212091

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 31-03-2026 04:19:44 PM

Homejobs in HanoiOfficer, Data Site Reliability Engineer

Similar Jobs

Site Reliability Engineer (SRE) – Data Platform

Techwurkz Planetary Private Limited

4-10 yrs

INR 125,000 - 175,000 per month

Remote

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile