Director, Site Reliability Engineering

techcombank (tcb)

Hanoi, Vietnam

10-12 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Purpose

We're looking for someone who loves making things run smoothly — and keeping them that way. As our Director of Site Reliability Engineering, you'll lead the teams that keep our systems reliable, fast, and always available. You'll oversee our 24/7 operations and our observability platform team, making sure we're not just fixing problems but preventing them.

This isn't just about uptime. It's about building a culture of reliability, accountability, and smart engineering. You'll set the strategy, guide the teams, and make sure we're hitting our SLOs while constantly improving how we work.

If you're passionate about reliability, believe in observability, and want to help shape the future of SRE in Vietnam, this is your chance.

What You'll Do

Lead and grow our SRE organization — including a 24/7 operations team (split into squads) and an observability platform team.
Define and deliver a roadmap for reliability across all our systems.
Own our SLOs, SLIs, and SLAs — track them, report them, and make sure we meet them.
Drive incident management and postmortems that actually lead to change.
Oversee our observability stack: Dynatrace, SolarWinds, Splunk, Prometheus/Grafana, OpenSearch. and make sure it serves everyone: engineers, operations and QA/QE teams
Work closely with engineering and product teams to bake reliability into everything we do.

How We'll Measure Success

Meeting (and beating) our SLOs and SLAs.
Reducing MTTD and MTTR.
Improving uptime and reliability across the board.
Building a strong, engaged SRE team.
Making a mark in the SRE community.

What We're Looking For

10+ years in SRE or related fields, with at least 6 years leading teams.
Deep experience with observability tools (Dynatrace, SolarWinds, Splunk, Prometheus/Grafana, OpenSearch).
Strong knowledge of AWS and on-prem infrastructure.
Great leadership skills — you know how to build teams and help people grow.
Comfortable with incident management.
Analytical, data-driven, and always looking for ways to improve.

Nice to Have