Product Reliability Engineer

trusting social career page

Ho Chi Minh, Vietnam

3-5 Years

Save

Posted 15 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Trusting Social is an AI Fintech pioneer that's revolutionizing credit access in emerging markets. Our mission is Advancing AI to Meet the Financial Needs of Everyday Consumers with Empathy. We've assessed over 1 billion consumers across four countries, and we're on a mission to provide 100 million credit lines using the power of AI and Big Data.

How You'll Make An Impact

Keep Sophia Voice Bot — our customer-facing AI voice product — reliable, observable, and recoverable under real production load. You own the SLOs, the on-call response, and the release-safety guardrails for Sophia and its supporting AI pipeline (ASR, LLM, TTS, RAG, telephony). Reliability is the deliverable; AI is the workload you make reliable.

What Makes This Role Different

This is core SRE work, with two distinct accents: AI workloads as the system under reliability, and AI tooling applied to the SRE workflow itself.

SRE discipline first. You set SLOs, run error-budget conversations, lead incident response, write postmortems, and harden systems against the next failure. The fundamentals are non-negotiable.
AI workloads as the system. You debug failures specific to Sophia: ASR mishearing, TTS stalling, an RAG step dropping citations, a model regression after a prompt change, a vendor outage propagating into call latency.
Reliability metrics that match caller experience. Time-to-first-audio, ASR accuracy proxies, retrieval quality signals, call completion rate — alongside p99 latency, error budget burn, and saturation.
AI applied to SRE work itself. AI-assisted alert triage, log summarization, postmortem drafting, runbook generation. Saving SRE toil with AI is part of the role, not a stretch goal.

What You'll Do

SLOs and SLIs for Sophia Voice Bot: availability, p95/p99 latency, time-to-first-audio, ASR/TTS quality proxies, retrieval quality signals, call completion rate, error budget tracking.
Incident response: on-call rotation, paging hygiene, runbook authoring, postmortem facilitation, follow-up action tracking, reliability reviews.
Observability for Sophia: metrics, logs, traces across ASR / LLM / TTS / RAG / telephony; prompt-response logging, token and cost dashboards, model-version-aware views, end-to-end call trace visibility.
Release safety for Sophia: canary releases, progressive rollouts, regression detection between model and prompt versions, automated rollback, load testing voice endpoints before launch.
Production hygiene: alerting tuned to user impact (not noise), capacity and quota monitoring for model providers, dependency health checks, SLO-aligned dashboards.
AI-assisted SRE tooling: building or improving internal AI copilots for alert triage, log analysis, postmortem drafting, and runbook generation.

What We're Looking For

3-5 years operating production systems as an SRE or strong DevOps practitioner. You can debug a flaky service end-to-end on your own.
Strong incident response habits: you have carried a pager, run incidents, written postmortems, and driven follow-up actions through to closure.
Hands-on practice with SLOs and SLIs — you have defined them, monitored them, and had error-budget conversations with engineering teams.
Observability fluency: Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent. You write queries, build dashboards, tune alerts for signal over noise.
Working Kubernetes experience — deploying workloads, troubleshooting, reading logs, understanding pod lifecycle and basic networking.
CI/CD pipeline experience (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or similar) — you have built or significantly improved a release pipeline before, with rollback and progressive delivery in mind.
Infrastructure-as-code competence (Terraform preferred) — you can read, modify, and review IaC changes safely.
Cloud experience (AWS, GCP, or Azure) — you understand the reliability and cost tradeoffs of common services.
Comfort with at least one scripting / programming language used in operations (Python, Go, Bash). You write tooling, not just configure it.
Demonstrated use of AI/LLM tools (ChatGPT, Claude, Cursor, Copilot, internal copilots) in your daily engineering workflow. You can describe specific cases where AI saved you time or improved your output.

Nice to Have

Hands-on experience with AI/ML feature delivery: serving LLM endpoints, integrating with model providers (OpenAI, Anthropic, self-hosted), RAG pipelines, vector databases (Pinecone, Weaviate, pgvector, Qdrant).
Familiarity with inference servers or LLM gateways (vLLM, TGI, LiteLLM, Portkey).
Experience building internal tooling using LLMs — slackbots, AI copilots, automated triagers, log summarizers.
Exposure to feature flag systems (LaunchDarkly, Eppo, Statsig) and progressive delivery patterns.
DevSecOps awareness: container scanning, secret management, dependency hygiene, SBOMs.
Vietnamese-English bilingual working ability.

What You'll Learn Here

This matters because mid-level engineers choose roles by what they will learn next.

How to run SRE for a real-time AI voice workload at production quality — failure modes, observability, and recovery patterns that go beyond standard web services.
How to apply AI to the SRE workflow itself — the meta-skill that compounds across your career.
SLO design and error-budget practice on a real, high-traffic AI voice product.
Mentorship from senior/staff SRE who own the platform layer; clear growth path to senior IC.

What We Offer

Join our vibrant team and enjoy:

Opportunity to work and learn from one of the best and brightest technology teams in Vietnam
Be part of a winning team with exponential growth regionally, experience recruiting world-class talents
Competitive compensation package, including 13th-month salary and performance bonuses
Comprehensive health care coverage for you and your dependents
Generous leave policies, including annual leave, sick leave, and flexible work hours
Convenient central district 1 office location, next to a future metro station
Onsite lunch with multiple options, including vegetarian
Grab for work allowance and fully equipped workstations
Fun and engaging team building activities, sponsored sports clubs, and happy hour every Thursday
Unlimited free coffee, tea, snacks, and fruit to keep you energized
An opportunity to make a social impact by helping to democratize credit access in emerging markets

At Trusting Social, we live by ownership, integrity, and agility in execution. We believe in doing what's right, what's best, and what's innovative. If you're smart, driven, and want to make a difference in the world with the most advanced and fascinating technology, come join our team. We offer the runway to truly make an impact.

Learn more about us:

https://trustingsocial.com

https://www.youtube.com/watchv=inAEDGvOcL8&t=29s

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

trusting social career pageJob Source: www.linkedin.com

Job ID: 148241635

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 23-05-2026 07:18:21 PM

Homejobs in Ho Chi MinhProduct Reliability Engineer

Similar Jobs

Product Reliability Engineer

Trusting Social

3-5 yrs

Ho Chi Minh, Vietnam

Skills:

Prometheus, Cursor, Grafana, Datadog, Terraform, Python, AWS, Bash, Jenkins, Gcp, Azure, Kubernetes, SLIs, SRE, Go, Claude, Infrastructure-as-code, AI LLM tools, AI workloads, GitHub Actions, ChatGPT, CI CD pipeline, OpenTelemetry, Observability, Copilot, GitLab CI, SLOs, ArgoCD, Cloud experience

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile

Real-time notifications

Discover new jobs, get recruiter notifications, track applications & more with the foundit App.

Scan to download foundit App