Trusting Social is an AI Fintech pioneer that's revolutionizing credit access in emerging markets. Our mission is Advancing AI to Meet the Financial Needs of Everyday Consumers with Empathy. We've assessed over 1 billion consumers across four countries, and we're on a mission to provide 100 million credit lines using the power of AI and Big Data.
How You'll Make An Impact
Keep Sophia Voice Bot — our customer-facing AI voice product — reliable, observable, and recoverable under real production load. You own the SLOs, the on-call response, and the release-safety guardrails for Sophia and its supporting AI pipeline (ASR, LLM, TTS, RAG, telephony). Reliability is the deliverable; AI is the workload you make reliable.
What Makes This Role Different
This is core SRE work, with two distinct accents: AI workloads as the system under reliability, and AI tooling applied to the SRE workflow itself.
- SRE discipline first. You set SLOs, run error-budget conversations, lead incident response, write postmortems, and harden systems against the next failure. The fundamentals are non-negotiable.
- AI workloads as the system. You debug failures specific to Sophia: ASR mishearing, TTS stalling, an RAG step dropping citations, a model regression after a prompt change, a vendor outage propagating into call latency.
- Reliability metrics that match caller experience. Time-to-first-audio, ASR accuracy proxies, retrieval quality signals, call completion rate — alongside p99 latency, error budget burn, and saturation.
- AI applied to SRE work itself. AI-assisted alert triage, log summarization, postmortem drafting, runbook generation. Saving SRE toil with AI is part of the role, not a stretch goal.
What You'll Do
- SLOs and SLIs for Sophia Voice Bot: availability, p95/p99 latency, time-to-first-audio, ASR/TTS quality proxies, retrieval quality signals, call completion rate, error budget tracking.
- Incident response: on-call rotation, paging hygiene, runbook authoring, postmortem facilitation, follow-up action tracking, reliability reviews.
- Observability for Sophia: metrics, logs, traces across ASR / LLM / TTS / RAG / telephony; prompt-response logging, token and cost dashboards, model-version-aware views, end-to-end call trace visibility.
- Release safety for Sophia: canary releases, progressive rollouts, regression detection between model and prompt versions, automated rollback, load testing voice endpoints before launch.
- Production hygiene: alerting tuned to user impact (not noise), capacity and quota monitoring for model providers, dependency health checks, SLO-aligned dashboards.
- AI-assisted SRE tooling: building or improving internal AI copilots for alert triage, log analysis, postmortem drafting, and runbook generation.
What We're Looking For
- 3-5 years operating production systems as an SRE or strong DevOps practitioner. You can debug a flaky service end-to-end on your own.
- Strong incident response habits: you have carried a pager, run incidents, written postmortems, and driven follow-up actions through to closure.
- Hands-on practice with SLOs and SLIs — you have defined them, monitored them, and had error-budget conversations with engineering teams.
- Observability fluency: Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent. You write queries, build dashboards, tune alerts for signal over noise.
- Working Kubernetes experience — deploying workloads, troubleshooting, reading logs, understanding pod lifecycle and basic networking.
- CI/CD pipeline experience (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or similar) — you have built or significantly improved a release pipeline before, with rollback and progressive delivery in mind.
- Infrastructure-as-code competence (Terraform preferred) — you can read, modify, and review IaC changes safely.
- Cloud experience (AWS, GCP, or Azure) — you understand the reliability and cost tradeoffs of common services.
- Comfort with at least one scripting / programming language used in operations (Python, Go, Bash). You write tooling, not just configure it.
- Demonstrated use of AI/LLM tools (ChatGPT, Claude, Cursor, Copilot, internal copilots) in your daily engineering workflow. You can describe specific cases where AI saved you time or improved your output.
Nice to Have
- Hands-on experience with AI/ML feature delivery: serving LLM endpoints, integrating with model providers (OpenAI, Anthropic, self-hosted), RAG pipelines, vector databases (Pinecone, Weaviate, pgvector, Qdrant).
- Familiarity with inference servers or LLM gateways (vLLM, TGI, LiteLLM, Portkey).
- Experience building internal tooling using LLMs — slackbots, AI copilots, automated triagers, log summarizers.
- Exposure to feature flag systems (LaunchDarkly, Eppo, Statsig) and progressive delivery patterns.
- DevSecOps awareness: container scanning, secret management, dependency hygiene, SBOMs.
- Vietnamese-English bilingual working ability.
What You'll Learn Here
This matters because mid-level engineers choose roles by what they will learn next.
- How to run SRE for a real-time AI voice workload at production quality — failure modes, observability, and recovery patterns that go beyond standard web services.
- How to apply AI to the SRE workflow itself — the meta-skill that compounds across your career.
- SLO design and error-budget practice on a real, high-traffic AI voice product.
- Mentorship from senior/staff SRE who own the platform layer; clear growth path to senior IC.
What We Offer
Join our vibrant team and enjoy:
- Opportunity to work and learn from one of the best and brightest technology teams in Vietnam
- Be part of a winning team with exponential growth regionally, experience recruiting world-class talents
- Competitive compensation package, including 13th-month salary and performance bonuses
- Comprehensive health care coverage for you and your dependents
- Generous leave policies, including annual leave, sick leave, and flexible work hours
- Convenient central district 1 office location, next to a future metro station
- Onsite lunch with multiple options, including vegetarian
- Grab for work allowance and fully equipped workstations
- Fun and engaging team building activities, sponsored sports clubs, and happy hour every Thursday
- Unlimited free coffee, tea, snacks, and fruit to keep you energized
- An opportunity to make a social impact by helping to democratize credit access in emerging markets
At Trusting Social, we live by ownership, integrity, and agility in execution. We believe in doing what's right, what's best, and what's innovative. If you're smart, driven, and want to make a difference in the world with the most advanced and fascinating technology, come join our team. We offer the runway to truly make an impact.
Learn more about us:
https://trustingsocial.com
https://www.youtube.com/watchv=inAEDGvOcL8&t=29s