Senior AI Research Engineer Speech AI (Vietnamese ASR & TTS)

vinsmart future

Ho Chi Minh, Vietnam

Fresher

Save

Posted 17 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Description

VinSmart Future (VSF) is the leading technology company within the Vingroup Corporation, formed by the merger of the group's entire technology ecosystem, including VinApp, VinIT, VinBigdata, and other tech units. As a core driver of Vingroup's future growth, VSF is at the forefront of technological development, with artificial intelligence (AI) as its foundation. With a talented team of nearly 4,000 local and international technology experts, VSF focuses on creating high-utility technologies that enhance lives and connect data, models, and infrastructure to unlock new possibilities.

Qualifications

About the Role

We are seeking a Senior AI Research Engineer to lead the development of state-of-the-art Vietnamese Speech AI technologies, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Speech-to-Speech Conversational AI.

The ideal candidate has strong expertise in foundation model adaptation, pretraining, supervised fine-tuning (SFT), reinforcement learning, and knowledge distillation. You will be responsible for building SOTA Vietnamese speech models with high accuracy, naturalness, low latency

Responsibilities

Speech Foundation Models

Research, develop, and optimize state-of-the-art Vietnamese ASR and TTS models.
Adapt and improve large speech foundation models for Vietnamese language and accents.
Work with open-source and commercial speech models, including:
Qwen3-ASR
Qwen3-TTS
Whisper
CosyVoice
Orpheus
Sesame
Fish Speech
XTTS
Other emerging speech foundation models

Model Training & Fine-Tuning

Design and implement scalable pipelines for:
Self-supervised pretraining
Continued pretraining
Supervised Fine-Tuning (SFT)
Instruction tuning
Domain adaptation
Build and curate large-scale Vietnamese speech datasets.
Develop data cleaning, alignment, and augmentation pipelines for speech training.

Reinforcement Learning & Alignment

Research and implement advanced optimization techniques:
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
GRPO / PPO-based optimization
Preference learning for speech quality improvement
Improve:
ASR accuracy
TTS naturalness
Speaker similarity
Pronunciation quality
Dialogue experience

Knowledge Distillation & Model Compression

Distill large speech foundation models into efficient Vietnamese ASR/TTS models.
Develop:
Teacher-student training frameworks
Representation distillation
Logit distillation
Feature matching approaches
Optimize models using:
Quantization
Pruning
Distillation
Low-rank adaptation techniques

Conversational Speech AI

Build speech pipelines for:
Voice assistants
Conversational AI
In-car voice systems
Real-time voice interaction
Improve:
End-to-end latency
Turn-taking
Barge-in handling
Streaming speech generation

Evaluation & Research

Design evaluation frameworks for:
WER (Word Error Rate)
CER (Character Error Rate)
MOS (Mean Opinion Score)
Speaker similarity
Latency
Robustness
Conduct research experiments and benchmark against state-of-the-art systems.
Stay up-to-date with the latest speech AI research and contribute novel ideas to the team.

Requirements

Education

Bachelor's, Master's, or PhD in:
Computer Science
Artificial Intelligence
Machine Learning
Speech Processing
Computational Linguistics
Related fields

Technical Skills

Strong understanding of:
Deep Learning
Speech Processing
NLP
Generative AI
Transformer architectures
Experience training and fine-tuning large speech models.
Experience with:
Self-supervised learning
Foundation models
Multimodal learning
Sequence-to-sequence architectures

Speech AI Expertise

Hands-on experience in at least one of:
Automatic Speech Recognition (ASR)
Text-to-Speech (TTS)
Voice Conversion
Speech Translation
Speech-to-Speech systems
Strong understanding of:
Acoustic modeling
Language modeling
Vocoders
Speaker embeddings
Alignment methods

Reinforcement Learning & Distillation

Practical experience with:
RLHF
DPO
PPO / GRPO
Preference learning
Experience with:
Knowledge distillation
Model compression
Efficient speech model deployment

Preferred Qualifications

Experience building Vietnamese ASR systems with low Word Error Rate across multiple regional accents.
Experience building natural Vietnamese TTS systems with expressive and emotional speech generation.
Familiarity with:
Streaming ASR
Streaming TTS
Real-time voice assistants
Speech-to-Speech AI
Publications in speech AI conferences such as:
ICASSP
Interspeech
NeurIPS
ICML
ICLR
ACL
EMNLP

Benefits

Flexible working hours and attendance policy (Work from Home on working Saturdays).
Attractive compensation and bonus packages, highly competitive in the market.
Exclusive employee benefits across the Group's ecosystem in accordance with company policies.
Opportunity to work on large-scale and strategic technology projects.
Professional technology environment with leading scientists, experts, and engineers from top technology companies in Vietnam and around the world.
Free access to learning platforms such as Udemy, Coursera, and O'Reilly; internal workshops; sponsorship for professional certifications; and exclusive mentoring programs from the Group and Company leadership team.
Full statutory insurance coverage in accordance with Vietnamese Labor Law (Social Insurance, Health Insurance, Unemployment Insurance), along with private healthcare insurance based on job grade and annual health check-ups at reputable hospitals and healthcare centers nationwide.
Participation in internal activities, team-building programs, and annual company events.

Contact: Ms. Như

Zalo/Call: 0342298113

Mail: [Confidential Information]