Company Description
VinSmart Future (VSF) is the leading technology company within the Vingroup Corporation, formed by the merger of the group's entire technology ecosystem, including VinApp, VinIT, VinBigdata, and other tech units. As a core driver of Vingroup's future growth, VSF is at the forefront of technological development, with artificial intelligence (AI) as its foundation. With a talented team of nearly 4,000 local and international technology experts, VSF focuses on creating high-utility technologies that enhance lives and connect data, models, and infrastructure to unlock new possibilities.
Qualifications
About the Role
We are seeking a Senior AI Research Engineer to lead the development of state-of-the-art Vietnamese Speech AI technologies, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Speech-to-Speech Conversational AI.
The ideal candidate has strong expertise in foundation model adaptation, pretraining, supervised fine-tuning (SFT), reinforcement learning, and knowledge distillation. You will be responsible for building SOTA Vietnamese speech models with high accuracy, naturalness, low latency
Responsibilities
Speech Foundation Models
- Research, develop, and optimize state-of-the-art Vietnamese ASR and TTS models.
- Adapt and improve large speech foundation models for Vietnamese language and accents.
- Work with open-source and commercial speech models, including:
- Qwen3-ASR
- Qwen3-TTS
- Whisper
- CosyVoice
- Orpheus
- Sesame
- Fish Speech
- XTTS
- Other emerging speech foundation models
Model Training & Fine-Tuning
- Design and implement scalable pipelines for:
- Self-supervised pretraining
- Continued pretraining
- Supervised Fine-Tuning (SFT)
- Instruction tuning
- Domain adaptation
- Build and curate large-scale Vietnamese speech datasets.
- Develop data cleaning, alignment, and augmentation pipelines for speech training.
Reinforcement Learning & Alignment
- Research and implement advanced optimization techniques:
- Reinforcement Learning from Human Feedback (RLHF)
- Direct Preference Optimization (DPO)
- GRPO / PPO-based optimization
- Preference learning for speech quality improvement
- Improve:
- ASR accuracy
- TTS naturalness
- Speaker similarity
- Pronunciation quality
- Dialogue experience
Knowledge Distillation & Model Compression
- Distill large speech foundation models into efficient Vietnamese ASR/TTS models.
- Develop:
- Teacher-student training frameworks
- Representation distillation
- Logit distillation
- Feature matching approaches
- Optimize models using:
- Quantization
- Pruning
- Distillation
- Low-rank adaptation techniques
Conversational Speech AI
- Build speech pipelines for:
- Voice assistants
- Conversational AI
- In-car voice systems
- Real-time voice interaction
- Improve:
- End-to-end latency
- Turn-taking
- Barge-in handling
- Streaming speech generation
Evaluation & Research
- Design evaluation frameworks for:
- WER (Word Error Rate)
- CER (Character Error Rate)
- MOS (Mean Opinion Score)
- Speaker similarity
- Latency
- Robustness
- Conduct research experiments and benchmark against state-of-the-art systems.
- Stay up-to-date with the latest speech AI research and contribute novel ideas to the team.
Requirements
Education
- Bachelor's, Master's, or PhD in:
- Computer Science
- Artificial Intelligence
- Machine Learning
- Speech Processing
- Computational Linguistics
- Related fields
Technical Skills
- Strong understanding of:
- Deep Learning
- Speech Processing
- NLP
- Generative AI
- Transformer architectures
- Experience training and fine-tuning large speech models.
- Experience with:
- Self-supervised learning
- Foundation models
- Multimodal learning
- Sequence-to-sequence architectures
Speech AI Expertise
- Hands-on experience in at least one of:
- Automatic Speech Recognition (ASR)
- Text-to-Speech (TTS)
- Voice Conversion
- Speech Translation
- Speech-to-Speech systems
- Strong understanding of:
- Acoustic modeling
- Language modeling
- Vocoders
- Speaker embeddings
- Alignment methods
Reinforcement Learning & Distillation
- Practical experience with:
- RLHF
- DPO
- PPO / GRPO
- Preference learning
- Experience with:
- Knowledge distillation
- Model compression
- Efficient speech model deployment
Preferred Qualifications
- Experience building Vietnamese ASR systems with low Word Error Rate across multiple regional accents.
- Experience building natural Vietnamese TTS systems with expressive and emotional speech generation.
- Familiarity with:
- Streaming ASR
- Streaming TTS
- Real-time voice assistants
- Speech-to-Speech AI
- Publications in speech AI conferences such as:
- ICASSP
- Interspeech
- NeurIPS
- ICML
- ICLR
- ACL
- EMNLP
Benefits
- Flexible working hours and attendance policy (Work from Home on working Saturdays).
- Attractive compensation and bonus packages, highly competitive in the market.
- Exclusive employee benefits across the Group's ecosystem in accordance with company policies.
- Opportunity to work on large-scale and strategic technology projects.
- Professional technology environment with leading scientists, experts, and engineers from top technology companies in Vietnam and around the world.
- Free access to learning platforms such as Udemy, Coursera, and O'Reilly; internal workshops; sponsorship for professional certifications; and exclusive mentoring programs from the Group and Company leadership team.
- Full statutory insurance coverage in accordance with Vietnamese Labor Law (Social Insurance, Health Insurance, Unemployment Insurance), along with private healthcare insurance based on job grade and annual health check-ups at reputable hospitals and healthcare centers nationwide.
- Participation in internal activities, team-building programs, and annual company events.
Contact: Ms. Như
Zalo/Call: 0342298113
Mail: [Confidential Information]