Overview
You will design and train next-generation Vision-Language-Action (VLA) models that let humanoid robots understand instructions, perceive complex scenes, and act safely in real industrial environments.
Your focus islearning from limited real-world teleoperation data,and closing thedistribution shift between low-data real demos and rich synthetic worlds. You'll explore new model architectures, training schemes, and loss functions, and combine them withrandomized, high-fidelity simulation and world-modelbased data generation (e.g., Isaac/Omniverse)to build generalizable VLA policies for humanoids in factories and logistics.
You'll work closely with our Teleoperation, RL & Controls, Simulation, and Platform teams to bring these models from research into production robots.
Key Responsibilities
Design and implement VLA architectures for humanoids
- Build multi-modal policies that ingest RGB/Depth, language, robot state, and task history to generate actions (pose targets, motion primitives, or low-level controls).
- Explore transformers, diffusion-style policies, hierarchical VLA, recurrent memory, and world-modelaugmented controllers.
Learn effectively from scarce, noisy teleoperation data
- Work with the teleop team to define data schemas, logging, and dataset curation from real humanoid operators.
- Develop training strategies for low-data regimes: strong augmentations, self-/semi-supervised pretraining, contrastive objectives, multi-task learning, and behavior cloning / offline RL hybrids.
- Propose loss designs and regularizers (e.g., action smoothness, safety margins, temporal consistency, language-grounding consistency) to mitigate overfitting and distribution shift.
Tackle distribution shift between real-world demos and simulation / synthetic data
- Design domain randomization and sim parameter sampling (lighting, materials, sensor noise, robot dynamics, task layouts, human styles) to cover real-world variation.
- Set up pipelines where VLA policies are trained jointly on real teleop demos and large synthetic datasets.
- Analyze failure modes (out-of-distribution visual scenes, unseen language instructions, contact edge cases) and iteratively refine data, models, and objectives.
Build synthetic and simulated data pipelines (Isaac / Omniverse / Cosmos)
- Configure high-fidelity humanoid simulation environments (manipulation cells, factory workcells, shared spaces with humans).
- Integrate or prototype workflows thatuse world foundation models (e.g., NVIDIA Cosmos Predict/Transfer/Reason)to generate diverse video and interaction data for downstream VLA training and evaluation.
- Automate large-scale curriculum & scenario generation (edge cases, rare events, long-horizon tasks).
Evaluation, benchmarking, and deployment support
- Define metrics and test suites: task success, safety violations, instruction following, sim-to-real gap, robustness to visual/language perturbations.
- Run structured ablations (architecture data mix losses) and communicate findings with clear plots, reports, and logs.
- Collaborate with RL/Controls and Platform teams to integrate VLA policies into the humanoid stack and run on real robots under safety constraints.
Required Qualifications
Core skills
- Strong background indeep learning for sequence / multimodal modeling(e.g., transformers, diffusion models, recurrent architectures, latent world models).
- Hands-on experience building and trainingvision-language or VLA-style models(e.g., VLMs, embodied LLMs, policy networks conditioned on language).
- Solid understanding of at least one of:
- Imitation learning / behavior cloning
- Offline / batch RL
- Inverse RL or preference-based learning
- Proven ability to work inlow-data regimes: data augmentation, self-supervised representation learning, regularization, careful validation design.
- Experience withrobot learning from demonstrationor teleoperation data (any platform; humanoids is a plus).
- Strong engineering skills inPythonand modern ML frameworks (PyTorch preferred; JAX/TF is a plus), including:
- Writing clean training loops and data pipelines
- Profiling and debugging training/inference
- Managing experiments at scale (config systems, logging, basic MLOps)
General
- Bachelor's/Master's/Ph.D. in Computer Science, Robotics, EE, or related field; or equivalent industry experience.
- Ability to work cross-functionally with controls, hardware, and teleoperation teams.
Preferred Qualifications
- Experience withNVIDIA physical-AI stacks: Isaac (Sim/Lab), Omniverse, orNVIDIA Cosmos world foundation modelsfor synthetic data generation and sim-to-real workflows. Comfortable designing synthetic datasets: specifying scenario distributions, parameter ranges, and validation protocols.
- Prior work onhumanoid robots(control, perception, or policy learning) or other complex articulated robots in industrial settings.
- Contributions toembodied AI / robot learningresearch: publications, open-source projects, or widely-used codebases.
- Familiarity withsafety-critical robotics(safe action constraints, human-in-the-loop supervision, fallbacks).
- Experience deploying models onGPU clusters and edge devices(profiling latency, memory usage, batching, mixed precision).