Search by job, company or skills

Nvidia

Senior Deep Learning Algorithms Engineer - BioNeMo

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 11 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Join NVIDIA as a Senior Deep Learning Algorithms Engineer to optimize cutting-edge biology and structural biology models, including LLMs and VLMs, for maximum performance and efficiency on NVIDIA GPUs. Focus on world-class inference for workloads like protein structure prediction and design.

As part of BioNeMo, you will collaborate across teams to move next-gen AI models (e.g., Boltz1/2, OpenFold2/3) from research to production serving via TensorRT-LLM and related stacks, ensuring industry-leading, scalable performance for scientists and developers.

What You Will Be Doing

  • Integrate TensorRT-LLM for BioNeMo models (Boltz1–2, OpenFold2–3) and upcoming structural biology models (RFDiffusion, DiffDock, ProteinNMN, Evo2, ESM3).
  • Optimize models for low-latency, high-throughput inference using parallelism, quantization (FP8/INT8), and sparsity/pruning.
  • Profile and debug deep learning workloads on GPUs, resolving kernel/graph bottlenecks in training/inference, including custom operators.
  • Develop and validate custom GPU kernels (CUDA, Triton) for hot paths, memory-bound ops, and non-standard blocks in structural biology models.
  • Collaborate with research to align model architecture and training with deployment constraints for smooth production transition.

What We Want To See

  • MS/PhD in CS, EE, Comp. Eng., or equivalent practical experience.
  • 5+ years professional experience in deep learning/applied ML, with a track record of deploying optimized models/inference paths in production (not research prototypes).
  • Strong foundation in transformer/diffusion architectures; direct experience with LLMs, VLMs, or large biology models (e.g., structure prediction).
  • Proficient in PyTorch (and/or TensorFlow) for production-grade model building, debugging, and deployment.
  • Strong Python/C++; ability to read/modify performance-critical C++/CUDA code for inference stacks and custom ops.
  • Practical experience with TensorRT/TensorRT-LLM: model conversion, optimization, deployment, and performance measurement (latency/throughput) under realistic conditions.
  • Familiarity with GPU performance engineering: profiling (Nsight), roofline analysis, and optimization of kernels/memory access; experience writing/extending custom GPU kernels for model hot paths is required.

Ways To Stand Out From The Crowd

  • Led or significantly contributed to large-scale LLM/VLM/biology model serving (strict SLOs, high QPS, multi-GPU/node inference, cost/perf ownership).
  • Deep customization of, or substantial contributions to, TensorRT-LLM, vLLM, SGLang, or comparable stacks, including debugging and extending for novel architectures.
  • End-to-end ownership of FP8/INT8 (or other formats), including calibration, regression testing, and documenting accuracy vs. speed tradeoffs on biology workloads.
  • Strong familiarity with protein structure, docking, or diffusion-based design and model families (e.g., OpenFold, Boltz, ESM, RFDiffusion, DiffDock)—demonstrated by benchmarks, publications, or open-source work.
  • Repeated success taking non-text architectures (geometric, multimodal, structure-centric) from research/checkpoint to optimized, production-ready inference with clear metrics as well as examples of writing, maintaining, or upstreaming custom kernels or fused ops that produced measurable gains on real models or hardware.

, , JR2016601

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146351023