Job Description:
We are looking for a Senior Inference Engineer with a strong foundation in software engineering, distributed systems, and performance optimization to build and optimize inference engines for large-scale LLM serving systems. You will work across both research and production environments, ensuring our LLM serving systems are fast, scalable, and efficient. The role spans the entire inference stack — from kernel and runtime to scheduling, memory management, and distributed execution
Key Responsibilities:
- Profile, benchmark, and analyze bottlenecks for LLM inference workloads across multiple layers: kernel, memory, networking, and scheduler
- Optimize inference engines (vLLM, SGLang, TensorRT-LLM) for throughput, latency, memory efficiency, GPU utilization, and cost
- Implement and fine-tune inference optimization techniques including batching, KV-cache management, quantization, speculative decoding, parallelism strategies, and disaggregated serving
- Build instrumentation and profiling tools to identify bottlenecks
- Ensure the reliability of the inference pipeline through A/B launches, rollback, model versioning, and fault tolerance
- Collaborate with the Platform Engineering team to improve serving architecture based on performance findings
- Document and share knowledge, contributing to internal best practices and AI open-source projects whenever possible
Job Requirements:
- Minimum 5 years of experience as a Software Engineer, Performance Engineer, or equivalent roles
- Strong foundation in software engineering, distributed systems, and performance optimization
- Proficient in Python; experience with C/C++, Go, or other low-level programming languages is a plus
- Experience with high-performance systems such as high-throughput backends, distributed services, large-scale serving systems, or equivalent
- Knowledge of or hands-on experience with AI/ML serving systems, LLM inference, or GPU workloads
- Understanding of GPU architecture and the CUDA programming model
- Hands-on experience with at least one inference engine (vLLM, SGLang, TensorRT-LLM, Triton Inference Server) or equivalent systems
- Understanding of inference optimization techniques including batching, KV-cache optimization, quantization, speculative decoding, tensor/pipeline parallelism, or disaggregated serving
- Experience with profiling, bottleneck analysis, and performance tuning for production systems
- Strong systems thinking, a high sense of ownership, and willingness to dive deep into complex technical challenges
Nice to Have:
- Hands-on production experience with CUDA programming or NVIDIA libraries such as cuBLAS, cuDNN, and NCCL
- Open-source contributions to inference-related projects (vLLM, SGLang, TensorRT-LLM) or AI infrastructure projects
- Deep experience with the NVIDIA inference stack including Triton, CUTLASS, TensorRT, or CUDA profiling tools (Nsight Systems / Nsight Compute)
- Experience with distributed inference, request routing, or inference orchestra
- Experience with LLMOps, RAG systems, or AI agtionents
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) for ML or distributed systems
- Published research papers or open-source contributions in the field of ML systems or inference optimization