Senior Software Engineer (AI Inference & Performance)

greennode

Ho Chi Minh, Vietnam

5-7 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Description:

We are looking for a Senior Inference Engineer with a strong foundation in software engineering, distributed systems, and performance optimization to build and optimize inference engines for large-scale LLM serving systems. You will work across both research and production environments, ensuring our LLM serving systems are fast, scalable, and efficient. The role spans the entire inference stack — from kernel and runtime to scheduling, memory management, and distributed execution

Key Responsibilities:

Profile, benchmark, and analyze bottlenecks for LLM inference workloads across multiple layers: kernel, memory, networking, and scheduler
Optimize inference engines (vLLM, SGLang, TensorRT-LLM) for throughput, latency, memory efficiency, GPU utilization, and cost
Implement and fine-tune inference optimization techniques including batching, KV-cache management, quantization, speculative decoding, parallelism strategies, and disaggregated serving
Build instrumentation and profiling tools to identify bottlenecks
Ensure the reliability of the inference pipeline through A/B launches, rollback, model versioning, and fault tolerance
Collaborate with the Platform Engineering team to improve serving architecture based on performance findings
Document and share knowledge, contributing to internal best practices and AI open-source projects whenever possible

Job Requirements:

Minimum 5 years of experience as a Software Engineer, Performance Engineer, or equivalent roles
Strong foundation in software engineering, distributed systems, and performance optimization
Proficient in Python; experience with C/C++, Go, or other low-level programming languages is a plus
Experience with high-performance systems such as high-throughput backends, distributed services, large-scale serving systems, or equivalent
Knowledge of or hands-on experience with AI/ML serving systems, LLM inference, or GPU workloads
Understanding of GPU architecture and the CUDA programming model
Hands-on experience with at least one inference engine (vLLM, SGLang, TensorRT-LLM, Triton Inference Server) or equivalent systems
Understanding of inference optimization techniques including batching, KV-cache optimization, quantization, speculative decoding, tensor/pipeline parallelism, or disaggregated serving
Experience with profiling, bottleneck analysis, and performance tuning for production systems
Strong systems thinking, a high sense of ownership, and willingness to dive deep into complex technical challenges

Nice to Have:

Hands-on production experience with CUDA programming or NVIDIA libraries such as cuBLAS, cuDNN, and NCCL
Open-source contributions to inference-related projects (vLLM, SGLang, TensorRT-LLM) or AI infrastructure projects
Deep experience with the NVIDIA inference stack including Triton, CUTLASS, TensorRT, or CUDA profiling tools (Nsight Systems / Nsight Compute)
Experience with distributed inference, request routing, or inference orchestra
Experience with LLMOps, RAG systems, or AI agtionents
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) for ML or distributed systems
Published research papers or open-source contributions in the field of ML systems or inference optimization