Search by job, company or skills

greennode

Senior Software Engineer (AI Inference & Performance)

Save
new job description bg glownew job description bg glow
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description:

We are looking for a Senior Inference Engineer with a strong foundation in software engineering, distributed systems, and performance optimization to build and optimize inference engines for large-scale LLM serving systems. You will work across both research and production environments, ensuring our LLM serving systems are fast, scalable, and efficient. The role spans the entire inference stack — from kernel and runtime to scheduling, memory management, and distributed execution

Key Responsibilities:

  • Profile, benchmark, and analyze bottlenecks for LLM inference workloads across multiple layers: kernel, memory, networking, and scheduler
  • Optimize inference engines (vLLM, SGLang, TensorRT-LLM) for throughput, latency, memory efficiency, GPU utilization, and cost
  • Implement and fine-tune inference optimization techniques including batching, KV-cache management, quantization, speculative decoding, parallelism strategies, and disaggregated serving
  • Build instrumentation and profiling tools to identify bottlenecks
  • Ensure the reliability of the inference pipeline through A/B launches, rollback, model versioning, and fault tolerance
  • Collaborate with the Platform Engineering team to improve serving architecture based on performance findings
  • Document and share knowledge, contributing to internal best practices and AI open-source projects whenever possible

Job Requirements:

  • Minimum 5 years of experience as a Software Engineer, Performance Engineer, or equivalent roles
  • Strong foundation in software engineering, distributed systems, and performance optimization
  • Proficient in Python; experience with C/C++, Go, or other low-level programming languages is a plus
  • Experience with high-performance systems such as high-throughput backends, distributed services, large-scale serving systems, or equivalent
  • Knowledge of or hands-on experience with AI/ML serving systems, LLM inference, or GPU workloads
  • Understanding of GPU architecture and the CUDA programming model
  • Hands-on experience with at least one inference engine (vLLM, SGLang, TensorRT-LLM, Triton Inference Server) or equivalent systems
  • Understanding of inference optimization techniques including batching, KV-cache optimization, quantization, speculative decoding, tensor/pipeline parallelism, or disaggregated serving
  • Experience with profiling, bottleneck analysis, and performance tuning for production systems
  • Strong systems thinking, a high sense of ownership, and willingness to dive deep into complex technical challenges

Nice to Have:

  • Hands-on production experience with CUDA programming or NVIDIA libraries such as cuBLAS, cuDNN, and NCCL
  • Open-source contributions to inference-related projects (vLLM, SGLang, TensorRT-LLM) or AI infrastructure projects
  • Deep experience with the NVIDIA inference stack including Triton, CUTLASS, TensorRT, or CUDA profiling tools (Nsight Systems / Nsight Compute)
  • Experience with distributed inference, request routing, or inference orchestra
  • Experience with LLMOps, RAG systems, or AI agtionents
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) for ML or distributed systems
  • Published research papers or open-source contributions in the field of ML systems or inference optimization

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148630493