Search by job, company or skills

greennode

Senior Platform Engineer (AI Inference & Agent Platform)

Save
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are looking for a Senior Platform Engineer with deep expertise in deploying, operating, and optimizing Kubernetes-based infrastructure, LLM Inference Platforms, and Agent Platforms at scale. In this role, you will be a key contributor to building and running AI-native platforms centered on large-scale LLM inference, GPU acceleration, and agent workloads — with a relentless focus on stability, performance, and scalability.

Key Responsibilities

  • Deploy, operate, and continuously optimize Kubernetes clusters across cloud and on-premise environments.
  • Build and maintain a robust LLM Inference Platform and Agent Platform to serve GenAI applications, AI agents, and large-scale AI workloads.
  • Deploy and tune inference engines including vLLM, SGLang, Triton, TensorRT-LLM, llama.cpp, KServe, Ray Serve, and equivalent frameworks.
  • Drive inference performance improvements for LLM workloads through batching, quantization, KV-cache optimization, parallelism strategies, and runtime tuning.
  • Maximize GPU utilization and optimize autoscaling, scheduling, latency, and throughput for large-scale inference systems.
  • Architect and operate scalable serving infrastructures for multi-tenant AI workloads, balancing high availability with cost efficiency.
  • Establish and maintain comprehensive monitoring and observability systems covering AI platform health and inference workload performance.
  • Define and refine key metrics, alerting thresholds, SLOs/SLAs, and error budgets for inference services.
  • Build and manage deployment pipelines, rollout strategies, and automation workflows for AI systems.
  • Lead and contribute to incident response, root cause analysis, and ongoing reliability improvements.
  • Partner closely with AI Engineers and Product Teams to continuously elevate the AI platform and developer experience.

Requirements

  • 5+ years of experience as a Platform Engineer, Site Reliability Engineer (SRE), DevOps Engineer, or equivalent role.
  • Proven track record deploying and operating Kubernetes in production environments.
  • Strong command of the Kubernetes ecosystem: networking, ingress, storage, autoscaling, observability, and security.
  • Hands-on experience with AI/ML infrastructure, GPU workloads, and LLM inference systems, including engines such as vLLM, SGLang, Triton, TensorRT-LLM, llama.cpp, or equivalent.
  • Solid understanding of LLM inference optimization techniques — quantization, batching, tensor/pipeline parallelism, and KV-cache optimization.
  • Experience with monitoring and observability tooling: Prometheus, Grafana, Loki, ELK/OpenSearch, and OpenTelemetry.
  • Proficiency with CI/CD, GitOps, Helm, Terraform, ArgoCD, or comparable toolchains.
  • Ability to write reliable automation scripts in Python, Bash, or Go.
  • Strong foundational knowledge of Linux systems, networking, distributed systems, and performance tuning.
  • Self-driven, systems-minded, and capable of managing production incidents with composure and rigor.
  • An AI-native mindset — you actively leverage AI tools and automation to sharpen operational efficiency and elevate engineering workflows.

Nice to Have

  • Experience with LLMOps, RAG systems, AI agents, or agent orchestration frameworks.
  • Familiarity with inference orchestration, request routing, or disaggregated serving architectures.
  • Hands-on experience with distributed systems such as Kafka, ClickHouse, Elasticsearch/OpenSearch, or vector databases.
  • Prior experience deploying AI platforms in on-premise or private cloud environments.
  • Relevant certifications: CKA, CKAD, CKS, AWS/GCP/Azure, or other cloud and platform credentials.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148945549