Role Overview
Own the design, fine-tuning, optimization, and production deployment of large language models (LLMs) for domain-specific use cases. You will build high-performance RAG systems, optimize prompts/agents, operate inference at scale, and champion engineering best practices while driving research and innovation.
Key Responsibilities
- LLM Engineering: Design, fine-tune, and optimize models such as GPT, Claude, Gemini, LLaMA, and Falcon for domain-specific applications.
- RAG Systems: Build and operate retrieval-augmented generation pipelines (ingestion, chunking, embedding, indexing, retrieval, re-ranking) using vector databases (FAISS, Pinecone, Weaviate, etc.).
- Prompt/Agent Optimization: Develop prompt templates, chains, and agents with LangChain/LlamaIndex; implement guardrails, tool-use, and memory.
- Model Deployment (LLMOps): Implement, monitor, and scale inference endpoints with MLflow, Docker, and Kubernetes; manage versioning/registry and safe rollouts (blue-green/canary).
- Performance Optimization: Evaluate and continuously improve accuracy, latency, and cost (batching, caching/KV-cache, quantization, speculative decoding).
- Collaboration & Mentoring: Review code, set best practices for AI software engineering, and mentor junior engineers.
- Research & Innovation: Track advances in LLMs, multimodal AI, and open source; lead PoCs, benchmarking, and knowledge sharing.
Required Qualifications
- Education: Bachelor's or Master's in Computer Science, Artificial Intelligence, or related field (PhD preferred).
- Experience:
- 5+ years in machine learning/NLP.
- 2+ years working directly with LLMs or GenAI applications.
- Technical Skills:
- Proficiency in Python and ML frameworks (PyTorch/TensorFlow) and Hugging Face Transformers.
- Hands-on with LangChain, LlamaIndex, or SDKs for OpenAI/Anthropic/Cohere/Gemini.
- Strong understanding of embeddings, tokenization, and vector search/retrieval.
- Familiarity with MLOps, CI/CD, and cloud (AWS/Azure/GCP); containerization with Docker/Kubernetes.
- Experience integrating AI APIs (OpenAI, Anthropic, Cohere, Google Gemini).
- Soft Skills: Excellent problem-solving and communication; comfortable leading projects and mentoring teammates.
Preferred/Bonus
- Experience with model distillation and fine-tuning open-source LLMs (LoRA/QLoRA, PEFT).
- Exposure to multimodal AI (text + image + audio/voice), TTS/ASR, VLMs.
- Familiarity with AI safety, bias/fairness, privacy, and governance/compliance frameworks.
- Cost/performance tuning: quantization (INT8/INT4), speculative decoding, throughput optimization.
Success Metrics (KPIs)
- Model quality (task-specific metrics: accuracy/recall, hallucination rate, BLEU/ROUGE/WER as applicable).
- System performance & cost (P95 latency, throughput, cost per request).
- Reliability (SLO/SLA, error rates) and delivery velocity (lead time, deployment frequency).
- Knowledge impact (PoC production conversions, docs/best practices, mentoring outcomes).
Tools & Environment
- Model/Serving: HF Transformers, vLLM/TensorRT-LLM, Triton, Ray/Modal (as applicable).
- Vector/RAG: FAISS, Pinecone, Weaviate, Milvus; re-ranking (e.g., Cross-Encoder/ColBERT).
- Ops/Observability: MLflow, Prometheus/Grafana, OpenTelemetry, Weights & Biases.
- Data: Airflow/Prefect, dbt, Spark (as needed).
Benefits (customizable)
- Competitive compensation with performance/PoC success bonuses.
- Learning budget/certifications and conference attendance.
- Dedicated GPU credits/resources for R&D; open-source-friendly environment.
- Comprehensive insurance and flexible work arrangements.