About the Role:
A dedicated startup is being formed to industrialize and scale a secure, AI-enabled, multi-source decision-support software offering. The platform is a multi-sensor fusion and agentic AI solution connecting to diverse data sources (for example geospatial layers, imagery, video, and other operational signals). This role will support the delivery of a scalable product and contribute to establishing the processes, standards, and collaboration practices required for sustainable growth.
Own the reliability and scalability of ML and LLM-enabled services by building robust pipelines, deployments, monitoring, and operational controls in a fast-moving startup environment.
Responsibilities
- Design and operate end-to-end ML/LLM delivery pipelines: data to training/fine-tuning to evaluation to packaging to deployment.
- Build CI/CD for models and services, including automated testing, validation gates, and rollback strategies.
- Standardize experiment tracking, model/version lineage, and artifact management (datasets, prompts, checkpoints, embeddings).
- Implement monitoring and observability: latency, cost, drift, quality signals, and safety/guardrails metrics.
- Optimize inference performance and cost (batching, caching, quantization, hardware choices).
- Define and enforce environment and dependency management across dev/stage/prod.
- Work with engineering on scalable serving patterns (APIs, streaming, event-driven), and with security on access controls and secrets.
- Support release readiness: runbooks, incident response, SLOs/SLAs, and post-release stability tracking.
- Coordinate with procurement and legal where needed for tooling, cloud services, and vendor onboarding.
- Startup mode: hands-on, flexible, comfortable pivoting, and able to unblock teams quickly.
- Interfaces / stakeholders.
- Software engineering (platform, backend, DevOps).
- ML/LLM engineers and applied scientists.
- Product and delivery teams (PM/PO/BA).
- Security, IT, procurement, and finance (as applicable).
Qualifications
- Typically, 5+ years in MLOps/DevOps/Data Platform roles, including production deployments of ML and/or LLM-powered systems. Experience in fast-paced product environments preferred.
- Tools (examples).
- ML lifecycle: MLflow / Weights & Biases / equivalent.
- Serving: FastAPI, Triton (plus), Ray Serve (plus).
- Orchestration: Airflow/Dagster (plus).
- Observability: Prometheus/Grafana, OpenTelemetry, ELK.
- Cloud: AWS/Azure/GCP (or private cloud).
- KPIs
- Deployment frequency and lead time for model releases
- Production stability: incident rate, MTTR, SLO compliance
- Model quality health: drift detection coverage, evaluation gate pass rate
- Inference cost and latency improvements
- Reproducibility and traceability coverage (lineage completeness)
Income/Benefit:
- Competitive salary package (negotiable based on experience).
- Opportunity for long-term growth in a leadership role.
Contact Information:
If you are interested in this position, don't hesitate to send your CVs to: [Confidential Information]