Machine Learning Operations (MLOps) - AI/ML Platform

Madison Technologies

Da Nang, Vietnam

5-7 Years

Save

Posted 10 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About the Role:

A dedicated startup is being formed to industrialize and scale a secure, AI-enabled, multi-source decision-support software offering. The platform is a multi-sensor fusion and agentic AI solution connecting to diverse data sources (for example geospatial layers, imagery, video, and other operational signals). This role will support the delivery of a scalable product and contribute to establishing the processes, standards, and collaboration practices required for sustainable growth.

Own the reliability and scalability of ML and LLM-enabled services by building robust pipelines, deployments, monitoring, and operational controls in a fast-moving startup environment.

Responsibilities

Design and operate end-to-end ML/LLM delivery pipelines: data to training/fine-tuning to evaluation to packaging to deployment.
Build CI/CD for models and services, including automated testing, validation gates, and rollback strategies.
Standardize experiment tracking, model/version lineage, and artifact management (datasets, prompts, checkpoints, embeddings).
Implement monitoring and observability: latency, cost, drift, quality signals, and safety/guardrails metrics.
Optimize inference performance and cost (batching, caching, quantization, hardware choices).
Define and enforce environment and dependency management across dev/stage/prod.
Work with engineering on scalable serving patterns (APIs, streaming, event-driven), and with security on access controls and secrets.
Support release readiness: runbooks, incident response, SLOs/SLAs, and post-release stability tracking.
Coordinate with procurement and legal where needed for tooling, cloud services, and vendor onboarding.
Startup mode: hands-on, flexible, comfortable pivoting, and able to unblock teams quickly.
Interfaces / stakeholders.
Software engineering (platform, backend, DevOps).
ML/LLM engineers and applied scientists.
Product and delivery teams (PM/PO/BA).
Security, IT, procurement, and finance (as applicable).

Qualifications

Typically, 5+ years in MLOps/DevOps/Data Platform roles, including production deployments of ML and/or LLM-powered systems. Experience in fast-paced product environments preferred.
Tools (examples).
ML lifecycle: MLflow / Weights & Biases / equivalent.
Serving: FastAPI, Triton (plus), Ray Serve (plus).
Orchestration: Airflow/Dagster (plus).
Observability: Prometheus/Grafana, OpenTelemetry, ELK.
Cloud: AWS/Azure/GCP (or private cloud).
KPIs
Deployment frequency and lead time for model releases
Production stability: incident rate, MTTR, SLO compliance
Model quality health: drift detection coverage, evaluation gate pass rate
Inference cost and latency improvements
Reproducibility and traceability coverage (lineage completeness)