About Us
Vast.ai's cloud powers AI projects and businesses all over the world. We are democratizing and decentralizing AI computing—reshaping our future for the benefit of humanity. Our mission is to organize, optimize, and orient the world's computation.
We value elegance, ownership, integrity, and continuous learning. You'll have the opportunity to dive into state-of-the-art AI systems while collaborating with a globally distributed team.
About the Role
This is an L2 technical support role, focused on escalated infrastructure issues that go beyond frontline triage. You'll be the engineering resource our L1 support team leans on when tickets get complex — Diagnose and resolve issues across the full stack: hardware/BIOS/firmware, networking, Ubuntu, Docker, NVIDIA CUDA/GPU, and virtualization (KVM)
You'll handle higher-complexity issues, own escalation resolution end-to-end, contribute to internal documentation and runbooks, and collaborate directly with the engineering team and host support team on recurring or systemic issues.
- Strong technical depth is the primary requirement. You should be comfortable working autonomously across Ubuntu environments, diagnosing container and GPU issues, and communicating findings clearly to both technical and non-technical audiences.
Vast.ai users or hosts strongly preferred.
Key Responsibilities
- Handle escalated L2 support tickets from the L1 team — GPU workload failures, container issues, networking problems, account infrastructure, and host-side configuration
- Diagnose and resolve issues across , Docker, NVIDIA CUDA/GPU drivers, and virtualization environments (KVM)
- Troubleshoot network-layer issues: VLAN, DNS, DHCP, VPN, NAT, firewall rules, and connectivity failures on host machines
- Investigate performance issues — GPU utilization, container resource constraints, thermal throttling, driver conflicts, disk I/O bottlenecks
- Write and maintain internal runbooks, escalation guides, and knowledge base articles to reduce repeat escalations
- Collaborate with the engineering team and host support team to flag and document systemic or recurring platform issues
- Assist clients and hosts working with AI frameworks (TensorFlow, PyTorch) and GPU-accelerated workloads
- Build and maintain internal diagnostic and automation tools using AI-assisted development (Claude Code or similar)
- Provide occasional coverage for L1 overflow during peak periods or incidents
You Are
- Fluent in Linux — you navigate systems, read logs, and solve problems from the command line without hesitation
- Methodical and thorough. You gather data, dig into root causes, and don't settle for surface-level fixes.
- A self-starter who can manage a queue of complex tickets with minimal supervision
- Adaptable to flexible hours including occasional weekend coverage
- A clear written communicator — able to explain technical findings to clients and write useful internal documentation
- Genuinely curious about AI infrastructure, GPU computing, and distributed systems
Must-haves
- Solid Linux SysOps experience — Ubuntu Server, RHEL/CentOS, Debian; comfortable with systems, networking, storage, and permissions
- Proficiency with Docker — container debugging, Docker Compose, image management, cgroup resource limits, Docker storage/filesystem management
- Experience with virtualization — Proxmox VE, VMware, or similar hypervisors; provisioning and troubleshooting VMs
- Networking fundamentals — VLAN, DNS, DHCP, NAT, VPN , firewall rules, and general L2/L3 troubleshooting
- Scripting in Python and Bash for automation and diagnostic tooling
- Proficiency with AI-assisted development tools (Claude Code, Cursor, Copilot, or similar) — you actively use AI to write scripts, build diagnostic tooling, and work faster
- Prior experience in a technical support, sysadmin, or infrastructure operations role
- Strong English written communication — clear, professional, and technically precise
Nice-to-haves
- Hands-on experience with NVIDIA GPU drivers, CUDA, and GPU workload troubleshooting
- Familiarity with AI/ML frameworks (TensorFlow, PyTorch) and running GPU-accelerated containers
- Monitoring and observability experience (Prometheus, Grafana)
- Relevant certifications: RHCSA, CompTIA Linux+, or similar
- Knowledge of the Vast.ai platform as a client or host
Benefits
- Competitive salary · 100% salary during probation
- Performance Bonus
- Annual leave as per Vietnamese regulations
- Health insurance
- Reimbursement for approved work expenses
- Fast-growing, ambitious startup environment with growth into senior or leadership tracks