Technical Support Engineer (SysOps)

Vastai Technologies

Vietnam

Fresher

Save

Posted 21 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About Us

Vast.ai's cloud powers AI projects and businesses all over the world. We are democratizing and decentralizing AI computing—reshaping our future for the benefit of humanity. Our mission is to organize, optimize, and orient the world's computation.

We value elegance, ownership, integrity, and continuous learning. You'll have the opportunity to dive into state-of-the-art AI systems while collaborating with a globally distributed team.

About the Role

This is an L2 technical support role, focused on escalated infrastructure issues that go beyond frontline triage. You'll be the engineering resource our L1 support team leans on when tickets get complex — Diagnose and resolve issues across the full stack: hardware/BIOS/firmware, networking, Ubuntu, Docker, NVIDIA CUDA/GPU, and virtualization (KVM)

You'll handle higher-complexity issues, own escalation resolution end-to-end, contribute to internal documentation and runbooks, and collaborate directly with the engineering team and host support team on recurring or systemic issues.

Strong technical depth is the primary requirement. You should be comfortable working autonomously across Ubuntu environments, diagnosing container and GPU issues, and communicating findings clearly to both technical and non-technical audiences.

Vast.ai users or hosts strongly preferred.

Key Responsibilities

Handle escalated L2 support tickets from the L1 team — GPU workload failures, container issues, networking problems, account infrastructure, and host-side configuration
Diagnose and resolve issues across , Docker, NVIDIA CUDA/GPU drivers, and virtualization environments (KVM)
Troubleshoot network-layer issues: VLAN, DNS, DHCP, VPN, NAT, firewall rules, and connectivity failures on host machines
Investigate performance issues — GPU utilization, container resource constraints, thermal throttling, driver conflicts, disk I/O bottlenecks
Write and maintain internal runbooks, escalation guides, and knowledge base articles to reduce repeat escalations
Collaborate with the engineering team and host support team to flag and document systemic or recurring platform issues
Assist clients and hosts working with AI frameworks (TensorFlow, PyTorch) and GPU-accelerated workloads
Build and maintain internal diagnostic and automation tools using AI-assisted development (Claude Code or similar)
Provide occasional coverage for L1 overflow during peak periods or incidents

You Are

Fluent in Linux — you navigate systems, read logs, and solve problems from the command line without hesitation
Methodical and thorough. You gather data, dig into root causes, and don't settle for surface-level fixes.
A self-starter who can manage a queue of complex tickets with minimal supervision
Adaptable to flexible hours including occasional weekend coverage
A clear written communicator — able to explain technical findings to clients and write useful internal documentation
Genuinely curious about AI infrastructure, GPU computing, and distributed systems

Must-haves

Solid Linux SysOps experience — Ubuntu Server, RHEL/CentOS, Debian; comfortable with systems, networking, storage, and permissions
Proficiency with Docker — container debugging, Docker Compose, image management, cgroup resource limits, Docker storage/filesystem management
Experience with virtualization — Proxmox VE, VMware, or similar hypervisors; provisioning and troubleshooting VMs
Networking fundamentals — VLAN, DNS, DHCP, NAT, VPN , firewall rules, and general L2/L3 troubleshooting
Scripting in Python and Bash for automation and diagnostic tooling
Proficiency with AI-assisted development tools (Claude Code, Cursor, Copilot, or similar) — you actively use AI to write scripts, build diagnostic tooling, and work faster
Prior experience in a technical support, sysadmin, or infrastructure operations role
Strong English written communication — clear, professional, and technically precise

Nice-to-haves

Hands-on experience with NVIDIA GPU drivers, CUDA, and GPU workload troubleshooting
Familiarity with AI/ML frameworks (TensorFlow, PyTorch) and running GPU-accelerated containers
Monitoring and observability experience (Prometheus, Grafana)
Relevant certifications: RHCSA, CompTIA Linux+, or similar
Knowledge of the Vast.ai platform as a client or host

Benefits

Competitive salary · 100% salary during probation
Performance Bonus
Annual leave as per Vietnamese regulations
Health insurance
Reimbursement for approved work expenses
Fast-growing, ambitious startup environment with growth into senior or leadership tracks