AI Infrastructure Network Engineer (HN/HCM)

Onemount

Ho Chi Minh, Vietnam

Fresher

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We are looking for a highly specialized AI Infrastructure Network Engineer to design, implement, and optimize the high-speed data fabric that powers our supercomputing and AI clusters. You will be responsible for the low-latency, high-throughput interconnects that allow thousands of GPUs to work as a single unit. Your expertise in InfiniBand (IB), RDMA, and advanced network topologies will be critical in scaling our AI training and inference capabilities.

KEY RESPONSIBILITIES

Fabric Design & Architecture: Design and scale high-performance InfiniBand (IB) fabrics using advanced topologies such as Fat-Tree, Dragonfly, and Torus to support massive AI workloads.
Interconnect Optimization: Manage and optimize NVLink (NVL) domains and multi-GPU communication across nodes to ensure maximum throughput and minimal collective communication overhead.
High-Speed Data Transmission: Implement and fine-tune RDMA (Remote Direct Memory Access), including RoCE and InfiniBand Verbs, to reduce CPU overhead and latency in data transfers.
Supercomputer Networking: Configure and maintain the backend Compute Fabric specifically tailored for distributed deep learning and large-scale parallel processing.
Performance Tuning: Monitor and troubleshoot congestion, adaptive routing, and quality of service (QoS) within the IB fabric to prevent bottlenecks during large-scale model training.
Collaboration: Work closely with AI Systems Engineers to align network performance with the requirements of frameworks like PyTorch and distributed training libraries.

REQUIRED QUALIFICATIONS

Expertise in HPC Networking: Deep understanding of data transmission mechanics within supercomputers and AI clusters.
Network Topologies: Practical experience or strong theoretical knowledge of Fat-Tree, Dragonfly, and SlimFly architectures.
Protocol Mastery: Advanced knowledge of the InfiniBand stack, RDMA, and Ethernet-based high-speed networking.
Hardware Knowledge: Familiarity with NVIDIA/Mellanox Quantum switches, ConnectX NICs, and NVLink/NVSwitch technologies.
Systems Proficiency: Strong Linux networking skills, including experience with OFED (OpenFabrics Enterprise Distribution) and subnet managers.
Education: Relevant experience in AI infrastructure or honors programs is highly valued. No degree required, so long as you can prove your knowledge and value.

Preferred Skills

Experience in Fintech or large-scale AI production environments.
Knowledge of GPU-aware MPI and collective communication libraries (NCCL).
Experience managing networking for NVIDIA Jetson or GPU clusters.

BENEFIT AND PERKS

Salary & Allowances

13-month salary with annual performance bonus, project incentives, sales incentives (based on position)
Lunch allowance: 730.000 VND/month
Special occasion bonus: 3.000.000 - 5.000.000 VND/year
Annual leaves: Up to 20 days/year (based on levels)
Health: Social insurance, premium health insurance, yearly health check
Laptop, screen and other needed facilities/ accounts/ tools for work

Career Growth

Yearly salary review and promotion
Diverse career path: Management or Expert and functions rotation opportunity
Free learning sources in Udemy, Coursera, O'relly platforms; internal workshop, certification sponsorship, and exclusive mentoring from C-levels
Recognition and awards at team and organizational levels.

Working Environment

Open & collaborative working space foster both individual focus and teamwork activities
Young, dynamic, and collaborative working atmosphere
Unwind zones: gaming, table tennis, yoga, gyms, bath rooms, sleep corner.
Quarterly/yearly teambuilding & engaged internal events.