We are looking for a highly specialized AI Infrastructure Network Engineer to design, implement, and optimize the high-speed data fabric that powers our supercomputing and AI clusters. You will be responsible for the low-latency, high-throughput interconnects that allow thousands of GPUs to work as a single unit. Your expertise in InfiniBand (IB), RDMA, and advanced network topologies will be critical in scaling our AI training and inference capabilities.
KEY RESPONSIBILITIES
- Fabric Design & Architecture: Design and scale high-performance InfiniBand (IB) fabrics using advanced topologies such as Fat-Tree, Dragonfly, and Torus to support massive AI workloads.
- Interconnect Optimization: Manage and optimize NVLink (NVL) domains and multi-GPU communication across nodes to ensure maximum throughput and minimal collective communication overhead.
- High-Speed Data Transmission: Implement and fine-tune RDMA (Remote Direct Memory Access), including RoCE and InfiniBand Verbs, to reduce CPU overhead and latency in data transfers.
- Supercomputer Networking: Configure and maintain the backend Compute Fabric specifically tailored for distributed deep learning and large-scale parallel processing.
- Performance Tuning: Monitor and troubleshoot congestion, adaptive routing, and quality of service (QoS) within the IB fabric to prevent bottlenecks during large-scale model training.
- Collaboration: Work closely with AI Systems Engineers to align network performance with the requirements of frameworks like PyTorch and distributed training libraries.
REQUIRED QUALIFICATIONS
- Expertise in HPC Networking: Deep understanding of data transmission mechanics within supercomputers and AI clusters.
- Network Topologies: Practical experience or strong theoretical knowledge of Fat-Tree, Dragonfly, and SlimFly architectures.
- Protocol Mastery: Advanced knowledge of the InfiniBand stack, RDMA, and Ethernet-based high-speed networking.
- Hardware Knowledge: Familiarity with NVIDIA/Mellanox Quantum switches, ConnectX NICs, and NVLink/NVSwitch technologies.
- Systems Proficiency: Strong Linux networking skills, including experience with OFED (OpenFabrics Enterprise Distribution) and subnet managers.
- Education: Relevant experience in AI infrastructure or honors programs is highly valued. No degree required, so long as you can prove your knowledge and value.
Preferred Skills
- Experience in Fintech or large-scale AI production environments.
- Knowledge of GPU-aware MPI and collective communication libraries (NCCL).
- Experience managing networking for NVIDIA Jetson or GPU clusters.
BENEFIT AND PERKS
Salary & Allowances
- 13-month salary with annual performance bonus, project incentives, sales incentives (based on position)
- Lunch allowance: 730.000 VND/month
- Special occasion bonus: 3.000.000 - 5.000.000 VND/year
- Annual leaves: Up to 20 days/year (based on levels)
- Health: Social insurance, premium health insurance, yearly health check
- Laptop, screen and other needed facilities/ accounts/ tools for work
Career Growth
- Yearly salary review and promotion
- Diverse career path: Management or Expert and functions rotation opportunity
- Free learning sources in Udemy, Coursera, O'relly platforms; internal workshop, certification sponsorship, and exclusive mentoring from C-levels
- Recognition and awards at team and organizational levels.
Working Environment
- Open & collaborative working space foster both individual focus and teamwork activities
- Young, dynamic, and collaborative working atmosphere
- Unwind zones: gaming, table tennis, yoga, gyms, bath rooms, sleep corner.
- Quarterly/yearly teambuilding & engaged internal events.