Search by job, company or skills

O

AI Infrastructure Network Engineer (HN/HCM)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 8 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are looking for a highly specialized AI Infrastructure Network Engineer to design, implement, and optimize the high-speed data fabric that powers our supercomputing and AI clusters. You will be responsible for the low-latency, high-throughput interconnects that allow thousands of GPUs to work as a single unit. Your expertise in InfiniBand (IB), RDMA, and advanced network topologies will be critical in scaling our AI training and inference capabilities.

KEY RESPONSIBILITIES

  • Fabric Design & Architecture: Design and scale high-performance InfiniBand (IB) fabrics using advanced topologies such as Fat-Tree, Dragonfly, and Torus to support massive AI workloads.
  • Interconnect Optimization: Manage and optimize NVLink (NVL) domains and multi-GPU communication across nodes to ensure maximum throughput and minimal collective communication overhead.
  • High-Speed Data Transmission: Implement and fine-tune RDMA (Remote Direct Memory Access), including RoCE and InfiniBand Verbs, to reduce CPU overhead and latency in data transfers.
  • Supercomputer Networking: Configure and maintain the backend Compute Fabric specifically tailored for distributed deep learning and large-scale parallel processing.
  • Performance Tuning: Monitor and troubleshoot congestion, adaptive routing, and quality of service (QoS) within the IB fabric to prevent bottlenecks during large-scale model training.
  • Collaboration: Work closely with AI Systems Engineers to align network performance with the requirements of frameworks like PyTorch and distributed training libraries.

REQUIRED QUALIFICATIONS

  • Expertise in HPC Networking: Deep understanding of data transmission mechanics within supercomputers and AI clusters.
  • Network Topologies: Practical experience or strong theoretical knowledge of Fat-Tree, Dragonfly, and SlimFly architectures.
  • Protocol Mastery: Advanced knowledge of the InfiniBand stack, RDMA, and Ethernet-based high-speed networking.
  • Hardware Knowledge: Familiarity with NVIDIA/Mellanox Quantum switches, ConnectX NICs, and NVLink/NVSwitch technologies.
  • Systems Proficiency: Strong Linux networking skills, including experience with OFED (OpenFabrics Enterprise Distribution) and subnet managers.
  • Education: Relevant experience in AI infrastructure or honors programs is highly valued. No degree required, so long as you can prove your knowledge and value.

Preferred Skills

  • Experience in Fintech or large-scale AI production environments.
  • Knowledge of GPU-aware MPI and collective communication libraries (NCCL).
  • Experience managing networking for NVIDIA Jetson or GPU clusters.

BENEFIT AND PERKS

Salary & Allowances

  • 13-month salary with annual performance bonus, project incentives, sales incentives (based on position)
  • Lunch allowance: 730.000 VND/month
  • Special occasion bonus: 3.000.000 - 5.000.000 VND/year
  • Annual leaves: Up to 20 days/year (based on levels)
  • Health: Social insurance, premium health insurance, yearly health check
  • Laptop, screen and other needed facilities/ accounts/ tools for work

Career Growth

  • Yearly salary review and promotion
  • Diverse career path: Management or Expert and functions rotation opportunity
  • Free learning sources in Udemy, Coursera, O'relly platforms; internal workshop, certification sponsorship, and exclusive mentoring from C-levels
  • Recognition and awards at team and organizational levels.

Working Environment

  • Open & collaborative working space foster both individual focus and teamwork activities
  • Young, dynamic, and collaborative working atmosphere
  • Unwind zones: gaming, table tennis, yoga, gyms, bath rooms, sleep corner.
  • Quarterly/yearly teambuilding & engaged internal events.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145208905