Overview
We are seeking a Senior HPC Network Engineer to support advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.
The role focuses on architecting, operating, and optimizing high-performance network fabrics for large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability.
The ideal candidate has strong hands-on experience with InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters.
Responsibilities
- Architect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloads
- Design and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needs
- Optimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
- Tune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloads
- Design and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
- Support SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use cases
- Build and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysis
- Collaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliability
Requirements
- 5+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 2+ years focused on HPC, AI/ML, or GPU cluster networking
- Proven hands-on experience with InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments
- Understanding of host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity
- Knowledge of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather
- Expertise in Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration
- Proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning
- Skills in Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics
- Background in network observability and performance management, including telemetry, traffic monitoring, and congestion detection, as well as latency analysis, SLOs, and capacity planning, along with alerting and troubleshooting across L1-L4, fabric, and RDMA layers
- Strong troubleshooting, root-cause analysis, documentation, and communication skills for working with client engineering teams, researchers, and platform stakeholders
- English level of minimum B2 (Upper-Intermediate) for effective communication
Nice to have
- Familiarity with Azure Networking, Ethernet, and GPGPU/GPU technologies
- Competency in Grafana, Prometheus, and Network Administration
- Capability to develop and maintain Infrastructure as Code
- Flexibility to use Python and UNIX shell scripting for automation and tooling
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn