Overview

We are seeking a Senior HPC Network Engineer to support advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on architecting, operating, and optimizing high-performance network fabrics for large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability.

The ideal candidate has strong hands-on experience with InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters.

Responsibilities

  • Architect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloads
  • Design and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needs
  • Optimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
  • Tune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloads
  • Design and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
  • Support SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use cases
  • Build and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysis
  • Collaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliability

Requirements

  • 5+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 2+ years focused on HPC, AI/ML, or GPU cluster networking
  • Proven hands-on experience with InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments
  • Understanding of host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity
  • Knowledge of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather
  • Expertise in Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration
  • Proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning
  • Skills in Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics
  • Background in network observability and performance management, including telemetry, traffic monitoring, and congestion detection, as well as latency analysis, SLOs, and capacity planning, along with alerting and troubleshooting across L1-L4, fabric, and RDMA layers
  • Strong troubleshooting, root-cause analysis, documentation, and communication skills for working with client engineering teams, researchers, and platform stakeholders
  • English level of minimum B2 (Upper-Intermediate) for effective communication

Nice to have

  • Familiarity with Azure Networking, Ethernet, and GPGPU/GPU technologies
  • Competency in Grafana, Prometheus, and Network Administration
  • Capability to develop and maintain Infrastructure as Code
  • Flexibility to use Python and UNIX shell scripting for automation and tooling

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn