Overview
We are looking for a Lead HPC Network Engineer to drive the strategy, architecture, and engineering excellence behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.
The role focuses on defining the technical vision, leading architecture decisions, and setting engineering standards for high-performance network fabrics supporting large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a technical leader, you will mentor senior engineers, influence client roadmaps, and own end-to-end delivery of mission-critical network platforms.
The ideal candidate combines deep expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading engineering teams and shaping large-scale HPC/AI network platforms.
Responsibilities
- Own the architectural vision and long-term roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics supporting large-scale GPU clusters and distributed AI/LLM workloads
- Lead the design, evaluation, and selection of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and define decision frameworks aligned with workload scale, performance, and cost constraints
- Establish engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
- Drive performance engineering initiatives for RDMA/RoCE, NCCL/MSCCL, and collective communication across multi-node GPU training workloads, and lead complex root-cause investigations
- Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
- Lead the adoption and integration strategy for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases
- Shape the network observability strategy, defining metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
- Mentor and technically lead engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, driving cross-functional alignment and resolution of complex bottlenecks
- Represent the engineering team in client and stakeholder forums, influencing technical direction, communicating trade-offs, and ensuring delivery of reliable, scalable network platforms
Requirements
- 6+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 3+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership on large-scale initiatives (1+ years)
- Proven experience leading the architecture and delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments
- Deep expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with the ability to set standards and guide other engineers
- Strong understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, and ability to advise on workload-network co-design
- Expert-level knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration
- Advanced proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at scale
- Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics
- Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
- Excellent leadership, mentoring, stakeholder management, and communication skills, with experience guiding engineering teams, influencing client architecture decisions, and driving consensus across researchers and platform stakeholders
- Excellent written and verbal communication skills in English (B2+ level)
Nice to have
- Hands-on experience with Azure Networking, Ethernet, and GPGPU/GPU technologies at an architectural level
- Strong command of Grafana, Prometheus, and Network Administration, with experience defining observability standards
- Proven ability to design, develop, and govern Infrastructure as Code at scale
- Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling team productivity
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn