Overview

We are looking for a Lead HPC Network Engineer to drive the strategy, architecture, and engineering excellence behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on defining the technical vision, leading architecture decisions, and setting engineering standards for high-performance network fabrics supporting large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a technical leader, you will mentor senior engineers, influence client roadmaps, and own end-to-end delivery of mission-critical network platforms.

The ideal candidate combines deep expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading engineering teams and shaping large-scale HPC/AI network platforms.

Responsibilities

  • Own the architectural vision and long-term roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics supporting large-scale GPU clusters and distributed AI/LLM workloads
  • Lead the design, evaluation, and selection of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and define decision frameworks aligned with workload scale, performance, and cost constraints
  • Establish engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
  • Drive performance engineering initiatives for RDMA/RoCE, NCCL/MSCCL, and collective communication across multi-node GPU training workloads, and lead complex root-cause investigations
  • Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
  • Lead the adoption and integration strategy for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases
  • Shape the network observability strategy, defining metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
  • Mentor and technically lead engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, driving cross-functional alignment and resolution of complex bottlenecks
  • Represent the engineering team in client and stakeholder forums, influencing technical direction, communicating trade-offs, and ensuring delivery of reliable, scalable network platforms

Requirements

  • 6+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 3+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership on large-scale initiatives (1+ years)
  • Proven experience leading the architecture and delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments
  • Deep expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with the ability to set standards and guide other engineers
  • Strong understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, and ability to advise on workload-network co-design
  • Expert-level knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration
  • Advanced proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at scale
  • Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics
  • Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
  • Excellent leadership, mentoring, stakeholder management, and communication skills, with experience guiding engineering teams, influencing client architecture decisions, and driving consensus across researchers and platform stakeholders
  • Excellent written and verbal communication skills in English (B2+ level)

Nice to have

  • Hands-on experience with Azure Networking, Ethernet, and GPGPU/GPU technologies at an architectural level
  • Strong command of Grafana, Prometheus, and Network Administration, with experience defining observability standards
  • Proven ability to design, develop, and govern Infrastructure as Code at scale
  • Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling team productivity

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn