Overview

We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on establishing the long-term technical direction, governing architecture decisions across multiple programs, and setting organization-wide engineering standards for high-performance network fabrics supporting massive-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end-to-end governance of mission-critical network platforms across the portfolio.

The ideal candidate combines authoritative expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry-leading HPC/AI network platforms.

Responsibilities

  • Define and own the multi-year strategic vision and architectural roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics powering massive-scale GPU clusters and distributed AI/LLM workloads across the client portfolio
  • Govern the design, evaluation, and standardization of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and establish enterprise-wide decision frameworks aligned with workload scale, performance, and cost constraints
  • Establish and enforce organization-wide engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
  • Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training workloads, and oversee resolution of the most complex systemic performance issues
  • Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs
  • Own the strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap
  • Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
  • Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross-functional alignment at scale
  • Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements
  • Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events

Requirements

  • 8+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2+ years)
  • Proven experience defining the architecture and governing delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-critical distributed compute environments
  • Authoritative expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with proven ability to set enterprise-wide standards and uplift engineering organizations
  • Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with the ability to drive workload-network co-design strategy at scale
  • Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures
  • Expert-level mastery of RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at very large scale
  • Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with the ability to define diagnostic methodologies for the broader engineering organization
  • Demonstrated ownership of enterprise network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
  • Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving consensus across researchers, platform stakeholders, and executive sponsors
  • English language proficiency at an Advanced level (C1)

Nice to have

  • Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies
  • Authoritative command of Grafana, Prometheus, and Network Administration, with experience defining observability standards across an engineering organization
  • Proven ability to define strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs
  • Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling organization-wide engineering productivity
  • Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn