Overview

We are seeking a highly skilled Network Specialist to drive the simulation, evaluation, and optimization of advanced network infrastructures, including routers, switches, cellular networks, and virtualized environments. This role focuses on developing innovative solutions for distributed systems and networking technologies while leveraging expertise in C/C++ programming. The successful candidate will integrate hardware and software components, stay at the forefront of technological advancements, and ensure the seamless operation of network infrastructure simulations. This position also emphasizes contributions to system design, optimization, and detailed technical documentation to support AI-driven applications and large-scale distributed systems.

Responsibilities

  • Design and implement simulation models for networking hardware and infrastructure, leveraging Discrete Event Simulation tools (e.g., Omnet++, NS3, QualNet, DONS). Develop and improve the Discrete Event Simulation (DES) software solution for data center networks, focusing on high-speed networks that feature very high throughput and very low latency
  • Identify congestion bottlenecks and optimal link placements for multi-GPU clusters
  • Integrate hardware and software components to enable seamless operation of distributed networking systems and AI-based applications
  • Stay informed about cutting-edge advancements in networking (e.g., RDMA, Ultra Ethernet, Software-Defined Networking, congestion control), AI technologies, distributed systems, and datacenter hardware. Apply these advancements to drive innovation in simulations
  • Implement datacenter switch behaviors in NS3 (buffer management, packet scheduling, priority queues) to accurately model vendor-specific hardware
  • Model LLM training (e.g., GPT-scale models) in Astra-sim across 100s-1000s of GPUs, predicting training iteration time based on network topology and collective communication patterns
  • Develop and evaluate network topologies, collective operations (MPI/NCCL/xCCL libraries), and distributed networking approaches to optimize performance and scalability
  • Research and implement emerging technologies like Ultra Ethernet, in-network computing, or congestion control innovations in the simulator
  • Create and maintain comprehensive technical documentation, including design specifications, test plans, user manuals, and process workflows. Collaborate with other engineers and stakeholders to ensure project success

Requirements

  • Minimum 3+ years of experience in C/C++ programming with a focus on performance-oriented system design for networking environments
  • Strong background in distributed networking, distributed systems, and underlying hardware components
  • Hands-on experience with Discrete Event Simulation tools such as Omnet++, NS3, QualNet, or DONS
  • Comprehensive knowledge of general networking concepts, including congestion control, software-defined networking (SDN), remote direct memory access (RDMA), and Ultra Ethernet
  • Familiarity with collective operations and libraries like MPI, NCCL, or other CCL frameworks
  • Understanding of network topologies and their applications in large-scale distributed systems
  • Competence in additional programming languages (e.g., Python, Java, or scripting languages) for broader versatility
  • English proficiency at a B2 level to ensure effective communication and documentation

Nice to have

  • Prior experience developing AI software solutions and integrating them into high-performance networking environments
  • Knowledge of DevOps/MLOps tools and methodologies for automating deployment pipelines and scaling solutions
  • Familiarity with GPU programming, CUDA kernels, or other parallel computing frameworks
  • Excellent problem-solving capabilities in performance optimization at the intersection of hardware, software, and networking

Czech Republic

  • Opportunity to work in a fast-paced, agile, software engineering culture
  • Comfortable modern office in Prague 7, with support of hybrid or fully remote mode
  • Benefit program (5 weeks of vacation, paid sick days, paid days off for special occasions, meal vouchers, flexi pass, Prague city public transport annual coupon, multisport cards, optional contribution to pension fund, health insurance for family member)
  • EPAM Employee Stock Purchase Plan (ESPP) (subject to certain eligibility requirements)
  • English language courses
  • Czech language courses upon request
  • Referral bonuses for recommended candidates
  • Mobile Phone Tariff’s program for managerial-level candidates
  • Great learning and development opportunities, including in-house professional training, career advisory and coaching, sponsored professional certifications, well-being programs, LinkedIn Learning Solutions and much more

Slovakia

  • Opportunity to work in a fast-paced, agile, software engineering culture
  • Benefit program (5 weeks of vacation, 5 paid sick days, meal vouchers, cafeteria and recreation bonuses, reimbursement of glasses, contribution to pension fund)
  • Referral bonuses for recommended candidates
  • English language courses
  • Great learning and development opportunities, including in-house professional training, career advisory and coaching, sponsored professional certifications, well-being programs, LinkedIn Learning Solutions and much more

[epamgdo] Czech Republic (Remote)

The remote work option is available to candidates residing and working within the Czech Republic.

[epamgdo] Czech Republic (Benefits Eligibility)

Certain benefits and perks may be subject to eligibility requirements and may be available only after you have passed your probationary period.

[epamgdo] Slovakia (Benefits Eligibility)

Certain benefits and perks may be subject to eligibility requirements and may be available only after you have passed your probationary period.