Overview
We are seeking a highly skilled DevOps Engineer to join EPAM’s delivery team.
In this client-facing, delivery-focused role, you will be responsible for the hands-on implementation, automation, and optimization of Kubernetes-based orchestration platforms—including Volcano for GPU-enabled workloads—and the Linux infrastructure supporting advanced AI and research initiatives. You will leverage deep expertise in Kubernetes administration, workload scheduling, quota management, and automation using Python and Shell scripting to deliver efficient, reliable, and scalable compute environments. You will work closely with other engineers and researchers to ensure a seamless, high-quality infrastructure experience.
Responsibilities
- Set up, configure, and support GPU-enabled Kubernetes clusters and independent Linux compute systems to maximize workload scheduling and system efficiency
- Oversee Volcano job scheduling, handling queue creation, POD management, GPU resource assignment, and namespace quota controls
- Manage all aspects of Kubernetes environments, including namespaces, RBAC, resource quotas, and strategies for workload isolation
- Write and maintain automation scripts in Python and Shell to simplify job submissions, resource allocation, and system monitoring
- Work alongside orchestration, optimization, and observability teams to boost scheduling performance, resource usage, and researcher productivity
- Track infrastructure status and resource consumption, sharing insights and data to drive optimization and reporting
- Propose and implement enhancements to infrastructure, tools, and automation processes to improve scalability, performance, and user experience
- Support operational workflows that provide researchers with a smooth and effective environment for AI and computational projects
Requirements
- Minimum of 2 years in DevOps or infrastructure engineering roles managing complex, large-scale systems
- Deep knowledge of Kubernetes administration and orchestration, covering namespaces, POD scheduling and balancing, persistent volume claims (PVC), network file systems (NFS), and resource quota controls
- Practical experience with the Volcano scheduler for GPU job management, including queue setup, workload prioritization, and Kubernetes integration
- Demonstrated ability to operate GPU cluster environments in both Kubernetes and standalone Linux setups for high-performance computing
- Advanced skills in Python scripting for automating infrastructure operations, job handling, and system monitoring
- Proficiency in UNIX Shell scripting (such as Bash) for automating system tasks and improving operational workflows
- Strong background in Linux system administration, including troubleshooting, optimizing performance, and managing configurations
- Thorough understanding of automation and orchestration tools and concepts to support scalable and dependable infrastructure
- Excellent English communication skills, both spoken and written, for direct client engagement and teamwork with cross-functional groups
Nice to have
- Experience using Helm for packaging and managing Kubernetes applications
- Knowledge of monitoring and observability tools like Prometheus, Grafana, and Loki for tracking infrastructure health and performance
- Familiarity with Infrastructure as Code solutions such as Terraform for automating cloud resource provisioning and management
- Background working with multi-cloud Kubernetes platforms, including Amazon EKS and Google GKE, for expanded orchestration capabilities
- Skills in Azure networking, including VPN setup, ExpressRoute configuration, and network security for robust cloud deployments
- Experience with AI-powered coding assistants (e.g., GitHub Copilot, ChatGPT, Claude) to improve development efficiency and code quality
- Understanding of hybrid scheduling and resource optimization across cloud and on-premises environments for flexible compute solutions
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn