Overview
We are supporting client delivery by running GPU-enabled Kubernetes and Linux compute infrastructure optimized for AI initiatives and Volcano-driven scheduling. You will implement automation in Python and UNIX Shell, administer Kubernetes resources like PVC, NFS, and quotas, and work with researchers to streamline workflows; apply now.
Responsibilities
- Set up and maintain GPU-enabled Kubernetes clusters alongside standalone Linux compute environments with stable scheduling and high performance
- Manage Volcano scheduling workflows, including queue setup, POD execution, GPU allocation, and enforcement of namespace quotas
- Control Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation strategies
- Build and support Python and Shell scripts that automate job submission, resource provisioning, and system reporting
- Work with orchestration, optimization, and observability teams to improve scheduling efficiency, utilization, and researcher workflows
- Assess infrastructure health and resource utilization and contribute data for optimization and reporting requirements
- Drive recommendations for improving infrastructure, tooling, and automation workflows to enhance performance, scalability, and usability
- Support operational processes that keep researcher experiences smooth across AI and computational workloads
Requirements
- At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large-scale environments
- Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management
- Hands-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes
- Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high-performance computing workloads
- Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting
- Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency
- Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments
- Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations
- Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross-functional teams
Nice to have
- Experience with Helm package management for deploying and managing Kubernetes applications
- Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking
- Hands-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources
- Exposure to multi-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience
- Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments
- Familiarity with AI-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality
- Experience with hybrid (cloud and on-premises) scheduling and resource optimization to support flexible and efficient compute environments
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn