Overview

We are supporting client delivery by running GPU-enabled Kubernetes and Linux compute infrastructure optimized for AI initiatives and Volcano-driven scheduling. You will implement automation in Python and UNIX Shell, administer Kubernetes resources like PVC, NFS, and quotas, and work with researchers to streamline workflows; apply now.

Responsibilities

  • Set up and maintain GPU-enabled Kubernetes clusters alongside standalone Linux compute environments with stable scheduling and high performance
  • Manage Volcano scheduling workflows, including queue setup, POD execution, GPU allocation, and enforcement of namespace quotas
  • Control Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation strategies
  • Build and support Python and Shell scripts that automate job submission, resource provisioning, and system reporting
  • Work with orchestration, optimization, and observability teams to improve scheduling efficiency, utilization, and researcher workflows
  • Assess infrastructure health and resource utilization and contribute data for optimization and reporting requirements
  • Drive recommendations for improving infrastructure, tooling, and automation workflows to enhance performance, scalability, and usability
  • Support operational processes that keep researcher experiences smooth across AI and computational workloads

Requirements

  • At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large-scale environments
  • Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management
  • Hands-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes
  • Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high-performance computing workloads
  • Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting
  • Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency
  • Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments
  • Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations
  • Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross-functional teams

Nice to have

  • Experience with Helm package management for deploying and managing Kubernetes applications
  • Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking
  • Hands-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources
  • Exposure to multi-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience
  • Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments
  • Familiarity with AI-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality
  • Experience with hybrid (cloud and on-premises) scheduling and resource optimization to support flexible and efficient compute environments

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn