Overview

We are hiring a Site Reliability Engineer to join the LatAm portion of our globally distributed DPoD SRE team. Our team operates on a 24x5 follow-the-sun model, with the LatAm region covering business hours for the Americas and contributing to weekend and holiday on-call rotations. This role is ideal for an engineer who thrives in production environments, enjoys solving complex reliability challenges on cloud-native infrastructure, and wants to help shape the operational excellence of a platform used by both internal teams and external customers.

Responsibilities

  • Operate, monitor and troubleshoot production workloads running on Azure, including AKS clusters, virtual machines, networking and storage components
  • Respond to incidents during shift hours and on-call rotations, drive resolution, lead post-incident reviews and implement preventive measures
  • Build and maintain CI/CD pipelines in Azure DevOps to support reliable, repeatable deployments
  • Design, implement and maintain observability solutions including dashboards, alerts, log pipelines and SLI/SLO metrics that improve service reliability and operational visibility
  • Automate repetitive operational tasks ("toil") using scripting languages such as Python and Bash
  • Collaborate with engineering, product and support teams across regions to improve system reliability, scalability and performance
  • Contribute to runbooks, knowledge base articles and operational documentation
  • Participate actively in continuous improvement of the team's processes, tooling and incident management practices

Requirements

  • 2+ years of experience in DevOps or Site Reliability Engineering
  • Hands-on experience operating workloads in Microsoft Azure (compute, networking, identity, storage)
  • Practical experience with Azure DevOps for CI/CD pipelines and repository management
  • Strong Linux administration and troubleshooting skills in production environments
  • Proficiency in at least one scripting language for automation purposes
  • Demonstrated experience applying Site Reliability Engineering principles (SLIs/SLOs, error budgets, toil reduction, automation, blameless postmortems)
  • Strong systematic troubleshooting skills across application, infrastructure and network layers
  • English proficiency at B2 level or higher

Nice to have

  • Skills in Bash scripting for systems automation and operational tooling
  • Hands-on experience with Azure Kubernetes Service (AKS) in production
  • Background in the Elastic Stack (Elasticsearch, Kibana) for logging and observability
  • Familiarity with formal Incident Management practices and ITSM frameworks (e.g., ITIL)
  • Expertise in configuring and managing NFS or other network-attached storage solutions
  • Proficiency in Python development for automation, observability tooling or platform integrations
  • Capability to define and manage Service Level Indicators and Service Level Objectives in a customer-facing service

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn