Overview
We are hiring a Site Reliability Engineer to join the LatAm portion of our globally distributed DPoD SRE team. Our team operates on a 24x5 follow-the-sun model, with the LatAm region covering business hours for the Americas and contributing to weekend and holiday on-call rotations. This role is ideal for an engineer who thrives in production environments, enjoys solving complex reliability challenges on cloud-native infrastructure, and wants to help shape the operational excellence of a platform used by both internal teams and external customers.
Responsibilities
- Operate, monitor and troubleshoot production workloads running on Azure, including AKS clusters, virtual machines, networking and storage components
- Respond to incidents during shift hours and on-call rotations, drive resolution, lead post-incident reviews and implement preventive measures
- Build and maintain CI/CD pipelines in Azure DevOps to support reliable, repeatable deployments
- Design, implement and maintain observability solutions including dashboards, alerts, log pipelines and SLI/SLO metrics that improve service reliability and operational visibility
- Automate repetitive operational tasks ("toil") using scripting languages such as Python and Bash
- Collaborate with engineering, product and support teams across regions to improve system reliability, scalability and performance
- Contribute to runbooks, knowledge base articles and operational documentation
- Participate actively in continuous improvement of the team's processes, tooling and incident management practices
Requirements
- 2+ years of experience in DevOps or Site Reliability Engineering
- Hands-on experience operating workloads in Microsoft Azure (compute, networking, identity, storage)
- Practical experience with Azure DevOps for CI/CD pipelines and repository management
- Strong Linux administration and troubleshooting skills in production environments
- Proficiency in at least one scripting language for automation purposes
- Demonstrated experience applying Site Reliability Engineering principles (SLIs/SLOs, error budgets, toil reduction, automation, blameless postmortems)
- Strong systematic troubleshooting skills across application, infrastructure and network layers
- English proficiency at B2 level or higher
Nice to have
- Skills in Bash scripting for systems automation and operational tooling
- Hands-on experience with Azure Kubernetes Service (AKS) in production
- Background in the Elastic Stack (Elasticsearch, Kibana) for logging and observability
- Familiarity with formal Incident Management practices and ITSM frameworks (e.g., ITIL)
- Expertise in configuring and managing NFS or other network-attached storage solutions
- Proficiency in Python development for automation, observability tooling or platform integrations
- Capability to define and manage Service Level Indicators and Service Level Objectives in a customer-facing service
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn