2 days ago

Site Reliability Engineer

Remote

Middle

Argentina; Mexico

DevOpsincident managementKubernetesPythonCI/CDLinux AdministrationToolingSite Reliability EngineeringSite Reliability EngineerobservabilityLinuxAKSAzure DevOpsCursorElastic Stack

Job description

Overview

We are hiring a Site Reliability Engineer to join the LatAm portion of our globally distributed DPoD SRE team. Our team operates on a 24x5 follow-the-sun model, with the LatAm region covering business hours for the Americas and contributing to weekend and holiday on-call rotations. This role is ideal for an engineer who thrives in production environments, enjoys solving complex reliability challenges on cloud-native infrastructure, and wants to help shape the operational excellence of a platform used by both internal teams and external customers.

Responsibilities

Operate, monitor and troubleshoot production workloads running on Azure, including AKS clusters, virtual machines, networking and storage components
Respond to incidents during shift hours and on-call rotations, drive resolution, lead post-incident reviews and implement preventive measures
Build and maintain CI/CD pipelines in Azure DevOps to support reliable, repeatable deployments
Design, implement and maintain observability solutions including dashboards, alerts, log pipelines and SLI/SLO metrics that improve service reliability and operational visibility
Automate repetitive operational tasks ("toil") using scripting languages such as Python and Bash
Collaborate with engineering, product and support teams across regions to improve system reliability, scalability and performance
Contribute to runbooks, knowledge base articles and operational documentation
Participate actively in continuous improvement of the team's processes, tooling and incident management practices

Requirements

2+ years of experience in DevOps or Site Reliability Engineering
Hands-on experience operating workloads in Microsoft Azure (compute, networking, identity, storage)
Practical experience with Azure DevOps for CI/CD pipelines and repository management
Strong Linux administration and troubleshooting skills in production environments
Proficiency in at least one scripting language for automation purposes
Demonstrated experience applying Site Reliability Engineering principles (SLIs/SLOs, error budgets, toil reduction, automation, blameless postmortems)
Strong systematic troubleshooting skills across application, infrastructure and network layers
English proficiency at B2 level or higher

Nice to have

Skills in Bash scripting for systems automation and operational tooling
Hands-on experience with Azure Kubernetes Service (AKS) in production
Background in the Elastic Stack (Elasticsearch, Kibana) for logging and observability
Familiarity with formal Incident Management practices and ITSM frameworks (e.g., ITIL)
Expertise in configuring and managing NFS or other network-attached storage solutions
Proficiency in Python development for automation, observability tooling or platform integrations
Capability to define and manage Service Level Indicators and Service Level Objectives in a customer-facing service

[GTS] Benefits (generic, except India)

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Match

Good match

We match every vacancy against your profile and show a fit score — so you instantly know which ones are worth applying to. Sign up and create a resume — it's free.

Not enough data to estimate a salary range for this role in this region yet.