2 hours ago

Senior Site Reliability Engineer

Remote

Senior

Colombia; Mexico

PostgreSQLDevOpsKubernetesTerraformPythonIncident ResponseObservabilityGrafanaPrometheusAzure DevOpsAzure MonitorLinkedInOpenSearchOpenTelemetryreliability

Job description

Overview

We are looking for a skilled Senior Site Reliability Engineer to deliver advanced support and reliability engineering for critical cloud-based systems. The role focuses on ensuring reliability, performance and observability across AWS environments, with strong emphasis on Kubernetes, advanced monitoring, database expertise and distributed systems such as Kafka. The position involves incident response, proactive reliability improvements, automation and collaboration with engineering teams to strengthen system resilience.

Responsibilities

Design, implement and maintain observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena and other modern tooling
Monitor and troubleshoot EKS, Aurora RDS (PostgreSQL) and other AWS infrastructure at an advanced level
Implement automated remediations and self-healing mechanisms
Participate in incident response, root-cause analysis and postmortems
Implement security measures impacting cluster reliability (IAM, network policies, Config)
Support and maintain current AWS infrastructure
Collaborate with L3 teams to escalate, troubleshoot and resolve operational issues

Requirements

3+ years of experience in site reliability engineering or advanced support roles
Expert-level proficiency in Grafana, Prometheus and OpenSearch
Expertise in Open Telemetry, Fluent Bit, CloudWatch and CloudTrail
Strong understanding of distributed tracing, metrics pipelines and log aggregation
Advanced troubleshooting and operational experience with EKS, RDS (PostgreSQL) and MSK (Kafka)
Knowledge of AWS Network (VPC, SG, Route Tables) and IAM (Roles and Policies)
Strong understanding of AWS networking, security, scaling and reliability patterns
Advanced Kubernetes knowledge in operations, debugging, networking and scaling
Strong background in incident response, RCA, postmortems and SLA management
Scripting skills in Bash or Python, with automation of cloud operations, observability integrations and incident recovery
Excellent structured problem-solving skills, strong communication across technical and non-technical teams, and comfort working in a fast-paced Agile environment

Nice to have

Familiarity with AKS (Kubernetes), Azure Monitor, Application Insights and Log Analytics
Knowledge of Cosmos DB and PostgreSQL on Azure
Expertise in Azure DevOps
Proficiency in Terraform and ArgoCD

[GTS] Benefits (generic, except India)

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Good match

We match every vacancy against your profile and show a fit score — so you instantly know which ones are worth applying to. Sign up and create a resume — it's free.

Not enough data to estimate a salary range for this role in this region yet.