Overview
We are looking for a skilled Senior Site Reliability Engineer to deliver advanced support and reliability engineering for critical cloud-based systems. The role focuses on ensuring reliability, performance and observability across AWS environments, with strong emphasis on Kubernetes, advanced monitoring, database expertise and distributed systems such as Kafka. The position involves incident response, proactive reliability improvements, automation and collaboration with engineering teams to strengthen system resilience.
Responsibilities
- Design, implement and maintain observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena and other modern tooling
- Monitor and troubleshoot EKS, Aurora RDS (PostgreSQL) and other AWS infrastructure at an advanced level
- Implement automated remediations and self-healing mechanisms
- Participate in incident response, root-cause analysis and postmortems
- Implement security measures impacting cluster reliability (IAM, network policies, Config)
- Support and maintain current AWS infrastructure
- Collaborate with L3 teams to escalate, troubleshoot and resolve operational issues
Requirements
- 3+ years of experience in site reliability engineering or advanced support roles
- Expert-level proficiency in Grafana, Prometheus and OpenSearch
- Expertise in Open Telemetry, Fluent Bit, CloudWatch and CloudTrail
- Strong understanding of distributed tracing, metrics pipelines and log aggregation
- Advanced troubleshooting and operational experience with EKS, RDS (PostgreSQL) and MSK (Kafka)
- Knowledge of AWS Network (VPC, SG, Route Tables) and IAM (Roles and Policies)
- Strong understanding of AWS networking, security, scaling and reliability patterns
- Advanced Kubernetes knowledge in operations, debugging, networking and scaling
- Strong background in incident response, RCA, postmortems and SLA management
- Scripting skills in Bash or Python, with automation of cloud operations, observability integrations and incident recovery
- Excellent structured problem-solving skills, strong communication across technical and non-technical teams, and comfort working in a fast-paced Agile environment
Nice to have
- Familiarity with AKS (Kubernetes), Azure Monitor, Application Insights and Log Analytics
- Knowledge of Cosmos DB and PostgreSQL on Azure
- Expertise in Azure DevOps
- Proficiency in Terraform and ArgoCD
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn