Overview

We just launched services for our client in Azure, and service health is our top priority. As we build our brand through reliable, high-performing services, we are seeking a Senior SRE who can immediately contribute to incident response, troubleshooting, and the ongoing improvement of our cloud reliability. This is a hands-on role for someone who thrives in high-stakes environments, can operate with minimal SRE process maturity, and is passionate about both firefighting and building for the future.

Responsibilities

Develop and automate operational processes to improve system reliability, scalability, and performance
Collaborate with development and operations teams to embed reliability best practices into the SDLC
Rapidly respond to and resolve service incidents in our Azure environment, minimizing downtime and customer impact
Lead root cause analysis and post-incident reviews, driving actionable improvements
Design, implement, and maintain robust monitoring, alerting, and observability solutions for all critical services
Proactively identify and address reliability risks before they impact customers
Help establish and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI definition
Mentor and upskill team members in SRE principles and Azure best practices
Analyze trends in incidents and outages to drive long-term improvements
Champion a culture of reliability, accountability, and continuous learning

Requirements

3+ years in SRE, DevOps, or related roles, with a strong track record in cloud environments (Azure experience required)
Deep expertise in troubleshooting distributed systems, networking, and cloud-native architectures
Hands-on experience with Azure monitoring, logging, and automation tools (Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, Terraform)
Proficiency in at least one scripting or programming language (Python, PowerShell, Bash)
Strong understanding of incident management, on-call operations, and post-incident analysis
Experience implementing observability solutions and defining SLOs/SLIs
Excellent communication skills and the ability to work cross-functionally in high-pressure situations
English proficiency at B2 level or higher

Nice to have

Proficiency in Python
Azure certifications (Azure Solutions Architect, Azure DevOps Engineer)
Experience in environments with low SRE process maturity, building practices from the ground up
Familiarity with CI/CD pipelines and infrastructure as code
Experience mentoring or leading SRE/DevOps teams

[GTS] Benefits (generic, except India)

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Senior Site Reliability Engineer (SRE) - Azure

Описание вакансии