Overview

We have recently launched services for our client in Azure, and ensuring service health is our highest priority. As we establish our reputation through dependable, high-performing cloud solutions, we are looking for a Lead Site Reliability Engineer (SRE) who can make an immediate impact on incident response, troubleshooting, and the ongoing enhancement of our cloud reliability. This is a hands-on opportunity for someone who excels in high-pressure situations, can operate effectively with minimal SRE process maturity, and is passionate about both rapid incident response and building resilient systems for the future.

Responsibilities

Develop and automate operational workflows to enhance system reliability, scalability, and performance
Work closely with development and operations teams to integrate reliability best practices throughout the software development lifecycle
Respond quickly to and resolve service incidents in the Azure environment, minimizing downtime and customer disruption
Lead root cause investigations and post-incident reviews, implementing actionable improvements
Design, deploy, and maintain comprehensive monitoring, alerting, and observability solutions for all critical services
Proactively identify and mitigate reliability risks before they affect customers
Help define and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI development
Mentor and train team members in SRE methodologies and Azure best practices
Analyze incident and outage trends to drive long-term reliability improvements
Foster a culture of reliability, accountability, and continuous learning within the team

Requirements

At least 5 years of experience in SRE, DevOps, or related roles, with a proven track record in cloud environments (Azure experience required)
Minimum of one year in a leadership or team management role
Advanced troubleshooting skills in distributed systems, networking, and cloud-native architectures
Hands-on experience with Azure monitoring, logging, and automation tools such as Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, and Terraform
Proficiency in at least one scripting or programming language, such as Python, PowerShell, or Bash
Strong understanding of incident management, on-call operations, and post-incident analysis
Experience implementing observability solutions and defining service level objectives (SLOs) and indicators (SLIs)
Excellent communication skills and the ability to collaborate effectively in high-pressure, cross-functional environments
English proficiency at B2 level or higher

Nice to have

Advanced proficiency in Python
Azure certifications such as Azure Solutions Architect or Azure DevOps Engineer
Experience building SRE practices from the ground up in environments with low process maturity
Familiarity with CI/CD pipelines and infrastructure as code
Experience mentoring or leading SRE/DevOps teams

[GTS] Benefits (generic, except India)

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Lead Site Reliability Engineer (SRE) - Azure

Описание вакансии