Overview
We have recently launched services for our client in Azure, and ensuring service health is our highest priority. As we establish our reputation through dependable, high-performing cloud solutions, we are looking for a Lead Site Reliability Engineer (SRE) who can make an immediate impact on incident response, troubleshooting, and the ongoing enhancement of our cloud reliability. This is a hands-on opportunity for someone who excels in high-pressure situations, can operate effectively with minimal SRE process maturity, and is passionate about both rapid incident response and building resilient systems for the future.
Responsibilities
- Develop and automate operational workflows to enhance system reliability, scalability, and performance
- Work closely with development and operations teams to integrate reliability best practices throughout the software development lifecycle
- Respond quickly to and resolve service incidents in the Azure environment, minimizing downtime and customer disruption
- Lead root cause investigations and post-incident reviews, implementing actionable improvements
- Design, deploy, and maintain comprehensive monitoring, alerting, and observability solutions for all critical services
- Proactively identify and mitigate reliability risks before they affect customers
- Help define and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI development
- Mentor and train team members in SRE methodologies and Azure best practices
- Analyze incident and outage trends to drive long-term reliability improvements
- Foster a culture of reliability, accountability, and continuous learning within the team
Requirements
- At least 5 years of experience in SRE, DevOps, or related roles, with a proven track record in cloud environments (Azure experience required)
- Minimum of one year in a leadership or team management role
- Advanced troubleshooting skills in distributed systems, networking, and cloud-native architectures
- Hands-on experience with Azure monitoring, logging, and automation tools such as Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, and Terraform
- Proficiency in at least one scripting or programming language, such as Python, PowerShell, or Bash
- Strong understanding of incident management, on-call operations, and post-incident analysis
- Experience implementing observability solutions and defining service level objectives (SLOs) and indicators (SLIs)
- Excellent communication skills and the ability to collaborate effectively in high-pressure, cross-functional environments
- English proficiency at B2 level or higher
Nice to have
- Advanced proficiency in Python
- Azure certifications such as Azure Solutions Architect or Azure DevOps Engineer
- Experience building SRE practices from the ground up in environments with low process maturity
- Familiarity with CI/CD pipelines and infrastructure as code
- Experience mentoring or leading SRE/DevOps teams
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn