Overview

We have recently launched services for our client in Azure, and ensuring service health is our highest priority. As we establish our reputation through dependable, high-performing cloud solutions, we are looking for a Lead Site Reliability Engineer (SRE) who can make an immediate impact on incident response, troubleshooting, and the ongoing enhancement of our cloud reliability. This is a hands-on opportunity for someone who excels in high-pressure situations, can operate effectively with minimal SRE process maturity, and is passionate about both rapid incident response and building resilient systems for the future.

Responsibilities

  • Develop and automate operational workflows to enhance system reliability, scalability, and performance
  • Work closely with development and operations teams to integrate reliability best practices throughout the software development lifecycle
  • Respond quickly to and resolve service incidents in the Azure environment, minimizing downtime and customer disruption
  • Lead root cause investigations and post-incident reviews, implementing actionable improvements
  • Design, deploy, and maintain comprehensive monitoring, alerting, and observability solutions for all critical services
  • Proactively identify and mitigate reliability risks before they affect customers
  • Help define and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI development
  • Mentor and train team members in SRE methodologies and Azure best practices
  • Analyze incident and outage trends to drive long-term reliability improvements
  • Foster a culture of reliability, accountability, and continuous learning within the team

Requirements

  • At least 5 years of experience in SRE, DevOps, or related roles, with a proven track record in cloud environments (Azure experience required)
  • Minimum of one year in a leadership or team management role
  • Advanced troubleshooting skills in distributed systems, networking, and cloud-native architectures
  • Hands-on experience with Azure monitoring, logging, and automation tools such as Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, and Terraform
  • Proficiency in at least one scripting or programming language, such as Python, PowerShell, or Bash
  • Strong understanding of incident management, on-call operations, and post-incident analysis
  • Experience implementing observability solutions and defining service level objectives (SLOs) and indicators (SLIs)
  • Excellent communication skills and the ability to collaborate effectively in high-pressure, cross-functional environments
  • English proficiency at B2 level or higher

Nice to have

  • Advanced proficiency in Python
  • Azure certifications such as Azure Solutions Architect or Azure DevOps Engineer
  • Experience building SRE practices from the ground up in environments with low process maturity
  • Familiarity with CI/CD pipelines and infrastructure as code
  • Experience mentoring or leading SRE/DevOps teams

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn