Overview

We just launched services for our client in Azure, and service health is our top priority. As we build our brand through reliable, high-performing services, we are seeking a Senior SRE who can immediately contribute to incident response, troubleshooting, and the ongoing improvement of our cloud reliability. This is a hands-on role for someone who thrives in high-stakes environments, can operate with minimal SRE process maturity, and is passionate about both firefighting and building for the future.

Responsibilities

  • Develop and automate operational processes to improve system reliability, scalability, and performance
  • Collaborate with development and operations teams to embed reliability best practices into the SDLC
  • Rapidly respond to and resolve service incidents in our Azure environment, minimizing downtime and customer impact
  • Lead root cause analysis and post-incident reviews, driving actionable improvements
  • Design, implement, and maintain robust monitoring, alerting, and observability solutions for all critical services
  • Proactively identify and address reliability risks before they impact customers
  • Help establish and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI definition
  • Mentor and upskill team members in SRE principles and Azure best practices
  • Analyze trends in incidents and outages to drive long-term improvements
  • Champion a culture of reliability, accountability, and continuous learning

Requirements

  • 3+ years in SRE, DevOps, or related roles, with a strong track record in cloud environments (Azure experience required)
  • Deep expertise in troubleshooting distributed systems, networking, and cloud-native architectures
  • Hands-on experience with Azure monitoring, logging, and automation tools (Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, Terraform)
  • Proficiency in at least one scripting or programming language (Python, PowerShell, Bash)
  • Strong understanding of incident management, on-call operations, and post-incident analysis
  • Experience implementing observability solutions and defining SLOs/SLIs
  • Excellent communication skills and the ability to work cross-functionally in high-pressure situations
  • English proficiency at B2 level or higher

Nice to have

  • Proficiency in Python
  • Azure certifications (Azure Solutions Architect, Azure DevOps Engineer)
  • Experience in environments with low SRE process maturity, building practices from the ground up
  • Familiarity with CI/CD pipelines and infrastructure as code
  • Experience mentoring or leading SRE/DevOps teams

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn