Overview
We just launched services for our client in Azure, and service health is our top priority. As we build our brand through reliable, high-performing services, we are seeking a Senior SRE who can immediately contribute to incident response, troubleshooting, and the ongoing improvement of our cloud reliability. This is a hands-on role for someone who thrives in high-stakes environments, can operate with minimal SRE process maturity, and is passionate about both firefighting and building for the future.
Responsibilities
- Develop and automate operational processes to improve system reliability, scalability, and performance
- Collaborate with development and operations teams to embed reliability best practices into the SDLC
- Rapidly respond to and resolve service incidents in our Azure environment, minimizing downtime and customer impact
- Lead root cause analysis and post-incident reviews, driving actionable improvements
- Design, implement, and maintain robust monitoring, alerting, and observability solutions for all critical services
- Proactively identify and address reliability risks before they impact customers
- Help establish and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI definition
- Mentor and upskill team members in SRE principles and Azure best practices
- Analyze trends in incidents and outages to drive long-term improvements
- Champion a culture of reliability, accountability, and continuous learning
Requirements
- 3+ years in SRE, DevOps, or related roles, with a strong track record in cloud environments (Azure experience required)
- Deep expertise in troubleshooting distributed systems, networking, and cloud-native architectures
- Hands-on experience with Azure monitoring, logging, and automation tools (Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, Terraform)
- Proficiency in at least one scripting or programming language (Python, PowerShell, Bash)
- Strong understanding of incident management, on-call operations, and post-incident analysis
- Experience implementing observability solutions and defining SLOs/SLIs
- Excellent communication skills and the ability to work cross-functionally in high-pressure situations
- English proficiency at B2 level or higher
Nice to have
- Proficiency in Python
- Azure certifications (Azure Solutions Architect, Azure DevOps Engineer)
- Experience in environments with low SRE process maturity, building practices from the ground up
- Familiarity with CI/CD pipelines and infrastructure as code
- Experience mentoring or leading SRE/DevOps teams
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn