Core Responsibilities

Team Leadership & Operational Management

  • Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.

    - Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.

    - Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.

    - Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.

    - Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.

    - Ensure the team’s processes, documentation, and runbooks stay current and audited.

Technical Oversight

  • Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.

    - Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.

    - Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.

    - Participate in high-severity incidents as escalation point and technical lead when needed.

    - Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.

Cross-Functional Leadership

  • Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.

    - Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.

    - Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.

    - Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.

Required Qualifications

  • 6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.

    - Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.

    - Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.

    - Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).

    - Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.

    - Excellent operational judgment: triage, prioritization, team load balancing, and process design.

    - Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional)

Nice-to-Have

  • Prior experience running a distributed or follow-the-sun SRE practice.

    - Exposure to chaos engineering, fault injection, or reliability stress testing.

    - Familiarity with cloud cost governance and rightsizing strategies.

    - Experience improving or scaling on-call systems.