Overview

We are looking for a Lead Data Platform Operations Engineer to join our team.

In this position, you will oversee the reliability, security, performance, and cost management of our global enterprise data platform. You will play a crucial role in delivering 8/5 operational support as part of a follow-the-sun 24x5 coverage model, ensuring our platform consistently meets the needs of business operations worldwide.

Responsibilities

  • Oversee the operation and support of a secure, stable, and high-performing enterprise data platform, including Snowflake, AWS data stack, dbt, orchestration tools, and BI/analytics platforms
  • Provide operational support in an 8/5 model and participate in a 24/7 on-call rotation for urgent incidents
  • Set up and manage monitoring, alerting, and observability solutions to enable proactive detection and resolution of issues
  • Carry out platform upgrades, patching, and configuration management in accordance with security and compliance requirements
  • Continuously enhance system performance to adapt to evolving business needs
  • Apply observability frameworks to monitor infrastructure, data pipelines, and platform services
  • Deliver operational insights through dashboards and reporting to inform decision-making
  • Identify opportunities for automation to streamline processes and reduce manual work
  • Drive ongoing improvements to boost platform resilience, scalability, and cost-effectiveness
  • Support infrastructure-as-code and configuration-as-code practices for reliable and repeatable operations

Requirements

  • Minimum of 5 years’ experience in professional software engineering roles
  • At least one year of experience leading and managing technical teams
  • Direct experience managing cloud-native data platforms, with a strong background in Snowflake
  • Proficiency in AWS cloud infrastructure, with a focus on operations, automation, and cost control
  • Hands-on experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, ELK, or CloudWatch
  • Knowledge of Infrastructure as Code tools like Terraform, Pulumi, or Ansible, and best practices in configuration management
  • Solid understanding of networking, security, and compliance in cloud-based environments
  • Strong analytical and problem-solving skills with a proactive, service-driven mindset
  • Ability to operate effectively in a global support environment with on-call duties
  • Excellent communication and collaboration skills for working with engineering, data, and business stakeholders
  • Commitment to driving continuous improvement and operational excellence
  • English language proficiency (written and spoken) at B2+ level or higher

Nice to have

  • Experience with FinOps frameworks and cloud cost optimization practices
  • Background working in regulated industries such as pharmaceuticals, healthcare, or finance, with a focus on compliance
  • Familiarity with modern data stack tools like dbt, Dagster, Airflow, ThoughtSpot, Tableau, or Power BI
  • Understanding of Site Reliability Engineering (SRE) concepts and methodologies
  • Experience with Databricks, BigQuery, or comparable data platforms

[GTS] Benefits (generic, except India)

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn