Overview
We are seeking a highly skilled Senior Data Platform Operations Engineer to join our team.
In this role, you will be responsible for maintaining the stability, security, performance, and cost-effectiveness of our global enterprise data platform. You will play a key part in providing 8/5 operational coverage as part of a follow-the-sun 24x5 support model, ensuring our platform reliably supports business operations around the world.
Responsibilities
- Maintain and support a stable, secure, and high-performing enterprise data platform, including Snowflake, AWS data stack, dbt, orchestration tools, and BI/analytics solutions
- Deliver operational coverage in an 8/5 support model and participate in a 24/7 on-call rotation for critical incidents
- Implement and manage robust monitoring, alerting, and observability systems for proactive incident detection and resolution
- Perform platform upgrades, patching, and configuration management in alignment with security and compliance standards
- Continuously optimize system performance to meet changing business requirements
- Utilize comprehensive observability frameworks to monitor infrastructure, data pipelines, and platform services
- Provide actionable operational insights through dashboards and reporting tools
- Identify and automate processes to improve efficiency and minimize manual intervention
- Recommend and execute ongoing improvements to enhance platform resilience, scalability, and cost efficiency
- Contribute to infrastructure-as-code and configuration-as-code initiatives for consistent and repeatable operations
Requirements
- At least 3 years of experience in professional software engineering roles
- Hands-on expertise managing cloud-native data platforms, with Snowflake experience required
- Proficiency in AWS cloud infrastructure, focusing on operations, automation, and cost management
- Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, ELK, or CloudWatch
- Knowledge of Infrastructure as Code tools like Terraform, Pulumi, or Ansible, and configuration management best practices
- Strong understanding of networking, security, and compliance in cloud environments
- Excellent problem-solving abilities with a proactive, service-oriented approach
- Ability to work in a global operations setting with on-call responsibilities
- Effective communication and collaboration skills for working with engineering, data, and business teams
- Dedication to continuous improvement and operational excellence
- Fluent English skills (written and spoken) at a B2+ level or higher
Nice to have
- Experience implementing FinOps frameworks and cost optimization strategies for cloud environments
- Background in regulated industries such as pharma, healthcare, or finance, with experience in compliance-driven operations
- Familiarity with modern data stack tools including dbt, Dagster, Airflow, ThoughtSpot, Tableau, or Power BI
- Exposure to Site Reliability Engineering (SRE) principles and practices
- Experience with Databricks, BigQuery, or similar data platforms
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn