Overview
We are seeking an experienced Lead Data Engineer with advanced expertise in PySpark and hands-on experience building ETL pipelines, data lake architectures, and integrating data feeds on AWS.
You will handle both structured and unstructured data, ingesting information from a variety of on-premises and enterprise sources such as SAP, Intelex, SQL, and OSI PI into AWS. This position provides the chance to work on large-scale data projects and collaborate with diverse teams in a fast-paced setting.
Responsibilities
- Create, refine, and manage ETL pipelines using PySpark and AWS Glue Jobs to process extensive structured and unstructured datasets
- Coordinate data workflows with Apache Airflow, ensuring dependable scheduling, dependency management, and effective error handling
- Develop and sustain data feeds from on-premises and enterprise systems into AWS data lake environments
- Integrate with enterprise sources including SAP for ERP and operational data, Intelex for environmental, health, safety, and quality data, SQL databases for relational data, and OSI PI for real-time industrial and process historian data
- Build and oversee API interactions to retrieve data from on-premises services into AWS
- Manage data extraction, transformation, and loading across multiple formats and protocols
- Assist in designing and maintaining AWS data lake architectures using Amazon S3, AWS Glue, and Lake Formation
- Ensure data is properly cataloged, partitioned, and optimized for analytics and reporting
- Apply data quality checks, validation, and lineage tracking throughout all pipelines
Requirements
- At least 5 years of experience in data engineering positions
- Minimum one year of experience leading and managing development teams
- High-level proficiency in Python and PySpark for data processing and pipeline creation
- Strong foundation in ETL processes for data integration
- Experience coordinating workflows with Apache Airflow
- Demonstrated success building production-grade data pipelines on AWS
- Hands-on experience with AWS Glue Jobs for ETL operations
- Familiarity with Amazon S3, data lake methodologies, and data cataloging practices
- Experience with AWS-native monitoring and operational tools
- Skilled in integrating enterprise systems via APIs, JDBC, or native connectors, including SAP, Intelex, SQL databases, and OSI PI
- Capability to work with both structured and unstructured data formats
- Excellent skills in documentation, communication, and collaboration
- English proficiency at B2+ level or higher, both written and spoken
Nice to have
- Experience working with energy, oil & gas, or industrial data environments
- Knowledge of Drilling and Completions data flows and terminology
[GTS] Benefits (generic, except India)
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn