Overview

We are seeking an experienced Lead Data Engineer with advanced expertise in PySpark and hands-on experience building ETL pipelines, data lake architectures, and integrating data feeds on AWS.

You will handle both structured and unstructured data, ingesting information from a variety of on-premises and enterprise sources such as SAP, Intelex, SQL, and OSI PI into AWS. This position provides the chance to work on large-scale data projects and collaborate with diverse teams in a fast-paced setting.

Responsibilities

Create, refine, and manage ETL pipelines using PySpark and AWS Glue Jobs to process extensive structured and unstructured datasets
Coordinate data workflows with Apache Airflow, ensuring dependable scheduling, dependency management, and effective error handling
Develop and sustain data feeds from on-premises and enterprise systems into AWS data lake environments
Integrate with enterprise sources including SAP for ERP and operational data, Intelex for environmental, health, safety, and quality data, SQL databases for relational data, and OSI PI for real-time industrial and process historian data
Build and oversee API interactions to retrieve data from on-premises services into AWS
Manage data extraction, transformation, and loading across multiple formats and protocols
Assist in designing and maintaining AWS data lake architectures using Amazon S3, AWS Glue, and Lake Formation
Ensure data is properly cataloged, partitioned, and optimized for analytics and reporting
Apply data quality checks, validation, and lineage tracking throughout all pipelines

Requirements

At least 5 years of experience in data engineering positions
Minimum one year of experience leading and managing development teams
High-level proficiency in Python and PySpark for data processing and pipeline creation
Strong foundation in ETL processes for data integration
Experience coordinating workflows with Apache Airflow
Demonstrated success building production-grade data pipelines on AWS
Hands-on experience with AWS Glue Jobs for ETL operations
Familiarity with Amazon S3, data lake methodologies, and data cataloging practices
Experience with AWS-native monitoring and operational tools
Skilled in integrating enterprise systems via APIs, JDBC, or native connectors, including SAP, Intelex, SQL databases, and OSI PI
Capability to work with both structured and unstructured data formats
Excellent skills in documentation, communication, and collaboration
English proficiency at B2+ level or higher, both written and spoken

Nice to have

Experience working with energy, oil & gas, or industrial data environments
Knowledge of Drilling and Completions data flows and terminology

[GTS] Benefits (generic, except India)

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Lead Data Engineer (Python/AWS)

Описание вакансии