Overview
We are seeking a Senior DevOps Engineer to enhance our high-performance computing services and collaborate closely with the scientific community to optimize research computing.
Join our team to build and operate cutting-edge HPC capabilities using automation and infrastructure-as-code. Apply now to contribute to innovative computational solutions in a dynamic environment.
Responsibilities
- Design, implement, and maintain robust platform infrastructure using Infrastructure as Code tools such as Terraform
- Develop, deliver, and operate research computing services and applications
- Apply Site Reliability Engineering principles to manage HPC service deployment, monitoring, and incident response
- Solve complex technical problems related to HPC services and user applications
- Manage large-scale HPC, HTC, or BC computing environments for optimal performance
- Collaborate with scientific users to tailor HPC resources to research needs
- Automate deployment processes to ensure consistency across HPC infrastructure
- Maintain and administer large-scale cluster and server computing software such as Slurm, LSF, or Grid Engine
- Develop and maintain monitoring dashboards using tools like Grafana and Prometheus
- Work within a DevOps team environment following agile methodologies
- Operate and utilize virtualized private cloud resources such as OpenStack
- Administer large-scale parallel filesystems including Weka, GPFS, or Lustre
- Use configuration management tools like Ansible, Salt, or Puppet to manage IT operations
- Develop scripts and tools for HPC and DevOps platform operations using Bash and Python
Requirements
- 3+ years of experience with DevOps processes and automation using Infrastructure as Code tools such as Terraform
- Hands-on experience operating or engineering large-scale HPC or similar computing environments
- Proven expertise in Linux system administration including TCP/IP networking and storage subsystems
- Experience administering large-scale cluster management software such as Slurm, LSF, or Grid Engine
- Knowledge of configuration management tools like Ansible, Salt, or Puppet
- Experience working in agile DevOps teams
- Ability to develop and maintain monitoring tools such as Grafana and Prometheus
- Experience with scripting languages such as Bash and Python for automation and tool development
- Strong experience managing virtualized private cloud environments like OpenStack
- Scientific degree or equivalent experience in computationally intensive scientific data analysis
- Proven ability to manage relationships with third-party suppliers
- Upper-intermediate proficiency in English (B2+)
Nice to have
- Experience with container technologies such as LXD, Singularity, Docker, or Kubernetes
- Operation and configuration experience with public cloud platforms like AWS, Azure, or GCP
- Experience with HashiCorp tools such as Vault, Consul, and Nomad
- Development experience with programming languages such as Java, C++, Python, Ruby, or Perl
- Experience with parallel filesystems like Weka, GPFS, or Lustre
Poland (Prod)
We gather like-minded people:
- Engineering community of industry professionals
- Friendly team and enjoyable working environment
- Flexible schedule and opportunity to work remotely within Poland
- Chance to work abroad for up to 60 days annually
- Business-driven relocation opportunities
We provide growth opportunities:
- Outstanding career roadmap
- Leadership development, career advising, soft skills, and well-being programs
- Certification (GCP, Azure, AWS)
- Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
- English classes
We cover it all:
- Stable income (Employment Contract or B2B)
- Participation in the Employee Stock Purchase Plan
- Benefits package (health insurance, multisport, shopping vouchers)
- Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
- Referral bonuses
- Corporate, social and well-being events
Please, note:
- The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
- We will reach out to selected candidates exclusively.
[epamgdo] Poland (About EPAM)
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.