Overview

We’re seeking a skilled DevOps/SRE with extensive expertise in designing, implementing, and maintaining observability platforms to ensure system reliability, performance, and scalability. As a vital member of our SRE team, you will promote the adoption of observability best practices, fostering proactive monitoring, swift incident resolution, and continuous enhancements to our software products and infrastructure.

This role emphasizes creating and refining observability solutions—including metrics, logs, and traces—to provide actionable insights into system health and performance. You'll also advance automation for deployment pipelines, oversee applications across various environments, and ensure our systems meet rigorous reliability and availability expectations. Collaboration will be essential as you engage closely with development teams to integrate observability into the software lifecycle, equipping them with the tools and practices for efficient debugging and iteration.

Responsibilities

  • Architect and implement observability platforms using tools like Prometheus, Grafana, and OpenTelemetry to support our Next.js frontend and accompanying systems
  • Design and maintain automated deployment pipelines focused on reliability, observability, and zero-downtime updates across multiple environments
  • Collaborate with development teams to integrate observability into local workflows for accelerated debugging and iteration
  • Optimize infrastructure and tools for scalability, fault tolerance, and performance with the aim of reducing mean time to detection (MTTD) and resolution (MTTR)
  • Mentor team members in SRE practices, including observability-driven development, incident management, and post-mortem analyses

Requirements

  • Proficiency in scripting languages like Python for automation and observability tools
  • Expertise in observability frameworks (e.g., Prometheus, Grafana, Loki, Jaeger) and logging solutions (e.g., ELK stack, Fluentd)
  • Background in containerization technologies (e.g., Docker) and orchestration platforms (e.g., Kubernetes, AWS ECS)
  • Knowledge of infrastructure as code tools (e.g., Terraform, Ansible) to provision and manage observable systems
  • Familiarity with version control systems, especially Git, and integrating observability into CI/CD pipelines (e.g., Jenkins, GitHub Actions)
  • Capability to define and measure service-level indicators (SLIs), objectives (SLOs), and error budgets to ensure system reliability
  • Competency in fostering collaboration and communication, with a strong commitment to nurturing a blameless culture of improvement

Nice to have

  • Proficiency in Polish language
  • Proficiency in programming languages as applied to SRE, DEVOPS, or observability contexts
  • Familiarity with cloud platforms, such as AWS, with a focus on observability services (e.g., CloudWatch, X-Ray)
  • Understanding of distributed systems, chaos engineering, or security practices in observable environments

Ukraine

With us you can:

  • Work on a flexible schedule remotely or from any of our comfortable offices or coworking spaces in Ukraine
  • Receive the necessary equipment to perform your work tasks
  • Change projects and technology stacks within EPAM
  • Gain experience in various business domains (Insurance, E-commerce, Healthcare, Finance, Travelling, Media, Artificial Intelligence, and more)
  • Relocation opportunities may be available for eligible candidates, depending on the role and openings at other EPAM locations
  • Participate in volunteer, charity programs and communities (both technical and interest-based)

We focus on your professional growth:

  • You can plan your individual career path together with your manager
  • Receive regular feedback from colleagues
  • Improve your English for free with certified teachers (Speaking Clubs, client interview preparation courses, etc.)
  • Get the opportunity to undergo free training and certification in AWS, GCP, or Azure Clouds
  • Use the internal E-learn training program (18,200+ specialized training and mentoring programs)
  • Access corporate accounts on LinkedIn Learning, Get Abstract and other partner resources
  • Study at EPAM Solution Architecture School with the instructors who are practicing architects
  • Develop as a leader, join Delivery Management, Resource Management, Leadership Essentials school and more
  • Participate in internal communities (500+ meetups, technical discussions, brainstorming sessions, online events and conferences annually)

What we offer:

  • Vacation and sick leave (including a sick leave without a medical certificate)
  • A wide range of Voluntary Medical Insurance programs providing both medical treatment and various preventive options (including sports activities)
  • Medical insurance for family members at corporate rates
  • Company support during significant life events (childbirth or adoption, marriage, etc.)
  • Support for psychological comfort: discounts on services from mental health specialists or coaches, thematic training
  • E-kids program - a free programming language training program for EPAMers' children

[epamgdo] Ukraine (Hybrid)

Kindly note that this role supports remote work, but only from within Ukraine.

[epamgdo] Ukraine (benefits may differ)

Kindly be advised that the set of benefits, including learning, certification, and other opportunities, may vary depending on the role you apply for. Our recruiter will be able to share more details about the specific opportunity during your general interview.

[epamgdo] Ukraine (About EPAM)

EPAM strives to provide its global team of over 62,350 professionals in more than 55 countries with opportunities for professional growth from day one of collaboration. Our colleagues are the source of EPAM's success, so we value cooperation, strive to always understand our clients' business and aim for the highest quality standards. No matter where you are, you will join a dedicated, diverse community that will help you realize your potential to the fullest.