Overview:

SOFTSWISS continues to expand the team and is looking for a Monitoring System Engineer.

If you're passionate about delivering top-notch service and consider yourself a proactive, positive thinker, we'd love to hear from you! We're eager for you to contribute to our team's success. If you're looking for a challenging and rewarding career opportunity, this could be the perfect fit.

Key responsibilities:

The two main pillars of our workflow are:

Responding to Events/Monitoring Alerts (L1/L2 tasks for certain system parts):

  • Offering on-duty service coverage, encompassing day and night shifts.

  • Addressing incidents by troubleshooting and resolving issues, even seeking assistance from third-party or vendor support when necessary.

  • Directing issues or queries to the relevant department as needed.

  • Keeping detailed records and documentation of current infrastructure challenges and Root Cause Analyses (RCAs).

  • Contribute to safe and effective internal practices for AI usage in monitoring and incident response workflows.

Maintaining and Enhancing the Monitoring Systems:

  • Collaborating with other teams to understand and define their monitoring needs, then implementing the right solutions.

  • Setting up and adjusting the monitoring/observability systems for various teams.

  • Designing and tweaking alerts and dashboards to suit specific needs.

  • Refining alerts to reduce irrelevant notifications and increase their significance.

  • Enhancing dashboards for better clarity, understanding, and a more comprehensive view.

  • Building and sustaining connections between the monitoring systems and other platforms like Jira, Opsgenie, etc. when required.

  • Establishing and updating a Knowledge Base, covering system configurations, alert processes, troubleshooting guidelines, and user manuals.

  • Staying updated with the newest trends and best practices to continuously uplift our organization's monitoring capabilities.

  • Identify opportunities to automate repetitive monitoring and support tasks, including with AI-assisted approaches where suitable.

Required Experience:

  • Minimum of 3 years experience as a Systems Engineer, SRE, DevOps, or Monitoring Support Engineer (L2+).

  • Good understanding of Linux-like operating systems (Debian-based).

  • Experience with containerization, virtualization, and orchestration (LXC/LXD, Docker, Kubernetes).

  • Development experience in any scripting language (Bash, Python, Go, etc) and familiarity with REST API.

  • Knowledge of basic database concepts (experience with PostgreSQL is preferable), including transactions and WAL.

  • English proficiency at an Intermediate (B1) level or higher. It's crucial to understand technical terminology related to our specific tech stack and to be able to interpret technical documentation.

  • Practical interest in using AI-assisted tools for troubleshooting, automation, documentation, and operational efficiency:
    - Ability to critically evaluate AI-generated output and validate it before using it in production environments.
    - Understanding of the risks and limitations of AI usage in infrastructure and production operations.

Skills & Experience

Monitoring/observability tools (experience with at least two of the following)

  • Zabbix (familiarity with concepts such as LLD, prototypes, dependencies, and preprocessing)

  • Grafana (knowledge of data sources, dashboard creation, and query usage)

  • Prometheus/VictoriaMetrics/etc. (understanding of metrics collection and alerting)

  • ELK/Splunk/etc. (ability to use queries and filters for log analysis)

  • Site24x7/Pingdom/etc. (experience with web monitoring and performance metrics)

Linux-like operating systems

  • Strong understanding of key concepts, including:

  • File systems

  • Process management

  • Built-in monitoring tools

  • Networks

  • Scripting

  • Troubleshooting

Familiarity with

  • Kafka

  • RabbitMQ

  • GitLab

  • Nginx/Puma

  • Clickhouse

  • PostgreSQL

  • MongoDB

  • Hashicorp Vault

  • Microservices and orchestration (Kubernetes)

  • Any IaC / infrastructure automation:
    - Provisioning tools (Terraform);
    - Configuration management (Ansible, Salt, Puppet)

  • Any AI-assisted/AIOps tools

Our Benefits:

  • Full-time remote work opportunities and flexible working hours

  • Private insurance

  • Additional 1 Day Off per calendar year

  • Sports program compensation

  • Comprehensive Mental Health Programme

  • Free online English lessons with a native speaker

  • Generous referral program

  • Training, internal workshops, and participation in international professional conferences and corporate events.