The NMCI Service Management Integration and Transport (SMIT) group at Leidos has an opening for a Site Reliability Engineer to focus on the reliability, performance, and scalability of complex distributed systems. The SRE will also develop and execute tests focused on system resilience, performance under load, and failure scenarios. They will work in tandem with other Site Reliability Engineers (SREs) and development teams to create automated testing frameworks that simulate real-world conditions that validate system behavior under normal and stress conditions, ensuring our services are resilient and meet established service level objectives (SLOs).
Requirements
- Maintaining complex computer systems by writing code to automate software releases, monitor systems, and detect and fix problems before users even know there is an issue.
- Proactive incident management using tools like Aternity to monitor end-user performance and proactively identify potential issues.
- Developing strategies to address recurring incidents and improve system reliability.
- Leading the planning, coordination, and execution of software deployments across end-user devices.
- Ensuring deployments are completed on time, with minimal disruption to end users.
- Analyzing service performance metrics to identify areas for improvement.
- Developing and implementing initiatives to enhance the quality of end-user services.
- Defining and maintaining a product vision and roadmap for End User/Seats Services, aligned with organizational objectives.
- Managing the product backlog, prioritizing user stories, and ensuring alignment with strategic goals.
- Serving as the primary point of contact between the End User/Seats Services team and business stakeholders.
- Creating user stories and acceptance criteria that clearly communicate stakeholder needs to the development team.
- Ensuring clear documentation of product requirements, progress, and updates for stakeholders.
Benefits
- 401k Matching
- Generous Paid Time Off
