Infrastructure Operations Engineer
Infrastructure Operations Engineer
Original Advert
The challenge
As an Infrastructure Operations Engineer, you'll play a critical role in ensuring the stability, availability, and performance of our cloud platform in production. Your mission will be to maintain service continuity, quickly restore operations when incidents occur, and act as a technical reference in high-pressure and complex scenarios. This role is designed for a senior profile with a strong operational mindset, comfortable working in real production environments, investigating issues, taking ownership during incidents, and ensuring the platform runs reliably at all times.
You'll be part of the operations team, working in a hands-on environment focused on real-time infrastructure management. Together, you'll handle incidents, improve operational procedures, and strengthen the reliability of the platform. This includes participating in on-call rotations, responding to critical incidents خارج regular hours, and continuously refining runbooks, observability mechanisms, and incident response processes. You'll also contribute to capacity management and execute system scaling and expansions in production environments following engineering guidelines.
Collaboration is key in this role. You'll work closely with engineering, systems, network, storage, and product teams to resolve cross-functional issues and ensure smooth platform operations. You'll contribute your operational perspective to infrastructure changes, maintenance, deployments, and platform evolution, ensuring operability and resilience are always considered. Additionally, you'll act as a coordination point during complex incidents, leading technical diagnosis, recovery efforts, and ensuring alignment across teams.
Requirements that are important for us
We are looking for a senior Infrastructure Operations Engineer with strong experience in managing critical production environments and a proven ability to troubleshoot and resolve complex incidents. The ideal candidate combines deep technical knowledge with a hands-on, problem-solving mindset to ensure platform reliability and continuity.
Relevant experience and expected outcomes:
Operating critical production infrastructures, ensuring availability, stability, and performance.
Leading incident response processes, including technical diagnosis and service restoration.
Performing root cause analysis and implementing corrective and preventive actions to reduce recurrence and impact.
Managing and improving observability, identifying risks, degradations, and bottlenecks proactively.
Contributing to capacity planning and executing infrastructure scaling in production environments.
Working with systems that support web applications, including solid understanding of HTTP protocol.
Administering technologies such as load balancers, NGINX/Apache, databases, and related systems.
Experience with Linux systems (primarily) and Windows environments (to a lesser extent), as well as strong knowledge of networking, virtualization, and storage.
Familiarity with Kubernetes is a plus.
Key skills and expected impact:
Strong troubleshooting capabilities, with the ability to analyze metrics, logs, and traces to identify and isolate issues.
Broad technical knowledge across different systems and domains, with the ability to understand interactions without needing to be a deep specialist in a single area.
Ability to remain calm under pressure while acting decisively and proactively in critical situations.
Ownership mindset, taking responsibility for restoring service and maintaining operational stability.
Motivation to act as a key technical reference during complex and high-impact scenarios.
Strong coordination skills to manage incident resolution, technical escalations, and cross-team collaboration.
Tools
Observability tools (metrics, logs, traces), enabling effective monitoring and troubleshooting.
Infrastructure and system administration tools across Linux and Windows environments.
Networking, virtualization, and storage management technologies.
Web infrastructure tools (load balancers, NGINX/Apache, databases).
Container orchestration platforms such as Kubernetes (valuable).
Operational documentation tools (runbooks, incident procedures), improving response consistency.
Capacity planning and infrastructure scaling tools, supporting platform growth and performance.
Application managed by Jotelulu