Runware Logo

Runware

Staff DevOps Engineer

Posted Yesterday
Be an Early Applicant
Remote
Hiring Remotely in United Kingdom
Senior level
Remote
Hiring Remotely in United Kingdom
Senior level
The Staff DevOps Engineer will enhance and manage infrastructure for AI inference, focusing on speed, resilience, and automation across GPU fleets and production systems.
The summary above was generated by AI

Runware is building the API layer for the next generation of AI products. Our platform gives teams fast, reliable access to real-time inference across thousands of models through a single flexible API. We help customers build and scale media generation products with better performance, lower cost, and less operational complexity.

Behind this is an infrastructure platform built for speed, reliability, and GPU scale. New models launch constantly. Customer traffic can grow quickly. Performance matters at every layer.

We are looking for a Staff/Senior DevOps Engineer to help build, operate, and scale the infrastructure behind Runware’s global AI inference platform. You’ll play a critical role in making our systems faster, more resilient, easier to operate, and ready for the next stage of growth.

About the role

Runware’s infrastructure is the engine behind some of the fastest-growing AI products in the world. As a Staff/Senior DevOps Engineer, you’ll help design, build, and operate the systems that power real-time AI inference across large-scale GPU fleets and a global production platform.

This is not a traditional DevOps role. You’ll be working at the intersection of bare-metal infrastructure, GPUs, networking, automation, observability, and high-performance distributed systems. Your work will directly shape how quickly we can launch new models, scale customer traffic, recover from failures, and deliver low-latency AI experiences to millions of users.

You’ll turn complex, hardware-driven infrastructure into reliable, automated, developer-friendly platforms. From provisioning and orchestration to deployment pipelines, monitoring, incident response, and capacity scaling, you’ll help remove friction so engineering teams can move faster without compromising reliability.

You’ll build the foundations that let Runware scale with confidence: infrastructure that is fast, resilient, observable, secure, and built for the demands of real-time AI.

What you’ll do
  • Build and scale the infrastructure that powers real-time AI inference across GPU fleets, bare-metal servers, serverless and containerised production systems
  • Help evolve Runware’s platform toward more elastic, on-demand infrastructure that can scale quickly with customer traffic and model demand
  • Make Runware faster, more reliable and more resilient by improving the critical paths behind our request entrypoints, inference services, queues, storage, load balancers and networking layer
  • Automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback
  • Build the observability backbone for a high-performance AI platform, with the signals needed to spot issues early, understand capacity and fix problems before customers feel them
  • Play a leading role in production operations, incident response, debugging and post-incident improvements, helping us turn operational challenges into a stronger platform
  • Strengthen the security and compliance foundations of our infrastructure through patching, secrets management, access controls, hardening, auditability, documentation and repeatable operational processes

Requirements
  • Strong experience as a DevOps Engineer, SRE, Infrastructure Engineer, Platform Engineer or similar, with a track record of running production systems at scale
  • Deep Linux knowledge and confidence debugging real production issues across networking, storage, performance, services and system behaviour
  • Hands-on experience building automation, Infrastructure-as-Code, CI/CD pipelines and deployment workflows that make infrastructure safer and easier to operate
  • Experience operating high-availability, low-latency or high-throughput platforms where reliability and performance directly affect customers
  • Strong networking fundamentals across TCP/IP, DNS, load balancing, routing, firewalls, proxies, TLS and HTTP
  • A calm and pragmatic approach under pressure, with strong communication, good judgement and a bias toward automation over manual toil
Bonus
  • Experience operating GPU infrastructure for AI/ML inference, including NVIDIA drivers, CUDA, container runtimes, GPU monitoring, capacity planning and workload isolation
  • Familiarity with inference serving and optimisation frameworks such as vLLM, TensorRT, Triton or similar

Benefits

We’re a remote-first collective, meeting in person twice a year to plan, brainstorm, celebrate wins, and enjoy some face-to-face time. We have core hours for cooperative working and calls, but outside of that your calendar is yours. Work the hours that let you perform at your peak while also building a healthy life.

Our release cycles are fast and intense, but they’re followed by real downtime. After big pushes we expect the team to unplug, recharge, and come back ready & stronger than ever for the next leap.

  • Generous paid time off – vacation, sick days, public holidays
  • Meaningful stock options – share in the upside you create
  • Remote-first setup – work from home anywhere we can employ you
  • Flexible hours – own your schedule outside core collaboration blocks
  • Family leave – paid maternity, paternity, and caregiver time
  • Company retreats – twice-yearly gatherings in inspiring locations

Similar Jobs

7 Days Ago
In-Office or Remote
Junior
Junior
Artificial Intelligence
The Applied AI Engineer will onboard customers, deploy AI applications, and collaborate with teams to resolve technical challenges and drive technological transformation.
Top Skills: AnsibleAWSAzureCi/CdDockerGCPKubernetesPythonPyTorchTensorFlowTerraform
24 Days Ago
Remote
United Kingdom
Mid level
Mid level
Information Technology • Consulting
This position involves designing and managing cloud infrastructure on AWS, scripting with Python, and configuring NGINX for traffic management, ensuring efficient deployment and operations.
Top Skills: AWSCi/CdLuaNginxPythonSslTerraform
6 Hours Ago
Remote
UK
Senior level
Senior level
Cloud • Software
Lead the development and implementation of secure cloud environments for National Security and Defence clients by leveraging commercial and open-source technologies.
Top Skills: AksAnsibleAtlassian SuiteAWSDockerEcsEksFlinkGitGitlabKafkaKubernetesTerraform

What you need to know about the Edinburgh Tech Scene

From traditional pubs and centuries-old universities to sleek shopping malls and glass-paneled office buildings, Edinburgh's architecture reflects its unique blend of history and modernity. But the fusion of past and future isn't just visible in its buildings; it's also shaping the city's economy. Named the United Kingdom's leading technology ecosystem outside of London, Edinburgh plays host to major global companies like Apple and Adobe, as well as a growing number of innovative startups in fields like cybersecurity, finance and healthcare.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account