Back to positions

[Remote] Site Reliability Engineer

Remote role Full-time Open position

Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.

Responsibilities

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Skills

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check
  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Benefits

  • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Company Overview

  • Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.
  • Company H1B Sponsorship

  • Runpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Further positions