Back to positions

[Remote] Senior Site Reliability Engineer

Remote role Full-time Open position

Note: The job is a remote job and is open to candidates in USA. CertifyOS is building the data infrastructure that powers modern healthcare. They are seeking a Senior Site Reliability Engineer who will design for reliability, manage the operational lifecycle, and influence platform architecture and deployment workflows across their systems.

Responsibilities

  • Designs for reliability, ships the automation, and stands behind it in production
  • Own the operational lifecycle end-to-end and influence platform architecture, reliability standards, and deployment workflows
  • Own the full lifecycle of what they support — from infrastructure design and deployment automation through observability, incident response, and postmortems
  • Improve autoscaling behavior, resource utilization, and workload efficiency across cloud-native distributed systems
  • Own incident response processes, root cause analysis, escalation workflows, and runbooks
  • Build and maintain Infrastructure as Code, CI/CD pipelines, and operational tooling that reduce manual work and improve engineering productivity
  • Instrument data freshness and infrastructure health, not just service uptime

Skills

  • 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering — operating production systems at scale where your infrastructure is someone else's dependency and failures have real downstream consequences
  • Track record of improving reliability end-to-end: you've debugged hard production problems, made them not happen again, and built the alerting to prove it
  • Strong Linux systems administration, incident response, and root cause analysis skills
  • Comfort influencing operational standards and mentoring teams on reliability practices
  • Deep hands-on experience with GCP — GKE, Cloud Run, and containerized workloads at scale
  • Experience building and maintaining Infrastructure as Code with Terraform and/or Pulumi
  • Fluency across deployment patterns and the judgment to know when each fits: rolling deployments, blue/green, canary — and the rollback story for each
  • Experience with autoscaling, resource optimization, and infrastructure efficiency for distributed systems
  • Experience managing infrastructure security, secrets, and access controls in regulated or security-conscious environments
  • Strong understanding of Golden Signals monitoring — latency, traffic, errors, saturation — and how to make them actionable rather than noisy
  • Experience designing SLIs, SLOs, error budgets, alerting strategies, dashboards, and escalation workflows
  • Hands-on experience with observability platforms: Google Cloud Monitoring, Datadog, Grafana, Prometheus, or similar
  • Strong sense of data platform health: lineage, freshness, and correctness matter as much to you as throughput
  • Experience building and maintaining CI/CD pipelines using GitHub Actions or similar
  • Scripting or programming fluency in Python, Bash, Go, or similar — you reduce toil through code, not process
  • Experience working with Git workflows and modern software delivery practices
  • Strong written and verbal communication — you can explain an operational risk to an engineer and a product manager in the same conversation
  • Experience operating systems handling sensitive data or PII in regulated or compliance-adjacent environments
  • Experience operating large-scale distributed systems or microservices architectures
  • Familiarity with healthcare, credentialing, or health-tech environments
  • Experience leveraging AI-assisted observability or incident response tooling
  • Familiarity with NodeJS, TypeScript, Java, or React application stacks

Benefits

  • We provide 100% coverage of health, dental, and vision insurance premiums for employees.
  • Our US-based team benefits from unlimited PTO, with at least two weeks off each year to recharge.
  • In India, employees are supported with health insurance, statutory leave benefits, and additional wellness (menstrual) leave for women.

Company Overview

  • CertifyOS is building the future of provider data infrastructure. About five years ago, we set out with a bold aspiration: One API. One provider ID. It was founded in 2021, and is headquartered in New York, New York, USA, with a workforce of 201-500 employees. Its website is https://www.certifyos.com.
  • Company H1B Sponsorship

  • CertifyOS has a track record of offering H1B sponsorships, with 2 in 2022, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Further positions