[Remote] Cloud Operations Engineer
Note: The job is a remote job and is open to candidates in USA. O'Reilly Media is dedicated to sharing the knowledge of innovators and helping professionals develop expertise. As a Cloud Operations Engineer, you will work on systems and tooling that power the learning platform, focusing on infrastructure-as-code and maintaining Kubernetes while collaborating with product engineering teams.
Responsibilities
- Maintaining and updating our Kubernetes cluster to ensure steady-state operations
- Writing or extending Terraform modules to provision and manage cloud infrastructure
- Contributing features to the Python CLI tooling we use to manage infrastructure workflows
- Design, build, and maintain cloud infrastructure using infrastructure-as-code (Terraform) on GCP
- Manage and evolve our Kubernetes platform, including cluster operations, workload configuration, and service mesh (Istio)
- Develop and improve internal tooling that abstracts cloud complexity and improves the developer experience
- Collaborate with product engineering teams to understand service deployment needs and deliver infrastructure solutions
- Monitor platform health using Datadog; proactively identify and resolve performance, availability, and security issues
- Participate in on-call rotation and incident response; drive blameless post-mortems and eliminate recurring issues at their root cause
- Define and track service-level indicators and objectives (SLIs/SLOs) for critical platform components
- Implement and refine alerting, dashboards, and runbooks that reduce mean time to resolution
- Embed security best practices into infrastructure workflows (DevSecOps) — not as an afterthought, but as a design principle
- Help maintain cloud security posture, IAM hygiene, and policy guardrails across our cloud environment
- Stay current with cloud security developments and proactively surface risks to the team
- Execute and maintain our automated disaster recovery processes
- Work closely with product engineering teams to understand their needs and remove infrastructure friction
- Document systems, processes, and architectural decisions clearly so knowledge is shared, not siloed
- Recommend improvements to tooling, architecture, and processes — and help drive them to completion
- Keep current with the evolving cloud-native ecosystem and bring relevant knowledge back to the team
Skills
- Bachelor's degree in Computer Science or a related field
- 5+ years of experience working in cloud infrastructure, platform engineering, or a related discipline
- In lieu of degree, equivalent education and/or experience may be considered
- Hands-on experience with Kubernetes in production environments (cluster management, workloads, networking)
- Proficiency with infrastructure-as-code tools, particularly Terraform
- Experience with at least one major cloud provider (GCP, AWS, or Azure)
- Solid scripting and automation skills in Python, Bash, or a comparable language
- Experience with modern observability platforms (Datadog, Grafana, or similar)
- Strong understanding of Linux systems administration
- Working knowledge of CI/CD concepts and tools (GitHub Actions, ArgoCD, Jenkins, or similar)
- Excellent communication skills — you write clearly, ask good questions, and explain complex systems accessibly
- AI-Augmented Development: Has the ability to demonstrate using AI-enabled development tools (e.g., Claude Code, Cursor) to streamline coding, debugging, and infrastructure-as-code authoring
- Experience with service mesh technologies such as Istio or Linkerd
- Familiarity with GitOps workflows and tools (ArgoCD, Flux)
- Experience with DevSecOps practices and tooling (Snyk, Trivy, OPA, or similar)
- Working knowledge of SQL databases (PostgreSQL or MySQL)
- Familiarity with FinOps practices and cloud cost optimization
- Experience building or consuming internal developer platforms (IDPs)
- Configuration management experience (Ansible, Chef, or similar)
- Relevant certifications (CKA, CKAD, AWS/GCP Professional, or similar)
Company Overview