[Remote] Platform Engineer
Note: The job is a remote job and is open to candidates in USA. Hyrhub is seeking a Senior Infrastructure Architect / Platform Engineer for their AI/ML platform to provide technical leadership for cloud platforms that support enterprise-scale generative AI applications. The role involves defining infrastructure architecture, leading platform standards, and collaborating with various engineering teams to enhance operational maturity across AI platforms.
Responsibilities
- Define and drive the technical strategy for AI/ML platform infrastructure supporting generative AI applications, LLM integrations, model routing, and enterprise AI services
- Architect, build, and operate scalable cloud platforms using AWS services such as EKS, ECS Fargate, Lambda, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA
- Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to support reliable multi-environment and multi-region deployments
- Lead CI/CD architecture using GitHub Actions, reusable workflows, OIDC-based AWS authentication, automated quality gates, deployment promotion, and environment approvals
- Design and improve observability across AI platforms, including CloudWatch dashboards, logs, alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics
- Build platform capabilities for GenAI workloads, including model availability monitoring
- Partner with software engineering teams to improve deployment reliability, rollback strategies, health checks, autoscaling, load testing, and runtime performance
- Define and enforce security and compliance practices for infrastructure, including IAM permission boundaries, Secrets Manager usage, secret scanning, audit logging, tagging standards, and change-management controls
- Provide technical leadership for cost optimization, capacity planning, environment standardization, and operational resilience across development, test, production, and sandbox environments
- Mentor engineers, review architecture and infrastructure designs, and influence platform engineering practices across teams
- Troubleshoot complex production issues across cloud infrastructure, networking, containers, serverless workloads, CI/CD systems, and observability platforms
- Translate enterprise requirements for security, compliance, reliability, and governance into pragmatic engineering standards and automation
Skills
- Bachelor's degree in Computer Science, Engineering, Information Technology, or a related technical field, or equivalent practical experience
- 7+ years of experience in DevOps, platform engineering, cloud infrastructure, site reliability engineering, or software engineering roles
- Strong hands-on experience with AWS/Azure/GCP infrastructure and services, including container, serverless, networking, storage, observability, and security services
- Experience designing and operating production systems on Kubernetes, ECS/Fargate, or comparable container orchestration platforms
- Proficiency with infrastructure-as-code, especially CloudFormation, Terraform, Helm, or similar tooling
- Strong CI/CD experience with GitHub Actions or similar platforms, including reusable workflows, automated testing, deployment gates, and cloud authentication
- Experience building and operating observability solutions using CloudWatch, Prometheus/Grafana, OpenSearch, or similar tools
- Strong understanding of cloud security practices, IAM, secrets management, least-privilege access, audit logging, and compliance requirements
- Experience supporting distributed systems, microservices, APIs, asynchronous workloads, and multi-environment deployments
- Demonstrated ability to lead technical design, mentor engineers, and influence engineering practices across teams
- Experience supporting AI/ML or generative AI platforms, including LLM gateways, model routing, prompt observability, token metering, or model failover
- Experience operating platforms in regulated enterprise environments, ideally healthcare, pharmaceutical, finance, or life sciences
- Experience with multi-account, multi-region AWS architectures and enterprise governance patterns
- Experience with cost optimization, autoscaling strategies, capacity planning, and cloud budget monitoring
- Experience with load testing and performance validation using tools such as Locust or comparable frameworks
- Strong Python or scripting skills for platform automation, operational tooling, and CI/CD extensions
- Ability to communicate complex technical decisions clearly to engineering, security, operations, and leadership audiences
Company Overview