Back to positions

[Remote] Principal GPU Infrastructure Engineer – AI/HPC Systems

Remote role Full-time Open position

Note: The job is a remote job and is open to candidates in USA. Axiom Recruit is partnering with a rapidly scaling technology business that is building advanced compute infrastructure for next-generation AI systems. They are seeking a Principal GPU Infrastructure Engineer to design and operate large-scale GPU environments supporting demanding enterprise-grade workloads across high-performance compute platforms.

Responsibilities

  • Own the lifecycle management of large-scale GPU infrastructure, from provisioning and firmware validation through to operational reliability
  • Lead operations across high-density, liquid-cooled compute environments supporting next-generation AI workloads
  • Build automated observability and remediation systems using Prometheus, Grafana, NVIDIA DCGM, and infrastructure automation tooling
  • Drive NetBox DCIM integration, asset management, IPAM, and infrastructure compliance across complex compute environments
  • Act as a senior technical lead for infrastructure operations, incident response, vendor management, and enterprise-level infrastructure support

Skills

  • Strong experience managing large-scale GPU, HPC, or high-performance compute infrastructure
  • Deep hands-on expertise with NVIDIA GPU systems, including H200, B200, or B300 environments
  • Advanced knowledge of InfiniBand, NVLink, NVSwitch, and high-throughput networking architectures
  • Strong Linux systems engineering background with infrastructure automation using Python or Go
  • Experience with observability and monitoring tooling including Prometheus, Grafana, NVIDIA DCGM, and SNMP
  • Proven experience across bare-metal provisioning, infrastructure lifecycle management, and automated/self-healing systems
  • Experience with liquid-cooled or high-density compute environments
  • Familiarity with NVIDIA Mission Control and GPU cluster management
  • Exposure to confidential compute technologies and attestation workflows
  • Experience building infrastructure standards in fast-scaling environments

Benefits

  • Competitive salary and benefits package
  • Opportunity to build next-generation AI infrastructure
  • Exposure to cutting-edge GPU and HPC environments
  • Strong ownership across infrastructure and automation
  • Engineering-led culture working on mission-critical systems

Company Overview

  • Web3/Blockchain/AI Recruitment It was founded in 2019, and is headquartered in Dubai, Dubai, ARE, with a workforce of 11-50 employees. Its website is https://www.axiomrecruit.com/.
  • Apply To This Job

    Further positions