[Remote] Senior Full Stack Software Engineer - DGX Cloud
Note: The job is a remote job and is open to candidates in USA. NVIDIA is hiring experienced software engineers to help scale up its AI Infrastructure. The role involves designing and developing a massively distributed scalable platform for GPU clusters used in AI workloads, while ensuring production AI clusters run reliably and consistently with maximum performance.
Responsibilities
- You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads
- Designing and developing a massively distributed scalable platform which would be used to identify, diagnose and remediate non-performant GPU assets
- Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. Evaluating system failures and improving services based on a well-defined incident management process
- Working across all of our product stack: React, Web Components, TypeScript, Golang, PostgreSQL, Temporal, Bazel, Kubernetes
Skills
- Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work
- Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies
- 12+ years in similar role and experience on large-scale production systems. Experience with common software engineering principles, tools and techniques
- You possess a BS in Computer Science or Engineering or equivalent experience
- 6+ years of experience doing full-stack engineering
- 3+ years building and shipping consumer-facing products
- Proficiency in React, TypeScript/JavaScript, and Golang
- Proficiency with a SQL database
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager)
- Empathy for users, attention to detail, and a passion for creating world-class user experiences
- Prior experience in asynchronous workflows and/or event driven architecture
- Proven operational excellence in maintaining reliable and performant infrastructure
- A good understanding of how to use LLMs responsibly and the perils of blindly consuming their output
Benefits
- You will also be eligible for equity and [benefits](https://www.nvidia.com/en-us/benefits/).
Company Overview
Company H1B Sponsorship