[Remote] Network Engineer - Network Resiliency and High Availability
Note: The job is a remote job and is open to candidates in USA. Dice is seeking a Senior Network Engineer specializing in Network Resiliency and High Availability to ensure their global network infrastructure remains fault-tolerant and capable of seamless disaster recovery. The role involves designing, validating, and optimizing redundant paths and high-availability clusters while ensuring zero packet loss for business-critical applications during unforeseen failures.
Responsibilities
- Design, implement, and maintain high-availability network topologies using physical and logical redundancy patterns (e.g., Multi-Chassis EtherChannel/MCLAG, VPC, and VSS)
- Architect redundant Wide Area Network (WAN) transport paths utilizing dual-homed ISP connections, SD-WAN dynamic path selection, and automated failover technologies
- Conduct controlled Network Chaos Engineering exercises (e.g., simulating fiber cuts, device power failures, and split-brain scenarios) to validate failover timers and resilience assumptions
- Optimize enterprise routing protocols (BGP, OSPF, EIGRP) for ultra-fast convergence, tuning features like Bidirectional Forwarding Detection (BFD), Fast Reroute (FRR), and Graceful Restart
- Implement First Hop Redundancy Protocols (HSRP, VRRP, GLBP) to guarantee default gateway redundancy for end-user and server segments
- Manage complex traffic engineering strategies (e.g., BGP local preference, AS-path prepending) to ensure predictable asymmetric/symmetric routing during failure states
- Lead the network engineering track for Corporate Disaster Recovery planning, including active-active and active-passive data center strategies
- Design, configure, and maintain automated DNS-based failover (GSLB) and Anycast routing strategies to reroute user traffic away from degraded data centers or cloud regions
- Keep comprehensive, up-to-date documentation on failover runbooks and infrastructure dependency maps
- Deploy advanced monitoring tools to track metrics like Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR)
- Set up telemetry-based alerting (SNMP, gRPC/Streaming Telemetry) to identify gray failures (e.g., high interface error rates causing intermittent drops) before they cause total outages
Skills
- 5+ years in a dedicated network engineering or operations role, with a proven track record of designing 99.99% or 99.999% (Four-to-Five Nines) uptime environments
- Bachelor's degree in Computer Science, Computer Engineering, or equivalent practical experience
- Design, implement, and maintain high-availability network topologies using physical and logical redundancy patterns (e.g., Multi-Chassis EtherChannel/MCLAG, VPC, and VSS)
- Architect redundant Wide Area Network (WAN) transport paths utilizing dual-homed ISP connections, SD-WAN dynamic path selection, and automated failover technologies
- Conduct controlled Network Chaos Engineering exercises (e.g., simulating fiber cuts, device power failures, and split-brain scenarios) to validate failover timers and resilience assumptions
- Optimize enterprise routing protocols (BGP, OSPF, EIGRP) for ultra-fast convergence, tuning features like Bidirectional Forwarding Detection (BFD), Fast Reroute (FRR), and Graceful Restart
- Implement First Hop Redundancy Protocols (HSRP, VRRP, GLBP) to guarantee default gateway redundancy for end-user and server segments
- Manage complex traffic engineering strategies (e.g., BGP local preference, AS-path prepending) to ensure predictable asymmetric/symmetric routing during failure states
- Lead the network engineering track for Corporate Disaster Recovery planning, including active-active and active-passive data center strategies
- Design, configure, and maintain automated DNS-based failover (GSLB) and Anycast routing strategies to reroute user traffic away from degraded data centers or cloud regions
- Keep comprehensive, up-to-date documentation on failover runbooks and infrastructure dependency maps
- Deploy advanced monitoring tools to track metrics like Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR)
- Set up telemetry-based alerting (SNMP, gRPC/Streaming Telemetry) to identify gray failures (e.g., high interface error rates causing intermittent drops) before they cause total outages
- Cisco Certified Internetwork Expert (CCIE - Enterprise Infrastructure or Data Center) or strong CCNP with equivalent experience
- Juniper Networks Certified Internetworking Specialist/Expert (JNCIS/JNCIE)
- Certified Business Continuity Professional (CBCP) or equivalent familiarity with DR frameworks is a plus
Company Overview
Company H1B Sponsorship