
Senior Manager, Platform, Lifecycle, Troubleshooting
Posted 3 hours ago

Posted 3 hours ago
This is a fully remote position, open to applicants in United States.
• Oversee the Platform, Lifecycle & Troubleshooting team to address intricate incidents and platform challenges.
• Take charge of server repurposing, migrations (such as OS/distribution upgrades), and extensive lifecycle management.
• Conduct and direct advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
• Assess firmware selections and manage complex and ongoing firmware updates.
• Deliver 24/7 on-call leadership and enhance incident response processes.
• Create runbooks, automation, and self-healing protocols to minimize toil and enhance MTTR.
• Work closely with Hardware and Onboarding teams regarding handoffs and mixed tickets.
• Collaborate with Engineering, Networking, and Solutions teams on technical escalations and enhancements.
• Guide senior engineers and cultivate a high-performing team dedicated to root-cause analysis.
• Monitor key metrics (uptime, incident trends, migration success) and advance operational maturity.
• A minimum of 8 years of experience in Linux systems administration, platform engineering, or SRE-style operations within cloud or large-scale infrastructure settings.
• Profound knowledge in troubleshooting GPU, storage, RDMA, and high-performance networking challenges.
• Established history of leading technical teams, encompassing on-call rotations and complex migrations.
• Strong skills in scripting/automation (Python, Bash, Ansible, etc.) and familiarity with monitoring tools.
• Exceptional problem-solving, documentation, and cross-team communication skills.
• Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
• Company covers 100% of insurance premiums for employee medical, dental, and vision plans.
• 401(k) plan with a 100% match up to 4%, featuring immediate vesting.
• Annual Professional Development Reimbursement of $2,500.
• 11 Holidays plus Paid Time Off Accrual and Rollover Plan.
• Vultr values commitment! Increased PTO at the 3-year and 10-year milestones, alongside a 1-month paid sabbatical every 5 years and an Anniversary Bonus each year.
• $500 stipend for remote office setup in the first year, followed by $400 each subsequent year.
• Internet reimbursement of up to $75 per month.
• Gym membership reimbursement up to $50 per month.
• Company-paid Wellable subscription.
SERVPRO
Century Complete
Mortenson
Get handpicked remote jobs straight to your inbox weekly.