This is a fully remote position, open to applicants in United States.

📋 Description

• Enhance platform availability while minimizing the frequency and duration of incidents.

• Develop and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across various services.

• Optimize Mean Time to Recovery (MTTR) through improved tools, automation, and comprehensive runbooks.

• Reinforce production readiness standards.

• Propel long-term systemic reliability enhancements.

• Establish and execute SLIs/SLOs for essential services.

• Oversee incident response efforts and coordinate mitigation strategies across teams.

• Conduct blameless post-incident reviews and ensure that corrective measures are executed.

• Perform assessments of production readiness for new services and features.

• Identify systemic vulnerabilities and promote preventative enhancements.

• Design and refine monitoring, alerting, and dashboard solutions (using tools like Prometheus, Grafana, etc.).

• Enhance the signal-to-noise ratio in alerts to alleviate alert fatigue.

• Develop internal tools for tracking and reporting on reliability.

• Increase visibility into GPU performance and the health of distributed systems.

• Automate recurring operational tasks.

• Create tools and scripts (in Python, Go, Bash) to remove manual processing.

• Enhance deployment safety through automation and protective measures.

• Strengthen CI/CD processes and release reliability.

• Collaborate with engineering teams to bolster system resilience.

• Offer insights on fault tolerance, scalability, and failure management.

• Participate in architectural discussions with a focus on reliability.

⛳️ Requirements

• Over 5 years of experience in Site Reliability Engineering (SRE), Reliability Engineering, or Production Engineering.

• Strong expertise in Linux systems and networking.

• Experience in managing containerized production environments.

• In-depth understanding of distributed systems and their failure modes.

• Proven experience in defining and managing SLIs/SLOs.

• Demonstrated leadership in incident response and post-incident reviews.

• Strong skills in scripting or programming.

• Familiarity with monitoring and alerting systems.

• Excellent written communication abilities.

• Successful completion of a background check.

🏝️ Benefits

• Significant equity in a rapidly growing company—every team member receives stock options, allowing you to share in our success as we grow.

• Comprehensive medical, dental, and vision plans.

• Flexible Paid Time Off (PTO)—take the time you need to rejuvenate.

• Most positions are remote-first, fostering an inclusive and collaborative environment, with Slack as our primary mode of internal communication.

• Join a dedicated team at the forefront of AI infrastructure, where culture, learning, and ownership are central to our scaling efforts.

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Rate Analyst

HSE Manager

People Partner

B2B Outside Sales Consultant

Business Development Executive, Early Career – European Language Required

Statistical Programmer II

Never miss a great job!