
Site Reliability Engineer
Posted 2 days ago

Posted 2 days ago
This is a fully remote position, open to applicants in United States.
• Enhance platform availability while minimizing the frequency and duration of incidents.
• Develop and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across various services.
• Optimize Mean Time to Recovery (MTTR) through improved tools, automation, and comprehensive runbooks.
• Reinforce production readiness standards.
• Propel long-term systemic reliability enhancements.
• Establish and execute SLIs/SLOs for essential services.
• Oversee incident response efforts and coordinate mitigation strategies across teams.
• Conduct blameless post-incident reviews and ensure that corrective measures are executed.
• Perform assessments of production readiness for new services and features.
• Identify systemic vulnerabilities and promote preventative enhancements.
• Design and refine monitoring, alerting, and dashboard solutions (using tools like Prometheus, Grafana, etc.).
• Enhance the signal-to-noise ratio in alerts to alleviate alert fatigue.
• Develop internal tools for tracking and reporting on reliability.
• Increase visibility into GPU performance and the health of distributed systems.
• Automate recurring operational tasks.
• Create tools and scripts (in Python, Go, Bash) to remove manual processing.
• Enhance deployment safety through automation and protective measures.
• Strengthen CI/CD processes and release reliability.
• Collaborate with engineering teams to bolster system resilience.
• Offer insights on fault tolerance, scalability, and failure management.
• Participate in architectural discussions with a focus on reliability.
• Over 5 years of experience in Site Reliability Engineering (SRE), Reliability Engineering, or Production Engineering.
• Strong expertise in Linux systems and networking.
• Experience in managing containerized production environments.
• In-depth understanding of distributed systems and their failure modes.
• Proven experience in defining and managing SLIs/SLOs.
• Demonstrated leadership in incident response and post-incident reviews.
• Strong skills in scripting or programming.
• Familiarity with monitoring and alerting systems.
• Excellent written communication abilities.
• Successful completion of a background check.
• Significant equity in a rapidly growing company—every team member receives stock options, allowing you to share in our success as we grow.
• Comprehensive medical, dental, and vision plans.
• Flexible Paid Time Off (PTO)—take the time you need to rejuvenate.
• Most positions are remote-first, fostering an inclusive and collaborative environment, with Slack as our primary mode of internal communication.
• Join a dedicated team at the forefront of AI infrastructure, where culture, learning, and ownership are central to our scaling efforts.
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.