
Staff Site Reliability Engineer
Posted 1 hour ago

Posted 1 hour ago
This is a fully remote position, open to applicants in California.
• Spearhead the creation of Domino's internal AI-driven reliability tools, encompassing systems that evaluate tickets, logs, traces, and documentation to assist teams in resolving outages more swiftly and with reduced recurring effort.
• Enhance the observability coverage and signal quality for our most vital customer-facing systems, providing engineers with more resources throughout the development and support lifecycle.
• Take ownership of incident response from start to finish, from detection through remediation, ensuring that each problem area is better documented, understood, and less prone to recurrence.
• Direct the development of customer and user-facing observability tools integrated within our products.
• Establish and refine SLO/SLI frameworks for priority services, transforming abstract reliability objectives into quantifiable, actionable standards.
• Scale cloud operations practices for Domino’s single-tenant SaaS solution and collaborate with engineering teams to enhance the reliability and consistency of customer deployments and upgrades.
• Mentor fellow engineers and influence the practice of SRE at Domino, including incident response procedures, operational readiness standards, and a culture of post-incident learning.
• Extensive experience in Site Reliability Engineering, platform engineering, or a software engineering position with authentic, hands-on operational responsibility.
• Proficiency with Kubernetes, Linux, cloud platforms, and observability tools, along with the capability to utilize them for investigating complex, real-world production issues.
• A strong aptitude for identifying and addressing reliability gaps in technical products, tools, and processes.
• Solid software development skills in Python or Go, with a proven history of creating internal tools or services that are genuinely relied upon.
• Comfort in leading technically ambiguous projects and influencing direction across teams without requiring direct authority to accomplish tasks.
• A background of enhancing reliability through engineering and automation, rather than solely relying on manual fire-fighting.
• Excellent communication skills and substantial experience mentoring engineers or influencing technical decision-making within your team.
• Sound judgment regarding AI/LLM tools: understanding where they truly assist in operational workflows and where they introduce more noise than clarity.
• Bonus: Familiarity with LLM-based systems, retrieval workflows, SaaS platform operations, or developing tools for support or developer teams.
• Equity
• Company bonus or sales commissions/bonuses
• 401(k) plan
• Medical, dental, and vision benefits
• Wellness stipends
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.