
Staff Site Reliability Engineer
Posted May 11

Posted May 11
This is a fully remote position, open to applicants in Argentina.
• Spearhead the creation of Domino's internal AI-driven reliability tools, which include systems that evaluate tickets, logs, traces, and documentation to assist teams in swiftly resolving outages with reduced recurring effort.
• Enhance the observability scope and signal integrity for our most essential customer-facing systems, providing engineers with better resources throughout the development and support lifecycle.
• Manage incident response comprehensively, from detection to resolution, ensuring each problem area is better documented, understood, and less prone to recurrence.
• Lead the creation of customer and user-oriented observability tools integrated within our products.
• Establish and refine SLO/SLI frameworks for priority services, transforming abstract reliability objectives into measurable, actionable standards.
• Scale cloud operations practices for Domino’s single-tenant SaaS solution, collaborating with engineering teams to enhance the reliability and consistency of customer deployments and upgrades.
• Mentor fellow engineers and influence the practice of SRE at Domino, including incident response processes, operational readiness expectations, and a culture of learning from incidents.
• Extensive experience in Site Reliability Engineering, platform engineering, or a software engineering role with significant, hands-on operational responsibility.
• Proficiency in Kubernetes, Linux, cloud platforms, and observability tools, with the capability to utilize them to diagnose intricate, real-world production issues.
• A strong aptitude for identifying and bridging reliability gaps in technical products, tools, and processes.
• Solid software engineering capabilities in Python or Go, with a proven history of developing internal tools or services that are genuinely relied upon.
• Comfort in leading technically ambiguous projects and influencing direction across teams without requiring direct authority to accomplish tasks.
• A background in enhancing reliability through engineering and automation, rather than solely addressing issues manually.
• Excellent communication skills and genuine experience mentoring engineers or influencing technical decision-making within your team.
• Sound judgment regarding AI/LLM tools: you understand where they truly benefit operational workflows and where they may create noise rather than signal.
• Bonus: Familiarity with LLM-based systems, retrieval workflows, SaaS platform operations, or developing tools for support or developer teams.
• We strongly believe in the importance of cultivating a diverse team and welcome candidates from all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply.
• We value a growth mindset, encouraging high-performing creative individuals who tackle challenges and identify opportunities for success.
• We appreciate individuals who pursue truth and speak honestly, allowing them to be their authentic selves at work.
• We recognize and support those who believe in the possibility of continuous improvement. At Domino, everything is a work in progress, and we can always enhance our efforts.
• We promote an environment of teaching and learning, equipping employees with the resources necessary for success in their roles and within the company.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.