
Staff Software Engineer – Grafana Cloud
Posted 19 hours ago

Posted 19 hours ago
• Cultivate and enhance a robust culture of operational excellence by establishing standards and mentoring teams to take ownership of reliability and availability.
• Promote advanced DevOps/SRE methodologies, encompassing incident response and post-incident reviews (PIRs), on-call readiness, runbooks, alerting, observability, and change/release management.
• Develop reliability frameworks such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets, utilizing them to steer prioritization and engineering trade-offs.
• Offer insights into system performance through transparent operational metrics and reliability reporting.
• Assist teams in the design, development, evolution, and management of extensive, distributed cloud systems.
• Shape product and system direction by participating in design reviews, architectural discussions, and fostering cross-team collaboration.
• Disseminate knowledge through clear, high-quality documentation and technical communication—internally and, when appropriate, externally—to aid teams in constructing and maintaining systems more efficiently.
• As the reliability foundation progresses, expand into broader leadership roles in application and product development, providing architectural and technical insight beyond operational matters.
• Extensive experience with DevOps/SRE methodologies, including the operation and evolution of production systems at scale.
• Solid programming background in a contemporary language (Python and Go are preferred, but prior experience is not mandatory).
• Proficiency in designing, building, and managing large-scale distributed systems.
• Strong grasp of reliability engineering principles (e.g., incident management, observability, and failure modes).
• Experience with test automation, covering both performance and functional testing.
• Capability to influence engineering practices through effective technical communication, reviews, and collaboration.
• Excellent interpersonal skills and ability to collaborate effectively across teams.
• Familiarity with modern software engineering methodologies and delivery practices.
• Self-motivated and comfortable working autonomously in ambiguous situations.
• Equity
• Bonus (if applicable)
• 30 days annual leave
• In-person onboarding
• Company-funded AI tools budget
• Grafana Shutdown Days
Smartsheet
Smartsheet
Domus Global
PSI CRO AG
Get handpicked remote jobs straight to your inbox weekly.