
Lead Site Reliability Engineer
Posted 12 hours ago

Posted 12 hours ago
This is a fully remote position, open to applicants in United States.
• Act as a technical leader in reliability engineering, operational excellence, and platform modernization within the Civic Platform.
• Spearhead platform modernization projects, transitioning from VM-based architectures to containerized and cloud-native services, collaborating with DevOps Engineering, Database Engineering, Security, and Development teams.
• Oversee initiatives that enhance and maintain the availability, performance, scalability, security, and cost-effectiveness of Accela's SaaS solutions.
• Establish, implement, and manage service level objectives (SLOs), service level agreements (SLAs), and error budgets for essential platform services, utilizing data to guide prioritization and risk-based decision-making.
• Direct observability initiatives encompassing metrics, distributed tracing, logging, and monitoring platforms to boost system visibility and expedite issue identification and resolution.
• Lead Root Cause Analysis (RCA) for intricate production incidents, facilitate blameless postmortems, and ensure that corrective actions are carried out and monitored to completion.
• Design, develop, and sustain automation, tools, and software solutions that enhance reliability, operational efficiency, scalability, and developer productivity.
• Act as a senior technical escalation point during production incidents and for platform changes affecting availability, performance, security, or compliance.
• Collaborate with Security and Compliance teams to ensure that platform operations adhere to regulatory and compliance standards, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSS.
• Convert operational metrics, reliability trends, and platform health data into actionable insights for engineering leadership and executive stakeholders.
• Mentor engineers within the Cloud Engineering organization and advocate for engineering best practices through technical leadership and collaboration.
• Over 8 years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, or related fields within a SaaS environment, including experience leading complex technical projects.
• Proven technical leadership in driving platform modernization in containerized and orchestrated settings, including Kubernetes or similar technologies.
• Practical experience in operating and supporting large-scale SaaS platforms on Microsoft Azure.
• Experience in developing automation and operational tools using Python, PowerShell, Bash, or similar scripting languages.
• Profound expertise in designing, operating, analyzing, and troubleshooting complex distributed systems across applications, infrastructure, networking, and operating system layers.
• Strong familiarity with modern observability platforms, including monitoring, logging, metrics, and distributed tracing.
• Proven success in leading incident response, Root Cause Analysis, and continuous improvement initiatives.
• Experience in establishing and enhancing Incident, Problem, and Change Management practices.
• Excellent written and verbal communication skills, with the ability to effectively convey technical concepts to engineering leadership and executive stakeholders.
• Experience with Git and GitHub-based development workflows.
• Flexible time off
• Comprehensive medical, dental, and vision plans
• Family planning benefits
• 401(k) retirement savings plan with company match
• Health savings account with company contributions
• Flexible spending account
• Life, accident, and disability coverage
• Business travel insurance
• Employee assistance programs
• Other well-being benefits
Cision France
Navigate Power
Get handpicked remote jobs straight to your inbox weekly.