This is a fully remote position, open to applicants in United States.

📋 Description

• Act as a technical leader in reliability engineering, operational excellence, and platform modernization within the Civic Platform.

• Spearhead platform modernization projects, transitioning from VM-based architectures to containerized and cloud-native services, collaborating with DevOps Engineering, Database Engineering, Security, and Development teams.

• Oversee initiatives that enhance and maintain the availability, performance, scalability, security, and cost-effectiveness of Accela's SaaS solutions.

• Establish, implement, and manage service level objectives (SLOs), service level agreements (SLAs), and error budgets for essential platform services, utilizing data to guide prioritization and risk-based decision-making.

• Direct observability initiatives encompassing metrics, distributed tracing, logging, and monitoring platforms to boost system visibility and expedite issue identification and resolution.

• Lead Root Cause Analysis (RCA) for intricate production incidents, facilitate blameless postmortems, and ensure that corrective actions are carried out and monitored to completion.

• Design, develop, and sustain automation, tools, and software solutions that enhance reliability, operational efficiency, scalability, and developer productivity.

• Act as a senior technical escalation point during production incidents and for platform changes affecting availability, performance, security, or compliance.

• Collaborate with Security and Compliance teams to ensure that platform operations adhere to regulatory and compliance standards, including SOC 2, HIPAA, FedRAMP, StateRAMP, and PCI-DSS.

• Convert operational metrics, reliability trends, and platform health data into actionable insights for engineering leadership and executive stakeholders.

• Mentor engineers within the Cloud Engineering organization and advocate for engineering best practices through technical leadership and collaboration.

⛳️ Requirements

• Over 8 years of experience in Site Reliability Engineering, Software Engineering, Cloud Infrastructure, or related fields within a SaaS environment, including experience leading complex technical projects.

• Proven technical leadership in driving platform modernization in containerized and orchestrated settings, including Kubernetes or similar technologies.

• Practical experience in operating and supporting large-scale SaaS platforms on Microsoft Azure.

• Experience in developing automation and operational tools using Python, PowerShell, Bash, or similar scripting languages.

• Profound expertise in designing, operating, analyzing, and troubleshooting complex distributed systems across applications, infrastructure, networking, and operating system layers.

• Strong familiarity with modern observability platforms, including monitoring, logging, metrics, and distributed tracing.

• Proven success in leading incident response, Root Cause Analysis, and continuous improvement initiatives.

• Experience in establishing and enhancing Incident, Problem, and Change Management practices.

• Excellent written and verbal communication skills, with the ability to effectively convey technical concepts to engineering leadership and executive stakeholders.

• Experience with Git and GitHub-based development workflows.

🏝️ Benefits

• Flexible time off

• Comprehensive medical, dental, and vision plans

• Family planning benefits

• 401(k) retirement savings plan with company match

• Health savings account with company contributions

• Flexible spending account

• Life, accident, and disability coverage

• Business travel insurance

• Employee assistance programs

• Other well-being benefits

Lead Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Rate Analyst

HSE Manager

People Partner

B2B Outside Sales Consultant

Business Development Executive, Early Career – European Language Required

Statistical Programmer II

Never miss a great job!