This is a fully remote position, open to applicants in Virginia.

• Establish the strategy for Service Level Objectives (SLOs) and Error Budgets.

• Create intricate telemetry pipelines for comprehensive full-stack observability.

• Design and oversee the enterprise standards for Infrastructure as Code (IaC).

• Build custom tools to automate complex recovery processes and system scaling.

• Serve as the Incident Commander during major system outages, leading the technical response and managing the Root Cause Analysis (RCA) process.

• Spearhead the integration of security-as-code within DevSecOps pipelines, ensuring adherence to RMF and NIST 800-53 standards.

• Offer technical guidance and mentorship to Mid-Level SREs and developers, promoting a culture of reliability throughout the organization.

• Over 7 years of experience in SRE or DevOps, with a strong focus on distributed systems.

• Proficiency in Go, Python, or Java, along with advanced knowledge of Linux internals.

• Significant experience managing production Kubernetes environments and complex cloud architectures.

• Demonstrated ability to define and achieve SLOs for high-availability systems.

• Familiarity with government Risk Management Framework (RMF) processes.

• Education: Bachelor’s or Master’s degree in Computer Science or Engineering.

• Certifications: CKA (Certified Kubernetes Administrator) and industry observability certification preferred.

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Flexible working hours and remote work options.

• Opportunities for professional development and continuous learning.

Senior Site Reliability Engineer

People also viewed