
Senior SRE
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in Kentucky.
• Design, implement, and sustain automation for infrastructure provisioning, configuration management, and application deployments across different environments (both on-premise and cloud).
• Actively monitor system health, performance, and availability using a variety of observability tools while defining key performance indicators (KPIs) and service level objectives (SLOs).
• Lead the analysis and resolution of intricate production incidents, conduct root cause analysis, and establish preventative measures to reduce future occurrences.
• Collaborate with development teams to ensure that software is engineered for reliability, scalability, and operational efficiency, participating in architectural reviews and providing expert advice.
• Create and maintain comprehensive incident response protocols, runbooks, and disaster recovery strategies.
• Contribute to the advancement of our SRE practices, tools, and best standards, fostering continuous improvement and knowledge sharing within the team.
• Engage in an on-call rotation to provide 24/7 support for critical production systems.
• Mentor junior SREs and assist in the growth and development of the team.
• Assess and implement new technologies and solutions to improve system reliability and operational efficiency.
• Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
• Over 5 years of experience in a Site Reliability Engineering, DevOps, or closely related infrastructure engineering position.
• Strong proficiency in at least one scripting/programming language (such as Python, Go, Java, Ruby, or Bash).
• Extensive experience with cloud platforms (AWS, Azure, GCP) including services related to compute, networking, storage, and databases.
• In-depth understanding of Linux operating systems and networking basics.
• Proven experience with infrastructure as code tools (like Terraform, CloudFormation, or Ansible).
• Solid background in CI/CD pipelines and related tools (such as Jenkins, GitLab CI, or GitHub Actions).
• Demonstrated expertise in monitoring and alerting systems (for example, Prometheus, Grafana, Datadog, or Splunk).
• Strong problem-solving abilities with a structured approach to diagnosing complex distributed systems.
• Excellent communication and teamwork skills, with the capability to work effectively across cross-functional teams.
• Experience with containerization technologies (Docker, Kubernetes) is highly desirable.
• Familiarity with database technologies (both relational and NoSQL) and their operational challenges.
• Competitive total rewards (base salary + bonus, if applicable).
• Customizable benefits package (3 medical plans with Health Saving Account company match).
• Generous paid time off starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays.
• Flexible time off for exempt team members + 13 paid holidays.
• Paid parental leave (including maternity + paternity leave).
• Education assistance opportunities and free LinkedIn Learning access.
• Free mental health and family planning programs, including adoption assistance and fertility support.
• 401(K) program with company match.
• Pet insurance.
• Employee resource groups.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.