
Site Reliability Engineer
Posted May 10

Posted May 10
This is a fully remote position, open to applicants in California.
• Develop, configure, and deploy code in Go and JavaScript to enhance service reliability for both existing and new systems; set a benchmark for others regarding code quality.
• Operate within the Google Cloud Platform (GCP) infrastructure, optimizing both performance and costs while scaling resources to accommodate demand.
• Offer constructive and actionable feedback as well as reviews for code or production modifications.
• Lead the repair and optimization of intricate systems, taking into account a broad range of contributing factors.
• Direct the debugging, troubleshooting, and analysis of service architecture and design.
• Engage in an on-call rotation.
• Create documentation: design specifications, system analysis, runbooks, and playbooks. Provide design feedback and enhance the design skills of colleagues.
• Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tools with Terraform and other Infrastructure as Code (IaC) tools to ensure visibility and proactive issue detection across platforms.
• Collaborate with development teams to improve system reliability and performance, applying a platform engineering mindset to system administration responsibilities.
• Develop and sustain automated solutions for operational tasks such as on-call monitoring, performance optimization, and disaster recovery.
• Diagnose and resolve issues in our development, testing, and production environments.
• Engage in postmortem analyses and formulate preventive measures for future incidents.
• Enforce and maintain security best practices across our infrastructure, ensuring adherence to industry standards and internal policies. Participate in security audits and vulnerability assessments.
• Take part in capacity planning and forecasting initiatives to ensure our systems are prepared for future growth and demand. Analyze trends and provide recommendations for resource allocation.
• Identify and rectify performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively detect and resolve issues.
• Create, maintain, and test disaster recovery plans and procedures to guarantee business continuity in the event of major outages or disasters. Participate in routine disaster recovery exercises.
• Contribute to internal knowledge bases and documentation.
• Bachelor's degree in Computer Science, Engineering, Mathematics, or equivalent professional experience.
• Over 3 years of experience as an SRE, Software Engineer, DevOps Engineer, or in a similar capacity.
• Strong programming skills in Golang and scripting languages, along with a solid understanding of software development best practices.
• Proficient in monitoring and observability tools, especially OpenTelemetry, Dynatrace, or similar tools.
• Skilled with cloud services, particularly with Kubernetes and Google Cloud Platform (GCP) experience being highly preferred.
• Experience working with relational and document databases.
• Capability to debug, optimize code, and automate routine tasks.
• Strong analytical abilities and the capacity to perform under pressure in a dynamic environment.
• Exceptional verbal and written communication skills.
• Immediate access to medical, dental, vision, and prescription drug coverage.
• Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized backup childcare, and more.
• Family-building benefits, including reimbursement for adoption and surrogacy expenses, fertility treatments, and additional support.
• Vehicle discount program available for employees and their family members, along with management leases.
• Tuition assistance.
• Established and active employee resource groups.
• Paid time off for both individual and team community service initiatives.
• A generous schedule of paid holidays, including the week between Christmas and New Year's Day.
• Paid time off with the option to purchase additional vacation days.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.