
Principal Site Reliability Engineer, SRE
Posted Jun 20

Posted Jun 20
This is a fully remote position, open to applicants in United States.
• Act as the main technical authority for ensuring production reliability across customer environments in the U.S.
• Analyze and resolve intricate issues involving web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations.
• Lead incident response for production issues, coordinating with cross-functional teams to restore services while minimizing the impact on customers.
• Conduct root cause analyses and implement corrective actions to enhance long-term system stability and resilience.
• Collaborate with software engineering and platform teams to identify recurring reliability challenges and develop sustainable solutions.
• Design, configure, and validate secure connectivity solutions for customers, including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths.
• Assist in customer onboarding by troubleshooting connectivity issues and ensuring consistent implementation procedures.
• Improve platform observability through enhancements in monitoring, logging, alerting, tracing, and operational dashboards.
• Contribute to CI/CD, infrastructure automation, and deployment processes that enhance release safety and operational consistency.
• Create operational tools that aid in incident response, troubleshooting, onboarding, and system monitoring activities.
• Work closely with engineering leadership to enhance cloud architecture, scalability, security, and operational readiness.
• Collaborate with customer-facing teams to communicate technical challenges, remediation strategies, and reliability enhancements in a clear and effective way.
• Support initiatives related to compliance, security, and risk management within highly regulated healthcare environments.
• Over 6 years of hands-on experience in supporting and managing AWS-based production environments.
• At least 4 years of experience in supporting web applications and backend services (experience with Python/Django is strongly preferred).
• Proficient in AWS networking technologies such as VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
• Strong expertise in Terraform and infrastructure-as-code deployment methodologies.
• Experience with containerized environments, including ECS, Fargate, Kubernetes, or similar technologies.
• Proven experience in building and maintaining CI/CD pipelines and automation for release processes.
• Familiar with monitoring and observability tools like Datadog, CloudWatch, Sentry, Grafana, or similar platforms.
• Experienced in leading production incidents, managing outages, and conducting root cause analysis.
• Familiarity with Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
• Preferred experience in healthcare technology, healthcare SaaS, clinical software, or other regulated industries.
• Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field is preferred.
• Health Care Plan (Medical, Dental & Vision)
• Retirement Plan (401k, IRA)
• Paid Time Off (Vacation, Sick & Public Holidays)
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.