This is a fully remote position, open to applicants in United States.

📋 Description

• Act as the main technical authority for ensuring production reliability across customer environments in the U.S.

• Analyze and resolve intricate issues involving web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations.

• Lead incident response for production issues, coordinating with cross-functional teams to restore services while minimizing the impact on customers.

• Conduct root cause analyses and implement corrective actions to enhance long-term system stability and resilience.

• Collaborate with software engineering and platform teams to identify recurring reliability challenges and develop sustainable solutions.

• Design, configure, and validate secure connectivity solutions for customers, including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths.

• Assist in customer onboarding by troubleshooting connectivity issues and ensuring consistent implementation procedures.

• Improve platform observability through enhancements in monitoring, logging, alerting, tracing, and operational dashboards.

• Contribute to CI/CD, infrastructure automation, and deployment processes that enhance release safety and operational consistency.

• Create operational tools that aid in incident response, troubleshooting, onboarding, and system monitoring activities.

• Work closely with engineering leadership to enhance cloud architecture, scalability, security, and operational readiness.

• Collaborate with customer-facing teams to communicate technical challenges, remediation strategies, and reliability enhancements in a clear and effective way.

• Support initiatives related to compliance, security, and risk management within highly regulated healthcare environments.

⛳️ Requirements

• Over 6 years of hands-on experience in supporting and managing AWS-based production environments.

• At least 4 years of experience in supporting web applications and backend services (experience with Python/Django is strongly preferred).

• Proficient in AWS networking technologies such as VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.

• Strong expertise in Terraform and infrastructure-as-code deployment methodologies.

• Experience with containerized environments, including ECS, Fargate, Kubernetes, or similar technologies.

• Proven experience in building and maintaining CI/CD pipelines and automation for release processes.

• Familiar with monitoring and observability tools like Datadog, CloudWatch, Sentry, Grafana, or similar platforms.

• Experienced in leading production incidents, managing outages, and conducting root cause analysis.

• Familiarity with Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.

• Preferred experience in healthcare technology, healthcare SaaS, clinical software, or other regulated industries.

• Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field is preferred.

🏝️ Benefits

• Health Care Plan (Medical, Dental & Vision)

• Retirement Plan (401k, IRA)

• Paid Time Off (Vacation, Sick & Public Holidays)

Principal Site Reliability Engineer, SRE

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!