
Site Reliability Engineer
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in United States.
• Take charge of the migration from Heroku to Google Cloud Platform, ensuring architecture, execution, and a seamless cutover that meets expectations.
• Develop and maintain the Postgres core, Fivetran pipeline, BigQuery data layer, and Hex reporting infrastructure.
• Enhance the critical paths that are most impactful: optimize key backend code and the most demanding third-party syncs to maintain performance as volume increases.
• Manage monitoring, alerting, cost reduction, and proactive scaling: identify issues early, control expenditures, and anticipate growth instead of merely responding to it.
• Lead incident responses and create post-mortems that transform outages into lasting solutions and a more knowledgeable team.
• Establish high operational standards across engineering and elevate others to meet them.
• Ownership of production reliability: Proven history of personally ensuring production reliability at a significant scale with concrete examples of incidents you managed, resolved, and prevented from reoccurring, rather than just participating. This is a core responsibility, not an ancillary one.
• Experience with infrastructure migrations: Genuine experience managing a cloud migration from start to finish, not just contributing to one. Proficient in GCP (or a similar cloud platform), infrastructure-as-code, and the failure modes associated with distributed systems.
• Expertise in observability and proactive operations: You create monitoring and alerting systems that detect issues before users are affected. You know what to measure, what to alert on, and what is just background noise.
• High agency: You identify the most impactful reliability challenges and take the initiative to address them without waiting for assignment. You don’t wait for outages to validate your efforts.
• Integration of AI in your workflows: Specific instances demonstrating how AI has enhanced your debugging, automation, or operational processes for increased speed and reliability.
• Lead the migration from Heroku to Google Cloud Platform.
• Create and sustain the Postgres core, Fivetran pipeline, BigQuery data layer, and Hex reporting infrastructure.
• Optimize the most critical paths: enhance key backend code and our most substantial third-party syncs to ensure performance as volume increases.
• Manage monitoring, alerting, cost reduction, and proactive scaling: identify issues early, manage costs effectively, and stay ahead of growth rather than simply reacting.
• Lead incident response and develop post-mortems that convert outages into lasting solutions and a more adept team.
• Set high operational standards across the engineering team and assist others in achieving them.
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.