This is a fully remote position, open to applicants in United States.

📋 Description

• Serve as a senior technical escalation resource during production incidents.

• Oversee real-time incident assessment, mitigation, and recovery initiatives.

• Lead root cause analysis (RCA) efforts, emphasizing systemic and long-term solutions.

• Recognize recurring failure patterns and advocate for architectural or operational enhancements.

• Collaborate with Customer Success and Engineering to manage customer impact during incidents.

• Take responsibility for post-launch reliability, stability, and operational integrity of core systems.

• Investigate and address complex field issues and production defects.

• Ensure that fixes developed during incidents or customer escalations are integrated into the core product.

• Enhance operational readiness of services through improved runbooks, monitoring, and alerting.

• Minimize operational overhead by automating repetitive manual tasks.

• Engage directly with strategic customers to tackle real-world, production-level technical challenges.

• Support intricate deployments, integrations, and escalations within customer environments.

• Act as a trusted technical advisor to customers during high-impact situations.

• Translate insights from customers into tangible product, platform, and operational enhancements.

• Develop production-quality code to:

• Automate operational workflows.

• Enhance reliability and observability.

• Reduce manual intervention and lower incident frequency.

• Primarily contribute in Python, with occasional exposure to JVM-based systems as necessary.

• Review code with a strong focus on operability, resilience, and maintainability.

• Advocate for engineering practices that prioritize operational functionality.

• Provide technical leadership without formal authority, shaping design and operational decisions.

• Mentor engineers through pairing, code reviews, and incident leadership.

• Work closely with Product, Engineering, Data, and Customer teams.

• Perform effectively in high-pressure, ambiguous situations, particularly during customer-impacting incidents.

⛳️ Requirements

• Over 10 years of experience in software engineering, Site Reliability Engineering (SRE), sustaining engineering, or production operations.

• Extensive hands-on experience managing production systems in AWS.

• Strong background in troubleshooting Databricks and large-scale data platforms.

• Proficient in Python with experience in developing production services or tooling.

• Solid understanding of:

• Distributed systems.

• Incident management and RCA methodologies.

• Monitoring, alerting, and observability practices.

• CI/CD Pipelines utilizing Infrastructure as Code.

• Demonstrated ability to take ownership of problems from detection to permanent resolution.

• Exceptional communication skills, particularly during incidents and customer escalations.

• Ability to trace issues from customer impact to root cause across systems and codebases, delivering solutions in environments with limited documentation.

• Strong awareness of operational risks, with a capacity to identify failure modes and strengthen systems proactively before they affect customers.

🏝️ Benefits

• Unlimited paid time off – recharge when you need it.

• Work from anywhere – flexibility to fit your lifestyle.

• Comprehensive health coverage – multiple plan options available.

• Equity for every employee – share in our success.

• Growth-focused environment – your development is important here.

• Home office setup allowance – one-time support to get you started.

• Monthly cell phone allowance – stay connected with ease.

Senior Sustaining, Forward Deployed Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Program Manager II

Senior Product Manager – Candidate & Recruiter Platform

Account Director

Forward-Deployed Product Manager – FDPM

Human Resource Generalist

Product Marketing Engineer

Never miss a great job!