
Senior Sustaining, Forward Deployed Engineer
Posted 3 hours ago

Posted 3 hours ago
This is a fully remote position, open to applicants in United States.
• Serve as a senior technical escalation resource during production incidents.
• Oversee real-time incident assessment, mitigation, and recovery initiatives.
• Lead root cause analysis (RCA) efforts, emphasizing systemic and long-term solutions.
• Recognize recurring failure patterns and advocate for architectural or operational enhancements.
• Collaborate with Customer Success and Engineering to manage customer impact during incidents.
• Take responsibility for post-launch reliability, stability, and operational integrity of core systems.
• Investigate and address complex field issues and production defects.
• Ensure that fixes developed during incidents or customer escalations are integrated into the core product.
• Enhance operational readiness of services through improved runbooks, monitoring, and alerting.
• Minimize operational overhead by automating repetitive manual tasks.
• Engage directly with strategic customers to tackle real-world, production-level technical challenges.
• Support intricate deployments, integrations, and escalations within customer environments.
• Act as a trusted technical advisor to customers during high-impact situations.
• Translate insights from customers into tangible product, platform, and operational enhancements.
• Develop production-quality code to:
• Automate operational workflows.
• Enhance reliability and observability.
• Reduce manual intervention and lower incident frequency.
• Primarily contribute in Python, with occasional exposure to JVM-based systems as necessary.
• Review code with a strong focus on operability, resilience, and maintainability.
• Advocate for engineering practices that prioritize operational functionality.
• Provide technical leadership without formal authority, shaping design and operational decisions.
• Mentor engineers through pairing, code reviews, and incident leadership.
• Work closely with Product, Engineering, Data, and Customer teams.
• Perform effectively in high-pressure, ambiguous situations, particularly during customer-impacting incidents.
• Over 10 years of experience in software engineering, Site Reliability Engineering (SRE), sustaining engineering, or production operations.
• Extensive hands-on experience managing production systems in AWS.
• Strong background in troubleshooting Databricks and large-scale data platforms.
• Proficient in Python with experience in developing production services or tooling.
• Solid understanding of:
• Distributed systems.
• Incident management and RCA methodologies.
• Monitoring, alerting, and observability practices.
• CI/CD Pipelines utilizing Infrastructure as Code.
• Demonstrated ability to take ownership of problems from detection to permanent resolution.
• Exceptional communication skills, particularly during incidents and customer escalations.
• Ability to trace issues from customer impact to root cause across systems and codebases, delivering solutions in environments with limited documentation.
• Strong awareness of operational risks, with a capacity to identify failure modes and strengthen systems proactively before they affect customers.
• Unlimited paid time off – recharge when you need it.
• Work from anywhere – flexibility to fit your lifestyle.
• Comprehensive health coverage – multiple plan options available.
• Equity for every employee – share in our success.
• Growth-focused environment – your development is important here.
• Home office setup allowance – one-time support to get you started.
• Monthly cell phone allowance – stay connected with ease.
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.