This is a fully remote position, open to applicants in California.

📋 Description

• Establish and implement the long-term strategic vision for Infrastructure as Code (IaC), the evolution of CI/CD, and cloud-native architecture to accommodate TrueML's scaling requirements.

• Spearhead the design and deployment of self-service internal platforms aimed at alleviating developer cognitive load, enabling feature teams to deploy and manage services with minimal friction and enhanced velocity.

• Serve as the primary decision-maker for cloud expenditure (AWS); spearhead cost-optimization initiatives and lead negotiations for the DevOps toolstack and third-party vendors.

• Ensure that the infrastructure architecture adheres to stringent High Availability (HA) standards and robust Disaster Recovery (DR) protocols, ensuring system integrity across various regions.

• Supervise the implementation and advancement of extensive monitoring, logging, and distributed tracing systems, utilizing AIOps to transition from reactive to predictive system maintenance.

• Advocate for security by design by incorporating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.

• Act as the primary escalation point for significant production outages, facilitating blameless post-mortem reviews focused on systemic enhancements rather than personal mistakes.

• Maintain up-to-date technical expertise in container orchestration (Kubernetes), serverless patterns, and contemporary automation frameworks to offer valuable mentorship and architectural guidance to senior engineering personnel.

⛳️ Requirements

• Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.

• More than 10 years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering, with over 5 years in a managerial role overseeing engineers.

• Expert-level proficiency with AWS and experience managing multi-region, high-availability deployments.

• Advanced knowledge of Kubernetes (K8s) and Docker, including cluster management, networking, and scaling within a production environment.

• Proficiency in Terraform to ensure consistency and automation across all infrastructure layers; experience with Atlantis is a plus.

• Extensive experience in designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and expertise in scripting languages such as Python, Go, or Bash.

• Practical experience with modern monitoring, observability, and tracing stacks (Datadog, Observe), along with a solid understanding of SRE principles (SLIs/SLOs/Error Budgets).

• Experience serving as an Incident Commander during high-severity outages and promoting a "blameless" post-mortem culture.

• Proven ability to influence executive leadership and collaborate effectively across Product, Engineering, and Security teams.

• Experience in integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to expedite delivery.

🏝️ Benefits

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Flexible work environment with options for remote work.

• Opportunities for professional development and continuous learning.

• Generous vacation and paid time off policies.

Senior Manager, DevOps

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Rate Analyst

HSE Manager

People Partner

B2B Outside Sales Consultant

Business Development Executive, Early Career – European Language Required

Statistical Programmer II

Never miss a great job!