This is a fully remote position, open to applicants in Massachusetts.

📋 Description

• Develop and manage scalable observability and telemetry platforms that handle logs, metrics, traces, and events across production environments.

• Enhance monitoring, alerting, and instrumentation strategies to boost service visibility and operational insights.

• Collaborate with engineering teams to improve telemetry collection and overall observability.

• Create resilient, automated infrastructure and platform services that enhance reliability, scalability, and efficiency.

• Implement Infrastructure as Code and automation solutions to minimize toil and enhance consistency.

• Lead technical projects from architectural design to implementation, focusing on performance, reliability, security, and maintainability.

• Diagnose complex production issues related to distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines.

• Engage in incident response and on-call duties.

• Promote operational excellence, conduct root cause analysis, and foster continuous improvement.

• Mentor engineers in SRE best practices, observability strategies, and the design of scalable systems.

• Contribute to the long-term strategy and enhancements of platform reliability.

⛳️ Requirements

• Over 7 years of experience in operating and engineering large-scale production infrastructure and distributed systems.

• In-depth expertise in Linux systems engineering, cloud infrastructure, and SRE methodologies.

• Demonstrated experience in designing and managing observability and telemetry platforms.

• Practical experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar technologies.

• Proficient in building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tools.

• Strong automation and software engineering capabilities using Python, Golang, or Bash.

• Experience in troubleshooting large-scale distributed systems in production, focusing on availability, performance, scalability, and resilience.

• Experience in managing services within cloud-native environments, including AWS and containerized platforms.

• Comprehensive understanding of monitoring strategies, telemetry pipelines, incident response, root cause analysis, and operational excellence.

• Ability to communicate effectively across engineering teams and influence technical decision-making.

🏝️ Benefits

• Health and financial benefits.

• Tuition assistance.

• Employee resource groups.

• Collaborative workspaces.

• Flexible work-life balance.

Lead SRE – Observability

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

US Legal Editor, AI Content Updating

Freelance Career Coach

Mechanical Services Estimator

Senior Claim Specialist – Prime Specialty

Acute Care Specialist

DRG Trainer

Never miss a great job!