
Lead SRE – Observability
Posted Jun 19

Posted Jun 19
This is a fully remote position, open to applicants in Massachusetts.
• Develop and manage scalable observability and telemetry platforms that handle logs, metrics, traces, and events across production environments.
• Enhance monitoring, alerting, and instrumentation strategies to boost service visibility and operational insights.
• Collaborate with engineering teams to improve telemetry collection and overall observability.
• Create resilient, automated infrastructure and platform services that enhance reliability, scalability, and efficiency.
• Implement Infrastructure as Code and automation solutions to minimize toil and enhance consistency.
• Lead technical projects from architectural design to implementation, focusing on performance, reliability, security, and maintainability.
• Diagnose complex production issues related to distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines.
• Engage in incident response and on-call duties.
• Promote operational excellence, conduct root cause analysis, and foster continuous improvement.
• Mentor engineers in SRE best practices, observability strategies, and the design of scalable systems.
• Contribute to the long-term strategy and enhancements of platform reliability.
• Over 7 years of experience in operating and engineering large-scale production infrastructure and distributed systems.
• In-depth expertise in Linux systems engineering, cloud infrastructure, and SRE methodologies.
• Demonstrated experience in designing and managing observability and telemetry platforms.
• Practical experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar technologies.
• Proficient in building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tools.
• Strong automation and software engineering capabilities using Python, Golang, or Bash.
• Experience in troubleshooting large-scale distributed systems in production, focusing on availability, performance, scalability, and resilience.
• Experience in managing services within cloud-native environments, including AWS and containerized platforms.
• Comprehensive understanding of monitoring strategies, telemetry pipelines, incident response, root cause analysis, and operational excellence.
• Ability to communicate effectively across engineering teams and influence technical decision-making.
• Health and financial benefits.
• Tuition assistance.
• Employee resource groups.
• Collaborative workspaces.
• Flexible work-life balance.
LexisNexis
Futures
Hunt St
CRC Insurance Services
Get handpicked remote jobs straight to your inbox weekly.