Remotery

Lead SRE – Observability

atathenahealthUS flagMassachusettsFull-timeUncategorizedSenior$143k – $243k/year

Posted Jun 19

This is a fully remote position, open to applicants in Massachusetts.

📋 Description

• Develop and manage scalable observability and telemetry platforms that handle logs, metrics, traces, and events across production environments.

• Enhance monitoring, alerting, and instrumentation strategies to boost service visibility and operational insights.

• Collaborate with engineering teams to improve telemetry collection and overall observability.

• Create resilient, automated infrastructure and platform services that enhance reliability, scalability, and efficiency.

• Implement Infrastructure as Code and automation solutions to minimize toil and enhance consistency.

• Lead technical projects from architectural design to implementation, focusing on performance, reliability, security, and maintainability.

• Diagnose complex production issues related to distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines.

• Engage in incident response and on-call duties.

• Promote operational excellence, conduct root cause analysis, and foster continuous improvement.

• Mentor engineers in SRE best practices, observability strategies, and the design of scalable systems.

• Contribute to the long-term strategy and enhancements of platform reliability.


⛳️ Requirements

• Over 7 years of experience in operating and engineering large-scale production infrastructure and distributed systems.

• In-depth expertise in Linux systems engineering, cloud infrastructure, and SRE methodologies.

• Demonstrated experience in designing and managing observability and telemetry platforms.

• Practical experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar technologies.

• Proficient in building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tools.

• Strong automation and software engineering capabilities using Python, Golang, or Bash.

• Experience in troubleshooting large-scale distributed systems in production, focusing on availability, performance, scalability, and resilience.

• Experience in managing services within cloud-native environments, including AWS and containerized platforms.

• Comprehensive understanding of monitoring strategies, telemetry pipelines, incident response, root cause analysis, and operational excellence.

• Ability to communicate effectively across engineering teams and influence technical decision-making.


🏝️ Benefits

• Health and financial benefits.

• Tuition assistance.

• Employee resource groups.

• Collaborative workspaces.

• Flexible work-life balance.

People also viewed

LexisNexis5 hours ago

US Legal Editor, AI Content Updating

US flagNew York OnlyFull-timeUncategorized$59.1k – $118.3k/year
ApplyView job
Futures5 hours ago

Freelance Career Coach

AR flagArgentina OnlyFreelanceUncategorized$99/year
ApplyView job
Hunt St5 hours ago

Mechanical Services Estimator

PH flagPhilippines OnlyFreelanceUncategorized$2,000 – $3,000/month
ApplyView job
CRC Insurance Services5 hours ago

Senior Claim Specialist – Prime Specialty

US flagNew York OnlyFull-timeUncategorized$120k – $140k/year
ApplyView job
ANI Pharmaceuticals, Inc.5 hours ago

Acute Care Specialist

US flagNew York OnlyFull-timeUncategorized$140k – $170k/year
ApplyView job
EXL5 hours ago

DRG Trainer

US flagUnited States OnlyFull-timeUncategorized$85k – $110k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers