Remotery

Site Reliability Engineer

atRunPodUS flagUnited StatesFull-timeUncategorizedMid-levelSenior$150k – $200k/year

Posted 2 days ago

This is a fully remote position, open to applicants in United States.

📋 Description

• Enhance platform availability while minimizing the frequency and duration of incidents.

• Develop and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across various services.

• Optimize Mean Time to Recovery (MTTR) through improved tools, automation, and comprehensive runbooks.

• Reinforce production readiness standards.

• Propel long-term systemic reliability enhancements.

• Establish and execute SLIs/SLOs for essential services.

• Oversee incident response efforts and coordinate mitigation strategies across teams.

• Conduct blameless post-incident reviews and ensure that corrective measures are executed.

• Perform assessments of production readiness for new services and features.

• Identify systemic vulnerabilities and promote preventative enhancements.

• Design and refine monitoring, alerting, and dashboard solutions (using tools like Prometheus, Grafana, etc.).

• Enhance the signal-to-noise ratio in alerts to alleviate alert fatigue.

• Develop internal tools for tracking and reporting on reliability.

• Increase visibility into GPU performance and the health of distributed systems.

• Automate recurring operational tasks.

• Create tools and scripts (in Python, Go, Bash) to remove manual processing.

• Enhance deployment safety through automation and protective measures.

• Strengthen CI/CD processes and release reliability.

• Collaborate with engineering teams to bolster system resilience.

• Offer insights on fault tolerance, scalability, and failure management.

• Participate in architectural discussions with a focus on reliability.


⛳️ Requirements

• Over 5 years of experience in Site Reliability Engineering (SRE), Reliability Engineering, or Production Engineering.

• Strong expertise in Linux systems and networking.

• Experience in managing containerized production environments.

• In-depth understanding of distributed systems and their failure modes.

• Proven experience in defining and managing SLIs/SLOs.

• Demonstrated leadership in incident response and post-incident reviews.

• Strong skills in scripting or programming.

• Familiarity with monitoring and alerting systems.

• Excellent written communication abilities.

• Successful completion of a background check.


🏝️ Benefits

• Significant equity in a rapidly growing company—every team member receives stock options, allowing you to share in our success as we grow.

• Comprehensive medical, dental, and vision plans.

• Flexible Paid Time Off (PTO)—take the time you need to rejuvenate.

• Most positions are remote-first, fostering an inclusive and collaborative environment, with Slack as our primary mode of internal communication.

• Join a dedicated team at the forefront of AI infrastructure, where culture, learning, and ownership are central to our scaling efforts.

People also viewed

Anchor Utility10 hours ago

Rate Analyst

US flagTexas OnlyFull-timeUncategorized
ApplyView job
Honeywell10 hours ago

HSE Manager

US flagNorth Carolina OnlyFull-timeUncategorized
ApplyView job
Cision France10 hours ago

People Partner

CA flagCanada OnlyFull-timeUncategorized$85k/year
ApplyView job
Navigate Power10 hours ago

B2B Outside Sales Consultant

US flagPennsylvania OnlyFreelanceUncategorized$50k – $250k/year
ApplyView job
TELUS10 hours ago

Business Development Executive, Early Career – European Language Required

GB flagUnited Kingdom OnlyFull-timeUncategorized
ApplyView job
Gilead Sciences10 hours ago

Statistical Programmer II

US flagUnited States OnlyFull-timeUncategorized$107.2k – $138.7k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers