
Site Reliability Engineer
Posted 7 hours ago

Posted 7 hours ago
This is a fully remote position, open to applicants in Pennsylvania.
• Schedule and execute large-scale batch workloads across Kubernetes clusters.
• Diagnose and troubleshoot job failures for clients.
• Work collaboratively with teams throughout the organization to comprehend workload requirements and enhance platform capabilities.
• Enhance the reliability and speed of our systems and processes by increasing automation.
• Document procedures to create a detailed library of runbooks, serving as a knowledge base and foundation for automation.
• Participate in an on-call rotation to maintain the SLOs and SLAs of production services.
• Contribute to platform tooling, automation, and CI/CD workflows.
• A solid understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
• Extensive experience with Kubernetes and container orchestration in production-grade environments.
• Knowledge of engineering design limitations and the ability to advise teams on scaling their services to meet performance goals within budget.
• Strong experience in implementing and troubleshooting cloud-native and open-source tools like Kubernetes, etcd, Prometheus, and OpenTelemetry.
• Excellent communication skills and the capability to work efficiently in a diverse and distributed team.
• We are proud to be an equal opportunity workplace.
• We believe that diverse teams produce the best ideas and outcomes.
• We are committed to fostering a culture of inclusion, entrepreneurship, and innovation across gender, race, age, sexual orientation, religion, disability, and identity.
Ping Identity
May Mobility
Practical DevSecOps
High 5 Games
Get handpicked remote jobs straight to your inbox weekly.