
Site Reliability Engineer – Level 3
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in United States.
• Deliver production support during assigned shifts as per the team on-call schedule.
• Address tickets raised by customers and the internal engineering/implementation team when not on-call for production support.
• System Monitoring and Maintenance: Consistently oversee the health and performance of our services, systems, and infrastructure.
• Process Automation: Create and sustain automation scripts and tools to enhance operations and minimize manual interventions.
• Incident Management: Support in troubleshooting incidents, conducting root cause analysis, and executing long-term solutions to avert recurrence.
• System Enhancements: Engage in the design and implementation of system enhancements to boost reliability, scalability, and performance.
• Team Collaboration: Collaborate closely with software engineers to grasp application requirements, provide design and architecture feedback, and facilitate deployment and release processes.
• Documentation: Develop and maintain documentation for processes, procedures, and troubleshooting guides to promote knowledge sharing within the team.
• Capacity Planning: Assist with capacity planning initiatives to predict future demands and ensure our infrastructure can accommodate growth.
• Security Compliance: Enforce and adhere to security best practices to safeguard our systems and data.
• Over 5 years of experience in site reliability engineering, system administration, or a comparable role.
• Strong understanding of Linux/Unix systems, networking, and cloud platforms (AWS, Azure, or Google Cloud).
• Proficient in scripting languages such as Python, Bash, or Ruby.
• Bachelor's or postgraduate degree in computer science, Information Technology, or a related discipline, or equivalent practical experience.
• Knowledge of AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.
• Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud).
• Skilled in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
• Advanced understanding of monitoring and logging tools like Elastic (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.
• Strong analytical and problem-solving abilities with a proven track record of diagnosing and resolving complex issues efficiently.
• Exceptional verbal and written communication skills, capable of conveying intricate technical concepts to non-technical stakeholders.
• Proven ability to lead and mentor a team, drive projects to fruition, and manage cross-functional initiatives.
• Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar are advantageous.
• Health insurance
• 401(k) matching
• Flexible work hours
• Paid time off
• Remote work options
Innovative Solutions
Caspar Health
IVIX
Investigo
Get handpicked remote jobs straight to your inbox weekly.