This is a fully remote position, open to applicants in United States.

📋 Description

• Deliver production support during assigned shifts as per the team on-call schedule.

• Address tickets raised by customers and the internal engineering/implementation team when not on-call for production support.

• System Monitoring and Maintenance: Consistently oversee the health and performance of our services, systems, and infrastructure.

• Process Automation: Create and sustain automation scripts and tools to enhance operations and minimize manual interventions.

• Incident Management: Support in troubleshooting incidents, conducting root cause analysis, and executing long-term solutions to avert recurrence.

• System Enhancements: Engage in the design and implementation of system enhancements to boost reliability, scalability, and performance.

• Team Collaboration: Collaborate closely with software engineers to grasp application requirements, provide design and architecture feedback, and facilitate deployment and release processes.

• Documentation: Develop and maintain documentation for processes, procedures, and troubleshooting guides to promote knowledge sharing within the team.

• Capacity Planning: Assist with capacity planning initiatives to predict future demands and ensure our infrastructure can accommodate growth.

• Security Compliance: Enforce and adhere to security best practices to safeguard our systems and data.

⛳️ Requirements

• Over 5 years of experience in site reliability engineering, system administration, or a comparable role.

• Strong understanding of Linux/Unix systems, networking, and cloud platforms (AWS, Azure, or Google Cloud).

• Proficient in scripting languages such as Python, Bash, or Ruby.

• Bachelor's or postgraduate degree in computer science, Information Technology, or a related discipline, or equivalent practical experience.

• Knowledge of AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.

• Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud).

• Skilled in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).

• Advanced understanding of monitoring and logging tools like Elastic (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.

• Strong analytical and problem-solving abilities with a proven track record of diagnosing and resolving complex issues efficiently.

• Exceptional verbal and written communication skills, capable of conveying intricate technical concepts to non-technical stakeholders.

• Proven ability to lead and mentor a team, drive projects to fruition, and manage cross-functional initiatives.

• Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar are advantageous.

🏝️ Benefits

• Health insurance

• 401(k) matching

• Flexible work hours

• Paid time off

• Remote work options

Site Reliability Engineer – Level 3

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud Engineer – DevOps

DevSecOps/DevOps Engineer

Deployment Engineer

Senior Cloud - Kubernetes SRE

DevOps Engineer

DevSecOps Engineer

Never miss a great job!