
Site Reliability Engineer – AI
Posted May 30

Posted May 30
This is a fully remote position, open to applicants in Poland.
• Develop and sustain a centralized monitoring and alerting framework for AI applications and their pipelines.
• Establish and execute Service Level Indicators (SLIs), alerts, and operational dashboards.
• Oversee incident management, which includes triage, coordination, root cause analysis, and implementing preventive measures.
• Standardize telemetry across various systems, focusing on latency, throughput, and failure metrics.
• Enhance Continuous Integration and Continuous Deployment (CI/CD) pipelines by introducing quality gates to ensure reliability.
• Collaborate closely with engineering teams to minimize recurring issues and enhance overall system stability.
• At least 5 years of experience in Site Reliability Engineering (SRE), Platform Engineering, or Production Engineering.
• Extensive hands-on experience with Kubernetes in production settings.
• Proficiency with Azure and Azure DevOps.
• Familiarity with monitoring tools such as Datadog.
• Strong knowledge of incident management processes and root cause analysis.
• Capability to develop effective monitoring and alerting systems.
• Nice to have: Experience with AI or large language model (LLM) pipelines.
• Nice to have: Experience in constructing monitoring platforms across multiple systems.
• Nice to have: Familiarity with Grafana.
• Nice to have: Experience in large-scale or distributed environments.
• Competitive and attractive salary.
• Opportunity to work in a multinational setting on international projects.
• Comprehensive healthcare coverage.
• Long-term B2B contract with a stable pipeline of projects.
• Fully remote work model.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.