This is a fully remote position, open to applicants in Poland.

• Develop and sustain a centralized monitoring and alerting framework for AI applications and their pipelines.

• Establish and execute Service Level Indicators (SLIs), alerts, and operational dashboards.

• Oversee incident management, which includes triage, coordination, root cause analysis, and implementing preventive measures.

• Standardize telemetry across various systems, focusing on latency, throughput, and failure metrics.

• Enhance Continuous Integration and Continuous Deployment (CI/CD) pipelines by introducing quality gates to ensure reliability.

• Collaborate closely with engineering teams to minimize recurring issues and enhance overall system stability.

• At least 5 years of experience in Site Reliability Engineering (SRE), Platform Engineering, or Production Engineering.

• Extensive hands-on experience with Kubernetes in production settings.

• Proficiency with Azure and Azure DevOps.

• Familiarity with monitoring tools such as Datadog.

• Strong knowledge of incident management processes and root cause analysis.

• Capability to develop effective monitoring and alerting systems.

• Nice to have: Experience with AI or large language model (LLM) pipelines.

• Nice to have: Experience in constructing monitoring platforms across multiple systems.

• Nice to have: Familiarity with Grafana.

• Nice to have: Experience in large-scale or distributed environments.

• Competitive and attractive salary.

• Opportunity to work in a multinational setting on international projects.

• Comprehensive healthcare coverage.

• Long-term B2B contract with a stable pipeline of projects.

• Fully remote work model.

Site Reliability Engineer – AI

People also viewed