
T3 Operations & Support Specialist – Compute & OS
Posted 2 days ago

Posted 2 days ago
This is a fully remote position, open to applicants in Germany.
• Delivering T3 operational leadership for Compute & OS services: managing intricate incidents, troubleshooting, conducting root cause analysis (RCA), and driving sustainable solutions and preventive actions.
• Ensuring readiness of compute/OS for releases and modifications: overseeing monitoring/alerting frameworks, establishing performance baselines, enhancing security, developing patch strategies, rollback and recovery processes, and preparing runbooks.
• Executing and refining standard operational practices through automation to minimize manual effort and enhance Mean Time to Recovery (MTTR) and system stability.
• Collaborating with Kubernetes, Data, Network, and Storage Subject Matter Experts (SMEs) to address cross-domain production challenges.
• Assessing deployment artifacts from an operational standpoint and enforcing quality assurance protocols.
• Monitoring system health, performance indicators, and service availability across multi-tenant environments.
• Identifying, analyzing, and resolving incidents to reduce service interruptions while initiating RCA and corrective measures.
• Establishing monitoring and logging frameworks to meet audit and compliance prerequisites.
• Conducting regular security assessments and addressing identified vulnerabilities.
• 5 to 10+ years of experience in IT operations, service delivery, or platform operations.
• Demonstrated expertise in implementing and leading Incident, Problem, Change, and Release governance in a production environment.
• Practical experience with VMware 8 virtualization.
• Proficiency in Operating Systems: Red Hat Enterprise Linux and Ubuntu.
• Familiarity with OS tools: Satellite, IPA, Certificate Server.
• Experience with ITSM and collaboration tools: Jira Service Management, Jira, Confluence.
• Solid understanding of core operational processes (Incident, Change, Problem management, ITSM) and Site Reliability Engineering (SRE) principles.
• Experience in extracting operational insights from monitoring/observability, including management of SLI/SLA/SLO and performance tracking.
• Practical experience in documenting procedures and enforcing clear runbooks and playbooks.
• Hands-on experience with monitoring and logging tools (e.g., Prometheus, Grafana, Datadog, Mimir, Loki).
• Understanding of modern platform operations (Kubernetes/containers, automation, observability) sufficient to oversee specialists.
• Proficient in English and German (C1 level minimum in both languages).
• Flexible working hours.
• Autonomy in selecting projects.
• Opportunity to engage in exciting projects across various industries.
• Support for career advancement.
• Competitive compensation.
• Dedicated team available for assistance.
pathway solutions
Webflow
Conduent
Get handpicked remote jobs straight to your inbox weekly.