
Principal AIOps Engineer
Posted Jun 21

Posted Jun 21
This is a fully remote position, open to applicants in Pennsylvania.
• Spearhead the AIOps strategy, roadmap, and operational model to significantly enhance MTTR, alert quality, and overall operational efficiency.
• Take ownership of the observability-to-AIOps pipeline and promote the standardization of telemetry, service health models, and actionable alerting.
• Design and execute event intelligence initiatives, including correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis.
• Provide guidance to operations, service owners, and leadership stakeholders; facilitate change enablement, adoption, and value assessment for AIOps.
• Develop AIOps integrations centered around ServiceNow, including event ingestion, alert-to-incident policies, enrichment, and assignment/routing.
• Establish governance for operational AI in collaboration with security, compliance, and operations teams.
• Construct and operationalize agentic AI workflows to assist with incident triage and resolution.
• Enable closed-loop automation and self-healing by linking AIOps detections to orchestrated actions.
• Collaborate with NOC/SOC, infrastructure, and application owners to facilitate the onboarding of services into AIOps.
• Produce enablement materials and mentor teams on AIOps methodologies, agentic AI application, and responsible automation practices.
• Over 10 years of experience in SRE and production operations supporting highly available services.
• Demonstrated technical leadership: capability to set direction, lead cross-team initiatives, and guide stakeholders through architecture assessments.
• Proficient programming/scripting skills (Python preferred) and experience in creating automation, integrations, and APIs.
• Experience in integrating observability platforms and event sources within hybrid environments (cloud/on-prem) while managing production-grade monitoring/event management at scale.
• Strong familiarity with ServiceNow as an ITSM system of record.
• Ability to build and manage integrations at scale (REST, webhooks, event management) to facilitate automation and ensure auditability.
• Expertise in Automation & Integration Engineering: Python (preferred) for automation and data/ML pipelines.
• Experience in developing integrations, services, and operational tools.
• Knowledge of AIOps, ITSM/ITOM (ServiceNow), and the Agentic AI Ecosystem: Observability tools such as Prometheus/Grafana, OpenTelemetry, ELK/Splunk/Datadog (or equivalents).
• Strong fundamentals in Linux and networking (TCP/IP, DNS, TLS, load balancing) with the ability to troubleshoot distributed systems comprehensively.
• Excellent communication skills.
• Medical, dental, and vision coverage.
• Paid time off.
• Retirement savings options.
• Wellness programs.
• Additional resources based on eligibility.
Sardine
DaVita Kidney Care
Sharecare
Manila Recruitment
Get handpicked remote jobs straight to your inbox weekly.