
Principal MLOps Engineer
Posted 13 hours ago

Posted 13 hours ago
This is a fully remote position, open to applicants in Pennsylvania.
• Lead the design of operational architecture, deployment strategies, and reliability engineering for the integration of AI within high-stakes Healthcare Information Systems (HIS).
• Establish enterprise operational standards, oversee release processes, and develop the robust infrastructure necessary for maintaining models in critical clinical settings.
• Architect and manage the entire release process, creating enterprise checklists, automated approval gates, release notes, and standards for deployment readiness.
• Set the deployment execution standards for AI promotion across all environments and ensure that customer deployments comply with stringent internal production protocols.
• Design and supervise the enterprise model registry, guaranteeing seamless integration with CI/CD pipelines and comprehensive version control traceability.
• Define and implement monitoring standards, establishing essential SLAs/SLOs, service health metrics, and detailed dashboards throughout the AI ecosystem.
• Create automated checks for input/output data quality and model drift, ensuring early detection of system performance issues.
• Lead the production incident process, which includes rigorous triage workflows, escalation paths for severity, postmortems, rollback mechanisms, and recovery infrastructure.
• Collaborate with Platform teams to provide critical ATO (Authority to Operate) and compliance support, ensuring complete traceability of deployments and stringent operational controls.
• Manage comprehensive operational reporting, delivering status updates to leadership regarding production systems, pre-production testing, customer rollouts, and incident metrics.
• Cultivate a culture of production discipline, mentoring junior engineers in maintaining operational runbooks and reliable deployment pipelines.
• Bachelor's Degree or higher in Computer Science, Software Engineering, or a related technical discipline.
• Over 10 years of experience in software engineering, with a minimum of 6 years focused on deploying and maintaining large-scale ML systems in production environments.
• Expert-level proficiency with Cloud Providers (AWS/GCP/Azure) and orchestration tools (Kubernetes, Kubeflow, or Airflow).
• Advanced expertise in Python and Java/Go (or similar languages).
• Extensive knowledge of backend frameworks, microservices, and system design patterns.
• In-depth understanding of monitoring stacks (Prometheus, Grafana, Datadog) and establishing enterprise SLAs/SLOs for AI services.
• Demonstrated success in designing automated deployment pipelines, managing complex rollback processes, and enforcing model registry governance at scale.
• Medical
• Dental & Vision
• Health Savings Accounts
• Health Care & Dependent Care Flexible Spending Accounts
• Disability Benefits
• Life Insurance
• Voluntary Benefits
• Paid Absences
• Retirement Benefits
EXL
Headspace
Allstate
Sargent & Lundy
Get handpicked remote jobs straight to your inbox weekly.