This is a fully remote position, open to applicants in Pennsylvania.

📋 Description

• Lead the design of operational architecture, deployment strategies, and reliability engineering for the integration of AI within high-stakes Healthcare Information Systems (HIS).

• Establish enterprise operational standards, oversee release processes, and develop the robust infrastructure necessary for maintaining models in critical clinical settings.

• Architect and manage the entire release process, creating enterprise checklists, automated approval gates, release notes, and standards for deployment readiness.

• Set the deployment execution standards for AI promotion across all environments and ensure that customer deployments comply with stringent internal production protocols.

• Design and supervise the enterprise model registry, guaranteeing seamless integration with CI/CD pipelines and comprehensive version control traceability.

• Define and implement monitoring standards, establishing essential SLAs/SLOs, service health metrics, and detailed dashboards throughout the AI ecosystem.

• Create automated checks for input/output data quality and model drift, ensuring early detection of system performance issues.

• Lead the production incident process, which includes rigorous triage workflows, escalation paths for severity, postmortems, rollback mechanisms, and recovery infrastructure.

• Collaborate with Platform teams to provide critical ATO (Authority to Operate) and compliance support, ensuring complete traceability of deployments and stringent operational controls.

• Manage comprehensive operational reporting, delivering status updates to leadership regarding production systems, pre-production testing, customer rollouts, and incident metrics.

• Cultivate a culture of production discipline, mentoring junior engineers in maintaining operational runbooks and reliable deployment pipelines.

⛳️ Requirements

• Bachelor's Degree or higher in Computer Science, Software Engineering, or a related technical discipline.

• Over 10 years of experience in software engineering, with a minimum of 6 years focused on deploying and maintaining large-scale ML systems in production environments.

• Expert-level proficiency with Cloud Providers (AWS/GCP/Azure) and orchestration tools (Kubernetes, Kubeflow, or Airflow).

• Advanced expertise in Python and Java/Go (or similar languages).

• Extensive knowledge of backend frameworks, microservices, and system design patterns.

• In-depth understanding of monitoring stacks (Prometheus, Grafana, Datadog) and establishing enterprise SLAs/SLOs for AI services.

• Demonstrated success in designing automated deployment pipelines, managing complex rollback processes, and enforcing model registry governance at scale.

🏝️ Benefits

• Medical

• Dental & Vision

• Health Savings Accounts

• Health Care & Dependent Care Flexible Spending Accounts

• Disability Benefits

• Life Insurance

• Voluntary Benefits

• Paid Absences

• Retirement Benefits

Principal MLOps Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Quality Analyst IV – Diagnosis-Related Group, Healthcare

Licensed Therapist

Regional Sales Manager – Pacific Northwest

Solar PV Construction Site Manager – Field Assignment

Field Service Engineer – IT Desktop and Printer Support

Client Success Leader

Never miss a great job!