This is a fully remote position, open to applicants in United States.

📋 Description

• Design and oversee comprehensive machine learning training pipelines on AWS (SageMaker, EKS, Step Functions) to guarantee consistent and reproducible model development and deployment.

• Develop and sustain infrastructure for production agentic applications utilizing Amazon Bedrock and Bedrock AgentCore — encompassing agent runtimes, memory, secure gateways, and large-scale observability.

• Participate in the architectural advancement of our ML platform, including assessing MLOps tools and engaging in buy vs. build evaluations.

• Apply AI/ML governance best practices for model versioning, testing, validation, maintenance, and security.

• Align MLOps best practices with Expel's SDLC, security, and infrastructure benchmarks, collaborating with SRE, Platform Engineering, and Security teams.

• Enhance quality, reliability, and scalability through strategic engineering and monitoring.

• Collaborate with data scientists, software engineers, and stakeholders to ensure the reliable and scalable operationalization of ML models.

• Guide and assist junior engineers; promote a culture of engineering excellence.

• Develop and maintain documentation, internal tools, and enablement resources to empower practitioners across Expel in working effectively with ML systems.

• Keep abreast of the MLOps landscape and reintroduce relevant innovations to the team.

⛳️ Requirements

• A minimum of 5 years of relevant software engineering experience with a significant emphasis on ML operations and infrastructure.

• A degree in Computer Science, Mathematics, Statistics, Engineering, or a related technical field is preferred (or a compelling narrative).

• Proficient in Python; familiarity with additional languages (Go, JS) is advantageous.

• Extensive experience with CI/CD pipelines, infrastructure-as-code, and containerization tailored for ML workloads.

• Practical experience with cloud-based ML platforms — AWS (SageMaker, Bedrock, Bedrock AgentCore) is strongly preferred; experience with GCP (Vertex AI) is also appreciated.

• Demonstrated experience in operationalizing LLMs and constructing infrastructure for intricate agentic applications — including agent orchestration, memory, tool calling, and RAG architectures.

• Familiarity with ML frameworks such as Scikit-Learn, PyTorch, Spark, and TensorFlow.

• Knowledge of continuous retraining, concept drift monitoring, and data drift detection in production environments.

🏝️ Benefits

• Provide unlimited PTO (which leadership actively models and encourages).

• Offer up to 24 weeks of parental leave.

• Excellent health benefits.

• Monthly stipends for fitness and cell phone expenses — no receipts needed.

• Support professional development with conference benefits and ongoing learning opportunities.

• Full remote flexibility — work from wherever you perform best.

Senior AI Platform Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Cloud-Native Scientific Platform Engineer

Principal Platform Engineer

Platform Engineer

Data Platform Engineering Lead

Senior Platform Engineer

Software Engineer, Data Platform

Never miss a great job!