
Senior AI Platform Engineer
Posted 10 hours ago

Posted 10 hours ago
This is a fully remote position, open to applicants in United States.
• Design and oversee comprehensive machine learning training pipelines on AWS (SageMaker, EKS, Step Functions) to guarantee consistent and reproducible model development and deployment.
• Develop and sustain infrastructure for production agentic applications utilizing Amazon Bedrock and Bedrock AgentCore — encompassing agent runtimes, memory, secure gateways, and large-scale observability.
• Participate in the architectural advancement of our ML platform, including assessing MLOps tools and engaging in buy vs. build evaluations.
• Apply AI/ML governance best practices for model versioning, testing, validation, maintenance, and security.
• Align MLOps best practices with Expel's SDLC, security, and infrastructure benchmarks, collaborating with SRE, Platform Engineering, and Security teams.
• Enhance quality, reliability, and scalability through strategic engineering and monitoring.
• Collaborate with data scientists, software engineers, and stakeholders to ensure the reliable and scalable operationalization of ML models.
• Guide and assist junior engineers; promote a culture of engineering excellence.
• Develop and maintain documentation, internal tools, and enablement resources to empower practitioners across Expel in working effectively with ML systems.
• Keep abreast of the MLOps landscape and reintroduce relevant innovations to the team.
• A minimum of 5 years of relevant software engineering experience with a significant emphasis on ML operations and infrastructure.
• A degree in Computer Science, Mathematics, Statistics, Engineering, or a related technical field is preferred (or a compelling narrative).
• Proficient in Python; familiarity with additional languages (Go, JS) is advantageous.
• Extensive experience with CI/CD pipelines, infrastructure-as-code, and containerization tailored for ML workloads.
• Practical experience with cloud-based ML platforms — AWS (SageMaker, Bedrock, Bedrock AgentCore) is strongly preferred; experience with GCP (Vertex AI) is also appreciated.
• Demonstrated experience in operationalizing LLMs and constructing infrastructure for intricate agentic applications — including agent orchestration, memory, tool calling, and RAG architectures.
• Familiarity with ML frameworks such as Scikit-Learn, PyTorch, Spark, and TensorFlow.
• Knowledge of continuous retraining, concept drift monitoring, and data drift detection in production environments.
• Provide unlimited PTO (which leadership actively models and encourages).
• Offer up to 24 weeks of parental leave.
• Excellent health benefits.
• Monthly stipends for fitness and cell phone expenses — no receipts needed.
• Support professional development with conference benefits and ongoing learning opportunities.
• Full remote flexibility — work from wherever you perform best.
futureproof consulting
Ad Hoc LLC
Glydways
Get handpicked remote jobs straight to your inbox weekly.