
AI Operations Engineer
Posted May 21

Posted May 21
This is a fully remote position, open to applicants in Latin America.
• Design and oversee CI/CD pipelines for the training, packaging, and deployment of ML models throughout our microservices architecture.
• Manage containerized applications on AWS ECS, focusing on optimizing cost, latency, and availability.
• Automate the provisioning of infrastructure and configuration of services using Terraform.
• Work on maintaining and scaling services that utilize third-party LLM providers.
• Construct and enhance data pipelines that deliver data from BigQuery, S3, and DynamoDB into training and inference workflows.
• Integrate services with observability tools (Datadog, OpenTelemetry, Langfuse) and establish Service Level Objectives (SLOs) for model-serving endpoints.
• Collaborate with ML engineers to implement new models in production utilizing BentoML, FastAPI, and containerized serving solutions.
• 2-3 years of experience in ML Ops supporting ML/AI features, systems, and workflows, along with 3-4 years of prior experience in DevOps, CloudOps, or Site Reliability Engineering (SRE).
• Strong expertise in Python programming.
• Practical experience with Docker containerization and orchestration of containers.
• Comprehensive understanding of CI/CD processes for ML workflows within an enterprise production context.
• Familiarity with Infrastructure as Code, with a preference for Terraform.
• Knowledge of cloud platforms, particularly AWS (ECS, ECR, S3, DynamoDB, CloudWatch) and GCP (BigQuery, Vertex AI).
• Experience with LLM integration and observability tools (OpenAI API, Google GenAI, Langfuse tracing).
• Proven track record in building and maintaining data pipelines for ML training and feature engineering.
• Understanding of ML modeling workflows, including training, evaluation, experiment tracking (e.g., MLFlow, Weights & Biases), and model versioning.
• Experience in monitoring and identifying model drift over time.
• Exposure to NLP/NLU models and frameworks like Hugging Face Transformers, spaCy, or sentence-transformers.
• Familiarity with vector databases (LanceDB, FAISS) and systems for embedding-based retrieval.
• Experience in scaling and maintaining deep learning frameworks (TensorFlow, PyTorch) in production environments.
• Knowledge of classical ML libraries (scikit-learn, XGBoost, LightGBM) and tools for model explainability (SHAP).
• Working knowledge of ML serving frameworks such as BentoML or similar solutions.
• Comfort in utilizing FastAPI or comparable asynchronous Python web frameworks.
• Competitive salary
• Professional development opportunities
EverAI
10x.Team
EverAI
Invisible Technologies
Get handpicked remote jobs straight to your inbox weekly.