This is a fully remote position, open to applicants in Latin America.

📋 Description

• Design and oversee CI/CD pipelines for the training, packaging, and deployment of ML models throughout our microservices architecture.

• Manage containerized applications on AWS ECS, focusing on optimizing cost, latency, and availability.

• Automate the provisioning of infrastructure and configuration of services using Terraform.

• Work on maintaining and scaling services that utilize third-party LLM providers.

• Construct and enhance data pipelines that deliver data from BigQuery, S3, and DynamoDB into training and inference workflows.

• Integrate services with observability tools (Datadog, OpenTelemetry, Langfuse) and establish Service Level Objectives (SLOs) for model-serving endpoints.

• Collaborate with ML engineers to implement new models in production utilizing BentoML, FastAPI, and containerized serving solutions.

⛳️ Requirements

• 2-3 years of experience in ML Ops supporting ML/AI features, systems, and workflows, along with 3-4 years of prior experience in DevOps, CloudOps, or Site Reliability Engineering (SRE).

• Strong expertise in Python programming.

• Practical experience with Docker containerization and orchestration of containers.

• Comprehensive understanding of CI/CD processes for ML workflows within an enterprise production context.

• Familiarity with Infrastructure as Code, with a preference for Terraform.

• Knowledge of cloud platforms, particularly AWS (ECS, ECR, S3, DynamoDB, CloudWatch) and GCP (BigQuery, Vertex AI).

• Experience with LLM integration and observability tools (OpenAI API, Google GenAI, Langfuse tracing).

• Proven track record in building and maintaining data pipelines for ML training and feature engineering.

• Understanding of ML modeling workflows, including training, evaluation, experiment tracking (e.g., MLFlow, Weights & Biases), and model versioning.

• Experience in monitoring and identifying model drift over time.

• Exposure to NLP/NLU models and frameworks like Hugging Face Transformers, spaCy, or sentence-transformers.

• Familiarity with vector databases (LanceDB, FAISS) and systems for embedding-based retrieval.

• Experience in scaling and maintaining deep learning frameworks (TensorFlow, PyTorch) in production environments.

• Knowledge of classical ML libraries (scikit-learn, XGBoost, LightGBM) and tools for model explainability (SHAP).

• Working knowledge of ML serving frameworks such as BentoML or similar solutions.

• Comfort in utilizing FastAPI or comparable asynchronous Python web frameworks.

🏝️ Benefits

• Competitive salary

• Professional development opportunities

AI Operations Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior AI Vertical Mini-Series Director

Risk Analyst – AI Trainer, Freelance

Senior AI Vertical Mini-Series Director – Freelance

Language Alignment & Resource Partner – Haitian Creole, Freelance AI Trainer

Automation & AI Manager

Mathematics AI Training Expert

Never miss a great job!