
Senior MLOps Engineer
Posted May 24

Posted May 24
This is a fully remote position, open to applicants in Brazil.
• Implement and adhere to established Standard Operating Procedures (SOPs) for GenAI and agent-based solutions in production environments.
• Oversee platform health, model performance, and inference pipelines.
• Maintain the stability and accessibility of AI services across all environments.
• Investigate and resolve incidents by analyzing logs, traces, and metrics.
• Perform root cause analysis (RCA) and document the findings.
• Utilize observability tools (logs, metrics, tracing) to identify anomalies and performance issues.
• Contribute to the enhancement of Standard Operating Procedures (SOPs) and runbooks.
• Assist in the runtime operations of LLM-based applications and agent-driven workflows.
• Monitor inference performance, including latency, throughput, and costs.
• Proven experience with MLOps, ML systems, or AI platform operations.
• Strong troubleshooting capabilities utilizing logs and observability tools.
• Familiarity with cloud environments such as Azure, AWS, or GCP.
• Understanding of ML pipelines, APIs, and distributed systems.
• Experience with monitoring tools such as Datadog, Prometheus, Grafana, or Azure Monitor.
• Health insurance.
• Flexible working hours.
• Professional development opportunities.
Hyatt
Scopic
Perform
Greenlight Planet
Get handpicked remote jobs straight to your inbox weekly.