This is a fully remote position, open to applicants in Mexico.

📋 Description

• Create foundational patterns and guidelines for EarnIn's development, assessment, monitoring, and deployment of AI agents in production environments.

• Manage agent governance, encompassing model selection, evaluation frameworks, safety protocols, and production observability.

• Implement infrastructure-as-code best practices for agentic systems, ensuring that prompts, tools, and evaluation criteria are version-controlled, reviewed, and tested like essential components.

• Act as the architect for agentic cloud infrastructure, establishing best practices for production AI agents.

• Mentor senior engineers on advanced agentic patterns, LLM integration, and production prompt engineering.

• Lead cross-functional projects with engineering, product, security, and business teams to ensure alignment of agentic AI adoption with company goals.

• Oversee large-scale, high-availability distributed systems on AWS, addressing critical performance, scalability, and stability challenges.

• Utilize AI-driven observability and anomaly detection to forecast potential failures.

• Spearhead the advancement of infrastructure-as-code and automation standards, integrating agentic pattern recognition and automated remediation into operations.

• Influence the development of our developer control plane (Cortex) as an AI-enhanced self-service platform where engineers engage with intelligent assistants.

• Propel AI-driven golden paths that encapsulate platform standards, security policies, and best practices.

• Serve as a liaison between cloud operations, AI infrastructure, and business stakeholders.

• Produce documentation on agentic architecture, best practices, and operational procedures.

• Engage in and lead on-call rotations, using post-mortem analyses as feedback mechanisms for enhancing system reliability and agentic automation.

⛳️ Requirements

• Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline.

• Over 7 years of experience in cloud infrastructure, overseeing large-scale, high-availability customer-facing distributed systems.

• Demonstrated experience in mentoring senior engineers and spearheading company-wide platform initiatives across various teams and functions.

• Proven track record in architecting and scaling AI-driven systems in production, designing multi-step agentic workflows that autonomously carry out complex operational tasks.

• History of reducing high-friction operational workflows through agentic AI, achieving measurable decreases in toil and enhanced platform leverage (e.g., LLM-powered incident diagnosis, intelligent CI/CD with test selection and deployment risk scoring, self-service assistants).

• Advanced proficiency in AWS (EKS, Lambda, Bedrock, etc.) with extensive knowledge of containerized and serverless architectures.

• Strong expertise in Kubernetes at scale, capable of guiding the implementation of intricate, resilient solutions.

• In-depth understanding of infrastructure-as-code tools (Terraform, Ansible) and ability to lead initiatives that merge traditional IaC and agentic automation.

• Mastery of Datadog and advanced observability practices, facilitating metrics-driven decision-making and agentic automation. Experience in developing AI-driven alerting and root-cause analysis systems is a plus.

• Strong commitment to security, privacy, and compliance best practices, with the ability to oversee governance for production AI systems (model safety, prompt injection prevention, data isolation).

• Familiarity with LLM orchestration frameworks (LangChain, LlamaIndex, CrewAI, or custom agentic architectures) and large-scale production prompt engineering.

• Proficient coding skills in Python and/or Go, with the ability to guide teams in treating infrastructure and agentic systems as software.

• Proven capability to drive cross-functional initiatives across engineering, product, security, and business, translating between technical intricacies and business outcomes.

• Experience utilizing AI-assisted development tools (e.g., GitHub Copilot, Cursor, ChatGPT, or similar tools) as part of your software development workflow.

• Experience with service mesh (Linkerd, Istio) and large-scale traffic management is a plus.

• Proficiency in GitOps (Argo CD, Flux CD) and CI/CD orchestration (GitHub Actions, Argo Workflows) is a plus.

• Familiarity with MLOps or LLMOps concepts (model versioning, evaluation frameworks, production monitoring for AI systems) is a plus.

• Knowledge of security frameworks relevant to AI systems (e.g., guardrails, audit logging, and data governance for LLMs) is a plus.

🏝️ Benefits

• Healthcare

• Internet and cell phone reimbursement

• Learning and development stipend

• Potential opportunities to travel to our Mountain View headquarters

Staff Platform Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Platform Engineer

AWS Platform Engineer

Platform Engineer

Junior Platform Engineer

Platform Engineer

Senior Platform Engineer

Never miss a great job!