
Staff Platform Engineer
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Mexico.
• Create foundational patterns and guidelines for EarnIn's development, assessment, monitoring, and deployment of AI agents in production environments.
• Manage agent governance, encompassing model selection, evaluation frameworks, safety protocols, and production observability.
• Implement infrastructure-as-code best practices for agentic systems, ensuring that prompts, tools, and evaluation criteria are version-controlled, reviewed, and tested like essential components.
• Act as the architect for agentic cloud infrastructure, establishing best practices for production AI agents.
• Mentor senior engineers on advanced agentic patterns, LLM integration, and production prompt engineering.
• Lead cross-functional projects with engineering, product, security, and business teams to ensure alignment of agentic AI adoption with company goals.
• Oversee large-scale, high-availability distributed systems on AWS, addressing critical performance, scalability, and stability challenges.
• Utilize AI-driven observability and anomaly detection to forecast potential failures.
• Spearhead the advancement of infrastructure-as-code and automation standards, integrating agentic pattern recognition and automated remediation into operations.
• Influence the development of our developer control plane (Cortex) as an AI-enhanced self-service platform where engineers engage with intelligent assistants.
• Propel AI-driven golden paths that encapsulate platform standards, security policies, and best practices.
• Serve as a liaison between cloud operations, AI infrastructure, and business stakeholders.
• Produce documentation on agentic architecture, best practices, and operational procedures.
• Engage in and lead on-call rotations, using post-mortem analyses as feedback mechanisms for enhancing system reliability and agentic automation.
• Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline.
• Over 7 years of experience in cloud infrastructure, overseeing large-scale, high-availability customer-facing distributed systems.
• Demonstrated experience in mentoring senior engineers and spearheading company-wide platform initiatives across various teams and functions.
• Proven track record in architecting and scaling AI-driven systems in production, designing multi-step agentic workflows that autonomously carry out complex operational tasks.
• History of reducing high-friction operational workflows through agentic AI, achieving measurable decreases in toil and enhanced platform leverage (e.g., LLM-powered incident diagnosis, intelligent CI/CD with test selection and deployment risk scoring, self-service assistants).
• Advanced proficiency in AWS (EKS, Lambda, Bedrock, etc.) with extensive knowledge of containerized and serverless architectures.
• Strong expertise in Kubernetes at scale, capable of guiding the implementation of intricate, resilient solutions.
• In-depth understanding of infrastructure-as-code tools (Terraform, Ansible) and ability to lead initiatives that merge traditional IaC and agentic automation.
• Mastery of Datadog and advanced observability practices, facilitating metrics-driven decision-making and agentic automation. Experience in developing AI-driven alerting and root-cause analysis systems is a plus.
• Strong commitment to security, privacy, and compliance best practices, with the ability to oversee governance for production AI systems (model safety, prompt injection prevention, data isolation).
• Familiarity with LLM orchestration frameworks (LangChain, LlamaIndex, CrewAI, or custom agentic architectures) and large-scale production prompt engineering.
• Proficient coding skills in Python and/or Go, with the ability to guide teams in treating infrastructure and agentic systems as software.
• Proven capability to drive cross-functional initiatives across engineering, product, security, and business, translating between technical intricacies and business outcomes.
• Experience utilizing AI-assisted development tools (e.g., GitHub Copilot, Cursor, ChatGPT, or similar tools) as part of your software development workflow.
• Experience with service mesh (Linkerd, Istio) and large-scale traffic management is a plus.
• Proficiency in GitOps (Argo CD, Flux CD) and CI/CD orchestration (GitHub Actions, Argo Workflows) is a plus.
• Familiarity with MLOps or LLMOps concepts (model versioning, evaluation frameworks, production monitoring for AI systems) is a plus.
• Knowledge of security frameworks relevant to AI systems (e.g., guardrails, audit logging, and data governance for LLMs) is a plus.
• Healthcare
• Internet and cell phone reimbursement
• Learning and development stipend
• Potential opportunities to travel to our Mountain View headquarters
Attio
TechBiz Global
Get handpicked remote jobs straight to your inbox weekly.