
Platform Engineer
Posted May 20

Posted May 20
This is a fully remote position, open to applicants in Mexico.
• Design, implement, and maintain infrastructure on AWS and Azure using Terraform or Pulumi.
• Define Creai's multi-cloud strategy, ensuring that all infrastructure is reproducible, secure, and versioned.
• Create and operate robust, reusable continuous integration and delivery pipelines for all engineering teams, supporting application deployments and ML/AI models with automated testing, quality gates, and rollback strategies.
• Design, deploy, and manage production Kubernetes clusters (EKS/AKS).
• Manage namespaces, RBAC, network policies, Helm/Kustomize, and auto-scaling strategies for AI workloads.
• Build and maintain Creai's MLOps platform: training pipelines, model registration and versioning, deployment as scalable endpoints, and performance monitoring in production.
• Implement specialized infrastructure for generative AI workloads, including GPU resource management and RAG architectures.
• Be the primary driver of developer experience: create tools, templates, and abstractions that allow engineering and data science teams to focus on creating value without operational friction.
• Integrate security at all levels of the platform: secret management, IAM, encryption, and compliance with the principle of least privilege.
• Define and monitor SLAs/SLOs. Lead incident response and post-mortems.
• Design for high availability and disaster recovery.
• Implement comprehensive observability stacks (metrics, logs, and traces) with tools such as Prometheus, Grafana, Datadog, or OpenTelemetry, ensuring visibility of the status of all services and models in production.
• As the first member of the Platform team, build not only the infrastructure but also the culture, processes, and standards of the team.
• Actively influence architectural decisions across the organization and mentor future platform engineers.
• Occasionally participate in technical discussions with clients to define infrastructure requirements, present architectures, and ensure that platform solutions meet project expectations.
• Continuously evaluate and improve the platform stack, tools, processes, and operational practices, optimizing the efficiency and reliability of solutions.
• Exhibit clear and structured communication with both technical and non-technical stakeholders, presenting architectural and infrastructure decisions in an accessible manner.
• Over 4 years of experience in Platform Engineering, DevOps, SRE, or Infrastructure Engineering roles, with direct responsibility for production infrastructure at scale.
• Solid and proven experience in AWS and Azure, including computing, networking, storage, identity (IAM/Entra ID), and managed Kubernetes (EKS/AKS).
• Proficiency in Terraform. Experience with remote state management, reusable modules, and IaC pipelines in CI/CD.
• Advanced experience designing and operating production Kubernetes clusters: RBAC, network policies, Helm, Kustomize, operators, and scaling strategies (HPA, VPA, Cluster Autoscaler).
• Experience designing complex CI/CD pipelines on platforms like GitHub Actions, GitLab CI, Azure DevOps, or Jenkins.
• Proficiency in Docker: building optimized images, multi-stage builds, and managing registries (ECR, ACR). Experience with vulnerability scanning (Trivy, Snyk).
• Experience implementing observability stacks with Prometheus, Grafana, Datadog, OpenTelemetry, or ELK/Loki.
• Strong scripting skills in Python and Bash for operational task automation and internal tool development.
• Proven ability to work independently, make complex technical decisions, and take ownership of end-to-end results in contexts of high ambiguity.
• Ability to explain infrastructure decisions to both technical and business audiences.
• Fluent communication in both Spanish and English, written and verbal.
• Experience with tools like MLflow, Kubeflow, Seldon Core, KServe, SageMaker Pipelines, or Azure ML Pipelines for ML model lifecycle management (Preferred).
• Experience managing GPU infrastructure (spot instances, scheduling) and deploying LLMs or embeddings in production (Preferred).
• Certifications in AWS (Solutions Architect, DevOps Engineer) or Azure (AZ-104, AZ-400) (Preferred).
• Experience with Istio, Linkerd, or Consul for traffic management, mTLS, and network observability (Preferred).
• Experience operating vector databases like Pinecone, Weaviate, or pgvector in production (Preferred).
• 100% remote work with hours aligned to CST.
• Unlimited PTO: We trust you to manage your time effectively.
• Annual development budget: Access to courses, certifications, and conferences.
• Equipment budget: Set up your ideal remote workspace.
• Health benefit: Access to private medical coverage or medical insurance subsidies.
• Growth opportunities: Career planning and mentorship with experts in AI and technology.
• Dynamic and flexible startup environment: Autonomy to make decisions and propose ideas, with a focus on results rather than hours worked.
• Work-life balance: A culture that prioritizes flexibility and well-being, allowing you to manage your time without sacrificing your personal life.
Akka (formerly Lightbend)
Swimlane
Get handpicked remote jobs straight to your inbox weekly.