This is a fully remote position, open to applicants in Mexico.

📋 Description

• Design, implement, and maintain infrastructure on AWS and Azure using Terraform or Pulumi.

• Define Creai's multi-cloud strategy, ensuring that all infrastructure is reproducible, secure, and versioned.

• Create and operate robust, reusable continuous integration and delivery pipelines for all engineering teams, supporting application deployments and ML/AI models with automated testing, quality gates, and rollback strategies.

• Design, deploy, and manage production Kubernetes clusters (EKS/AKS).

• Manage namespaces, RBAC, network policies, Helm/Kustomize, and auto-scaling strategies for AI workloads.

• Build and maintain Creai's MLOps platform: training pipelines, model registration and versioning, deployment as scalable endpoints, and performance monitoring in production.

• Implement specialized infrastructure for generative AI workloads, including GPU resource management and RAG architectures.

• Be the primary driver of developer experience: create tools, templates, and abstractions that allow engineering and data science teams to focus on creating value without operational friction.

• Integrate security at all levels of the platform: secret management, IAM, encryption, and compliance with the principle of least privilege.

• Define and monitor SLAs/SLOs. Lead incident response and post-mortems.

• Design for high availability and disaster recovery.

• Implement comprehensive observability stacks (metrics, logs, and traces) with tools such as Prometheus, Grafana, Datadog, or OpenTelemetry, ensuring visibility of the status of all services and models in production.

• As the first member of the Platform team, build not only the infrastructure but also the culture, processes, and standards of the team.

• Actively influence architectural decisions across the organization and mentor future platform engineers.

• Occasionally participate in technical discussions with clients to define infrastructure requirements, present architectures, and ensure that platform solutions meet project expectations.

• Continuously evaluate and improve the platform stack, tools, processes, and operational practices, optimizing the efficiency and reliability of solutions.

• Exhibit clear and structured communication with both technical and non-technical stakeholders, presenting architectural and infrastructure decisions in an accessible manner.

⛳️ Requirements

• Over 4 years of experience in Platform Engineering, DevOps, SRE, or Infrastructure Engineering roles, with direct responsibility for production infrastructure at scale.

• Solid and proven experience in AWS and Azure, including computing, networking, storage, identity (IAM/Entra ID), and managed Kubernetes (EKS/AKS).

• Proficiency in Terraform. Experience with remote state management, reusable modules, and IaC pipelines in CI/CD.

• Advanced experience designing and operating production Kubernetes clusters: RBAC, network policies, Helm, Kustomize, operators, and scaling strategies (HPA, VPA, Cluster Autoscaler).

• Experience designing complex CI/CD pipelines on platforms like GitHub Actions, GitLab CI, Azure DevOps, or Jenkins.

• Proficiency in Docker: building optimized images, multi-stage builds, and managing registries (ECR, ACR). Experience with vulnerability scanning (Trivy, Snyk).

• Experience implementing observability stacks with Prometheus, Grafana, Datadog, OpenTelemetry, or ELK/Loki.

• Strong scripting skills in Python and Bash for operational task automation and internal tool development.

• Proven ability to work independently, make complex technical decisions, and take ownership of end-to-end results in contexts of high ambiguity.

• Ability to explain infrastructure decisions to both technical and business audiences.

• Fluent communication in both Spanish and English, written and verbal.

• Experience with tools like MLflow, Kubeflow, Seldon Core, KServe, SageMaker Pipelines, or Azure ML Pipelines for ML model lifecycle management (Preferred).

• Experience managing GPU infrastructure (spot instances, scheduling) and deploying LLMs or embeddings in production (Preferred).

• Certifications in AWS (Solutions Architect, DevOps Engineer) or Azure (AZ-104, AZ-400) (Preferred).

• Experience with Istio, Linkerd, or Consul for traffic management, mTLS, and network observability (Preferred).

• Experience operating vector databases like Pinecone, Weaviate, or pgvector in production (Preferred).

🏝️ Benefits

• 100% remote work with hours aligned to CST.

• Unlimited PTO: We trust you to manage your time effectively.

• Annual development budget: Access to courses, certifications, and conferences.

• Equipment budget: Set up your ideal remote workspace.

• Health benefit: Access to private medical coverage or medical insurance subsidies.

• Growth opportunities: Career planning and mentorship with experts in AI and technology.

• Dynamic and flexible startup environment: Autonomy to make decisions and propose ideas, with a focus on results rather than hours worked.

• Work-life balance: A culture that prioritizes flexibility and well-being, allowing you to manage your time without sacrificing your personal life.

Platform Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Forward Deployed Engineer

Professional Services Engineer

Senior Cisco CUCM Engineer

Ingeniero de Observabilidad IA

Field Services Engineer

Technical Services Engineer

Never miss a great job!