
Senior Platform Engineer – AI Agent Infrastructure
Posted May 19

Posted May 19
This is a fully remote position, open to applicants in Argentina.
• Creating event-driven communication strategies.
• Enhancing the reliability of streaming services.
• Developing observability tools for the platform.
• Leading architectural decision-making processes.
• Managing cloud infrastructure and automation using Infrastructure as Code (IaC).
• Establishing systems for monitoring, tracing, and alerting.
• Expertise in event-driven architecture and messaging systems — you’ve designed solutions utilizing message queues (Kafka, NATS, RabbitMQ, or similar). You possess a strong understanding of at-least-once delivery, consumer groups, dead letters, backpressure, and have ideally transitioned a system from synchronous to asynchronous messaging.
• Proficient in AWS — extensive experience with EC2, VPC, IAM, S3, RDS. You have a solid grasp of networking principles since inter-service communication operates over internal VPC.
• Database skills — strong knowledge of both SQL (PostgreSQL) and NoSQL (MongoDB, Redis). You know when to apply each type, as well as indexing strategies, replication, and performance optimization techniques.
• Docker — familiarity with container lifecycle management, resource limitations, health checks, bind mounts, and multi-stage builds.
• Experience in debugging distributed systems — you’ve troubleshot asynchronous flows and cascading failures across production services, and can articulate what went wrong and how you resolved it.
• Infrastructure as Code — proficiency with Terraform or Pulumi. You advocate for infrastructure changes to be reviewed in pull requests rather than through console clicks.
• Observability expertise — fluency in Datadog or equivalent tools (dashboards, monitors, APM, log pipelines, distributed tracing).
• Familiarity with a tech stack including Go, AWS (EC2, S3, VPC, RDS PostgreSQL), Docker, PostgreSQL, MongoDB, Redis, and Datadog.
• Experience with AI / MLOps infrastructure — managing AI workloads in production (model serving, LLM inference, GPU/resource management, agent evaluation, and tools like LangFuse, LangSmith, Braintrust, MLflow).
• Knowledge of multi-tenant container platforms — experience with services that run customer/user workloads in containers (Replit, Railway, Fly.io, or internal PaaS systems).
• Kubernetes — you have successfully migrated from "Docker on bare EC2" to Kubernetes at least once and are aware of potential issues that can arise during this transition.
• Experience with data pipelines and orchestration — familiarity with tools such as Airflow, Prefect, or similar. Knowledge of data warehouses (Databricks, Snowflake, BigQuery) is a plus.
• Competitive Compensation.
• Remote Work – You can work from anywhere!
• Home Office Bonus – A one-time allowance to assist you in creating your perfect home office setup.
• Provision of Work Equipment.
• Stock Options.
• Comprehensive Health Plan available wherever you are.
• Flexible Days Off.
• Opportunities for Language, Professional, and Personal Growth courses.
MAINSOFT
World Vision
Block Labs
Attio
Get handpicked remote jobs straight to your inbox weekly.