This is a fully remote position, open to applicants in United States.

📋 Description

• Take ownership of availability, latency, and performance objectives for AI platform services and data infrastructure hosted on AWS.

• Design and establish monitoring, alerting, and observability frameworks throughout the platform stack.

• Lead incident response, conduct root cause analysis, and facilitate post-mortem activities for platform-level outages or performance degradations.

• Define and monitor SLOs/SLAs for essential platform components including RAG pipelines, agent orchestration services, and model access layers.

• Proactively detect reliability risks and spearhead engineering enhancements before they escalate into production issues.

• Create and maintain runbooks, disaster recovery plans, and operational documentation.

• Design, build, and sustain CI/CD pipelines for AI platform components, data pipelines, and internal applications.

• Oversee infrastructure-as-code (IaC) practices within the team utilizing tools like Terraform or AWS CDK.

• Manage and optimize AWS environments encompassing ECS, Lambda, S3, RDS, Redshift, API Gateway, and associated services.

• Implement and uphold security, compliance, and cost optimization best practices across AWS infrastructure.

• Automate deployment, scaling, and configuration management to minimize manual operational overhead.

• Collaborate with AI Platform Engineers to containerize and operationalize AI services and agent frameworks.

• Assist Data & AI Engineers with environment management, access controls, and deployment tools for Polaris and data pipeline infrastructure.

• Act as the operational backbone for the AI platform team, ensuring that engineers can deploy and iterate swiftly without being hindered by infrastructure challenges.

• Contribute to our "factory model" vision by transforming deployment and reliability into a repeatable, scalable capability instead of an ad hoc process.

⛳️ Requirements

• 3+ years of professional experience in a DevOps, SRE, or platform engineering position.

• Hands-on experience with AWS is essential – including AgentCore, Bedrock, ECS, Lambda, S3, RDS, Redshift, CloudWatch, IAM, VPC, and related services.

• Proficiency with infrastructure-as-code tools such as Terraform or AWS CDK.

• Strong CI/CD background with tools like GitHub Actions.

• Familiarity with containerization and orchestration technologies (Docker, ECS, or Kubernetes).

• Knowledge of AI/ML infrastructure patterns – model serving, vector databases, pipeline orchestration (strongly preferred).

• Experience with observability and monitoring tools (Datadog, CloudWatch).

• Previous experience in a SaaS environment.

• Excellent verbal and written communication skills with the capacity to collaborate with both technical and non-technical stakeholders.

• Self-motivated with a proactive mindset to identify and address infrastructure risks before they affect delivery.

• Open to exploring and responsibly adopting AI tools to boost productivity and innovation in your role.

🏝️ Benefits

• Competitive health plans.

• Paid time off.

• Company-paid holidays.

• 401K retirement program with a company-matched contribution.

• Additional company-sponsored programs.

Platform Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Principal Platform Engineer

GenAI Platform Engineer

Dynamics 365/Power Platform Developer

Principal Architect – Platform Infrastructure

Director, Platform Engineering

Senior Data Platform Engineer

Never miss a great job!