
Platform Engineer
Posted 8 hours ago

Posted 8 hours ago
This is a fully remote position, open to applicants in United States.
• Take ownership of availability, latency, and performance objectives for AI platform services and data infrastructure hosted on AWS.
• Design and establish monitoring, alerting, and observability frameworks throughout the platform stack.
• Lead incident response, conduct root cause analysis, and facilitate post-mortem activities for platform-level outages or performance degradations.
• Define and monitor SLOs/SLAs for essential platform components including RAG pipelines, agent orchestration services, and model access layers.
• Proactively detect reliability risks and spearhead engineering enhancements before they escalate into production issues.
• Create and maintain runbooks, disaster recovery plans, and operational documentation.
• Design, build, and sustain CI/CD pipelines for AI platform components, data pipelines, and internal applications.
• Oversee infrastructure-as-code (IaC) practices within the team utilizing tools like Terraform or AWS CDK.
• Manage and optimize AWS environments encompassing ECS, Lambda, S3, RDS, Redshift, API Gateway, and associated services.
• Implement and uphold security, compliance, and cost optimization best practices across AWS infrastructure.
• Automate deployment, scaling, and configuration management to minimize manual operational overhead.
• Collaborate with AI Platform Engineers to containerize and operationalize AI services and agent frameworks.
• Assist Data & AI Engineers with environment management, access controls, and deployment tools for Polaris and data pipeline infrastructure.
• Act as the operational backbone for the AI platform team, ensuring that engineers can deploy and iterate swiftly without being hindered by infrastructure challenges.
• Contribute to our "factory model" vision by transforming deployment and reliability into a repeatable, scalable capability instead of an ad hoc process.
• 3+ years of professional experience in a DevOps, SRE, or platform engineering position.
• Hands-on experience with AWS is essential – including AgentCore, Bedrock, ECS, Lambda, S3, RDS, Redshift, CloudWatch, IAM, VPC, and related services.
• Proficiency with infrastructure-as-code tools such as Terraform or AWS CDK.
• Strong CI/CD background with tools like GitHub Actions.
• Familiarity with containerization and orchestration technologies (Docker, ECS, or Kubernetes).
• Knowledge of AI/ML infrastructure patterns – model serving, vector databases, pipeline orchestration (strongly preferred).
• Experience with observability and monitoring tools (Datadog, CloudWatch).
• Previous experience in a SaaS environment.
• Excellent verbal and written communication skills with the capacity to collaborate with both technical and non-technical stakeholders.
• Self-motivated with a proactive mindset to identify and address infrastructure risks before they affect delivery.
• Open to exploring and responsibly adopting AI tools to boost productivity and innovation in your role.
• Competitive health plans.
• Paid time off.
• Company-paid holidays.
• 401K retirement program with a company-matched contribution.
• Additional company-sponsored programs.
Tango
Accenture Federal Services
Strategize it Inc.
Accela
Get handpicked remote jobs straight to your inbox weekly.