This is a fully remote position, open to applicants in United States.

📋 Description

• Ensure the stability and reliability of Epic's GCP infrastructure by establishing and monitoring SLOs/SLIs, minimizing toil, and eliminating recurring instability sources.

• Design and manage Epic's GCP infrastructure to achieve high availability, scalability, and cost-effectiveness.

• Oversee and enhance our Docker and GKE container platform, focusing on workload scheduling, autoscaling, networking, and seamless failure management.

• Sustain and optimize CI/CD pipelines that facilitate rapid, secure, and low-risk delivery across engineering teams.

• Take ownership of the observability stack—metrics, logs, traces, dashboards, and alerts—ensuring that signals are actionable, noise is minimized, and on-call personnel have the necessary context to resolve issues swiftly.

• Write and manage Terraform scripts to codify infrastructure throughout the organization, prioritizing consistency, change safety, and reproducibility.

• Engage in capacity planning, cost optimization, and architectural reviews with a strong emphasis on reliability.

• Advocate for platform security best practices, encompassing secrets management, IAM policies, and network segmentation.

• Assist in compliance-oriented infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we advance our SOC 2 and student-data compliance initiatives.

• Collaborate with data engineering to oversee the orchestration platform and its supporting infrastructure—deployment, scaling, reliability, and observability.

• Work closely with backend and data engineers to diagnose service and platform issues.

• Set an example by participating in a regular on-call rotation; lead incident response, conduct blameless post-mortems, and ensure follow-through that transforms one-time outages into lasting reliability enhancements.

• Offer guidance to developers on infrastructure-related concerns and best practices.

⛳️ Requirements

• A Bachelor's degree or higher in Computer Science, Software Engineering, or a related discipline.

• Over 5 years of experience in infrastructure, platform, DevOps, or a similar engineering role.

• Practical experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and associated services).

• Familiarity with Docker and Kubernetes (GKE)—including containerizing workloads, deploying to GKE, Helm, and cluster fundamentals.

• Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or equivalent).

• Proficient in using an observability platform like New Relic (metrics, logging, alerting, dashboards).

• Expertise in Terraform for managing infrastructure as code.

• Scripting/programming capabilities in Python, Bash, or similar languages.

• Willingness to participate in a regular production on-call rotation.

• Proven track record of significantly enhancing the reliability of production systems—e.g., establishing SLOs, decreasing incident frequency or MTTR, and eliminating recurrent failure modes.

• Strong problem-solving abilities, a sense of ownership, and the capacity to operate effectively within dynamic systems.

• Proficient in English for daily collaboration and technical documentation.

• Proficient in Mandarin Chinese to facilitate effective collaboration with global engineering and business teams.

🏝️ Benefits

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Generous paid time off and flexible work arrangements.

• Opportunities for professional development and continuous learning.

• Supportive work environment fostering collaboration and innovation.

Senior Software Engineer, Infrastructure

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior Vice President, Client Strategy

National Accounts Manager

Associate Sales Representative, CRM

3rd Party Collections Specialist

Client Director – Strategic AI Infrastructure

Regional Sales Manager

Never miss a great job!