
Senior Software Engineer, Infrastructure
Posted 8 hours ago

Posted 8 hours ago
This is a fully remote position, open to applicants in United States.
• Ensure the stability and reliability of Epic's GCP infrastructure by establishing and monitoring SLOs/SLIs, minimizing toil, and eliminating recurring instability sources.
• Design and manage Epic's GCP infrastructure to achieve high availability, scalability, and cost-effectiveness.
• Oversee and enhance our Docker and GKE container platform, focusing on workload scheduling, autoscaling, networking, and seamless failure management.
• Sustain and optimize CI/CD pipelines that facilitate rapid, secure, and low-risk delivery across engineering teams.
• Take ownership of the observability stack—metrics, logs, traces, dashboards, and alerts—ensuring that signals are actionable, noise is minimized, and on-call personnel have the necessary context to resolve issues swiftly.
• Write and manage Terraform scripts to codify infrastructure throughout the organization, prioritizing consistency, change safety, and reproducibility.
• Engage in capacity planning, cost optimization, and architectural reviews with a strong emphasis on reliability.
• Advocate for platform security best practices, encompassing secrets management, IAM policies, and network segmentation.
• Assist in compliance-oriented infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we advance our SOC 2 and student-data compliance initiatives.
• Collaborate with data engineering to oversee the orchestration platform and its supporting infrastructure—deployment, scaling, reliability, and observability.
• Work closely with backend and data engineers to diagnose service and platform issues.
• Set an example by participating in a regular on-call rotation; lead incident response, conduct blameless post-mortems, and ensure follow-through that transforms one-time outages into lasting reliability enhancements.
• Offer guidance to developers on infrastructure-related concerns and best practices.
• A Bachelor's degree or higher in Computer Science, Software Engineering, or a related discipline.
• Over 5 years of experience in infrastructure, platform, DevOps, or a similar engineering role.
• Practical experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and associated services).
• Familiarity with Docker and Kubernetes (GKE)—including containerizing workloads, deploying to GKE, Helm, and cluster fundamentals.
• Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or equivalent).
• Proficient in using an observability platform like New Relic (metrics, logging, alerting, dashboards).
• Expertise in Terraform for managing infrastructure as code.
• Scripting/programming capabilities in Python, Bash, or similar languages.
• Willingness to participate in a regular production on-call rotation.
• Proven track record of significantly enhancing the reliability of production systems—e.g., establishing SLOs, decreasing incident frequency or MTTR, and eliminating recurrent failure modes.
• Strong problem-solving abilities, a sense of ownership, and the capacity to operate effectively within dynamic systems.
• Proficient in English for daily collaboration and technical documentation.
• Proficient in Mandarin Chinese to facilitate effective collaboration with global engineering and business teams.
• Competitive salary and performance-based bonuses.
• Comprehensive health, dental, and vision insurance.
• Generous paid time off and flexible work arrangements.
• Opportunities for professional development and continuous learning.
• Supportive work environment fostering collaboration and innovation.
Urrly
Weiler Abrasives Group
Abbott
Segoso
Get handpicked remote jobs straight to your inbox weekly.