This is a fully remote position, open to applicants in California, +4 more states.

📋 Description

• Design, implement, and manage solutions on Kubernetes for extensive storage and data platforms, including the manifests, Helm charts, and operators necessary for their operation.

• Create tools, services, and automation processes that enhance the lifecycle of storage and data systems—from provisioning and configuration to deployment, scaling, and ongoing operations.

• Develop and manage telemetry and observability for production systems, including metrics, logging, tracing, dashboards, and alerting, ensuring that system health, availability, and latency are both measurable and actionable.

• Utilize strong analytical troubleshooting abilities to identify and resolve complex issues within distributed, containerized infrastructures.

• Collaborate closely with colleagues and partner teams to enhance the lifecycle of services, from initial design through deployment, operation, and continuous improvement.

• Sustainably scale systems through automation, infrastructure-as-code, and CI/CD practices, while advocating for changes that enhance reliability and speed.

• Assist services prior to their live deployment by engaging in tasks such as deployment automation, capacity planning, and launch readiness reviews.

• Engage in sustainable incident response practices and postmortem reviews, while participating in an on-call rotation to support production systems.

⛳️ Requirements

• Bachelor's degree (or equivalent experience) in Computer Science or a related technical field involving programming.

• Over 12 years of hands-on experience.

• Practical experience with Kubernetes, including the deployment, configuration, and operation of workloads and solutions in a production environment.

• Experience in developing tools and services for storage, data, or platform infrastructure, with a solid grasp of software design principles (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.

• Familiarity with building and managing telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.

• Strong analytical troubleshooting skills, employing a systematic, root-cause-driven methodology to identify and resolve complex issues.

• Proficiency in one or more programming languages, including Python, Go, or Java.

• Solid understanding of infrastructure configuration management and infrastructure-as-code tools like Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.

🏝️ Benefits

• Equity

• Health insurance

• Retirement plans

• Paid time off

• Professional development opportunities

Senior Systems Engineer, Storage – DGX Cloud

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Revenue Systems Architect, AI & Automation

Senior Systems Engineer

Systems Engineer

System Engineer – Bare Metal

Platform Systems Architect – Azure

Senior Systems Engineer

Never miss a great job!