
Senior Systems Engineer, Storage – DGX Cloud
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in California, +4 more states.
• Design, implement, and manage solutions on Kubernetes for extensive storage and data platforms, including the manifests, Helm charts, and operators necessary for their operation.
• Create tools, services, and automation processes that enhance the lifecycle of storage and data systems—from provisioning and configuration to deployment, scaling, and ongoing operations.
• Develop and manage telemetry and observability for production systems, including metrics, logging, tracing, dashboards, and alerting, ensuring that system health, availability, and latency are both measurable and actionable.
• Utilize strong analytical troubleshooting abilities to identify and resolve complex issues within distributed, containerized infrastructures.
• Collaborate closely with colleagues and partner teams to enhance the lifecycle of services, from initial design through deployment, operation, and continuous improvement.
• Sustainably scale systems through automation, infrastructure-as-code, and CI/CD practices, while advocating for changes that enhance reliability and speed.
• Assist services prior to their live deployment by engaging in tasks such as deployment automation, capacity planning, and launch readiness reviews.
• Engage in sustainable incident response practices and postmortem reviews, while participating in an on-call rotation to support production systems.
• Bachelor's degree (or equivalent experience) in Computer Science or a related technical field involving programming.
• Over 12 years of hands-on experience.
• Practical experience with Kubernetes, including the deployment, configuration, and operation of workloads and solutions in a production environment.
• Experience in developing tools and services for storage, data, or platform infrastructure, with a solid grasp of software design principles (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.
• Familiarity with building and managing telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.
• Strong analytical troubleshooting skills, employing a systematic, root-cause-driven methodology to identify and resolve complex issues.
• Proficiency in one or more programming languages, including Python, Go, or Java.
• Solid understanding of infrastructure configuration management and infrastructure-as-code tools like Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.
• Equity
• Health insurance
• Retirement plans
• Paid time off
• Professional development opportunities
Jellyfish
ScalableOS
Pragmatike
Get handpicked remote jobs straight to your inbox weekly.