Remotery

Senior Systems Engineer, Storage – DGX Cloud

Posted 6 days ago

This is a fully remote position, open to applicants in California, +4 more states.

📋 Description

• Design, implement, and manage solutions on Kubernetes for extensive storage and data platforms, including the manifests, Helm charts, and operators necessary for their operation.

• Create tools, services, and automation processes that enhance the lifecycle of storage and data systems—from provisioning and configuration to deployment, scaling, and ongoing operations.

• Develop and manage telemetry and observability for production systems, including metrics, logging, tracing, dashboards, and alerting, ensuring that system health, availability, and latency are both measurable and actionable.

• Utilize strong analytical troubleshooting abilities to identify and resolve complex issues within distributed, containerized infrastructures.

• Collaborate closely with colleagues and partner teams to enhance the lifecycle of services, from initial design through deployment, operation, and continuous improvement.

• Sustainably scale systems through automation, infrastructure-as-code, and CI/CD practices, while advocating for changes that enhance reliability and speed.

• Assist services prior to their live deployment by engaging in tasks such as deployment automation, capacity planning, and launch readiness reviews.

• Engage in sustainable incident response practices and postmortem reviews, while participating in an on-call rotation to support production systems.


⛳️ Requirements

• Bachelor's degree (or equivalent experience) in Computer Science or a related technical field involving programming.

• Over 12 years of hands-on experience.

• Practical experience with Kubernetes, including the deployment, configuration, and operation of workloads and solutions in a production environment.

• Experience in developing tools and services for storage, data, or platform infrastructure, with a solid grasp of software design principles (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.

• Familiarity with building and managing telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.

• Strong analytical troubleshooting skills, employing a systematic, root-cause-driven methodology to identify and resolve complex issues.

• Proficiency in one or more programming languages, including Python, Go, or Java.

• Solid understanding of infrastructure configuration management and infrastructure-as-code tools like Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.


🏝️ Benefits

• Equity

• Health insurance

• Retirement plans

• Paid time off

• Professional development opportunities

People also viewed

Jellyfish10 hours ago

Revenue Systems Architect, AI & Automation

US flagMassachusetts OnlyFull-timeSystems Engineer$180k – $225k/year
ApplyView job
ScalableOS10 hours ago

Senior Systems Engineer

PH flagPhilippines OnlyFull-timeSystems Engineer
ApplyView job
3Cloud10 hours ago

Systems Engineer

PH flagPhilippines OnlyFull-timeSystems Engineer
ApplyView job
Pragmatike10 hours ago

System Engineer – Bare Metal

BR flagBrazil OnlyFull-timeSystems Engineer
ApplyView job
Leidos10 hours ago

Platform Systems Architect – Azure

US flagUnited States OnlyFull-timeSystems Engineer$131.3k – $237.3k/year
ApplyView job
CSC10 hours ago

Senior Systems Engineer

US flagUnited States OnlyFull-timeSystems Engineer
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers