
Senior Storage Production Engineer β DGX Cloud
Posted 1 hour ago

Posted 1 hour ago
This is a fully remote position, open to applicants in California.
β’ Design, implement, and provide support for large-scale storage clusters, ensuring they are scalable, highly available, and maintain data integrity.
β’ Create and uphold storage monitoring, logging, and alerting systems to facilitate proactive detection and resolution of performance issues.
β’ Collaborate with AI/ML workloads to enhance storage architectures for low-latency access, efficient caching, and high-throughput performance.
β’ Enhance the lifecycle of storage services β from conception and design to deployment, operation, and ongoing optimization.
β’ Assist in the preparation of storage services prior to their availability by engaging in system build consulting, developing automation frameworks, managing capacity, and conducting launch reviews.
β’ Oversee the production storage infrastructure by monitoring availability, latency, and system health, utilizing predictive analytics and AI-driven automation.
β’ Maximize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload allocation.
β’ Sustainably scale storage systems using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
β’ Guarantee data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
β’ Engage in sustainable incident response and blameless root cause analysis.
β’ Participate in an on-call rotation to support storage and production systems.
β’ Bachelor's degree or equivalent experience in Computer Science, Storage Systems, or a related technical field, with a minimum of 8 years of practical experience.
β’ Proficient in distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
β’ Strong understanding of block, file, and object storage technologies, including their scalability, reliability, performance characteristics, and standard processes.
β’ Familiarity with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
β’ Expertise in algorithms, data structures, complexity analysis, software design, and automating the maintenance of large-scale Linux-based storage systems.
β’ Experience in one or more programming languages such as C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
β’ Practical experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
β’ Proficient in using observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring the health of storage systems.
β’ Equity
β’ Benefits
Instacart
CLASP
Tailor
Get handpicked remote jobs straight to your inbox weekly.