
Senior Engineer, Network Observability
Posted 22 hours ago

Posted 22 hours ago
This is a fully remote position, open to applicants in United Kingdom.
• We are looking for a skilled and experienced Senior Engineer for Network Observability to join our Network Observability team. In this position, you will play a crucial role in designing, developing, and maintaining the monitoring, telemetry, and observability systems that ensure the reliable and scalable operation of CoreWeave’s GPU cloud network.
• Your primary focus will be on creating solutions that deliver real-time insights into network performance, proactively identifying issues and facilitating swift resolutions.
• You will develop, optimize, and maintain network observability platforms, utilizing your expertise in Python and Golang to create and automate collectors, exporters, and dashboards that offer in-depth visibility into network health and performance.
• Collaborate with Network Engineering and Platform teams to aggregate and standardize logs, metrics, and events from various platforms (such as Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a cohesive observability pipeline.
• Design and implement scalable telemetry solutions using protocols like gNMI, SNMP, and streaming analytics, ensuring advanced alerting and anomaly detection through tools like Prometheus, Grafana, and Alertmanager.
• Work closely with network developers, site reliability engineers, and security teams to integrate observability solutions throughout the broader infrastructure.
• Participate in design discussions, requests for comments (RFCs), and architectural decision-making.
• Join a rotating on-call schedule to troubleshoot and resolve observability-related issues, providing timely support to operations teams and swiftly isolating and addressing problems as they arise.
• Mentor junior team members, share best practices, and cultivate a culture of continuous learning and improvement within the observability domain.
• Extensive knowledge of Prometheus, Grafana, Alertmanager, gNMI, and SNMP, with experience in writing or extending custom metric collectors/exporters seen as advantageous.
• Background as a Network Engineer, Site Reliability Engineer (SRE), Software Developer, or Systems Administrator in large-scale environments, with a proven history of building and operating resilient telemetry and monitoring solutions considered a plus.
• A strong passion for automating tasks and processes, finding fulfillment in creating workflows that manage repetitive tasks and minimize human error.
• Proficient in containerizing solutions within Kubernetes, as well as designing, building, and deploying container-based workloads efficiently.
• Strong programming skills in Python, Go, and Bash, along with familiarity with configuration management and templating tools (e.g., Ansible, Jinja2).
• In-depth understanding of Linux systems and IP networking concepts, complemented by hands-on experience in routing, switching, and network troubleshooting.
• Practical experience with a variety of platforms, including Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux.
• A collaborative and humble attitude, always willing to assist others while remaining open to learning from more experienced colleagues.
• Family-level Medical Insurance
• Family-level Dental Insurance
• Generous Pension Contribution
• Life Assurance at 4x Salary
• Critical Illness Cover
• Employee Assistance Programme
• Tuition Reimbursement
• Work culture focused on innovative disruption
VPS
Tango
Influur
Salesloft
Get handpicked remote jobs straight to your inbox weekly.