
Cloud NOC Engineer
Posted May 23

Posted May 23
This is a fully remote position, open to applicants in Chile.
• Proactive Monitoring: Continuous surveillance of dashboards and alerts (physical infrastructure, virtual and services) to ensure 99.999% availability.
• Incident Management (Triage): Reception, categorization, and prioritization of alerts.
• Ticket Management: Rigorous opening and follow-up of tickets following ITIL methodologies.
• Initial Technical Resolution: Diagnosis and resolution of low to medium complexity issues (e.g., service restarts, log cleaning, quota adjustments, basic connectivity checks).
• Structured Escalation: When complexity exceeds the initial level, escalate to L1/L2 by providing a comprehensive technical report (logs, network traces, reproduction steps, and client context).
• Case Documentation: Keep the event log and knowledge base (KB) updated regarding recurring incidents.
• External Communication: Notify clients about health statuses, maintenance windows, and ongoing incidents clearly and promptly.
• Health Checks: Conduct periodic health validation routines on production platforms.
• Ensure compliance with SLA regarding incidents and network and service availability.
• Generation and analysis of availability reports for platforms.
• At least 1-2 years in monitoring centers (NOC), first-level technical support, or systems administration.
• Experience in ticket management and support processes (Jira, ServiceNow, or others), including clear documentation of diagnosis, evidence, and communication.
• Proficiency in Monitoring/Observability tools such as Prometheus, Grafana, Elasticsearch, Opensearch, OpenNMS.
• Ability to read and interpret metrics, events, logs, and alarms.
• Experience with production-critical systems, including incident management, coordination of production actions, escalation, and effective communication.
• Degree in Computer Engineering, Systems Engineering, Electronic Engineering, or a related field.
• Experience with Linux in production environments: troubleshooting services and the operating system (systemd, journalctl), permissions/users, processes, filesystem, and networks.
• Networking in Linux: configuration and diagnosis of interfaces, VLANs, routes, bonding, and MTU; troubleshooting with tools like tcpdump (sniffing), ip, ss, ethtool, ping/traceroute.
• Kubernetes: operation/administration and troubleshooting in production (Pods, Deployments/DaemonSets, Services, events/logs, readiness/liveness; basic knowledge of storage PV/PVC).
• Virtualization: experience operating and supporting virtualized environments (KVM/VMware/Hyper-V or others), including diagnosis of common computing, network, and storage failures.
• Automation: ability to resolve repetitive tasks using Bash and Ansible and/or Python (information gathering, operational checks, basic remediation, secure production scripts).
• Intermediate English skills for reading/writing technical documentation, updating stakeholders, and interacting with vendors/manufacturers for support cases.
• Private medical insurance for you and your family.
• Language courses to ensure your growth knows no boundaries.
• Access to courses, books, materials, and reimbursement for certifications.
• A minimum of 15 vacation days, one day off for your birthday, and extra breaks before National Holidays, Christmas, and New Year.
• Performance bonuses and project success incentives.
• Budget for recreational activities and team-building.
• Cutting-edge technology: We renew your equipment every 3 years... and it’s yours at the end of the period!
Get handpicked remote jobs straight to your inbox weekly.