
Azure CloudOps Engineer
Posted Jun 19

Posted Jun 19
This is a fully remote position, open to applicants in United States.
• Oversee and provide support for Azure infrastructure across development, QA, staging, and production environments.
• Ensure the operational health of Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and associated services.
• Guarantee that resources are provisioned, configured, monitored, maintained, and decommissioned according to company standards.
• Assist in setting up environments for new products, clients, and integrations.
• Identify and rectify infrastructure issues that impact performance, reliability, availability, or security.
• Develop and manage Terraform modules and environmental configurations.
• Ensure that infrastructure changes are version-controlled, peer-reviewed, tested, and authorized.
• Handle Terraform state, workspaces, variables, secrets, and deployment workflows.
• Identify and resolve discrepancies between Terraform and deployed Azure resources.
• Standardize naming conventions, tagging, resource group structures, environment isolation, and module patterns.
• Create, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments.
• Facilitate CI/CD pipelines across various SaaS products and environments.
• Implement promotion processes from development to QA to staging to production.
• Introduce deployment safeguards such as environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails.
• Oversee pipeline secrets, service principals, managed identities, and deployment credentials.
• Enhance build and deployment reliability, speed, and traceability.
• Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads.
• Support production operations for LLM integrations and AI-enabled product features.
• Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and costs.
• Assist in defining operational standards for AI workloads: access control, logging, alerting, failover, usage governance, and provider disruption management.
• Collaborate with engineering teams to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side disruptions.
• Ensure secure handling of AI secrets, endpoints, keys, managed identities, and private network access.
• Implement and maintain monitoring systems with Azure Monitor, Log Analytics, Application Insights, and related tools.
• Develop dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health.
• Configure alerts for availability, latency, errors, resource saturation, queue depth, job failures, deployment failures, database health, quota exhaustion, and cost anomalies.
• Enhance signal quality by minimizing noise and ensuring alerts are actionable.
• Participate in production incident response for infrastructure, deployments, integrations, and platform services.
• Triage and resolve issues across Azure services, CI/CD, Terraform, networking, databases, messaging, and AI integrations.
• Develop and maintain runbooks for common operational challenges.
• Assist in root cause analyses and post-incident evaluations.
• Implement preventive measures after incidents to improve reliability.
• Help define severity levels, escalation paths, response expectations, on-call processes, and production support protocols.
• Enforce cloud security best practices across Azure environments.
• Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions.
• Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access.
• Aid in secret rotation, certificate management, and secure configuration management.
• Enforce network security through private endpoints, firewalls, IP restrictions, and environment-specific access rules.
• Support audit and compliance readiness for SOC 2, ISO 27001, or similar frameworks.
• Assist with Azure PostgreSQL operations: backups, restores, performance monitoring, connection limits, high availability, and capacity planning.
• Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategies, and usage trends.
• Support Azure Service Bus operations: queue/topic monitoring, dead-letter handling, retry behavior, and throughput.
• Ensure the operational health of SignalR, including connection metrics, scaling behavior, and related production issues.
• Monitor Azure expenditures across products, environments, services, and applicable customers.
• Implement tagging standards to facilitate cost allocation by product, environment, customer, or business unit.
• Create cost dashboards, budget alerts, anomaly detection, and conduct recurring cost reviews.
• Identify underutilized resources and recommend right-sizing opportunities.
• Review AI service costs, LLM and token usage, Speech-to-Text usage, storage growth, database sizing, and environment costs.
• Suggest savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules.
• Define and maintain backup and recovery procedures for critical cloud services.
• Test database restores and validate backup reliability.
• Help define recovery time objectives (RTOs) and recovery point objectives (RPOs) for production systems.
• Support disaster recovery planning for SaaS products and customer-facing services.
• Enhance resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness assessments.
• Create and maintain CloudOps documentation, runbooks, deployment guides, and environment standards.
• Establish standards for naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions patterns, and production modifications.
• Document procedures for cloud services, CI/CD workflows, AI services, and incident response.
• Empower engineering teams with reusable patterns, templates, and self-service guidance.
• Over 7 years of hands-on experience managing production workloads in Microsoft Azure.
• Extensive experience with Terraform and infrastructure as code.
• Proven experience in building and maintaining CI/CD pipelines using GitHub Actions.
• Familiarity with containerized workloads, preferably Azure Container Apps or similar technologies.
• Experience with Azure Monitor, Log Analytics, and Application Insights.
• Proficiency in Azure PostgreSQL or similar managed relational database services.
• Strong grasp of Azure networking, DNS, identity management, RBAC, managed identities, Key Vault, and security best practices.
• Experience in troubleshooting production incidents across infrastructure, deployments, networking, and cloud services.
• Proficient in scripting with Bash, PowerShell, Python, or similar languages.
• Excellent documentation, communication, and cross-functional collaboration skills.
• Competitive salary based on experience.
• Opportunities for career growth and professional development.
• Experience working with a diverse, global team in a remote work environment.
Presidio
Thinkahead Consultant Psychologist Pty Ltd
Duck Creek Technologies
Bamboo Health
Get handpicked remote jobs straight to your inbox weekly.