Remotery

Senior Site Reliability Engineer

Posted 5 days ago

This is a fully remote position, open to applicants in Florida.

📋 Description

• Ensure the availability, reliability, and performance of high-traffic Java-based applications within a distributed environment.

• Troubleshoot and resolve intricate issues across both production and non-production settings.

• Engage in pre- and post-deployment performance testing and monitoring to enhance application performance continuously.

• Optimize Java application performance focusing on JVM tuning, efficient resource use, and horizontal scaling.

• Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to provide real-time monitoring, logging, and alerting.

• Implement and refine observability strategies to improve visibility into application and infrastructure health.

• Create and maintain dashboards, alerts, and log queries for thorough system health monitoring.

• Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction.

• Design, build, and operate agentic AI workflows to automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization.

• Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) for autonomous diagnostic and remediation actions or with human-in-the-loop approval.

• Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents.

• Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems.

• Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and consistently improving.

• Advocate for the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization.

• Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence.

• Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and identify patterns across historical incidents.

• Document and share lessons learned, contributing to a culture of continuous improvement.

• Identify repetitive operational workflows and design AI-augmented or fully automated replacements.

• Create self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures using natural language.

• Measure and report on toil reduction metrics to quantify the impact of automation initiatives.

• Collaborate closely with developers, architects, and data/ML engineers to design solutions that enhance reliability and utilize AI capabilities.

• Work alongside DevOps and NOC teams to support the application platform.

• Communicate SRE practices, AI/automation capabilities, and operational insights to both technical and non-technical stakeholders.

• Provide feedback on application performance, potential improvements, and observability metrics.


⛳️ Requirements

• Bachelor's degree in Computer Science or a related field, or equivalent professional experience.

• Over 5 years of experience in SRE, DevOps, or similar infrastructure roles with a background in managing large-scale, high-availability production systems.

• More than 3 years of hands-on experience managing production Kubernetes clusters, with a deep understanding of architecture, networking, storage, and security.

• Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.

• Proficient in kubectl, Helm, Kubernetes operators, and troubleshooting container orchestration.

• Advanced expertise in the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.

• Proficient in PromQL and experienced with Loki for log aggregation and analysis.

• Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.

• Expertise in cloud platforms (AWS preferred; GCP or Azure also valued).

• Familiarity with Infrastructure as Code tools like Terraform/Terragrunt or Ansible.

• Proficiency with ArgoCD for GitOps workflows and continuous deployment.

• Strong scripting skills in Python, Bash, or Go, with experience in building CI/CD pipelines and deployment automation.

• Proven history with on-call rotations, incident response, and root cause analysis.

• At least 1 year of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.

• Demonstrated capability to design agentic systems that employ tool calling, retrieval-augmented generation (RAG), or multi-step reasoning for operational tasks.

• Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.

• Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).

• Understanding of best practices in prompt engineering, including structured outputs, system prompts, and few-shot examples.

• Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.

• Experience in building or utilizing MCP (Model Context Protocol) servers to expose internal tools to AI agents.

• Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.


🏝️ Benefits

• Competitive pay and benefits

• Flexible vacation allowance

• A hybrid / remote working environment

• Startup culture backed by a secure, global brand

People also viewed

Innovative Solutions1 hour ago

Cloud Engineer – DevOps

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$100k – $160k/year
ApplyView job
Caspar Health1 hour ago

DevSecOps/DevOps Engineer

DE flagGermany OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
IVIX1 hour ago

Deployment Engineer

US flagNew York OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Investigo11 hours ago

Senior Cloud - Kubernetes SRE

GB flagUnited Kingdom OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind11 hours ago

DevOps Engineer

AR flagArgentina OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Cherokee Federal11 hours ago

DevSecOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$125k – $140k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers