
Site Reliability Engineer – SRE
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Brazil.
• Define and monitor reliability metrics (SLI, SLO, SLA) and operate according to the Error Budget.
• Develop strategies for high availability, resilience, and disaster recovery (RTO/RPO).
• Perform capacity planning and analyze service performance.
• Enhance the reliability and performance of applications operating on Kubernetes.
• Design and improve system observability (logs, metrics, traces, and alerts).
• Create actionable dashboards and alerts aimed at increasing visibility while minimizing noise and false positives.
• Identify issues proactively through service instrumentation.
• Establish and manage the incident response process (classification, severity, on-call duties).
• Lead or assist in troubleshooting applications and distributed systems.
• Conduct root cause analyses (RCA) and post-mortems, and recommend preventative measures.
• Develop and maintain operational runbooks.
• Automate operational tasks and incident responses (self-healing) to eliminate repetitive manual processes.
• Utilize AI for log analysis, anomaly detection, troubleshooting, and optimization (AIOps).
• Consistently apply the principle “automate before repeating” to enhance operational maturity.
• Collaborate with development and platform teams to foster continuous improvement in reliability.
• Cultivate a culture of reliability and best practices across teams.
• Implement security best practices in production environments (secrets, access control, segregation).
• Ensure traceability through logs, auditing, and events.
• Assist in compliance with standards such as ISO 27001 and DevSecOps practices.
• Integrate reliability and security (Security by Design).
• Experience or familiarity with observability tools (Grafana, Prometheus, Elastic, Dynatrace, or similar).
• Experience or familiarity with Kubernetes and container technologies (Docker).
• Knowledge of Linux and networking concepts (HTTP, DNS, TLS/SSL).
• Proficiency in scripting and automation (Shell, Python, or similar).
• Strong analytical skills with a focus on problem-solving.
• Regularly utilize AI in daily tasks with an automation-oriented mindset ("automate before repeating").
• Organized and autonomous with strong technical communication skills, comparable to a mid/senior Full Stack Developer involved in production projects.
• Quick learner with a continuous desire to acquire new knowledge.
• Empathetic towards customer needs.
• Commitment to delivering an exceptional customer experience.
• Team-oriented mindset; willing to offer and seek assistance.
• Excellent communication skills for interaction across various teams.
• Proactive and well-organized.
• Alignment with our core values: Honesty and Ethics; Excellence and Care in Deliverables; Recognition; Respect and Courtesy.
• Experience with SLI, SLO, and Error Budget frameworks.
• Expertise in troubleshooting distributed systems.
• Background in critical, high-availability environments.
• Familiarity with APM tools (Dynatrace, Datadog).
• Knowledge of OpenTelemetry and service instrumentation.
• Understanding of Kafka, Elasticsearch, or Redis.
• Experience with incident automation (self-healing) and Infrastructure as Code (IaC) using tools like Terraform and Ansible.
• Knowledge of Chaos Engineering and service mesh technologies.
• Experience applying AI to operational processes (AIOps, technical copilots).
• Background in regulated environments (government, legal, or financial sectors).
• Health plan: Comprehensive coverage for your health needs.
• Life insurance: Providing security and peace of mind for you and your family.
• Partner discounts: Access to services from pharmacies, nutritionists, and psychologists at special rates.
• Well-being app (Clude): Encouragement for physical activities and overall well-being.
• Total Pass: Access to a wide network of local gyms.
• Workplace exercise: Active breaks to maintain your physical health during work hours.
• Meal allowance: Available for CLT employment contracts.
• Caju Card: A special gift to celebrate your birthday month.
• Home office allowance: Financial support for establishing a comfortable and efficient workspace.
• Education assistance: Support for your academic and professional growth.
• Book allowance: Encouragement to enhance your knowledge through reading.
• Continuous development: Programs and initiatives designed to advance your career.
• Innovation program: A platform for you to share ideas and make a meaningful impact.
• Dual screen: Providing the right tools to boost your productivity.
• 100% remote position: Freedom to work from your preferred location.
• FreeDay: Enjoy a day off to recharge.
• Moment Off: We promote breaks for relaxation and disconnection.
• Time off for your graduation: We celebrate your educational achievements with you.
• Gift for new children of employees: A gesture to commemorate the arrival of a new family member.
• Welcome-back gift after paternity leave: Support for your transition back to work.
• Supportive and collaborative environment: A team that thrives together.
• Eco-friendly welcome kit: Start your journey with us in an environmentally conscious way.
• Sustainable culture: Engage in practical initiatives, such as promoting composting.
• Virtual social gatherings: Opportunities to celebrate and connect with the team.
• Ongoing engagement campaigns throughout the year.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.