This is a fully remote position, open to applicants in Greece.

📋 Description

• **A Glimpse Into Your Daily Tasks:**

• - **Infrastructure as Code for the Cloud**: Take charge of and enhance our Terraform setup across various GCP environments (base, core, obs, dev, test, prod), encompassing GKE clusters, Cloud SQL (Postgres/MySQL), networking, buckets, and IAM. Spearhead the ongoing "Neo" platform rollout and the transition from legacy infrastructure.

• - **Kubernetes and Container Management**: Oversee workloads on GKE, maintain Dockerfiles and Helm-style application configurations for approximately 10 backend services, and optimize autoscaling, resource limits, and pod disruption budgets.

• - **Enhancing Our GitHub Actions Pipelines**: Manage PR checks (Python/JS lint, type-check, tests), Terraform prechecks, image builds and pushes, auto-deploy, and DB-migration labeling/gating. Aim to decrease build times and flakiness while enabling self-service deployments for product teams.

• - **Data and Messaging Infrastructure Management**: Operate Postgres, Redis, and Celery-based asynchronous workers; oversee Alembic migrations, queue health, and backpressure for lengthy simulation tasks.

• - **Monitoring and Observability**: Take ownership of our monitoring stack — including Grafana dashboards, ClickHouse, Langfuse (LLM tracing), and Celery queue metrics. Develop alerting and SLOs to identify issues prior to customer impact.

• - **Security and Secret Management**: Oversee secret distribution, implement least-privilege IAM, and track remediation efforts. Collaborate with engineering on insights from our security assessment process.

• - **Cost Management and Reliability**: Monitor cloud and LLM-proxy (LiteLLM) expenses, optimize resource allocation, and enhance the resilience of simulation and evaluation pipelines.

• **Your Collaborators Include:**

• - Cloud: Google Cloud Platform (GKE, Cloud SQL, GCS, IAM); some AWS / IBM involvement

• - Infrastructure as Code: Terraform (>= 1.14), multi-environment root modules

• - Containers/Orchestration: Docker, docker-compose (local), Kubernetes / GKE

• - CI/CD: GitHub Actions

• - Backend Technologies: Python 3.13+ (managed with uv), Celery, FastAPI-style HTTP APIs; Node/Express services

• - Data Management: PostgreSQL, MySQL, Redis, ClickHouse

• - Observability Tools: Grafana, Langfuse, custom Celery metrics

• - LLM Infrastructure: LiteLLM proxy

⛳️ Requirements

• **Your Key Skills 🚀**

• - Over 3 years of experience in DevOps / SRE / Platform Engineering, or significant backend expertise with substantial infrastructure ownership.

• - Strong hands-on experience with Terraform (modules, state management, multi-environment) and cloud platforms (GCP preferred; AWS/Azure experience is transferable).

• - Practical experience with Kubernetes in production: including deployments, services, autoscaling, pod debugging, and rollouts/rollbacks.

• - Solid understanding of Docker fundamentals and proficiency in writing/optimizing Dockerfiles.

• - Experience in designing and maintaining CI/CD pipelines (GitHub Actions, or equivalent such as GitLab CI / CircleCI).

• - Proficient in scripting and reading code in Python and/or Bash; capable of navigating a polyglot monorepo.

• - Operational expertise with relational databases and managed database services (migrations, backups, performance optimization).

• - A reliability-oriented approach: monitoring, alerting, incident response, and creating runbooks.

• **Additional Desirable Qualifications:**

• - Experience managing Celery / distributed task queues and Redis at scale.

• - Familiarity with LLM/AI infrastructure (model proxies, GPU scheduling, token/cost management).

• - Proficiency in observability tools (Grafana, Prometheus, ClickHouse, OpenTelemetry, Langfuse, or similar tracing technologies).

• - Background in security/compliance (IAM hardening, secret management, vulnerability remediation).

• - Experience in cost-optimization for cloud and third-party API expenditures.

• - Experience supporting a monorepo that encompasses multiple language ecosystems and editable/internal package dependencies.

🏝️ Benefits

• **Perks and Advantages**

• - Competitive salary.

• - Training budget for skill enhancement through partnerships with leading tech companies such as Microsoft, AWS, Salesforce, and Databricks – whether it’s certifications or courses, we’ve got you covered.

• - Private insurance, top-tier tech equipment, and the opportunity to collaborate with an exceptional team.

DevOps Engineer, GCP

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!