
DevOps Engineer, GCP
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Greece.
• **A Glimpse Into Your Daily Tasks:**
• - **Infrastructure as Code for the Cloud**: Take charge of and enhance our Terraform setup across various GCP environments (base, core, obs, dev, test, prod), encompassing GKE clusters, Cloud SQL (Postgres/MySQL), networking, buckets, and IAM. Spearhead the ongoing "Neo" platform rollout and the transition from legacy infrastructure.
• - **Kubernetes and Container Management**: Oversee workloads on GKE, maintain Dockerfiles and Helm-style application configurations for approximately 10 backend services, and optimize autoscaling, resource limits, and pod disruption budgets.
• - **Enhancing Our GitHub Actions Pipelines**: Manage PR checks (Python/JS lint, type-check, tests), Terraform prechecks, image builds and pushes, auto-deploy, and DB-migration labeling/gating. Aim to decrease build times and flakiness while enabling self-service deployments for product teams.
• - **Data and Messaging Infrastructure Management**: Operate Postgres, Redis, and Celery-based asynchronous workers; oversee Alembic migrations, queue health, and backpressure for lengthy simulation tasks.
• - **Monitoring and Observability**: Take ownership of our monitoring stack — including Grafana dashboards, ClickHouse, Langfuse (LLM tracing), and Celery queue metrics. Develop alerting and SLOs to identify issues prior to customer impact.
• - **Security and Secret Management**: Oversee secret distribution, implement least-privilege IAM, and track remediation efforts. Collaborate with engineering on insights from our security assessment process.
• - **Cost Management and Reliability**: Monitor cloud and LLM-proxy (LiteLLM) expenses, optimize resource allocation, and enhance the resilience of simulation and evaluation pipelines.
• **Your Collaborators Include:**
• - Cloud: Google Cloud Platform (GKE, Cloud SQL, GCS, IAM); some AWS / IBM involvement
• - Infrastructure as Code: Terraform (>= 1.14), multi-environment root modules
• - Containers/Orchestration: Docker, docker-compose (local), Kubernetes / GKE
• - CI/CD: GitHub Actions
• - Backend Technologies: Python 3.13+ (managed with uv), Celery, FastAPI-style HTTP APIs; Node/Express services
• - Data Management: PostgreSQL, MySQL, Redis, ClickHouse
• - Observability Tools: Grafana, Langfuse, custom Celery metrics
• - LLM Infrastructure: LiteLLM proxy
• **Your Key Skills 🚀**
• - Over 3 years of experience in DevOps / SRE / Platform Engineering, or significant backend expertise with substantial infrastructure ownership.
• - Strong hands-on experience with Terraform (modules, state management, multi-environment) and cloud platforms (GCP preferred; AWS/Azure experience is transferable).
• - Practical experience with Kubernetes in production: including deployments, services, autoscaling, pod debugging, and rollouts/rollbacks.
• - Solid understanding of Docker fundamentals and proficiency in writing/optimizing Dockerfiles.
• - Experience in designing and maintaining CI/CD pipelines (GitHub Actions, or equivalent such as GitLab CI / CircleCI).
• - Proficient in scripting and reading code in Python and/or Bash; capable of navigating a polyglot monorepo.
• - Operational expertise with relational databases and managed database services (migrations, backups, performance optimization).
• - A reliability-oriented approach: monitoring, alerting, incident response, and creating runbooks.
• **Additional Desirable Qualifications:**
• - Experience managing Celery / distributed task queues and Redis at scale.
• - Familiarity with LLM/AI infrastructure (model proxies, GPU scheduling, token/cost management).
• - Proficiency in observability tools (Grafana, Prometheus, ClickHouse, OpenTelemetry, Langfuse, or similar tracing technologies).
• - Background in security/compliance (IAM hardening, secret management, vulnerability remediation).
• - Experience in cost-optimization for cloud and third-party API expenditures.
• - Experience supporting a monorepo that encompasses multiple language ecosystems and editable/internal package dependencies.
• **Perks and Advantages**
• - Competitive salary.
• - Training budget for skill enhancement through partnerships with leading tech companies such as Microsoft, AWS, Salesforce, and Databricks – whether it’s certifications or courses, we’ve got you covered.
• - Private insurance, top-tier tech equipment, and the opportunity to collaborate with an exceptional team.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.