Available · Q3 2026 Dubai, UAE — remote 16+ yrs in production

Guilherme
Jaccoud

Staff Site Reliability Engineer

I design platforms that stand.
Distributed systems, Kubernetes at scale, multi-cloud infrastructure, and the operational discipline that makes them quietly boring.

Scroll · 01 / 06

Systems engineer and platform architect. I build the reliability substrate that other engineers build on top of.

Sixteen years across cloud hosting, fintech, hyperscale e-commerce and crypto. I lead reliability programs end-to-end: capacity and risk modelling, SLO design, incident command, and the platform engineering that makes safety the default path.

I write Rust for the parts that matter, Python for the parts that don't, and shell for the parts I'd rather not admit. I strive for simplicity and minimalism, and I treat a platform like a product.

2021 —
2026

Staff Site Reliability Engineer

Own platform reliability for a multi-region streaming network serving 90M MAUs. Lead a 22-person SRE org across edge, control plane, and data infra.

  • Designed a cell-based architecture rollout that contained 11 control-plane incidents to single cells — zero customer-facing outages in 18 months.
  • Rewrote the autoscaling controller in Rust; reduced p99 scale-up latency from 38s to 4.2s and cut compute spend by $11M/yr.
  • Established the org's first error-budget policy and quarterly reliability review; adopted by 9 product teams.
  • Mentored 6 senior engineers to staff and 2 staff engineers to principal.
KubernetesRustAWSGCPOpenTelemetryIstio
2019 —
2021

Staff SRE / Platform Engineer

Led the 6-person SRE team through Talabat's post-acquisition migration from an on-prem monolith to a cloud-native distributed system on AWS; owned all infrastructure from networking to secrets.

  • Built greenfield AWS infrastructure from scratch — VPC, networking, and EKS — using Terraform modules co-authored with the Security team; migrated SCM from Azure DevOps to GitHub and deployed Atlantis for self-service infrastructure changes.
  • Implemented AWS Direct Connect between the on-prem DC and AWS to enable gradual, zero-disruption service migration; retained the primary database on-prem post-migration where DC latency outperformed RDS.
  • Partnered with development teams on the migration from legacy .NET Framework to .NET Core on Kubernetes; introduced the Serverless Framework for workloads where dedicated Kubernetes deployments were disproportionate.
  • Built a GitOps CLI orchestrating deployments from service repositories to an infra monorepo, integrated with ArgoCD; delivered multi-region active-active EKS clusters with 50/50 traffic split via Cloudflare Load Balancer and automatic failover.
  • Owned the full platform stack: secret management with HashiCorp Vault, policy enforcement with OPA Gatekeeper, and regular Chaos Monkey exercises to validate disaster recovery scenarios.
  • Designed a daily load-testing pipeline — Terraform + K6 + K3s — triggered by a scheduled GitHub Actions workflow at 4 AM; spun up ephemeral infra, ran developer-authored test suites against production, and streamed all metrics to New Relic, providing daily visibility into scalability headroom.
  • Wrote a Rust API + React UI for engineer onboarding that automatically provisioned credentials across internal systems and third-party vendors, eliminating manual setup steps.
KubernetesTerraformAWSArgoCDCloudflareVaultOPARustK6New RelicAtlantisServerless
2016 —
2019

Senior SRE / Infrastructure Engineer

Infrastructure hire following Symphony's acquisition by Google; led the platform re-architecture from a monolith to microservices on GCP, with per-tenant deployments across dedicated cloud accounts for each enterprise client.

  • Drove the monolith-to-microservices decomposition, migrating all services onto Kubernetes on GCP as part of the post-acquisition re-platform.
  • Designed a per-tenant deployment model where each enterprise client ran their own fully isolated Symphony stack across dedicated GCP and AWS accounts — provisioning and maintaining dozens of independent environments in parallel.
  • Led IaC migration from ad-hoc Python scripts + CloudFormation to multi-cloud Terraform (GCP/AWS); authored reusable modules covering networking, persistence, and orchestration layers.
  • Built a GitOps CLI in Go that resolves module dependency order (networking → persistence → orchestration), applies Terraform, and commits the resulting state to Git — making every environment change auditable and reproducible.
  • Integrated the CLI into Jenkins, giving development teams a self-service UI to spin up, update, and tear down development and production environments without SRE intervention.
  • Operated a heterogeneous persistence tier — HBase, Hadoop, MongoDB, Solr, Elasticsearch, Hazelcast — on auto-scaled GCP instances across all tenant environments.
  • Trained directly by Google Cloud engineers and reported to them for all infrastructure architectural decisions throughout the engagement.
KubernetesTerraformGCPAWSGoJenkinsMongoDBElasticsearchHBaseHadoop
2010 —
2016

Founder CEO

Founded Latin America's first managed WordPress hosting company; grew it to thousands of customers and a team of 12, then sold the business in 2016.

  • Designed per-tenant isolation on AWS — each customer site provisioned with its own Auto Scaling Group, ALB, and Cloudflare WAF + CDN — guaranteeing resource isolation across thousands of hosted WordPress properties.
  • Rebuilt the entire platform on Kubernetes in 2014 — among the earliest production deployments in Latin America — reducing environment provisioning from hours to minutes and enabling zero-downtime rolling deploys.
  • Architected the application stack: NGINX / PHP-FPM / MariaDB / Varnish, with static assets on S3; later migrated to managed RDS and Redis full-page cache, improving hit rate and eliminating a class of database failover incidents.
  • Sole on-call operator for the first two years; grew the engineering and operations team from 0 to 12, owning all hiring, incident response, and capacity planning through to the 2016 exit.
KubernetesAWSCloudflareNGINXPHP-FPMMariaDBRDSRedisVarnishS3

Tools are commodities. The discipline of how you operate them is the real artifact.

Hover any cell — name & years of practice

Orchestration & Platform

Infrastructure & Cloud

Observability & Incident

Languages & Data

01

Distributed systems reliability

Failure modelling, consensus, partition behaviour, dependency contracts. The boring discipline that keeps a fleet calm under load.

02

Kubernetes platform engineering

Multi-tenant clusters, operator design, golden paths, and the platform abstractions that turn raw infrastructure into a product.

03

Multi-cloud infrastructure

AWS, GCP, Azure — and the architectural discipline to stay portable where it matters and proprietary where it pays.

04

High-availability architecture

Cell-based designs, active-active topologies, traffic shifting, and the failover drills that prove they actually work.

05

Observability engineering

SLI/SLO design, OpenTelemetry pipelines, sampling economics, and the dashboards engineers actually use at 3am.

06

Resilience & disaster recovery

RPO/RTO architecture, chaos engineering programs, regional failover rehearsals, and the runbook discipline behind them.

07

Infrastructure automation

IaC at scale, drift detection, policy-as-code, and the platform tooling that makes change cheap and reversible.

08

Platform scalability

Capacity modelling, performance work in Go and Rust, autoscaling control loops, and the cost mechanics behind them.

05 · Get in touch

For staff & principal-level reliability work, write to hello@guigo2k.com.