Multi‑cluster, multi‑tenant secret management

Márk Sági-Kazár

2023-05-25 @ ESO meetup


Márk Sági-Kazár

Open Source Tech Lead @ Cisco

CNCF Ambassador


Cisco ET&I

  • Incubating org for emerging technology
  • Engineers generally don’t have production operational experience
  • Infrastructure needs are served by a central platform engineering (SRE) team


  • Managing configuration across clusters, envs, tenants
  • Access control and resource management
  • Reducing the blast radius
  • Secret management and rotation (!)

Deployment model

  • GitOps
  • ArgoCD

Secret management (initial approach)

  • Central Hashicorp Vault (each team gets its own namespace)
  • ESO to sync secrets into clusters
  • Manual secret rotation :(

Challenges of secret rotation

  • Complexity
  • Time-consuming and error prone process
  • Disruption of service availability

Secret rotation flow

    actor Operator
    participant Provider as Secret provider
    participant Store as Secret store
    participant Deploy as ???
    participant Production

    Deploy->>Store: Watch for changes
    activate Deploy
    Operator->>Provider: Generate new secret
    Provider-->>Operator: Return new secret
    Operator->>Store: Rotate secret in store
    Store-->>Deploy: Notice secret change
    deactivate Deploy
    Deploy->>Production: Deploy new secret

Secret management in Kubernetes

⚠️ Plug the holes first! ⚠️

  • Turn on encryption at rest
  • Configure least-privilege access to Secrets

Official guide: Good practices for Kubernetes Secrets

Deploying secrets to Kubernetes

  • External Secrets Operator (ESO):
  • Synchronize secrets from an external store to Kubernetes
  • Mount secrets as usual (env var, file)

Triggering workload rollout

    participant Store as Secret store
    participant ExternalSecrets as External secrets
    participant Kubernetes
    participant Reloader

    ExternalSecrets->>Store: Watch for changes
    Reloader->>Kubernetes: Watch for changes
    Store-->>ExternalSecrets: Notice secret change
    ExternalSecrets->>Kubernetes: Deploy new secret
    Kubernetes-->>Reloader: Notice secret change
    Reloader->>Kubernetes: Trigger workload rollout

What could possibly go wrong?

Who knows, so monitor everything


Potential high cardinality labels (drop metrics/labels you don’t need)

Changes take effect with a delay

  1. Change some configuration ✏️
  2. Wait until the next secret sync period 🤞
  3. Hope nothing breaks 🙏

Solution: create (and modify) test secrets at the same time.

Cascading effect of an outage 1

Requirement: Use store validation.

  1. Provider goes down for a long time (ie. hours) ❌
  2. Store validation reaches a backoff of hours ⏳
  3. Secret synchronization essentially stops 😱

Solution: Bump every (Cluster)SecretStore after an outage.

To sum up ESO

  • Understand how (and when) changes will take effect
  • Monitor and alert for failures


Thank you

Any questions?