Phase 9: Operations & Maintenance

This phase covers backup procedures, secret rotation, disaster recovery, and ongoing maintenance tasks.

Backup Strategy

Backup Components

Component	Method	Frequency	Retention
etcd	etcd snapshot	Daily	30 days
PostgreSQL	pg_dumpall	Daily	30 days
Redis	RDB snapshot	Daily	7 days
Longhorn Volumes	Longhorn backup	Daily	14 days
Vault	Raft snapshot	Daily	30 days
Gitea Repositories	Git bundle	Weekly	90 days
Configuration	ArgoCD Git repo	Continuous	Git history

Automated Backup Jobs

etcd Backup

bash

cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 1 * * *"  # Daily at 1 AM
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
          - name: etcd-backup
            image: rancher/k3s:v1.30.0-k3s1
            command:
            - /bin/sh
            - -c
            - |
              TIMESTAMP=\$(date +%Y%m%d_%H%M%S)
              mkdir -p /backup/etcd

              # Create etcd snapshot
              k3s etcd-snapshot save --name etcd_\${TIMESTAMP}

              # Copy to backup volume (k3s creates .zip snapshots)
              cp /var/lib/rancher/k3s/server/db/snapshots/etcd_\${TIMESTAMP}* /backup/etcd/

              # Keep only last 30 days
              find /backup/etcd -name "etcd_*" -mtime +30 -delete

              echo "Backup completed: etcd_\${TIMESTAMP}"
            volumeMounts:
            - name: backup
              mountPath: /backup
            - name: k3s-data
              mountPath: /var/lib/rancher/k3s
              readOnly: true
            securityContext:
              privileged: true
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: "true"
          tolerations:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: etcd-backup-pvc
          - name: k3s-data
            hostPath:
              path: /var/lib/rancher/k3s
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: etcd-backup-pvc
  namespace: kube-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: longhorn-encrypted
EOF

Vault Backup

bash

cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-backup
  namespace: vault
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: vault
          containers:
          - name: vault-backup
            image: hashicorp/vault:1.15
            command:
            - /bin/sh
            - -c
            - |
              TIMESTAMP=\$(date +%Y%m%d_%H%M%S)

              # Login to Vault (requires configured auth)
              export VAULT_ADDR=http://vault:8200

              # Create Raft snapshot
              vault operator raft snapshot save /backup/vault_\${TIMESTAMP}.snap

              # Keep only last 30 days
              find /backup -name "vault_*.snap" -mtime +30 -delete

              echo "Vault backup completed"
            env:
            - name: VAULT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: vault-backup-token
                  key: token
            volumeMounts:
            - name: backup
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: vault-backup-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vault-backup-pvc
  namespace: vault
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: longhorn-encrypted
EOF

Longhorn Recurring Backup

bash

cat <<EOF | kubectl apply -f -
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: daily-backup
  namespace: longhorn-system
spec:
  cron: "0 3 * * *"
  task: backup
  groups:
    - default
  retain: 14
  concurrency: 2
  labels:
    backup-type: daily
EOF

Backup Verification

Since PVCs are namespace-scoped, verify backups using temporary pods in each namespace:

bash

# Verify etcd backups (in kube-system)
kubectl run etcd-backup-check --rm -it --restart=Never \
  --namespace kube-system \
  --image=busybox \
  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"etcd-backup-pvc"}}],"containers":[{"name":"check","image":"busybox","command":["sh","-c","echo \"=== etcd backups ===\"; ls -lh /backup/etcd/etcd_* 2>/dev/null | tail -5"],"volumeMounts":[{"name":"backup","mountPath":"/backup"}]}]}}'

# Verify PostgreSQL backups (in databases namespace)
kubectl run pg-backup-check --rm -it --restart=Never \
  --namespace databases \
  --image=bitnami/postgresql:16 \
  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"postgresql-backup-pvc"}}],"containers":[{"name":"check","image":"bitnami/postgresql:16","command":["bash","-c","echo \"=== PostgreSQL backups ===\"; ls -lh /backup/*.sql.gz 2>/dev/null | tail -5; echo \"\"; LATEST=$(ls -t /backup/*.sql.gz 2>/dev/null | head -1); if [ -n \"$LATEST\" ]; then gunzip -t \"$LATEST\" && echo \"Backup integrity: OK\" || echo \"Backup integrity: FAILED\"; fi"],"volumeMounts":[{"name":"backup","mountPath":"/backup"}]}]}}'

# Verify Vault backups (in vault namespace)
kubectl run vault-backup-check --rm -it --restart=Never \
  --namespace vault \
  --image=busybox \
  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"vault-backup-pvc"}}],"containers":[{"name":"check","image":"busybox","command":["sh","-c","echo \"=== Vault backups ===\"; ls -lh /backup/*.snap 2>/dev/null | tail -5"],"volumeMounts":[{"name":"backup","mountPath":"/backup"}]}]}}'

Secret Rotation

Automatic Rotation via Vault

Vault can automatically rotate secrets. Configure rotation policies:

bash

# Connect to Vault
kubectl exec -n vault vault-0 -- vault login ${VAULT_ROOT_TOKEN}

# Enable database secrets engine with rotation
kubectl exec -n vault vault-0 -- vault write database/config/postgresql \
  plugin_name=postgresql-database-plugin \
  connection_url="postgresql://{{username}}:{{password}}@postgresql-postgresql-ha-pgpool.databases.svc:5432/postgres" \
  allowed_roles="*" \
  username="postgres" \
  password="${POSTGRES_PASSWORD}"

# Create role with automatic rotation
kubectl exec -n vault vault-0 -- vault write database/roles/mattermost-dynamic \
  db_name=postgresql \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT ALL PRIVILEGES ON DATABASE mattermost TO \"{{name}}\";" \
  revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

Manual Secret Rotation Procedure

Rotate PostgreSQL Passwords

bash

#!/bin/bash
# rotate-postgresql.sh

# Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)
SERVICE=$1  # mattermost, gitea, etc.

# Update Vault
kubectl exec -n vault vault-0 -- vault kv put secret/services/${SERVICE} \
  db-password="${NEW_PASSWORD}"

# Update PostgreSQL user
kubectl exec -n databases postgresql-postgresql-ha-0 -- psql -U postgres -c \
  "ALTER USER ${SERVICE}_user WITH PASSWORD '${NEW_PASSWORD}';"

# Wait for External Secrets to sync (default 1h, can be faster)
kubectl annotate externalsecret ${SERVICE}-db -n ${SERVICE} \
  force-sync=$(date +%s)

# Restart the service to pick up new credentials
kubectl rollout restart deployment/${SERVICE} -n ${SERVICE}

echo "Password rotated for ${SERVICE}"

Rotate Redis Password

bash

#!/bin/bash
# rotate-redis.sh

# Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)

# Update Vault
kubectl exec -n vault vault-0 -- vault kv put secret/infrastructure/redis \
  password="${NEW_PASSWORD}"

# Update Redis (requires restart for Sentinel)
kubectl exec -n databases redis-master-0 -- redis-cli CONFIG SET requirepass "${NEW_PASSWORD}"
kubectl exec -n databases redis-replicas-0 -- redis-cli CONFIG SET requirepass "${NEW_PASSWORD}"
kubectl exec -n databases redis-replicas-1 -- redis-cli CONFIG SET requirepass "${NEW_PASSWORD}"

# Force External Secrets sync
kubectl annotate externalsecret redis-credentials -n databases \
  force-sync=$(date +%s)

# Restart services that use Redis
kubectl rollout restart deployment/authentik -n authentik
kubectl rollout restart deployment/mattermost-team-edition -n mattermost
kubectl rollout restart statefulset/gitea -n gitea

echo "Redis password rotated"

Rotate Vault Unseal Keys

Critical Operation

Rotating Vault unseal keys requires careful planning. All current unseal keys will be invalidated.

bash

# Generate new unseal keys
kubectl exec -n vault vault-0 -- vault operator rekey -init \
  -key-shares=5 \
  -key-threshold=3

# Provide 3 of the current unseal keys to authorize the rekey
kubectl exec -n vault vault-0 -- vault operator rekey \
  -nonce=<nonce-from-init>

# Save new keys securely offline
# Update auto-unseal configuration if used

Rotation Schedule

Secret Type	Rotation Frequency	Method
Database passwords	90 days	Manual + Vault
Redis password	90 days	Manual
JWT/Session secrets	180 days	Vault + restart
TLS certificates	Auto (cert-manager)	Automatic
OIDC client secrets	365 days	Manual
Vault unseal keys	As needed	Manual
LUKS encryption keys	Never (re-encrypt volumes)	N/A

Disaster Recovery

Recovery Procedures

Recover from etcd Backup

bash

# Stop K3s
systemctl stop k3s

# List available snapshots to find the correct filename
ls -la /var/lib/rancher/k3s/server/db/snapshots/

# Restore etcd snapshot (use actual filename from backup, typically etcd_YYYYMMDD_HHMMSS-*)
k3s server --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd_YYYYMMDD_HHMMSS

# Start K3s
systemctl start k3s

# Verify cluster
kubectl get nodes
kubectl get pods -A

Snapshot Format

K3s etcd snapshots are stored as compressed archives (not .db files). Check the actual filename in the snapshots directory before restoring.

Recover PostgreSQL

bash

# Scale down services using PostgreSQL
kubectl scale deployment --all -n mattermost --replicas=0
kubectl scale deployment --all -n gitea --replicas=0
kubectl scale deployment --all -n authentik --replicas=0

# Get admin password
PGPASSWORD=$(kubectl get secret postgresql-credentials -n databases \
  -o jsonpath='{.data.postgres-password}' | base64 -d)

# Restore from backup using a temporary pod with access to backup PVC
kubectl run pg-restore --rm -it --restart=Never \
  --namespace databases \
  --image=bitnami/postgresql:16 \
  --env="PGPASSWORD=${PGPASSWORD}" \
  --overrides='
{
  "spec": {
    "containers": [{
      "name": "pg-restore",
      "image": "bitnami/postgresql:16",
      "command": ["bash", "-c", "gunzip -c /backup/all_databases_YYYYMMDD_HHMMSS.sql.gz | psql -h postgresql-postgresql-ha-pgpool -U postgres"],
      "env": [{"name": "PGPASSWORD", "valueFrom": {"secretKeyRef": {"name": "postgresql-credentials", "key": "postgres-password"}}}],
      "volumeMounts": [{"name": "backup", "mountPath": "/backup"}]
    }],
    "volumes": [{"name": "backup", "persistentVolumeClaim": {"claimName": "postgresql-backup-pvc"}}]
  }
}'

# Scale services back up
kubectl scale deployment --all -n mattermost --replicas=1
kubectl scale deployment --all -n gitea --replicas=1
kubectl scale deployment --all -n authentik --replicas=1

Recover Vault

bash

# Unseal Vault if needed
kubectl exec -n vault vault-0 -- vault operator unseal ${UNSEAL_KEY_1}
kubectl exec -n vault vault-0 -- vault operator unseal ${UNSEAL_KEY_2}
kubectl exec -n vault vault-0 -- vault operator unseal ${UNSEAL_KEY_3}

# Restore from Raft snapshot
kubectl exec -n vault vault-0 -- vault operator raft snapshot restore \
  /backup/vault_YYYYMMDD_HHMMSS.snap

# Verify
kubectl exec -n vault vault-0 -- vault status

Recover Longhorn Volumes

bash

# Via Longhorn UI or CLI
# 1. Navigate to Longhorn UI > Backup
# 2. Select the backup to restore
# 3. Create a new volume from backup
# 4. Update the PVC to use the new volume

# Or via kubectl
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: restored-volume
  namespace: longhorn-system
spec:
  fromBackup: "s3://backup-bucket@region/backups/backup-xxx"
  numberOfReplicas: 1
EOF

Recovery Time Objectives

Component	RTO	RPO
Kubernetes Control Plane	30 min	24h
PostgreSQL	1h	24h
Vault	30 min	24h
Application Services	15 min	N/A (stateless)
Longhorn Volumes	2h	24h

Maintenance Tasks

Regular Maintenance Schedule

Task	Frequency	Command
Check node health	Daily	`kubectl get nodes`
Check pod status	Daily	`kubectl get pods -A \| grep -v Running`
Review alerts	Daily	Grafana dashboard
Check certificate expiry	Weekly	`kubectl get certificate -A`
Check Longhorn health	Weekly	Longhorn UI
Review Vault audit logs	Weekly	Vault UI or API
Update system packages	Monthly	`apt update && apt upgrade`
K3s minor updates	Monthly	See below
Helm chart updates	Monthly	See below
Review resource usage	Monthly	Grafana
Test backup restore	Quarterly	Restore to test env
Rotate secrets	As scheduled	See rotation section

K3s Updates

bash

# Check current version
k3s --version

# Check available versions
curl -s https://update.k3s.io/v1-release/channels | jq

# Update K3s (single-node)
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable sh -

# Verify update
k3s --version
kubectl get nodes

Helm Chart Updates

bash

# Update all Helm repos
helm repo update

# Check for updates
helm list -A -o json | jq -r '.[] | "\(.name) \(.namespace) \(.chart)"' | while read name ns chart; do
  current=$(echo $chart | sed 's/.*-//')
  latest=$(helm search repo $(echo $chart | sed 's/-[0-9].*//' ) -o json | jq -r '.[0].version')
  if [ "$current" != "$latest" ]; then
    echo "Update available: $name in $ns: $current -> $latest"
  fi
done

# Update a specific chart
helm upgrade <release> <chart> -n <namespace> --reuse-values

Cleanup Tasks

bash

# Remove completed/failed jobs
kubectl delete jobs --field-selector status.successful=1 -A
kubectl delete jobs --field-selector status.failed=1 -A

# Clean up old ReplicaSets
kubectl get rs -A -o json | jq -r '.items[] | select(.spec.replicas == 0) | "\(.metadata.namespace) \(.metadata.name)"' | \
  xargs -n2 kubectl delete rs -n

# Prune unused images (on node)
crictl rmi --prune

# Clean old Longhorn snapshots
kubectl get snapshot -n longhorn-system -o json | jq -r '.items[] | select(.status.readyToUse == true) | .metadata.name' | \
  xargs -I {} kubectl delete snapshot {} -n longhorn-system

Monitoring Maintenance Health

Create Maintenance Dashboard

bash

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-maintenance
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  maintenance.json: |
    {
      "title": "ATLAS Maintenance Status",
      "uid": "atlas-maintenance",
      "panels": [
        {
          "title": "Backup Age (hours)",
          "type": "stat",
          "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
          "targets": [{
            "expr": "(time() - max(kube_job_status_completion_time{job_name=~\".*backup.*\"})) / 3600"
          }]
        },
        {
          "title": "Certificate Expiry (days)",
          "type": "stat",
          "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
          "targets": [{
            "expr": "min(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400"
          }]
        },
        {
          "title": "Vault Sealed Status",
          "type": "stat",
          "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
          "targets": [{
            "expr": "vault_core_unsealed"
          }]
        },
        {
          "title": "Failed Pods",
          "type": "stat",
          "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
          "targets": [{
            "expr": "count(kube_pod_status_phase{phase=~\"Failed|Unknown\"})"
          }]
        }
      ]
    }
EOF

Validation Tests

bash

# Check all CronJobs
kubectl get cronjobs -A
# Expected: etcd-backup, postgresql-backup, redis-backup, vault-backup Running

# Check recent backup jobs
kubectl get jobs -A -l app=backup --sort-by=.status.completionTime | tail -10

# Verify backup files exist (run a temporary pod to check)
kubectl run backup-check --rm -it --restart=Never \
  --namespace kube-system \
  --image=busybox \
  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"etcd-backup-pvc"}}],"containers":[{"name":"backup-check","image":"busybox","command":["ls","-la","/backup/etcd"],"volumeMounts":[{"name":"backup","mountPath":"/backup"}]}]}}'

# Check Longhorn backup status
kubectl get backups -n longhorn-system

# Test External Secrets sync
kubectl annotate externalsecret mattermost-secrets -n mattermost \
  force-sync=$(date +%s)
kubectl get secret mattermost-secrets -n mattermost -o yaml | grep -c "db-password"

# Check maintenance alerts
kubectl get prometheusrule -n monitoring -o yaml | grep -A5 "BackupFailed"

Expected Results

Task	Status	Verification
etcd backup	Daily at 1 AM	Check job history
PostgreSQL backup	Daily at 2 AM	Check job history
Vault backup	Daily at 2 AM	Check job history
Longhorn backup	Daily at 3 AM	Longhorn UI
Backup verification	Weekly Sunday	Check job logs
Certificate renewal	Automatic	cert-manager status

Operations Runbook

Daily Checks

bash

#!/bin/bash
# daily-check.sh

echo "=== ATLAS Daily Health Check ==="
echo "Date: $(date)"
echo ""

echo "1. Node Status:"
kubectl get nodes
echo ""

echo "2. Pod Issues:"
kubectl get pods -A | grep -v Running | grep -v Completed
echo ""

echo "3. Recent Events:"
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
echo ""

echo "4. Certificate Status:"
kubectl get certificate -A
echo ""

echo "5. Vault Status:"
kubectl exec -n vault vault-0 -- vault status 2>/dev/null | grep -E "Sealed|Version"
echo ""

echo "6. Backup Jobs (last 24h):"
kubectl get jobs -A --sort-by='.status.completionTime' | tail -10
echo ""

echo "=== End of Daily Check ==="

Summary

This operations guide covers:

Backups: Automated daily backups for all critical components
Secret Rotation: Procedures and schedules for rotating credentials
Disaster Recovery: Step-by-step recovery procedures
Maintenance: Regular tasks and update procedures

Automation

Consider automating the daily checks script and sending results to Mattermost or email.

Conclusion

You have now completed the full ATLAS cluster installation:

✅ System Preparation
✅ K3s Core Infrastructure
✅ HashiCorp Vault
✅ Shared Databases
✅ Core Services
✅ DevOps Tools
✅ Monitoring Stack
✅ Security Hardening
✅ Operations & Maintenance

The cluster is now ready for production use with:

Full encryption at rest and in transit
Centralized secrets management
Comprehensive monitoring and alerting
Automated backups
GitOps deployment workflow

Phase 9: Operations & Maintenance ​

Backup Strategy ​

Backup Components ​

Automated Backup Jobs ​

etcd Backup ​

Vault Backup ​

Longhorn Recurring Backup ​

Backup Verification ​

Secret Rotation ​

Automatic Rotation via Vault ​

Manual Secret Rotation Procedure ​

Rotate PostgreSQL Passwords ​

Rotate Redis Password ​

Rotate Vault Unseal Keys ​

Rotation Schedule ​

Disaster Recovery ​

Recovery Procedures ​

Recover from etcd Backup ​

Recover PostgreSQL ​

Recover Vault ​

Recover Longhorn Volumes ​

Recovery Time Objectives ​

Maintenance Tasks ​

Regular Maintenance Schedule ​

K3s Updates ​

Helm Chart Updates ​

Cleanup Tasks ​

Monitoring Maintenance Health ​

Create Maintenance Dashboard ​

Validation Tests ​

Expected Results ​

Operations Runbook ​

Daily Checks ​

Summary ​

Conclusion ​

Phase 9: Operations & Maintenance

Backup Strategy

Backup Components

Automated Backup Jobs

etcd Backup

Vault Backup

Longhorn Recurring Backup

Backup Verification

Secret Rotation

Automatic Rotation via Vault

Manual Secret Rotation Procedure

Rotate PostgreSQL Passwords

Rotate Redis Password

Rotate Vault Unseal Keys

Rotation Schedule

Disaster Recovery

Recovery Procedures

Recover from etcd Backup

Recover PostgreSQL

Recover Vault

Recover Longhorn Volumes

Recovery Time Objectives

Maintenance Tasks

Regular Maintenance Schedule

K3s Updates

Helm Chart Updates

Cleanup Tasks

Monitoring Maintenance Health

Create Maintenance Dashboard

Validation Tests

Expected Results

Operations Runbook

Daily Checks

Summary

Conclusion