Skip to content

Architecture

System Architecture

graph TB
    subgraph "Control Plane"
        CP1[control0<br/>10.0.50.1]
        CP2[control1<br/>10.0.50.2]
        VIP[Virtual IP<br/>10.0.50.50]
    end

    subgraph "Worker Nodes"
        W1[worker0<br/>10.0.50.10]
        W2[worker1<br/>10.0.50.11]
        W3[worker2<br/>10.0.50.12]
    end

    subgraph "Load Balancer IPs"
        DNS[k8s-gateway<br/>10.0.50.100]
        INT[Internal Gateway<br/>10.0.50.101]
        EXT[External Gateway<br/>10.0.50.102]
    end

    subgraph "GitOps"
        GIT[GitHub Repository]
        FLUX[Flux Controllers]
    end

    CP1 --> VIP
    CP2 --> VIP
    VIP --> W1
    VIP --> W2
    VIP --> W3

    GIT --> FLUX
    FLUX --> W1
    FLUX --> W2
    FLUX --> W3

Configuration Management

Template System

The cluster uses a template-driven approach with makejinja:

cluster.yaml + nodes.yaml
    makejinja
├── talos/         (Talos configs)
├── kubernetes/    (K8s manifests)
└── bootstrap/     (Helmfile configs)

Important: Never edit generated files directly. Always edit source YAML files and run task configure.

Directory Structure

home-ops/
├── .mise.toml              # Tool versions
├── Taskfile.yaml           # Task definitions
├── cluster.yaml            # Cluster configuration (SOURCE)
├── nodes.yaml              # Node configuration (SOURCE)
├── age.key                 # SOPS encryption key
├── .sops.yaml              # SOPS configuration
├── talos/                  # Talos Linux configs
│   ├── talconfig.yaml      # Generated by makejinja
│   ├── talenv.yaml         # Talos/K8s versions
│   ├── patches/            # Configuration patches
│   │   ├── global/         # Applied to all nodes
│   │   ├── controller/     # Control plane only
│   │   └── worker/         # Worker nodes only
│   └── clusterconfig/      # Generated per-node configs
├── kubernetes/             # Kubernetes manifests
│   ├── apps/               # Application deployments
│   │   ├── cert-manager/
│   │   ├── databases/
│   │   ├── kube-system/
│   │   ├── network/
│   │   └── storage/
│   └── flux/               # Flux configuration
│       ├── cluster/        # Cluster-wide Kustomizations
│       └── meta/           # Flux repositories
├── bootstrap/              # Initial bootstrap
│   └── helmfile.d/         # Helmfile configs
├── templates/              # Jinja2 templates
│   ├── config/             # Main templates
│   └── scripts/            # Template plugins
└── .taskfiles/             # Task implementations
    ├── bootstrap/
    ├── talos/
    └── template/

GitOps Workflow

Flux Architecture

graph LR
    A[Git Push] --> B[GitHub]
    B --> C[Flux Source Controller]
    C --> D[Flux Kustomization Controller]
    D --> E[cluster-meta]
    D --> F[cluster-apps]
    E --> G[OCI Repositories]
    F --> H[Application Kustomizations]
    H --> I[HelmReleases]

Reconciliation Flow

  1. Developer pushes to Git repository
  2. Flux polls every 1 hour (or webhook triggers immediately)
  3. Source Controller pulls latest changes
  4. Kustomization Controller applies manifests:
  5. cluster-meta first (Flux repos, dependencies)
  6. cluster-apps second (all applications)
  7. Helm Controller installs/upgrades HelmReleases
  8. Notification on success/failure (if configured)

Networking

Pod Network (Cilium)

  • CNI: Cilium (native routing mode)
  • Pod CIDR: 10.42.0.0/16
  • Service CIDR: 10.43.0.0/16
  • Gateway API: Enabled
  • Hubble: Available for network observability

Ingress Architecture

graph TB
    Internet[Internet] --> CF[Cloudflare Tunnel<br/>10.0.50.102]
    LAN[Home Network] --> INT[Internal Gateway<br/>10.0.50.101]

    CF --> EXT_GW[External Gateway]
    INT --> INT_GW[Internal Gateway]

    EXT_GW --> SVC1[Service A]
    EXT_GW --> SVC2[Service B]
    INT_GW --> SVC3[Service C]
    INT_GW --> SVC4[Service D]

Gateway Selection:

  • Use external gateway for public internet access (via Cloudflare Tunnel)
  • Use internal gateway for home network only access

DNS Flow

graph LR
    A[Client Query] --> B{DNS Server}
    B -->|*.tosih.org| C[k8s-gateway<br/>10.0.50.100]
    B -->|Other| D[Upstream DNS]
    C --> E[Gateway API Resources]
    E --> F[Service IPs]

Split DNS Setup Required:

Configure your home DNS server to forward *.yourdomain.com to the k8s-gateway IP (10.0.50.100).

Storage

Storage Providers

Provider Type StorageClass Use Case
Rook-Ceph Block RBD ceph-block (default) High-performance block storage for databases
Rook-Ceph Filesystem CephFS ceph-filesystem Shared filesystem storage (ReadWriteMany)
Rook-Ceph Object S3 ceph-bucket Object storage buckets
ZFS NFS NFS zfs-nfs NFS-backed persistent volumes
Local HostPath Node-local local-hostpath Node-specific persistent data
emptyDir Ephemeral N/A Temporary pod storage

Database Services

Service Type Purpose
CloudNativePG PostgreSQL Operator Manages PostgreSQL clusters for applications
Dragonfly Operator Redis Alternative High-performance in-memory datastore
External Postgres Operator External DB Management Manages external PostgreSQL instances
VerneMQ MQTT Broker Message broker for IoT devices (Home Assistant)

Security

Secret Management

graph LR
    A[Secret in Git] --> B[SOPS Encrypted]
    B --> C[age Key]
    C --> D[Flux Decrypts]
    D --> E[K8s Secret]
    E --> F[Pod]

Encryption Flow:

  1. Developer creates secret YAML
  2. task configure encrypts with SOPS + age
  3. Encrypted secret pushed to Git
  4. Flux reads age key from cluster
  5. Flux decrypts and creates Kubernetes Secret
  6. Pods consume secret as env vars or files

Access Control

  • Talos: API-only access (no SSH)
  • Kubernetes: RBAC enabled
  • Secrets: SOPS encrypted in Git + External Secrets from 1Password
  • Authentication: Pocket ID OIDC provider for SSO
  • Gateways:
  • External (10.0.50.102) - Public apps with authentication
  • Internal (10.0.50.101) - Home network only (network-protected)

Update Strategy

Component Updates

Component Update Method Automation
Helm Charts Renovate PR → Merge → Flux Automated
Container Images Renovate PR → Merge → Flux Automated
Kubernetes Manual edit talenv.yamltask talos:upgrade-k8s Manual
Talos Manual edit talenv.yamltask talos:upgrade-node Manual

Renovate Workflow

graph LR
    A[Renovate Detects Update] --> B[Creates PR]
    B --> C[CI Validates]
    C --> D[Developer Reviews]
    D --> E[Merge]
    E --> F[Flux Applies]

High Availability

Control Plane HA

  • 2 control plane nodes with shared VIP (10.0.50.50)
  • etcd quorum: 2/2 required (50% + 1)
  • API server: Load balanced via VIP

Workload HA

  • Multi-replica deployments across 3 workers
  • Pod anti-affinity for critical apps
  • PodDisruptionBudgets for graceful updates

Deployed Applications

Production Workloads

Application Namespace Gateway Purpose
Cloud Services
Immich cloud external Photo & video backup (OIDC-enabled)
ImmichFrame cloud internal Digital photo frame for Immich
Memos cloud internal Note-taking service
Romm cloud internal ROM manager for retro gaming
Syncthing cloud internal File synchronization
Media Automation
Plex media internal Media streaming server
Jellyseerr media internal Media request management
Sonarr media internal TV show automation
Radarr media internal Movie automation
Lidarr media internal Music automation
Readarr media internal eBook & audiobook automation
Prowlarr media internal Indexer management
Recyclarr media internal TRaSH guide automation
qBittorrent media internal Torrent download client
NZBGet media internal Usenet download client
Audiobookshelf media internal Audiobook & podcast server
Beets media internal Music library manager
Home Automation
Home Assistant home internal Home automation platform
Homebridge home internal HomeKit bridge
AirConnect home internal AirPlay to UPnP/Sonos bridge
Eufy Security WS home internal Eufy camera integration
Infrastructure
Homepage default internal Application dashboard
Uptime Kuma default internal Uptime monitoring
Echo default internal HTTP echo server for testing
Network Services
AdGuard Home network internal DNS server & ad blocking
k8s-gateway network internal Internal DNS for *.tosih.org
Cloudflare Tunnel network external Secure external access via Cloudflare
Cloudflare DNS network external DNS record automation
Security & Authentication
Pocket ID security external OIDC identity provider (SSO)
External Secrets security internal 1Password secret integration
OnePassword Connect security internal 1Password API server
Storage & Databases
Rook-Ceph Dashboard rook-ceph internal Storage cluster management
ZFS Provisioner kubernetes-zfs-provisioner internal Local ZFS storage provisioning
CloudNativePG databases internal PostgreSQL operator (3 databases)
Dragonfly databases internal Redis-compatible in-memory datastore
External Postgres Operator databases internal External PostgreSQL management
VerneMQ databases internal MQTT message broker for IoT

Total: 40+ applications across 10 namespaces

Infrastructure Services

  • Flux - GitOps continuous delivery (flux-operator, flux-instance)
  • Cilium - CNI and Gateway API
  • Rook-Ceph - Distributed storage (operator + cluster)
  • CloudNativePG - PostgreSQL operator (3 databases)
  • Dragonfly Operator - Redis-compatible in-memory datastore
  • External Postgres Operator - External PostgreSQL management
  • VerneMQ - MQTT message broker for IoT
  • External Secrets - 1Password integration
  • Cert-Manager - TLS certificate management
  • k8s-gateway - Internal DNS resolution
  • Cloudflare Tunnel - Secure external access via Cloudflare
  • Cloudflare DNS - DNS record management
  • CoreDNS - Cluster DNS service
  • Reloader - Auto-reload on ConfigMap/Secret changes
  • Spegel - Distributed container image cache
  • Metrics Server - Resource metrics API
  • Descheduler - Pod rescheduling optimization
  • Goldilocks - Resource recommendation engine
  • VolSync - Persistent volume replication and backup
  • ZFS Provisioner - Local ZFS storage provisioning

Monitoring Points

Recommended monitoring (not included by default):

  • Node metrics: CPU, memory, disk, network
  • Cilium: Network flows via Hubble
  • Flux: Reconciliation status
  • Application: Custom metrics via ServiceMonitor
  • Ceph: Storage health and capacity

Disaster Recovery

Backup Strategy

Critical items to backup:

  • age.key - Cannot decrypt secrets without this
  • cluster.yaml and nodes.yaml - Source configuration
  • Git repository - Everything else can be recovered from here

Data backups:

  • Persistent volumes - Use Velero or application-specific backup tools
  • Application data - Application-specific backup tools

Recovery Procedure

  1. Reinstall Talos on nodes
  2. Restore age.key
  3. task bootstrap:talos
  4. task bootstrap:apps
  5. Flux restores everything from Git
  6. Restore persistent data from backups