Skip to content

Architecture

System Architecture

graph TB
    subgraph "Control Plane"
        CP1[control0<br/>10.0.50.1]
        CP2[control1<br/>10.0.50.2]
        VIP[Virtual IP<br/>10.0.50.50]
    end

    subgraph "Worker Nodes"
        W1[worker0<br/>10.0.50.10]
        W2[worker1<br/>10.0.50.11]
        W3[worker2<br/>10.0.50.12]
    end

    subgraph "Load Balancer IPs"
        DNS[k8s-gateway<br/>10.0.50.100]
        INT[Internal Gateway<br/>10.0.50.101]
        EXT[External Gateway<br/>10.0.50.102]
    end

    subgraph "GitOps"
        GIT[GitHub Repository]
        FLUX[Flux Controllers]
    end

    CP1 --> VIP
    CP2 --> VIP
    VIP --> W1
    VIP --> W2
    VIP --> W3

    GIT --> FLUX
    FLUX --> W1
    FLUX --> W2
    FLUX --> W3

Configuration Management

Template System

The cluster uses a template-driven approach with makejinja:

cluster.yaml + nodes.yaml
    makejinja
├── talos/         (Talos configs)
├── kubernetes/    (K8s manifests)
└── bootstrap/     (Helmfile configs)

Important: Never edit generated files directly. Always edit source YAML files and run task configure.

Directory Structure

home-ops/
├── .mise.toml              # Tool versions
├── Taskfile.yaml           # Task definitions
├── cluster.yaml            # Cluster configuration (SOURCE)
├── nodes.yaml              # Node configuration (SOURCE)
├── age.key                 # SOPS encryption key
├── .sops.yaml              # SOPS configuration
├── talos/                  # Talos Linux configs
│   ├── talconfig.yaml      # Generated by makejinja
│   ├── talenv.yaml         # Talos/K8s versions
│   ├── patches/            # Configuration patches
│   │   ├── global/         # Applied to all nodes
│   │   ├── controller/     # Control plane only
│   │   └── worker/         # Worker nodes only
│   └── clusterconfig/      # Generated per-node configs
├── kubernetes/             # Kubernetes manifests
│   ├── apps/               # Application deployments
│   │   ├── cert-manager/
│   │   ├── databases/
│   │   ├── kube-system/
│   │   ├── network/
│   │   └── storage/
│   └── flux/               # Flux configuration
│       ├── cluster/        # Cluster-wide Kustomizations
│       └── meta/           # Flux repositories
├── bootstrap/              # Initial bootstrap
│   └── helmfile.d/         # Helmfile configs
├── templates/              # Jinja2 templates
│   ├── config/             # Main templates
│   └── scripts/            # Template plugins
└── .taskfiles/             # Task implementations
    ├── bootstrap/
    ├── talos/
    └── template/

GitOps Workflow

Flux Architecture

graph LR
    A[Git Push] --> B[GitHub]
    B --> C[Flux Source Controller]
    C --> D[Flux Kustomization Controller]
    D --> E[cluster-meta]
    D --> F[cluster-apps]
    E --> G[OCI Repositories]
    F --> H[Application Kustomizations]
    H --> I[HelmReleases]

Reconciliation Flow

  1. Developer pushes to Git repository
  2. Flux polls every 1 hour (or webhook triggers immediately)
  3. Source Controller pulls latest changes
  4. Kustomization Controller applies manifests:
  5. cluster-meta first (Flux repos, dependencies)
  6. cluster-apps second (all applications)
  7. Helm Controller installs/upgrades HelmReleases
  8. Notification on success/failure (if configured)

Networking

Pod Network (Cilium)

  • CNI: Cilium (native routing mode)
  • Pod CIDR: 10.42.0.0/16
  • Service CIDR: 10.43.0.0/16
  • Gateway API: Enabled
  • Hubble: Available for network observability

Ingress Architecture

graph TB
    Internet[Internet] --> CF[Cloudflare Tunnel<br/>10.0.50.102]
    LAN[Home Network] --> INT[Internal Gateway<br/>10.0.50.101]

    CF --> EXT_GW[External Gateway]
    INT --> INT_GW[Internal Gateway]

    EXT_GW --> SVC1[Service A]
    EXT_GW --> SVC2[Service B]
    INT_GW --> SVC3[Service C]
    INT_GW --> SVC4[Service D]

Gateway Selection:

  • Use external gateway for public internet access (via Cloudflare Tunnel)
  • Use internal gateway for home network only access

DNS Flow

graph LR
    A[Client Query] --> B{DNS Server}
    B -->|*.tosih.org| C[k8s-gateway<br/>10.0.50.100]
    B -->|Other| D[Upstream DNS]
    C --> E[Gateway API Resources]
    E --> F[Service IPs]

Split DNS Setup Required:

Configure your home DNS server to forward *.yourdomain.com to the k8s-gateway IP (10.0.50.100).

Storage

Storage Providers

Provider Type Use Case
Rook-Ceph Distributed Persistent volumes with replication
ZFS Provisioner Local High-performance local storage
emptyDir Ephemeral Temporary pod storage
hostPath Node-local Node-specific persistent data

Ceph Architecture (when configured)

graph TB
    subgraph "Worker Nodes"
        W1[worker0<br/>nvme0n1p5: 100GB]
        W2[worker1<br/>nvme0n1p5: 100GB]
        W3[worker2<br/>nvme0n1p5: 100GB]
    end

    subgraph "Ceph Cluster"
        MON1[Monitor]
        MON2[Monitor]
        MON3[Monitor]
        OSD1[OSD]
        OSD2[OSD]
        OSD3[OSD]
        MGR[Manager]
    end

    W1 --> OSD1
    W2 --> OSD2
    W3 --> OSD3

    OSD1 --> POOL[ceph-blockpool<br/>3x replication]
    OSD2 --> POOL
    OSD3 --> POOL

    POOL --> SC1[ceph-block<br/>StorageClass]
    POOL --> SC2[ceph-filesystem<br/>StorageClass]

Security

Secret Management

graph LR
    A[Secret in Git] --> B[SOPS Encrypted]
    B --> C[age Key]
    C --> D[Flux Decrypts]
    D --> E[K8s Secret]
    E --> F[Pod]

Encryption Flow:

  1. Developer creates secret YAML
  2. task configure encrypts with SOPS + age
  3. Encrypted secret pushed to Git
  4. Flux reads age key from cluster
  5. Flux decrypts and creates Kubernetes Secret
  6. Pods consume secret as env vars or files

Access Control

  • Talos: API-only access (no SSH)
  • Kubernetes: RBAC enabled
  • Secrets: SOPS encrypted in Git
  • External Access: Cloudflare Tunnel with authentication

Update Strategy

Component Updates

Component Update Method Automation
Helm Charts Renovate PR → Merge → Flux Automated
Container Images Renovate PR → Merge → Flux Automated
Kubernetes Manual edit talenv.yamltask talos:upgrade-k8s Manual
Talos Manual edit talenv.yamltask talos:upgrade-node Manual

Renovate Workflow

graph LR
    A[Renovate Detects Update] --> B[Creates PR]
    B --> C[CI Validates]
    C --> D[Developer Reviews]
    D --> E[Merge]
    E --> F[Flux Applies]

High Availability

Control Plane HA

  • 2 control plane nodes with shared VIP (10.0.50.50)
  • etcd quorum: 2/2 required (50% + 1)
  • API server: Load balanced via VIP

Workload HA

  • Multi-replica deployments across 3 workers
  • Pod anti-affinity for critical apps
  • PodDisruptionBudgets for graceful updates

Monitoring Points

Recommended monitoring (not included by default):

  • Node metrics: CPU, memory, disk, network
  • Cilium: Network flows via Hubble
  • Ceph: Cluster health, OSD status
  • Flux: Reconciliation status
  • Application: Custom metrics via ServiceMonitor

Disaster Recovery

Backup Strategy

Critical items to backup:

  • age.key - Cannot decrypt secrets without this
  • cluster.yaml and nodes.yaml - Source configuration
  • Git repository - Everything else can be recovered from here

Data backups:

  • Persistent volumes (Ceph) - Use Velero or similar
  • Application data - Application-specific backup tools

Recovery Procedure

  1. Reinstall Talos on nodes
  2. Restore age.key
  3. task bootstrap:talos
  4. task bootstrap:apps
  5. Flux restores everything from Git
  6. Restore persistent data from backups