Architecture¶
System Architecture¶
graph TB
subgraph "Control Plane"
CP1[control0<br/>10.0.50.1]
CP2[control1<br/>10.0.50.2]
VIP[Virtual IP<br/>10.0.50.50]
end
subgraph "Worker Nodes"
W1[worker0<br/>10.0.50.10]
W2[worker1<br/>10.0.50.11]
W3[worker2<br/>10.0.50.12]
end
subgraph "Load Balancer IPs"
DNS[k8s-gateway<br/>10.0.50.100]
INT[Internal Gateway<br/>10.0.50.101]
EXT[External Gateway<br/>10.0.50.102]
end
subgraph "GitOps"
GIT[GitHub Repository]
FLUX[Flux Controllers]
end
CP1 --> VIP
CP2 --> VIP
VIP --> W1
VIP --> W2
VIP --> W3
GIT --> FLUX
FLUX --> W1
FLUX --> W2
FLUX --> W3
Configuration Management¶
Template System¶
The cluster uses a template-driven approach with makejinja:
cluster.yaml + nodes.yaml
↓
makejinja
↓
├── talos/ (Talos configs)
├── kubernetes/ (K8s manifests)
└── bootstrap/ (Helmfile configs)
Important: Never edit generated files directly. Always edit source YAML files and run task configure.
Directory Structure¶
home-ops/
├── .mise.toml # Tool versions
├── Taskfile.yaml # Task definitions
├── cluster.yaml # Cluster configuration (SOURCE)
├── nodes.yaml # Node configuration (SOURCE)
├── age.key # SOPS encryption key
├── .sops.yaml # SOPS configuration
│
├── talos/ # Talos Linux configs
│ ├── talconfig.yaml # Generated by makejinja
│ ├── talenv.yaml # Talos/K8s versions
│ ├── patches/ # Configuration patches
│ │ ├── global/ # Applied to all nodes
│ │ ├── controller/ # Control plane only
│ │ └── worker/ # Worker nodes only
│ └── clusterconfig/ # Generated per-node configs
│
├── kubernetes/ # Kubernetes manifests
│ ├── apps/ # Application deployments
│ │ ├── cert-manager/
│ │ ├── databases/
│ │ ├── kube-system/
│ │ ├── network/
│ │ └── storage/
│ └── flux/ # Flux configuration
│ ├── cluster/ # Cluster-wide Kustomizations
│ └── meta/ # Flux repositories
│
├── bootstrap/ # Initial bootstrap
│ └── helmfile.d/ # Helmfile configs
│
├── templates/ # Jinja2 templates
│ ├── config/ # Main templates
│ └── scripts/ # Template plugins
│
└── .taskfiles/ # Task implementations
├── bootstrap/
├── talos/
└── template/
GitOps Workflow¶
Flux Architecture¶
graph LR
A[Git Push] --> B[GitHub]
B --> C[Flux Source Controller]
C --> D[Flux Kustomization Controller]
D --> E[cluster-meta]
D --> F[cluster-apps]
E --> G[OCI Repositories]
F --> H[Application Kustomizations]
H --> I[HelmReleases]
Reconciliation Flow¶
- Developer pushes to Git repository
- Flux polls every 1 hour (or webhook triggers immediately)
- Source Controller pulls latest changes
- Kustomization Controller applies manifests:
cluster-metafirst (Flux repos, dependencies)cluster-appssecond (all applications)- Helm Controller installs/upgrades HelmReleases
- Notification on success/failure (if configured)
Networking¶
Pod Network (Cilium)¶
- CNI: Cilium (native routing mode)
- Pod CIDR: 10.42.0.0/16
- Service CIDR: 10.43.0.0/16
- Gateway API: Enabled
- Hubble: Available for network observability
Ingress Architecture¶
graph TB
Internet[Internet] --> CF[Cloudflare Tunnel<br/>10.0.50.102]
LAN[Home Network] --> INT[Internal Gateway<br/>10.0.50.101]
CF --> EXT_GW[External Gateway]
INT --> INT_GW[Internal Gateway]
EXT_GW --> SVC1[Service A]
EXT_GW --> SVC2[Service B]
INT_GW --> SVC3[Service C]
INT_GW --> SVC4[Service D]
Gateway Selection:
- Use
externalgateway for public internet access (via Cloudflare Tunnel) - Use
internalgateway for home network only access
DNS Flow¶
graph LR
A[Client Query] --> B{DNS Server}
B -->|*.tosih.org| C[k8s-gateway<br/>10.0.50.100]
B -->|Other| D[Upstream DNS]
C --> E[Gateway API Resources]
E --> F[Service IPs]
Split DNS Setup Required:
Configure your home DNS server to forward *.yourdomain.com to the k8s-gateway IP (10.0.50.100).
Storage¶
Storage Providers¶
| Provider | Type | StorageClass | Use Case |
|---|---|---|---|
| Rook-Ceph Block | RBD | ceph-block (default) |
High-performance block storage for databases |
| Rook-Ceph Filesystem | CephFS | ceph-filesystem |
Shared filesystem storage (ReadWriteMany) |
| Rook-Ceph Object | S3 | ceph-bucket |
Object storage buckets |
| ZFS NFS | NFS | zfs-nfs |
NFS-backed persistent volumes |
| Local HostPath | Node-local | local-hostpath |
Node-specific persistent data |
| emptyDir | Ephemeral | N/A | Temporary pod storage |
Database Services¶
| Service | Type | Purpose |
|---|---|---|
| CloudNativePG | PostgreSQL Operator | Manages PostgreSQL clusters for applications |
| Dragonfly Operator | Redis Alternative | High-performance in-memory datastore |
| External Postgres Operator | External DB Management | Manages external PostgreSQL instances |
| VerneMQ | MQTT Broker | Message broker for IoT devices (Home Assistant) |
Security¶
Secret Management¶
graph LR
A[Secret in Git] --> B[SOPS Encrypted]
B --> C[age Key]
C --> D[Flux Decrypts]
D --> E[K8s Secret]
E --> F[Pod]
Encryption Flow:
- Developer creates secret YAML
task configureencrypts with SOPS + age- Encrypted secret pushed to Git
- Flux reads age key from cluster
- Flux decrypts and creates Kubernetes Secret
- Pods consume secret as env vars or files
Access Control¶
- Talos: API-only access (no SSH)
- Kubernetes: RBAC enabled
- Secrets: SOPS encrypted in Git + External Secrets from 1Password
- Authentication: Pocket ID OIDC provider for SSO
- Gateways:
- External (10.0.50.102) - Public apps with authentication
- Internal (10.0.50.101) - Home network only (network-protected)
Update Strategy¶
Component Updates¶
| Component | Update Method | Automation |
|---|---|---|
| Helm Charts | Renovate PR → Merge → Flux | Automated |
| Container Images | Renovate PR → Merge → Flux | Automated |
| Kubernetes | Manual edit talenv.yaml → task talos:upgrade-k8s |
Manual |
| Talos | Manual edit talenv.yaml → task talos:upgrade-node |
Manual |
Renovate Workflow¶
graph LR
A[Renovate Detects Update] --> B[Creates PR]
B --> C[CI Validates]
C --> D[Developer Reviews]
D --> E[Merge]
E --> F[Flux Applies]
High Availability¶
Control Plane HA¶
- 2 control plane nodes with shared VIP (10.0.50.50)
- etcd quorum: 2/2 required (50% + 1)
- API server: Load balanced via VIP
Workload HA¶
- Multi-replica deployments across 3 workers
- Pod anti-affinity for critical apps
- PodDisruptionBudgets for graceful updates
Deployed Applications¶
Production Workloads¶
| Application | Namespace | Gateway | Purpose |
|---|---|---|---|
| Cloud Services | |||
| Immich | cloud | external | Photo & video backup (OIDC-enabled) |
| ImmichFrame | cloud | internal | Digital photo frame for Immich |
| Memos | cloud | internal | Note-taking service |
| Romm | cloud | internal | ROM manager for retro gaming |
| Syncthing | cloud | internal | File synchronization |
| Media Automation | |||
| Plex | media | internal | Media streaming server |
| Jellyseerr | media | internal | Media request management |
| Sonarr | media | internal | TV show automation |
| Radarr | media | internal | Movie automation |
| Lidarr | media | internal | Music automation |
| Readarr | media | internal | eBook & audiobook automation |
| Prowlarr | media | internal | Indexer management |
| Recyclarr | media | internal | TRaSH guide automation |
| qBittorrent | media | internal | Torrent download client |
| NZBGet | media | internal | Usenet download client |
| Audiobookshelf | media | internal | Audiobook & podcast server |
| Beets | media | internal | Music library manager |
| Home Automation | |||
| Home Assistant | home | internal | Home automation platform |
| Homebridge | home | internal | HomeKit bridge |
| AirConnect | home | internal | AirPlay to UPnP/Sonos bridge |
| Eufy Security WS | home | internal | Eufy camera integration |
| Infrastructure | |||
| Homepage | default | internal | Application dashboard |
| Uptime Kuma | default | internal | Uptime monitoring |
| Echo | default | internal | HTTP echo server for testing |
| Network Services | |||
| AdGuard Home | network | internal | DNS server & ad blocking |
| k8s-gateway | network | internal | Internal DNS for *.tosih.org |
| Cloudflare Tunnel | network | external | Secure external access via Cloudflare |
| Cloudflare DNS | network | external | DNS record automation |
| Security & Authentication | |||
| Pocket ID | security | external | OIDC identity provider (SSO) |
| External Secrets | security | internal | 1Password secret integration |
| OnePassword Connect | security | internal | 1Password API server |
| Storage & Databases | |||
| Rook-Ceph Dashboard | rook-ceph | internal | Storage cluster management |
| ZFS Provisioner | kubernetes-zfs-provisioner | internal | Local ZFS storage provisioning |
| CloudNativePG | databases | internal | PostgreSQL operator (3 databases) |
| Dragonfly | databases | internal | Redis-compatible in-memory datastore |
| External Postgres Operator | databases | internal | External PostgreSQL management |
| VerneMQ | databases | internal | MQTT message broker for IoT |
Total: 40+ applications across 10 namespaces
Infrastructure Services¶
- Flux - GitOps continuous delivery (flux-operator, flux-instance)
- Cilium - CNI and Gateway API
- Rook-Ceph - Distributed storage (operator + cluster)
- CloudNativePG - PostgreSQL operator (3 databases)
- Dragonfly Operator - Redis-compatible in-memory datastore
- External Postgres Operator - External PostgreSQL management
- VerneMQ - MQTT message broker for IoT
- External Secrets - 1Password integration
- Cert-Manager - TLS certificate management
- k8s-gateway - Internal DNS resolution
- Cloudflare Tunnel - Secure external access via Cloudflare
- Cloudflare DNS - DNS record management
- CoreDNS - Cluster DNS service
- Reloader - Auto-reload on ConfigMap/Secret changes
- Spegel - Distributed container image cache
- Metrics Server - Resource metrics API
- Descheduler - Pod rescheduling optimization
- Goldilocks - Resource recommendation engine
- VolSync - Persistent volume replication and backup
- ZFS Provisioner - Local ZFS storage provisioning
Monitoring Points¶
Recommended monitoring (not included by default):
- Node metrics: CPU, memory, disk, network
- Cilium: Network flows via Hubble
- Flux: Reconciliation status
- Application: Custom metrics via ServiceMonitor
- Ceph: Storage health and capacity
Disaster Recovery¶
Backup Strategy¶
Critical items to backup:
age.key- Cannot decrypt secrets without thiscluster.yamlandnodes.yaml- Source configuration- Git repository - Everything else can be recovered from here
Data backups:
- Persistent volumes - Use Velero or application-specific backup tools
- Application data - Application-specific backup tools
Recovery Procedure¶
- Reinstall Talos on nodes
- Restore
age.key task bootstrap:talostask bootstrap:apps- Flux restores everything from Git
- Restore persistent data from backups