Architecture¶
System Architecture¶
graph TB
subgraph "Control Plane"
CP1[control0<br/>10.0.50.1]
CP2[control1<br/>10.0.50.2]
VIP[Virtual IP<br/>10.0.50.50]
end
subgraph "Worker Nodes"
W1[worker0<br/>10.0.50.10]
W2[worker1<br/>10.0.50.11]
W3[worker2<br/>10.0.50.12]
end
subgraph "Load Balancer IPs"
DNS[k8s-gateway<br/>10.0.50.100]
INT[Internal Gateway<br/>10.0.50.101]
EXT[External Gateway<br/>10.0.50.102]
end
subgraph "GitOps"
GIT[GitHub Repository]
FLUX[Flux Controllers]
end
CP1 --> VIP
CP2 --> VIP
VIP --> W1
VIP --> W2
VIP --> W3
GIT --> FLUX
FLUX --> W1
FLUX --> W2
FLUX --> W3
Configuration Management¶
Template System¶
The cluster uses a template-driven approach with makejinja:
cluster.yaml + nodes.yaml
↓
makejinja
↓
├── talos/ (Talos configs)
├── kubernetes/ (K8s manifests)
└── bootstrap/ (Helmfile configs)
Important: Never edit generated files directly. Always edit source YAML files and run task configure.
Directory Structure¶
home-ops/
├── .mise.toml # Tool versions
├── Taskfile.yaml # Task definitions
├── cluster.yaml # Cluster configuration (SOURCE)
├── nodes.yaml # Node configuration (SOURCE)
├── age.key # SOPS encryption key
├── .sops.yaml # SOPS configuration
│
├── talos/ # Talos Linux configs
│ ├── talconfig.yaml # Generated by makejinja
│ ├── talenv.yaml # Talos/K8s versions
│ ├── patches/ # Configuration patches
│ │ ├── global/ # Applied to all nodes
│ │ ├── controller/ # Control plane only
│ │ └── worker/ # Worker nodes only
│ └── clusterconfig/ # Generated per-node configs
│
├── kubernetes/ # Kubernetes manifests
│ ├── apps/ # Application deployments
│ │ ├── cert-manager/
│ │ ├── databases/
│ │ ├── kube-system/
│ │ ├── network/
│ │ └── storage/
│ └── flux/ # Flux configuration
│ ├── cluster/ # Cluster-wide Kustomizations
│ └── meta/ # Flux repositories
│
├── bootstrap/ # Initial bootstrap
│ └── helmfile.d/ # Helmfile configs
│
├── templates/ # Jinja2 templates
│ ├── config/ # Main templates
│ └── scripts/ # Template plugins
│
└── .taskfiles/ # Task implementations
├── bootstrap/
├── talos/
└── template/
GitOps Workflow¶
Flux Architecture¶
graph LR
A[Git Push] --> B[GitHub]
B --> C[Flux Source Controller]
C --> D[Flux Kustomization Controller]
D --> E[cluster-meta]
D --> F[cluster-apps]
E --> G[OCI Repositories]
F --> H[Application Kustomizations]
H --> I[HelmReleases]
Reconciliation Flow¶
- Developer pushes to Git repository
- Flux polls every 1 hour (or webhook triggers immediately)
- Source Controller pulls latest changes
- Kustomization Controller applies manifests:
cluster-metafirst (Flux repos, dependencies)cluster-appssecond (all applications)- Helm Controller installs/upgrades HelmReleases
- Notification on success/failure (if configured)
Networking¶
Pod Network (Cilium)¶
- CNI: Cilium (native routing mode)
- Pod CIDR: 10.42.0.0/16
- Service CIDR: 10.43.0.0/16
- Gateway API: Enabled
- Hubble: Available for network observability
Ingress Architecture¶
graph TB
Internet[Internet] --> CF[Cloudflare Tunnel<br/>10.0.50.102]
LAN[Home Network] --> INT[Internal Gateway<br/>10.0.50.101]
CF --> EXT_GW[External Gateway]
INT --> INT_GW[Internal Gateway]
EXT_GW --> SVC1[Service A]
EXT_GW --> SVC2[Service B]
INT_GW --> SVC3[Service C]
INT_GW --> SVC4[Service D]
Gateway Selection:
- Use
externalgateway for public internet access (via Cloudflare Tunnel) - Use
internalgateway for home network only access
DNS Flow¶
graph LR
A[Client Query] --> B{DNS Server}
B -->|*.tosih.org| C[k8s-gateway<br/>10.0.50.100]
B -->|Other| D[Upstream DNS]
C --> E[Gateway API Resources]
E --> F[Service IPs]
Split DNS Setup Required:
Configure your home DNS server to forward *.yourdomain.com to the k8s-gateway IP (10.0.50.100).
Storage¶
Storage Providers¶
| Provider | Type | Use Case |
|---|---|---|
| Rook-Ceph | Distributed | Persistent volumes with replication |
| ZFS Provisioner | Local | High-performance local storage |
| emptyDir | Ephemeral | Temporary pod storage |
| hostPath | Node-local | Node-specific persistent data |
Ceph Architecture (when configured)¶
graph TB
subgraph "Worker Nodes"
W1[worker0<br/>nvme0n1p5: 100GB]
W2[worker1<br/>nvme0n1p5: 100GB]
W3[worker2<br/>nvme0n1p5: 100GB]
end
subgraph "Ceph Cluster"
MON1[Monitor]
MON2[Monitor]
MON3[Monitor]
OSD1[OSD]
OSD2[OSD]
OSD3[OSD]
MGR[Manager]
end
W1 --> OSD1
W2 --> OSD2
W3 --> OSD3
OSD1 --> POOL[ceph-blockpool<br/>3x replication]
OSD2 --> POOL
OSD3 --> POOL
POOL --> SC1[ceph-block<br/>StorageClass]
POOL --> SC2[ceph-filesystem<br/>StorageClass]
Security¶
Secret Management¶
graph LR
A[Secret in Git] --> B[SOPS Encrypted]
B --> C[age Key]
C --> D[Flux Decrypts]
D --> E[K8s Secret]
E --> F[Pod]
Encryption Flow:
- Developer creates secret YAML
task configureencrypts with SOPS + age- Encrypted secret pushed to Git
- Flux reads age key from cluster
- Flux decrypts and creates Kubernetes Secret
- Pods consume secret as env vars or files
Access Control¶
- Talos: API-only access (no SSH)
- Kubernetes: RBAC enabled
- Secrets: SOPS encrypted in Git
- External Access: Cloudflare Tunnel with authentication
Update Strategy¶
Component Updates¶
| Component | Update Method | Automation |
|---|---|---|
| Helm Charts | Renovate PR → Merge → Flux | Automated |
| Container Images | Renovate PR → Merge → Flux | Automated |
| Kubernetes | Manual edit talenv.yaml → task talos:upgrade-k8s |
Manual |
| Talos | Manual edit talenv.yaml → task talos:upgrade-node |
Manual |
Renovate Workflow¶
graph LR
A[Renovate Detects Update] --> B[Creates PR]
B --> C[CI Validates]
C --> D[Developer Reviews]
D --> E[Merge]
E --> F[Flux Applies]
High Availability¶
Control Plane HA¶
- 2 control plane nodes with shared VIP (10.0.50.50)
- etcd quorum: 2/2 required (50% + 1)
- API server: Load balanced via VIP
Workload HA¶
- Multi-replica deployments across 3 workers
- Pod anti-affinity for critical apps
- PodDisruptionBudgets for graceful updates
Monitoring Points¶
Recommended monitoring (not included by default):
- Node metrics: CPU, memory, disk, network
- Cilium: Network flows via Hubble
- Ceph: Cluster health, OSD status
- Flux: Reconciliation status
- Application: Custom metrics via ServiceMonitor
Disaster Recovery¶
Backup Strategy¶
Critical items to backup:
age.key- Cannot decrypt secrets without thiscluster.yamlandnodes.yaml- Source configuration- Git repository - Everything else can be recovered from here
Data backups:
- Persistent volumes (Ceph) - Use Velero or similar
- Application data - Application-specific backup tools
Recovery Procedure¶
- Reinstall Talos on nodes
- Restore
age.key task bootstrap:talostask bootstrap:apps- Flux restores everything from Git
- Restore persistent data from backups