In the beginning
7 years ago, at 2am (probably), I stumbled upon r/homelab and the beautifully cable managed server racks. I was astonished and wanted to understand what it took to take agency of my own infrastructure and data.
In celebration of my home lab hitting v1, I have decided to create this post to document my lab and the things I've learned and struggled with along the way.
After all these years, my lab has served me well in teaching me what it meant to run bare metal at an enterprise level and has helped me to become a better engineer. It taught me about high availability architectures, containerization, low level OS concepts (I love ZFS), hardware failure modes and how to optimize software for hardware.
This is by no means a guide and my v1 structure is likely overkill for 99% of all home lab users(heck, it's in excess for me!) but as I'd like to say, hobby without excess is just a chore.
Some services I run today
- Media
- Jellyfin
- Moved from Plex to have more sovereignty over my data
- Immich
- Jellyfin
- Tools
- AI
- OpenWebUI
- LiteLLM (Poe + NanoGPT)
- AI
- Infrastructure
- OS
- Hypervisor: Proxmox on Debian
- File System: ZFS
- Running performant and redundant Ceph is expensive!
- Auth
- PocketID
- Networking (routed by Unifi Network)
- DNS
- Zoraxy
- I did not enjoy using NGINX Proxy Manager. Zoraxy "just works"
- Adguard
- Unbound
- Zoraxy
- ProtonVPN (external VPN provider)
- Wireguard + WGDashboard
- DNS
- Operations
- Uptime Kuma
- XyOps + XySatelite for running operations automations
- OS
V0 - Getting my feet wet
To understand the v1 state, a bit of knowledge on how things started is useful.
My priority in the first iteration was to be able to get everything up and running for as cheap and fast as possible and learn as much as possible. All bare metal in this iteration are consumer/prosumer grade for this reason.
Bare Metal
- Networking
- Router: Unifi Dream Machine Special Edition (UDM SE)
- Wifi Access points: Unifi U6-LR
- Control Plane
- Intel NUC 11 (32 gb DDR4 RAM)
- Hypervisor: VMWare
- Ubuntu VMs
- Hypervisor: VMWare
- Intel NUC 11 (32 gb DDR4 RAM)
- Data Plane
- Synology DS1621+ (2gb ECC RAM)
A few of these items are no longer easily accessible at the time of writing (the NUC and Synology NAS). This hardware carried me for almost 5 years and ran pretty much everything I needed.
Though the deployment was quick, it was PAINFUL to maintain. So painful in fact that it almost killed my entire passion for home labs. This was for a variety of reasons:
Learnings
I hate VMWare
Administering VMs directly was an a nightmare. At some point I tracked the number of times the VMs failed at some point (disconnecting from network, services dying, VMs crashing) and I was intervening 3/4 times a week.
Every little thing was manual; updating, mounting network drives, etc... This sucked.
This was all while I was starting out my career too so it resulted in a lot of wasted energy. I got fed up with VMWare really quickly
IO limits is a thing
Spinning rust (also known as Hard Disks lol) are physically limited to how many operations they can do per second as each time a random read/write occurs, the disk physically spins to find the location of a block. This is exaserbated when the OS needs to perform this more often if results of operations aren't stored in cache or memory causing an issue called disk thrashing, heavily degrading performance (and the life of your drives)
Building systems on the cloud makes this invisible. You upload to S3 and you defer thinking about IO to S3 engineers. Need for IO for a managed OpenSearch instance? Change a config to increase IO or change your storage type and AWS will migrate it for you.
But when you're running your own bare metal, you literally have to plan ahead with the types of workloads you're expecting running. Things like how much ram you need to prevent disk thrashing, how much IO your drives expect to handle, etc...I did not do this and thought 2gb of ECC ram would be enough (surprise! ~ It wasn't).
I only realized this when it started to take 2 minutes to load the Synology WebUI every time as the system was running p75 ~95% capacity of IOPs .
Underestimating operations
I did not automate ANYTHING operations and naively believed that a hypervisor and ubuntu and apt update was enough...
I was the monitoring, the maintenance.
No SLA means low standards means poor service
The implications of failure modes with bare metal are much more significant than one of software. You can always run more software images with extra commands on the same machine, but if that machine goes down, then everything on that goes down.
I never put an expectation of how "online" my services should be (heck I didn't even understand that when I started).
As a result, when my NUC went down, everything was down, even my internet. That is okay if you're the only consumer, but not so much if people depend on your infrastructure...
No 3-2-1
3 copies, 2 onsite, 1 offsite
I had a SINGLE Data Plane node. One bad incident or one wrong command and goodbye all data. I lived like this for almost 7 years..... The fear of losing everything lived in the back of my mind this entire time.
(Pro/Con)sumer is usage but inflexible
Over time I learnt very quickly that prosumer is very constrained on native capabilities. Many features are either not implemented or intentionally unsupported to keep users locked into these ecosystems. Some examples are;
- Synology
- Limiting rsync folder cloning to only Synology branded NAS'
- Workarounds can be done (i.e syncthing), rsync CLI but isn't easy to perform or optimal
- Most consumer hardware will be flagged by the synology system as "incompatible" even if it is perfectly fine to use
- DS1621+ only uses SODIMM memory
- Requires using BTRFS filesystem (ZFS generally not supported)
- Official spare parts are exorbitantly expensive
- Limiting rsync folder cloning to only Synology branded NAS'
- Ubiquiti
- Site to Site VPNs via Wireguard is not supported by Unifi Network Software
- Policy routing was not supported until very recently
- Effectively dead if parts fail
- VMWare
- Just sucked lol
All of these features are supported and accessible on Linux + Open Source Software, but is sacrificed for the usability of the ecosystems.
Deriving priorities for V1
As a result of YEARS of pain, I resolved myself to "fix this shit". The following requirements were created allow me and others to enjoy my home lab.
- SLAs
- 99.9% uptime
- 3-2-1. Kill the anxiety.
- Get off VMWare and onto Proxmox
- No more consumer (prosumer scrutinized)
- Fully automated operations
V1 - Architecture of today
The current iteration running today.
Bare Metal
- UPS: Cyberpower OLS2000ERT2UA 9A, close to the most load that a typical AU circuit can handle (10A)
- Networking
- Router: Unifi Dream Machine Special Edition (UDM SE)
- Switches
- Wifi Access points: Unifi U6-LR
- Hyper-converged Compute (Control + Data Plane)
- Bare Metal Nodes
- Node 1: Dell PowerEdge T630 ~ 2x 2690 V4, 96gb ECC DDR4 RAM, Enterprise Toshiba Dell SAS Drives
- Node 2: SilverStone RM400 ~ Ryzen 5 3600, 48gb Non-ECC DDR4 RAM, WD Red Pro
- Node 3 (offsite): Dell PowerEdge R530 ~ 1x 2620 V4, 16gb ECC DDR4 RAM, Mixed Drives (Ironwolf, Exos, WD Red Pro)
- All running Proxmox and ZFS volumes (encrypted at rest) replicating via syncoid
- Bare Metal Nodes
Software Mechanisms
- Operations
- All scripts are run by XyOps and XySat
- Replication and data
- Syncoid
- Cockpit with 45 Drives' plugins
Design Decisions
Why Hyper-converged architecture?
Hyper-converged infrastructure (HCI) is a software-defined IT infrastructure that virtualizes... virtualized computing (a hypervisor), software-defined storage, and virtualized networking (software-defined networking).
- Wikipedia
VMWare and using Synology DSM taught me that relying on separate systems for compute and storage (especially at the homelab scale) can be painful. Having to manage 2 systems and ensure that both are in properly secure and in sync with each other can lead to drift between configurations as the contract between them is not solidified. This includes migrations, authentication, replication and etc.
The options I considered were
- Proxmox
- Pros
- Hyper-converged infrastructure that can combine compute and data plane management
- Can run any filesystem required on drives (inc. ZFS)
- Cons
- Requires upfront learning investment
- No strong GUI to manage data plane
- Cockpit does exist to improve the UX of data plane management but is as rich as data plane first software i.e option 2
- Pros
- TrueNas or other Data Plane first software
- Pros
- Data Plane first software. Isolated failure between control and data plane
- Cons
- Not virtualization first. Primarily focused on operating as a NAS and not a hyperconverged. May result in similar experiences as option 3
- Pros
- Stick with separate architecture
- Pros
- Isolated failures between control and data plane
- Cons
- Requires separate hardware for each TrueNas instance (doubles nodes required)
- Running TrueNas in a VM increases complexity and failure modes so dedicated nodes are preferred
- Does not solve the failure modes of having separate control and data planes
- Requires separate hardware for each TrueNas instance (doubles nodes required)
- Pros
I ended up picking Proxmox VE with Cockpit + 45 Drives' plugins to minimize the impact of trading off NAS first UX.
Why ZFS and not Ceph?
Ceph is a... software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation.
- Wikipedia
ZFS is old, boring, performant with single node architectures and is widely used everywhere.
This is really good, if you do not value live replication. Unlike Ceph, ZFS only supports push/pull based replication, relying on middleware to sync two nodes one way.
Ceph is pooled storage meaning data is striped across multiple Ceph nodes. Ceph nodes in quorum co-ordinate and write over network. This allows data to be shared across multiple compute nodes unlike traditional RAID or ZFS and an entire data center can all work together to provide storage while accessing that same storage.
This means;
- You need alot of network capacity (at least 10gig networking) to allow writes to happen quickly over the network
- Your data plane needs to be quick and handle alot of IO as writes/reads happen across multiple nodes very frequently to keep all nodes in sync
- You need multiple nodes to benefit from pooled storage and all need to maintain quorum (>50% active)
- Because data can be written across multiple nodes in ceph, you can redundancy built in and you can survive certain nodes going down if you are in quorum
- You can also read the same data from multiple nodes as well which means you can move an exact workload from one node to another instantaneously unlike ZFS where you might have some difference between nodes as you manually trigger replication.
I am not a data center and do not need pooled storage. If a node goes down, I can migrate a container from one node to another.
Building 3 compute nodes means alot of money for little gain. ZFS is enough and keeps things simple and reliable.
Changelog
- 28/04/2026 - Initial Version