In the beginning
7 years ago, at 2am (probably), I stumbled upon r/homelab and the beautifully cable managed server racks. I was astonished and wanted to understand what it took to take agency of my own infrastructure and data.
In celebration of my home lab hitting v1, I have decided to create this post to document my lab and the things I've learned and struggled with along the way.
After all these years, my lab has served me well in teaching me what it meant to run bare metal in production and has helped me to become a better engineer. It taught me about high availability architectures, containerization, low level OS concepts (I love ZFS), hardware failure modes and how to optimize software for hardware.
This is by no means a guide and my v1 structure is probably overkill for 90% of all home lab users heck, it's in excess for me! But in the end, a hobby without some excess is just doing chores.
Some services I run today
I try my best to run purely free and open source software. This is not everything that I run but includes all the vitals.
- Infrastructure
- Hypervisor: Proxmox
- File System: ZFS (see later for why!)
- Auth: PocketID
- Networking (routed by Unifi Network)
- Media
- Tools
- Operations
- Uptime Kuma
- XyOps + XySatelite for operations automation
V0 - Getting my feet wet
To understand v1, a bit of knowledge on how things started is useful.
The priority in the first iteration was to be able to get everything up quickly to learn as much as possible and fail fast. All bare metal in this iteration are consumer/prosumer grade for this reason.
Bare Metal
- Networking
- Router: Unifi Dream Machine Special Edition (UDM SE)
- Wifi Access points: Unifi U6-LR
- Control Plane
- Intel NUC 11 (32 gb DDR4 RAM)
- Hypervisor: VMware
- Intel NUC 11 (32 gb DDR4 RAM)
- Data Plane
- Synology DS1621+ (2gb ECC RAM)
A few of these items are no longer easily accessible at the time of writing (the NUC and Synology NAS). This hardware carried me for almost 5 years and ran pretty much everything I needed.
Though the deployment was quick, it was PAINFUL to maintain. So painful in fact that it almost killed my entire passion for home labs. This was for a variety of reasons:
Learnings
I hate VMware
Administering VMs directly was a nightmare. At some point I tracked the number of times the VMs failed at some point (disconnecting from network, services dying, VMs crashing) and I was intervening 3/4 times a week.
Every little thing was manual; updating, mounting network drives, etc... This sucked.
This was while I was starting my career so it resulted in a lot of wasted energy. I got fed up with VMware really quickly.
IO limits are a thing
Spinning rust (also known as Hard Disks) are physically limited to how many operations they can do per second as each time a random read/write occurs, the disk physically spins to find the location of a block.
This occurs when results of operations aren't stored in cache or memory causing an issue called disk thrashing, heavily degrading performance (and the life of your drives) and exacerbated when OS is starved of RAM.
Building systems on the cloud makes this invisible. You upload to S3 and you defer thinking about IO to S3 engineers. Need for IO for a managed OpenSearch instance? Change a config to increase IO or change your storage type and AWS will migrate it for you.
But when you're running bare metal, you have to plan ahead and accommodate the types of workloads you're expecting to run. Things like how much ram/cache you need to prevent disk thrashing, how much IO your drives expect to handle, etc...
I did not do this and thought 2gb of ECC ram would be enough (surprise! ~ It wasn't) and I only realized implications of this when it started to take 2 minutes to load the Synology WebUI with my system running at p75 ~95% capacity of HDD IOPs .
Underestimating operations
I did not automate ANYTHING operations and naively believed that a hypervisor and ubuntu and apt update was enough...
I was the monitoring, the maintenance, and never thought to invest time to make administration easier.
No SLA = low standards = poor service
The implications of failure modes with bare metal are more significant than the software that runs on it. You can always run more software images with extra commands on the same machine, but if that machine goes down, then everything goes down with it.
I never put an expectation of how "online" my services should be (heck I didn't even understand that when I started).
So, when my NUC went down, everything was down, even my internet. That is okay if you're the only consumer, but not so much if family and friends depend on your infrastructure...
No 3-2-1
3 copies, 2 onsite, 1 offsite
For years, I had a SINGLE Data Plane node running SHR1 (1 drive redundancy). One incident or one wrong command and goodbye all data. I lived like this for almost 7 years..... The fear of losing everything lived in the back of my mind this entire time.
(Pro/Con)sumer is usable but inflexible
Over time I learnt very quickly that prosumer is very constrained on native capabilities. Many features are either not implemented or intentionally unsupported to keep users locked into these ecosystems. Some examples are;
- Synology
- Limiting rsync folder cloning to only Synology branded NAS'
- Workarounds can be done (i.e syncthing), rsync CLI but isn't easy to perform or optimal
- Most consumer hardware will be flagged by the synology system as "incompatible" even if it is perfectly fine to use
- DS1621+ only uses SODIMM memory
- Requires using the BTRFS filesystem (ZFS is generally not supported)
- Official spare parts are exorbitantly expensive
- Limiting rsync folder cloning to only Synology branded NAS'
- Ubiquiti
- Site to Site VPNs via Wireguard is not supported by Unifi Network Software
- Entire systems dead if parts fail
- VMware
- Just sucked lol
All of these features are supported and accessible on Linux + Open Source Software, but is sacrificed for the usability of the ecosystems.
Deriving priorities for V1
As a result of YEARS of pain, I resolved myself to "fix this shit". The following requirements were created allow me and others to enjoy my home lab.
- Aspire for 99.9% uptime
- Automated failover of stateless containers
- Consistent data replication across all nodes for stateful containers
- Manual failover for stateful containers
- Automated failovers for VMs
- Get off VMware and onto Proxmox
- No more consumer (prosumer decisions scrutinized)
- Disaster protection
- 3-2-1. Kill the anxiety
- Fully automated operations including backups, health checking, data scrubbing, etc...
- Local disaster protection (power surges, fires)
V1 - Architecture of today
What is driving up my electricity bill nowadays.
Bare Metal
- UPS: Cyberpower OLS2000ERT2UA 9A, close to the practical limit a single AU circuit can handle (10A)
- Networking
- Router: Unifi Dream Machine Special Edition (UDM SE)
- Switches
- Wifi Access points: Unifi U6-LR
- Hyper-converged Compute (Control + Data Plane)
- Bare Metal Nodes
- Node 1: Dell PowerEdge T630 ~ 2x Xeon 2690 V4, 96gb ECC DDR4 RAM, Enterprise Toshiba Dell SAS Drives
- Node 2: SilverStone RM400 ~ Ryzen 5 3600, 48gb Non-ECC DDR4 RAM, WD Red Pro
- Node 3 (offsite): Dell PowerEdge R530 ~ Xeon 2620 V4, 16gb ECC DDR4 RAM, Mixed Drives (Ironwolf, Exos, WD Red Pro)
- All running Proxmox and ZFS volumes (encrypted at rest) replicating via syncoid
- Bare Metal Nodes
Software Mechanisms
- Operations
- All scripts are run by XyOps and XySatelite
- Replication and data
- Syncoid
- Cockpit with 45 Drives' plugins
Design Decisions
Why Hyper-converged architecture?
Hyper-converged infrastructure (HCI) is a software-defined IT infrastructure that virtualizes... virtualized computing (a hypervisor), software-defined storage, and virtualized networking (software-defined networking).
- Wikipedia
VMware and using Synology DSM taught me that relying on separate systems for compute and storage (especially at the homelab scale) can be painful.
Having to manage 2 systems and ensure that both are secure and in sync can lead to drift between configurations as the contract between them is brittle. This includes migrations, authentication, replication and etc.
The options I considered to remove this friction were
- Proxmox
- Pros
- Hyper-converged infrastructure that can combine compute and data plane management
- Can run any filesystem required on drives (inc. ZFS)
- Cons
- Requires upfront learning investment
- No strong GUI to manage data plane
- Cockpit does exist to improve the UX of data plane management but is as rich as data plane first software i.e option 2
- Pros
- TrueNas or other Data Plane first software
- Pros
- Data Plane first software. Isolated failure between control and data plane
- Cons
- Not virtualization first. Primarily focused on operating as a NAS and not a hyperconverged. May result in similar experiences as option 3
- Pros
- Stick with separate architecture
- Pros
- Isolated failures between control and data plane
- Cons
- Requires separate hardware for each TrueNas instance (doubles nodes required)
- Running TrueNas in a VM increases complexity and failure modes so dedicated nodes are preferred
- Does not solve the failure modes of having separate control and data planes
- Requires separate hardware for each TrueNas instance (doubles nodes required)
- Pros
I ended up picking Proxmox VE with Cockpit + 45 Drives' plugins to minimize the impact of trading off NAS first UX.
Why ZFS and not Ceph?
Ceph is a... software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation.
- Wikipedia
ZFS is old, boring, performant with single node architectures and is widely used.
This is really good, if you do not value live replication. Unlike Ceph, ZFS only supports push/pull based replication, relying on middle-ware to sync two nodes one way.
Ceph is pooled storage meaning data is striped across multiple Ceph nodes. Ceph nodes in quorate (>50% of nodes online) co-ordinate and write to disk over network. This allows data to be shared across multiple compute nodes unlike single node filesystems like ZFS.
With a similar concept to RAID, an entire data center can all work together to provide storage with multiple copies of data written to the cluster protecting against failures.
This means;
- You need significant network capacity (at least 10G networking for a home lab) to allow writes to happen quickly over the network
- Your data plane needs to be quick and handle significant IO as writes/reads happen across multiple nodes frequently to keep data in sync
- You need multiple nodes to benefit from pooled storage and all need to maintain quorum (>50% active)
- Because data can be written across multiple nodes in ceph, you get redundancy built in and can survive certain nodes going down if you are in quorum
- You can also read the same data from multiple nodes as well which means you can move an exact workload from one node to another instantaneously unlike ZFS where you might have some difference between nodes as you manually trigger replication.
Though I would like to build one, I do not have the funds to build a full data center and do not need the benefits of pooled storage. If a node goes down, I can migrate a stateless container from one node to another automatically with proxmox already and manually move stateful ones if required.
Building 3 compute nodes means a major investment for little gain in my situation. ZFS is enough and keeps things simple and reliable.
7 year reflections
It's easy to overlook what your code runs on as a Software Engineer. There are pipelines that run, code that compiles and IDEs that guides you through the process which make it easy to both develop and forget about the metal that runs all these tools.
Sure, my initial drives was the interest of how hardware like this is managed and to have sovereignty over my data. But as things went on, it grew into wanting to gain perspective of the physical aspect of Software Engineering.
Understanding why my script failed and is pinging me every 5 minutes at 3am is uncomfortable, sure. But in the end, understanding the whys in that moment is what allows me to empathize with these experiences and build software with a more complete picture in mind.
Changelog
- 29/04/2026 - Grammar fixes, additional links, grammar cleaning up, fixing statements, added conclusion
- 28/04/2026 - Initial Version