A few months ago I rebuilt my router on an espressobin and got the itch to overhaul the rest of my homelab.
While I could pick up some post-market AmaFaceGooSoft equipment for a typical high-power x86 lab, I decided to put the devops mantra of a distributed, fault-tolerant architecture to work and see how far I could get with lots of small, cheap, low-power machines instead.
@Perry I just tack on an extra configuration option to my Nomad servers’ systemd service files to inject a Vault token into their environment, as as long as the cluster is running, token renewals happen indefinitely (via a periodic, orphaned token):
For anyone else coming here and considering a similar setup: HC2 is 32-bit and GlusterFS has officially dropped support for that, so it might explain at least some stability issues. Other users have reported recent versions not running at all on 32-bit arm.
@tylerjl Maybe you’d consider editing your blog post to include that, as I can imagine many will end up there and do a purchase thinking it’ll still work fine (I know I almost did!)
Ceph still seems to be working fine, though.
I’m still on the hunt for a low-profile arm64 board with SATA and GbE to use for a gluster cluster.
BTW, didn’t know that go templates were supported in hcl configs, that makes life a lot easier!
A great starting point for setting up consul and nomad is brianshumate’s Ansible roles for consul/nomad/vault - make any changes directly in your playbooks as suggested and you don’t have to scratch your head or remember how you configured all your configs.
A really interesting overview of distributed computing. I’ve been thinking about something similar for my own homelab for a while, it especially appeals to me for storage availability.
How have you found the performance of the compute portion when performing heavier tasks/larger services? I run a few services with my current home server that are pretty heavy, especially when it comes to ram usage how well would you expect it to perform?
Low-power ARM boards are just going to be overall slower in many cases, but I’ve typically found that CPU or memory bottlenecks aren’t too problematic. Overall, the downsides I’ve found are:
Very CPU-intensive times are exacerbated, particularly for things like startup times for the JVM, or when lots of I/O occurs that hits the SD card. When Nomad schedules a new workload, that can cause a flurry of activity. Day-to-day, when things are in a steady state, this isn’t really a problem.
When using a scheduler like nomad (or k3s or k8s), each workload needs some sort of resource allocation, and carefully carving up RAM and CPU is tricky, because those resources are so sparse on these boards. Workloads will be happier with more RAM, but there’s so little to go around, that it’s often easier to just err on the side or more instead of less RAM and let the scheduler spread workloads around.
Storage delays/latency probably need tuning. Using a gluster or NFS mount can be really tricky when mounting it into a container, so doing a little testing and tuning mount flags is worth it.
Hey! I actually read this post a while ago, but haven’t yet posted a comment (or have I? I don’t remember). Anyway, I’m building my own Raspberry Pi-based cluster inspired by your setup. I’m using Unbound + Nomad + Consul + Fabio (+ wesher) + Vault (though I haven’t finished the setup there yet) as the backbone of my cluster.
I’m getting to the point in the setup where I want to start running services on my cluster. To this end, I’m concerned about keeping containers up to date. How do you keep your containers updated, and do you build any of your own containers? If so, how to you rebuild those to keep them updated too?
@sbrl wesher is a really cool project! Thanks for sharing that, I might use that in the future.
In answer to your question all my updates are manual. I typically pull and then push images to my locally-hosted docker registry (to ensure I can run applications even when my network isn’t connected to the internet). Some images I just pull and then host locally (like bitwarden_rs) and others I need to build some scratch (such as my self-hosted instance of stringer for rss reading). Ultimately it’s a tradeoff between having to do upgrade maintenance manually versus applications breaking unexpectedly - if I put some system in place to automatically rollout upgrades, there’s a potential for breaking changes to unexpectedly cause downtime. The way things are setup now, they’re stable until I do maintenance.
It’s certainly a choice that makes a significant impact on how a cluster might operate, and there are merits to different approaches.
Hi ! Nice write up ! I have a few questions remaining
With consul operational, I took one additional step to make checking in on services that register with consul even easier from non-cluster machines (like my laptop or phone) a little easier. The following line in my router’s /etc/dnsmasq.conf instructs it to forward requests for consul domains to the local consul listener:
server=/consul/127.0.0.1#8600
How can the router forward the *.consul requests to the local listener ? What do you mean by local listener ? Is it a consul node ? I cannot understand how forwarding consul to 127.0.0.1:8600 help.
That’s correct - consul is running on the router as well, with the daemon bound to just the LAN-side interfaces to avoid leaking anything to the WAN/internet side. I like to run Consul on every machine in my network since it’s a helpful aid to auto-register services or easily hook into services on my network that need consensus (like Vault or Nomad).
Ultimately that configuration like just sends requests like lookups for foo.service.consul to the instance of consul running on the router.
How are you securing nomad and consul clusters. The official guides recommends mTLS (say with vault). But vault needs consul (circular dependency). I am also thinking about running vault as a nomad job !
@blmhemu you’re right that there are some circular dependencies here, and I’ve fought a few of them now and then. At least for certs, my strategy has been to provision very long-lived certificates when I use them as part of my core infrastructure to avoid getting into a situation where I may need a new one but my infrastructure isn’t able to provision new ones. For example, a Nomad node may have a cert with 100 years expiry, as does my GlusterFS cluster servers.
That has worked pretty well for certs, but I’ve avoided running workloads with Nomad because that circular dependency is much harder to break out of. For example, running Vault with Nomad can be really tricky if the cluster needs to be started up “cold” - the Nomad servers need Vault available, and without Nomad servers, containers can’t be allocated. This is as opposed to certs provisioned through Vault, which can function just fine if Vault itself is down.