Background
I am looking at creating a compute cluster using "old" (8th Gen and above) Dell XPS laptops for compute-intensive workflows. In the end, I want to be able to run AI (LLM and tensor-based), ML, and other compute-heavy workloads on this cluster, though I'm not kidding myself about how good it will be at those.
Disclaimer: This is not meant to be the best or most efficient way to run large compute workloads, but is rather a hobby/side-project activity, so expect quite a lot of "sketchy" stuff, especially around the hardware setup. If you want a cost-effective way to run compute workloads, for CPU look at your standard PaaS or server providers (I love Hetzner) and for GPUs I use RunPod, which is hard to beat on per-second GPU pricing and usability.
The following areas will need to be considered:
-
Preparing the hardware
-
Provisioning the OS
-
Installing all software to support monitoring and clustering
-
Integration into compute workflows
1. Preparing the hardware
Laptops are far from ideal for server/clustering setups, primarily due to:
-
A lid switch that often turns off/puts the OS in standby when the lid is shut
-
Limited connectivity
-
Consumer (and often lower-end) components
The Dell XPS (13) product line compounds these issues, especially as:
-
They only have a maximum of 2 USB-C ports as their ONLY data ports—no dedicated power line, no LAN port, no USB-A
-
Cooling (from experience) on these is quite poor
-
Only integrated Intel graphics—so likely everything will be run on CPU
On the bright side, a built-in keyboard and screen allow for easier access if (or when) needed, and a battery serves as a built-in UPS (though it may be a good idea to have these running off mains power only, to avoid battery-related fires).
For now, I will try a minimal setup of 3 laptops (nodes), each powered via USB-C on one of their ports, with the other having a USB stick for data/OS (more on that below), with connectivity done via Wi-Fi. In the future, Wi-Fi connectivity will surely be the bottleneck, so I'll likely need to invest in some USB-C hubs with Ethernet (ideally 2.5 Gb, but we shall see).
Finally, for longer-term storage and persistence, I will likely make use of a NAS for now (Raspberry Pi with a connected SSD or something like that).
2. Provisioning the OS
Since these are borrowed from my workplace (thanks Uniun) and there are likely plans to recycle/donate these in the future, I do not want to use the built-in SSD, but ideally run the OS and the limited storage it will need off a USB stick/RAM.
The OS will, of course, be Linux, but the distro/base and variant I still need to figure out. I want a solid, well-supported and documented base, will have no need for a Window Manager or graphical interface, and ideally do not want to have to learn a whole new approach to Linux systems. Some early contenders are:
-
Ubuntu Server (minimal) - strong, well-supported and documented base, good cloud-init support (see below)
-
openSUSE MicroOS - container-focused, immutable, very small footprint
Ideally, I want to do a minimal amount of manual/interactive tasks for provisioning each laptop (node), so I'll need to look at something like cloud-init (and specifically their NoCloud data source) or alternatives for initial user setup, SSH keys, network connectivity, etc.
3. Installing software
From a general software management perspective, I will likely use Ansible (relying on some of Jeff Geerling's existing work here and here) to manage the software base on all nodes and to run some workflows on them too (such as benchmarks). The software itself will likely run on either a Docker Swarm or Kubernetes base, which will allow for easier deployment of different workflow types and distributed compute frameworks; more on that below. I have significant experience with Docker, Compose, and to a lesser extent Swarm, but little-to-no Kubernetes experience, though this would be a great project to learn.
4. Compute Frameworks
To actually make use of the compute cluster, specialized distributed software will be needed. I am currently eyeing up the following:
-
exo - software that can run LLMs on a heterogeneous set of devices, seems promising, but is still very early in development and there have not been any updates in the last few months
-
distributed-llama - another solution to run LLMs over multiple machines
-
llama.cpp (rpc-server) - llama.cpp configured to run in a server RPC mode
-
dask - Python library for parallel and distributed compute, has good integrations with many Python libraries (pandas, sklearn, etc.)
-
ray - Similar to dask in that it provides an open-source library to run distributed and parallel Python workloads
Across these, I want to run a number of LLM, AI, ML, and standard CPU benchmarks and compare them with single-node and GPU-based setups.
Next Steps
The above are the main areas of focus for me right now and likely the order in which I will approach them, though there will certainly be refinement loops as I learn more about all of this. I will post all of my code and the overall setup on GitHub as I go and will be posting on this blog using the cluster tag, so follow along with me and let me know if you have any tips and tricks.