Running KISScluster - the design

Scope

We're talking about configuring/managing the VMs that are running on the cluster.

Goals

The cluster should be self-contained. That is, no external management station.

It also should be, well, clustered. Meaning that when one node fails, management should be possible from the other node.

"Managing VMs" means creating "VM skeletons"; all the configuration files and logical volumes that permit a VM do be started, but excluding operating system installation.

Design

Use git, with two (bare) repositories, one on each node, that are kept in sync.

One each node, there's also a working copy of the repo. However, in normal operation, it should only be used on the primary node: configs are edited there, are applied (via ansible), and the changes are pushed to both nodes.

This means that in normal operation we can wrap it all into a simple, sequential script that just pushes local changes to the repos, and we don't have to handle possible conflicts that would arise when multiple working copies exist.

A simple check in that script aborts when it's not running on the primary node, and a --force option can be used to override it. This is the moment when the human has used the big hammer, so they're responsible to handle the git stuff that is necessary.

Stuff will be kept in

  • /srv/kiss-configure.git for the git repositories
  • /srv/kiss-configure for the working directory

Permissions on the working directory will be for the adm group. The repositories will belong to the ansible user that will also be used to manage the VMs. The ssh keys used to authenticate the ansible user will have to be configured by you manually (I wanted to use a per-cluster key with group read permissions, but ssh does not allow that).

git is configured with two push URLs for the "origin" remote, so a "git push" automatically pushes to both. Pulling happens from the first url.

How VMs are configured

VMs are configured by placing a YAML file per VM in vm-config/cluster_name. See vm-config/EXAMPLE.YML for their content.

The big issue here is that for DRBD, we have to assign minor device numbers in the resource definitions. They can be changed, but it's a big hassle, so we have to keep them constant over the lifetime of a VM. Of course they also have to be unique. Since you may want to have VMs with more than one virtual disk, we also have to provide for multiple minors per VM.

So we have to manage them somehow, automatically. I've decided to reserve four minors per VM (which means you can have up to four virtual disks); the first minor is then also used as the VNC display number for that VM, and as the last octet in the MAC address.

So: we read the VM YAML configs, and we keep a cache of previously-known VMs (in vm-config/cluster_name.yml). From the cache, we know the maximum assigned minor. When we find a new VM definition, we assign the next minor, in steps of four. Then, we write those new assignments out to the cache.

You can disable a VM by renaming it's YAML file to something not ending in .yml (or you could delete it); then it will not be processed anymore. Note that this does not stop, deconfigure or remove the VM, nor reassign it's minor numbers; it just stops generating configs for it. The idea is you can turn it off like you would a physical server, keeping it around for a bit because you known someone will come along and need something from that VM.

To really get rid of a VM (logical volumes, DRBD resource configs, systemd unit file), run kiss-nuke-vm vm-name on both cluster nodes.

What kiss-configure does

kiss-configure does a bit of git synching, and then goes off to run ansible on kiss-configure.yml, which in turn depends on the kisscluster-vms role.

That role runs the kisscluster-vm-inventory helper script, which takes the YAML files and the cache and generates a fresh cache.

That cache then is read by ansible, and it's active_vm list is iterated over to:

  • create root logical volume

  • generate DRBD resource config

  • create additional logical volumes, if any

  • initialize the LVs for DRBD

  • start DRBD initial sync

All this is only done when the root logical volume didn't exist before.

Disabled VMs (cache key disabled_vm) are disabled in systemd (but not stopped), which basically means you can manually start/stop them, but the cluster will ignore them.

Active VMs have their service files re-generated at every invocation, so you can change the service templates with all the qemu options and they will be updated.

Active VMs are also enabled in systemd, because that's the meaning of "active".