Running Incus with ZFS in a VM and moving containers to a new storage pool

🇩🇪 [Deutsch]

For many years, I have been relying on the virtualization and container management tool “Incus” (formerly “LXD”) to host my services. Incus runs in a virtual machine and helps me to create separation at the application level. For example, there is an Incus container for trashserver.net, another for metalhead.club, etc. The root file systems of the individual containers are located in a ZFS file system. This allows me to create space-saving snapshots of my containers before critical maintenance actions, e.g., before updates or operating system upgrades.

Since I recently upgraded the underlying storage, I would like to briefly introduce my storage setup and document for myself (but also for you ;-) ) what I paid attention to and how I moved my containers to the new storage.

Storage setup diagram

At the hypervisor level (i.e., on the physical server), storage for my containers is provided in the form of LVM volumes. This allows me to offer the large storage space in many individual volumes for other VMs as well and to dimension it according to resource requirements.

An LVM volume (block device) is then passed to the VM via libvirt/QEMU and the virtio-blk driver. I use the following parameters for storage in libvirt:

Driver: virt-blk
Cache mode: none (cache is already handled by zfs)
io = native (usually better than io=threads)
discard = unmap (for Trim support)
serial = megastor

I would like to highlight the “serial” setting, which allows you to specify a serial number/ID/unique name for the transferred block device. If you connect multiple storage devices to your VM, each storage device can be identified without any doubt. For example, a device named “megastor” will appear in the VM as “/dev/disk/by-id/virtio-megastor”. This is particularly helpful if…

… you are no longer sure what is actually between /dev/vdb and /dev/vdd and which storage device is which?
… or you want to remove /dev/vdc, but you are afraid that the next time you reboot, /dev/vdd will move up and take its place. Then the chaos would be complete.

This is because device names for storage devices are not “stable” under Linux! Only partition UUIDs are stable (or alternatively disk IDs!).

In the VM, the block device is used to create a simple ZFS pool, e.g. via:

zpool create -o ashift=12 megastor /dev/disk/by-id/virtio-megastor

The ZFS pool is also registered in Incus so that it can be used with the containers:

incus storage create megastor zfs source=megastor --description="megastor storage for containers"

Trim Support

Some time ago, only the virtio-scsi driver supported trim to mark and release free storage space on the SSD. Regular “trimming” leads to significantly better performance. However, since some QEMU versions, the leaner “virtio-blk” driver also supports trimming.

Incidentally, ZFS regularly executes a trim cron job (/etc/cron.d/zfsutils-linux) itself via the zfsutils-linux package, so it is not necessary to execute fstrim yourself.

However, it is important to note that, as mentioned above, discard=unmap is defined for the QEMU storage device. And in my case, a “discard” parameter must also be entered in /etc/crypttab at the hypervisor level due to LUKS encryption. Otherwise, the trimming will not be passed through to our storage layer. Example:

luks-d429b69b-1c1a-234a-8bcf -64232e83f156 UUID=d429b69b-1c1a-234a-8bcf-64232e83f156 /var/lib/megastor.key discard

In addition, LVM should also be configured appropriately: vim /etc/lvm/lvm.conf:

issue_discards = 1

Moving existing Incus VMs to new storage

As mentioned at the beginning, I installed new SSDs in the system and set up new ZFS pools accordingly, so now the Incus containers should also be moved to the new “megastor” pool. There are several methods for doing this…

The easy way with a few minutes of downtime

This is how it works with Incus:

incus stop mycontainer
incus move mycontainer --storage megastor

Before the transfer, the container is stopped, which takes a certain amount of time depending on the size of the container and the performance of the storage, thus causing downtime. However, this method is less complex. It is worthwhile for smaller containers with a size of around 50–100 GB.

With incremental transfer – but short downtime

If you want to avoid downtime as much as possible, you can proceed differently:

Create a snapshot of the source container during operation
Copy the snapshot to the destination in the background
Stop the source container (start of downtime)
Transfer the difference since the last transfer (usually only a few MB to GB)
Remove the old container and rename the new container.
Start the new container (end of downtime).

Downtime is reduced to a minimum. I have already described the basic procedure in my article “Move a Mastodon instance with less than 3 minutes downtime (LXD/ZFS-based)” and have worked around LXD, so to speak. However, the same also works with Incus on-board tools and can be applied to the transfer of a container from one storage to another (not just to the transfer between two hosts!).

# 1. Create a snapshot during operation
incus snapshot create mycontainer

# 2. Transfer the snapshot to the new storage “megastor”
incus copy mycontainer new-mycontainer --storage megastor

# 3. Stop the container and create another snapshot
incus stop mycontainer
incus snapshot create mycontainer

# 4. Transfer changes since the start of the first transfer
incus copy mycontainer new-mycontainer --storage megastor --refresh

# 5. Delete old container, rename new container
incus delete mycontainer
incus move new-mycontainer mycontainer

# 6. Start new container
incus config unset mycontainer volatile.apply_template
incus start mycontainer

The incus config unset... ensures that the new container retains the old MAC addresses, IPs, etc., and is not assigned new addresses.

If you want, you can also insert additional “snapshot - copy” sequences between steps 2 and 3:

incus snapshot create mycontainer
incus copy mycontainer new-mycontainer --storage megastor --refresh

This is particularly useful if a lot of time has passed since the first transfer. If a lot has changed in the container during this time, the final sync will also take significantly longer and result in longer downtime. This can be counteracted by approaching it in several “snapshot - copy” steps. It is important not to forget the --refresh flag after the first copy command.