(qemu-microvm)=
# QEMU microvm

QEMU microvm is a special case of virtual machine (VMs), designed to be
optimised for initialisation speed and minimal resource use.

The underlying concept of a mivrovm is based on giving up some capabilities of
standard QEMU in order to reduce complexity and gain speed. Maybe - for your
use-case - you do not need e.g. the hypervisor to be able to pretend to have a
network card from the 90s, nor to emulate a CPU of a foreign architecture,
nor live migrate with external I/O going on. In such cases a lot of what QEMU
provides is not needed and a less complex approach like microvm might be
interesting to you.

All of that is a balance that needs to be decided by the needs of your
use case. There will surely be arguments and examples going both ways - so be
careful. Giving up on unnecessary features to gain speed is great, but it is
not so great if - some time after deploying your project - you realise that you
now need a feature only available in the more complete solution.

QEMU provides additional components that were added to support this special use case:

1. The [`microvm` machine type](https://www.qemu.org/docs/master/system/i386/microvm.html)
1. Alternative simple firmware (FW) that can boot Linux [called `qboot`](https://github.com/bonzini/qboot)
1. Ubuntu has a QEMU build with reduced features matching these use cases called `qemu-system-x86-microvm`

## Why a special workload?

One has to understand that minimising the QEMU initialisation time only yields a
small gain, by shaving off parts of a task that usually do not take a long time. That is only worth
it if the workload you run is not taking much longer anyway. For example, by
booting a fully generic operating system, followed by more time to completely
initialise your workload.

There are a few common ways adapt a workload to match this:
- Use faster bootloaders and virtual firmware (see `qboot` below) with a reduced
  feature set, not as generally capable but sufficient for a particular use case.
- Even the fastest bootloader is slower than no bootloader, so often
  the kernel is directly passed from the host filesystem.
  A drawback of this solution is the fact that the guest system will not have
  control over the kernel anymore, thus restricting what can be done inside the
  guest system.
- Sometimes a simpler user space like [busybox](https://www.busybox.net/) or a container-like environment
  is used.
- In a similar fashion, a customised kernel build with a reduced feature set
  with only what is needed for a given use case.

A common compromise of the above options is aligning virtualization with
container paradigms. While behaving mostly like a container, those tools will
use virtualization instead of namespaces for the isolation.
Examples of that are:

- container-like as in [kata containers](https://katacontainers.io/),
- function-based services as in [Firecracker](https://firecracker-microvm.github.io/),
- system containers as in {ref}`LXD <lxd-containers>`.

In particular {ref}`LXD <lxd-containers>` added VM mode to allow the very same UX
with namespaces and virtualizaton.

Other related tools are more about creating VMs from containers like:

- [slim from dockerfiles](https://github.com/ottomatica/slim) or
- [krunvm from OCI images](https://github.com/containers/krunvm).

There are more of these out there, but the point is that one can mix and match
to suit their needs. At the end of the day many of the above use the same
underlying technology of namespaces or QEMU/KVM.

This page tries to stick to the basics and not pick either higher level
system mentioned above. Instead it sticks to just QEMU to show how it's
ideas of reduced firmware and microvms play into all of this.

## Create the example workload artifact

To create an example of such a small workload we will follow the tutorial on
how to build a [sliced rock](https://documentation.ubuntu.com/rockcraft/en/stable/tutorials/chisel/).

Out of these tutorials one gets an [OCI-compatible](https://github.com/opencontainers/image-spec/blob/main/spec.md)
artifact. It will be called `chiselled-hello_latest_amd64.rock`.
That is now converted to a disk image for use as virtual disk in our later
example.

```bash
# Convert the artifact of the ROCK tutorial into OCI format
$ sudo rockcraft.skopeo --insecure-policy copy oci-archive:chiselled-hello_latest_amd64.rock oci:chiselled-hello.oci:latest
# Unpack that to a local directory
$ sudo apt install oci-image-tool
$ oci-image-tool  unpack --ref name=latest chiselled-hello.oci /tmp/chiselled-hello
# Create some paths the kernel would be unhappy if they would be missing
$ mkdir /tmp/chiselled-hello/{dev,proc,sys,run,var}
# Convert the directory to a qcow2 image
$ sudo apt install guestfs-tools
$ sudo virt-make-fs --format=qcow2 --size=50M /tmp/chiselled-hello chiselled-hello.qcow2
```

## Run the stripped workload in QEMU

Now that we have a stripped-down workload as an example, we can run it
in standard QEMU and see that this is much quicker than booting a full
operating system.

```bash
$ sudo qemu-system-x86_64 -m 128M -machine accel=kvm \
    -kernel /boot/vmlinuz-$(uname -r) \
    -append 'console=ttyS0 root=/dev/vda fsck.mode=skip init=/usr/bin/hello' \
    -nodefaults -no-user-config \
    -display none -serial mon:stdio \
    -drive file=chiselled-hello.qcow2,index=0,format=qcow2,media=disk,if=none,id=virtio1 \
    -device virtio-blk-pci,drive=virtio1
...
[    2.116207] Run /usr/bin/hello as init process
Hello, world!
```

Breaking down the command-line elements and their purpose in this context:

| **command-line element** | **Explanation** |
| ----------------------- | --------------- |
| `sudo` | `sudo` is a simple way for this example to work, but not recommended. Scenarios outside of an example should use separate kernel images and a user that is a member of the `kvm` group to access `/dev/kvm`. |
| `qemu-system-x86_64` | Call the usual binary of QEMU used for system virtualization. |
| `-m 128M` | Allocate 128 megabytes of RAM for the guest. |
| `-machine accel=kvm` | Enable KVM. |
| `-kernel /boot/vmlinuz-$(uname -r)` | Load the currently running kernel for the guest. |
| `-append '...'` | This passes four arguments to the kernel explained one by one in the following rows. |
| `console=ttyS0` | Tells the kernel which serial console it should send its output to. |
| `root=/dev/vda` | Informs it where to expect the root device matching the `virtio-block` device we provide. |
| `fsck.mode=skip` | Instructs the kernel to skip filesystem checks, which saves time. |
| `init=/usr/bin/hello` | Tell the kernel to directly start our test workload. |
| `-nodefaults` | Do not create the default set of devices. |
| `-no-user-config` | Do not load any user provided config files. |
| `-display none` | Disable video output (due to `-nodefaults` and `-display none` we do not also need `-nographic`). |
| `-serial mon:stdio` | Map the virtual serial port and the monitor (for debugging) to stdio |
| `-drive ... -device ...` | Provide our test image as virtio based block device. |

After running this example we notice that, by changing the workload to something
small and streamlined, the execution time went down from about 1 minute (when
booting a bootloader into a full OS into a workload) to about 2 seconds (from
when the kernel started accounting time), as expected.

For the purpose of what this page wants to explain, it is not important to be
perfectly accurate and stable. We are now in the right order of magnitude
(seconds instead of a minute) in regard to the _overall time spent_ to begin
focusing on the time that the initialisation of firmware and QEMU consume.

## Using `qboot` and `microvm`

In the same way as `qemu-system-x86-microvm` is a reduced QEMU,
[qboot](https://github.com/bonzini/qboot) is a simpler variant to the
extended feature set of [seabios](https://www.seabios.org/SeaBIOS) or
[UEFI](https://github.com/tianocore/edk2) that can do less, but therefore is
faster at doing what it can.

If your system does not need the extended feature sets you can try
`qboot` if this gives you an improvement for your use case. To do so
add `-bios /usr/share/qemu/qboot.rom` to the QEMU command line.

[QEMU microvm](https://github.com/qemu/qemu/blob/master/docs/system/i386/microvm.rst)
is a machine type inspired by [Firecracker](https://firecracker-microvm.github.io/)
and constructed after its machine model.

In Ubuntu we provide this on x86 as `qemu-system-x86_64-microvm` alongside the
_standard_ QEMU in the package `qemu-system-x86`.

Microvm aims for maximum compatibility by default; this means that you will
probably want to switch off some more legacy devices that are not shown in
this example. But for what is shown here we want to keep it rather comparable
to the non-microvm invocation.
For more details on what else could be disabled see
[microvm](https://github.com/qemu/qemu/blob/master/docs/system/i386/microvm.rst#running-a-microvm-based-vm).

Run the guest in `qemu-system-x86_64-microvm`:

```bash
$ sudo qemu-system-x86_64-microvm -m 128M -machine accel=kvm \
    -bios /usr/share/qemu/qboot.rom \
    -kernel /boot/vmlinuz-$(uname -r) \
    -append 'console=ttyS0 root=/dev/vda fsck.mode=skip init=/usr/bin/hello' \
    -nodefaults -no-user-config \
    -display none -serial mon:stdio \
    -drive file=chiselled-hello.qcow2,index=0,format=qcow2,media=disk,if=none,id=virtio1 -device virtio-blk-device,drive=virtio1
```

Breaking down the changes to the command-line elements and their purpose:

| **command-line element** | **Explanation** |
| ----------------------- | --------------- |
| `qemu-system-x86_64-microvm` | Call the lighter, feature-reduced QEMU binary. |
| `-bios /usr/share/qemu/qboot.rom` | Running QEMU as `qemu-system-x86_64-microvm` will auto-select `/usr/share/seabios/bios-microvm.bin` which is a simplified SeaBIOS for this purpose. But, for the example shown here we want the even simpler `qboot`, so in addition we set `-bios /usr/share/qemu/qboot.rom`. |
| _info_ | QEMU will auto-select the microvm machine type, equivalent to `-M microvm` which therefore doesn't need to be explicitly included here. |
| `... virtio-blk-device ...` | This feature-reduced QEMU only supports `virtio-bus`, so we need to switch the type `virtio-blk-pci` to `virtio-blk-device`. |

> Sadly, polluting this nice showcase there is currently an issue with the
> RTC initialisation not working in this mode - which makes the guest
> kernel wait ~1.3 + ~1.4 seconds. See this
> [qemu bug](https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2074073)
> if you are curious about that.
>
> But these changes were not about making the guest faster once it runs, instead
> it mostly is about the initialisation time (and kernel init by having less
> virtual hardware). And that we can check despite this issue.

On average across a few runs (albeit not in a very performance-controlled
environment) we can see the kernel start time to be 282ms faster
comparing _normal QEMU_ to `microvm` and another 526ms faster comparing `microvm`
to `microvm`+`qboot`.

As mentioned one could go further from here by disabling more legacy devices,
using `hvcconsole`, customising the guest CPU, switching off more subsystems
like ACPI or customising the kernel that is used. But this was meant to be an
example on how `microvm` can be used in general so we won't make it more
complex for now.

## Alternative - using virtiofs

Another common path not fully explored in the example above is sharing the
content with the guest via `virtiofsd`.

Doing so for our example could start with a conversion of the container
artifact above to a shareable directory:

```bash
# Copy out the example the tutorial had in OCI format
$ sudo rockcraft.skopeo --insecure-policy copy oci-archive:chiselled-hello_latest_amd64.rock oci:chiselled-hello.oci:latest
# Unpack that to a directory
$ sudo apt install oci-image-tool
$ oci-image-tool  unpack --ref name=latest chiselled-hello.oci /tmp/chiselled-hello
```

Exposing that directory to a guest via `virtiofsd`:

```bash
$ sudo apt install virtiofsd
$ /usr/libexec/virtiofsd --socket-path=/tmp/vfsd.sock --shared-dir /tmp/chiselled-hello
...
[INFO  virtiofsd] Waiting for vhost-user socket connection...
```

To the QEMU command-line one would then add the following options:

```bash
...
-object memory-backend-memfd,id=mem,share=on,size=128M \
-numa node,memdev=mem -chardev socket,id=char0,path=/tmp/vfsd.sock \
-device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs
...
```

Which allows the user to mount it from inside the guest via
`$ mount -t virtiofs myfs /mnt` or if you want to use it as root you can pass
it via kernel parameter `rootfstype=virtiofs root=myfs rw`.