Christine Dodrill 6b234323a0
tstest/integration/vms: fix flake when testing (#2145)
Occasionally the test framework would fail with a timeout due to a
virtual machine not phoning home in time. This seems to be happen
whenever qemu can't bind the VNC or SSH ports for a virtual machine.
This was fixed by taking the following actions:

1. Don't listen on VNC unless the `-use-vnc` flag is passed, this
   removes the need to listen on VNC at all in most cases. The option to
   use VNC is still left in for debugging virtual machines, but removing
   this makes it easier to deal with (VNC uses this odd system of
   "displays" that are mapped to ports above 5900, and qemu doesn't
   offer a decent way to use a normal port number, so we just disable
   VNC by default as a compromise).
2. Use a (hopefully) inactive port for SSH. In an ideal world I'd just
   have the VM's SSH port be exposed via a Unix socket, however the QEMU
   documentation doesn't really say if you can do this or not. While I
   do more research, this stopgap will have to make do.
3. Strictly tie more VM resource lifetimes to the tests themselves.
   Previously the disk image layers for virtual machines were only
   cleaned up at the end of the test and existed in the parent
   test-scoped temporary folder. This can make your tmpfs run out of
   space, which is not ideal. This should minimize the use of temporary
   storage as much as I know how to.
4. Strictly tie the qemu process lifetime to the lifetime of the test
   using testing.T#Cleanup. Previously it used a defer statement to
   clean up the qemu process, however if the tests timed out this defer
   was not run. This left around an orphaned qemu process that had to be
   killed manually. This change ensures that all qemu processes exit
   when their relevant tests finish.

Signed-off-by: Christine Dodrill <xe@tailscale.com>
2021-06-25 14:45:12 -04:00
..

End-to-End VM-based Integration Testing

This test spins up a bunch of common linux distributions and then tries to get them to connect to a testcontrol server.

Running

This test currently only runs on Linux.

This test depends on the following command line tools:

This test also requires the following:

  • about 10 GB of temporary storage
  • about 10 GB of cached VM images
  • at least 4 GB of ram for virtual machines
  • hardware virtualization support (KVM) enabled in the BIOS
  • the kvm module to be loaded (modprobe kvm)
  • the user running these tests must have access to /dev/kvm (being in the kvm group should suffice)

This optionally requires an AWS profile to be configured at the default path. The S3 bucket is set so that the requester pays. Please keep this in mind when running these tests on your machine. If you are uncomfortable with the cost from downloading from S3, you should pass the -no-s3 flag to disable downloads from S3. However keep in mind that some distributions do not use stable URLs for each individual image artifact, so there may be spurious test failures as a result.

If you are using Nix, you can run all of the tests with the correct command line tools using this command:

$ nix-shell -p openssh -p go -p qemu -p cdrkit --run "go test . --run-vm-tests --v --timeout 30m"

Keep the timeout high for the first run, especially if you are not downloading VM images from S3. The mirrors we pull images from have download rate limits and will take a while to download.

Because of the hardware requirements of this test, this test will not run without the --run-vm-tests flag set.

Other Fun Flags

This test's behavior is customized with command line flags.

Don't Download Images From S3

If you pass the -no-s3 flag to go test, the S3 step will be skipped in favor of downloading the images directly from upstream sources, which may cause the test to fail in odd places.

Distribution Picking

This test runs on a large number of distributions. By default it tries to run everything, which may or may not be ideal for you. If you only want to test a subset of distributions, you can use the --distro-regex flag to match a subset of distributions using a regular expression such as like this:

$ go test -run-vm-tests -distro-regex centos

This would run all tests on all versions of CentOS.

$ go test -run-vm-tests -distro-regex '(debian|ubuntu)'

This would run all tests on all versions of Debian and Ubuntu.

Ram Limiting

This test uses a lot of memory. In order to avoid making machines run out of memory running this test, a semaphore is used to limit how many megabytes of ram are being used at once. By default this semaphore is set to 4096 MB of ram (about 4 gigabytes). You can customize this with the --ram-limit flag:

$ go test --run-vm-tests --ram-limit 2048
$ go test --run-vm-tests --ram-limit 65536

The first example will set the limit to 2048 MB of ram (about 2 gigabytes). The second example will set the limit to 65536 MB of ram (about 65 gigabytes). Please be careful with this flag, improper usage of it is known to cause the Linux out-of-memory killer to engage. Try to keep it within 50-75% of your machine's available ram (there is some overhead involved with the virtualization) to be on the safe side.