Troubleshooting

Copyright 2022,2023 Nvidia Corporation. All rights reserved.

...for when things don't go as expected

Installing DUE

Symptom: Docker (or Podman) isn't installed

Installing Docker without a DUE .deb

If you've downloaded DUE as source, install its dependencies by running:
sudo apt update ; sudo apt install docker.io git rsync binfmt-support qemu qemu-user-static

Note 1: docker.io can be replaced with docker-ce, or podman.
Note 2: Newer host system distributions may provide the systemd-binfmt
package to support the execution of non-native binaries, so that the
binfmt-support package is not explicitly needed.

The last three packages there are optional, but necessary if you want to
run alternate architectures.

TIP if you are on the master Git branch of DUE, you can run make install
to install DUE without going through package management.

Installing Docker through the DUE .deb

The lack of Docker will be obvious on the initial install of the DUE
.deb, as you'll see the error:
due depends on docker.io | docker-ce | podman; however:   Package docker.io is not installed. Package docker.ce is not installed. Package podman is not installed.

To resolve this, try: sudo apt update
sudo apt install --fix-broken If that fails (and might, depending on how
old the version of your operating system is), try
sudo apt install docker.io
...and if that fails, try downloading and installing docker.ce from
https://hub.docker.com

Running DUE

Symptom: Docker containers don't run (or only run as root).

You'll see
Got permission denied while trying to connect to the Docker daemon socket
You are probably not a member of the Docker group, so you'll need to:

Add yourself to the Docker group:

sudo usermod -a -G docker $(whoami)

You may have to log out and back in again for the group change to take
effect. Running groups should show docker along with your other groups.

Symptom: Strange failures and permission errors in the container.

Check that the host directory the container is using is a LOCAL file
system. I've seen strange permission related errors when Docker is
mounting a file system that is network mounted. If your home directory
is NFS mounted on your build system, consider creating a work directory
on the host system and using either /etc/due/due.conf or
~/.config/due/due.conf ( generate this with ./due --manage --copy-config
) to specify this local work directory as your "home" directory. You'll
probably want to copy config files, etc to the new "home" directory.

Symptom: Can't mount file systems or missing dev entries in container.

Certain operations (like loopback mounting files) are restricted within
the container because they would require root level access to the host
file system. While Docker containers can run with the --privileged
option which would allow this access, it also provides a false sense of
security that actions taken within the container won't trash the host
system. Bottom line: this can be done, but it carries risks.

Symptom: DNS failures

In general, Docker will use the host system's network configuration
within the container, so the contents of /etc/hosts and /etc/resolv.conf
will be set at run time, which may not be what you want.

Overriding /etc/hosts

If a templates//filesystem/etc/hosts file is present for image creation,
the container-create-user.sh script (being the first process run) will
append its contents to the /etc/hosts file that is generated by Docker.
This is useful if you have static addresses to add.

Using a VPN with Docker

For image creation

If your image needs to access resources over the host's VPN, during
image creation, and it is failing, make sure the host VPN is up, then
restart Docker so that Docker becomes aware of the VPN, and then retry
image creation. Example:
sudo systemctl stop docker
sudo systemctl start docker

In the container

VPN software (like Openconnect) can work in a container, regardless if
the host system is connected to a VPN or not.

Check the /etc/resolv.conf file

Double check that the container's /etc/resolv.conf file has been updated
properly by any VPN software (like openconnect) running in the container
or on the host. It may be prioritizing the host's primary network
connection rather than the VPN. If in doubt, make sure the VPN's domain
is the first one listed, so that it is searched first.
Example from /etc/resolv.conf:
search myVPNDomain myISPDomain

Symptom: Running emulated containers fails.

If QEMU is properly and fully installed, DUE should be able to run
containers of other architectures seamlessly. If you're reading this,
then you've found a seam and should file a bug at:
https://github.com/CumulusNetworks/DUE/issues
Note Be aware that newer host distributions may use systemd-binfmt to
handle the running of non-native binaries, and as a result, the
following binfmt-support suggestions may or may not apply in your
environment.

Fails with: standard\_init\_linux.go:211: exec user process caused "exec format error"

So far this has been the only time I've seen this die, and I tracked it
down to my system's binfmt-support not being configured to handle ARM
binaries. Ideally, qemu should register the architectures it can run
with binfmt-support, so that when non-native code is encountered, it can
be passed off to qemu.

###Other emulation related failures to check: ####Are there qemu-*
entries under /proc/sys/fs/binfmt_misc/ If
ls -l /proc/sys/fs/binfmt_misc doesn't show them, then a few required
packages may not be installed. Try:

sudo apt update ; sudo apt install qemu qemu-user-static binfmt-support

This should create the entries. If this fails, try reconfiguring
qemu-user-static, with:
sudo dpkg-reconfigure qemu-user-static
which should have configured binfmt-support to have the entries. I had
to do this on one system, for reasons that aren't completely clear to
me.

####Is the binfmt service running? List bimfmt files
systemctl list-unit-files | grep binfmt
Restart binfmt-support
sudo systemctl restart binfmt-support.service

Debugging a failed image creation

If image creation does not complete, a partial image will have been
created with the name <none>. Running due --manage --list-images will
list all containers on the system with the most recently created ones
listed first.

To get inside the failed container and debug it, run:

due --run --debug
and select the image.
Then: cd /due_configuration

Here you'll find all the configuration scripts that were run to create
the container, so you can run them in the container as needed to track
down the failure.

Your home directory will be mounted under /home/root, so any file
changes you make can be persisted by copying them there.

Cleaning up failed images

Run due --manage --delete-matched none
This gets the IDs of all images that have 'none' in their name and
generates a script named delete_these_docker_images.sh that can be run
to delete all those images.

--delete-matched filters images with with *term-supplied* so you should
check that the images listed in the script are, indeed, the ones you
want to get rid of.
