autopkgtest_qemu doesn't use accel=kvm on ppc64le, being fully unusable on that arch

Bug #1988527 reported by Paride Legovini
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
autopkgtest (Debian)
Fix Released
Unknown
autopkgtest (Ubuntu)
Fix Released
High
Paride Legovini
Jammy
Fix Released
High
Paride Legovini

Bug Description

[ Impact ]

On Power9 the qemu based autopkgtest commands create VMs that are extremely slow and fail with obscure errors (partially discussed in LP: #1973628, comment 8). This can be reproduced for example by running:

  autopkgtest-buildvm-ubuntu-cloud -v -r jammy --ram-size 1024

but autopkgtest-virt-qemu is also affected.

This happens because autopkgtest fails to detect the system architecture as KVM capable due to a typo in the architecture name (ppc64el instead of ppc64le). This upload fixes the typo.

Fixing this bug in Jammy will allow users and developers to manually run autopkgtests on ppc64el. This is useful for example in +1 maintenance.

[ Test Plan ]

Run:

  autopkgtest-buildvm-ubuntu-cloud -v -r jammy --ram-size 1024

on an affected system (= a KVM-capable POWER machine running Jammy).

Buggy package => the command takes hours to complete and prints lots of obscure errors.

Fixed package => the command completes in minutes.

[ Where problems could occur ]

Without this fix qemu based autopkgtest could in principle complete even when KVM is available (/dev/kvm exists) but broken, as it may be in some nested virtualization scenarios. This said, without KVM qemu based appears to be very broken due to timeouts caused by its extreme slowness, so I think the risk of causing a regression is marginal.

[ Other Info ]

The very same fix has been submitted and merged upstream, and released to Kinetic via a clean cherry-pick.

[ Original Description ]

On Power9 the qemu based autopkgtest commands create VMs that are extremely slow and fail with obscure errors (partially discussed in LP: #1973628, comment 8). This can be reproduced for example by running:

  autopkgtest-buildvm-ubuntu-cloud -v -r jammy --ram-size 1024

but autopkgtest-virt-qemu is also affected. The extreme slowness of the VMs made me think that something was off with the virtualization settings. I modified autopkgtest_qemu.py so that qemu-system-ppc64le is called with '-machine accel=kvm' (which I think is the same as '-machine pseries,accel=kvm' with pseries being the default machine type).

With this change everything is very fast and reliable. These warnings also went away:

qemu-system-ppc64le: warning: TCG doesn't support requested feature, cap-cfpc=workaround
qemu-system-ppc64le: warning: TCG doesn't support requested feature, cap-sbbc=workaround
qemu-system-ppc64le: warning: TCG doesn't support requested feature, cap-ibs=workaround
qemu-system-ppc64le: warning: TCG doesn't support requested feature, cap-ccf-assist=on

indicating that we were using TCG emulation before.

I imagine that Qemu has good reasons not to default to accel=kvm or accel=kvm:tcg on ppc64, but think it's reasonable to assume it's available and enable it in autopkgtest.

We can fix this in autopkgtest upstream, but it would be nice to verify if this is an issue with Debian too before submitting a salsa MR.

[1] https://wiki.qemu.org/Documentation/TCG

Related branches

Frank Heimes (fheimes)
tags: added: ppc64el
Revision history for this message
Paride Legovini (paride) wrote :

We need Christian's Qemu insight on this one.

For the record this is the change I applied locally to force kvm usage:

--- /tmp/autopkgtest_qemu.py.orig
+++ /usr/share/autopkgtest/lib/autopkgtest_qemu.py
@@ -329,6 +329,8 @@
                 boot == 'efi'
             ):
                 argv.extend(['-machine', 'q35'])
+ elif self.qemu_architecture == 'ppc64le':
+ argv.extend(['-machine', 'accel=kvm'])

         # Some architectures can only be run with certain CPUs
         if '-cpu' not in qemu_options:

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

TL;DR:
- either use "$ kvm" instead of "$ qemu-system-*"
- or (better) use accel=... according to your preferences
  - I assume you want kvm:tcg to try the fast but not break if falling back
- doing neither of the above most likely will give you only TCG

Details follow ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

There are multiple levels this could be going wrong or being a misunderstanding of expectations/behavior.

I assume you didn't hit the obvious, like a permission issue which would appear as:
Could not access KVM kernel module: Permission denied
qemu-system-ppc64: failed to initialize kvm: Permission denied
qemu-system-ppc64: falling back to tcg

In addition please be aware that second level virtualization is known be somewhat dysfunctional and therefore won't work. I assume we have level 1 = openstack providing a Vm for autopkgtest and then level 2 (without options to force it) will default to TCG.
For second level (using kernel kvm_pr instead of kvm_hv) you might always need to force accel=kvm which is just what you have done in your workaround (and maybe need to load kvm_pr manually).

But let us ignore second level for now and resolve the simple case first - running on bare metal.
With the qemu monitors `info jit` you can check if tcg was enabled (not needing to assume it based on speed) and therefore easily check which options lead to tcg being used or if KVM accel is used instead.

Easy cases first - things running as specified:

KVM Mode
$ sudo qemu-system-ppc64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy,accel=kvm:tcg
$ sudo qemu-system-ppc64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy,accel=kvm

TCG Mode
$ sudo qemu-system-ppc64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy,accel=tcg
$ sudo qemu-system-ppc64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy,accel=kvm:tcg:kvm

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

But most importantly (as that is your case) without specifying it at all
$ sudo qemu-system-ppc64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy

It runs in TCG mode indeed.
That is, btw not different to x86
$ sudo qemu-system-x86_64 -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine ubuntu

You might now think, wait a minute - on x86 this always worked for me.
Well, you might have used "kvm" instead of "qemu-system-*".
The code does this:

        if (accelerators == NULL) {
            /* Select the default accelerator */
            bool have_tcg = accel_find("tcg");
            bool have_kvm = accel_find("kvm");

            if (have_tcg && have_kvm) {
                if (g_str_has_suffix(progname, "kvm")) {
                    /* If the program name ends with "kvm", we prefer KVM */
                    accelerators = "kvm:tcg";
                } else {
                    accelerators = "tcg:kvm";
                }
            } else if (have_kvm) {
                accelerators = "kvm";
            } else if (have_tcg) {
                accelerators = "tcg";

You can read that as:
1. if anything is specified explicitly - use that
2. if both kvm & tcg are available then
2.1 if the binary ends in kvm - prefer kvm
2.2 else - prefer tcg
3. if only one of kvm of tcg are available - use that

Which is exactly what happens here.
If you run the very same as kvm (which is just a symlink nowadays):
$ sudo kvm -S -nographic -monitor telnet:127.0.0.1:1234,server,nowait -machine pseries-jammy
You get the kvm mode by default.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

There is sadly no shortage of options to set the accelerator and they keep changing which might cause subtle unexpected changes on upgrades.

If you had this working before it might have been by some now deprecated and incompatible options. For example some args used to set accelerators have changed in recent years:
-machine accel= + -accel at the same time are now incompatible
-accel tcg -accel kvm => used to only do kvm, now will do tcg:kvm

That is in since qemu v5.0.0 which translates to 20.10 and later.

There is also "-enable-kvm" which overrules any of the former.
As I said, there are too many options controlling the same

"kvm" used to be a wrapper script which added -enable-kvm, but now is just a symlink - due to qemu internally setting to kvm:tcg now it should have the same behavior, with one twist.
In the past in an env that can only do TCG but not KVM it would have:
$ kvm ...
- old - used -enable-kvm: abort, can't initialize KVM
- new - sets kvm:tcg: try KVM, fall back to TCG

That was considered an improvement as it works in more cases, but it might have led in your case to a silent downgrade to tcg?
In any case, specifying explicitly what you want should be best.

I see autopkgtest uses qemu-system-*, so I'd expect that explicitly setting accel to kvm:tcg should be helpful.
But I'd recommend to not add it as further "-machine" but extend the existing one (if any).

Revision history for this message
Paride Legovini (paride) wrote :

Thanks, this last comment on "no shortage of options to set the accelerator" made me discover that autopkgtest_qemu.py has code to enable kvm on ppc64 via -enable-kvm if /dev/kvm is present, but for some reason it is not working. My impression at the moment is that the machine fails to be detected as kvm capable due to a mismatch between the dpkg arch name and the `uname -m` arch name (ppc64el / ppc64le). Investigating in this direction.

Revision history for this message
Paride Legovini (paride) wrote :

The ppc64 architecture name introduced in this commit is wrong:

https://salsa.debian.org/ci-team/autopkgtest/-/commit/d2350929d7d570aa71d40f15c297c56bc489a014

We need the "uname" arch name there (ppc64le), while ppc64el is the dpkg arch name. According to the git tags the bug was first introduced in autopkgtest 5.17. Ubuntu >= Jammy is affected.

Revision history for this message
Paride Legovini (paride) wrote :
Changed in autopkgtest (Debian):
status: Unknown → Fix Committed
Changed in autopkgtest (Ubuntu):
importance: Undecided → High
Changed in autopkgtest (Ubuntu Jammy):
importance: Undecided → High
Changed in autopkgtest (Ubuntu):
status: New → Triaged
Changed in autopkgtest (Ubuntu Jammy):
status: New → Triaged
Paride Legovini (paride)
Changed in autopkgtest (Ubuntu):
assignee: nobody → Paride Legovini (paride)
Changed in autopkgtest (Ubuntu Jammy):
assignee: nobody → Paride Legovini (paride)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package autopkgtest - 5.25ubuntu1

---------------
autopkgtest (5.25ubuntu1) kinetic; urgency=medium

  * qemu: fix the ppc64le arch name in kvm_compatible().
    Upstream cherry-pick (93fd5ea9). (LP: #1988527)

 -- Paride Legovini <email address hidden> Wed, 07 Sep 2022 17:28:05 +0200

Changed in autopkgtest (Ubuntu):
status: Triaged → Fix Released
Paride Legovini (paride)
Changed in autopkgtest (Ubuntu Jammy):
status: Triaged → In Progress
Paride Legovini (paride)
description: updated
description: updated
Revision history for this message
Paride Legovini (paride) wrote :

Note for the SRU team: there is some noise in the Jammy SRU debdiff [1] caused by the fact that by default `dpkg-source -b` excludes .gitignore from the tarball, but the Debian upload has been done using dgit, which doesn't have any special exclude rules.

[1] https://launchpadlibrarian.net/623539153/autopkgtest_5.20_5.20ubuntu1.diff.gz

description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Paride, or anyone else affected,

Accepted autopkgtest into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/autopkgtest/5.20ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in autopkgtest (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (autopkgtest/5.20ubuntu1)

All autopkgtests for the newly accepted autopkgtest (5.20ubuntu1) for jammy have finished running.
The following regressions have been reported in tests triggered by the package:

systemd/249.11-0ubuntu3.4 (ppc64el)
gscan2pdf/2.12.6-1 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/jammy/update_excuses.html#autopkgtest

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Frank Heimes (fheimes) wrote (last edit ):

From my point of view the new autopkgtest version fixed this issue - incl. LP#1973628 and LP#1987393.
(But I needed for my casync autopkgtest case more resources than the defaults to be successful, but that's beyond the image creation issue itself).
I would consider this as successful verified on jammy.

tags: added: verification-done-jammy
removed: verification-needed-jammy
Changed in autopkgtest (Debian):
status: Fix Committed → Fix Released
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for autopkgtest has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package autopkgtest - 5.20ubuntu1

---------------
autopkgtest (5.20ubuntu1) jammy; urgency=medium

  * qemu: fix the ppc64le arch name in kvm_compatible().
    Upstream cherry-pick (93fd5ea9). (LP: #1988527)

 -- Paride Legovini <email address hidden> Thu, 15 Sep 2022 15:27:52 +0200

Changed in autopkgtest (Ubuntu Jammy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.