QEMU/KVM crashes with GPU Passthrough at random times while playing games on Windows 11

Asked by Sibidharan

I have built a machine with i9 14900k with 128GB RAM. I am currently on Ubuntu 23.04 Server. I am using it as a server for my work as well as entertainment purposes with a Windows VM with GPU Passthrough with Nvidia A4000 GPU for gaming. The games are all fine, but at random times, the windows VM crashes with machine gun sound and the event causes the entire host to go down and all the VMs just goes down since kernel panics . I had no idea what happens since there was nothing recorded in the dmesg or kern.log. So I configured kdump to capture the log and this is what I got.

Can someone help me understand what is wrong here and how to fix it?

[57089.489077] WARNING: CPU: 9 PID: 27910 at include/linux/srcu.h:227 kvm_vcpu_check_block+0xa8/0xb0 [kvm]
[57089.489133] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd tls snd_seq_dummy snd_hrtimer rfcomm vhost_net tap zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) xt_CHECKSUM xt_MASQUERADE nf_conntrack_netlink xt_conntrack xfrm_user xfrm_algo xt_addrtype br_netfilter nft_masq vmw_vsock_vmci_transport vmw_vmci vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock nft_chain_nat ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_cgroup xt_mark xt_owner xt_tcpudp nft_compat nf_tables nfnetlink overlay cmac algif_hash algif_skcipher af_alg bnep bridge stp llc openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 binfmt_misc nls_iso8859_1 snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_msr intel_rapl_common snd_sof intel_tcc_cooling x86_pkg_temp_thermal
[57089.489172] snd_sof_utils intel_powerclamp snd_soc_hdac_hda snd_hda_ext_core coretemp snd_soc_acpi_intel_match snd_soc_acpi snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi kvm snd_hda_codec snd_hda_core snd_hwdep irqbypass soundwire_bus rapl iwlmvm intel_cstate snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm mac80211 snd_seq_midi snd_seq_midi_event snd_rawmidi libarc4 btusb snd_seq btrtl btbcm snd_seq_device btintel btmtk snd_timer iwlwifi bluetooth snd cmdlinepart pmt_telemetry pmt_class mei_hdcp mei_pxp asus_nb_wmi eeepc_wmi spi_nor wmi_bmof joydev mtd ecdh_generic soundcore input_leds ecc cfg80211 intel_vsec acpi_pad acpi_tad mac_hid mei_me mei dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr parport_pc ppdev lp parport efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid i915 drm_buddy i2c_algo_bit ttm
[57089.489226] drm_display_helper cec rc_core mfd_aaeon drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea asus_wmi sysfillrect polyval_clmulni sysimgblt ledtrig_audio polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel sparse_keymap r8169 nvme crypto_simd platform_profile spi_intel_pci cryptd drm intel_lpss_pci i2c_i801 xhci_pci nvme_core spi_intel realtek ahci i2c_smbus intel_lpss xhci_pci_renesas nvme_common libahci idma64 vmd video wmi pinctrl_alderlake
[57089.489248] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P O 6.2.0-37-generic #38-Ubuntu
[57089.489250] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489251] RIP: 0010:kvm_vcpu_check_block+0xa8/0xb0 [kvm]
[57089.489279] Code: 5c 41 5d 5d 31 d2 31 f6 31 ff c3 cc cc cc cc f0 80 63 38 fb 48 8b 3b 41 bc fc ff ff ff 48 81 c7 f0 9a 00 00 41 83 fd 01 76 c9 <0f> 0b eb c5 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[57089.489280] RSP: 0018:ffffbefb8612fd60 EFLAGS: 00010296
[57089.489282] RAX: 0000000000000000 RBX: ffff9e0f752f0000 RCX: 0000000000000000
[57089.489283] RDX: 0000000000000800 RSI: 0000000000000000 RDI: ffffbefb83e42af0
[57089.489283] RBP: ffffbefb8612fd80 R08: 0000000000000000 R09: 0000000000000000
[57089.489284] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[57089.489285] R13: 0000000083e39000 R14: 0000000000000001 R15: ffff9e070c5419c0
[57089.489286] FS: 00007fc8d37fe6c0(0000) GS:ffff9e1afec40000(0000) knlGS:0000003d6e646000
[57089.489287] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57089.489287] CR2: 000001b294813220 CR3: 0000001439a44000 CR4: 0000000000752ee0
[57089.489288] PKRU: 55555554
[57089.489289] Call Trace:
[57089.489290] <TASK>
[57089.489293] ? show_regs+0x6d/0x80
[57089.489297] ? __warn+0x89/0x160
[57089.489300] ? kvm_vcpu_check_block+0xa8/0xb0 [kvm]
[57089.489331] ? report_bug+0x17e/0x1b0
[57089.489335] ? handle_bug+0x46/0x90
[57089.489337] ? exc_invalid_op+0x18/0x80
[57089.489338] ? asm_exc_invalid_op+0x1b/0x20
[57089.489341] ? kvm_vcpu_check_block+0xa8/0xb0 [kvm]
[57089.489402] ? kvm_vcpu_check_block+0x20/0xb0 [kvm]
[57089.489449] kvm_vcpu_halt+0x13a/0x470 [kvm]
[57089.489484] vcpu_run+0x1fc/0x290 [kvm]
[57089.489559] kvm_arch_vcpu_ioctl_run+0x1d5/0x540 [kvm]
[57089.489609] kvm_vcpu_ioctl+0x297/0x800 [kvm]
[57089.489636] ? __fget_light+0xa5/0x120
[57089.489639] __x64_sys_ioctl+0x9d/0xe0
[57089.489642] do_syscall_64+0x58/0x90
[57089.489645] ? do_syscall_64+0x67/0x90
[57089.489647] ? do_syscall_64+0x67/0x90
[57089.489649] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[57089.489651] RIP: 0033:0x7fcd21511eef
[57089.489653] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.489654] RSP: 002b:00007fc8d37fd490 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[57089.489655] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.489656] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.489657] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.489658] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.489658] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.489660] </TASK>
[57089.489661] ---[ end trace 0000000000000000 ]---
[57089.489663] ================================================================================
[57089.489665] UBSAN: array-index-out-of-bounds in /build/linux-xCEaXy/linux-6.2.0/kernel/rcu/srcutree.c:681:2
[57089.489667] index -2082238464 is out of range for type 'atomic_long_t [2]'
[57089.489668] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P W O 6.2.0-37-generic #38-Ubuntu
[57089.489670] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489670] Call Trace:
[57089.489670] <TASK>
[57089.489671] dump_stack_lvl+0x48/0x70
[57089.489673] dump_stack+0x10/0x20
[57089.489674] __ubsan_handle_out_of_bounds+0xc6/0x110
[57089.489678] __srcu_read_unlock+0x48/0x50
[57089.489680] kvm_vcpu_check_block+0x79/0xb0 [kvm]
[57089.489710] ? kvm_vcpu_check_block+0x20/0xb0 [kvm]
[57089.489740] kvm_vcpu_halt+0x13a/0x470 [kvm]
[57089.489772] vcpu_run+0x1fc/0x290 [kvm]
[57089.489811] kvm_arch_vcpu_ioctl_run+0x1d5/0x540 [kvm]
[57089.489852] kvm_vcpu_ioctl+0x297/0x800 [kvm]
[57089.489885] ? __fget_light+0xa5/0x120
[57089.489887] __x64_sys_ioctl+0x9d/0xe0
[57089.489889] do_syscall_64+0x58/0x90
[57089.489891] ? do_syscall_64+0x67/0x90
[57089.489892] ? do_syscall_64+0x67/0x90
[57089.489894] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[57089.489896] RIP: 0033:0x7fcd21511eef
[57089.489897] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.489898] RSP: 002b:00007fc8d37fd490 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[57089.489899] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.489900] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.489900] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.489901] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.489902] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.489903] </TASK>
[57089.489904] ================================================================================
[57089.489909] BUG: unable to handle page fault for address: ffffdef79ca0e450
[57089.489910] #PF: supervisor write access in kernel mode
[57089.489912] #PF: error_code(0x0002) - not-present page
[57089.489913] PGD 100040067 P4D 100040067 PUD 0
[57089.489915] Oops: 0002 [#1] PREEMPT SMP NOPTI
[57089.489916] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P W O 6.2.0-37-generic #38-Ubuntu
[57089.489918] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489918] RIP: 0010:__srcu_read_unlock+0x24/0x50
[57089.489920] Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 f0 83 44 24 fc 00 48 63 f6 48 8b 9f c0 00 00 00 48 83 fe 01 77 14 <65> 48 ff 44 f3 10 48 8b 5d f8 c9 31 f6 31 ff c3 cc cc cc cc 48 c7
[57089.489921] RSP: 0018:ffffbefb8612fd40 EFLAGS: 00010246
[57089.489922] RAX: 0000000000000000 RBX: 000040e07ec06440 RCX: 0000000000000000
[57089.489923] RDX: 0000000000000000 RSI: ffffffff83e39000 RDI: 0000000000000000
[57089.489924] RBP: ffffbefb8612fd50 R08: 0000000000000000 R09: 0000000000000000
[57089.489925] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[57089.489926] R13: 0000000083e39000 R14: 0000000000000001 R15: ffff9e070c5419c0
[57089.489927] FS: 00007fc8d37fe6c0(0000) GS:ffff9e1afec40000(0000) knlGS:0000003d6e646000
[57089.489928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57089.489929] CR2: ffffdef79ca0e450 CR3: 0000001439a44000 CR4: 0000000000752ee0
[57089.489930] PKRU: 55555554
[57089.489931] Call Trace:
[57089.489932] <TASK>
[57089.489932] ? show_regs+0x6d/0x80
[57089.489935] ? __die+0x24/0x80
[57089.489937] ? page_fault_oops+0x99/0x1b0
[57089.489939] ? kernelmode_fixup_or_oops+0xb2/0x140
[57089.489941] ? __bad_area_nosemaphore+0x1a5/0x2c0
[57089.489943] ? bad_area_nosemaphore+0x16/0x30
[57089.489944] ? do_kern_addr_fault+0x7b/0xa0
[57089.489945] ? exc_page_fault+0x10a/0x1b0
[57089.489948] ? asm_exc_page_fault+0x27/0x30
[57089.489950] ? __srcu_read_unlock+0x24/0x50
[57089.489952] kvm_vcpu_check_block+0x79/0xb0 [kvm]
[57089.489986] ? kvm_vcpu_check_block+0x20/0xb0 [kvm]
[57089.490016] kvm_vcpu_halt+0x13a/0x470 [kvm]
[57089.490048] vcpu_run+0x1fc/0x290 [kvm]
[57089.490088] kvm_arch_vcpu_ioctl_run+0x1d5/0x540 [kvm]
[57089.490125] kvm_vcpu_ioctl+0x297/0x800 [kvm]
[57089.490156] ? __fget_light+0xa5/0x120
[57089.490158] __x64_sys_ioctl+0x9d/0xe0
[57089.490159] do_syscall_64+0x58/0x90
[57089.490161] ? do_syscall_64+0x67/0x90
[57089.490163] ? do_syscall_64+0x67/0x90
[57089.490165] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[57089.490167] RIP: 0033:0x7fcd21511eef
[57089.490168] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.490169] RSP: 002b:00007fc8d37fd490 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[57089.490170] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.490171] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.490172] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.490173] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.490174] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.490175] </TASK>
[57089.490176] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd tls snd_seq_dummy snd_hrtimer rfcomm vhost_net tap zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) xt_CHECKSUM xt_MASQUERADE nf_conntrack_netlink xt_conntrack xfrm_user xfrm_algo xt_addrtype br_netfilter nft_masq vmw_vsock_vmci_transport vmw_vmci vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock nft_chain_nat ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_cgroup xt_mark xt_owner xt_tcpudp nft_compat nf_tables nfnetlink overlay cmac algif_hash algif_skcipher af_alg bnep bridge stp llc openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 binfmt_misc nls_iso8859_1 snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_msr intel_rapl_common snd_sof intel_tcc_cooling x86_pkg_temp_thermal
[57089.490217] snd_sof_utils intel_powerclamp snd_soc_hdac_hda snd_hda_ext_core coretemp snd_soc_acpi_intel_match snd_soc_acpi snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi kvm snd_hda_codec snd_hda_core snd_hwdep irqbypass soundwire_bus rapl iwlmvm intel_cstate snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_pcm mac80211 snd_seq_midi snd_seq_midi_event snd_rawmidi libarc4 btusb snd_seq btrtl btbcm snd_seq_device btintel btmtk snd_timer iwlwifi bluetooth snd cmdlinepart pmt_telemetry pmt_class mei_hdcp mei_pxp asus_nb_wmi eeepc_wmi spi_nor wmi_bmof joydev mtd ecdh_generic soundcore input_leds ecc cfg80211 intel_vsec acpi_pad acpi_tad mac_hid mei_me mei dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr parport_pc ppdev lp parport efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid i915 drm_buddy i2c_algo_bit ttm
[57089.490264] drm_display_helper cec rc_core mfd_aaeon drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea asus_wmi sysfillrect polyval_clmulni sysimgblt ledtrig_audio polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel sparse_keymap r8169 nvme crypto_simd platform_profile spi_intel_pci cryptd drm intel_lpss_pci i2c_i801 xhci_pci nvme_core spi_intel realtek ahci i2c_smbus intel_lpss xhci_pci_renesas nvme_common libahci idma64 vmd video wmi pinctrl_alderlake
[57089.490285] CR2: ffffdef79ca0e450

Question information

Language:
English Edit question
Status:
Expired
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Sibidharan (sibi1995) said :
#2
Revision history for this message
Sibidharan (sibi1995) said :
#3

Is there any way not to panic and crash the entire kernel, instead, just report the issue and keep running the kernel? ChatGPT suggested to add ubsan.panic=0 to avoid panics. I just wonder if chatGPT is hallucinating since there is no such reference for that setting in the internet as far as i have searched.

Revision history for this message
Sibidharan (sibi1995) said :
#4

This issue is similar to mine: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.19/+bug/2030894

Looks like its UBSAN causing the panic. Is there a way to disable it without recompiling the kernel?

Revision history for this message
Bernard Stafford (bernard010) said :
#6

Please open a terminal[ctl+alt+t]:
lsb_release -a; uname -a; dpkg -l | grep ' linux-i'
paste output here for diagnostic purposes.

Revision history for this message
Sibidharan (sibi1995) said :
#7

Here is the output you asked for.

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 23.04
Release: 23.04
Codename: lunar
Linux selfmadeninja 6.2.0-37-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 30 21:04:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
rc linux-image-6.2.0-20-generic 6.2.0-20.20 amd64 Signed kernel image generic
rc linux-image-6.2.0-23-generic 6.2.0-23.23 amd64 Signed kernel image generic
rc linux-image-6.2.0-24-generic 6.2.0-24.24 amd64 Signed kernel image generic
rc linux-image-6.2.0-25-generic 6.2.0-25.25 amd64 Signed kernel image generic
rc linux-image-6.2.0-26-generic 6.2.0-26.26 amd64 Signed kernel image generic
rc linux-image-6.2.0-27-generic 6.2.0-27.28 amd64 Signed kernel image generic
rc linux-image-6.2.0-31-generic 6.2.0-31.31 amd64 Signed kernel image generic
rc linux-image-6.2.0-32-generic 6.2.0-32.32 amd64 Signed kernel image generic
rc linux-image-6.2.0-33-generic 6.2.0-33.33+1 amd64 Signed kernel image generic
rc linux-image-6.2.0-34-generic 6.2.0-34.34 amd64 Signed kernel image generic
rc linux-image-6.2.0-35-generic 6.2.0-35.35 amd64 Signed kernel image generic
rc linux-image-6.2.0-36-generic 6.2.0-36.37 amd64 Signed kernel image generic
ii linux-image-6.2.0-37-generic 6.2.0-37.38 amd64 Signed kernel image generic
ii linux-image-6.2.0-39-generic 6.2.0-39.40 amd64 Signed kernel image generic
ii linux-image-generic 6.2.0.39.39 amd64 Generic Linux kernel image

Revision history for this message
Launchpad Janitor (janitor) said :
#8

This question was expired because it remained in the 'Open' state without activity for the last 15 days.