QEMU/KVM crashes with GPU Passthrough at random times while playing games on Windows 11
I have built a machine with i9 14900k with 128GB RAM. I am currently on Ubuntu 23.04 Server. I am using it as a server for my work as well as entertainment purposes with a Windows VM with GPU Passthrough with Nvidia A4000 GPU for gaming. The games are all fine, but at random times, the windows VM crashes with machine gun sound and the event causes the entire host to go down and all the VMs just goes down since kernel panics . I had no idea what happens since there was nothing recorded in the dmesg or kern.log. So I configured kdump to capture the log and this is what I got.
Can someone help me understand what is wrong here and how to fix it?
[57089.489077] WARNING: CPU: 9 PID: 27910 at include/
[57089.489133] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd tls snd_seq_dummy snd_hrtimer rfcomm vhost_net tap zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) xt_CHECKSUM xt_MASQUERADE nf_conntrack_
[57089.489172] snd_sof_utils intel_powerclamp snd_soc_hdac_hda snd_hda_ext_core coretemp snd_soc_
[57089.489226] drm_display_helper cec rc_core mfd_aaeon drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea asus_wmi sysfillrect polyval_clmulni sysimgblt ledtrig_audio polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel sparse_keymap r8169 nvme crypto_simd platform_profile spi_intel_pci cryptd drm intel_lpss_pci i2c_i801 xhci_pci nvme_core spi_intel realtek ahci i2c_smbus intel_lpss xhci_pci_renesas nvme_common libahci idma64 vmd video wmi pinctrl_alderlake
[57089.489248] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P O 6.2.0-37-generic #38-Ubuntu
[57089.489250] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489251] RIP: 0010:kvm_
[57089.489279] Code: 5c 41 5d 5d 31 d2 31 f6 31 ff c3 cc cc cc cc f0 80 63 38 fb 48 8b 3b 41 bc fc ff ff ff 48 81 c7 f0 9a 00 00 41 83 fd 01 76 c9 <0f> 0b eb c5 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[57089.489280] RSP: 0018:ffffbefb86
[57089.489282] RAX: 0000000000000000 RBX: ffff9e0f752f0000 RCX: 0000000000000000
[57089.489283] RDX: 0000000000000800 RSI: 0000000000000000 RDI: ffffbefb83e42af0
[57089.489283] RBP: ffffbefb8612fd80 R08: 0000000000000000 R09: 0000000000000000
[57089.489284] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[57089.489285] R13: 0000000083e39000 R14: 0000000000000001 R15: ffff9e070c5419c0
[57089.489286] FS: 00007fc8d37fe6c
[57089.489287] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57089.489287] CR2: 000001b294813220 CR3: 0000001439a44000 CR4: 0000000000752ee0
[57089.489288] PKRU: 55555554
[57089.489289] Call Trace:
[57089.489290] <TASK>
[57089.489293] ? show_regs+0x6d/0x80
[57089.489297] ? __warn+0x89/0x160
[57089.489300] ? kvm_vcpu_
[57089.489331] ? report_
[57089.489335] ? handle_
[57089.489337] ? exc_invalid_
[57089.489338] ? asm_exc_
[57089.489341] ? kvm_vcpu_
[57089.489402] ? kvm_vcpu_
[57089.489449] kvm_vcpu_
[57089.489484] vcpu_run+
[57089.489559] kvm_arch_
[57089.489609] kvm_vcpu_
[57089.489636] ? __fget_
[57089.489639] __x64_sys_
[57089.489642] do_syscall_
[57089.489645] ? do_syscall_
[57089.489647] ? do_syscall_
[57089.489649] entry_SYSCALL_
[57089.489651] RIP: 0033:0x7fcd21511eef
[57089.489653] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.489654] RSP: 002b:00007fc8d3
[57089.489655] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.489656] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.489657] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.489658] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.489658] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.489660] </TASK>
[57089.489661] ---[ end trace 0000000000000000 ]---
[57089.489663] =======
[57089.489665] UBSAN: array-index-
[57089.489667] index -2082238464 is out of range for type 'atomic_long_t [2]'
[57089.489668] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P W O 6.2.0-37-generic #38-Ubuntu
[57089.489670] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489670] Call Trace:
[57089.489670] <TASK>
[57089.489671] dump_stack_
[57089.489673] dump_stack+
[57089.489674] __ubsan_
[57089.489678] __srcu_
[57089.489680] kvm_vcpu_
[57089.489710] ? kvm_vcpu_
[57089.489740] kvm_vcpu_
[57089.489772] vcpu_run+
[57089.489811] kvm_arch_
[57089.489852] kvm_vcpu_
[57089.489885] ? __fget_
[57089.489887] __x64_sys_
[57089.489889] do_syscall_
[57089.489891] ? do_syscall_
[57089.489892] ? do_syscall_
[57089.489894] entry_SYSCALL_
[57089.489896] RIP: 0033:0x7fcd21511eef
[57089.489897] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.489898] RSP: 002b:00007fc8d3
[57089.489899] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.489900] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.489900] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.489901] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.489902] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.489903] </TASK>
[57089.489904] =======
[57089.489909] BUG: unable to handle page fault for address: ffffdef79ca0e450
[57089.489910] #PF: supervisor write access in kernel mode
[57089.489912] #PF: error_code(0x0002) - not-present page
[57089.489913] PGD 100040067 P4D 100040067 PUD 0
[57089.489915] Oops: 0002 [#1] PREEMPT SMP NOPTI
[57089.489916] CPU: 9 PID: 27910 Comm: CPU 14/KVM Kdump: loaded Tainted: P W O 6.2.0-37-generic #38-Ubuntu
[57089.489918] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1402 09/08/2023
[57089.489918] RIP: 0010:__
[57089.489920] Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 f0 83 44 24 fc 00 48 63 f6 48 8b 9f c0 00 00 00 48 83 fe 01 77 14 <65> 48 ff 44 f3 10 48 8b 5d f8 c9 31 f6 31 ff c3 cc cc cc cc 48 c7
[57089.489921] RSP: 0018:ffffbefb86
[57089.489922] RAX: 0000000000000000 RBX: 000040e07ec06440 RCX: 0000000000000000
[57089.489923] RDX: 0000000000000000 RSI: ffffffff83e39000 RDI: 0000000000000000
[57089.489924] RBP: ffffbefb8612fd50 R08: 0000000000000000 R09: 0000000000000000
[57089.489925] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[57089.489926] R13: 0000000083e39000 R14: 0000000000000001 R15: ffff9e070c5419c0
[57089.489927] FS: 00007fc8d37fe6c
[57089.489928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57089.489929] CR2: ffffdef79ca0e450 CR3: 0000001439a44000 CR4: 0000000000752ee0
[57089.489930] PKRU: 55555554
[57089.489931] Call Trace:
[57089.489932] <TASK>
[57089.489932] ? show_regs+0x6d/0x80
[57089.489935] ? __die+0x24/0x80
[57089.489937] ? page_fault_
[57089.489939] ? kernelmode_
[57089.489941] ? __bad_area_
[57089.489943] ? bad_area_
[57089.489944] ? do_kern_
[57089.489945] ? exc_page_
[57089.489948] ? asm_exc_
[57089.489950] ? __srcu_
[57089.489952] kvm_vcpu_
[57089.489986] ? kvm_vcpu_
[57089.490016] kvm_vcpu_
[57089.490048] vcpu_run+
[57089.490088] kvm_arch_
[57089.490125] kvm_vcpu_
[57089.490156] ? __fget_
[57089.490158] __x64_sys_
[57089.490159] do_syscall_
[57089.490161] ? do_syscall_
[57089.490163] ? do_syscall_
[57089.490165] entry_SYSCALL_
[57089.490167] RIP: 0033:0x7fcd21511eef
[57089.490168] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[57089.490169] RSP: 002b:00007fc8d3
[57089.490170] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fcd21511eef
[57089.490171] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000027
[57089.490172] RBP: 0000560c373b9f70 R08: 0000560c35746ef8 R09: 00000000000000ff
[57089.490173] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[57089.490174] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[57089.490175] </TASK>
[57089.490176] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd tls snd_seq_dummy snd_hrtimer rfcomm vhost_net tap zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) xt_CHECKSUM xt_MASQUERADE nf_conntrack_
[57089.490217] snd_sof_utils intel_powerclamp snd_soc_hdac_hda snd_hda_ext_core coretemp snd_soc_
[57089.490264] drm_display_helper cec rc_core mfd_aaeon drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea asus_wmi sysfillrect polyval_clmulni sysimgblt ledtrig_audio polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel sparse_keymap r8169 nvme crypto_simd platform_profile spi_intel_pci cryptd drm intel_lpss_pci i2c_i801 xhci_pci nvme_core spi_intel realtek ahci i2c_smbus intel_lpss xhci_pci_renesas nvme_common libahci idma64 vmd video wmi pinctrl_alderlake
[57089.490285] CR2: ffffdef79ca0e450
Question information
- Language:
- English Edit question
- Status:
- Expired
- For:
- Ubuntu Edit question
- Assignee:
- No assignee Edit question
- Last query:
- Last reply: