Ubuntu 20.04 crashes with following messages.

Asked by ngoel

my Ubuntu 20.04 machine crashes (becomes unresponsive, but it appears that the hardware is still powered up), with following message in kern.log.
Note - Jan 8 01:34:00 in the log below. This machine is headless and we are not plugging in devices or removing devices.
It does have compute jobs

Jan 6 19:18:03 rahim kernel: [12661.683622] wlo1: Limiting TX power to 30 (30 - 0) dBm as advertised by d8:07:b6:71:a7:fe
Jan 8 01:34:00 rahim kernel: [121617.817184] BUG: kernel NULL pointer dereference, address: 00000000000000b1
Jan 8 01:34:00 rahim kernel: [121617.817189] #PF: supervisor read access in kernel mode
Jan 8 01:34:00 rahim kernel: [121617.817190] #PF: error_code(0x0000) - not-present page
Jan 8 01:34:00 rahim kernel: [121617.817192] PGD 0 P4D 0
Jan 8 01:34:00 rahim kernel: [121617.817194] Oops: 0000 [#1] SMP NOPTI
Jan 8 01:34:00 rahim kernel: [121617.817196] CPU: 12 PID: 280272 Comm: nnet3-chain-tra Tainted: P OE 5.4.0-92-generic #103-Ubuntu
Jan 8 01:34:00 rahim kernel: [121617.817197] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 UNIFY (MS-7C35), BIOS A.A0 06/2\
2/2021
Jan 8 01:34:00 rahim kernel: [121617.817357] RIP: 0010:_nv031290rm+0x79/0x890 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817359] Code: 06 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89Jan 8 17:17:05 rahim kernel: [ 0.00000\
0] Linux version 5.4.0-92-generic (buildd@lgw01-amd64-016) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #103-Ubuntu SMP Fri Nov 26 16:13:0\
0 UTC 2021 (Ubuntu 5.4.0-92.103-generic 5.4.157)
Jan 8 17:17:05 rahim kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.0-92-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro
Jan 8 17:17:05 rahim kernel: [ 0.000000] KERNEL supported cpus:

Question information

Language:
English Edit question
Status:
Answered
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
ngoel (nagendra-goel) said :
#1

Here is a syslog that has more detailed information. However I can't figure out how to fix the issue. Please help.

Jan 8 00:00:13 rahim systemd[1]: logrotate.service: Succeeded.
Jan 8 00:00:13 rahim systemd[1]: Finished Rotate log files.
Jan 8 00:00:13 rahim systemd[1]: man-db.service: Succeeded.
Jan 8 00:00:13 rahim systemd[1]: Finished Daily man-db regeneration.
Jan 8 00:17:01 rahim CRON[257152]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 8 01:17:01 rahim CRON[275144]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 8 01:33:18 rahim systemd[1]: Starting Message of the Day...
Jan 8 01:33:50 rahim 50-motd-news[280147]: * Super-optimized for small spaces - read how we shrank the memory
Jan 8 01:33:50 rahim 50-motd-news[280147]: footprint of MicroK8s to make it the smallest full K8s around.
Jan 8 01:33:50 rahim 50-motd-news[280147]: https://ubuntu.com/blog/microk8s-memory-optimisation
Jan 8 01:33:50 rahim systemd[1]: motd-news.service: Succeeded.
Jan 8 01:33:50 rahim systemd[1]: Finished Message of the Day.
Jan 8 01:34:00 rahim kernel: [121617.817184] BUG: kernel NULL pointer dereference, address: 00000000000000b1
Jan 8 01:34:00 rahim kernel: [121617.817189] #PF: supervisor read access in kernel mode
Jan 8 01:34:00 rahim kernel: [121617.817190] #PF: error_code(0x0000) - not-present page
Jan 8 01:34:00 rahim kernel: [121617.817192] PGD 0 P4D 0
Jan 8 01:34:00 rahim kernel: [121617.817194] Oops: 0000 [#1] SMP NOPTI
Jan 8 01:34:00 rahim kernel: [121617.817196] CPU: 12 PID: 280272 Comm: nnet3-chain-tra Tainted: P OE 5.4.0-92-generic #103-Ubuntu
Jan 8 01:34:00 rahim kernel: [121617.817197] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 UNIFY (MS-7C35), BIOS A.A0 06/22
/2021
Jan 8 01:34:00 rahim kernel: [121617.817357] RIP: 0010:_nv031290rm+0x79/0x890 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817359] Code: 06 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84
90 00 00 00 49 8b 86 f0 19 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 54
Jan 8 01:34:00 rahim kernel: [121617.817362] RSP: 0018:ffffbd610138f900 EFLAGS: 00010202
Jan 8 01:34:00 rahim kernel: [121617.817363] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
Jan 8 01:34:00 rahim kernel: [121617.817365] RDX: ffff9fbbd3fb0008 RSI: ffff9fb5efb0a008 RDI: ffff9fabe3f5c008
Jan 8 01:34:00 rahim kernel: [121617.817366] RBP: ffff9fb6f3a45d70 R08: ffff9fbbeeb301c0 R09: ffff9fbbe8406680
Jan 8 01:34:00 rahim kernel: [121617.817367] R10: ffffffffc0b48040 R11: 0000000000000000 R12: ffff9fb6f3a45db8
Jan 8 01:34:00 rahim kernel: [121617.817368] R13: 000000000000000f R14: ffff9fb5efb0a008 R15: 0000000000000001
Jan 8 01:34:00 rahim kernel: [121617.817369] FS: 00007fc678227000(0000) GS:ffff9fbbeeb00000(0000) knlGS:0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.817371] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 8 01:34:00 rahim kernel: [121617.817372] CR2: 00000000000000b1 CR3: 0000001b08370000 CR4: 0000000000740ee0
Jan 8 01:34:00 rahim kernel: [121617.817374] PKRU: 55555554
Jan 8 01:34:00 rahim kernel: [121617.817374] Call Trace:
Jan 8 01:34:00 rahim kernel: [121617.817520] ? _nv031406rm+0x82/0x270 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817651] ? _nv031435rm+0x17/0x30 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817768] ? _nv022268rm+0xc0/0x1b0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817887] ? _nv022273rm+0x11b/0x230 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818004] ? _nv022273rm+0x211/0x230 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818120] ? _nv022275rm+0x310/0x310 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818200] ? _nv022948rm+0x32d/0x470 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818280] ? _nv022948rm+0x304/0x470 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818361] ? _nv000711rm+0x32a/0x680 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818445] ? _nv000704rm+0x1772/0x22b0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818530] ? rm_init_adapter+0xc5/0xe0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818595] ? nv_open_device+0x125/0x8d0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818655] ? nvidia_open+0x279/0x4c0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818696] ? nvidia_frontend_open+0x58/0xa0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818699] ? chrdev_open+0xd3/0x1c0
Jan 8 01:34:00 rahim kernel: [121617.818701] ? cdev_default_release+0x20/0x20
Jan 8 01:34:00 rahim kernel: [121617.818702] ? do_dentry_open+0x143/0x3a0
Jan 8 01:34:00 rahim kernel: [121617.818704] ? vfs_open+0x2d/0x30
Jan 8 01:34:00 rahim kernel: [121617.818706] ? do_last+0x194/0x900
Jan 8 01:34:00 rahim kernel: [121617.818707] ? path_openat+0x8d/0x290
Jan 8 01:34:00 rahim kernel: [121617.818708] ? do_filp_open+0x91/0x100
Jan 8 01:34:00 rahim kernel: [121617.818710] ? __alloc_fd+0x46/0x150
Jan 8 01:34:00 rahim kernel: [121617.818711] ? do_sys_open+0x17e/0x290
Jan 8 01:34:00 rahim kernel: [121617.818712] ? __x64_sys_openat+0x20/0x30
Jan 8 01:34:00 rahim kernel: [121617.818714] ? do_syscall_64+0x57/0x190
Jan 8 01:34:00 rahim kernel: [121617.818716] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 8 01:34:00 rahim kernel: [121617.818717] Modules linked in: nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper dr
m fb_sys_fops syscopyarea sysfillrect sysimgblt nfsv3 nfs_acl nfs lockd grace fscache bnep nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc sc
si_dh_alua ccm snd_hda_codec_hdmi edac_mce_amd btusb btrtl btbcm btintel joydev ccp bluetooth kvm snd_hda_codec_realtek iwlmvm snd_hda_codec_gene
ric ledtrig_audio ecdh_generic ecc mac80211 mxm_wmi libarc4 wmi_bmof snd_hda_intel snd_intel_dspcfg snd_hda_codec iwlwifi k10temp snd_hda_core sn
d_hwdep snd_pcm cfg80211 snd_timer snd soundcore mac_hid sch_fq_codel msr sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 as
ync_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pcl
mul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ahci i2c_piix4 nvme r8169 libahci realtek nvme_core wmi [last unl
oaded: ipmi_msghandler]
Jan 8 01:34:00 rahim kernel: [121617.818749] CR2: 00000000000000b1
Jan 8 01:34:00 rahim kernel: [121617.818751] ---[ end trace 2e1dabff9b68b589 ]---
Jan 8 01:34:00 rahim kernel: [121617.818860] RIP: 0010:_nv031290rm+0x79/0x890 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818861] Code: 06 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84
90 00 00 00 49 8b 86 f0 19 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 54
Jan 8 01:34:00 rahim kernel: [121617.818863] RSP: 0018:ffffbd610138f900 EFLAGS: 00010202
Jan 8 01:34:00 rahim kernel: [121617.818863] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
Jan 8 01:34:00 rahim kernel: [121617.818864] RDX: ffff9fbbd3fb0008 RSI: ffff9fb5efb0a008 RDI: ffff9fabe3f5c008
Jan 8 01:34:00 rahim kernel: [121617.818865] RBP: ffff9fb6f3a45d70 R08: ffff9fbbeeb301c0 R09: ffff9fbbe8406680
Jan 8 01:34:00 rahim kernel: [121617.818866] R10: ffffffffc0b48040 R11: 0000000000000000 R12: ffff9fb6f3a45db8
Jan 8 01:34:00 rahim kernel: [121617.818867] R13: 000000000000000f R14: ffff9fb5efb0a008 R15: 0000000000000001
Jan 8 01:34:00 rahim kernel: [121617.818868] FS: 00007fc678227000(0000) GS:ffff9fbbeeb00000(0000) knlGS:0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.818869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 8 01:34:00 rahim kernel: [121617.818870] CR2: 00000000000000b1 CR3: 0000001b08370000 CR4: 0000000000740ee0
Jan 8 01:34:00 rahim kernel: [121617.818871] PKRU: 55555554
Jan 8 01:34:00 rahim kernel: [121617.819547] general protection fault: 0000 [#2] SMP NOPTI
Jan 8 01:34:00 rahim kernel: [121617.819549] CPU: 12 PID: 280272 Comm: nnet3-chain-tra Tainted: P D OE 5.4.0-92-generic #103-Ubuntu
Jan 8 01:34:00 rahim kernel: [121617.819550] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 UNIFY (MS-7C35), BIOS A.A0 06/22
/2021
Jan 8 01:34:00 rahim kernel: [121617.819630] RIP: 0010:_nv009366rm+0x89/0x340 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819631] Code: 48 c7 42 18 00 00 00 00 c6 42 20 01 48 39 30 0f 87 84 00 00 00 48 89 50 18 48 8b 0f eb 35 0f
1f 00 48 85 f6 0f 84 8f 00 00 00 <80> 7e 20 00 0f 84 85 00 00 00 c6 40 20 00 c6 46 20 00 48 8b 42 08
Jan 8 01:34:00 rahim kernel: [121617.819633] RSP: 0018:ffffbd610138fce8 EFLAGS: 00010002
Jan 8 01:34:00 rahim kernel: [121617.819634] RAX: ffffbd610138faf0 RBX: ffffbd610138fd30 RCX: ffffffffa96001b0
Jan 8 01:34:00 rahim kernel: [121617.819635] RDX: ffffbd610138fd80 RSI: 415e415f41000001 RDI: ffffffffc2643db8
Jan 8 01:34:00 rahim kernel: [121617.819635] RBP: ffff9fb290bd2ff0 R08: ffffffffa96001b0 R09: 0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.819636] R10: 0000000000000008 R11: ffff9fbbcbba8070 R12: 0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.819637] R13: ffffffffc2644540 R14: ffffbd610138fe18 R15: ffffffffc26411c0
Jan 8 01:34:00 rahim kernel: [121617.819638] FS: 00007fc678227000(0000) GS:ffff9fbbeeb00000(0000) knlGS:0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.819639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 8 01:34:00 rahim kernel: [121617.819640] CR2: 00000000000000b1 CR3: 0000001f5ba0a000 CR4: 0000000000740ee0
Jan 8 01:34:00 rahim kernel: [121617.819641] PKRU: 55555554
Jan 8 01:34:00 rahim kernel: [121617.819641] Call Trace:
Jan 8 01:34:00 rahim kernel: [121617.819683] ? _nv039600rm+0xdf/0x1e0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819752] ? rm_cleanup_file_private+0x42/0x140 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819794] ? nvidia_close+0x149/0x2d0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819835] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819836] ? __fput+0xcc/0x260
Jan 8 01:34:00 rahim kernel: [121617.819837] ? ____fput+0xe/0x10
Jan 8 01:34:00 rahim kernel: [121617.819839] ? task_work_run+0x8f/0xb0
Jan 8 01:34:00 rahim kernel: [121617.819841] ? do_exit+0x36e/0xaf0
Jan 8 01:34:00 rahim kernel: [121617.819843] ? rewind_stack_do_exit+0x17/0x20
Jan 8 01:34:00 rahim kernel: [121617.819844] Modules linked in: nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper dr
m fb_sys_fops syscopyarea sysfillrect sysimgblt nfsv3 nfs_acl nfs lockd grace fscache bnep nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc sc
si_dh_alua ccm snd_hda_codec_hdmi edac_mce_amd btusb btrtl btbcm btintel joydev ccp bluetooth kvm snd_hda_codec_realtek iwlmvm snd_hda_codec_gene
ric ledtrig_audio ecdh_generic ecc mac80211 mxm_wmi libarc4 wmi_bmof snd_hda_intel snd_intel_dspcfg snd_hda_codec iwlwifi k10temp snd_hda_core sn
d_hwdep snd_pcm cfg80211 snd_timer snd soundcore mac_hid sch_fq_codel msr sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 as
ync_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pcl
mul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ahci i2c_piix4 nvme r8169 libahci realtek nvme_core wmi [last unl
oaded: ipmi_msghandler]
Jan 8 01:34:00 rahim kernel: [121617.819860] ---[ end trace 2e1dabff9b68b58a ]---
Jan 8 01:34:00 rahim kernel: [121617.819967] RIP: 0010:_nv031290rm+0x79/0x890 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.819968] Code: 06 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84
90 00 00 00 49 8b 86 f0 19 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 54
Jan 8 01:34:00 rahim kernel: [121617.819969] RSP: 0018:ffffbd610138f900 EFLAGS: 00010202
Jan 8 01:34:00 rahim kernel: [121617.819970] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
Jan 8 01:34:00 rahim kernel: [121617.819971] RDX: ffff9fbbd3fb0008 RSI: ffff9fb5efb0a008 RDI: ffff9fabe3f5c008
Jan 8 01:34:00 rahim kernel: [121617.819972] RBP: ffff9fb6f3a45d70 R08: ffff9fbbeeb301c0 R09: ffff9fbbe8406680
Jan 8 01:34:00 rahim kernel: [121617.819973] R10: ffffffffc0b48040 R11: 0000000000000000 R12: ffff9fb6f3a45db8
Jan 8 01:34:00 rahim kernel: [121617.819974] R13: 000000000000000f R14: ffff9fb5efb0a008 R15: 0000000000000001
Jan 8 01:34:00 rahim kernel: [121617.819975] FS: 00007fc678227000(0000) GS:ffff9fbbeeb00000(0000) knlGS:0000000000000000
Jan 8 01:34:00 rahim kernel: [121617.819976] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 8 01:34:00 rahim kernel: [121617.819976] CR2: 00000000000000b1 CR3: 0000001f5ba0a000 CR4: 0000000000740ee0
Jan 8 01:34:00 rahim kernel: [121617.819977] PKRU: 55555554
Jan 8 01:34:00 rahim kernel: [121617.819978] Fixing recursive fault but reboot is needed!
Jan 8 17:17:05 rahim systemd-modules-load[698]: Inserted module 'msr'
Jan 8 17:17:05 rahim systemd-sysctl[719]: Not setting net/ipv4/conf/all/promote_secondaries (explicit setting exists).
Jan 8 17:17:05 rahim systemd-sysctl[719]: Not setting net/ipv4/conf/default/promote_secondaries (explicit setting exists).
Jan 8 17:17:05 rahim lvm[693]: 1 logical volume(s) in volume group "ubuntu-vg" monitored
Jan 8 17:17:05 rahim systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 8 17:17:05 rahim kernel: [ 0.000000] Linux version 5.4.0-92-generic (buildd@lgw01-amd64-016) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~2
0.04)) #103-Ubuntu SMP Fri Nov 26 16:13:00 UTC 2021 (Ubuntu 5.4.0-92.103-generic 5.4.157)
Jan 8 17:17:05 rahim systemd[1]: Started udev Kernel Device Manager.
Jan 8 17:17:05 rahim kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.0-92-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro
Jan 8 17:17:05 rahim kernel: [ 0.000000] KERNEL supported cpus:
Jan 8 17:17:05 rahim kernel: [ 0.000000] Intel GenuineIntel
Jan 8 17:17:05 rahim kernel: [ 0.000000] AMD AuthenticAMD
Jan 8 17:17:05 rahim kernel: [ 0.000000] Hygon HygonGenuine
Jan 8 17:17:05 rahim kernel: [ 0.000000] Centaur CentaurHauls
Jan 8 17:17:05 rahim systemd[1]: Finished Flush Journal to Persistent Storage.
Jan 8 17:17:05 rahim kernel: [ 0.000000] zhaoxin Shanghai
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: xstate_offset[9]: 832, xstate_sizes[9]: 8
Jan 8 17:17:05 rahim kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Jan 8 17:17:05 rahim kernel: [ 0.000000] BIOS-provided physical RAM map:
Jan 8 17:17:05 rahim kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Jan 8 17:17:05 rahim kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Jan 8 17:17:05 rahim systemd[1]: Activated swap /swap.img.
Jan 8 17:17:05 rahim kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009d81fff] usable
Jan 8 17:17:05 rahim kernel: [ 0.000000] BIOS-e820: [mem 0x0000000009d82000-0x0000000009ffffff] reserved

Revision history for this message
Manfred Hampl (m-hampl) said :
#2

The crash seems to be in the nvidia display driver.
If that is a headless system, then you should consider uninstalling that driver in favor of a generic display driver (or the nouveau driver).

Revision history for this message
ngoel (nagendra-goel) said :
#3

Nvidia driver is needed because we use GPU (CUDA) for computations. We have several GPU cards. Which specific lines in the log gave you this hint? I can try to change the driver version.

Revision history for this message
Manfred Hampl (m-hampl) said :
#4

Re: "Which specific lines in the log gave you this hint? "

e.g.
Jan 8 01:34:00 rahim kernel: [121617.817357] RIP: 0010:_nv031290rm+0x79/0x890 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817520] ? _nv031406rm+0x82/0x270 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817651] ? _nv031435rm+0x17/0x30 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817768] ? _nv022268rm+0xc0/0x1b0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.817887] ? _nv022273rm+0x11b/0x230 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818004] ? _nv022273rm+0x211/0x230 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818120] ? _nv022275rm+0x310/0x310 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818200] ? _nv022948rm+0x32d/0x470 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818280] ? _nv022948rm+0x304/0x470 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818361] ? _nv000711rm+0x32a/0x680 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818445] ? _nv000704rm+0x1772/0x22b0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818530] ? rm_init_adapter+0xc5/0xe0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818595] ? nv_open_device+0x125/0x8d0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818655] ? nvidia_open+0x279/0x4c0 [nvidia]
Jan 8 01:34:00 rahim kernel: [121617.818696] ? nvidia_frontend_open+0x58/0xa0 [nvidia]

Can you help with this problem?

Provide an answer of your own, or ask ngoel for more information if necessary.

To post a message you must log in.