nvidia 346: boot hang (optimus off) or missing display and CUDA failure (optimus on)

Asked by EEngineer on 2015-02-25

I have a Dell M6800 laptop with an NVIDIA Quadro K4100M. There are two different failure modes I've been dealing with that relate to the NVIDIA Optimus setting in the BIOS.

#1 NVIDIA Optimus Off:
-10%% chance of a boot hang when it tries to initialize the display
- unable to switch to terminals via ctrl+alt+f1
- inability to use low power integrated GPU

Here are logs when it hangs:
dmesg stack trace
[ 3.964123] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 3.964127] IP: [<ffffffff81727b5b>] __down_common+0x4c/0x144
[ 3.964129] PGD 7ff1b4067 PUD 7ff1b5067 PMD 0
[ 3.964130] Oops: 0002 [#1] SMP
[ 3.964154] CPU: 2 PID: 719 Comm: nvidia-persiste Tainted: P OX 3.13.0-46-generic #75-Ubuntu

http://paste.ubuntu.com/10438694/
Xorg:
http://paste.ubuntu.com/10416863/
lightdm:
http://paste.ubuntu.com/10416900/
http://paste.ubuntu.com/10416912/
http://paste.ubuntu.com/10416916/

#2 NVIDIA Optimus On, using NVIDIA GPU:
- without the xedgers drivers: stretched displays shown here:
http://askubuntu.com/questions/573465/14-04-nvidia-dual-display-externallaptop-stretched-display-offset-desktop

#3 with xedgers drivers, Optimus On, using NVIDIA GPU:
- No CUDA device detected
- when docking the laptop and connecting to 2x external displays, nvidia does not allow me to use my laptop display. The ubuntu displays window will show it, but it will just be a blank screen. The screen is on, but I can't drag anything to it and the launcher is not displayed, though I can actually launch stuff by clicking where it should be (mouse is visible). If I overlap it with one of the two external displays, it will mirror what I see on the external monitor.

These can be shown here:
http://imgur.com/a/BGdKd

I've been using Option #1 since I get to use CUDA and 2 monitors + laptop display, but the boot hanging has been ever increasing. Now I'm forced to use Option #3 which is sub optimal.

I'm using nvidia-prime, not bumblebee. With Optimus off, nvidia-prime installed or removed seemed to make no difference. I've had problems #1 and #2 with 331 and 346 from the normal repository. #1 and #3 occur when I switched to xedgers ppa.

rc nvidia-331 331.113-0ubuntu0.0.4 amd64 NVIDIA binary driver - version 331.113
ii nvidia-346 346.35-0ubuntu1~xedgers14.04.1 amd64 NVIDIA binary driver - version 346.35
ii nvidia-346-dev 346.35-0ubuntu1~xedgers14.04.1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-346-uvm 346.35-0ubuntu1~xedgers14.04.1 amd64 NVIDIA Unified Memory kernel module
ii nvidia-modprobe 346.29-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-346 346.35-0ubuntu1~xedgers14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 346.35-0ubuntu1~xedgers14.04.1 amd64 Tool for configuring the NVIDIA graphics driver

nouveau and bumblebee are blacklisted. Upgrading Dell's BIOS from A09 to A13 had no effect.

Question information

Language:
English Edit question
Status:
Expired
For:
Ubuntu nvidia-graphics-drivers-346 Edit question
Assignee:
No assignee Edit question
Last query:
2015-02-25
Last reply:
2015-03-13
Launchpad Janitor (janitor) said : #1

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Alistair Muldal (alimuldal) said : #2

I'm experiencing the same kernel oops in 14.04 after upgrading from 3.13.0-52 to 3.13.0-53

Jun 3 13:56:16 cjagpu2 kernel: [ 4.538913] BUG: unable to handle kernel NULL pointer dereference at (null)
Jun 3 13:56:16 cjagpu2 kernel: [ 4.538939] IP: [<ffffffff817296ab>] __down_common+0x4c/0x144
Jun 3 13:56:16 cjagpu2 kernel: [ 4.538957] PGD 7fefcc067 PUD 7fddcd067 PMD 0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.538974] Oops: 0002 [#1] SMP
Jun 3 13:56:16 cjagpu2 kernel: [ 4.538986] Modules linked in: nvidia(POX+) wl(POX+) x86_pkg_temp_thermal eeepc_wmi intel_powerclamp asus_wmi mxm_wmi sparse_keymap coretemp kvm btusb crct10dif_pclmul bluetooth crc32_pclmul snd_hda_codec
_realtek snd_seq_midi ghash_clmulni_intel dm_multipath joydev snd_hda_intel(+) aesni_intel snd_seq_midi_event snd_hda_codec aes_x86_64 scsi_dh snd_rawmidi lrw snd_seq gf128mul snd_hwdep glue_helper snd_pcm ablk_helper cfg80211 cryptd serio
_raw drm snd_page_alloc snd_seq_device snd_timer snd mei_me mei soundcore wmi shpchp acpi_pad video mac_hid parport_pc ppdev lp parport btrfs hid_generic usbhid hid xor raid6_pq libcrc32c igb e1000e psmouse i2c_algo_bit ahci dca libahci pt
p dm_mirror dm_region_hash pps_core dm_log
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539241] CPU: 7 PID: 680 Comm: nvidia-persiste Tainted: P OX 3.13.0-53-generic #89-Ubuntu
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539263] Hardware name: ASUS All Series/Z97-DELUXE, BIOS 2012 09/30/2014
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539279] task: ffff880801028000 ti: ffff8807ff406000 task.ti: ffff8807ff406000
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539297] RIP: 0010:[<ffffffff817296ab>] [<ffffffff817296ab>] __down_common+0x4c/0x144
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539317] RSP: 0018:ffff8807ff407b48 EFLAGS: 00010096
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539330] RAX: 0000000000000000 RBX: ffffffffa173f2b0 RCX: 0000000000000000
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539347] RDX: ffffffffa173f2b8 RSI: ffff8807ff407b50 RDI: ffffffffa173f2b0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539363] RBP: ffff8807ff407b98 R08: 0000000000000296 R09: ffffffffa14a651b
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539380] R10: 0000000000000022 R11: 00000000000000ff R12: 7fffffffffffffff
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539396] R13: ffff880801028000 R14: 0000000000000002 R15: 0000000000000000
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539413] FS: 00007fac25624740(0000) GS:ffff88082edc0000(0000) knlGS:0000000000000000
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539445] CR2: 0000000000000000 CR3: 00000007fefcd000 CR4: 00000000001407e0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539462] Stack:
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539467] 0000000000000000 ffffffffa173f2b8 0000000000000000 ffff8807ff19d700
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539490] 0000000000000000 ffffffffa173f2b0 ffff8807ff070000 0000000000000003
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539514] ffff880800bcecf8 0000000000000002 ffff8807ff407ba8 ffffffff817297c0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539537] Call Trace:
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539545] [<ffffffff817297c0>] __down+0x1d/0x1f
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539558] [<ffffffff810b0e01>] down+0x41/0x50
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539609] [<ffffffffa14a6817>] nvidia_open+0x387/0x8b0 [nvidia]
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539654] [<ffffffffa14b3bd9>] nvidia_frontend_open+0x49/0xa0 [nvidia]
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539672] [<ffffffff811c2a6f>] chrdev_open+0x9f/0x1d0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539686] [<ffffffff811bb5a3>] do_dentry_open+0x233/0x2e0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539700] [<ffffffff811c29d0>] ? cdev_put+0x30/0x30
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539713] [<ffffffff811bb8d9>] vfs_open+0x49/0x50
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539727] [<ffffffff811ccc24>] do_last+0x564/0x1230
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539740] [<ffffffff811cacc1>] ? link_path_walk+0x71/0x870
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539755] [<ffffffff81315aab>] ? apparmor_file_alloc_security+0x5b/0x180
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539772] [<ffffffff811cd9ab>] path_openat+0xbb/0x650
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539787] [<ffffffff811a23d5>] ? kmem_cache_free+0x1b5/0x1e0
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539802] [<ffffffff810c919d>] ? call_rcu_sched+0x1d/0x20
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539817] [<ffffffff811cedaa>] do_filp_open+0x3a/0x90
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539830] [<ffffffff811dbc47>] ? __alloc_fd+0xa7/0x130
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539844] [<ffffffff811bd3f9>] do_sys_open+0x129/0x280
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539858] [<ffffffff811bd56e>] SyS_open+0x1e/0x20
Jun 3 13:56:16 cjagpu2 kernel: [ 4.539871] [<ffffffff8173391d>] system_call_fastpath+0x1a/0x1f
Jun 3 13:56:16 cjagpu2 kernel: [ 4.540436] Code: 54 49 89 d4 48 8d 57 08 53 48 89 fb 48 83 e4 f0 48 83 ec 28 48 8b 47 10 48 8d 74 24 08 48 89 54 24 08 48 89 44 24 10 48 89 77 10 <48> 89 30 4c 89 f0 4c 89 6c 24 18 83 e0 01 c6 44 24 20 0
0 48 89
Jun 3 13:56:16 cjagpu2 kernel: [ 4.542211] RIP [<ffffffff817296ab>] __down_common+0x4c/0x144
Jun 3 13:56:16 cjagpu2 kernel: [ 4.542933] RSP <ffff8807ff407b48>
Jun 3 13:56:16 cjagpu2 kernel: [ 4.543497] CR2: 0000000000000000
Jun 3 13:56:16 cjagpu2 kernel: [ 4.544088] ---[ end trace df205f9f94a0e59d ]---

I'm using nvidia-346.46 installed from the NVIDIA CUDA developers repository (http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/).