MCE Kernel Panic on kernel 2.6.38

Bug #801840 reported by Brendan McLearie
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Repeatable kernel panics after upgrade to 11.04 on kernel 2.6.38-08. Tried pre-release kernel 38.10, same.

Reverted to 2.6.35-28 no problems.

I usually boot my system with clock=acpi_pm. This resolved some vmware stability some time back. Have tried with and without various acpi options to no avail. Have also enabled and disabled acpi in bios.

System is a dual xeon server, 8GB, ASUS Z7S.

Absolutely rock solid months on end.

Panic does not alway happen at same time. But it always happens within a few minutes (at best) of booting.

Thinking acpi and possible thermal issues, have forced all fans to max in bios (setting threshold temps down). No resolution.

Bug report package generated while running 2.6.35-28 kernel. System otherwise unchanged.

WORKAROUND: BIOS -> disable 1394

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-2.6.38-8-server 2.6.38-8.42
ProcVersionSignature: Ubuntu 2.6.35-28.50-server 2.6.35.11
Uname: Linux 2.6.35-28-server x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: brendan 2746 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf9af4000 irq 97'
   Mixer name : 'Realtek ALC888'
   Components : 'HDA:10ec0888,104382cb,00100001'
   Controls : 33
   Simple ctrls : 19
Date: Sat Jun 25 14:21:12 2011
HibernationDevice: RESUME=UUID=abdf656c-e2c8-49fd-8cac-3cea7fac4cc8
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.

 eth1 no wireless extensions.
MachineType: System manufacturer System Product Name
ProcEnviron:
 LANGUAGE=en_AU:en
 LANG=en_AU.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: root=UUID=dc5cb730-74c6-4caf-854f-5b5a1ee508af ro clock=acpi_pm
RelatedPackageVersions:
 linux-restricted-modules-2.6.35-28-server N/A
 linux-backports-modules-2.6.35-28-server N/A
 linux-firmware 1.52
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to natty on 2011-06-23 (1 days ago)
WpaSupplicantLog:

dmi.bios.date: 07/22/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0401
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: Z7S WS
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0401:bd07/22/2008:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnZ7SWS:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Brendan McLearie (bren-internode) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Brendan McLearie (bren-internode) wrote :
Revision history for this message
Brendan McLearie (bren-internode) wrote :

Hi Brad. Thanks for confirming. Where to from here?

Revision history for this message
Brendan McLearie (bren-internode) wrote :

**** Bump ****

Revision history for this message
Gabriel Blanchard (gabe-b) wrote :

I'm having the same issue, the common point is the motherboard

I also use an ASUS Z7S WS and appears to be kernel related as I'm able to boot ubuntu 11.04 with an older kernel.

I've tried numerous distros as well, gentoo, fedora etc...and all crash in the same matter somewhere after version 2.6.35

Also, another thread related to this same issue found here

https://answers.launchpad.net/ubuntu/+source/linux/+question/155035

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Thanks gabe-b. Will post a note on that question referring back here.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

2.6.38-11 Same problem. Screen shot included.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

2.6.38-11 panic screen shot.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

In all cases (2.6.38-8 2.6.38-10 2.6.38-11) this is a mce_panic
MCE (machine check exception).

tags: added: 2.6.38 mce panic
summary: - Kernel Panic 2.6.38-xx
+ MCE Kernel Panic on kernel 2.6.38
Revision history for this message
Brett Alton (brett-alton-deactivatedaccount) wrote :

I have the exact same problem and it seems to occur for me with 2.6.38-11 but not 2.6.38-10 as often. It does not occur with 2.6.35.

I'm attaching a screenshot from my low-end Android, so I apologize for the quality. Let me know what other info you need and I will provide it.

I can't seem to pinpoint when it occurs; I can have Firefox open, Thunderbird open or tens of other applications going and I can't seem to pinpoint if a certain application is the source.

Revision history for this message
Frank Bucher (twoflowers) wrote :

I confirm this bug too, Z7S with 24GB Ram.

<rant>
This Issue broke a Setup that was running stable for more than 8 months and it is extremly dissapointing to see that it does not seem worthy of any attention nearly 3 months after the initial report, and something like 4 to 5 months after this bug has begun to knock out once rock stable machines. I love to come back every second day to this page to assure me that the importance of this issue is still undecided.

Writing from a Windows Installation while I take the learning curve to move my work environment to FreeBSD. Guess Why.
</rant>

Revision history for this message
MTM (mellon-matthew) wrote :

I can confirm the same errors on a Supermicro board, with the only "change" being the extremely regrettable click on the "upgrade" button.
http://www.supermicro.com/products/motherboard/Xeon1333/5100/X7DCL-i.cfm

Two Xeon 5400's, 4 gb ram, disconnected several bits of hardware that worked great before that, found threads leading here and began to sigh and despair.

With so much progress over so many years, it's things like this that have kept me from converting even a single other person to linux. They'd call me, and I'm stumped -- even if there is a successful kernel patch, how could those of us crashing quickly ever even install it?

Many thanks.

--matt

Revision history for this message
MTM (mellon-matthew) wrote :

It later occurred to me that every such error has usually been resolved through instructions provided by some kind person or other. I'm guessing something involving a live cd and a chroot will be required. (This sort of thing usually has my systems down for a few weeks, as it did take me a while to think of that.)

Revision history for this message
Seanyoung247 (seanyoung247) wrote :

Just to confirm beyond any doubt that it's the motherboard, the latest kernel release used in Oneiric still has this issue, crashing after about ten minutes (confirmed with a LiveCD). I have now swapped out the motherboard with my backup, an Asus DSGC-DW, and left it running with the same liveCD; the machine has been up over 9 hours now with no issues. All components save the motherboard are unchanged.

Cheers

Sean

Revision history for this message
Frank Bucher (twoflowers) wrote :

I want to note that I too used another Asus Board of different Type for test reference which is unaffected by this bug. In my case Asus DSEB-DG. Otherwise I'm tired to come back here in vain hope of a sign that the kernel developers could ever be bothered to fix what they trashed with whatever they changed with 2.6.36. They probably think we are to few to be worthy of any effort from their side and socket 771 approaches EOL in a not to far future ... Soon lands socket 2011. All they need to do is sit and wait. I am extremly disappointed. Just waiting for FreeBSD 9 to go stable. FreeBSD 9 betas do run fine so far on Asus Z7S WS, as did FreeBSD 8.2. Soon it will be time to say bye bye Penguins ...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the release candidate kernel versus the daily build. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

tags: added: needs-upstream-testing
Revision history for this message
Frank Bucher (twoflowers) wrote :

In case you were referering to me I am afraid I can't do that atm. I would have to rely on the hardware that I use on my production machine, and building a test setup for a newer linux kernel would require to move huge amounts of data to make space for a linux partition. I will not test my luck with partition shrinking and the like, so I need a full backup. I simply do not have time to do this at the moment. I might give linux a last try in january before my planed transition to Freebsd 9, but this will be of course be something like a further 3 months from now, and I think you don't want to wait as long as this.

Revision history for this message
Eliah Kagan (degeneracypressure) wrote :

That's OK--perhaps one of the other 3+ people who is affected by this bug can perform the testing instead. Any takers?

Revision history for this message
Brendan McLearie (bren-internode) wrote :

People we may have a solution. Could be the infrared driver. A workaround has been proposed.

This is an associated bug - its a long post, but look right to the bottom where Daniel proposes a solution that may work. Others have reported sucess.

I havent had a chance to try it yet myself but will do so by the weekend. In the meantime if you could report your results here and also in the other bug I'm sure everyone will appreciate it.

the other bug report is at https://bugs.launchpad.net/ubuntu/oneiric/+source/linux/+bug/784484

Cheers
Brendan

Revision history for this message
Frank Bucher (twoflowers) wrote :

The Bug still exist as far as I can tell:

- Installed Kubuntu 10.10
- added blacklist ene_ir to /etc/modprobe.d/blacklist.conf
- updated to Kubuntu 11.04
- a few minutes after login again the Kernel Panic occured, so the workaround seems to be of no effect on my hardware
- booted from partedmagic 5.9 and checked blacklist.conf in case the upgrade process might have erased the blacklisting of ene_ir - but it still was at the end of the file, untouched
- booted a second time in Kubuntu 11.04 and again, after a few minutes, the kernel panic occured
- booted a third time in Kubuntu 11.04 but this time with the old kernel option in the boot menu and the system runs stable

Tomorrow I will have a look at what happens if i update to onceric and post the outcome of this attempt

Revision history for this message
Frank Bucher (twoflowers) wrote :

Using the old kernel I update now to oneric but the kernel panics still occur with the 3.00.14 (or was it 3.00.12?) kernel that came with this update. On top, networking seems to work no longer when you boot in oneric with an old kernel, it does not detect the connected network cable any longer. So there is no more convenient way to apply further updates when I use oneric with an old pre 2.6.38 kernel. The entry in blacklist.conf as suggested for a workaround is still there, but does not help at all on my machine. I doubt that the ene_ir module is connected to the problems Asus Z7S WS user face since it didn't help neither in natty nor in oneric. Of course it might be helpfull for other hardware but I doubt that it is at the hart of the problem since there is no difference if you blacklist it or not on my hardware.

That's it as far as I am concerned. Bye bye penguins, welcome beastie!

Revision history for this message
Brendan McLearie (bren-internode) wrote :

@ Frank, really sorry to hear that beastie is on the scene :)

I also tried the black list recently and it failed to solve the problem. However I've had very little time recently to do anything.

I recall from ages ago when trying to blacklist other drivers that for some reason, sometimes its not sufficient to just add them to the blacklist - cant recall details anymore.

My work around was to move the driver files off (eg /root). I then got errors at boot complaining bust at least I knew for sure it wasnt loading.

If you havent alredy reformatted I was wondering if you could try that as a last ditch attempt.

I'd be tempted to move all IR drivers out that can be found. Perhaps the Z7S loads a different one albeit for the same purpose and including the same issues.

You should be able to find the drivers somewhere under /lib

Alsa I'll be heading the same direction as you as well over xmas if this doesnt work.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Update: dont bother. I tried and failed too. I've posted my results back to the Daniel / Jakub thread.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

@Daniel Manrique and anyone esle interested.

Bug update:

I've tried booting from USB with 11.10. Idea was to log a new bug report. However, the system won't run long enough to be able to do much of anything.

2.6.35-28-server: Stable
2.6.38-8,10,11-server,generic: All Crash.
2.6.38-02063808-generic: Crash

Its been a while since I tried that mainline kernel, so I'll need reminding where to source them and how to install.

Let me know if you need any other info, and many many thanks for your help.

Cheers
Brendan

Revision history for this message
penalvch (penalvch) wrote :

Brendan McLearie, thank you for reporting this bug and helping make Ubuntu better. If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: testing
removed: needs-upstream-testing
Revision history for this message
Brendan McLearie (bren-internode) wrote :

OK have tested with mainline kernel 3.3.0-999-generic today.

Looks like the same problem. I didnt see the first panic (but know it did becuase it restarted).

In the next test I booted into single user mode without any kernel options and it locked up without even reporting a panic, however it did first print on the console the usual prelude message "Disabling lock debugging due to kernel taint".

In the final test I have booted with clock_source=acpi_pm into gnome and it has also died, again with the same taint message. This time it managed to reboot, though I didnt get a panic message on the console for some reason.

Thanks for your help.

Where to next?

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, we will try to narrow down the mainline kernel release on which this started occurring, or if it even happens on mainline kernels; if we can't reproduce the problem with the mainline kernels, then it's probably something related to an Ubuntu kernel patch.

If possible please test the following kernels, which you can find here:

http://kernel.ubuntu.com/~kernel-ppa/mainline/

First, test this to see if it works (it should):
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.35-maverick/

Then, test this to see if it works (it shouldn't!):
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.38-natty/

Finally, but *only* if these two kernels behave as predicted, test this (this is the chronological middle point, so we will be doing a sort of "binary" search or "bisection" of mainline kernels):

http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.36-maverick/

Depending on whether the "midpoint" kernel works or fails, you'd then test the kernels on the "good" side or the "bad" side. You can do this yourself, or you can report your results back and I can tell you which one to test. If you want to do it on your own, just remember that -rc kernels go *before* the actual final releases:

v2.6.36-rc6-maverick
v2.6.36-maverick
v2.6.36.1-natty (whatever)

For this testing, please don't give the kernel any additional parameters; we want to test out-of-the-box behavior.

Thanks and please let me know the results from the first three kernels!

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Thanks Daniel and Christopher.

Underway. I expect the first one to take a while since I expect it to be stable. I'll come back to it in a few hours.

I understand the approach and will continue bisecting. Is there any value to keeping to the release versions rather than the release candidates?

I'm tracking as follows.

*****
v2.6.35-maverick/ 02-Aug-2010 11:23 - Bisect 1: Testing Now Expect: Pass Actual:
 v2.6.35-rc1-lucid/ 01-Jun-2010 19:47 -
 v2.6.35-rc2-maverick/ 10-Jun-2010 11:37 -
 v2.6.35-rc3-maverick/ 12-Jun-2010 11:38 -
 v2.6.35-rc4-maverick/ 05-Jul-2010 11:23 -
 v2.6.35-rc5-maverick/ 13-Jul-2010 12:24 -
 v2.6.35-rc6-maverick/ 23-Jul-2010 12:34 -
 v2.6.35.1-maverick/ 16-Aug-2010 11:29 -
 v2.6.35.2-maverick/ 16-Aug-2010 13:46 -
 v2.6.35.3-maverick/ 21-Aug-2010 15:35 -
 v2.6.35.4-maverick/ 27-Aug-2010 21:30 -
 v2.6.35.5-maverick/ 21-Sep-2010 13:27 -
 v2.6.35.6-maverick/ 27-Sep-2010 13:17 -
 v2.6.35.7-maverick/ 29-Sep-2010 11:26 -
 v2.6.35.8-maverick/ 29-Oct-2010 13:28 -
 v2.6.35.9-maverick/ 23-Nov-2010 13:29 -
 v2.6.35.10-maverick/ 24-Nov-2011 01:49 -
 v2.6.35.11-maverick/ 24-Nov-2011 02:04 -
 v2.6.35.12-maverick/ 24-Nov-2011 02:18 -
 v2.6.35.13-maverick/ 24-Nov-2011 02:32 -
 v2.6.35.13-original-maverick/ 26-Jul-2011 13:17 -
 v2.6.35.14-maverick/ 24-Nov-2011 02:42 -
 v2.6.36-maverick/ 21-Oct-2010 11:26 - Bisect 1: Test Final Expect: Unknown Actual:
 v2.6.36-rc1-maverick/ 17-Aug-2010 15:23 -
 v2.6.36-rc2-maverick/ 23-Aug-2010 11:26 -
 v2.6.36-rc3-maverick/ 30-Aug-2010 11:19 -
 v2.6.36-rc4-maverick/ 13-Sep-2010 11:19 -
 v2.6.36-rc5-maverick/ 21-Sep-2010 15:48 -
 v2.6.36-rc6-maverick/ 29-Sep-2010 13:47 -
 v2.6.36-rc7-maverick/ 07-Oct-2010 11:28 -
 v2.6.36-rc8-maverick/ 15-Oct-2010 11:18 -
 v2.6.36.1-natty/ 23-Nov-2010 15:56 -
 v2.6.36.2-natty/ 10-Dec-2010 13:42 -
 v2.6.36.3-natty/ 08-Jan-2011 14:29 -
 v2.6.36.4-natty/ 18-Feb-2011 11:34 -
 v2.6.37-natty/ 05-Jan-2011 11:23 -
 v2.6.37-rc1-maverick/ 02-Nov-2010 11:21 -
 v2.6.37-rc2-maverick/ 16-Nov-2010 11:23 -
 v2.6.37-rc3-natty/ 22-Nov-2010 11:31 -
 v2.6.37-rc4-natty/ 30-Nov-2010 10:54 -
 v2.6.37-rc5-natty/ 07-Dec-2010 11:31 -
 v2.6.37-rc6-natty/ 16-Dec-2010 11:32 -
 v2.6.37-rc7-natty/ 22-Dec-2010 16:05 -
 v2.6.37-rc8-natty/ 29-Dec-2010 11:21 -
 v2.6.37.1-natty/ 18-Feb-2011 13:54 -
 v2.6.37.2-natty/ 25-Feb-2011 11:58 -
 v2.6.37.3-natty/ 08-Mar-2011 11:51 -
 v2.6.37.4-natty/ 15-Mar-2011 13:02 -
 v2.6.37.5-natty/ 24-Mar-2011 12:43 -
 v2.6.37.6-natty/ 28-Mar-2011 12:46 -
 v2.6.38-natty/ 15-Mar-2011 16:03 - Bisect 1: Test Next Expect: Fail Actual:
********

Cheers
Brendan

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi,

I'd advise to treat rc kernels as you would any other, since the change may have been introduced in one of the rc (release candidate) versions.

Also, please note that the rc kernels go *before* any others in that series; that is, 2.6.37-natty comes after 2.6.37-rc8-natty.

- RC kernels are "release candidates".
- The "release" kernel for a series is e.g. 2.6.37
- Maintenance, "point" releases come after the "release", e.g. 2.6.37.1, 2.6.37.2, and so on.

This is important since, in order for the bisection to be "sane", we need to maintain strict cronological ordering of the kernels.

When in doubt, look at the dates, they need to be consecutive for each kernel version number.

Thanks!

Revision history for this message
Brendan McLearie (bren-internode) wrote :
Download full text (4.7 KiB)

OK, have been madly testing mainline kernels. My log below.

Questions:
1. Is the order I have sequenced correct?
2. v2.6.35-12,12,13 was confusing. Not sure of history, but the ".13-original-maverick" has a date much earlier than the ".13-maverick", yet the later athough in a .13 folder acutally has a filename 02063512 to confuse matters... have I got this correct?
3. There are clearly concurrent branches with similar kernel versions. I've confined my testing to the maverick and natty versions, which appeared in the folder listing between your two bracketing suggetions.
4. My conclusion: The problem is with somethng changed or new between 2.6.35 and 2.6.36. Interestingly the bisect is indicating that all 2.6.35 work, and all 2.6.36 fail. So it must be something fundamental to that version change.

Where to from here?

I'm happy to go down to code level with you, if you have the patience. Last time I compiled a kernel was mid 90's using FreeBSD! So I'll need some hand holding once byond basic cli work, though I wouldnt mind learning more, and thus becoming more able to contribute to the ubuntu communty.

Many thanks
Brendan

, , , Seq., Seq., Bicect, Status, K OPTS, Expect, Actual, Notes
v2.6.35-maverick, 2-Aug-10, 11:23, , , 1.1, Done, , Pass, Pass, Uptime 7hrs:25
v2.6.35.1-maverick, 16-Aug-10, 11:29, , , , , , ,  ,
v2.6.35.2-maverick, 16-Aug-10, 13:46, , , , , , ,  ,
v2.6.35.3-maverick, 21-Aug-10, 15:35, , , , , , ,  ,
v2.6.35.4-maverick, 27-Aug-10, 21:30, , , , , , ,  ,
v2.6.35.5-maverick, 21-Sep-10, 13:27, , , , , , ,  ,
v2.6.35.6-maverick, 27-Sep-10, 13:17, , , , , , ,  ,
v2.6.35.7-maverick, 29-Sep-10, 11:26, , , , , , ,  ,
v2.6.35.8-maverick, 29-Oct-10, 13:28, , , , , , ,  ,
v2.6.35.9-maverick, 23-Nov-10, 13:29, 02 06 35 09, 2010 11 23 1112, 2, Done, Check, Unknow, Pass, Uptime 3hrs:14
, , , , , , , , , ,
v2.6.35.13-original-maverick, 26-Jul-11, 13:17, 02 06 35 13, 2011 07 26 1012, 5, Done, Check, Pass, Pass,
, , , , , , , , , ,
v2.6.35.10-maverick, 24-Nov-11, 1:49, 02 06 35 10, 2011 11 23 2035, , , , , ,
v2.6.35.11-maverick, 24-Nov-11, 2:04, , , , , , , ,
v2.6.35.12-maverick, 24-Nov-11, 2:18, 02 06 35 12, 2011 11 23 2104, , , , , ,
v2.6.35.13-maverick, 24-Nov-11, 2:32, 02 06 35 12, 2011 11 23 2118, 3, Done, Check, Unknown, Pass, Uptime 38mins
v2.6.35.14-maverick, 24-Nov-11, 2:42, , , -1, NA - Headers Only. Try .9 to rule out up to point of ".13-original" confusion.
, , , , , , , , , ,
v2.6.35-28-server-ubuntu, , , , , 0, NA, NA, NA, NA, Ubuntu Version Order Assumed. ".13-original-maverick" confusing check .14 mainline.
, , , , , , , , , ,
v2.6.36-rc1-maverick, 17-Aug-10, 15:23, 02 06 36 rc1, 2010 08 17 1306, 4, Done, Check, Unkown, Fail, Photo Available
v2.6.36-rc2-maverick, 23-Aug-10, 11:26, , , , , , ,  ,
v2.6.36-rc3-maverick, 30-Aug-10, 11:19, , , , , , ,  ,
v2.6.36-rc4-maverick, 13-Sep-10, 11:19, , , , , , ,  ,
v2.6.36-rc5-maverick, 21-Sep-10, 15:48, , , , , , ,  ,
v2.6.36-rc6-maverick, 29-Sep-10, 13:47, , , , , , ,  ,
v2.6.36-rc7-maverick, 7-Oct-10, 11:28, , , , , , ,  ,
v2.6.36-rc8-maverick, 15-Oct-10, 11:18, , , , , , ,  ,
v2.6.36-maverick, 21-Oct-10, 11:26, , , 1.3, Done, Check, Unknown, Fail, Photo...

Read more...

Revision history for this message
Daniel Manrique (roadmr) wrote :

Awesome! So per your results, the problem was introduced with the 2.6.36 series, in 2.6.36-rc1, which chronologically is the first one that fails.

Since 2.6.35 works fine (all the 2.6.35.x are maintenance releases for that series, done *after* work started on the 2.6.36 series) I think the problem must lie in one of the changes introduced in 2.6.36-rc1.

So the problem lies in one of the code changes done between those two releases.

Without any more knowledge about which file has the problem, we'd have to look at the changes one by one (and there are about 8000 of them). Looks like a bisection process would be the way to go.

I can point you to this document on how to do it:
https://wiki.ubuntu.com/Kernel/KernelBisection

One problem with that doc is that it assumes you're using the Ubuntu kernel source tree, whereas we'd prefer to use the mainline tree. Or you could use the Ubuntu tree (the one for Natty, for instance) and set v2.6.35 as your good point and v2.6.36-rc1 as the bad point. Compile and test each produced kernel and tell the bisect process whether they're good or bad to get the next kernel.

You'll probably have to test about 15 kernels for this, so it'll be a rather lengthy process :(

If you want to try this yourself (and I think you have the skills to do it) it's OK, but if you have any questions or would prefer me to compile the kernels and just upload them for you to test, please let me know; we're here to help.

Thanks!

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Thanks Daniel. Sounds like fun.

I've had a quick read of the wiki. Before I get underway, a few questions:

1. It sounds like the ubuntu versions are going to be easier to work with because there already looks to be a fari bit of guiding material. Is there any problem with using the ubuntu versions rather than mainline? Would it be worthwhile / good practice to confirm the results of the kernel version bisect using the corresponding ubuntu versions before getting underway?

2. Tool dependencies eg gcc, libraries etc - is it worth installing a clean copy of ubuntu in a virtual with specific tools? Particular versions?

3. What version of Ubunutu as a platfrom to actually do the bisect work and complile etc?

4. I have some minimal / rusty experience with eclipse as an IDE. Would it be worthwhile trying for some sort of git plugin?

5. Once compiled is there an easy way to package the result as a .deb for installation on the test machine? Is this involved? I have never looked at deb packaging.

Probably heaps of other questions to come as I get underway......

Revision history for this message
Daniel Manrique (roadmr) wrote : Re: [Bug 801840] Re: MCE Kernel Panic on kernel 2.6.38
Download full text (4.1 KiB)

On 12-03-28 09:16 AM, Brendan McLearie wrote:
> Thanks Daniel. Sounds like fun.
>
> I've had a quick read of the wiki. Before I get underway, a few
> questions:
>

It's OK, I'm glad to be able to help with any questions.

> 1. It sounds like the ubuntu versions are going to be easier to work
> with because there already looks to be a fari bit of guiding material.
> Is there any problem with using the ubuntu versions rather than
> mainline? Would it be worthwhile / good practice to confirm the results
> of the kernel version bisect using the corresponding ubuntu versions
> before getting underway?

I'd say it's OK to use the Ubuntu kernel, since it already contains the
packaging setup needed to produce .debs to install. But using mainline is also
straightforward enough, see here for how to build from a mainline kernel (sorry
for dumping more documentation on you!)

http://newbiedoc.sourceforge.net/system/kernel-pkg.html

I'll also share my procedure with you, at the end of this comment.

>
> 2. Tool dependencies eg gcc, libraries etc - is it worth installing a
> clean copy of ubuntu in a virtual with specific tools? Particular
> versions?

I don't think it's needed, the dependencies are relatively simple, basically I
just install build-essential and kernel-package and then, if the build process
complains of missing tools, I apt-cache search for them and install any extra
needed stuff.

>
> 3. What version of Ubunutu as a platfrom to actually do the bisect work
> and complile etc?

I'd suggest either Maverick or Natty, but really the compiled kernels should be
version-independent; for instance, when I compiled Jakub's kernels I was running
(I think) either Natty or Precise (and that was kernel 2.6.35 from Maverick). So
you can potentially create the kernels on your current box and then copy them to
the server to be tested.

>
> 4. I have some minimal / rusty experience with eclipse as an IDE. Would
> it be worthwhile trying for some sort of git plugin?

Since you won't really be touching the kernel source code proper, it's probably
not worth it. If anything, you can look for existing GUI tools for git like gitk
or gitg and use that, but it's really not needed I think.

>
> 5. Once compiled is there an easy way to package the result as a .deb
> for installation on the test machine? Is this involved? I have never
> looked at deb packaging.

Yes! The Ubuntu kernel contains everything needed to produce a .deb (look in the
ubuntu/ directory).

Then you can follow instructions here (more documentation, yay):
https://help.ubuntu.com/community/Kernel/Compile

>
> Probably heaps of other questions to come as I get underway......
>

What I usually do is start with a mainline kernel checkout (from linux
upstream), then do each bisect step and run this script to prepare and compile
the kernel, updating only the revision number so it's unique for each step of
the bisection; this assume kernel-package is installed and:

- create $HOME/linux-source
- In there, you need a directory "linux" with the mainline kernel git tree from
kernel.org.
- Also, you need a directory ubuntu-natty with the git tree from ubuntu for that
release.

#!/bin/bash
#Where th...

Read more...

Revision history for this message
Brendan McLearie (bren-internode) wrote :
Download full text (3.4 KiB)

OK..... have been burried in this and staring to get my head around it, but have now hit a problem (apart from my wife starting to call me a git!)

I'll try to reassemble my path here in case Ive done something stupid in the earlier steps.

1. new virtual machine version 11.10 upgraded to 3.0.0-17 x86_64
2. grabbed every package via apt-get that seemed relevant
3. basic structure as per your script. git clone from kernel.org under linux. git clone of ubuntu-natty
4. modified your script to grab from ubuntu-natty rather than oneiric
5. tried compiling with your script prior to bisect and all worked, producing me .deb packages for 3.0.0-17
6. reconfigured virtual and threw 8 cores at it (forgot how long it takes to compile a kernel.
7. changed to $HOME/linux-source/linux
8. git checkout mcebisect
9. git bisect start
10. git bisect good v2.6.35
11. git bisect bad v2.6.36-rc1
- complained about modfied files not commited (presuming from previous build)
12. git stash
-bisect completed with over 4000 / 12 to go.
13. up one level and ran your modified script.
-Compile failed.
14. back into ./linux
15. git bisect bad ##toss a coin - well almost. I saw lots of early comits in the log for acpi and temprature, which I've always suspected re this bug. figured that a bad guess would bisect next with early part included.... thereby keeping it in the mix - easy to see test fail as the panic happens more quickly than waiting to be sure that it hasnt.
- got same error regarding comits.
16. git stash
-bisect proceeded and worked.
17. Updated script with revision and comment and re-ran.
-Compile Fail again.
18. git clean, git reset, git checkout, git bisect start, git bisect good v2.6.35, git bad 2.6.36-rc1, recompiled again (just to make sure it wasnt something stuid I did in the setup.
- Compile failed as expected.
19. This time git bisect good # guess the other way
-Compile failed again.

Where to from here?

A couple of other aside questions:
1. Am I right that the the ubuntu git version is only there to grab the control scripts?
2. Are there any issues / version problems grabbibg the ubuntu-package on my 3.0.0-7 system?

Thanks for your continued help.

Your modified script:
#!/bin/bash
#Where the linux directory resides
BASE=$HOME/linux-source
#REVISION=1step1: START OK 4072, Compile broke.
#REVISION=1step2: GUESSED BAD, Compile Broke. Dead End.
#REVISION=2step1: START OK 4072, Compile Broke.
#REVISION=2step2: GUESSED GOOD, Compile Broke
REVISION=2step2
BUG=lp801840

cd $BASE/linux
git clean -f -d -x
git clean -f -d -X
cp /boot/config-`uname -r` .config
yes '' | make oldconfig
sed -i 's/CONFIG_DEBUG_INFO=y/CONFIG_DEBUG_INFO=n/' .config
sed -i 's/CONFIG_SYS_HYPERVISOR=y/CONFIG_SYS_HYPERVISOR=n/' .config
sed -i 's/CONFIG_XEN_SYS_HYPERVISOR=y/CONFIG_XEN_SYS_HYPERVISOR=n/' .config
sed -i 's/CONFIG_XEN=y/CONFIG_XEN=n/' .config

sed -rie 's/echo "\+"/#echo "\+"/' scripts/setlocalversion
cd $BASE
cp -a /usr/share/kernel-package ubuntu-package
#The following should be one line each cp command
cp ubuntu-natty/debian/control-scripts/{postinst,postrm,preinst,prerm} ubuntu-package/pkg/image/
cp ubuntu-natty/debian/control-scripts/headers-postinst...

Read more...

Revision history for this message
Daniel Manrique (roadmr) wrote :
Download full text (3.7 KiB)

On 12-03-29 09:37 AM, Brendan McLearie wrote:
> OK..... have been burried in this and staring to get my head around it,
> but have now hit a problem (apart from my wife starting to call me a
> git!)
>
> I'll try to reassemble my path here in case Ive done something stupid in
> the earlier steps.
>
> 1. new virtual machine version 11.10 upgraded to 3.0.0-17 x86_64
> 2. grabbed every package via apt-get that seemed relevant
> 3. basic structure as per your script. git clone from kernel.org under linux. git clone of ubuntu-natty
> 4. modified your script to grab from ubuntu-natty rather than oneiric
> 5. tried compiling with your script prior to bisect and all worked, producing me .deb packages for 3.0.0-17
> 6. reconfigured virtual and threw 8 cores at it (forgot how long it takes to compile a kernel.
> 7. changed to $HOME/linux-source/linux
> 8. git checkout mcebisect
> 9. git bisect start
> 10. git bisect good v2.6.35
> 11. git bisect bad v2.6.36-rc1
> - complained about modfied files not commited (presuming from previous build)
> 12. git stash
> -bisect completed with over 4000 / 12 to go.
> 13. up one level and ran your modified script.
> -Compile failed.
> 14. back into ./linux
> 15. git bisect bad ##toss a coin - well almost. I saw lots of early comits in the log for acpi and temprature, which I've always suspected re this bug. figured that a bad guess would bisect next with early part included.... thereby keeping it in the mix - easy to see test fail as the panic happens more quickly than waiting to be sure that it hasnt.
> - got same error regarding comits.
> 16. git stash
> -bisect proceeded and worked.
> 17. Updated script with revision and comment and re-ran.
> -Compile Fail again.
> 18. git clean, git reset, git checkout, git bisect start, git bisect good v2.6.35, git bad 2.6.36-rc1, recompiled again (just to make sure it wasnt something stuid I did in the setup.
> - Compile failed as expected.
> 19. This time git bisect good # guess the other way
> -Compile failed again.

Hi,

Um, I never saw any compile errors when building my kernels :(

there's one extra step I do to clean up temporary files, since it works for you
*sometimes*, this may help (since the basic process seems to be working).

BASE=/wherever/you/keep/linux-source
cd $BASE/linux
git clean -f -d -x
git clean -f -d -X #notice uppercase X

git checkout scripts/setlocalversion

Maybe this will help?
I also don't do the git stash step.

One thing you could do to validate is note the revision picked when you do the
git bisect (it always tells you which revision it picked by giving you the SHA
checksum). Then revert changes and clean (don't trust me on this, my git is a
bit rusty):

git checkout . #to revert all files to unmodified status
git clean -f -d -x
git clean -f -d -X #remove all extraneous files
git checkout <revision_SHA> #to check out the revision given by git bisect

then try to compile from that; since you're starting from a completely clean git
tree and a checkout of a particular revision, it should work (I assume commits
that break building don't make it into the kernel!). If this works, then the
cleaning step I outline above should be done after ea...

Read more...

Revision history for this message
Brendan McLearie (bren-internode) wrote :

thanks Daniel. and thanks for your email reply too. Glad to see that you ended up in the second best country in the world :)

Ive tried your suggestion. Still failing to compile.

I think the git issue was simply from local files I had hanging aroung after doing things out of order. so git stash simply put the local changes apart from the current checkout. The lcoal files are just the copy from the ubuntu tree I suspect.

Before raising the white flag, I'm going to try it on a version of natty (installing now), The compile fail doesnt give me much to go on - it could be the ubuntu package from 3.0.0 thats the problem. Will let you know.

Revision history for this message
Daniel Manrique (roadmr) wrote :

On 12-03-31 08:28 AM, Brendan McLearie wrote:
> thanks Daniel. and thanks for your email reply too. Glad to see that you
> ended up in the second best country in the world :)
>
> Ive tried your suggestion. Still failing to compile.
>
> I think the git issue was simply from local files I had hanging aroung
> after doing things out of order. so git stash simply put the local
> changes apart from the current checkout. The lcoal files are just the
> copy from the ubuntu tree I suspect.
>
> Before raising the white flag, I'm going to try it on a version of natty
> (installing now), The compile fail doesnt give me much to go on - it
> could be the ubuntu package from 3.0.0 thats the problem. Will let you
> know.

Hey!

So again, to confirm:

- Which "linux" kernel tree are you using, the one from kernel.org?
- Are you taking the ubuntu-natty files from a copy of the ubuntu-natty git tree?
- Are you taking the ubuntu-package from /usr/share/kernel-package? if so, could
you let me know which version of kernel-package you have installed? (apt-cache
policy kernel-package).
- Are you using good=v2.6.35 and bad=v2.6.36-rc1?

Asking all this because I'll try running the process on my end, and see what
happens, if I get the same errors maybe I can find a way to solve them.

Cheers!

- Daniel

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Woo hoo..... progress being made.

Compile worked fine on the natty install.

Working version on natty:
linux-stable: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
kernel-package: 12.036+nmu1

Broken version on 3.0.0
linux: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel-package: 12.036+nmu1

Both have been using good=v2.6.35 and bad=v2.6.36-rc1

So perhaps some dependencies in the compile environment. Or it could be a problem with the config grabbed from the 3.0.0 version.

First kernel under test now. Will let you know how it goes.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Dead end again grrr. Starting to second guess myself that it must be something stupid I am doing with git.

@Daniel Would be great if you could give it a try.

The first bisect is good.

I cant compile the second.

It gets up to here:

 CC [M] drivers/net/wireless/rt2x00/rt2500pci.o
  CC [M] drivers/net/wireless/ath/ath9k/eeprom_4k.o
  CC [M] drivers/net/wireless/ath/ath9k/eeprom_9287.o
  CC [M] drivers/net/wireless/rt2x00/rt61pci.o
  CC [M] drivers/net/wireless/wl12xx/wl1251_spi.o
  LD [M] drivers/net/wireless/zd1211rw/zd1211rw.o
  CC [M] drivers/net/wireless/rt2x00/rt2800pci.o
  CC [M] drivers/net/wireless/rt2x00/rt2500usb.o
  CC [M] drivers/net/wireless/rt2x00/rt73usb.o
  CC [M] drivers/net/wireless/ath/ath9k/ani.o
  CC [M] drivers/net/wireless/rt2x00/rt2800usb.o
  CC [M] drivers/net/wireless/ath/ath9k/btcoex.o
  CC [M] drivers/net/wireless/wl12xx/wl1251_sdio.o
  LD [M] drivers/net/wireless/iwlwifi/iwlcore.o
  LD [M] drivers/net/wireless/iwlwifi/iwlagn.o
  LD [M] drivers/net/wireless/iwlwifi/iwl3945.o
  CC [M] drivers/net/wireless/ath/ath9k/mac.o
  LD [M] drivers/net/wireless/wl12xx/wl1251.o
  CC [M] drivers/net/wireless/ath/ath9k/ar9002_mac.o
  LD [M] drivers/net/wireless/rt2x00/rt2x00lib.o
  CC [M] drivers/net/wireless/ath/ath9k/ar9003_mac.o
  CC [M] drivers/net/wireless/ath/ath9k/ar9003_eeprom.o
  CC [M] drivers/net/wireless/ath/ath9k/ar9003_paprd.o
  LD [M] drivers/net/wireless/ath/ath9k/ath9k.o
  LD [M] drivers/net/wireless/ath/ath9k/ath9k_hw.o
  LD [M] drivers/net/wireless/ath/ath9k/ath9k_common.o
  LD [M] drivers/net/wireless/ath/ath9k/ath9k_htc.o
make[1]: *** [drivers] Error 2
make[1]: Leaving directory `/home/setup/linux-src/linux-stable'
make: *** [debian/stamp/build/kernel] Error 2
setup@ubwud-9:~/linux-src$

Revision history for this message
Tomasz Kusmierz (wally-tm) wrote :

Boys / Girls

1.
To clear out things - AFAIK when version x.y.z of kernel is made a release version - it's taken at ground for next release x.y.(z+1) -dev or -rc (whatever). So if version 2.6.35 was stable there had to be some prerelease of 2.6.36 that was stable - and here is catch - when pushing for new version of kernel all tricks are allowed = completely changing queuing style / scheduler sequencing / anything - so at this point there will be a vast amount of changes to general design of kernel !!! Release 2.6.35 and 2.6.36 might be like black and white as far as I'm concerned. So IMHO it's better to narrow where in pre-releases of 36 f***up was introduced.

2.
Bisect might be good - but from my experience commits on svn / csv / git are made on base of "best practice" - ei there is no guarantee that certain code commit will work or even compile so having problems with compiling certain versions of repository tree does not surprise me.

3.
Since on every machine it crashes with same error - search for specific string to see where is generated and pick some of functions that might lead to it. Also, on all of those asus z7s MCE errors I've seen that cal stack look's exactly the same so I would suggest to search in:
machine_check()
do_machine_check()
mce_reign()
for any changes between offending and non offending version of kernel.

4.
My bet is that this is something extremely silly / stupid like:
- checking cache coherency while in / straight after E0 state
- accessing ram while in low power mode.
- or actually giving damn about errors reported by machine (ECC machines DO REPORT ERRORS but also correct those and properly bail out without creating any harm to running operations - good example is FB - DDR)
etc.

Revision history for this message
Brendan McLearie (bren-internode) wrote :

@Tomasz thanks for your comments. re each of your points:

1. So we are unlikley to get a bisect between 2.6.35 and 2.6.36 to compile. prior to doing git bisects I tested mainline kernels and determined the error to have entered somewhere between 2.6.35-14 and 2.6.36-rc1. So these have been focus of git bisect. Since rc1 is the first .36 we're sort of doing what you suggest. Either my git skills arent good enough or it doesnt exist - can you / how do you determine if there was a -dev version prior to -rc1?

2. As above.

3. Thanks for suggestions - not sure if my skills are up to that level of diagnosis - happy to give it a go but will need help.

4. Sounds plausible - will be great if its something simple.

@Daniel - are you still on the journey with this one? I've been busy over Easter and after so just getting time again to keep working on this.

Many thanks
Brendan

Revision history for this message
Daniel Manrique (roadmr) wrote :

@brendan:

Tip #3 from Tomasz is good, if you can take a picture of the crashed scren that may give a clue as to where the problem is. If you do, please attach picture here and we can help diagnose it.

Also, from some stuff I happened to read this week, commits to kernel mainline are *supposed* to always be buildable (i.e. nothing that makes the kernel fail to compile should make it into the kernel source. If you're still having problems compiling, let me know and I'll try to duplicate the process on my end, maybe I can compile the kernels and upload them for you to try. Failing that, there'll be two of us looking at possible compile failures, that doubles our chances of success :)

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Yes please Daniel. I think I'm at the limit of my experience.... But still keen to solve the problem and learn what I can on the way.

There is at least one crash photo attached already. I've got a couple more from recent tests which I'll attach if the existing one is insufficient. Let me know.

As per previous posts I can only get a couple of steps into the bisect before it fails to compile.

All help appreciated.

Revision history for this message
Tomasz Kusmierz (wally-tm) wrote :

@ ALL

BUAHAHAHAHAHAHAHHAHAHAHAHAHAHAHHA I GOT IT !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

IT WORKS !!!!! 4 HOURS AND RUNNING !!!! UPDATING TO 12.04 IN BACKGROUND !!!!!

Anyway, long story short -> been updating my home server -> by accident plugged in this stick to my machine -> booted up by accdent -> it crashed -> few swear words $%^#$#$%^ -> quick look at the screen -> WTF ? there is a firewire_OHCI function in MCE trace -> reboot -> BIOS -> disable 1394 (or how ever it is called there) -> reboot into "old ubuntu install" -> AND IT WORKS FOR 4H NOW !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

for folks that are not bothered with story:

! DISABLE FIREWIRE AND LINUX WORKS !

Revision history for this message
Brendan McLearie (bren-internode) wrote :

@Tomasz you are a legend! Hopefully I'll get a chance to try it myself in a couple of hours...... Was just looking at that big bit of tin yesterday thinking I might have to install Windoze or bsd on it......

Revision history for this message
Brendan McLearie (bren-internode) wrote :

Success Tomasz! Many thanks - now happily running on 12.04.

I have logged it also as a new bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1016556 under the 12.04 / 3.x kernel.

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Tomasz Kusmierz (wally-tm) wrote :

YEAH !!!!

Recently I've switched job and to be able to work from home I've had to negotiate with my miss to use her laptop as dev rig (development is done 101% in ubuntu) so now I've just made my self a fantastic gift as well because I can compile whole project in under 10 minutes vs. 2h ;))))

Anyway, thank you Brendan for your input and starting whole new bug and wrapping it arround with nice explanations etc. Also big thanks to all people that wanted to help (specially to those crazy enough to do bissect - people pay me extorsion of money just to look at direction of task with that magnitude and you guys went there for free without any preparation - R1sPeCT !). Not so much thanks to ubuntu buys for whishing that this bug will die on it's own - but hey life is bruttal.

Many thanks to assus hardware engineers for making such a fantastic mother board that after ~5 years it still takes 2x less time to compile 300MB of source code than on brand new i7 (6 core) - many thanks and big respect, you trully can point with finger and proudly say "I done that!" - keep up good work lads.

#ifdef sarcazm
Also many thanks to asus software engineers folks for making such a fantastic job with bios - without them such a ridicolous issues like firewire linux bug or XEON E0 states or even complete windoze (s!c) acpi f*** up woudl hunt us day's and nights. Also many thanks for absolutelly brilant fan sontroll mechanism - it makes life alot easier to control all 7 fans as one - not that I would like to run CPU fan quicker than HDD fan when CPU is under load and HDD is nice and cool - WHY WOULD I DO THAT! Also thank you for providing such a fantastic drivers for realtek pcie x1 sound card that completly f*** up most of games and had to go directly to realtek and get their drivers for this chip to make is work propperly.
#endif

penalvch (penalvch)
tags: added: needs-upstream-testing regression-release
removed: 2.6.38 mce panic testing
tags: added: kernel-bug-exists-upstream performing-bisect
removed: needs-upstream-testing
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.