Bogus MCE errors with 11.04 install

Asked by Tomasz Kusmierz

Hi,

I'm getting bogus MCE errors with new 11.04. Yesterday I've decided to go for 11.04 and did upgrade for my ubuntu -> when running new one with new kernel (2.6.38) it's crashing after 5 - 15 minutes. When I try to load with old kernel (2.6.35-28) everything runs smooth. To eliminate issues with bad upgrade process I've went with fresh install and during install I've got exactly same MCE errors.

Why I think that this errors are bogus:
win XP - runs OK
win Vista -runs OK
win 7 - runs OK
ubuntu 10.10 - runs OK
fedora 14 - runs OK

This occures on my dev/game box ... I've already swapped cpu's & ram's with some severs ... so:

mobo:
asus Z7S WS / unfortunately got no access to second one ;(

ram tested:
2 x 4GB ecc fbdimm hynix / 2x1GB ecc fbdimm micron

cpu tested:
2 x 2.8Ghz xeon 5400 / 1 x 3.6Ghz xeon 5400

graphic card:
my one: ati 4780x2 / (I will pick similar one from my mate today, but this just feels like a wasted effort)

There seems to be nothing in any log files about this crash, but I'm made few photos ;) so here it is a copy from a photo:

[ blablabla] [Hardware Error]: CPU 3: Machine Check Exception: 4 Bank 0: b200000410000800
[ blablabla] [Hardware Error]: TSC 5c8758437c
[ blablabla] [Hardware Error]: PROCESSOR 0:10676 TIME 1304085547 SOCKET 0 APIC 3
[ blablabla] [Hardware Error]: No human readable MCE decoding support on this CPU type.
[ blablabla] [Hardware Error]: Run The ... bla bla
[ blablabla] [Hardware Error]: CPU 3: Machine Check Exception: 4 Bank 5: b200000040100e0f
[ blablabla] [Hardware Error]: blabla
[ blablabla] [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: b200000410000800
[ blablabla] [Hardware Error]: blabla
[ blablabla] [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 5: b200000044100e0f
[ blablabla] [Hardware Error]: blabla
[ blablabla] [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: b200001080200e0f
[ blablabla] [Hardware Error]: RIP !INEXACT! 10:<ffffffff81014a4d> {mwait_idle+0x8d/0x120}
[ blablabla] [Hardware Error]: blabla
[ blablabla] [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: b200001084200e0f
[ blablabla] [Hardware Error]: RIP !INEXACT! 10:<ffffffff812edf4f> {zlib_inflate+0xf3f/0x15f0}
[ blablabla] [Hardware Error]: blabla
[ blablabla] [Hardware Error]: Machine check: Processor context corrupt

kernel is 2.6.28-8-generic #42-Ubuntu

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu linux Edit question
Assignee:
No assignee Edit question
Solved by:
Tomasz Kusmierz
Solved:
Last query:
Last reply:
Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#1

BTW,
pretty much same stuff happens on server kernel ...

Revision history for this message
Seanyoung247 (seanyoung247) said :
#2

I get the exact same problem with similar hardware, ASUS Z7S WS, 2.33Ghz Quad Xeon, 8Gb FBDimms (4x2gb) and ATI 5850. This also occurs running the Fedora 15 Beta Live CD, that uses the same Kernel as Ubuntu 11.04. I've tried booting with nomce and nomodeset parameters set with no change.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#3

damn, so shall we take this to kernel mail list ? and also what kernel fetora (sic) 15 is using on beta 15 live cd ? (mainly concerned for 2.6.38-xx bit)

BTW 2,
I've tested bios 0401 and 0501 - both with exactly same problem !

So, is there a chance to get new ubuntu flawored kernel from 2.6.38 range in any near future ?

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#4

@Tomasz Kusmierz
I recommend you report this as a bug in linux in Ubuntu. To do that, first read https://help.ubuntu.com/community/ReportingBugs carefully. Then search to see if the bug has already been reported (among many other things, that page explains how.) Then report the bug by running "ubuntu-bug linux". In your bug report, include all the information about the bug that you've posted in this question, especially your explanation of why there is very good reason to believe that these machine check exceptions are spurious. For your bug report, please find the [Hardware Error] lines in the file /var/log/dmesg, and post them in your bug report. You should also attach the file /var/log/dmesg, if Apport does not attach it automatically. If you are unable to find those lines in /var/log/dmesg, then either they were't written, or the last crash occurred long enough ago that they would be the rotated dmesg logs (/var/log/dmesg.0, /var/log/dmesg.1.gz, /var/log/dmesg.2.gz, and so forth). You only have to check the contents of /var/log/dmesg.0 (or others) if you cannot find this in /var/log/dmesg and the file modification times on the rotated log files are such that they could have been current enough to have covered the most recent instance of the crash. (It's a reasonably good bet that if the crash occurred once and wasn't logged, then on your system it is not ever logged when it occurs.) If this was not recorded in any log files, then you should go ahead and include the copy of the error message lines from your photo, and in addition, you should attach the photo as an image file (unless you took it with an analog camera and have no way of scanning or otherwise digitizing it) to the bug report.

After you file your bug report, you can link this question and the bug to each other using the "Link existing bug" link on this question page. While it is a good idea to do this, your bug report should be complete and self-contained--it should not be necessary for the Ubuntu developers to refer to this question in order to completely understand and appreciate your bug report.

Even though this problem is unlikely to specific to Ubuntu (in part because Seanyoung247 has also experienced it with Fedora), it is still very appropriate to file a Launchpad bug about this against linux in Ubuntu. At some point soon, it would also be good to file a bug at https://bugzilla.kernel.org (whoever does that may want to read through https://bugzilla.kernel.org/docs/en/html/using.html first and should certainly at least take a look at it). If you're reporting this against linux in Ubuntu, then by way of division of labor, it would perhaps be reasonable for Seanyoung247, who has experienced it on multiple distros, to file the upstream bug. But that's just a suggestion. Similarly, either of you could file the downstream bug here on Launchpad, too. (Or you both could, and then one could get marked a duplicate of the other.) Once the upstream bug is filed (or found already to have been filed, by a search of the bugs at https://bugzilla.kernel.org), the downstream bug should be linked to it using the "Also affects project" link on the downstream (Launchpad) bug page.

@Seanyoung247
Do you also have good reason to believe that the machine check exceptions aren't real, on your system? For example, do you run another operating system for a significant number of hours each day and *not* get machine check exceptions, like Tomasz Kusmierz does? If you run a Linux-based OS on the machine, what kernel does it use?

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#5

"So, is there a chance to get new ubuntu flawored kernel from 2.6.38 range in any near future ?"

I don't know which 2.3.38 kernel you're running, but there is a new version in natty-proposed. The version in main is 2.6.38-8.42. The version in proposed is 2.6.38-9.43. The changelog (you can read it at https://launchpad.net/ubuntu/+source/linux) doesn't say anything suggesting that a bug like this would have been (intentionally) fixed, and I doubt that will fix this problem, but you could try it and find out (see https://wiki.ubuntu.com/Testing/EnableProposed). If you try the version in proposed, that could be either before or after you report your bug, but if you're going to try it and you haven't reported the bug yet, I suppose it may as well be before. If you're running the version from proposed when you report the bug, you should make sure to mention in your bug report that it also occurs in the version in main (and give the full version number of the -main kernel for clarity, which will probably be 2.6.38-8.42), so as to avoid creating the impression that this bug might have originated in kernel 2.6.38-9.43

Revision history for this message
Seanyoung247 (seanyoung247) said :
#6

@Tomasz Kusmierz

The Kernel in use by Fedora 15 Beta is 2.6.38.2-9. It may be worth mentioning that I've tried the 32 bit version of the Natty liveCD as well, and it also crashes, though before completing boot and without giving any Kernel messages before rebooting.

@Eliah Kagan
I believe these messages are spurious also, yes. I have been running Windows Vista as a dual boot for over a year on this system with no problems. I have been, and I am currently back using, Ubuntu 10.10 with the 2.6.35-28-generic Kernel and it's never had any problems.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#7

Dear
@Eliah Kagan
- cheers for advice on https://help.ubuntu.com/community/ReportingBugs
- ubuntu-bug linux tells me that I can NOT report this since this is "not original ubuntu package" (when running on 2.6.35) and when running on 2.6.38 it does not manage to get to sending process because kernel crashes.
- /var/log/dmesg does not shows any messages [Hardware Error] same as any other log file && / || rotated log file - anyway it would be VERY suspicious if kernel would log anything after biting the bullet of 'processor context corrupt' (writing corrupted data to fs ?). Also, have you read through the line stating "There seems to be nothing in any log files about this crash..." in first post?
- it's fairly hard to provide dmesg from crashed installation process.
- rather than throwing some lame photos from my mobile, I've preferred to put some effort and at least copy vital text in so anyone searching for this type of bug will find what he/she look for.

Kind regards for guidance & support.

Anyway, @Seanyoung247
cheers for info, I've tried 32 bit install as well because I was keen to believe that 64bit build might got contaminated. Also I've tried 2.6.38-9 from proposed repo (as proposed by Eliah Kagan) with no luck. For time being I'll run 11.04 with 2.6.35 kernel ... when I'll get some time I'll post this bug on kernel mail list or bugzila (whatever). If so are you ok to provide your setup as additional reference ?

Btw, I'm not trying to blame any kernel dev's for having this bug for that long in main stream - after all this mobo in not really a common hardware.

Cheers, Tom.

Revision history for this message
Seanyoung247 (seanyoung247) said :
#8

@Tomasz Kusmierz
Yeah that's perfectly fine, let me know if you need any more information from me.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#9

Btw, in first post, last line:

kernel is 2.6.28-8-generic #42-Ubuntu

should be:

kernel is 2.6.38-8-generic #42-Ubuntu

my bad, sorry.

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#10

@Tomasz Kusmierz
"- ubuntu-bug linux tells me that I can NOT report this since this is "not original ubuntu package" (when running on 2.6.35) and when running on 2.6.38 it does not manage to get to sending process because kernel crashes."

If you run a custom kernel--i.e., one you build yourself, with any changes of any kind in the configuration--then you cannot report a bug in it against linux in Ubuntu. (Actually, you can, but you shouldn't, and Apport won't facilitate it easily, and if you do, your bug will be summarily marked Invalid.) To be reported downstream, bugs in a custom kernel must be reproduced in a vanilla (i.e., non-customized) kernel. You can still report bugs affecting a custom kernel upstream at https://bugzilla.kernel.org; when doing so, you should make sure to explain the configuration changes and/or unofficial patches you used when you compiled it.

However, from your description, it seem that the bug you're experiencing is not present in 2.6.35. Therefore, even if "ubuntu-bug linux" worked fine when running that kernel, that would not be a good way to report this bug, as it would attach information about the wrong kernel. Therefore, to report the bug against a 2.6.38 kernel when running the 2.6.35 kernel, you should run:

ubuntu-bug linux-image-2.6.38-9-generic

Or, if you've already removed that version (which is the one from -proposed):

ubuntu-bug linux-image-2.6.38-8-generic

As an alternative technique--and this is arguably better, though only marginally--you could boot into the kernel where the bug occurs and run:

apport-cli linux

That will collect information, upload it, and provide a URL. Tell it not to open a web browser, but copy the URL. This should only take a couple of minutes at most, so with one or two tries, you should be able to accomplish this on the system running the bad kernel. Then you can boot with the good kernel (or use another machine) and go to the URL provided. You can do this by running "ubuntu-bug linux" too, but the added time and resources associated with actually spawning a browser window and going to the URL might make it harder to complete the operation and get the URL in time.

"- /var/log/dmesg does not shows any messages [Hardware Error] ...."

That's not surprising, though it was worth checking. Since you had not specified which log files you'd checked, I wasn't sure you'd checked the right ones--I see now that you have. Since prior to Natty, Ubuntu put the dmesg in /var/log/messages instead of /var/log/dmesg, there have been some questions posted expressing confusion about where to find the kernel's logs on disk.

I recommend attaching the dmesg anyway, even though it doesn't contain *direct* documentation of the panic. It may still contain information that might shed light on the problem. I do recommend pointing out in the report that it doesn't contain a log of the problem, and attaching the text you wrote based on your photo, as well as the image itself, to your bug report, as described above. (When you run ubuntu-bug or apport-cli or the like, it might automatically attach the dmesg, so then you needn't do it.)

"it's fairly hard to provide dmesg from crashed installation process."

And also fairly hard to run a different kernel in the installation process, yes. But since you said you were booting into the 2.6.25 kernel, presumably you've either overcome that (via live CD customization?) or you've got an installed system on which the problem occurs. Is that the case?

"rather than throwing some lame photos from my mobile, I've preferred to put some effort and at least copy vital text in so anyone searching for this type of bug will find what he/she look for."

Yes, if you can copy all the text precisely and without error, you're quite right that this is preferable. If you get all the text from the screen exactly, then I agree that you should not bother attaching the image.

Revision history for this message
Gabriel Blanchard (gabriel-blanchard) said :
#11

I can confirm that I have the exact same problem and it appears that the point in common is the motherboard

I use a Z7S WS as well.

Ubuntu 10.x runs just fine but I can't even get past the installation with 11.04

This appears to be related to a change made with the kernel somewhere past 2.6.35 . I'm running into the exact same issue with Fedora 15 for example but not 14.

Unfortunately I'm unable to attach anything either as the crash happens during the installation and doesn't allow me to save anything.

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#12

@Tomasz Kusmierz
Are you continuing to experience problems reporting this as a bug?

Revision history for this message
sam tygier (samtygier) said :
#13

i have a crash with a similar "mwait_idle+0x8d/0x120" part. maybe it is related. i filed Bug #789890

i have a TYAN Tempest i5000XT (S2696) motherboard

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#14

@sam tygier
Please let me know if you think I'm missing something...but your crash does not seem similar to the crash that Tomasz Kusmierz, Seanyoung247, and Gabriel Blanchard are experiencing. They are getting notifications that there has been a machine check exception. It seems that the only similarity between your bug and the bug they're experiencing is that they both cause kernel panics (http://en.wikipedia.org/wiki/Kernel_panic). There are many possible different, unrelated bugs that could cause the kernel to panic.

In any case, thank you for reporting your bug.

Revision history for this message
sam tygier (samtygier) said :
#15

@Eliah Kagan
this page came up when i googled for parts the trace from my panic.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#16

Hi boys and girls, sorry for the delay, but I was on holiday (and away from PC at all cost).

I've uploaded previously mentiones "photos" of a crash. Please be avare that those are made during install process but during normal operation (after upgrade for example) this messages does NOT come up on the screen, only was to replicate is to lunch to console rather than to X, and than messages will get spited out on any console you're in and PC will rebot after 20 seconds.

http://imageshack.us/photo/my-images/27/037ysh.jpg/
http://imageshack.us/photo/my-images/21/038gj.jpg/
http://imageshack.us/photo/my-images/851/041xv.jpg/

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#17
Revision history for this message
Eliah Kagan (degeneracypressure) said :
#18

@Tomasz Kusmierz
Are you able to report this as a bug, using the technique I described in post #10?

Revision history for this message
Brendan McLearie (bren-internode) said :
#19

I think this bug (reported and confirmed) is the same issue:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/801840

Only a suspicion but I think the ACPI is somehow involved. I infer this from having previously needed to use various kernel boot options on this board to stabilize vmware under ubuntu. Further, these options when used variously cause variations in the kernel panic.

This is unfortunate because if the bug is truly limited to the Z7S (and from reading the forums a handful of other boards) then a fix is unlikley. The board is rock solid otherwise and with the older kernels no problem at all.

If anyone has any insights into further diagnosis or even a custom kernel build to work around it I'd be happy to participate / test etc.

Revision history for this message
Brendan McLearie (bren-internode) said :
#20

Just tried 2.6.38-11 pre-release kernel. Same problem.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#21

well if you need some diagnosis just lunch your pc to console and watch ;) all the time it pops up with MCE errors and reboots (BTW nice mce error you've got on your screen) seems like the same point of origin for error - mwait() -> cpu_idle() -> cpu_start_secondary(). I shall try playing arround ACPI to confirm whatever this is valid / not valid assumption.

Anyway I’ve had the most spooky failure in my pc experience - the spider entered pc case and went on the power supply rail for cpu0 (on of the caps just next to it) and somehow caused a short circuit on the cap :O this mobo can supply up to 150A of current for cpu so needles to say that spider got welded into the case wall. Anyway after successful warranty claim (which took 6 week’s anyway) I've got a new mobo ... and guess what – BLOODY MCE ERROR

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#22

BTW, dear moderator / admin - please stop closing this bug report because it's NOT SOLVED and my question is NOT ANWSERED!

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#23

@Tomasz Kusmierz
At the risk of distracting from the actual technical issues here (which fortunately, due to recent progress, seem like they may now be addressable in bug 801840), I want to assuage your concerns about the way your question's status has been changing.

First of all, this is not a bug report. This is a question. (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/801840 is a bug report, it appears perhaps to be the same as the bug being discussed here, and it has not ever been closed.)

The way a question on Launchpad works is that it is initially in the Open state, and then when someone asks you for more information it goes into the Needs Information state, or if someone proposes an answer it goes into the Answered state. When you provide the requested information (by submitting your post with the "I'm Providing More Information" button), or indicate that the proposed answer was not sufficient and you still need help (by submitting your post with the "I Still Need an Answer" button), the question's status goes back to Open. (When you are a question's creator, as in this case, replying by email to a question in the Needs Information or Answered states also has the effect of changing it back to Open.) The vast majority of question status changes are not done by moderators or admins, and none of the changes here have been.

The Answered state should not be confused with the Solved state. Only you (or maybe an administrator) can mark your question as Solved. Answered just means that an answer has been proposed, and might potentially turn out to be the solution.

For more information about how questions work on Launchpad, see https://help.launchpad.net/Answers/AskingForHelp.

Since this question thread is about a bug report, rather than being a bug report, it stands to reason that it is solved not once the bug is fixed, but rather once the necessary information for reporting the bug (and working around it, as best as is possible) has been provided, or once no more information is being asked for in the question. While this issue is moving toward that point, it doesn't seem quite there yet. On the other hand, for this question to be Open (as it now is) means that you are requesting that someone provide an answer to you here about something. If that is the case, you may want to be more clear about what it is that you are asking.

It seems to me like you might be asking Brendan McLearie to verify that in bug 801840, he is getting MCE (machine check exception) messages similar to the ones discussed here. I would like to know that as well. As you have pointed out, if those messages are appearing in bug 801840, then along with the other similarities, it seems likely that's the bug being discussed here is the same as bug 801840.

Revision history for this message
Eliah Kagan (degeneracypressure) said :
#24

@Brendan McLearie
Do you get MCE (machine check exception) messages with your kernel panics? If not, then you may well not have the same bug as this (and then the bug here should still be separately reported). If so, please let us know...and you may also want to add that information to your bug report.

Revision history for this message
Brendan McLearie (bren-internode) said :
#25

@Eliah Kagan (degeneracypressure)

Thx for your explanation of the Answers status - most informative.
I can confirm that I get an mce_panic in all cases of kernel versions 2.6.38-8,10,11.
I've added a further photo of the console screen after the panic to bug 801840 at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/801840
I also suggest that this is a confirmed bug relating to this question. I have therefore proposed this as an answer.
Many thanks.

Revision history for this message
Brendan McLearie (bren-internode) said :
#26

@Tomasz Kusmierz
Could you add the outcomes of your ACPI investigations to the bug please. Most interested in what you try and your findings.

Revision history for this message
Tomasz Kusmierz (wally-tm) said :
#27

@ ALL

BUAHAHAHAHAHAHAHHAHAHAHAHAHAHAHHA I GOT IT !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

IT WORKS !!!!! 4 HOURS AND RUNNING !!!! UPDATING TO 12.04 IN BACKGROUND !!!!!

Anyway, long story short -> been updating my home server -> by accident plugged in this stick to my machine -> booted up by accdent -> it crashed -> few swear words $%^#$#$%^ -> quick look at the screen -> WTF ? there is a firewire_OHCI function in MCE trace -> reboot -> BIOS -> disable 1394 (or how ever it is called there) -> reboot into "old ubuntu install" -> AND IT WORKS FOR 4H NOW !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

for folks that are not bothered with story:

! DISABLE FIREWIRE AND LINUX WORKS !

Revision history for this message
Brendan McLearie (bren-internode) said :
#28

I have logged it also as a new bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1016556 under the 12.04 / 3.x kernel.