transparent hugepages and thrashing on amd64

Asked by garyr on 2012-06-14

I seem to have found a solution to a severe thrashing/swapping/freezing problem that I've been having for months now. I guess the real question is - should I turn it into a bug report and what would be useful data to include if so.

This is a quad core AMD Phaeom system with 4G of ram, dual monitors and a single 1TB WD caviar black HD. It had been behaving normally until something broke sometime late in the 11.x release cycle and continues in the current 12.04 LTS. The symptoms are running a moderate load of apps (firefox with ~8 tabs, a terminal or 2, and aisleriot solitaire for example) and experiencing system freezes where the entire UI becomes totally unresponsive for 20 seconds - 5 minutes with solid disk activity. Trying to figure out what was going on via iotop and top show jbd2 and kswapd accounting for the largest load, but since it freezes iotop like everything else I can't tell what's going during the worst storms. Googling around shows a fair number of other people with similar problems, most of them with multi core amd64 systems.

The other day I spotted this report on opensuse that looked similar but not identical:

http://lists.opensuse.org/opensuse/2012-03/msg00657.html

I booted with the grub parameter transparent_hugepage=never yesterday and the problem went and away and hasn't come back. I've streesed the system by running a bunch of flash/java tabs in firefox, running a large java based stock app (ThinkorSwim) in another workspace and playing a 1080p 60fps movie in a third workspace. This certainly causes swapping, but not freezing or stumbling. It actually did a bit of swapping a minute ago while I was typing and it managed to make Pandora radio stumble for a moment - but that's orders of magnitude better than it has been.

I think there may be a fundamental problem with how transparent hugepages are handled with some AMD CPUs. I think this problem started when this feature was implemented and enabled by default.

Hre's a partial list of things that haven't worked well in the past:

Playing with the swappiness value: setting swappiness to very low values makes the problem take longer to surface, but (unsurprisinglly) makes it even worse once it does.

swapoff-a ; swapon-a: this makes it go away for a while. A potentially interesting thing is that as soon as I can get the system to act on the swapoff -a the system becomes responsive again. It pegs once CPU core at 100% and the HD grinds like crazy but it stops freezing right away.

Moving swap from the HD to a USB thumb drive: Obviously I didn't expect that to be faster but wanted to see if segregating swap to a different device on a different bus would make it swap more smoothly - it didn't.

Playing with nice and ionice priorities for jdb2, kswapd. The fact that running these processes at a lower priority than anything else on the system makes no difference leads me to think they were just symptoms and not at the root of the problem.

I think this may be a tip of the iceberg and there may be a lot of other having this problem. Looking around I see a fair number of reports, most of them unsolved. Some may have been fixed by just adding enough RAM that dirty hugepages just don't collect. Some may have been fixed by chaanging filesystems - ext4 seems like something a lot of people with this problem have in common.

Question information

Language:
English Edit question
Status:
Answered
For:
Ubuntu linux Edit question
Assignee:
No assignee Edit question
Last query:
2012-06-14
Last reply:
2012-06-15

Sure, report a bug then suggest your solution.

Can you help with this problem?

Provide an answer of your own, or ask garyr for more information if necessary.

To post a message you must log in.