RTL8139 nic: Ubuntu 12.04 network stops responding, nothing in the logs

Asked by Sven Neuhaus

I have an old root server located in a data center. It has a Shuttle MK35 mainboard with onboard RTL8139 ethernet.

I recently installed Ubuntu 12.04 on that box (using a rescue system and debootstrap).

When there is load on the machine, it completely stops responding to the network. There is nothing in the logs that indicates why.

When I restart the server by soft-reset (emulates sending ctrl+alt+del), the network starts responding once again rightaway (during shutdown!) and the server reboots cleanly.

The ethernet port uses the 8139too driver.

I have upgraded the kernel to the proposed kernel linux-image-3.2.0-24-generic 3.2.0-24.38 and it has improved the situation a lot. Previously I would get these network disconnects every few minutes (under load), now I get them only a few times per day (which is still too much - the same machine was working flawlessly using Debian lenny previously).

I have tried running a shell script on the server that detects if the outside network is unreachable and restarts the network (ifdown eth0;ifup eth0) but that didn't work, it looks as if the ping went through.

Perhaps it is some kind of firewall issue? I have not configured anything in that regard and the iptables are empty.

Ideas, pointers are welcome. Thanks.

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu linux Edit question
Assignee:
No assignee Edit question
Solved by:
Sven Neuhaus
Solved:
Last query:
Last reply:
Revision history for this message
actionparsnip (andrew-woodhead666) said :
#1

I suggest you test RAM from Memtest in Grub as a good first move.

Revision history for this message
Sven Neuhaus (sven0) said :
#2

Do you really think this looks like a memory problem? In the past when there were memory problems I saw processes crashing etc.

Anyway, is there a way to run memtest this on a remote server located in a data center?

Revision history for this message
Sven Neuhaus (sven0) said :
#3

The server's network connection was stuck again this morning.

Revision history for this message
Sven Neuhaus (sven0) said :
#4

I don't have gnome installed.

Revision history for this message
actionparsnip (andrew-woodhead666) said :
#5

The memory holds the apps and drivers as well as the OS, if the bad part of the RAM is used, it will cause issues.
You could try a smarthands request to get the server looked at locally.

Revision history for this message
marcobra (Marco Braida) (marcobra) said :
#6

I know you have not gnome installed, you are on server, but since launchpad have not smart grouping of questions about network issue we are used to put all related network layer questions to that groups.

Revision history for this message
Sven Neuhaus (sven0) said :
#7

I noticed that the clock of the server also stops running and sometimes, ntp terminates.
So it's not just the network that stops responding.
Meanwhile I have updated the machine to the 3.4 mainline kernel. This didn't help either.
I have now disabled ACPI (added acpi=off to the kernel boot parameters). Hopefully that will do the trick.
The clocksource is now tsc, not acpi_pm.

If it runs stable for a couple of days I'll update this post.

Revision history for this message
Sven Neuhaus (sven0) said :
#8

Looks like the kernel parameter acpi=off fixed it!

Revision history for this message
PapiMigas (papimigas) said :
#9

Thank you, Sven!!!