System hangs when copying to NFS mounts

Bug #71212 reported by Jose Troncoso
30
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
linux-source-2.6.17 (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

Binary package hint: linux-source-2.6.17

After I upgraded from dapper to edgy, I noticed that the system froze partially or completely when copying large files (more than 50-100 MB) to my NFS mounts.

I had this problem on two different computers accessing the same NFS server on a third computer which still runs on dapper. I checked the server and client settings time and again and upgraded to NFSv4, but the problem persisted. I also checked that the network was OK, uploading large files to the computer with the NFS server by using scp with no problem at all.

I thought that the problem had to be the NFS driver in the kernel, so I reinstalled dapper kernel 2.6.15-27-k7 with the corresponding restricted modules, and now the problem has ceased: my NFS mounts work as they did before the upgrade to edgy.

Tags: cft-2.6.27
Changed in linux-source-2.6.17:
importance: Undecided → High
status: Unconfirmed → Confirmed
Revision history for this message
Matt Thrailkill (matt-modestolan) wrote :

This happens to me also. Nothing shows up in any logs. Its not a full freeze, I can ssh in and do things.. but something is messed up. Alot of things don't work. This needs to be fixed.

Revision history for this message
SixDays (oscar-dix) wrote :

Same here.
I have tried ruling out the hardware by changing NICs, checking the hard drives in every which way possible.
Tried copying using a dapper liveCD and it works perfectly, which rules out hardware malfunctioning.

Tried switching between nfs-kernel-server and nfs-user-server, same result. This bug really pisses me off, mostly since it renders the computer unusable when I copy files over NFS. Moving 10 gigs of data takes quite a while so this i think is a rather urgent bug to fix.

I have not yet discovered if this is only related to my P4 machine, with an Intel motherboard or if it is cross-hardware "compliant". Will test, and if same error occurs on my other machines I will post info on it after this message.

Hardware: P4@2.40, 768 mb ddr 2700, xubuntu 6.10

Revision history for this message
SixDays (oscar-dix) wrote :

Tried to use another machine running xubuntu 6.10 but on an amd xp 3000 barton cpu, 512 mb ram.

It seems like the same thing happens, and the receiving box stated in my previous comment goes to 100% system load.

using midnight commander i tried to copy an dvd iso image from the AMD machine to the INTEL box.

after transfering 53% i got an error message stating: "File size exceeded." I accidentially lost the original message so that is what I recall from my memory.

Revision history for this message
Surfraz Ahmed (surfraz) wrote :

After upgrading one of my workstations to edgy I had major problems with NFS (/home drive mounted over NFS). Dapper workstations have not had any problems.

I was only able to resolve the problem by compiling the 2.6.18.2 kernel from kernel.org with the config from the edgy kernel.

If you have a terminal window open when the problem occurs you may see some messages as mentioned in bug #65827 (or type dmesg when the problem occurs). If so can you mark this bug as a duplicate of #65827?

Thanks

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

I am running Edgy as well and can confirm that this bug still exists while using the 2.6.19 kernel (compiled from kernel.org using a .config that has worked for me for months). I tried 2.6.18.3 and 2.6.18.1 also, but did not get the chance to confirm for certain if the bug exists while using those kernels. However, NFS did seem to perform more consistantly there.

I can also confirm that scp'ing to the same server that's NFS-mounted does not induce hangs of any sort, and copies at a faster and more consistant speed as well. Using NFS: 4.5 MB/sec with a sawtooth-like variance in transfer speed (watching my network meter) SCP: 6.6+MB/sec with a fairly steady transfer rate.

Please someone fix this bug soon!

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

Following-up my last post ... I tried out both 2.6.19-git15, which has a patch that seems was intended to fix this bug, but that was no good. In fact, my maching locked up solid when the bug struck - not even Alt-SysRq would work, just the "magic" reset button.

I tried rolling back to 2.6.18.2 as per Surfraz's suggestion, but that was also a no-go. The bug exists for me in that version also.

I've attached my current .config for 2.6.18.2 in case there's some obscure driver in there that's interfereing with NFS and causing the bug.

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

More tests on my end: 2.6.18 (first release) on my box also exhibits this bug. 2.6.19 on my husband's box (nearly identical hardware and same distro/version) also has the bug. Going on something I found on the web, I tried mounting the filesystem via TCP (my server is configured to support it), but that does not help.

I have not been lucky enough to see the kernel messages mentioned in bug #65827 either in dmesg, my logs, the console, or the terminal doing the copy, and no other messages are generated either.

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

Figured I'd continue testing and now it's getting interesting!

As with the last tests, I'm using fairly large (over 2GB) files and copying to normal NFS mounts.

There are three machines involved here. Rainbird (my box, Edgy, 2.6.19), Swan (husband, Edgy, 2.6.19), and Stork (our server, Breezy, 2.6.17).

As already mentioned, if I tell Rainbird to push a file to Stork, it hangs and the transfer rate is slow. Cancelling the copy is very difficult, taking several second to a minute.

The same thing happens if I tell Rainbird to push a file to Swan, or if I tell Swan to push to Stork. Something interesting here however - watching Swan's network meter (wmnet) during those periods when Rainbird is hung, I can see that there is still data transfer - more in fact than when Rainbird is responding normally.

HOWEVER, if I log into Stork and instruct it to pull from Rainbird, it works fine - no hangs. The transfer averages 8.7 MB/sec and stays fairly smooth. The same thing happens if I instruct Swan to pull a file from Rainbird - no problems whatsoever.

So basically, it doesn't seem to matter which two machines are involved, as long as the data is being *pulled* from from the source machine to the destination machiner, rather than pushed from the source as would be the norm.

Revision history for this message
Vorik (launchpad-gerapeldoorn) wrote :

same issues here. Really annoying.

Revision history for this message
Surfraz Ahmed (surfraz) wrote :

If it helps, when I was playing around with this, I found that switching to nfs-user-server on the server side stopped gnome from locking up. To do this you need to make sure exportfs does not contain any options that nfs-user-server does not support, then run 'aptitude install nfs-user-server ' and reboot. This is not a fix just a workaround, that solved the problem for me. All this makes me want to switch to samba/cifs.... if only I could get it to automount on client logon...

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

My two client machines, Swan and Rainbird, have since been updated to the full 2.6.19 release, while Stork still sits at 2.6.17. Problem still exists, however there is something noteable:

If I also update Stork (server) to 2.6.19, something new happens - the hangs seem to go away but the overall transfer rate is maybe 1.3MB/sec, no matter which way the copy goes or which machine does it (thus invalidating my previous "push" and "pull" tests).

Someone PLEASE FIX THIS BUG!

Revision history for this message
Paul Natsuo Kishimoto (khaeru) wrote :

I'm also experiencing this bug, between a server running edgy ubuntu-server with nfs-kernel-server installed, and a desktop running edgy ubuntu-desktop with nfs-common installed.

Whether I transfer the files through using Nautilus or a terminal, transfer TO the NFS share hangs my desktop (completely but temporarily; Ctrl-Alt-Backspace will not work, but the desktop is entirely usable when transfer finishes); transfer FROM the share proceeds at nearly the same speed, but doesn't result in any noticeable slowdown of GNOME.

I've attached the result of some tests I found in the official NFS HOW-TO: http://nfs.sourceforge.net/nfs-howto/ar01s05.html. Again, the former freezes GNOME; the latter does not. I can't find any useful information in dmesg. The speeds are comparable to using SCP.

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :
Download full text (9.3 KiB)

I've updated my server and one of my client machines to 2.6.20-rc5 just as a test, and there's some changes here. First of all, most of the hangs are gone, but not entirely. Don't be surprised to see the entire machine suddenly stop responding for a while; switching to a text console and back to X may cause the screen to go black for a while (in my case, about two minutes).

Here's the beginning of a "push" transfer (rainbird copying a file to NFS), where a nice large spike in network activity can be seen:

11:42:26.603523 IP (tos 0x0, ttl 64, id 1492, offset 0, flags [DF], proto: TCP
(6), length: 192) 10.1.1.3.921034261 > 10.1.1.1.2049: 140 getattr [|nfs]
11:42:26.607668 IP (tos 0x0, ttl 64, id 23326, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.921034261: reply ok 116 getattr [|nf
s]
11:42:26.607694 IP (tos 0x0, ttl 64, id 1493, offset 0, flags [DF], proto: TCP
(6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xd74c (correct), ack 12
804 win 11470 <nop,nop,timestamp 9139730 757630>
11:42:26.608222 IP (tos 0x0, ttl 64, id 1494, offset 0, flags [DF], proto: TCP
(6), length: 196) 10.1.1.3.937811477 > 10.1.1.1.2049: 144 access [|nfs]
11:42:26.613817 IP (tos 0x0, ttl 64, id 23327, offset 0, flags [DF], proto: TCP
 (6), length: 176) 10.1.1.1.2049 > 10.1.1.3.937811477: reply ok 124 access [|nfs
]
11:42:26.614237 IP (tos 0x0, ttl 64, id 1495, offset 0, flags [DF], proto: TCP
(6), length: 208) 10.1.1.3.954588693 > 10.1.1.1.2049: 156 getattr [|nfs]
11:42:26.615780 IP (tos 0x0, ttl 64, id 23328, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.954588693: reply ok 116 getattr [|nf
s]
11:42:26.616106 IP (tos 0x0, ttl 64, id 1496, offset 0, flags [DF], proto: TCP
(6), length: 208) 10.1.1.3.971365909 > 10.1.1.1.2049: 156 getattr [|nfs]
11:42:26.619674 IP (tos 0x0, ttl 64, id 23329, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.971365909: reply ok 116 getattr [|nf
s]
11:42:26.619963 IP (tos 0x0, ttl 64, id 1497, offset 0, flags [DF], proto: TCP
(6), length: 212) 10.1.1.3.988143125 > 10.1.1.1.2049: 160 access [|nfs]
11:42:26.623855 IP (tos 0x0, ttl 64, id 23330, offset 0, flags [DF], proto: TCP
 (6), length: 176) 10.1.1.1.2049 > 10.1.1.3.988143125: reply ok 124 access [|nfs
]
11:42:26.624145 IP (tos 0x0, ttl 64, id 1498, offset 0, flags [DF], proto: TCP
(6), length: 244) 10.1.1.3.1004920341 > 10.1.1.1.2049: 192 setattr [|nfs]
11:42:26.670359 IP (tos 0x0, ttl 64, id 23331, offset 0, flags [DF], proto: TCP
 (6), length: 52) 10.1.1.1.2049 > 10.1.1.3.705: ., cksum 0xc069 (correct), ack 8
92761 win 16022 <nop,nop,timestamp 757645 9139734>
11:42:26.809860 IP (tos 0x0, ttl 64, id 23332, offset 0, flags [DF], proto: TCP
 (6), length: 200) 10.1.1.1.2049 > 10.1.1.3.1004920341: reply ok 148 setattr [|n
fs]

After a couple of minutes, here's something odd that comes up - normally during these tests I was getting occasional large spikes to 10+ MB/sec among an otherwise constant ~700 kB/sec stream, but this time around I got a fairly constant 2.5MB/sec or so:

11:43:33.378615 IP (tos 0x0, ttl 64, id 45505, offset 0, flags [DF], proto: TCP (6), length:...

Read more...

Revision history for this message
DavidM (dmccullo) wrote :

Confirming the bug. As a workaround, I installed Feisty kernel:

2.6.20-6-generic #2 SMP Wed Jan 31 20:53:39 UTC 2007 i686 GNU/Linux

With the caveat that I haven't done much testing, this kernel seems to fix the problem. The machine is a MythTV backend/frontend, so had to rebuild nvidia modules; ivtv still works; can't get lirc rebuilt...

BTW, I encountered the bug because of a NFS mounted share that stored a music collection. When ripping CD's from a NFS client, the mount would hang after copying approximately 4GB. The NFS server would then lose its network connection.

Many thanks to all who contributed to this bug report. This was/is very frustrating and I could not have determined the problem without you folks.

Revision history for this message
sammiam (sammh) wrote :

I'm experiencing the same thing, am using the Fiesty Fawn Live CD. I nfs mount from my backup machine, and start doing a "copy -dpR" over to my main system. Some times it works just fine, other times it freezes the machine. No symptoms or messages in the log. I'm using the kernel: Linux version 2.6.20-8-generic (root@vernadsky) (gcc version 4.1.2 20070129 (prerelease) (Ubuntu 4.1.1-31ubuntu2)) #2 SMP Tue Feb 13 05:18:42 UTC 2007

Revision history for this message
sammiam (sammh) wrote :

as a follow up, looks my problem is solved... I was using a DFE-530TX+ adapter which uses the 8139 driver. I switched, and am now using the internal intel ethernet adapter that comes on my motherboard, and all looks well. What tipped me off was that I retried the operations I was doing, but instead of using nfs, I sftp'd from my source machine over to my machine. The machine froze in the middle of a hugh file. In searching for 8139, I found where the machine locking up is a known problem.

Revision history for this message
Brian Murray (brian-murray) wrote :

I am assigning this bug to the 'ubuntu-kernel-team' per their bug policy. For future reference you can learn more about their bug policy at https://wiki.ubuntu.com/KernelTeamBugPolicies .

Changed in linux-source-2.6.17:
assignee: nobody → ubuntu-kernel-team
milestone: edgy-updates → none
Revision history for this message
Nick Fishman (bsdlogical) wrote :

I encountered the exact same problem, and just like sammiam wrote, it was the DFE-350TX+ adapter that caused the problem. When I switched to using an onboard network interface on a server, NFS worked like a charm.

I was using Gutsy with the 2.6.22-14-generic kernel, by the way, so the problem with the 8139 driver is very much still alive.

Revision history for this message
John Nilsson (john-milsson) wrote :

I'm also experiencing this problem.

Server: A "Popcorn Hour A-100" using firmware 01-15-080123-14-POP-402 with apps 00-15-080116-14-POP-402

Client:
Ubuntu Gutsy Gibbon
With following mount option
192.168.0.11:/share /net/pha100 nfs rw,rsize=4096,wsize=4096,hard,intr,user,noauto 0 0
(changed from negotiated 32768 to 4096 to see if that would improve multitasking somewhat)

Client Nic:
04:06.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 74)
with 3c59x driver from 2.6.22-14-generic

(My system is configured to use this nic for both public IP (eth0) and local ip (eth0:1) such that all trafic from LAN to Internet is both comming in and going out the same nic with this maching acting as firewall/NAT.)

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Hardy Heron Alpha series was recently released which contains an updated version of the kernel. You can download and try the new Hardy Heron Alpha release from http://cdimage.ubuntu.com/releases/hardy/ . You should be able to then test the new kernel via the LiveCD. If you can, please verify if this bug still exists or not and report back your results. General information regarding the release can also be found here: http://www.ubuntu.com/testing/ .

Also note we'll keep this report open against the actively developed kernel but against 2.6.17 this will be closed. Thanks.

Changed in linux:
status: New → Incomplete
Changed in linux-source-2.6.17:
status: Confirmed → Won't Fix
Revision history for this message
John Nilsson (john-milsson) wrote :

I am now running Hardy Heron. Since my last post I've also bought a router and thus nolonger have my Ubuntu box acting as router/firewall, it's purley a client with one internal ip now.

The symptoms is nolonger that the entire system gets unresponsive, now it's only nautilus that stops responing. It doesn't redraw the desktop icons and it's not possible to open new nautilus windows.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

This seems to work fine for me under Hardy with the 2.6.24-19 kernel - no hangs of any kind.

Revision history for this message
Markus Korn (thekorn) wrote :

Marking as 'Fixed Released' based on the last comment.
If this is still an issue with this most recent release please feel free to reopen this report. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New".

Thanks,
Markus

Changed in linux:
status: Incomplete → Fix Released
Revision history for this message
Richp (rich-parkin-home) wrote :

I have just done a default install of Ibex and have had the same issue. When copying a 4.1gb file to a NFS drive the desktop froze. I couldn't open nautilus, but app's like System Monitor worked fine. I have to wait until the copy completes and then the desktop is fine. Here is a section of my logs while the copy was taking place

Oct 31 11:54:35 rich-desktop kernel: [14220.292919] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x2)
Oct 31 11:54:35 rich-desktop kernel: [14220.292952] ata1: soft resetting link
Oct 31 11:54:35 rich-desktop kernel: [14220.502299] ata1.00: configured for UDMA/100
Oct 31 11:54:35 rich-desktop kernel: [14220.502320] ata1: EH complete
Oct 31 11:54:35 rich-desktop kernel: [14220.516137] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Oct 31 11:54:35 rich-desktop kernel: [14220.516155] sd 0:0:0:0: [sda] Write Protect is off
Oct 31 11:54:35 rich-desktop kernel: [14220.516184] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Regards
Richard

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
bananenkasper (bananenkasper) wrote :

Since ages, still the same problem.

DISTRIB_ID=LinuxMint
DISTRIB_RELEASE=17.2
DISTRIB_CODENAME=rafaela
DISTRIB_DESCRIPTION="Linux Mint 17.2 Rafaela"

Linux 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.