Comment 13 for bug 71212

Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote :

I've updated my server and one of my client machines to 2.6.20-rc5 just as a test, and there's some changes here. First of all, most of the hangs are gone, but not entirely. Don't be surprised to see the entire machine suddenly stop responding for a while; switching to a text console and back to X may cause the screen to go black for a while (in my case, about two minutes).

Here's the beginning of a "push" transfer (rainbird copying a file to NFS), where a nice large spike in network activity can be seen:

11:42:26.603523 IP (tos 0x0, ttl 64, id 1492, offset 0, flags [DF], proto: TCP
(6), length: 192) 10.1.1.3.921034261 > 10.1.1.1.2049: 140 getattr [|nfs]
11:42:26.607668 IP (tos 0x0, ttl 64, id 23326, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.921034261: reply ok 116 getattr [|nf
s]
11:42:26.607694 IP (tos 0x0, ttl 64, id 1493, offset 0, flags [DF], proto: TCP
(6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xd74c (correct), ack 12
804 win 11470 <nop,nop,timestamp 9139730 757630>
11:42:26.608222 IP (tos 0x0, ttl 64, id 1494, offset 0, flags [DF], proto: TCP
(6), length: 196) 10.1.1.3.937811477 > 10.1.1.1.2049: 144 access [|nfs]
11:42:26.613817 IP (tos 0x0, ttl 64, id 23327, offset 0, flags [DF], proto: TCP
 (6), length: 176) 10.1.1.1.2049 > 10.1.1.3.937811477: reply ok 124 access [|nfs
]
11:42:26.614237 IP (tos 0x0, ttl 64, id 1495, offset 0, flags [DF], proto: TCP
(6), length: 208) 10.1.1.3.954588693 > 10.1.1.1.2049: 156 getattr [|nfs]
11:42:26.615780 IP (tos 0x0, ttl 64, id 23328, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.954588693: reply ok 116 getattr [|nf
s]
11:42:26.616106 IP (tos 0x0, ttl 64, id 1496, offset 0, flags [DF], proto: TCP
(6), length: 208) 10.1.1.3.971365909 > 10.1.1.1.2049: 156 getattr [|nfs]
11:42:26.619674 IP (tos 0x0, ttl 64, id 23329, offset 0, flags [DF], proto: TCP
 (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.971365909: reply ok 116 getattr [|nf
s]
11:42:26.619963 IP (tos 0x0, ttl 64, id 1497, offset 0, flags [DF], proto: TCP
(6), length: 212) 10.1.1.3.988143125 > 10.1.1.1.2049: 160 access [|nfs]
11:42:26.623855 IP (tos 0x0, ttl 64, id 23330, offset 0, flags [DF], proto: TCP
 (6), length: 176) 10.1.1.1.2049 > 10.1.1.3.988143125: reply ok 124 access [|nfs
]
11:42:26.624145 IP (tos 0x0, ttl 64, id 1498, offset 0, flags [DF], proto: TCP
(6), length: 244) 10.1.1.3.1004920341 > 10.1.1.1.2049: 192 setattr [|nfs]
11:42:26.670359 IP (tos 0x0, ttl 64, id 23331, offset 0, flags [DF], proto: TCP
 (6), length: 52) 10.1.1.1.2049 > 10.1.1.3.705: ., cksum 0xc069 (correct), ack 8
92761 win 16022 <nop,nop,timestamp 757645 9139734>
11:42:26.809860 IP (tos 0x0, ttl 64, id 23332, offset 0, flags [DF], proto: TCP
 (6), length: 200) 10.1.1.1.2049 > 10.1.1.3.1004920341: reply ok 148 setattr [|n
fs]

After a couple of minutes, here's something odd that comes up - normally during these tests I was getting occasional large spikes to 10+ MB/sec among an otherwise constant ~700 kB/sec stream, but this time around I got a fairly constant 2.5MB/sec or so:

11:43:33.378615 IP (tos 0x0, ttl 64, id 45505, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.0 > 10.1.1.1.2049: 1448 null
11:43:33.378621 IP (tos 0x0, ttl 64, id 45506, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.0 > 10.1.1.1.2049: 1448 null
11:43:33.378631 IP (tos 0x0, ttl 64, id 20927, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.2049 > 10.1.1.3.705: ., cksum 0x2e4a (correct), ack 158604625 win 16022 <nop,nop,timestamp 774315 9156416>
11:43:33.378634 IP (tos 0x0, ttl 64, id 45507, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.0 > 10.1.1.1.2049: 1448 null
11:43:33.378640 IP (tos 0x0, ttl 64, id 45508, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.0 > 10.1.1.1.2049: 1448 null
11:43:33.378648 IP (tos 0x0, ttl 64, id 20928, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.2049 > 10.1.1.3.705: ., cksum 0x22fa (correct), ack 158607521 win 16022 <nop,nop,timestamp 774315 9156416>
11:43:33.378657 IP (tos 0x0, ttl 64, id 20929, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.2049 > 10.1.1.3.705: ., cksum 0x17aa (correct), ack 158610417 win 16022 <nop,nop,timestamp 774315 9156416>
11:43:33.378755 IP (tos 0x0, ttl 64, id 45509, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.0 > 10.1.1.1.2049: 1448 null

(what's with all the NULL responses?)

The tcpdump output stayed mostly like that for this particular test, and here's how the copy ended (it took a good couple of minutes for my attempt to cancel it to actually work):

11:45:14.015252 IP (tos 0x0, ttl 64, id 19770, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xe68f (correct), ack 869296 win 12918 <nop,nop,timestamp 9181580 799472>
11:45:14.019288 IP (tos 0x0, ttl 64, id 54539, offset 0, flags [DF], proto: TCP (6), length: 184) 10.1.1.1.2049 > 10.1.1.3.2219564821: reply ok 132 commit [|nfs]
11:45:14.019309 IP (tos 0x0, ttl 64, id 54540, offset 0, flags [DF], proto: TCP (6), length: 184) 10.1.1.1.2049 > 10.1.1.3.2236342037: reply ok 132 commit [|nfs]
11:45:14.019316 IP (tos 0x0, ttl 64, id 19771, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xe585 (correct), ack 869560 win 12918 <nop,nop,timestamp 9181581 799473>
11:45:14.099364 IP (tos 0x0, ttl 64, id 54541, offset 0, flags [DF], proto: TCP (6), length: 192) 10.1.1.1.2049 > 10.1.1.3.776724245: reply ok 140
11:45:14.099465 IP (tos 0x0, ttl 64, id 54542, offset 0, flags [DF], proto: TCP (6), length: 192) 10.1.1.1.2049 > 10.1.1.3.2102124309: reply ok 140 write [|nfs]
11:45:14.099474 IP (tos 0x0, ttl 64, id 19772, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xe442 (correct), ack 869840 win 12918 <nop,nop,timestamp 9181601 799496>
11:45:14.100076 IP (tos 0x0, ttl 64, id 19773, offset 0, flags [DF], proto: TCP (6), length: 220) 10.1.1.3.2253119253 > 10.1.1.1.2049: 168 commit [|nfs]
11:45:14.103363 IP (tos 0x0, ttl 64, id 54543, offset 0, flags [DF], proto: TCP (6), length: 184) 10.1.1.1.2049 > 10.1.1.3.2253119253: reply ok 132 commit [|nfs]
11:45:14.104357 IP (tos 0x0, ttl 64, id 19774, offset 0, flags [DF], proto: TCP (6), length: 208) 10.1.1.3.2269896469 > 10.1.1.1.2049: 156 getattr [|nfs]
11:45:14.107253 IP (tos 0x0, ttl 64, id 54544, offset 0, flags [DF], proto: TCP (6), length: 168) 10.1.1.1.2049 > 10.1.1.3.2269896469: reply ok 116 getattr [|nfs]
11:45:14.149906 IP (tos 0x0, ttl 64, id 19775, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.3.705 > 10.1.1.1.2049: ., cksum 0xe1f8 (correct), ack 870088 win 12918 <nop,nop,timestamp 9181613 799498>

A "pull" request on my server (pulling the same file from NFS to a local disk), showed a continuous stream of messages like this (tcpdump running on the client as before):

11:48:48.310652 IP (tos 0x0, ttl 64, id 18314, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.310774 IP (tos 0x0, ttl 64, id 18315, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.310897 IP (tos 0x0, ttl 64, id 18316, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.311020 IP (tos 0x0, ttl 64, id 18317, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.314499 IP (tos 0x0, ttl 64, id 60418, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.705 > 10.1.1.3.2049: ., cksum 0x1d76 (correct), ack 30907177 win 24576 <nop,nop,timestamp 853033 9235142>
11:48:48.314507 IP (tos 0x0, ttl 64, id 18318, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.314512 IP (tos 0x0, ttl 64, id 18319, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.314525 IP (tos 0x0, ttl 64, id 60419, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.705 > 10.1.1.3.2049: ., cksum 0x1226 (correct), ack 30910073 win 24576 <nop,nop,timestamp 853033 9235142>
11:48:48.314529 IP (tos 0x0, ttl 64, id 18320, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.314535 IP (tos 0x0, ttl 64, id 18321, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448
11:48:48.314544 IP (tos 0x0, ttl 64, id 60420, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.705 > 10.1.1.3.2049: ., cksum 0x06d4 (correct), ack 30912969 win 24576 <nop,nop,timestamp 853034 9235143>
11:48:48.314552 IP (tos 0x0, ttl 64, id 60421, offset 0, flags [DF], proto: TCP (6), length: 52) 10.1.1.1.705 > 10.1.1.3.2049: ., cksum 0xfb83 (correct), ack 30915865 win 24576 <nop,nop,timestamp 853034 9235143>
11:48:48.314648 IP (tos 0x0, ttl 64, id 18322, offset 0, flags [DF], proto: TCP (6), length: 1500) 10.1.1.3.2049 > 10.1.1.1.0: reply ok 1448

I might point out that this time, the transfer was moving along at a fairly slow pace - about 2.8 MB/sec, which is odd - earlier in the tests I was doing with this kernel, I was able to max out my server's disk bandwidth (about 10MB/sec) doing "pull" requests. Anyway, cancelling the copy worked immediately like it should, and only a few seconds went by before tcpdump stopped spewing (guess it was flushing a cache?).