bzr branch from launchpad hangs after one hour

Asked by Pat Tressel

I've been trying to branch web2py from launchpad -- I've tried with a previous bzr version and now with 2.1.0. It starts off fine, but at the one hour mark, it stops doing anything. Here's where it had gotten to when it stopped:

$ bzr branch -Dhpss -Dhpssdetail lp:~mdipierro/web2py/devel web2py
- 106420KB 29KB/s | Fetching revisions:Inserting stream

(The rate is steady at about 30KB/s up to the point where it stops.)

The bzr branch command waits for some while after activity stops, then exits with this message:

FATAL ERROR: Server unexpectedly closed network connection
bzr: ERROR: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.
HPSS calls: 7 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)

I have no trouble branching sahanapy, where the branch completes in under an hour.

I did bzr lp-login, so this isn't the same as question #84616. Can't tell if question #81139 is related -- I'm not using the same ssh, and the server in my case is launchpad.

I'm wondering if either bzr has a one hour timeout (I'm expecting it doesn't) or if launchpad has a process killer that nukes some types of processes after an hour (e.g. those servicing ports), on the expectation that if they hang around for an hour, they must be dead (seems more likely than a timeout in bzr).

If it is launchpad that's doing it, and if the launchpad admins don't want to up the timeout, does bzr have a checkpoint and restart mechanism for branches / checkouts / other potentially long commands?

Client is running on WinXP, in a cygwin bash terminal, with putty (pageant) ssh. Version of bzr is 2.1.0. Network connection is DSL, not fast but seems quite reliable -- I haven't seen it drop the connection (except once when my phone went dead as well -- the external phone line was out).

Below is the log, with the long middle section of "NNN byte part read" lines omitted. (In case it's relevant, the values of NNN stayed at around 400-600 for a while, then started creeping up, then started having bursts with numbers reaching just over 2000000.)

Thanks!

-- Pat

Wed 2010-03-10 14:05:20 -0800
0.140 bazaar version: 2.1.0
0.140 bzr arguments: [u'branch', u'-Dhpss', u'-Dhpssdetail', u'lp:~mdipierro/web2py/devel', u'web2py']
0.156 looking for plugins in C:/Documents and Settings/Patricia Tressel/Application Data/bazaar/2.0/plugins
0.156 looking for plugins in C:/Program Files/Bazaar/plugins
0.343 encoding stdout as sys.stdout encoding 'cp437'
2.671 hpss: Built a new medium: SmartSSHClientMedium
2.765 bzr-svn: using Subversion 1.6.6 ()
2.859 hpss call: 'BzrDir.open_2.1', '~mdipierro/web2py/devel/'
2.859 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.437 result: ('yes', 'no')
8.437 hpss call: 'BzrDir.open_branchV3', '~mdipierro/web2py/devel/'
8.437 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.656 result: ('branch', 'Bazaar Branch Format 6 (bzr 0.15)\n')
8.671 hpss call: 'BzrDir.find_repositoryV3', '~mdipierro/web2py/devel/'
8.671 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.875 result: ('ok', '', 'no', 'no', 'no', 'Bazaar pack repository format 1 (needs bzr 0.92)\n')
8.890 hpss call: 'Branch.get_stacked_on_url', '~mdipierro/web2py/devel/'
8.890 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
9.343 result: ('UnstackableBranchFormat', 'Branch format 6', 'lp-66552656:///~mdipierro/web2py/devel')
9.343 hpss call: 'Branch.last_revision_info', '~mdipierro/web2py/devel/'
9.343 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
9.796 result: ('ok', '1738', '<email address hidden>')
9.812 hpss call: 'BzrDir.cloning_metadir', '~mdipierro/web2py/devel/', 'False'
9.812 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
10.312 result: ('Bazaar-NG meta directory, format 1\n', 'Bazaar pack repository format 1 (needs bzr 0.92)\n', ('branch', 'Bazaar Branch Format 6 (bzr 0.15)\n'))
10.375 creating repository in file:///C:/Documents%20and%20Settings/Patricia%20Tressel/My%20Documents/sahanapy/web2py/.bzr/.
10.453 Using fetch logic to copy between RemoteRepository(bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/.bzr/)(<RemoteRepositoryFormat>) and KnitPackRepository('file:///C:/Documents%20and%20Settings/Patricia%20Tressel/My%20Documents/sahanapy/web2py/.bzr/repository/')(<RepositoryFormatKnitPack1>)
10.453 hpss call w/body: 'Repository.get_stream_1.19', '~mdipierro/web2py/devel/', 'Bazaar pack repository format 1 (needs bzr 0.92)\n' ('ancestry-of\nmdipierr'...)
10.453 86 bytes
12.156 result: ('ok',)
12.156 42 byte part read
12.156 54 byte part read
12.156 548 byte part read
12.171 574 byte part read
...
3516.000 431 byte part read
3516.000 417 byte part read
3526.343 322989 byte part read
3586.671 1862204 byte part read
3589.156 61915 byte part read
3589.156 434 byte part read
3589.171 424 byte part read
3603.171 391488 byte part read
7204.296 decoder state: buf[:10]='\x00\x1e\x8a\x01B20013', state_accept=_state_accept_expecting_bytes
7204.343 Traceback (most recent call last):
  File "bzrlib\commands.pyo", line 853, in exception_to_return_code
  File "bzrlib\commands.pyo", line 1055, in run_bzr
  File "bzrlib\commands.pyo", line 661, in run_argv_aliases
  File "bzrlib\commands.pyo", line 665, in run_direct
  File "bzrlib\cleanup.pyo", line 122, in run_simple
  File "bzrlib\cleanup.pyo", line 156, in _do_with_cleanups
  File "bzrlib\builtins.pyo", line 1232, in run
  File "bzrlib\bzrdir.pyo", line 1184, in sprout
  File "bzrlib\repository.pyo", line 1704, in fetch
  File "bzrlib\decorators.pyo", line 194, in write_locked
  File "bzrlib\repository.pyo", line 3432, in fetch
  File "bzrlib\fetch.pyo", line 83, in __init__
  File "bzrlib\fetch.pyo", line 109, in __fetch
  File "bzrlib\fetch.pyo", line 137, in _fetch_everything_for_search
  File "bzrlib\repository.pyo", line 4274, in insert_stream
  File "bzrlib\repository.pyo", line 4307, in _locked_insert_stream
  File "bzrlib\knit.pyo", line 1599, in insert_record_stream
  File "bzrlib\versionedfile.pyo", line 1665, in read
  File "bzrlib\smart\repository.pyo", line 578, in iter_substream_bytes
  File "bzrlib\smart\repository.pyo", line 568, in iter_stream_decoder
  File "bzrlib\smart\message.pyo", line 336, in read_streamed_body
  File "bzrlib\smart\message.pyo", line 286, in _read_more
ConnectionReset: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.

7204.343 Transferred: 106455KiB (14.8K/s r:106454K w:1K)
7204.343 return code 3
[ 4476] 2010-03-10 16:05:24.342 INFO: HPSS calls: 7 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)

Question information

Language:
English Edit question
Status:
Solved
For:
Bazaar Edit question
Assignee:
No assignee Edit question
Solved by:
Andrew Bennetts
Solved:
Last query:
Last reply:
Revision history for this message
Parth Malwankar (parthm) said :
#1

This looks like bug #526825 to me.

A quick check shows me that webpy repo is in "Bazaar pack repository format 1 (needs bzr 0.92)" which is a very old format. So there is a lot of conversion happening during the pull and there is probably a timeout happening in the process.

One option you could consider for now is asking the branch owner Massimo[1] to upgrade the repo to 2a format using the 'upgrade' link he would see on the page[1]. 2a format is much more efficient (space, speed, memory).

[1] https://launchpad.net/~mdipierro/web2py/devel lists the branch owner

Revision history for this message
Parth Malwankar (parthm) said :
#2

You might also try the approach suggested by John in this comment on the bug before the branch upgrade:
https://bugs.launchpad.net/bzr/+bug/526825/comments/10

Revision history for this message
Andrew Bennetts (spiv) said :
#3

There's no timeout like that in either bzr or Launchpad.

That's a bit mysterious. It appears that the socket has simply failed partway through reading the stream (perhaps due to a transient network outage?). I don't think this is the fault of either bzr or Launchpad. This output for instance looks correct:

> decoder state: buf[:10]='\x00\x1e\x8a\x01B20013', state_accept=_state_accept_expecting_bytes

So it's apparently died while expecting a 2001409 long byte part. I can run that branch command, and it works for me (at about 400kB/s). I even see the same size byte parts as at the point yours failed, here's a snippet from my log:

290.085 391488 byte part read
294.586 2001409 byte part read

(Although I interrupted it after 250MB transferred, and the total branch size appears to be about 500MB).

At no point was the connection idle for any significant length of time.

I'm not sure why the network connection was unreliable for you. I'd suggest trying to workaround the problem by fetching the branch incrementally, i.e.:

bzr branch lp:~mdipierro/web2py/devel web2py -r100
cd web2py
bzr pull -r 200
bzr pull -r 300
[etc]

(The current revno of that branch is 1738)

Also, i see that branch is in pack-0.92 format. It will probably be much smaller if upgraded to 2a (the default format of bzr 2.0), which may help if long-running transfers are failing for you, so perhaps ask the owner of the branch if they can upgrade it?

Revision history for this message
Andrew Bennetts (spiv) said :
#4

Parth: this is not bug 526825. The HPSS log shows that it is requesting the stream in pack-0.92 format, which is the same format that the remote branch is in, so there is no cross-format issue here.

Revision history for this message
Parth Malwankar (parthm) said :
#5

On Thu, Mar 11, 2010 at 7:17 AM, Andrew Bennetts
<email address hidden> wrote:
> Question #103938 on Bazaar changed:
> https://answers.edge.launchpad.net/bzr/+question/103938
>
> Andrew Bennetts posted a new comment:
> Parth: this is not bug 526825.  The HPSS log shows that it is requesting
> the stream in pack-0.92 format, which is the same format that the remote
> branch is in, so there is no cross-format issue here.
>

Ah. You are right of course. I somehow missed the parameter specifying
the format.
Thanks.

Revision history for this message
Pat Tressel (ptressel) said :
#6

Whoa, those were fast replies!

> That's a bit mysterious. It appears that the socket has simply
> failed partway through reading the stream (perhaps due to a
> transient network outage?). I don't think this is the fault of either
> bzr or Launchpad.
> ...
> I'm not sure why the network connection was unreliable for you.

I'm ~99% certain that there is no connectivity problem. I've done longer transfers with no failure, and there's no error in the system logs, or in the bzr log until some significant time *after* activity stops (that's the one that says "server dropped connection"). And it's surely curious that it stops at exactly one hour (within a minute -- I was watching an actual clock).

> I'd suggest trying to workaround the problem by fetching the
> branch incrementally, i.e.:
> bzr branch lp:~mdipierro/web2py/devel web2py -r100
> cd web2py
> bzr pull -r 200
> bzr pull -r 300
> [etc]
> (The current revno of that branch is 1738)

That should work -- I'll try it and report back. (I assume the the result of those pulls will be a usable branch equivalent to the original command..?) But it seems a bit awkward for regular use -- would still be nice to resolve it.

> Also, i see that branch is in pack-0.92 format. It will probably
> be much smaller if upgraded to 2a (the default format of bzr 2.0),
> which may help if long-running transfers are failing for you, so
> perhaps ask the owner of the branch if they can upgrade it?

I can give that a try too. But it would have to get a *lot* smaller -- at 30KB/s I can download about 100MB in an hour. (I haven't introduced myself to Massimo yet, but will need to anyway since I have a change we'd like to put in web2py. But before I can offer the change, I gotta getta branch...)

(If it *is* stopping at an hour, then even if a compressed web2py makes it in under an hour, I might just run into the same issue on a larger project.)

> I can run that branch command, and it works for me (at about 400kB/s).
> ...
> (Although I interrupted it after 250MB transferred, and the total
> branch size appears to be about 500MB).

So that was about 10 minutes? Hmmm -- you'd have to find something big enough to run for over an hour to reproduce the problem -- looks like web2py isn't big enough at your connection speed.

> At no point was the connection idle for any significant length of time.

Nor was mine until it stopped entirely -- I was keeping half an eye on the ascii animation and amount downloaded on the progress line.

> There's no timeout like that in either bzr or Launchpad.

It runs on Linux, right? I'd bet there's a way to have port daemons get killed after some time. (I've run into problems with idle timers that look only at certain kinds of activity and happen to neglect whatever the process is actually doing.)

Ok, let me try the repeated pulls. But I'm not saying "solved" since we haven't shown the one hour timeout *doesn't* occur on another client. ;-) If someone would be willing to try a branch on something big enough to continue past an hour, that would be really helpful...

Thanks!

-- Pat

Revision history for this message
Best Andrew Bennetts (spiv) said :
#7

Pat Tressel wrote:
[...]
> Pat Tressel is still having a problem:
> Whoa, those were fast replies!

:)

> > That's a bit mysterious. It appears that the socket has simply
> > failed
[...]
> > I'm not sure why the network connection was unreliable for you.
>
> I'm ~99% certain that there is no connectivity problem. I've done
> longer transfers with no failure, and there's no error in the system
> logs, or in the bzr log until some significant time *after* activity
> stops (that's the one that says "server dropped connection"). And it's
> surely curious that it stops at exactly one hour (within a minute -- I
> was watching an actual clock).

Exactly an hour is pretty suspicious! But there's no timeout like that
in bzr. I've just grepped the code for Launchpad and there is a 1 hour
timeout — but it's supposed to be 1 hour of idle time, not since the
connection was established.

That does seem like more than a coincidence, so I'll turn this into a
bug report against codehosting for the codehosting team to investigate.

> > I'd suggest trying to workaround the problem by fetching the
> > branch incrementally, i.e.:
> > bzr branch lp:~mdipierro/web2py/devel web2py -r100
> > cd web2py
> > bzr pull -r 200
> > bzr pull -r 300
> > [etc]
> > (The current revno of that branch is 1738)
>
> That should work -- I'll try it and report back. (I assume the the
> result of those pulls will be a usable branch equivalent to the original
> command..?) But it seems a bit awkward for regular use -- would still
> be nice to resolve it.

Yep, that will give you an identical branch.

It would definitely be good to resolve!

[...]
> Ok, let me try the repeated pulls. But I'm not saying "solved" since we
> haven't shown the one hour timeout *doesn't* occur on another client.

Thanks, reopening this was exactly the right action :)

-Andrew.

Revision history for this message
Pat Tressel (ptressel) said :
#8

Thanks, Andrew!

Revision history for this message
Pat Tressel (ptressel) said :
#9

Thanks Andrew Bennetts, that solved my question.