"bzr pull" downloads a few megabytes before noticing branches are diverged, normal?

Asked by GuilhemBichot

Reporting this on behalf of a colleague. He happened to be on a slow link (internet connection through mobile phone), something like a few kilobytes per second.
He had a branch of https://code.launchpad.net/~mysql/mysql-server/mysql-maria. He used the bzr+ssh protocol (he wasn't using the launchpad branch but the branch which launchpad mirrors, which supports bzr+ssh).
He committed a revision in his copy.
When he wanted to push, he did "bzr pull" to check if something had been pushed to https://code.launchpad.net/~mysql/mysql-server/mysql-maria meanwhile.
The command apparently downloaded 4MB of data (~ an hour in his case) before bailing out saying that branches were diverged.
So the question is: is the implementation of testing of divergence efficient right now? Does it download the entire revision graph from the remote machine before deciding if branches diverged? Would a quick pre-check make sense (like, local host sends its most recent revid, remote checks if this revid is in remote branch, if no then they diverged, if yes, more checks...?)?

Question information

Language:
English Edit question
Status:
Solved
For:
Bazaar Edit question
Assignee:
No assignee Edit question
Solved by:
GuilhemBichot
Solved:
Last query:
Last reply:
Revision history for this message
John A Meinel (jameinel) said :
#1

As a follow up question, now that the user knows there is divergence, what is his next step? Does he pull in the remaining changes (via 'bzr merge', etc)? Would he try to 'rebase' his changes by 'bzr pull --overwrite' and then recommitting the changes? (or bzr uncommit, bzr pull, bzr commit, etc.)

Right now we fetch the actual changes from the remote side, before we check for divergence. We don't *have* to do it that way, but in almost all normal use cases that we have encountered, the user wants those revisions after doing a pull. Whether it is because they need to merge, resolve conflicts and then commit, or any of a number of resolution steps. In all of them, they still need the revisions after the pull has failed.

So the current logic does do a fetch before checking for divergence. It wouldn't have to, but it does simplify things, as you know you have all the revisions available. So a follow up "bzr merge REMOTE" should be much faster.

A different way of checking is to use 'bzr missing'. If it shows revisions for both 'mine' and 'theirs' then you know you have diverged. That, IIRC, doesn't fetch the changes.

So while we could make it error out faster with 'diverged', you would still need to spend the time downloading the changes when it comes to do whatever it is you do to resolve the fact of divergence. The only case that doesn't need the revisions is if your decision was to push the data to a different location. For example:

  bzr pull REMOTE
 # oops, diverged, I'll create a new public branch
  bzr push DIFFERENT_PATH

I believe this is an unlikely scenario, but certainly one we could support better if it was deemed important.

Revision history for this message
Michael Widenius (monty) said :
#2

A small clarification to the problem:

Yes, you would probably like to have the revisions for the changes if there was any. However, you don't want to pay for bandwidth or wait a long time if there is no changes on the other side.

My major problem was however that when I did bzr push, this also fetch several megabytes from the remote side before sending over the changes. It would be nice if bzr push could just ask the remote for the last (common) branch tag and send over things since that branch.

Revision history for this message
John A Meinel (jameinel) said :
#3

Just as a followup to this question. We have a new index format, which seems to improve this a lot. You can test it by doing "bzr upgrade --development2". This format is present in bzr 1.8. For bzr 1.9 we might be releasing it as a non-development format. The patch to do so is here:
http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C490790AE.6080507%40arbash-meinel.com%3E

The patch switches to using a btree structure for the index, rather than bisecting a sorted list. It also is able to compress the index (using gzip/zlib). What that means is:

1) Indexes are generally significantly smaller. I've seen 3:1, though 2:1 may be more common. (text indexes compress very well, revision indexes not as well, so it depends on your ratio of number of files/changes to files, versus length of history.)

So operations that need to read the whole index now have to download 1/2 as much data.

2) It is much easier to download just a portion of the index when we need to. With a bisect we download a 64kB page, and then get about 2-way fanout (we know we are before/after this point). With the btree index, we read a 4kB page, and then get approximately 100-way fan-out.

So operations that only need a little bit of history can do so much more easily. Not to mention that even if both are O(log(N)) one is log_2(N) and the other is log_100(N).

3) The new indexes are broken in to exact pages, with more logical caching. Because entries could happen at any byte offset, the old prefetch code would have more waste. Also, the prefetch code didn't know what was already cached, which caused other inefficiencies. The new code expands requests only to uncached regions.

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#4

Hi John. Thanks for the good news. Should the "bzr upgrade --development2" command be run on the local machine (my home computer, where I type "bzr push" or "bzr pull") or on the remote machine (the push/pull parent) ?
I guess on both...
If on the remote machine:
* should the repository be unused by others while the "bzr upgrade" command is running?
* it's unlikely I could use it, because the remote machine is using a shared repository with tens of branches, and is used by tens of developers all around the clock.

Revision history for this message
John A Meinel (jameinel) said :
#5

Generally you want the good indexes on whatever machine is "remote", because the problem is having to access things remotely.

In 1.9, the format became officially supported, so you would "bzr upgrade --format=1.9" (or just --1.9). Unfortunately, the format string between --development2 and --1.9 is different (because it changed from a 'use at your risk' to a 'fully supported' format.)

You can run 'bzr upgrade' without taking the repo offline as a separate step, but I think it may take it offline automatically.

At the moment, the upgrade logic just initializes a new repo, and then fetches everything across. Which is *correct* but slower than it has to be.

I have a script which does things "in place", preparing the new index, and then swapping them not-quite atomically. I did set it up to take the repository offline. However in my testing, it only takes 30s to upgrade a Mysql repository, so I don't think that would be an undue burden.

Probably the more difficult issue is requiring people to use a newer version of bzr.

I'll convert this question to a bug so that I can attach the script.

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#6

I understand the proposed way to fix the problem, and am thankful for the effort. However, for something like our bzr branches which are critical for our work, I have trouble thinking of upgrading the central repository to a format which is brand new (hit a bug there and 100 MySQL developers would end up idle waiting for the solution). It did sound logical that, if the top revisions (the heads) of the local and remote branches have the same id, it should be possible for "bzr push" to just send to the remote side the changes (which are small), and just that.
Or do you mean the proposed fix (the new format) is only for the slowness of "bzr pull" and not the one of "bzr push"?

Revision history for this message
John A Meinel (jameinel) said :
#7

The new format makes our indexes more efficient in general. This should benefit push and pull and many other actions (missing, log, etc). However, this is talking about a discovery phase. And the original question didn't have enough information to determine if the 4MB was wasted in reading indexes, or more because it actually had to transfer content.

I'm not sure I completely follow your "if the top revisions have the same id"... if the tips had the same revision id, there would be nothing to push.

Going back to the original question...

It is possible for a revision to not be in the remote and for the branches to not be diverged. Simply committing a new revision on top of the existing value will do this.

A
|
B <= Remote is here
|
C <= Local is here

C & B are not diverged, but B does not know what C is.

The specific case of checking for divergence is not optimized into an RPC yet, but in theory it could. I'll note that you really want both sides to get the tip revisions, and then start searching on their end. However, we also want to be careful that we don't have to search all of history just to find that it isn't there.
I'm also not convinced that *divergence* is the first order that you want to worry about.

Specifically, when you issue "bzr pull" we first fetch the remote branch, and then check for divergence. This can, indeed, make "bzr pull" slower than if we checked for divergence first. *However* what did the user do after they found divergence? Did they

1) 'bzr merge' the remote branch
2) 'bzr pull --overwrite' the remote branch, destroying their local work.
3) 'bzr branch' the remote branch into another directory, and then merge their local work into it.
4) ???

Anyway, in the first 3 cases, they wanted the content of the revisions locally, in order to resolve whatever was remaining. So fetching is fine, considering if we just checked for divergence *first*, the next step that the user would do would then need to fetch the revisions anyway. So while that "bzr pull" was slow, I would expect the next command (merge/pull/branch/etc) to not need to fetch the same content again.

If you can come up with a (4) that would not need to fetch the remote content, then certainly we can consider moving the divergence check earlier in the process.

For the new index layer, I have some quick "back of the envelope" calculations.
1) Looking at:
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-5.0/.bzr/repository/indices/

Your largest index is 20MB, if I go here:
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-6.0/.bzr/repository/indices/
It is 34MB.

The current code reads 64kB pages and does a bisection search on the index. So 34MB/64kB = 531 pages, log2(531)= 9 round trips, 9*64=0.58 MB read to find a single node. If your changes effected more than one file, that number goes up (though not strictly linearly because the bisect nodes will have some overlap.)

The new indexes are more compact for the same data. On .tix I would expect 2:1 compression. So they would start at 17MB instead of 34MB. Being generous, let's assume 25MB. They use a btree, with a page size of 4kB. So we would have 6.2k pages. We generally get 100:1 fan-out, so I would expect a 3-level tree. To find a single value, we would have to read 4kB * 3 = 12kB of data.
When searching for multiple values, we will pad the request to a 64kB request, so we would read the 4kB root, approx 64kB at the next layer, and 64kB at the next, or 132kB of data. When searching for multiple values, we spread the 64kB read across each actual page, so the expected value does not go up by much (until you are searching for more than 16 well-spread-out keys).

So if you had committed a change which effected 20 files, with the current index, I would not be surprised if it accessed 2-4MB of index data, while the new format would only read, say, 250kB.

Either way, there is still room for improvement for fetch, since we could have smarter RPCs that have the remote side do more computation (at the cost of loading the server more) of what to send.

I believe the 1.9 release does have some improved discovery code for bzr+ssh, though I believe 'fetch()' is still done on the raw indexes (rather than as a logical stream based on revision ids).

What I would recommend, would be to create a copy of the repository, and have some developers start using the new format. They are completely compatible, so you can fetch data either way. You could automate that with a cron script, though there is the possibility of having "diverged" branches then (if one group is committing to the same logical branch, but at a different physical location).

Also, the new index code is an incremental update, which only changes a small portion of the stack. And the code for this was started back in June with the code landing in bzr.dev in late August. So it has indeed had some time to be proven. (I also think Robert is using it in his "everything in Ubuntu" 16GB repository.)

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#8

I may not be clear. I think I'm confusing people by discussing "bzr push" issues on this question, which is about "bzr pull".
I understand your arguments why the new format would speed up "bzr pull" and will think about the feasibility of a repository upgrade.
For the "bzr push" issues, which look more weird to me than the "bzr pull" issues, I'll put my comments in https://bugs.launchpad.net/bzr/+bug/252945

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#9

Hi John. For your information, we have just upgraded our central shared repository (the one from/to which we pull/push most of the time) with "bzr upgrade --1.9".

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#10

Hi. I have two format-1.9 diverged repos, and I do 'bzr missing' between them, it transfers ~3MB (according to bzr's progress bar). Then I immediately repeat the command, and it transfers 3MB again. So apparently "bzr missing" does not cache stuff.
On the other hand, I was able to verify what John wrote above: I did "bzr pull", transferred 6MB before failing with divergence, and the second "bzr pull" was much faster.

Revision history for this message
Martin Pool (mbp) said :
#11

bzr missing is strictly readonly, so each run should take just the same amount of effort.

It would be possible for it to actually pull all the revisions and then compare them, but that's possibly going to require doing more work (as we'd need to fetch the file texts etc too), and in some cases people won't actually want it to be stored.

We should probably focus on just reducing the amount of IO to find changes between closely related branches.

Revision history for this message
GuilhemBichot (guilhem-bichot) said :
#12

Hi Martin. Yes, my remark about "bzr missing" was only a remark, not a wish or complaint. I can imagine cases where one does not want to download all revisions when doing "bzr missing".