Can Bazaar meet requirements for game-development workflows?

Asked by Jeff M.

I am working on a project, similar to game-development, that requires version control for several different types of files. For example: code (c, c++, delphi, lua, xml, ini, etc.), images (low-res gui, high-res textures, etc.), binary (compress data, proprietary, etc.), video (avi, mpg, etc.); you get the idea.

I am looking for a version-controlled environment that allows each of these file types to be collectively managed in a reasonable fashion. The central server will reside on a Linux machine, while contributors will be on Windows. I'm looking for:

1) a centralized server/repository for storing changesets approved for release; and from which contributors will pull from.

2) local decentralized repositories that contributors can use for local commits or working offline; namely for coding. These will typically be bug or feature branches that, once complete, will need to be approved and merged/pushed back to the central repository. Preferably users can also push/pull from each other.

3) Performs well with large repositories; eg those over 20k files.

4) support for partial-clones, so that different types of contributors (code, design, docs, video, etc.) can pull content only relevant to them; lowers bandwidth and local-storage requirements. not a requirement per se, but still highly valued.

5) support for history-less / shallow / cvs-like control of specified files/types. In particular large and/or binary files of which only final, approved versions will be stored on the central server. The files history does not need to be pulled from the central server, nor do local, non-final changes need to be pushed to the central server.

6) Preferably the wisdom to automatically recognize, or at least provide a manual option of, designating files that do not need to be merged; and consequently do not need to be loaded or have parents loaded into memory.

I have used CVS and Mercurial and have looked at Subversion and Git but none of these support all requisites. All of them fail at #6, which seems like a trivial feature. I do have a couple of ideas I still need to test: if you first remove, and then add a new file of the same name, in which VCS is the diff avoided? Or in which VCS would it be possible to hook into the diff process, check for binary, and add/remove (if it avoids the diff) or perhaps return a simplified replace-all patch without having to load the parent?

CVS fails to support #2. Subversion fails at #2 and #3. Mercurial and Git fail at #4 and #5. I am trying to determine if this is a workflow that Bazaar can support? And if not, which of these does Bazaar fail at? I have been looking at some of Bazaar's features, but I do not fully understand some of the concepts. Is a lightweight checkout equivalent to #5? Are stacked-branches equivalent to #4? Can you tell bazaar not to diff certain files?

I have been contemplating using CVS for the centralized server (fulfilling #1, #3, #4, #5), and then having contributors use Mercurial for the local and/or P2P workflow (#2). However, this is contingent on working around #6 with CVS. Any enlightenment you could provide would be very much appreciated :)

Question information

Language:
English Edit question
Status:
Solved
For:
Bazaar Edit question
Assignee:
No assignee Edit question
Solved by:
Andrew Bennetts
Solved:
Last query:
Last reply:
Revision history for this message
Parth Malwankar (parthm) said :
#1

On Sat, Aug 14, 2010 at 3:30 AM, Jeff M.
<email address hidden> wrote:

Hi Jeff,

Thats a lot of questions :-) ... I will try to comment on some.

> I am working on a project, similar to game-development, that requires version control for several different types of files. For example: code (c, c++, delphi, lua, xml, ini, etc.), images (low-res gui, high-res textures, etc.), binary (compress data, proprietary, etc.), video (avi, mpg, etc.); you get the idea.
>

Handling of all files (including binaries) should be ok. The thumb
rule is that you would need 3-4x the file size memory for some
operation. E.g. diff requires at least two full copies of a file to be
in memory at a time.

> I am looking for a version-controlled environment that allows each of these file types to be collectively managed in a reasonable fashion. The central server will reside on a Linux machine, while contributors will be on Windows. I'm looking for:
>
> 1) a centralized server/repository for storing changesets approved for release; and from which contributors will pull from.
>

Yes. You should be able to do this. Bzr also has a patch queue manager
plugin (bzr-pqm) that can help with this. Some docs[1].

> 2) local decentralized repositories that contributors can use for local commits or working offline; namely for coding. These will typically be bug or feature branches that, once complete, will need to be approved and merged/pushed back to the central repository. Preferably users can also push/pull from each other.
>

Yes. This is a standard feature as bzr is a DVCS.

> 3) Performs well with large repositories; eg those over 20k files.
>

IMO it should work fine though I haven't personally used bzr with only
5-6k files. You should probably be using shared repositories[2] if you
have large history and a large number of files.

> 4) support for partial-clones, so that different types of contributors (code, design, docs, video, etc.) can pull content only relevant to them; lowers bandwidth and local-storage requirements. not a requirement per se, but still highly valued.
>

Bzr doesn't support partial clones. It does however support views[3].
Its not exactly the same but the users can work with only a subset of
files.
For lower local-storage needs shared-repo works very well. You can
have N branches within a shared repo and the history is common between
them.

> 5) support for history-less / shallow / cvs-like control of specified files/types. In particular large and/or binary files of which only final, approved versions will be stored on the central server. The files history does not need to be pulled from the central server, nor do local, non-final changes need to be pushed to the central server.
>

Bzr supports lightweight checkouts[4] which is much like cvs or svn.

> 6) Preferably the wisdom to automatically recognize, or at least provide a manual option of, designating files that do not need to be merged; and consequently do not need to be loaded or have parents loaded into memory.
>

I don't think this is possible but maybe someone else can comment.
Perhaps custom hooks (mentioned below) can be used handle specific
cases.

> I have used CVS and Mercurial and have looked at Subversion and Git but none of these support all requisites. All of them fail at #6, which seems like a trivial feature. I do have a couple of ideas I still need to test: if you first remove, and then add a new file of the same name, in which VCS is the diff avoided?

I am not sure if I fully understand you here. bzr tracks files and
directories as first class objects. So, For each file bzr creates a
uniques id and then tracks it across its lifetime (including renames -
bzr mv). If you remove a file (bzr rm) and add a second file with the
same content, bzr will see this as a separate file with a separate id.

> Or in which VCS would it be possible to hook into the diff process, check for binary, and add/remove (if it avoids the diff) or perhaps return a simplified replace-all patch without having to load the parent?
>

bzr supports various hooks[5][6] including extend command hooks. Maybe
this can be used for the things you mention above but I haven't tried
something like that.

> CVS fails to support #2. Subversion fails at #2 and #3. Mercurial and Git fail at #4 and #5. I am trying to determine if this is a workflow that Bazaar can support? And if not, which of these does Bazaar fail at? I have been looking at some of Bazaar's features, but I do not fully understand some of the concepts. Is a lightweight checkout equivalent to #5? Are stacked-branches equivalent to #4?
>

#2 (local repos), and #3 (large number of files) are supported.
#4 (partial checkouts) are not supported exactly, but views can help.
#5 Yes, lightweight checkout is #5.

As I understand it, stacked branches are more of a space optimization,
so that servers that run services like launchpad with many users
branching and keeping their branches on the server don't duplicate
history. For e.g. you can see a large number of bzr branches at
https://code.launchpad.net/bzr . All of them are stacked so the server
doesn't waste space.

> Can you tell bazaar not to diff certain files?

Not yet. Thats an open bug IIRC.

> I have been contemplating using CVS for the centralized server (fulfilling #1, #3, #4, #5), and then having contributors use Mercurial for the local and/or P2P workflow (#2). However, this is contingent on working around #6 with CVS. Any enlightenment you could provide would be very much appreciated :)
>

[1] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/bazaar_workflows.html#decentralized-with-automatic-gatekeeper
[2] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/shared_repository_layouts.html
[3] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/filtered_views.html
[4] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/using_checkouts.html#getting-a-lightweight-checkout
[5] http://doc.bazaar.canonical.com/bzr.2.2/en/user-reference/hooks-help.html
[6] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/hooks.html

Revision history for this message
Jeff M. (jamarr) said :
#2

  Hi Parth, thank you for taking the time to respond. I just have a few
more questions ;)

> As I understand it, stacked branches are more of a space optimization,
> so that servers that run services like launchpad with many users
> branching and keeping their branches on the server don't duplicate
> history. For e.g. you can see a large number of bzr branches at
> https://code.launchpad.net/bzr . All of them are stacked so the server
> doesn't waste space.

Today I stumbled upon a post from last year giving an, imo, better
description of lightweight-checkouts and stacked-branches than the
official docs do. You can read the post here:
http://lists.debian.org/debian-python/2009/03/msg00007.html

As you said, it sounds like lightweight-checkouts will let you mimic cvs
workflows. However, the interesting part is on stacked-branches which
apparently gives you lightweight-checkouts with local commits; something
I was looking for. If and when the history is needed, the stacked-branch
will downloaded it "on demand" from the central server. I think the
important distinction here is if Bazaar has to pull the "entire"
repository history, or if it only needs to pull the "related" history on
demand. I think this post implies that Bazaar only pulls the "related"
history; if this is the case, it means Bazaar already supports "partial
checkouts" - it simply does not have a user-friendly interface for it
yet. Can you corroborate this information?

>> 4) support for partial-clones, so that different types of contributors (code, design, docs, video, etc.) can pull content only relevant to them; lowers bandwidth and local-storage requirements. not a requirement per se, but still highly valued.
> Bzr doesn't support partial clones. It does however support views[3].
> Its not exactly the same but the users can work with only a subset of
> files.
> For lower local-storage needs shared-repo works very well. You can
> have N branches within a shared repo and the history is common between
> them.

It sounds like views are essentially filters on the client (checkout)
side that intentionally limit what you can view in your sandbox. So then
it should be possible to combine this with the lightweight or stacked
features to, almost, give me what I want. If stacked-branches do cache
the "on demand" history, then this combination can easily be perceived
as a partial-checkout. Even if the on-demand history is not cached, I
can still live with this; it'll just be a bit slower when the history is
involved.

>> I have used CVS and Mercurial and have looked at Subversion and Git but none of these support all requisites. All of them fail at #6, which seems like a trivial feature. I do have a couple of ideas I still need to test: if you first remove, and then add a new file of the same name, in which VCS is the diff avoided?
> I am not sure if I fully understand you here. bzr tracks files and
> directories as first class objects. So, For each file bzr creates a
> uniques id and then tracks it across its lifetime (including renames -
> bzr mv). If you remove a file (bzr rm) and add a second file with the
> same content, bzr will see this as a separate file with a separate id.

In CVS files are tracked by name (afaik) and so removing a file and
re-adding a file with the same name automatically ties their history
together. Because of this, it should in theory be possible to delete a
file in CVS, and then re-add a new version of that file to avoid a diff.
I was hoping that Bazaar used a similar mechanism; by removing and
re-adding a file, you are telling the VCS that while the file is
related, you do not need to keep diffs between them; you want to
maintain their history, but each step in their history is a completely
new version of the file.

How does Bazaar create these unique ids? Are they not hashes of the
path/filename, or are they random uuids? If Bazaar could purposefully
generate the same unique id for a given path/file (via a hook?), if
someone where to remove and re-add the file would this link their
history while avoiding the diff? I essentially want to tell the VCS that
the history should be linked, but that this instance in history should
be treated as a new file and there is no need to diff with the parent.

>> Or in which VCS would it be possible to hook into the diff process, check for binary, and add/remove (if it avoids the diff) or perhaps return a simplified replace-all patch without having to load the parent?
> bzr supports various hooks[5][6] including extend command hooks. Maybe
> this can be used for the things you mention above but I haven't tried
> something like that.
> [1] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/bazaar_workflows.html#decentralized-with-automatic-gatekeeper
> [2] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/shared_repository_layouts.html
> [3] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/filtered_views.html
> [4] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/using_checkouts.html#getting-a-lightweight-checkout
> [5] http://doc.bazaar.canonical.com/bzr.2.2/en/user-reference/hooks-help.html
> [6] http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/hooks.html

Thanks for the references, I will try to look into the hooking system.

>> 6) Preferably the wisdom to automatically recognize, or at least provide a manual option of, designating files that do not need to be merged; and consequently do not need to be loaded or have parents loaded into memory.
>> Can you tell bazaar not to diff certain files?
> Not yet. Thats an open bug IIRC.

I hope that this comes sooner rather than later. If I have the correct
perception of shared-repositories, stacked-branches, and views then the
only issue preventing the use of Bazaar out of the box is the diff
issue. This seems to be a problem for every free version control system
that I've looked at, and would certainly put Bazaar one notch above the
rest. As far as I am aware, the only VCS that lets you avoid diffing
large files is Perforce, which costs a small fortune. This in
combination with history-less checkouts is really a requirement for
large, media-rich projects that are just waiting for a VCS like Bazaar
to meet their needs ;)

Revision history for this message
Best Andrew Bennetts (spiv) said :
#3

Here's a few quick answers for you.

Jeff M. wrote:
> Question #121049 on Bazaar changed:
[...]
> As you said, it sounds like lightweight-checkouts will let you mimic cvs
> workflows. However, the interesting part is on stacked-branches which
> apparently gives you lightweight-checkouts with local commits; something
> I was looking for. If and when the history is needed, the stacked-branch
> will downloaded it "on demand" from the central server. I think the
> important distinction here is if Bazaar has to pull the "entire"
> repository history, or if it only needs to pull the "related" history on
> demand. I think this post implies that Bazaar only pulls the "related"
> history; if this is the case, it means Bazaar already supports "partial
> checkouts" - it simply does not have a user-friendly interface for it
> yet. Can you corroborate this information?

Sadly, stacked branches are not quite ready for that yet. You cannot commit
directly to a stacked branch: <https://bugs.launchpad.net/bzr/+bug/375013>

[...]
> How does Bazaar create these unique ids? Are they not hashes of the
> path/filename, or are they random uuids? If Bazaar could purposefully

Random UUIDs.

> generate the same unique id for a given path/file (via a hook?), if
> someone where to remove and re-add the file would this link their
> history while avoiding the diff? I essentially want to tell the VCS that
> the history should be linked, but that this instance in history should
> be treated as a new file and there is no need to diff with the parent.

Bazaar has no concept of linked-but-separate of the sort you seek. It's
either the same file or a totally different one. I'm not sure why you
are concerned about whether Bazaar is storing it using diffs to the
parent or not; Bazaar's storage will perform just as well whether or not
the two versions of that file are tracked as one file or two. Or am I
misunderstanding your concern?

[...]
> >> 6) Preferably the wisdom to automatically recognize, or at least provide a manual option of, designating files that do not need to be merged; and consequently do not need to be loaded or have parents loaded into memory.
> >> Can you tell bazaar not to diff certain files?
> > Not yet. Thats an open bug IIRC.
>
> I hope that this comes sooner rather than later. If I have the correct
> perception of shared-repositories, stacked-branches, and views then the
> only issue preventing the use of Bazaar out of the box is the diff
> issue. This seems to be a problem for every free version control system
> that I've looked at, and would certainly put Bazaar one notch above the
> rest. As far as I am aware, the only VCS that lets you avoid diffing
> large files is Perforce, which costs a small fortune. This in
> combination with history-less checkouts is really a requirement for
> large, media-rich projects that are just waiting for a VCS like Bazaar
> to meet their needs ;)

To an extent this may be addressed by writing a plugin that uses the
merge_file_content hook. e.g. see
<http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/hooks.html#example-a-merge-plugin>,
which includes a simple example that cause bzr to never try merge *.xml
files, and instead mark them as conflicts for the user to resolve by
hand.

-Andrew.

Revision history for this message
Jeff M. (jamarr) said :
#4

Thanks Andrew Bennetts, that solved my question.

Revision history for this message
Jeff M. (jamarr) said :
#5

  Hi Andrew, thanks for the response. I do not have any more questions,
so I'll mark this as answered shortly.

I understand that Bazaar, Mercurial, and Git are all focused on the
open-source, library-driven market; and it is certainly a large market.
However, there is also a large media-rich market out there waiting for a
single-solution to content management. And I believe that who ever taps
into this market first will gain substantial contributions and support
from it. In addition to the existing source-code management in use
today, these media-rich projects need features suitable for managing
large and/or compressed content elegantly. Each of you has already
delivered on the requirements for source-control in your own, unique
models. What is needed now is to extend your markets by providing
support for diverse projects; expand your communities. These needs are
not that far out there:

1) Limiting the amount of history needed on the client for standard
functionality; eg shallow checkouts. In 99% of my experience, the entire
repository history is completely unnecessary. And in 90% of my
experience, 90% of the history is completely unnecessary. This rings as
true for source-only repositories as it does for media-rich ones; yet
this issue is still struggled with. In terms of media, outside of
explicit checkout, the history is never needed locally.

2) Limiting the scope of a checkout to a fraction of the repository; eg
partial checkouts. This is mostly irrelevant for small projects, even up
to large source-only libraries. However, when you have to collaborate
with several different and distinct types of users, it is essential. You
do not need, or want, graphics designers to checkout out code and vice
verse. You do not need/want authors to checkout videos and vice verse.
It is simply too inefficient for media-rich projects.

3) Supporting undiffable content is crucial for supporting media-rich
repositories. Failing to develop native support for this results in
limited use in media-rich markets. The majority of media formats is in
some way compressed, because they would be otherwise unwieldy. To top it
off, there is little relevance in diffing this media even uncompressed.
We want version control of this media, but we do not want to pay, and in
a lot of cases can not afford to pay, the overhead involved with trying
to diff these files.

Unfortunately it seems like the Mercurial and Git communities are not
interested in escaping their limited, source-control shells. But Bazaar
at least gives the impression of such things. I hope that these features
will become a higher priority for this community so that Bazaar can meet
the needs of all projects, little, big, and diverse ;)

> Sadly, stacked branches are not quite ready for that yet. You cannot commit
> directly to a stacked branch:<https://bugs.launchpad.net/bzr/+bug/375013>

That is unfortunate. It seems this issue has been sitting for over a
year. This really is an extremely important feature. We could all
benefit from combining the best aspects of central and decentralized
workflows. It is crucial to have a stable, centralized repository from
which clients can pull from while giving day-to-day developers the
ability to work locally and efficiently.

> Bazaar has no concept of linked-but-separate of the sort you seek. It's
> either the same file or a totally different one. I'm not sure why you
> are concerned about whether Bazaar is storing it using diffs to the
> parent or not; Bazaar's storage will perform just as well whether or not
> the two versions of that file are tracked as one file or two. Or am I
> misunderstanding your concern?

This was merely related to a theoretical workaround to avoid the memory
overhead involved in diffing undiffable content. This workaround is not
particularly important.

> To an extent this may be addressed by writing a plugin that uses the
> merge_file_content hook. e.g. see
> <http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/hooks.html#example-a-merge-plugin>,
> which includes a simple example that cause bzr to never try merge *.xml
> files, and instead mark them as conflicts for the user to resolve by
> hand.

Thanks. I did notice this when Parth linked the hooks page. Definitely
something to consider. However, I also noticed another bug report where
even commits require several times the files size in memory. And that
issue has been sitting for over three years. I was hoping to find a VCS
that could avoid such unnecessary overhead. Hopefully these issue will
garner some more attention.

Thanks again, everyone, for all of the insight.

Revision history for this message
John A Meinel (jameinel) said :
#6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff M. wrote:
> Question #121049 on Bazaar changed:
> https://answers.launchpad.net/bzr/+question/121049
>

...

> 1) Limiting the amount of history needed on the client for standard
> functionality; eg shallow checkouts. In 99% of my experience, the entire
> repository history is completely unnecessary. And in 90% of my
> experience, 90% of the history is completely unnecessary. This rings as
> true for source-only repositories as it does for media-rich ones; yet
> this issue is still struggled with. In terms of media, outside of
> explicit checkout, the history is never needed locally.

The primary counterpoint is that you don't know ahead of time when that
10% kicks in. I agree that most of the time most of the history isn't
necessary. But it is *very* hard to predict in advance when it will be.

There have been some other systems that didn't propagate history (such
as Arch/tla: you had it in your repo, maybe someone mirrored your repo,
maybe they didn't). However, very often the primary copy of the content
then goes offline, and suddenly everyone has a hole in their history.
You go to annotate a file... but you're missing changes, etc.

Many places have a 'central authority' which is one potential option.
The central location has the full history, but each individual does not.
This raises a bunch of other issues, though. (you can't do all the work
offline that you could have done while online, you're copy really isn't
a peer to another one when you want it to be, etc, etc.)

It is something we're still exploring. Shallow branches may take a
larger role in this. I've been personally tasked to fix the
commit-to-stacked-branch stuff once it bubbles to the top of my TODO
queue. (Shallow is an extension of stacked, whereby we default to
populating it with some amount of history, stacked defaults to no
history except for new commits.)

>
> 2) Limiting the scope of a checkout to a fraction of the repository; eg
> partial checkouts. This is mostly irrelevant for small projects, even up
> to large source-only libraries. However, when you have to collaborate
> with several different and distinct types of users, it is essential. You
> do not need, or want, graphics designers to checkout out code and vice
> verse. You do not need/want authors to checkout videos and vice verse.
> It is simply too inefficient for media-rich projects.

The very basic way to do this is to have them in separate branches. It
means deciding a priori what the dividing lines is. But you clearly have
'code here' and 'media there'. It doesn't seem hard to find the dividing
line.

The one thing that Bazaar natively is missing is a way to bring together
lots of separate branches into a single meta-project. (Our term for it
is Nested Trees.) There are 2 different plugins that provide some
support for it (scmproj and bzr-externals, IIRC). We're also looking to
focus on implementing it in the next 6-month cycle, since it is starting
to be more of a priority for other work we are trying to do.

>
> 3) Supporting undiffable content is crucial for supporting media-rich
> repositories. Failing to develop native support for this results in
> limited use in media-rich markets. The majority of media formats is in
> some way compressed, because they would be otherwise unwieldy. To top it
> off, there is little relevance in diffing this media even uncompressed.
> We want version control of this media, but we do not want to pay, and in
> a lot of cases can not afford to pay, the overhead involved with trying
> to diff these files.
>
> Unfortunately it seems like the Mercurial and Git communities are not
> interested in escaping their limited, source-control shells. But Bazaar
> at least gives the impression of such things. I hope that these features
> will become a higher priority for this community so that Bazaar can meet
> the needs of all projects, little, big, and diverse ;)
>

There are quite a few potential tweaks here, but honestly the tradeoffs
are not fully understood. Even on 'pre-compressed' content, it really
depends on how the editing is done.

For example, if you had a 1GB 10-minute movie. If someone edits some of
the content (say even 1 minute of it), I would bet that a significant
fraction of it would remain unchanged. (It is just inefficient to
regenerate the entire 1GB of content.) As such, having 'diff' or 'delta'
at the repository level is still efficient. (Your size increase would go
up by the level of the change, rather than the size of the raw content
each time.)

The proposal for Bazaar that we liked the most was 'sharding'. Which is
essentially, any file larger than X size gets treated as N shards of
size X. (So a 1GB file internally becomes say 10 100MB shards.) You
potentially lose some diff efficiency, but you could easily cap the
largest internal object size.

...

>> Bazaar has no concept of linked-but-separate of the sort you seek. It's
>> either the same file or a totally different one. I'm not sure why you
>> are concerned about whether Bazaar is storing it using diffs to the
>> parent or not; Bazaar's storage will perform just as well whether or not
>> the two versions of that file are tracked as one file or two. Or am I
>> misunderstanding your concern?
>
> This was merely related to a theoretical workaround to avoid the memory
> overhead involved in diffing undiffable content. This workaround is not
> particularly important.

As I mentioned, I think there is still a lot of 'it depends' going on.
medium sized files (1MB) could easily change the whole content for every
change, but large (>100MB) is unlikely to change everything every time.

>
>> To an extent this may be addressed by writing a plugin that uses the
>> merge_file_content hook. e.g. see
>> <http://doc.bazaar.canonical.com/bzr.2.2/en/user-guide/hooks.html#example-a-merge-plugin>,
>> which includes a simple example that cause bzr to never try merge *.xml
>> files, and instead mark them as conflicts for the user to resolve by
>> hand.
>
> Thanks. I did notice this when Parth linked the hooks page. Definitely
> something to consider. However, I also noticed another bug report where
> even commits require several times the files size in memory. And that
> issue has been sitting for over three years. I was hoping to find a VCS
> that could avoid such unnecessary overhead. Hopefully these issue will
> garner some more attention.
>
> Thanks again, everyone, for all of the insight.

At the moment, 'bzr commit' requires 1 fulltext, + I think 2 zlib
compressed texts. We have pushed on that a few times. If repacking is
triggered, then we require up to 4x the size of the largest content.

There are several quick hacks to prevent some of that, as mentioned the
most elegant solution is probably something like sharding.

It would be approx 3-line change to just change the compressor to say
"if content > X size, don't delta compress". You could even do it in a
plugin (with a bit more effort). I don't think the tradeoffs are as
straightforward as you imply, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxq0/gACgkQJdeBCYSNAAOpagCfTb5hLh64m6VptUpZ4qsHlCcU
teQAoJMe8pD1biJ4lwdxCg+NvFnVYWre
=gIle
-----END PGP SIGNATURE-----

Revision history for this message
Gordon Tyler (doxxx) said :
#7

> At the moment, 'bzr commit' requires 1 fulltext, + I think 2 zlib
> compressed texts. We have pushed on that a few times. If repacking is
> triggered, then we require up to 4x the size of the largest content.

I've always wondered why bzr doesn't use temporary files on disk for potentially memory-consuming operations like these?

Revision history for this message
Jeff M. (jamarr) said :
#8

Hi John, thanks for the insight. It is very welcoming to see how active
and helpful Bazaar's question and answer forum is.

If you notice, all of these are about storage efficiency. This is the
single most important factor in media-heavy projects. The workflows are
centralized for stability, security, and accessibility but the
day-to-day workflows of a diverse group of contributors relies on
efficiently handling their media. From my own experience most VCS's are
all about source-control; they should really all be re-labeled as
versioned source-control systems as none of them efficiently handle all
content. So now I am simply trying to promote the idea that: supporting
the media-heavy communities can and will be rewarding to the existing
DVCS communities; in the hope that more DVCS contributors will
contribute to these features ;)

> The primary counterpoint is that you don't know ahead of time when that
> 10% kicks in. I agree that most of the time most of the history isn't
> necessary. But it is *very* hard to predict in advance when it will be.
>
> There have been some other systems that didn't propagate history (such
> as Arch/tla: you had it in your repo, maybe someone mirrored your repo,
> maybe they didn't). However, very often the primary copy of the content
> then goes offline, and suddenly everyone has a hole in their history.
> You go to annotate a file... but you're missing changes, etc.
>
> Many places have a 'central authority' which is one potential option.
> The central location has the full history, but each individual does not.
> This raises a bunch of other issues, though. (you can't do all the work
> offline that you could have done while online, you're copy really isn't
> a peer to another one when you want it to be, etc, etc.)
>
> It is something we're still exploring. Shallow branches may take a
> larger role in this. I've been personally tasked to fix the
> commit-to-stacked-branch stuff once it bubbles to the top of my TODO
> queue. (Shallow is an extension of stacked, whereby we default to
> populating it with some amount of history, stacked defaults to no
> history except for new commits.)

Right. I agree with what you are saying. My point was simply that
shallow checkout is important and can be utilized in both centralized
and decentralized systems. I just did not see the importance of such a
feature being taken seriously. And this is somewhat understandable when
your current community consists of a majority of decentralized
workflows. But in large, media-heavy projects it is inconceivable to try
and work with the entire history of a project. As much as full-history
checkouts are needed in a decentralized project, it is as much a
frustration in a media-rich project.

I know that some people advocate not putting large and/or binary content
in a version control system. But what they do not understand is that one
way or another, and that way generally being manual backup, it will be
in some form of version control. We need version control of media for
some of the same reasons it is needed for source-code: tracking,
versioning, reverting, branching, etc. What the media-heavy community is
lacking is a single version control system to meet the needs of both
source and media.

I am happy to hear that you are still pursuing this, and I am sure that
many of us will be excited to see progress made in this area.

> The very basic way to do this is to have them in separate branches. It
> means deciding a priori what the dividing lines is. But you clearly have
> 'code here' and 'media there'. It doesn't seem hard to find the dividing
> line.
>
> The one thing that Bazaar natively is missing is a way to bring together
> lots of separate branches into a single meta-project. (Our term for it
> is Nested Trees.) There are 2 different plugins that provide some
> support for it (scmproj and bzr-externals, IIRC). We're also looking to
> focus on implementing it in the next 6-month cycle, since it is starting
> to be more of a priority for other work we are trying to do.

Right. Dividing the lines a priori is generally a simple matter for
media-separated hierarchies. Using separate branches for media is an
interesting option. Does Bazaar allow you to checkout/pull from a single
branch, without having to pull unshared history from all branches in a
repository? So in theory you could initialize a repository with a branch
for each type of media, and when pulling one branch avoid the overhead
of pulling history from all other branches? In Mercurial you have to
pull the history for every branch, even if/when those histories are
unrelated.

> There are quite a few potential tweaks here, but honestly the tradeoffs
> are not fully understood. Even on 'pre-compressed' content, it really
> depends on how the editing is done.
>
> For example, if you had a 1GB 10-minute movie. If someone edits some of
> the content (say even 1 minute of it), I would bet that a significant
> fraction of it would remain unchanged. (It is just inefficient to
> regenerate the entire 1GB of content.) As such, having 'diff' or 'delta'
> at the repository level is still efficient. (Your size increase would go
> up by the level of the change, rather than the size of the raw content
> each time.)
>
> The proposal for Bazaar that we liked the most was 'sharding'. Which is
> essentially, any file larger than X size gets treated as N shards of
> size X. (So a 1GB file internally becomes say 10 100MB shards.) You
> potentially lose some diff efficiency, but you could easily cap the
> largest internal object size.

Indeed this does sound like an attractive and reasonable solution. There
was only one mention of this in the bug report
(https://bugs.launchpad.net/bzr/+bug/109114) and apparently is/was not
on Canonical's list of priorities. I would very much like to see the
community push for contributions in this area. I would very much like to
contribute myself, and perhaps I will be able to one day; though at the
moment I have no experience developing any form of VCS nor do I know
Python or Pyrex; I am mostly familiar with C/C++/PHP.

Revision history for this message
John A Meinel (jameinel) said :
#9

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Right. Dividing the lines a priori is generally a simple matter for
> media-separated hierarchies. Using separate branches for media is an
> interesting option. Does Bazaar allow you to checkout/pull from a single
> branch, without having to pull unshared history from all branches in a
> repository? So in theory you could initialize a repository with a branch
> for each type of media, and when pulling one branch avoid the overhead
> of pulling history from all other branches? In Mercurial you have to
> pull the history for every branch, even if/when those histories are
> unrelated.
>

You only pull the history for a single branch.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxq/L4ACgkQJdeBCYSNAAOoTwCgurXwwfwszJo/hko5b1acNySd
yVMAoNSO2aihQCCahOtBdrWtIzaUY4an
=fgr/
-----END PGP SIGNATURE-----

Revision history for this message
Adrian Wilkins (adrian-wilkins) said :
#10

The wrinkle you have here are points #5 and #6 - I am presuming the files you wish to ignore for diff / merge purposes are the same as the content files you only want to retain the latest version of (both on client and server).

To me these are not requirements you want to put near a version control system, simply because you are not calling for version controlled files.

What I would do with resources like these is store them in a common server, and have your build system grab them as required. If they have to go in the same folder as the sources, put them in a folder and instruct the VCS to ignore it. It matters little that you aren't versioning them because you've already broken the paradigm of version control by only retaining the latest version of these resources. Even better would be if your build system can cope with them being in an external directory.

The closest analogue I can offer for these features would be using Maven to access files from Archiva. Archiva will retain as many snapshots of your resources as you wish. Maven will mirror them to repository on your local disk, checking to see if you have the latest available snapshot or major version as stipulated. Put your Maven pom.xml files in version control ; store your binary resources in Archiva. When you are on the network, you can instruct it to work offline, when you are back in the office you can have it grab the latest versions.

Revision history for this message
Jeff M. (jamarr) said :
#11

  Hi Adrian, thanks for the suggestion. I have heard of Maven and was
planning on looking into that; I was not aware of Archiva so thanks for
that tip.

As far as version control is concerned, you recommend not using Bazaar
to version media and instead use a combination of other software to
version control media. Doesn't that just sounds funny :) This is what I
am talking about in my previous reply. This media needs to be version
controlled, in some fashion, on the central-server and not on the local
checkouts; perhaps revisioned on local-checkouts, but these revisions
would never be pushed. The reason these need to be versioned is because
we release various editions (branches, if you will) of our software.
Sometimes the media needs to be reverted a version or two because
unforeseen issues arise in the media and/or content of said media.
Sometimes we need to merge branches when experimental media has been
proven worthwhile. And sometimes we simply need to know when some media
object was introduced and what releases are/where affected by that.

Yes, we can, and will probably have to find a combination of software to
fill this void. My question is simply, why? Why does it need to be so
complicated? Is it really impossible for Version Control Systems to
branch out and meet these needs? How much simpler would it be for
media-rich projects if a single VCS could be used to version-control all
of their content? I know there are paid services and software that offer
this. I know that in an open-source market there needs to be demand and
people willing to meet that demand, to get the features you need. So I
am trying to advocate this: 1) this need does exist; I see threads on
all of these issues in all of the major VCS forums from several years
ago up to today 2) these features can be of benefit to the open-source
community and to both centralized and distributed systems 3) broadening
your audience will bring in new contributors and help move the project.

The biggest appeal, to me, is a single open-source VCS that can
intelligently version control all forms of content. How nice would it be
to hear this conversion: Hey Paul, what VCS do you recommend for a small
decentralized project? Bazaar. Hey Paul, what VCS do you recommend for
my new graphics-intensive website? Bazaar. Hey Paul, what VCS do you
recommend for centralized control with local commits for authors?
Bazaar. Hey Paul, what VCS do you recommend for a medium-sized
scientific project with mixed source data, test data, and proprietary
formats? Bazaar. Hey Paul, what VCS do you recommend for a
game-development company that needs to control an extremely diverse set
of files? Bazaar.

These features are not that far of a stretch. Efficiently handling large
and/or pre-compressed files is something a VCS can do. Letting clients
use shallow checkouts is something a VCS can do. Letting clients use a
centralized server with local commits is something a VCS can do. All of
these features can be beneficial to a wide array projects.

I am happy to hear that Bazaar is working on these features and taking
steps forward to broaden it's community. And I appreciate you taking the
time to comment. I will certainly continue to experiment with Bazaar as
well as with Maven, Archiva, and other projects. Thanks for the
discussion :)

Revision history for this message
Raynor (memphis007fr) said :
#12

I also work in game-development and i have the exact same requirements.

Everybody who doesn't work on a rich media project tells us that media shouldn't go in the VCS because it's not source code (maybe they think medias doesn't change often or maybe it doesn't cause bugs).
I completely disagree because:
- Medias can change very often and the source code or script are closely bound to those medias.
- If you update to an older revision without the media that originally belonged to that revision, then the whole VCS become useless even for the code.
- We also need revert, commit and all those tools on medias and more importantly log because we need to see the change on the medias that cause a bug or an artifact.

Right now we are using svn because:
- It handles repositories with 30GB without slowing down.
- It can work on big files without eating memory.
- It handles partial checkout.
- It handles externals and sub directory merge.
- It has a great GUI (TortoiseSVN)

But svn have some problems too:
- The merge is awful.
- Externals are not very automated like git (you need to commit in 2 times and edit the property to change external revision)
- With Externals TortoiseSVN doesn't show changes on the sub project in the log, just the externals property change.
- A full copy of the project is stocked in .svn, that double the working copy size and number of files. It badly hurt my hard drive on Windows. But the next version of svn will store them in a compressed database so it will be better.

I think svn is the only solution in game-development. I tried bzr and git they didn't fit. Others big game studios are using perforce because it handle 100GB+ projects but this solution is too expensive for my company.

I think all those requirements are common for media rich projects but they are left apart because those projects are rare in open source world. But those projects account for a large portion of the commercial market. (for exemple video games is a $104 billion market).

Anyway i want to thank you because bzr is a great project and is open to all utilizations (unlike git and mercurial that are only targeted to open source developers).

Revision history for this message
Raynor (memphis007fr) said :
#13

Sorry video games is a $50 billion market if you just count the software.