multi backend

Bug #423988 reported by Plamen K. Kosseff
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Duplicity
Fix Released
Wishlist
Unassigned

Bug Description

So the usecase for that is that I have a substantial amount of data to backup so it is not an option to make a local copy first and then sync it with the remote storage.

The attached backend accepts an url in the form multi://anything/<ssh://user:pass@host//path><file:///path><s3+http://bucket/path/><cf+http://container/> and executes every action on all of the real backends.

Uploads are done concurrently.

The code has reached the "works-for-me" level, but if you are interested in including it in duplicity I will continue to work on it.

There are some problems that need to be addressed:
- the url scheme is ugly I'm open to suggestions.
- passwords will be leaked by ps
- currently it is possible to have multiple concurrent put operations on the same backend if you have multiple <ssh://user:pass@host//path> for example.
- downloads will always use the first backend in the URL (i'm not sure if this is a problem).
- there should be an utility to check if all backends are in sync and synchronize them if needed.

p.s. Python is not my native language so if you see something that is not very pythonish or plain stupid don't hesitate to tell me :)

Tags: backend patch
Revision history for this message
Plamen K. Kosseff (p-kosseff) wrote :
tags: removed: new
Revision history for this message
Jiri Tyr (jtyr) wrote :

I have been looking for something like this! It would be really nice to have this patch as a part of the stable version. It would save a lot of CPU time for many of my machines.

Here are my comments:
- the url scheme is ugly I'm open to suggestions.

I would prefer to split all destinations:
$ duplicity full /dir scp://user@host1/dest scp://user@host2/dest scp://user@host3/dest

- passwords will be leaked by ps

Use scp without password (use ssh keys instead). Best solution is to install rssh (http://rssh.sourceforge.net/) on the destination machine to chroot the user and allow him to login only with the ssh key (no password).

- currently it is possible to have multiple concurrent put operations on the same backend if you have multiple <ssh://user:pass@host//path> for example.

That's OK I think.

- downloads will always use the first backend in the URL (i'm not sure if this is a problem).

I would be good to disable somehow the support of multi:// in the case of download. Show error message and exit or something like that.

- there should be an utility to check if all backends are in sync and synchronize them if needed.

This is not really necessary if you are using the multi:// patch.

Revision history for this message
Plamen K. Kosseff (p-kosseff) wrote :

> $ duplicity full /dir scp://user@host1/dest scp://user@host2/dest scp://user@host3/dest

yes this is the best way but it will require changes to duplicity it self. which I think will require far greater effort then a just a backend implementation.

> - there should be an utility to check if all backends are in sync and synchronize them if needed.
>
> This is not really necessary if you are using the multi:// patch.

But it is because now if the upload to the second backend fails next time duplicity start it will see that the file have been uploaded because it is available on the firs backend

Revision history for this message
Jiri Tyr (jtyr) wrote : Re: [Bug 423988] Re: multi backend

>> $ duplicity full /dir scp://user@host1/dest scp://user@host2/dest
> scp://user@host3/dest
>
> yes this is the best way but it will require changes to duplicity it
> self. which I think will require far greater effort then a just a
> backend implementation.

I know, but it worth for the work.

>> - there should be an utility to check if all backends are in sync and synchronize them if needed.
>>
>> This is not really necessary if you are using the multi:// patch.
>
> But it is because now if the upload to the second backend fails next
> time duplicity start it will see that the file have been uploaded
> because it is available on the firs backend

I see. But this is only for the case of non-full backup. If you do full
backup, it doesn't matter.

---

Anyway, all this is very interesting feature which should be implemented
as soon as possible.

Revision history for this message
Plamen K. Kosseff (p-kosseff) wrote :

Hmm may be the easiest way is to make duplicity combine the multiple URLs in the command line in 1 that can be handled by the multi: backend this will minimize the changes in duplicity.

It is always possible that the backends get out of sync there should be mechanism for detecting and fixing that.

Anyway the developers need to decide if this is worth including in the application. And then I'll make the implementation that fits in the rest of duplicity best.

Revision history for this message
Jiri Tyr (jtyr) wrote :

Jiri Tyr wrote:
>>> - there should be an utility to check if all backends are in sync and synchronize them if needed.
>>>
>>> This is not really necessary if you are using the multi:// patch.
>> But it is because now if the upload to the second backend fails next
>> time duplicity start it will see that the file have been uploaded
>> because it is available on the firs backend
>
> I see. But this is only for the case of non-full backup. If you do full
> backup, it doesn't matter.

I have got and idea how to do it even if you do the incremental backup:

Before you create new incremental backup, you check if there are the
same timestamps of the last backup on all destinations. If so, you do
the new backup. If not, you check what's the last common backup and you
sync the rest. If there is no common backup you end up with an error.
You should also consider the situation that on one destination is full
backup and on other one is incremental backup. Then you should end up
with an error.

What do you think?

Revision history for this message
Jiri Tyr (jtyr) wrote :

Plamen K. Kosseff wrote:
> Hmm may be the easiest way is to make duplicity combine the multiple
> URLs in the command line in 1 that can be handled by the multi: backend
> this will minimize the changes in duplicity.

The best solution is the write a patch for duplicity itself and allow to
accept more then one destinations. Then no additional backend is needed.

It should not be so difficult to implement it. It's just a matter of
checking if the last N parameters are definition of a destination and
then process all these destinations in a loop.

> Anyway the developers need to decide if this is worth including in the
> application. And then I'll make the implementation that fits in the rest
> of duplicity best.

I think that the developers want it (see the Blueprints/redundancy).

Revision history for this message
Plamen K. Kosseff (p-kosseff) wrote :

>The best solution is the write a patch for duplicity itself and allow to
>accept more then one destinations. Then no additional backend is needed.

>It should not be so difficult to implement it. It's just a matter of
>checking if the last N parameters are definition of a destination and
>then process all these destinations in a loop

Given my experience with the par2 patch I can safely say it is much harder then it sounds :)

Revision history for this message
Larry Gilbert (l2g) wrote :

For my taste, it would be better to use anything but a pseudo-URL scheme.

If I were to do it, I'd probably implement this feature in two parts: (1) add a new --target option for making a target explicit, and (2) have the use of more than one --target indicate that the backup will go to all given targets together.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

From a clarity standpoint, the best solution is to go with an option that specifies an addition URL, i.e. "--add-url=foo". Users screw up the command line well enough on their own without complicating it more. :-)

That's the simple part... the hard parts are in keeping the cache and multiple URL's in sync, in doing checkpoint/restart, in additional asynchronism issues and probably some I have not thought about.

I think such a thing as this would be possible to do with sufficient development and testing time, but for now, its just a wishlist item. Solutions exist with rsync and others that solve the problem much more elegantly. I know the usecase for this says no, but the cheapness of drives nowadays says yes.

Changed in duplicity:
importance: Undecided → Wishlist
Revision history for this message
Jiri Tyr (jtyr) wrote :

Kenneth Loafman wrote:
> That's the simple part... the hard parts are in keeping the cache and
> multiple URL's in sync, in doing checkpoint/restart, in additional
> asynchronism issues and probably some I have not thought about.

Not in the case of full backup. I propose to implement it first for the
full backup only and later, if somebody has enough of time, for
incremental backup. What do you think?

Cheers,
Jiri

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

Jiri Tyr wrote:
> Kenneth Loafman wrote:
>> That's the simple part... the hard parts are in keeping the cache and
>> multiple URL's in sync, in doing checkpoint/restart, in additional
>> asynchronism issues and probably some I have not thought about.
>
> Not in the case of full backup. I propose to implement it first for the
> full backup only and later, if somebody has enough of time, for
> incremental backup. What do you think?

If the implementation can be done with an option, --add-target or
similar, go ahead. I don't think the multi-URL format is maintainable.

...Thanks,
...Ken

Revision history for this message
edso (ed.so) wrote :

here's another idea i've been playing with in my mind for quite a while now.

why not add additional commands to duplicity like 'copy', 'sync' eg.

 duplicity sync ftp://srv1/path sftp://srv2/path

would only upload _new files_ to the secondary location, effectively realizing an horcrux approach. we have all needed get/put capability in duplicity's backends.

as duplicity's backends are dumb we would of course need a way to identify _completely_ uploaded files. we cannot ask the (all) backends for size or check that, so i'd suggest to have .markerfiles or another checksummed manifest to keep record of that.

this is much more flexible than the suggested above as it can run in parallel (multiple copy, sync operations to diff. backends) and can be even used outside the scope of duplicity for users who just want to copy data from eg. s3 to ftp or such.

what do you think? ..ede

Changed in duplicity:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.