restoring metadata from amazon glacier

Asked by Martin on 2017-08-25

When doing a "collection-status", duplicity does a sync_archive and if files are not local, it copies them from remote.
If the backend is amazon s3 and the files have been migrated to Glacier, duplicity restores the files one by one and has to wait for each file for hours to go from glacier to S3. So if the restore needs 6 hours and there are 50 files, this needs 300 hours. It would be better if the restore is triggered for all files and the wait is done afterwards.

Looking into the code I find the following in duplicity in function sync_archive around line 1185:
            if hasattr(globals.backend, 'pre_process_download'):
                globals.backend.pre_process_download(local_missing)
This would be a good place for the backend to initiate the restore of all files.
But the test "hasattr(globals.backend, 'pre_process_download')" is not true. I think this is because the backend is a duplicity.backend.BackendWrapper so it should read globals.backend.backend.

Also I see that "local_missing" is a list of files, whereas the function "def pre_process_download(self, remote_filename, wait=False)" in _boto_single expects a single filename, not a list.

This is the reason why this piece of code is not working.

Question information

Language:
English Edit question
Status:
Answered
For:
Duplicity Edit question
Assignee:
No assignee Edit question
Last query:
2017-08-25
Last reply:
2017-08-29
Martin (martin3000) said : #1

For testing, I inserted a second loop for restoring files in duplicity.sync_archive, around line 1185:

            for fn in local_missing:
                globals.backend.backend.pre_process_download(fn,wait=False) # restore from S3
            for fn in local_missing:
                copy_to_local(fn)

This seems to do it.

Martin (martin3000) said : #2

This is better:

            if hasattr(globals.backend.backend, 'pre_process_download'):
                for fn in local_missing:
                    globals.backend.backend.pre_process_download(fn,wait=False) # restore from S3

            for fn in local_missing:
                copy_to_local(fn)

edso (ed.so) said : #3

Martin,

there are multiple calls to backend.pre_process_download() in bin/duplicity. so it might make sense to rework BackendWrapper accordingly.

BackendWrapper is a kind of generalizing class for backends providing eg. put(), which is implemented in the backends as private _put() method. so
1. adding a pre_process_download() to the wrapper,
2. making it work with single entries and lists and
3. renaming / running the underlying _pre_process_download()
should be sufficient here. boto seems to be the only instance using it currently.

renaming the method to pre_get() or pre_process_get() would probably streamline the general naming.

can you do the changes and generate a diff patch? or even better setup a bazaar branch, it's not that difficult.

..ede/duply.net

Martin (martin3000) said : #4

It is the first time I look here. I can start an editor and understand python, but I have no idea how to create patches and I don't know what a bazaar is :-)

Martin (martin3000) said : #5

Also I have found the following problem:
_boto_single assumes that objects in class "GLACIER" cannot be downloaded ("if key.storage_class == "GLACIER":").
This is not true.
http://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html states:
        After you receive a temporary copy of the restored object, the object's storage class remains GLACIER
        (a GET or HEAD request will return GLACIER as the storage class).
So it is not a good idea to force objects back to S3.

The correct way for _boto_single.pre_process_download to do it:

if key.storage_class == "GLACIER":
  if key.ongoing_restore: wait or ignore
  else if key.expiry_date: restore finished, temp copy available
  else key.restore(days=2)

 See also http://boto.cloudhackers.com/en/latest/s3_tut.html

edso (ed.so) said : #6

On 25.08.2017 15:18, Martin wrote:
> Question #656947 on Duplicity changed:
> https://answers.launchpad.net/duplicity/+question/656947
>
> Martin posted a new comment:
> It is the first time I look here. I can start an editor and understand
> python, but I have no idea how to create patches and I don't know what a
> bazaar is :-)
>

how about you use the latest sources as a start, patch/test them and send me the result as a compressed archive? i could do the rest and suggest it for merging if the changes look good.

how to run a devl duplicity is described in README-REPO.
  http://bazaar.launchpad.net/~duplicity-team/duplicity/0.8-series/view/head:/README-REPO
it also describes howto download from bazaar. howto upload a bazaar branch is described on launchpad.

..ede/duply.net

edso (ed.so) said : #7

On 25.08.2017 15:47, Martin wrote:
> Question #656947 on Duplicity changed:
> https://answers.launchpad.net/duplicity/+question/656947
>
> Martin posted a new comment:
> Also I have found the following problem:
> _boto_single assumes that objects in class "GLACIER" cannot be downloaded ("if key.storage_class == "GLACIER":").
> This is not true.
> http://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html states:
> After you receive a temporary copy of the restored object, the object's storage class remains GLACIER
> (a GET or HEAD request will return GLACIER as the storage class).
> So it is not a good idea to force objects back to S3.
>
> The correct way for _boto_single.pre_process_download to do it:
>
> if key.storage_class == "GLACIER":
> if key.ongoing_restore: wait or ignore
> else if key.expiry_date: restore finished, temp copy available
> else key.restore(days=2)
>
> See also http://boto.cloudhackers.com/en/latest/s3_tut.html
>

good catch. again, you are welcome to fix those. i, unfortunately do not use s3 nor have i the need to, but would be willing to help you get your changes committed ;)

..ede/duply.net

Martin (martin3000) said : #8

This is my new _boto_single.py.pre_process_download:

    def pre_process_download(self, remote_filename, wait=False):
        # Used primarily to restore files in Glacier
        key_name = self.key_prefix + remote_filename
        if not self._listed_keys.get(key_name, False):
            self._listed_keys[key_name] = list(self.bucket.list(key_name))[0]
        key = self._listed_keys[key_name]
        key2 = self.bucket.get_key(key.key) #why do we need key2?

        if key2.storage_class == "GLACIER":
            if not key2.expiry_date: # no temp copy avail
                if not key2.ongoing_restore:
                    log.Info("File %s is in Glacier storage, restoring" % remote_filename)
                    key.restore(days=2) # Shouldn't need this again after 2 days
                if wait:
                    log.Info("Waiting for file %s to restore in Glacier" % remote_filename)
                    while not key2.expiry_date:
                       time.sleep(60)
                       self.resetConnection()
                    log.Info("File %s was successfully restored in Glacier" % remote_filename)

Martin (martin3000) said : #9

I think somebody who knows the code should have a look first.

edso (ed.so) said : #10

On 25.08.2017 16:43, Martin wrote:
> Question #656947 on Duplicity changed:
> https://answers.launchpad.net/duplicity/+question/656947
>
> Martin posted a new comment:
> This is my new _boto_single.py.pre_process_download:
>
> def pre_process_download(self, remote_filename, wait=False):
> # Used primarily to restore files in Glacier
> key_name = self.key_prefix + remote_filename
> if not self._listed_keys.get(key_name, False):
> self._listed_keys[key_name] = list(self.bucket.list(key_name))[0]
> key = self._listed_keys[key_name]
> key2 = self.bucket.get_key(key.key) #why do we need key2?
>
> if key2.storage_class == "GLACIER":
> if not key2.expiry_date: # no temp copy avail
> if not key2.ongoing_restore:
> log.Info("File %s is in Glacier storage, restoring" % remote_filename)
> key.restore(days=2) # Shouldn't need this again after 2 days
> if wait:
> log.Info("Waiting for file %s to restore in Glacier" % remote_filename)
> while not key2.expiry_date:
> time.sleep(60)
> self.resetConnection()
> log.Info("File %s was successfully restored in Glacier" % remote_filename)
>

Martin,

the key2 business seems unnecessary. why don't you leave the method as it is and simply add key.expiry_date as a condition for not restoring? eg.

  # in deep freeze and _no_ temp copy available
  if key.storage_class == "GLACIER" and not self.bucket.get_key(key.key).expiry_date:

..ede/duply.net

Martin (martin3000) said : #11

It does not work with "key", it is necessary to create key2.

If a restore is already ongoing, it is not necessary to trigger the restore again. I think if you trigger it again, you move the object from glacier to s3.

So you can start deja-dup/duplicity, do the collection_status and resume later in the next morning when all files are there.

edso (ed.so) said : #12

On 25.08.2017 18:33, Martin wrote:
> Martin posted a new comment:
> It does not work with "key", it is necessary to create key2.

Martin, did you try my suggestion?

self.bucket.get_key(key.key).expiry_date

  is essentially the _same_ as

key2 = self.bucket.get_key(key.key) #why do we need key2?
key2.expiry_date

, just without adding a temporary variable!

fetching the new key is probably necessary because the connection is rest in the loop.

..ede/duply.net

Martin (martin3000) said : #13

Of course you can replace every occurence of key2 with "self.bucket.get_key(key.key)".
And every occurence of key can be replaced with self._listed_keys[key_name].
So key2 can be replaced by self.bucket.get_key(self._listed_keys[key_name].key).
Or self.bucket.get_key(self._listed_keys[self.key_prefix + remote_filename].key).
But this is not faster and does not save any memory.

edso (ed.so) said : #14

On 8/29/2017 21:33, Martin wrote:
> Question #656947 on Duplicity changed:
> https://answers.launchpad.net/duplicity/+question/656947
>
> Martin posted a new comment:
> Of course you can replace every occurence of key2 with "self.bucket.get_key(key.key)".
> And every occurence of key can be replaced with self._listed_keys[key_name].
> So key2 can be replaced by self.bucket.get_key(self._listed_keys[key_name].key).
> Or self.bucket.get_key(self._listed_keys[self.key_prefix + remote_filename].key).
> But this is not faster and does not save any memory.
>

well, where is your objection then? no reason to create a variable that is only used once is all i am saying. ..ede/duply.net

Can you help with this problem?

Provide an answer of your own, or ask Martin for more information if necessary.

To post a message you must log in.