stuck git import for a few days

Asked by Peter Sabaini

Git import for the ceph-mon charm seems stuck for a few days (since 2023-10-02)

https://code.launchpad.net/~openstack-charmers/charm-ceph-mon/+git/charm-ceph-mon

The logs seem to indicate a timeout:

2023-10-02 15:33:18 INFO Starting job.
2023-10-02 15:33:18 INFO Getting existing repository from hosting service.
2023-10-02 15:33:36 INFO remote: Counting objects: 100% (3189/3189)
2023-10-02 15:33:36 INFO remote: Counting objects: 100% (3189/3189), done.
2023-10-02 15:33:37 INFO remote: Compressing objects: 100% (3091/3091)
2023-10-02 15:33:37 INFO remote: Compressing objects: 100% (3091/3091), done.
2023-10-02 15:33:37 INFO Receiving objects: 99% (27243/27518), 20.93 MiB | 41.28 MiB/s
2023-10-02 15:33:37 INFO remote: Total 27518 (delta 2323), reused 122 (delta 96)
2023-10-02 15:33:37 INFO Receiving objects: 100% (27518/27518), 20.93 MiB | 41.28 MiB/s
2023-10-02 15:33:37 INFO Receiving objects: 100% (27518/27518), 21.51 MiB | 23.34 MiB/s, done.
2023-10-02 15:33:42 INFO Resolving deltas: 100% (16470/16470)
2023-10-02 15:33:42 INFO Resolving deltas: 100% (16470/16470), done.
2023-10-02 15:33:42 INFO Fetching remote repository.
Traceback (most recent call last):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/scripts/code-import-worker.py", line 112, in <module>
    sys.exit(script.main())
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/scripts/code-import-worker.py", line 107, in main
    return import_worker.run()
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 581, in run
    return self._doImport()
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1197, in _doImport
    cwd="repository")
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1080, in _runGit
    for line in self._throttleProgress(git_process.stdout):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1037, in _throttleProgress
    buffered, timeout=timeout):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1056, in _throttleProgress
    line = next(wrapped_file)
KeyboardInterrupt
Import failed:
Traceback (most recent call last):
Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

However a manual clone from https://opendev.org/openstack/charm-ceph-mon.git works just fine:

git clone https://opendev.org/openstack/charm-ceph-mon.git
Cloning into 'charm-ceph-mon'...
remote: Enumerating objects: 7608, done.
remote: Counting objects: 100% (3003/3003), done.
remote: Compressing objects: 100% (789/789), done.
remote: Total 7608 (delta 2848), reused 2214 (delta 2214), pack-reused 4605
Receiving objects: 100% (7608/7608), 1.70 MiB | 537.00 KiB/s, done.
Resolving deltas: 100% (4974/4974), done.

I've triggered an import manually as well and it does seem to be stuck too.

Would you be able to help out?

Question information

Language:
English Edit question
Status:
Solved
For:
Launchpad itself Edit question
Assignee:
Guruprasad Edit question
Solved by:
Guruprasad
Solved:
Last query:
Last reply:
Revision history for this message
Ines Almeida (ines-almeida) said :
#3

Hi Peter,

I see that this code import was tried a few times, each time the same error appears for different code-import workers, and those workers are working OK for other projects. So this seems weirdly very specific for this particular import, I'm not 100% sure how we can help currently, or if this is indeed an issue from Launchpad or not.

It was working on the 2nd of October (this Monday), then it started failing. I'm guessing the launchpad configuration for the code import has remained the same?

I also noticed that there was exactly 1 commit between things working vs. not working, though I don't see how that could be the root of the issue.

I'll ask if anyone else in the team has any other ideas

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#4

Ines,

indeed the launchpad configuration hasn't changed.

FTR. this is the MR where things stopped working: https://review.opendev.org/c/openstack/charm-ceph-mon/+/897011

It's a one line change of a parameter, can't see how this could affect things.

Revision history for this message
Jürgen Gmach (jugmac00) said :
#5
Revision history for this message
Ines Almeida (ines-almeida) said :
#6

Small update here, I tried doing the same code import in our qastaging environment and it succeeded: https://code.qastaging.launchpad.net/~ines-almeida/test-project-ines/+git/test-project-ines

This is more proof that this is an issue within production (given the configuration didn't change, then it shouldn't be due to the configuration in production either)

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#7

@Ines thanks for the update

@Juergen, I see the import was retried and failed at 2023-10-09 06:52:34
http://launchpadlibrarian.net/691047798/openstack-charmers-charm-ceph-mon-+git-charm-ceph-mon.log

Revision history for this message
Colin Watson (cjwatson) said :
#8

The import seems to be running into a deadlock between "git-remote-https" and "git fetch-pack". It looks quite like https://github.com/git/git/commit/b37fd14beb39b9f545bd72e42e1bdbb00bad4b3d, and I wonder if we should try cherry-picking that patch into the git backport that we're running.

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#9

Interesting!

From the linked commit it says this happens "when the server side prematurely throws an error and disconnects", does this mean we're running into errors when fetching from upstream?

Fwiw I'm of course very much +1 on trying that cherry-pick if it helps us get unstuck.

Thanks!

Revision history for this message
Colin Watson (cjwatson) said :
#10

I didn't see any such errors, but it's always possible that upstream's analysis of the exact set of situations that can result in this bug is incomplete. (Alternatively, my educated guess could be wrong.)

Revision history for this message
Jürgen Gmach (jugmac00) said :
#11
Revision history for this message
Guruprasad (lgp171188) said :
#12

I backported what looked like a relevant fix upstream to the bionic package that we are using in the code import workers. But that didn't solve the problem and the code import is still timing out. Colin from my team has suggested backporting the jammy version of git to bionic (if at all that is possible) to see if that resolves this issue. So fixing this is going to take more time.

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#13

Hey @Guruprasad, thanks for the update (and bummer about that fix).

Just for avoidance of doubt, is the git backport jammy->bionic something you're considering implementing?

Cheers!

Revision history for this message
Guruprasad (lgp171188) said :
#14

Hi Peter, yes, I am going to try backporting the jammy version of git to bionic, if at all that is possible.

Revision history for this message
Billy Olsen (billy-olsen) said :
#15

Any thoughts on deleting the repository and re-importing it? Is that a potential option that might get everything by?

Revision history for this message
Colin Watson (cjwatson) said :
#16

Billy, no, that won't help given that we were able to reproduce the failure in qastaging.

Revision history for this message
Guruprasad (lgp171188) said (last edit ):
#17

I backported jammy's git package to bionic, upgraded the qastaging codeimport workers to use it, and retried the import of the same repository (https://code.qastaging.launchpad.net/~ines-almeida/test-project-ines/+git/test-project-ines) on the qastaging instance. After multiple failures similar to what we saw in the production environment before today's upgrade, the import succeeded on the first try. Let us keep it running for a few more iterations to confirm that this is not a random success and then upgrade the production codeimport workers to use this version.

Revision history for this message
Felipe Reyes (freyes) said :
#18

> After multiple failures similar to what we saw in the production environment before today's upgrade, the import succeeded on the first try.

that's great news, thanks for working on this issue, Guruprasad.

> Let us keep it running for a few more iterations to confirm that this is not a random success and then upgrade the production codeimport workers to use this version.

do you think before end of this week we could get this roll over to production?

Revision history for this message
Best Guruprasad (lgp171188) said :
#19

I have upgraded the production workers now and retried the import and it worked!

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#20

Fantastic, thanks a ton Guruprasad!

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#21

Thanks Guruprasad, that solved my question.