snap build failures on riscv64

Asked by Stéphane Graber

We've recently enabled riscv64 building for the latest/edge snap on LXD but so far we're yet to see a single build manage to pull all our build artifacts from Github.

snap is at https://code.launchpad.net/~ubuntu-lxc/+snap/lxd-latest-edge

Most recent failures are:
https://launchpadlibrarian.net/569777968/buildlog_snap_ubuntu_focal_riscv64_lxd-latest-edge_BUILDING.txt.gz
https://launchpadlibrarian.net/569637349/buildlog_snap_ubuntu_focal_riscv64_lxd-latest-edge_BUILDING.txt.gz

Both are failing to clone https://github.com/tianocore/edk2 from
Github. I believe that's the same we've seen in every failure so far
(maybe 10 or so builds were attempted).

Question information

Language:
English Edit question
Status:
Open
For:
Launchpad itself Edit question
Assignee:
Colin Watson Edit question
Last query:
Last reply:
Revision history for this message
Stéphane Graber (stgraber) said :
#2

This is completely unrelated, this issue is about a build infrastructure issue and has nothing to do with code hosting.

The ticket was open at the direct request of the Launchpad team so they can track this internally.

Revision history for this message
Max Walters (mdwalters) said :
#3

Can you add some more information on this? Also, did you make the version control system from Bazzar to Git?

Revision history for this message
Heinrich Schuchardt (xypron) said :
#4

@mdwalters124:

The appended complete buildlog indicates that the build machine was riscv64-qemu-lcy01-014.
A command to clone an external git repository failed because the connection was interrupted.

     git clone https://github.com/tianocore/edk2 . -b edk2-stable202108

As the git server is easily reachable on a developer machine: Why is this happening repeatedly on our build farm?

Revision history for this message
Heinrich Schuchardt (xypron) said (last edit ):
#5

To debug the GNUTLS activity during the clone operation the following environment variables can be used:

export GIT_TRACE_PACKET=1
export GIT_TRACE=1
export GIT_CURL_VERBOSE=1

Reportedly the same problem was observed outside Canonical with inappropriate settings of http.postBuffer which defaults to 1 MiB:

git config --global http.postBuffer 524288000

To reduce traffic --depth=1 could be added to the git clone statement.

Revision history for this message
Max Walters (mdwalters) said (last edit ):
#6

Hi, I think I have the answer:
1. You may have used the wrong URL. What you where trying to direct to Git, was the repository website, which is the link to direct to web browsers. On your GitHub repository you may have not looked at the greenn code button. It will show you the correct link, which should end with the file extension *.git, which, Git can read.
2. Maybe your internet connection dropped... Try cloning another repo, if that works, try adding in the *.git extension.
And besides, the url you entered, says "Not found". I think it wasn't your internet connection, but rather the file you tried to direct to Git. I'd suggest you use this URL: https://github.com/tianocore/edk2.git.

Revision history for this message
Max Walters (mdwalters) said :
#7

Also, I looked up what riscv64 is, and it appears to be a port of OpenBSD for RISC-V systems. What version of riscv64 are you using?

Revision history for this message
Dimitri John Ledkov (xnox) said :
#8

@mdwalters124 https://launchpad.net/ubuntu/focal/riscv64 the Ubuntu port obviously.

Revision history for this message
Dimitri John Ledkov (xnox) said :
#9

@mdwalters127 this request is for Launchpad Admins to process. Are you part of the Launchpad Admin team?

Revision history for this message
Dimitri John Ledkov (xnox) said (last edit ):
#10

I have attempted to skip cloning edk2 to get the build further

https://launchpad.net/~xnox/+snap/any-riscv64/+build/1591895

and I managed to get it to effectively time out after 3.25 hours of building:

[23/Nov/2021:21:43:10 +0000] "GET http://ftpmaster.internal/ubuntu/pool/main/n/nano/nano_4.8-1ubuntu1_riscv64.deb HTTP/1.1" 407 2146 "-" "Debian APT-HTTP/1.3 (2.0.6)"

Err nano_4.8-1ubuntu1_riscv64.deb

  407 Proxy Authentication Required [IP: 10.10.10.1 8222]

Given how long build times are on riscv64, is it possible to increase the timeout on the proxy authentication tokens for riscv64 builders?

Revision history for this message
Stéphane Graber (stgraber) said :
#11

Ah yeah, the proxy auth token timing out would make sense. I guess if it was to hit halfway through downloading git artifacts from Github it may cause what we're seeing too. It would indicate pretty damn near identical timing though for it to hit always at the same spot ;)

Revision history for this message
Stéphane Graber (stgraber) said :
#12

It actually got worse now, the riscv64 builders can't talk to the snap store.

https://launchpadlibrarian.net/573567430/buildlog_snap_ubuntu_focal_riscv64_lxd-latest-edge_BUILDING.txt.gz

error: unable to contact snap store
Install failed

This isn't a race or something like that, we've seen this on a number of builds in a row now.

Revision history for this message
Dimitri John Ledkov (xnox) said :
#13

Same when building subiquity https://launchpad.net/~canonical-foundations/+snap/subiquity

setting up snapd package fails to contact the snap store, so snapcraft build is never even attempted.

Revision history for this message
Dimitri John Ledkov (xnox) said :
#14

At some point in November / early December we were able to build riscv64 snaps of subiquity, but not any more.

Revision history for this message
Colin Watson (cjwatson) said :
#15

Snap store connectivity has been restored - it was a routing problem.

I'm not sure whether this completely solves the original issue, though.

Revision history for this message
Stéphane Graber (stgraber) said :
#16

Thanks, the snap store connectivity indeed looks good.
Building LXD still won't work though as the proxy tokens only last 3 hours and our build on riscv64 appears to take significantly longer than that.

https://code.launchpad.net/~ubuntu-lxc/+snap/lxd-latest-edge/+build/1640358

So to recap, so far it looks like we sorted:
 - Odd routing/network issues causing weird failures
 - Recent store access breakage

But are left with:
 - Snap builds take so long the proxy token expires before we're done

I don't know if the proxy access tokens can be special cased per-architecture, but if they can, I'd probably suggest doubling the riscv64 ones. 6 hours ought to be enough to get a build ;)

Revision history for this message
Colin Watson (cjwatson) said :
#17

I agree that architecture-dependent proxy timeouts would be sensible, but unfortunately it's currently impossible to do that without quite extensive rearrangements. Gory details:

 * The timeout is not per-token: instead, the builder proxy's database records the timestamp at which the token was issued, and the queries for retrieving valid tokens and deleting invalid tokens simply check for tokens issued earlier than the current time minus the service-wide token timeout. (With hindsight it's easy to see that this was an early design error; we should have explicitly recorded both the issue and expiry times for each token.)
 * We therefore need to make a database schema change in order to fix this.
 * Unfortunately, the builder proxy has a very basic database setup, and currently has no facility for schema migrations. Therefore we'd need to start by retrofitting such a capability. Of course this would be a good thing to do, but it makes it a somewhat painful job. (Honestly, I'd sort of like to start by moving it from SQLite to PostgreSQL, since that would let us eliminate a single point of failure.)
 * Once we do this, we'd then need to extend the token creation API used by Launchpad to specify a timeout; it would then be possible to devise a way to configure architecture-specific timeouts.

Revision history for this message
Dimitri John Ledkov (xnox) said :
#18

Separately we have been working on reordering how snapcraft does pull & stage, such that all the pulls are done ahead of time.

This can be done by setting:
_byarch {'riscv64': {'snapcraft': 'latest/stable/6.0.1'}}

This improves things and we get to pull most things, but break at cloning qemu.

1) currently qemu is pulled from gitlab with override-pull to execute git clone non-recursive, this doesn't work because proxy details during `snapcraft pull` with this branch of snapcraft somehow are not there.

2) switching qemu pull from override-pull to normal git source type appears to eventually break because it is cloning a lot of repos, which i am guessing are not needed.

I wonder if we can get away without building qemu for riscv64 for lxd, for now.

See my attempts at https://launchpad.net/~xnox/+snap/any-riscv64

1) override-pull failure https://launchpad.net/~xnox/+snap/any-riscv64/+build/1640963
2) source-git failure https://launchpad.net/~xnox/+snap/any-riscv64/+build/1640976

Revision history for this message
Stéphane Graber (stgraber) said :
#19

Probably as LXD only supports KVM based virtualization and that'd still very early in development for riscv64 with little to no hardware support right now.

Revision history for this message
Dimitri John Ledkov (xnox) said :
#20

Bugs:

https://bugs.launchpad.net/snapcraft/+bug/1957767 => snapcraft syntax doesn't support non-recursive git clone (which is desired for the qemu part), and when using pull before build (6.0.1) override-pull appears to have incorrect environment lacking proxy.
Skipping qemu part build makes the build go further

Up at 4h build time mark parts with plugin meson fail because meson is not pulled during pull stage, only build https://bugs.launchpad.net/snapcraft/+bug/1957766

See https://launchpad.net/~xnox/+snap/any-riscv64/+build/1641131

I can try reorder the parts, such that a meson part is first in snapcraft.yaml.....

Revision history for this message
Dimitri John Ledkov (xnox) said :
#21

I have pushed to pull meson part be attempted earlier, however, things are still not quite nice:

https://launchpad.net/~xnox/+snap/any-riscv64/+build/1642588

There are now 14 instances of "Cleaning later steps and re-pulling" of various things, which seems to be counter to the expected behaviour of pulling everything first, then building, without repullings.

If we cannot increase global timeout for network access, it seems like we might need to split individual parts into individual snaps, and then just stage-snap a lot of times to assemble the final lxd. which is quite annoying. Or build lxd riscv64 snap outside of launchpad.

Revision history for this message
Stéphane Graber (stgraber) said :
#22

Hmm, splitting into multiple snaps and using stage-snap would get annoying very very quickly...
We have a lot of parts and we frequently upgrade specific parts to newer versions in specific channels/tracks before rolling that out to the rest of our channels and tracks.

So we'd pretty much need each of those parts to match our track and channel layout, automate the building, testing and validation of all of those and make the main lxd snap builds themselves dependent on all the parts being in a consistent state (don't want differences between architectures).

That would be a LOT of work on our side... As for building outside of Launchpad. That's definitely an option. We do have two reasonable large riscv64 VMs running inside of AMD EPYC VMs that we've setup for some testing and image building. It wouldn't be too hard to run snapcraft on those and manually upload.

But doing so wouldn't be suitable for general consumption as it wouldn't be built on Launchpad and also would be built far less frequently than our normal builds (as it'd likely be pretty manual, at least for the upload side of things). So if we go with this approach, it would be limited to the latest/edge channel.

As I believe our goal here is to have LXD riscv64 as a stable snap in both latest/stable and 5.0/stable for the 22.04 release, we pretty much needs it to build properly on Launchpad. So that gets us back to our usual set of options:
 - Get the build to take less than 3 hours
 - Get Launchpad to issue proxy tokens for more than 3 hours
 - Get snapcraft to never hit network after the initial pull stage

What would help would be to know how long this would take to build (and if it actually builds properly at all).
Knowing that we may have an idea of how close we are as far as the proxy token expiry.

Can you help with this problem?

Provide an answer of your own, or ask Stéphane Graber for more information if necessary.

To post a message you must log in.