Serious performance degradation of math functions

Bug #1663280 reported by Oleg Strikov
54
This bug affects 8 people
Affects Status Importance Assigned to Milestone
GLibC
Fix Released
Medium
glibc (Fedora)
Fix Released
Undecided
glibc (Ubuntu)
Fix Released
Medium
Matthias Klose
Xenial
Fix Released
High
Daniel Axtens
Zesty
Won't Fix
Medium
Unassigned

Bug Description

SRU Justification
=================

[Impact]

 * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial.

 * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported.

 * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them.

[Test Case]

Firstly, you need a suitable Intel machine. Users report that Sandy
Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I
have been able to reproduce it on a Skylake CPU using a suitable Azure
VM.

Create the following C file, exp.c:

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
  printf("%f\n", a);
  return 0;
}

$ gcc -O3 -march=x86-64 -o exp exp.c -lm

With the current version of glibc:

$ time ./exp
...
real 0m1.349s
user 0m1.349s

$ time LD_BIND_NOW=1 ./exp
...
real 0m0.625s
user 0m0.621s

Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve.

With the proposed update:

$ time ./exp
...
real 0m0.625s
user 0m0.621s

$ time LD_BIND_NOW=1 ./exp
...

real 0m0.631s
user 0m0.631s

Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference.

[Regression Potential]

glibc is the nightmare case for regressions as could affect pretty much
anything, and this patch touches a key part (dynamic libraries).

We can be fairly confident in the fix generally - it's in the glibc in
Bionic, Debian and some RPM-based distros. The backport is based on
the patches in the release/2.23/master branch in the upstream glibc
repository, and the backport was straightforward.

Obviously that doesn't remove all risk. There is also a fair bit of
Ubuntu-specific patching in glibc so other distros are of limited
value for ruling out bugs. So I have done the following testing, and
I'm happy to do more as required. All testing has been done:
 - on an Azure VM (affected by the change), with proposed package
 - on a local VM (not affected by the change), with proposed package

 * Boot with the upgraded libc6.

 * Watch a youtube video in Firefox over VNC.

 * Build some C code (debuild of zlib).

 * Test Java by installing and running Eclipse.

Autopkgtest also passes.

[Original Description]

Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine.

@strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though.

@strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually.

This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well.

Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible.

Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back).

There are two types of math routines provided by libm:
(a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other)
(b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others)

For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines.

============================================================================

Let's take one routine from the group (a) and try to reproduce the slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
  printf("%f\n", a);
  return 0;
}

$ gcc -O3 -march=x86-64 -o exp exp.c -lm

$ time ./exp
<..> 2.801s <..>

$ time LD_BIND_NOW=1 ./exp
<..> 0.660s <..>

You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be.

============================================================================

Let's take one routine from the group (b) and try to reproduce the slowdown.

#include <math.h>
#include <stdio.h>

int main () {
  double a, b;
  for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
  printf("%f\n", a);
  return 0;
}

# note that -mavx option has been passed
$ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

$ time ./pow
<..> 4.157s <..>

$ time LD_BIND_NOW=1 ./pow
<..> 2.123s <..>

You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called.

============================================================================

[!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown.

[0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
[3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
[4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182
[5] http://www.agner.org/optimize/blog/read.php?i=761#761

Related branches

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
Marcel Stimberg (marcelstimberg) wrote :

It seems that the fix has been backported to upstreams's 2.24 branch: https://sourceware.org/git/?p=glibc.git;a=commit;h=4b8790c81c1a7b870a43810ec95e08a2e501123d

Changed in glibc:
importance: Unknown → Medium
status: Unknown → Fix Released
Revision history for this message
In , Oleg (oleg-redhat-bugs) wrote :
Download full text (6.4 KiB)

Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. Fedora 24 and Fedora 25 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine.

This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well.

Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers, Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible.

Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, ...

Read more...

Revision history for this message
In , Carlos (carlos-redhat-bugs) wrote :

I don't see anywhere near the performance degradation you're seeing, so it must be heavily dependent on the family and stepping that you're using.

e.g.
[carlos@athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test
154964150.331550

real 0m1.831s
user 0m1.820s
sys 0m0.003s

[carlos@athas rhbz1421121]$ time ./pow-test
154964150.331550

real 0m1.830s
user 0m1.820s
sys 0m0.001s

Verified pow-test built without DT_FLAGS BIND_NOW.

I agree that it is less than optimal to have processor state transitions like those you indicate for every time the dynamic loader trampoline is called.

We'll look into this.

Fedora 26 will not have this problem since it's based on glibc 2.25 with the fix you indicate already present.

Revision history for this message
In , Oleg (oleg-redhat-bugs) wrote :

Hi Carlos,

Many thanks for looking into this! Could you please confirm that you used the following command to compile pow test with gcc:

$ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

Passing -mavx is the key thing for this example to work as expected. You want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx.

I'd also appreciate if you tell me on which CPU you do testing. It's impossible for me run this test on every possible CPU (tried on Sandy Bridge and Ivy Bridge machines so far) and this information would be really helpful.

Thanks!

Revision history for this message
In , Carlos (carlos-redhat-bugs) wrote :

(In reply to Oleg Strikov from comment #4)
> Hi Carlos,
>
> Many thanks for looking into this! Could you please confirm that you used
> the following command to compile pow test with gcc:
>
> $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

I can confirm that I used these options on an F25 system.

The dynamic loader trampoline is only called once in the loop to resolve the singular math function call, and after that it's the same sequence over and over again without any explicit software save/restore (though the CPU might do something for the transition).

carlos@athas rhbz1421121]$ gcc -O3 -march=x86-64 -mavx -o pow-test pow-test.c -lm
[carlos@athas rhbz1421121]$ time ./pow-test
154964150.331550

real 0m1.829s
user 0m1.819s
sys 0m0.002s
[carlos@athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test
154964150.331550

real 0m1.833s
user 0m1.819s
sys 0m0.005s

gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC)

> Passing -mavx is the key thing for this example to work as expected. You
> want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx.
>
> I'd also appreciate if you tell me on which CPU you do testing. It's
> impossible for me run this test on every possible CPU (tried on Sandy Bridge
> and Ivy Bridge machines so far) and this information would be really helpful.

I ran this on an i5-4690K, so a Haswell series CPU, but without AVX512.

Revision history for this message
In , Florian (florian-redhat-bugs) wrote :

(In reply to Carlos O'Donell from comment #5)
> (In reply to Oleg Strikov from comment #4)
> > Hi Carlos,
> >
> > Many thanks for looking into this! Could you please confirm that you used
> > the following command to compile pow test with gcc:
> >
> > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm
>
> I can confirm that I used these options on an F25 system.
>
> The dynamic loader trampoline is only called once in the loop to resolve the
> singular math function call, and after that it's the same sequence over and
> over again without any explicit software save/restore (though the CPU might
> do something for the transition).

Right, that's why I found the claim about the substantial performance impact always a bit puzzling.

What happens if you use LD_BIND_NOT=1?

Revision history for this message
In , Carlos (carlos-redhat-bugs) wrote :

(In reply to Florian Weimer from comment #6)
> (In reply to Carlos O'Donell from comment #5)
> > (In reply to Oleg Strikov from comment #4)
> > > Hi Carlos,
> > >
> > > Many thanks for looking into this! Could you please confirm that you used
> > > the following command to compile pow test with gcc:
> > >
> > > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm
> >
> > I can confirm that I used these options on an F25 system.
> >
> > The dynamic loader trampoline is only called once in the loop to resolve the
> > singular math function call, and after that it's the same sequence over and
> > over again without any explicit software save/restore (though the CPU might
> > do something for the transition).
>
> Right, that's why I found the claim about the substantial performance impact
> always a bit puzzling.

Agreed.

> What happens if you use LD_BIND_NOT=1?

[carlos@athas rhbz1421121]$ time LD_BIND_NOT=1 ./pow-test
154964150.331550

real 0m4.527s
user 0m4.505s
sys 0m0.003s

Terrible performance as expected though.

Surprisingly inline with Oleg's numbers.

However, LD_BIND_NOT performance is never the default, you'd have to be running with a preloaded audit library (LD_AUDIT) to trigger that kind of behaviour.

Perhaps something is wrong with Oleg's system configuration?

Revision history for this message
In , Oleg (oleg-redhat-bugs) wrote :

To my understanding, once trampoline touched upper halves of YMM registers ALL future switches between AVX and SSE require time consuming store/restore operation (i. e. all future calls to pow will suffer). Touching upper halves sets somewhat like a dirty flag (which forces cpu to do store/restore) and this flag never gets dropped during the whole program execution. That's why impact is so serious.

I was able to reproduce the issue using f25 live cd. So it looks like a cpu model depending issue. We were able to repro on E5-1630 (haswell) though.

Revision history for this message
In , Marcel (marcel-redhat-bugs) wrote :

Hi, I'm the one that Oleg referred to who had this issue on an E5-1630 CPU. It turns out, that I actually /cannot/ reproduce it with a Fedora 25 live CD (before and after an update of glibc)!
I don't normally use Fedora on this machine, I originally encountered the problem with Ubuntu 16.04 (which has glibc 2.23 and not 2.24 as Fedora 25) -- there it is perfectly reproducible with Oleg's code, with very similar timings to the ones that Oleg reported.

This is very confusing, I can try with a Fedora 24 live CD as well, but Oleg seems to be able to reproduce it on Fedora 25, so...

Revision history for this message
In , Marcel (marcel-redhat-bugs) wrote :

Um, sorry for the noise, but it seems that the bug was fixed with Fedora's glibc 2.24-4 release:

* Fri Dec 23 2016 Carlos O'Donell <carlos@...> - 2.24-4
  - Auto-sync with upstream release/2.24/master,
    commit e9e69e468039fcd57276f783a16aa771a8e4214e, fixing:
  - [...]
  - Fix runtime resolver routines in the presence of AVX512 (swbz#20508)
  - [...]

That would explain why Oleg saw it with the Fedora 25 live CD (which still has 2.24-3) while Carlos did not see it on his system. Now what I don't understand is why I myself could not reproduce with the live CD, even though I tried compiling/running it before updating glibc...

Revision history for this message
In , Oleg (oleg-redhat-bugs) wrote :

I just rerun all the tests again on F24 and F25. I can confirm that the performance issue disappears on F25 when glibc package gets updated to version 2.24-4. It is still observable on F24 because the fix has not been propagated there. I'm very sorry for such a stupid mistake (not updating livecd packages before running tests). Thanks to Marcel for pointing that out, it saved me huge amount of time.

We also did some kind of investigation regarding specific CPU models which suffer from such kind of performance degradation. Quite reliable source [1] says that 'AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch'. It means that only Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected.

Many thanks to Carlos and Florian for such fast and straight to the point response. I really appreciate that.

[1] http://www.agner.org/optimize/blog/read.php?i=761#761

Revision history for this message
Oleg Strikov (strikov-deactivatedaccount) wrote :

Bug description has been updated to include the following information:

@strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though.

@strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually.

description: updated
Revision history for this message
Marcel Stimberg (marcelstimberg) wrote :

Regarding glibc 2.24: note that the version in use in Debian testing/unstable (i.e. stretch/sid) is 2.24-9 which already incorporates the upstream fix, i.e. Debian is not affected.

dino99 (9d9)
tags: added: upgrade-software-version xenial yakkety zesty
summary: - Serious performance degradation of math functions in 16.04/16.10/17.04
- due to known Glibc bug
+ Serious performance degradation of math functions
Changed in glibc (Ubuntu Zesty):
importance: Undecided → Medium
Changed in glibc (Ubuntu):
assignee: nobody → Matthias Klose (doko)
Revision history for this message
Vinson Lee (vlee) wrote :

Please backport these upstream glibc patches to 16.04 xenial glibc-2.23. These patches are already in upstream glibc-2.24.

https://sourceware.org/git/?p=glibc.git;a=commit;h=f43cb35c9b3c35addc6dc0f1427caf51786ca1d2
https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
https://sourceware.org/git/?p=glibc.git;a=commit;h=c15f8eb50cea7ad1a4ccece6e0982bf426d52c00

These patches were not backported to upstream glibc-2.23 because the build requirements would have needed to be bumped to binutils-2.24. However, 16.04 xenial is already on binutils-2.26 and would not be restricted by this build requirement change.

Luke Faraone (lfaraone)
Changed in glibc (Ubuntu):
status: Confirmed → Triaged
Changed in glibc (Ubuntu Zesty):
status: Confirmed → Triaged
Changed in glibc (Fedora):
importance: Unknown → Undecided
status: Unknown → Fix Released
Revision history for this message
Bryan Seitz (seitz-a) wrote :

Any idea when a fix for Xenial will be released?

Revision history for this message
Daniel Axtens (daxtens) wrote :

Hi,

I created a package with the following 3 patches:

 - https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commit;h=83037ea1d9e84b1b44ed307f01cbb5eeac24e22d
 - https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commit;h=883cadc5543ffd3a4537498b44c782ded8a4a4e8 (backports suggested by Intel in https://bugs.launchpad.net/intel/+bug/1717382), and
 - The 2.23 backport of the final patch - https://sourceware.org/bugzilla/attachment.cgi?id=9375&action=edit - from https://sourceware.org/bugzilla/show_bug.cgi?id=20139 comment 9

It is at https://launchpad.net/~daxtens/+archive/ubuntu/builder/+build/15006375

I installed it on a Xenial system but it didn't seem to help - things still run slowly unless I specify LD_BIND_NOW=1.

Any ideas what I might be missing or might have done wrong? Perhaps the 2.23 backport of the final patch is insufficient?

Regards,
Daniel

Revision history for this message
Florian Weimer (fw) wrote :

Daniel, are the patches in comment 17 compatible with the (non-standard) Intel __regcall calling convention>

https://sourceware.org/bugzilla/show_bug.cgi?id=21265

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

zesty is EOL.
artful+ are fix released.

xenial is the only currently affected supported series.

Changed in glibc (Ubuntu Zesty):
status: Triaged → Won't Fix
Changed in glibc (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Daniel Axtens (daxtens) wrote :

Florian,

(Apologies for the delay in getting back to you.)

I haven't checked __regcall in detail, but I still see the degradation in the pow and exp examples from the bug description. They're all compiled with GCC and don't use any code compiled with ICC. If my understanding is correct, that should rule out the __regcall convention?

(Apologies if I'm way off base here, I have more of a PPC background...)

Regards,
Daniel

Revision history for this message
Florian Weimer (fw) wrote :

Sorry for being unclear. What I was trying to say is this: The patches you posted do not look like the XSAVE/XRSTOR patches for the dynamic linker trampoline, but they appear to be based on an earlier attempt to fix this. I'm worried that these patches clobber registers which are used by existing binaries to pass arguments (even though these registers are reserved by the ABI).

Revision history for this message
Daniel Axtens (daxtens) wrote :

Hi Florian,

With that pointer I was able to grab the correct patches from the release/2.23/master branch and apply them to the Ubuntu Xenial glibc package. The built package performs correctly and quickly.

Thanks so much - it would have taken me much, much longer to figure out what was going on without your comments.

Regards,
Daniel

Revision history for this message
Florian Weimer (fw) wrote :

Note that you may also have to take measures to increase available stack size and avoid lazy binding for libgcc_s, for increased backwards compatibility on AVX-512 systems.

Daniel Axtens (daxtens)
Changed in glibc (Ubuntu Xenial):
status: New → Confirmed
assignee: nobody → Daniel Axtens (daxtens)
Daniel Axtens (daxtens)
description: updated
Daniel Axtens (daxtens)
tags: added: sts
Mathew Hodson (mhodson)
Changed in glibc (Ubuntu Xenial):
importance: Undecided → Medium
Mathew Hodson (mhodson)
Changed in glibc (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in glibc (Ubuntu Xenial):
importance: Medium → Low
Dan Streetman (ddstreet)
Changed in glibc (Ubuntu Xenial):
importance: Low → High
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Oleg, or anyone else affected,

Accepted glibc into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/glibc/2.23-0ubuntu11 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in glibc (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Dan Streetman (ddstreet) wrote :

autopkgtest regressions that should be ignored:

node-srs (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore.

bzr (all archs): test failed for last 2 years, since last bzr pkg update; fails on local system with current glibc; ignore.

systemd/amd64: 2 failures:
  subprocess.CalledProcessError: Command '['modprobe', 'scsi_debug']' returned non-zero exit status 1
  logind FAIL stderr: grep: /sys/power/state: No such file or directory

both unrelated to glibc (likely a change in the kernel api and/or pkging introduced this test regression), ignore.

systemd/s390x: test always failed. ignore.

linux-oracle/amd64: test fails consistently, ignore.

node-ws (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore.

fpc (i386/armhf): test always failed, ignore.

libreoffice/i386: test always failed, ignore.

apt (all archs): existing apt test bug, ignore:
  https://bugs.launchpad.net/ubuntu/+source/apt/+bug/1815750

nvidia-graphics-drivers-340/armhf: test always failed, ignore.

ruby-xmlparser (all archs): test failed for 3 years, fails locally with current glibc; ignore.

linux (ppc64el/i386): tests flaky, fail more than succeed; ignore.

gearmand/armhf: test failed for last year, ignore.

node-groove (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore.

iscsitarget/armhf: test always failed, ignore.

libnih/armhf: test blacklisted, ignore.

nplan (all archs): test flaky, almost always fails; ignore.

node-leveldown (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore.

snapcraft: failed since pkg last updated, unrelated to this sru, ignore.

r-bioc-genomicalignments/s390x: test not run in 3 years, fails locally with current glibc; ignore.

ruby-nokogiri (all archs): test not run for 3 years, fails locally with current glibc; ignore.

ipset (all archs): test not run for 3 years, fails locally with current glibc; ignore.

snapd (all archs): test flaky, almost always fails, ignore.

dadhi-linux/s390x: test always failed, ignore.

still reviewing:

mercurial (all archs)

ruby2.3/s390x

pdns-recursor (s390x/i386)

Revision history for this message
Dan Streetman (ddstreet) wrote :

mercurial (all archs): fails since security update, verified fails in local test run, test failure introduced by security update. ignore.

Revision history for this message
Dan Streetman (ddstreet) wrote :

pdns-recursor (s390x/i386): test fails the same way for only these 2 archs, on yakkety; failure unrelated to this sru, ignore.

Revision history for this message
Dan Streetman (ddstreet) wrote :

ruby2.3/s390x: test fails on all archs, but hinted always fails on other archs. should be hinted always fails on s390x as well. ignore.

Revision history for this message
Dan Streetman (ddstreet) wrote :

all autopkgtest failures should be ignored based on above comments.

Revision history for this message
Dan Streetman (ddstreet) wrote :

dpkg -l | grep libc6:amd64
ii libc6:amd64 2.23-0ubuntu10 amd64 GNU C Library: Shared libraries

$ time ./exp
127781126.100057

real 0m3.334s
user 0m3.336s
sys 0m0.000s
$ time LD_BIND_NOW=1 ./exp
127781126.100057

real 0m0.710s
user 0m0.708s
sys 0m0.000s

$ dpkg -l | grep libc6:amd64
ii libc6:amd64 2.23-0ubuntu11 amd64 GNU C Library: Shared libraries

$ time ./exp
127781126.100057

real 0m0.709s
user 0m0.708s
sys 0m0.000s
$ time LD_BIND_NOW=1 ./exp
127781126.100057

real 0m0.714s
user 0m0.712s
sys 0m0.004s

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial yakkety zesty
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for glibc has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.23-0ubuntu11

---------------
glibc (2.23-0ubuntu11) xenial; urgency=medium

  * debian/patches/ubuntu/xsave-part1.diff and
    debian/patches/ubuntu/xsave-part2.diff: Fix a serious performance
    regression when mixing SSE and AVX code on certain processors.
    The patches are from the upstream 2.23 stable branch. (LP: #1663280)

 -- Daniel Axtens <email address hidden> Thu, 04 Oct 2018 10:29:55 +1000

Changed in glibc (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.