Commits · e875f40f0b2ad71c5381a431e6d71829770c7ab7 · Android-smartphones / realme / realme 5pro / kaderbava / external_arm-optimized-routines

Sep 18, 2018
- Fix the documentation comment of checkint · e875f40f
  Szabolcs Nagy authored Sep 18, 2018
```
checkint in pow is not supposed to be used with 0, inf or nan inputs.
```
  e875f40f
Sep 05, 2018
- Document the log table generation method · 71d27280
  Szabolcs Nagy authored Sep 04, 2018
```
Add comments with enough detail so the log lookup tables can be recreated.
```
  71d27280
Aug 17, 2018

remez.jl: update to work with Julia 1.0. · 757f906d

Simon Tatham authored Aug 15, 2018

This program was originally developed for a much earlier version of
Julia, and there have been a lot of changes to Julia semantics since
then. But 1.0 is intended to be stable, so updating once to work with
that should mean that no further updates are needed for a long time.

Changes relevant to this script include new API calls for making
arrays, evaluating Julia source-code expressions from strings, and
parsing strings as numbers; scope and semantics changes requiring
extra escaping in the @debug macro and some explicit 'global' to allow
for-loops to update variables outside themselves; a syntax change for
tuple types; and replacing 'beginswith' on strings with 'startswith'.

There are a couple of remaining inconveniences with this version of
the code. Firstly, the standard Julia 1.0 interpreter will consume a
"--" option terminator on its command line even if it appears after
the script name, so commands of the form 'julia remez.jl -- -1 1 ...'
that worked in Julia 0.2 will no longer work. Adding an extra "--"
before the script name works around this ('julia -- remez.jl -- ...')
because the first "--" is seen by the Julia interpreter and the second
goes to the script. But unfortunately that "--" can't be put in the #!
line as well as the 'env', because of #! command line semantics. So
users may have to work around this by explicitly invoking the
interpreter.

Secondly, Julia 1.0 has moved some mathematical functions (e.g. erf
and gamma) out of its core library into the SpecialFunctions package,
so any pre-existing command lines that used those functions will now
need to qualify them with a package name, and be run on a Julia
installation which has that package installed.

Because Julia 0.4.x is still common (Ubuntu 16.04 and 18.04 both
provide 0.4.5), I've included some backwards-compatibility code so
that the script still runs on that version as well.

757f906d

Ensure HAVE_FAST_FMA is set on AArch64 · 5175759c
Wilco Dijkstra authored Aug 17, 2018
```
If math.h doesn't set FP_FAST_FMA correctly, ensure HAVE_FAST_FMA is set on AArch64.
```
5175759c

Aug 08, 2018
- Improve sincosf comments · b2fc9892
  Wilco Dijkstra authored Aug 08, 2018
```
Improve comments. Use TOINT_INTRINSICS rather than HAVE_FAST_ROUND.
```
  b2fc9892
Jul 30, 2018

Don't build tanf rredf and funder by default · 41ed0e60

Szabolcs Nagy authored Jul 27, 2018

These are no longer maintained and only kept for WANT_SINGLEPREC build,
which is useful for microcontrollers with single precision fpu only.

Removed all tests that are not testing code in the default build.

41ed0e60

Add separate HOST_CFLAGS, HOST_LDFLAGS and HOST_LDLIBS make variables · a202746b
Szabolcs Nagy authored Jul 27, 2018
```
Don't use target flags when building host tools.  The example config.mk.dist
is updated accordingly.
```
a202746b

Don't build rtest by default · c5a8042e

Szabolcs Nagy authored Jul 11, 2018

rtest is the only binary that is built for the host and has additional
dependencies: mpfr and mpc.  To avoid build issues only build it if
the randomized tests are run.  New build target is introduced for
randomized testing, so make check works without rtest.

Tests are no longer copied to the build directory and the runtest.sh
tool is removed as it is not very useful.

c5a8042e

Flush stdout after each benchmark in mathbench · 5f82bffe

Szabolcs Nagy authored Jul 11, 2018

If lot of benchmarks are run and the output is not a tty then the results
only showed up after the stdio buffer was full.  With fflush if the
output is redirected and mathbench is killed the results before the kill
are still available.

5f82bffe

Jul 10, 2018

Remove float compare option from sincosf · fce09974

Szabolcs Nagy authored Jul 09, 2018

PREFER_FLOAT_COMPARISON setting was not correct as it could raise
spurious exceptions. Fixing it is easy: just use ISLESS(x, y) instead
of abstop12(x) < abstop12(y) with appropriate non-signaling definition
for ISLESS. However it seems this setting is not very useful (there is
only minor performance difference on various architectures), so remove
this option for now.

fce09974

Fix the documentation comments for log_inline in pow · ae8bc7d0
Szabolcs Nagy authored Jul 09, 2018
```
There was a typo and the arguments were not explained clearly.
```
ae8bc7d0

Jul 04, 2018

Fix namespace issues in sincosf · 3262ef23

Wilco Dijkstra authored Jul 04, 2018

Use const sincos_t for clarity instead of making the typedef const.
Use __inv_pi4 and __sincosf_table to avoid namespace issues with
static linking.

3262ef23

Change the return type of converttoint and document the semantics · dd178df0

Szabolcs Nagy authored Jul 04, 2018

The roundtoint and converttoint internal functions are only called with small
values, so 32 bit result is enough for converttoint and it is a signed int
conversion so the natural return type is int32_t.

The original idea was to help the compiler keeping the result in uint64_t,
then it's clear that no sign extension is needed and there is no accidental
undefined or implementation defined signed int arithmetics.

But it turns out gcc does a good job with inlining so changing the type has
no overhead and the semantics of the conversion is less surprising this way.
Since we want to allow the asuint64 (x + 0x1.8p52) style conversion, the top
bits were never usable and the existing code ensures that only the bottom
32 bits of the conversion result are used.

dd178df0

More detailed documentation comments · bc4b9012

Szabolcs Nagy authored Jul 03, 2018

Rewrote some documentation text and fixed a GNU style issue based on
feedback from Joseph Myers on libc-alpha mailing list.

bc4b9012

Jul 03, 2018

Fix large ulp error in pow without fma very near 1.0 · 2105bad5

Szabolcs Nagy authored Jul 03, 2018

The !HAVE_FAST_FMA code path split r = z/c - 1 into r = rhi + rlo such
that when z = 1-tiny and c = 1 then rlo and rhi could have much larger
magnitude than r which later caused large rounding errors.

So do a nearest rounding instead of truncation at the split.

2105bad5

Jun 29, 2018
- Add documentation comments to internal functions · 58ce45c0
  Szabolcs Nagy authored Jun 29, 2018
```
Explain the semantics of internal functions.
```
  58ce45c0
- Fix GNU style issues · 5e838911
  Szabolcs Nagy authored Jun 29, 2018
```
Whitespace changes only.
```
  5e838911
Jun 27, 2018

Add exp trace · 4aa92161

Szabolcs Nagy authored Jun 20, 2018

Extracted 16000 samples from several exp call traces.
First half is quite generic, second half has large number of zeros.

4aa92161

Jun 25, 2018

Fix pow error bound · 5049bfaf

Szabolcs Nagy authored Jun 25, 2018

The pow error bound was miscalculated, it is slightly below 0.54 ULP
when using fma and slightly above it without fma.

If 2^-400 < |pow(x,y)| < 2^400 then the error is less than 0.52 ULP
with fma.

5049bfaf

Jun 22, 2018

Fix gnu code style in pow.c · 76fd080f
Szabolcs Nagy authored Jun 22, 2018

76fd080f

Improve pow implementation · a6230320

Szabolcs Nagy authored Jun 21, 2018

The log part of pow got rewritten to use a slightly different algorithm.
This improves precision and throughput while keeps the same table size.

Near 1 cases are no longer special cased, there is a slight performance
regression in that case.  And when the fma instruction is not available
this algorithm is expected to have slightly worse performance.

Worst-case error improved from 0.67 ULP to 0.57 ULP.

On Cortex-A72 i see
thruput near 1:  7% worse
latency near 1:  2% worse
thruput general: 8% better
latency general: 2% better

a6230320

Jun 20, 2018
- Fix the type of sign_bias in pow · db6e4e96
  Szabolcs Nagy authored Jun 18, 2018
```
uint64_t works too, but the correct type is uint32_t.
```
  db6e4e96
- Add trace for sinf/cosf/sincosf · 942b2123
  Wilco Dijkstra authored Jun 15, 2018
```
Add trace for sinf/cosf/sincosf with easy, hard and medium cases
extracted from a trace of 1.4 million calls.
```
  942b2123
- Improve trace support · 2127ba77
  Wilco Dijkstra authored Jun 15, 2018
```
Improve support for large traces by reading all of the trace
and splitting it into smaller parts.

Also remove the B arrays which are not required.
```
  2127ba77
- Use int32_t in reduce_fast in sincosf.h · f3af42d9
  Wilco Dijkstra authored Jun 20, 2018
```
Use int32_t rather than int since we require it to be 32 bits here.
```
  f3af42d9
- Fix the sign of 0 in pow even with !WANT_ROUNDING · e00696aa
  Szabolcs Nagy authored Jun 19, 2018
```
The sign needs to be fixed even in nearest rounding mode.
```
  e00696aa
Jun 15, 2018

Fix spurious underflow in exp without fma · 2117b832

Szabolcs Nagy authored Jun 14, 2018

The last multiplication in exp and exp2 could underflow when it was not
contracted into an fma.  Changed the thresholds so the problematic cases
end up in the specialcase code path (which handles underflow correctly).

The initial check now only looks at the exponent bits which has slightly
better performance on aarch64.  The overflow threshold can be tight for
exp2, but was let loose in exp so the specialcase handling got updated
accordingly.

Added comments about this issue and the assumptions exp_inline is making
in pow.

2117b832

Fix the opt_barrier_ function prototypes · f6717402
Szabolcs Nagy authored Jun 14, 2018
```
The portable function prototype was wrong.
```
f6717402

Jun 13, 2018

Add trace support to mathbench · 9159cf25

Szabolcs Nagy authored Jun 12, 2018

Previously only uniform random and linear (equidistant) input generators
were supported.  Now with "-g trace" numbers are read from stdin or with
"-f file" they are read from file, so previously recorded call traces
can be benchmarked.

Only the first 8000 numbers are used and if there are less then the
numbers are repeated to fill up the benchmark array.

9159cf25

Fix floating-point exceptions with clang · 5fa69e1d

Szabolcs Nagy authored Jun 12, 2018

Add optimization barriers to prevent constant folding and code moving
transformations. Use opt_barrier_ and force_eval_ consistently to work
around fenv semantics breaking optimizations.

5fa69e1d

Always compile mathtest with -fmath-errno · 21f63567

Szabolcs Nagy authored Jun 12, 2018

To make the errno tests meaningful mathtest needs to be compiled with
errno support otherwise (math_errhandling & MATH_ERRNO) can be 0.

21f63567

Jun 12, 2018

Add _finite symbol aliases for better glibc compatibility · b7d568d7

Szabolcs Nagy authored Jun 06, 2018

The glibc math.h header can redirect math symbols (e.g. with the
-ffinite-math-only cflag) to _finite symbols which have the same
semantics as the original ones except they may omit some error checks.

With these aliases libmathlib can be a drop in replacement for glibc
libm at link time.  It's not a full replacement, but it can interpose
glibc libm calls using LD_PRELOAD in case of an already dynamic linked
binary or using -lmathlib before -lm at link time.  This hopefully will
make it easy to try libmathlib with a glibc based toolchain.

Unfortunately with static linking the glibc _finite symbol definitions
may conflict with the libmathlib definitions if other glibc internal
symbols are used from the same object file where the _finite symbol is
defined.  To work this around add glibc internal symbols as hidden
aliases so the conflicting glibc object files don't get linked.
This is a bit fragile since glibc can change its internal symbols and
how they are used, but it only affects static linking, which always
has this problem when symbols are interposed.

b7d568d7

Add pow to mathlib.h · 875334a2
Szabolcs Nagy authored Jun 12, 2018
```
Update mathlib.h to use GNU style declarations and add missing pow.
```
875334a2

More portable default setting for HAVE_FAST_FMA · d7494f25

Szabolcs Nagy authored Jun 12, 2018

FP_FAST_FMA is the standard way to decide if fma is fast, unfortunately
in practice it can be defined even if fma is not inlined (e.g. with
-fno-builtin-fma), and it does not really help if the libc has a single
instruction implementation: the call overhead is too much.  Most of the
time it is the correct check though and without configure time checks
this is the closest we can get.

d7494f25

Jun 11, 2018

Add benchmark code · 764b4bfb

Szabolcs Nagy authored Jun 08, 2018

Simple microbenchmark to measure the latency and throughput of math
functions.

764b4bfb

Add new pow implementation · ed0ecfff

Szabolcs Nagy authored Jun 06, 2018

The algorithm is exp(y * log(x)), where log(x) is computed with about
1.8*2^-66 relative error, returning the result in two doubles, and the
exp part uses the same algorithm (and lookup tables) as exp, but takes
the input as two doubles and a sign (to handle negative bases with odd
integer exponent).

There is separate code path when fma is not available but the worst
case error is about 0.67 ULP in both cases.  The lookup table and
consts for log are 4224 bytes, the code is 1196 bytes.  The non-nearest
rounding error is less than 1 ULP.

Improvements on Cortex-A72 compared to current glibc master:
latency: 1.8x
thruput: 2.5x

ed0ecfff

Jun 06, 2018

Remove sin, cos, tan random testing · f79ee89a

Szabolcs Nagy authored Jun 06, 2018

We don't have implmentations for these functions anymore.  Directed
tests are kept, they just test the host libc implementation for now.

f79ee89a

Add new log2 implementation · d69e5045

Szabolcs Nagy authored Jun 05, 2018

Similar algorithm is used as in log, but there are more operations
(and more error) due to the 1/ln2 multiplier.

There is separate code path when fma instruction is not available for
computing x/c - 1 precisely, for which the table size is doubled,
and to compute (x/c - 1)/ln2 precisely.

The worst case error is 0.547 ULP (0.55 without fma), the read only
global data size is 1168 bytes (2192 without fma).  The non-nearest
rounding error is less than 1 ULP.

Improvements on Cortex-A72 compared to current glibc master:
log latency: 2.04x
log thruput: 1.87x

d69e5045

Jun 05, 2018
- Add new double precision functions to mathlib.h · a7711a35
  Szabolcs Nagy authored Jun 05, 2018
  
  a7711a35
- Fix the name of log · cdd9a491
  Szabolcs Nagy authored Jun 05, 2018
```
Accidentally published with wrong name.
```
  cdd9a491