Commits · a0ad28dbcb9d52be85dfe622f4c43d9262641300 · Android-smartphones / realme / realme 5pro / kaderbava / external_arm-optimized-routines

Feb 28, 2020

Szabolcs Nagy authored Feb 28, 2020

New functionality
* string: New strrchr and stpcpy routines
* string: New Memory Tagging Extension (MTE) variants of strlen and strchr
* math: New vector version of pow(double)
* networking: Optimized ones' complement checksum for 32-bit and 64-bit Arm

Performance improvements
* string: Improved memcpy and memmove (SIMD and non-SIMD) for 64-bit Arm
* string: Improved memset for 64-bit Arm

a0ad28db

Feb 27, 2020

networking: New subproject. · 6a988f68

Ola Liljedahl authored Feb 27, 2020

Add scalar and NEON ones' complement checksumming implementations for
AArch64 and Armv7-A.

6a988f68

math: Improve comments, disable errno handling by default · 6f41cff0
Wilco Dijkstra authored Feb 27, 2020
```
Improve comments in math_config.h.

Set WANT_ERRNO to 0 by default.
```
6f41cff0
string: Rename memcpy_simd.S · 4fba27cf
Szabolcs Nagy authored Feb 27, 2020
```
Use consistent file names.
```
4fba27cf

string: Optimize SIMD memcpy · 1f931789

Wilco Dijkstra authored Feb 27, 2020

Further optimize SIMD memcpy. Small cases now include copies up
to 32 bytes. 64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies. Comments have been rewritten.
Performance on the random memcpy benchmark is ~10% faster.

1f931789

string: Fix white space in memcpy · 718e742f
Wilco Dijkstra authored Feb 27, 2020

718e742f

Feb 25, 2020

ARMv8.5 MTE: Add MTE compatible version of strchr. · c8e72e8a

Gabor Kertesz authored Feb 19, 2020

Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.

This implementation is based on string/aarch64/strchr.S

The 64-bit syndrome value is changed to contain only 16 bytes of
data.
The 32 byte loop is unrolled by two 16 byte reads.

c8e72e8a

string: Add stpcpy · 9be4a9b8
Wilco Dijkstra authored Feb 25, 2020
```
Add support for stpcpy on AArch64.
```
9be4a9b8

string: Cleanup memset · b09a519e

Wilco Dijkstra authored Feb 25, 2020

Remove unnecessary code for unused ZVA sizes.
For zero memsets it's faster use DC ZVA for >= 160 bytes.
Add a define which allows skipping the ZVA size test when the
ZVA size is known to be 64 - reading dczid_el0 may be expensive.

b09a519e

Feb 18, 2020

ARMv8.5 MTE: Add MTE compatible version of strlen. · 02cfc9cc

Branislav Rankov authored Feb 06, 2020

Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.

This implementation is based on string/aarch64/strlen.S

Merged the page cross code into the main path and optimized it.
Modified the zeroones mask to ignore the bytes that are loaded but are
not part of the string. Made a special case for when there is 8 bytes
or less to check before the alignment boundary.

02cfc9cc

string: change build system to avoid fragile includes · 1dfd7b85

Szabolcs Nagy authored Feb 12, 2020

Including multiple asm source files into a single top level file
can cause problems, this can be fixed by having one top level
file per target specific source file, but for maintenance and
clarity it's better to use the sub directory structure for selecting
which files to build.

This requires a new ARCH make variable setting in config.mk which
must be consistent with the target of CC.

Note: the __ARM_FEATURE_SVE checks are moved into the SVE asm code.
This is not entirely right: the feature test macro is for ACLE, not
asm support, but this patch is not supposed to change the produced
binaries and some toolchains (e.g. older clang) does not support SVE
instructions.  The intention is to remove these checks eventually
and always build all asm code and only support new toolchains (the
test code will only test the SVE variants if there is target support
for it though).

1dfd7b85

Feb 12, 2020

string: optimize memcpy · 4c175c8b

Wilco Dijkstra authored Feb 12, 2020

Further optimize integer memcpy. Small cases now include copies up
to 32 bytes. 64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies. Comments have been rewritten.

Improves glibc's memcpy-random benchmark by ~10% on Neoverse N1.

4c175c8b

Jan 14, 2020

math: Add more ulp tests · 33ba1908

Szabolcs Nagy authored Jan 14, 2020

Some functions were not tested with the statistical ulp error check
tool, this commit adds tests for the current math symbols.

33ba1908

math: add vector pow · a807c9bb

Szabolcs Nagy authored Jan 10, 2020

This implementation is a wrapper around the scalar pow with appropriate
call abi. As such it is not expected to be faster than scalar calls,
the new double prec vector pow symbols are provided for completeness.

a807c9bb

string: Remove memcpy_bytewise · 099350af

Wilco Dijkstra authored Jan 14, 2020

This was a placeholder for testing the build system before we added
optimized string code and thus no longer needed.

099350af

Jan 09, 2020

math: fix spurious overflow in pow with clang · 2771bc7f

Szabolcs Nagy authored Jan 08, 2020

clang does not support c99 fenv_access and may move fp operations out
of conditional blocks causing unconditional fenv side-effects. Here

  if (cond)
    ix = f (x * 0x1p52);

was transformed to

  ix_ = f (x * 0x1p52);
  ix = cond ? ix_ : ix;

where x can be a huge negative value so the mul overflows. The added
barrier should prevent such transformation by significantly increasing
the cost of doing the mul unconditionally.

Found by enh from google on android arm and aarch64 targets.
Fixes github issue #16.

2771bc7f

Jan 07, 2020
- string: Fix compilation of AArch64 strrchr with Clang · 0aed5ab8
  Jake Weinstein authored Jan 06, 2020
  
  0aed5ab8
Jan 06, 2020

string: Add strrchr · bbd64ec1

Wilco Dijkstra authored Jan 06, 2020

Add strrchr for AArch64. Originally written by Richard Earnshaw, same
code is present in newlib, this copy has minor edits for inclusion into
the optimized-routines repo.

bbd64ec1

Jan 03, 2020

Add the Assignment Agreement v1.1 document · dbb919dc

Szabolcs Nagy authored Jan 02, 2020

This Assignment Agreement has to be filled in, signed and sent to
optimized-routines-assignment@arm.com by Contributors before their
contributions can be accepted into optimized-routines.

dbb919dc

Jan 02, 2020
- string: Use L(name) for labels · 833e8609
  Wilco Dijkstra authored Jan 02, 2020
```
Use L(name) for all assembler labels.
```
  833e8609
- string: Use asmdefs.h, ENTRY and END · 31b560bc
  Wilco Dijkstra authored Jan 02, 2020
```
Cleanup string functions to use asmdefs.h, ENTRY and END instead of
defining macros in each file.
```
  31b560bc
Dec 10, 2019

aarch64: Combine memcpy and memmove implementations · 3377796f

Krzysztof Koch authored Dec 09, 2019

Modify integer and SIMD versions of memcpy to handle overlaps correctly.

Make __memmove_aarch64 and __memmove_aarch64_simd alias to
__memcpy_aarch64 and __memcpy_aarch64_simd respectively.

Complete sharing of code between memcpy and memmove implementations is
possible without noticeable performance penalty. This is thanks to
moving the source and destination buffer overlap detection after
the code for handling small and medium copies which are overlap-safe
anyway.

Benchmarking shows that keeping two versions of memcpy is necessary
because newer platforms favor aligning src over destination for large
copies. Using NEON registers also gives a small speedup. However,
aligning dst and using general-purpose registers works best for older
platforms. Consequently, memcpy.S and memcpy_simd.S contain memcpy
code which is identical except for the registers used and src vs dst
alignment.

3377796f

Nov 27, 2019
- Update the readme · 709020ed
  Szabolcs Nagy authored Nov 27, 2019
```
Mention releases.
```
  709020ed
Nov 26, 2019

arch64: Add SIMD version of memcpy · 6d3ae5fc

Krzysztof Koch authored Nov 25, 2019

Create a new memcpy implementation for targets with the NEON extension.

__memcpy_aarch64_simd has been tested on a range of modern
microarchitectures. It turned out to be faster than __memcpy_aarch64 on
all of them, with a performance improvement of 3-11% depending on the
platform.

6d3ae5fc

aarch64: Use common header file in memcpy.S · 015c9519

Krzysztof Koch authored Nov 25, 2019

Include asmdefs.h in memcpy.S to avoid duplicate macro definitions.

Add macro for defining labels in asmdefs.h.

Change the default routine entry point alignment to 64 bytes.

Define a new macro which allows controlling the entry point alignment.

Add include guard to asmdefs.h.

015c9519

Makefile tweak for better subproject handling · dec9ffea

Szabolcs Nagy authored Nov 26, 2019

Don't include the makefile fragments of subprojects that aren't built.

With this the build fails more reasonably when SUBS is set incorrectly.

dec9ffea

Nov 22, 2019

Build system refactoring · 1fd2aaae

Szabolcs Nagy authored Nov 20, 2019

Reorganise the makefiles so subprojects can be more separately used and
maintained. Still kept the single toplevel Makefile and config.mk.

Subproject Dir.mk is expected to provide all-X, check-X, clean-X and
install-X targets where X is the subproject name and it may use generic
make variables set in config.mk, like CFLAGS_ALL and CC, or subproject
specific variables like X-cflags.

1fd2aaae

Nov 19, 2019

string: Use .d rather than .2d for element mov instructions · 4f4e530b

George Steed authored Nov 19, 2019

Use .d rather than .2d for element mov instructions in string routines
so the assembly compiles with clang too.

4f4e530b

Nov 06, 2019

math: add WANT_VMATH feature macro · 1f3b1638
Szabolcs Nagy authored Nov 06, 2019
```
When defined as 0 the vector math code is not built and not tested.
```
1f3b1638

math: allow errno setting even if !(math_errhandling&MATH_ERRNO) · 675721a4

Szabolcs Nagy authored Nov 06, 2019

The math_errhandling checks are incorrect in general: it is defined
by the libc math.h which is not appropriate for optimized-routines
provided functions that we are testing.

However even if we want to test a libc implementation, ISO C allows
the setting of errno even if !(math_errhandling&MATH_ERRNO), so
relax the checks.

675721a4

math: fix unused function warnings in mathbench.c · 80922e8b

Szabolcs Nagy authored Nov 06, 2019

Vector functions are only used on aarch64, so only define them there.

  math/test/mathbench.c:95:1: warning: '__v_dummyf' defined but not used [-Wunused-function]

80922e8b

math: fix missing attributes warnings · 2c6c2405

Szabolcs Nagy authored Nov 06, 2019

gcc-9 started warning if alias symbols have different attributes:

math/expf.c: At top level:
math/expf.c:89:21: warning: '__expf_finite' specifies less restrictive attributes than its target 'expf': 'leaf', 'nothrow', 'pure' [-Wmissing-attributes]

so copy the attributes when creating the aliases.

2c6c2405

math: fix unused variable warnings · 17cd8af3

Szabolcs Nagy authored Nov 06, 2019

Compilers (incorrectly) warn about unused volatile variables:

  math/math_config.h: In function 'force_eval_float':
  math/math_config.h:188:18: warning: unused variable 'y' [-Wunused-variable]

silence them.

17cd8af3

math: move definitions in internal header · 9e6e14e2

Szabolcs Nagy authored Nov 06, 2019

Compiler checks and realated macros need to be done earlier so they are
usable for the static inline functions.

9e6e14e2

Fix building outside the source directory · 0eb42807
Szabolcs Nagy authored Nov 06, 2019
```
Fix the Makefile so the documented mechanism in the README still works.
```
0eb42807

Nov 05, 2019

Add vector exp2f · 69170e15

Szabolcs Nagy authored Oct 14, 2019

Same design as in expf. Worst-case error of __v_exp2f and __v_exp2f_1u
is 1.96 and 0.88 ulp respectively.

It is not clear if round/convert instructions are better or +- Shift.
For expf the latter, for exp2f the former seems more consistently
faster, but both options are kept in the code for now.

69170e15

math: fix runulp.sh · 65464ec6

Szabolcs Nagy authored Nov 05, 2019

Use heredoc instead of pipe when iterating over test cases to avoid
creating a subshell that would break the PASS/FAIL accounting.

65464ec6

aarch64: Increase small and medium cases for memcpy · fec28a72

Krzysztof Koch authored Oct 25, 2019

Increase the upper bound on medium cases from 96 to 128 bytes.
Now, up to 128 bytes are copied unrolled.

Increase the upper bound on small cases from 16 to 32 bytes so that
copies of 17-32 bytes are not impacted by the larger medium case.

fec28a72

Oct 17, 2019

Add -Werror=implicit-function-declaration · 433a3b1f

Szabolcs Nagy authored Oct 17, 2019

Implicit function declaration is always a bug, but compilers don't
turn it into an error by default for historical reasons, so add it
to the default config.

433a3b1f

fix the build of s_powf.o on non-aarch64 targets · 3d7ecfe3

Szabolcs Nagy authored Oct 17, 2019

This is a simple fix to the v_powf code, but in general the vector
code may not work on arbitrary targets even when compiled with
scalar types (s_powf.c), so in the long term may be all s_* should
be disabled for non-aarch64 targets (requires test system and header
changes too).

3d7ecfe3