- Feb 28, 2020
-
-
Szabolcs Nagy authored
New functionality * string: New strrchr and stpcpy routines * string: New Memory Tagging Extension (MTE) variants of strlen and strchr * math: New vector version of pow(double) * networking: Optimized ones' complement checksum for 32-bit and 64-bit Arm Performance improvements * string: Improved memcpy and memmove (SIMD and non-SIMD) for 64-bit Arm * string: Improved memset for 64-bit Arm
-
- Feb 27, 2020
-
-
Ola Liljedahl authored
Add scalar and NEON ones' complement checksumming implementations for AArch64 and Armv7-A.
-
Wilco Dijkstra authored
Improve comments in math_config.h. Set WANT_ERRNO to 0 by default.
-
Szabolcs Nagy authored
Use consistent file names.
-
Wilco Dijkstra authored
Further optimize SIMD memcpy. Small cases now include copies up to 32 bytes. 64-128 byte copies are split into two cases to improve performance of 64-96 byte copies. Comments have been rewritten. Performance on the random memcpy benchmark is ~10% faster.
-
Wilco Dijkstra authored
-
- Feb 25, 2020
-
-
Gabor Kertesz authored
Reading outside the range of the string is only allowed within 16 byte aligned granules when MTE is enabled. This implementation is based on string/aarch64/strchr.S The 64-bit syndrome value is changed to contain only 16 bytes of data. The 32 byte loop is unrolled by two 16 byte reads.
-
Wilco Dijkstra authored
Add support for stpcpy on AArch64.
-
Wilco Dijkstra authored
Remove unnecessary code for unused ZVA sizes. For zero memsets it's faster use DC ZVA for >= 160 bytes. Add a define which allows skipping the ZVA size test when the ZVA size is known to be 64 - reading dczid_el0 may be expensive.
-
- Feb 18, 2020
-
-
Branislav Rankov authored
Reading outside the range of the string is only allowed within 16 byte aligned granules when MTE is enabled. This implementation is based on string/aarch64/strlen.S Merged the page cross code into the main path and optimized it. Modified the zeroones mask to ignore the bytes that are loaded but are not part of the string. Made a special case for when there is 8 bytes or less to check before the alignment boundary.
-
Szabolcs Nagy authored
Including multiple asm source files into a single top level file can cause problems, this can be fixed by having one top level file per target specific source file, but for maintenance and clarity it's better to use the sub directory structure for selecting which files to build. This requires a new ARCH make variable setting in config.mk which must be consistent with the target of CC. Note: the __ARM_FEATURE_SVE checks are moved into the SVE asm code. This is not entirely right: the feature test macro is for ACLE, not asm support, but this patch is not supposed to change the produced binaries and some toolchains (e.g. older clang) does not support SVE instructions. The intention is to remove these checks eventually and always build all asm code and only support new toolchains (the test code will only test the SVE variants if there is target support for it though).
-
- Feb 12, 2020
-
-
Wilco Dijkstra authored
Further optimize integer memcpy. Small cases now include copies up to 32 bytes. 64-128 byte copies are split into two cases to improve performance of 64-96 byte copies. Comments have been rewritten. Improves glibc's memcpy-random benchmark by ~10% on Neoverse N1.
-
- Jan 14, 2020
-
-
Szabolcs Nagy authored
Some functions were not tested with the statistical ulp error check tool, this commit adds tests for the current math symbols.
-
Szabolcs Nagy authored
This implementation is a wrapper around the scalar pow with appropriate call abi. As such it is not expected to be faster than scalar calls, the new double prec vector pow symbols are provided for completeness.
-
Wilco Dijkstra authored
This was a placeholder for testing the build system before we added optimized string code and thus no longer needed.
-
- Jan 09, 2020
-
-
Szabolcs Nagy authored
clang does not support c99 fenv_access and may move fp operations out of conditional blocks causing unconditional fenv side-effects. Here if (cond) ix = f (x * 0x1p52); was transformed to ix_ = f (x * 0x1p52); ix = cond ? ix_ : ix; where x can be a huge negative value so the mul overflows. The added barrier should prevent such transformation by significantly increasing the cost of doing the mul unconditionally. Found by enh from google on android arm and aarch64 targets. Fixes github issue #16.
-
- Jan 07, 2020
-
-
Jake Weinstein authored
-
- Jan 06, 2020
-
-
Wilco Dijkstra authored
Add strrchr for AArch64. Originally written by Richard Earnshaw, same code is present in newlib, this copy has minor edits for inclusion into the optimized-routines repo.
-
- Jan 03, 2020
-
-
Szabolcs Nagy authored
This Assignment Agreement has to be filled in, signed and sent to optimized-routines-assignment@arm.com by Contributors before their contributions can be accepted into optimized-routines.
-
- Jan 02, 2020
-
-
Wilco Dijkstra authored
Use L(name) for all assembler labels.
-
Wilco Dijkstra authored
Cleanup string functions to use asmdefs.h, ENTRY and END instead of defining macros in each file.
-
- Dec 10, 2019
-
-
Krzysztof Koch authored
Modify integer and SIMD versions of memcpy to handle overlaps correctly. Make __memmove_aarch64 and __memmove_aarch64_simd alias to __memcpy_aarch64 and __memcpy_aarch64_simd respectively. Complete sharing of code between memcpy and memmove implementations is possible without noticeable performance penalty. This is thanks to moving the source and destination buffer overlap detection after the code for handling small and medium copies which are overlap-safe anyway. Benchmarking shows that keeping two versions of memcpy is necessary because newer platforms favor aligning src over destination for large copies. Using NEON registers also gives a small speedup. However, aligning dst and using general-purpose registers works best for older platforms. Consequently, memcpy.S and memcpy_simd.S contain memcpy code which is identical except for the registers used and src vs dst alignment.
-
- Nov 27, 2019
-
-
Szabolcs Nagy authored
Mention releases.
-
- Nov 26, 2019
-
-
Krzysztof Koch authored
Create a new memcpy implementation for targets with the NEON extension. __memcpy_aarch64_simd has been tested on a range of modern microarchitectures. It turned out to be faster than __memcpy_aarch64 on all of them, with a performance improvement of 3-11% depending on the platform.
-
Krzysztof Koch authored
Include asmdefs.h in memcpy.S to avoid duplicate macro definitions. Add macro for defining labels in asmdefs.h. Change the default routine entry point alignment to 64 bytes. Define a new macro which allows controlling the entry point alignment. Add include guard to asmdefs.h.
-
Szabolcs Nagy authored
Don't include the makefile fragments of subprojects that aren't built. With this the build fails more reasonably when SUBS is set incorrectly.
-
- Nov 22, 2019
-
-
Szabolcs Nagy authored
Reorganise the makefiles so subprojects can be more separately used and maintained. Still kept the single toplevel Makefile and config.mk. Subproject Dir.mk is expected to provide all-X, check-X, clean-X and install-X targets where X is the subproject name and it may use generic make variables set in config.mk, like CFLAGS_ALL and CC, or subproject specific variables like X-cflags.
-
- Nov 19, 2019
-
-
George Steed authored
Use .d rather than .2d for element mov instructions in string routines so the assembly compiles with clang too.
-
- Nov 06, 2019
-
-
Szabolcs Nagy authored
When defined as 0 the vector math code is not built and not tested.
-
Szabolcs Nagy authored
The math_errhandling checks are incorrect in general: it is defined by the libc math.h which is not appropriate for optimized-routines provided functions that we are testing. However even if we want to test a libc implementation, ISO C allows the setting of errno even if !(math_errhandling&MATH_ERRNO), so relax the checks.
-
Szabolcs Nagy authored
Vector functions are only used on aarch64, so only define them there. math/test/mathbench.c:95:1: warning: '__v_dummyf' defined but not used [-Wunused-function]
-
Szabolcs Nagy authored
gcc-9 started warning if alias symbols have different attributes: math/expf.c: At top level: math/expf.c:89:21: warning: '__expf_finite' specifies less restrictive attributes than its target 'expf': 'leaf', 'nothrow', 'pure' [-Wmissing-attributes] so copy the attributes when creating the aliases.
-
Szabolcs Nagy authored
Compilers (incorrectly) warn about unused volatile variables: math/math_config.h: In function 'force_eval_float': math/math_config.h:188:18: warning: unused variable 'y' [-Wunused-variable] silence them.
-
Szabolcs Nagy authored
Compiler checks and realated macros need to be done earlier so they are usable for the static inline functions.
-
Szabolcs Nagy authored
Fix the Makefile so the documented mechanism in the README still works.
-
- Nov 05, 2019
-
-
Szabolcs Nagy authored
Same design as in expf. Worst-case error of __v_exp2f and __v_exp2f_1u is 1.96 and 0.88 ulp respectively. It is not clear if round/convert instructions are better or +- Shift. For expf the latter, for exp2f the former seems more consistently faster, but both options are kept in the code for now.
-
Szabolcs Nagy authored
Use heredoc instead of pipe when iterating over test cases to avoid creating a subshell that would break the PASS/FAIL accounting.
-
Krzysztof Koch authored
Increase the upper bound on medium cases from 96 to 128 bytes. Now, up to 128 bytes are copied unrolled. Increase the upper bound on small cases from 16 to 32 bytes so that copies of 17-32 bytes are not impacted by the larger medium case.
-
- Oct 17, 2019
-
-
Szabolcs Nagy authored
Implicit function declaration is always a bug, but compilers don't turn it into an error by default for historical reasons, so add it to the default config.
-
Szabolcs Nagy authored
This is a simple fix to the v_powf code, but in general the vector code may not work on arbitrary targets even when compiled with scalar types (s_powf.c), so in the long term may be all s_* should be disabled for non-aarch64 targets (requires test system and header changes too).
-