Skip to content
  1. Feb 28, 2020
    • Szabolcs Nagy's avatar
      v20.02 release · a0ad28db
      Szabolcs Nagy authored
      New functionality
      * string: New strrchr and stpcpy routines
      * string: New Memory Tagging Extension (MTE) variants of strlen and strchr
      * math: New vector version of pow(double)
      * networking: Optimized ones' complement checksum for 32-bit and 64-bit Arm
      
      Performance improvements
      * string: Improved memcpy and memmove (SIMD and non-SIMD) for 64-bit Arm
      * string: Improved memset for 64-bit Arm
      a0ad28db
  2. Feb 27, 2020
  3. Feb 25, 2020
    • Gabor Kertesz's avatar
      ARMv8.5 MTE: Add MTE compatible version of strchr. · c8e72e8a
      Gabor Kertesz authored
      Reading outside the range of the string is only allowed within 16 byte
      aligned granules when MTE is enabled.
      
      This implementation is based on string/aarch64/strchr.S
      
      The 64-bit syndrome value is changed to contain only 16 bytes of
      data.
      The 32 byte loop is unrolled by two 16 byte reads.
      c8e72e8a
    • Wilco Dijkstra's avatar
      string: Add stpcpy · 9be4a9b8
      Wilco Dijkstra authored
      Add support for stpcpy on AArch64.
      9be4a9b8
    • Wilco Dijkstra's avatar
      string: Cleanup memset · b09a519e
      Wilco Dijkstra authored
      Remove unnecessary code for unused ZVA sizes.
      For zero memsets it's faster use DC ZVA for >= 160 bytes.
      Add a define which allows skipping the ZVA size test when the
      ZVA size is known to be 64 - reading dczid_el0 may be expensive.
      b09a519e
  4. Feb 18, 2020
    • Branislav Rankov's avatar
      ARMv8.5 MTE: Add MTE compatible version of strlen. · 02cfc9cc
      Branislav Rankov authored
      Reading outside the range of the string is only allowed within 16 byte
      aligned granules when MTE is enabled.
      
      This implementation is based on string/aarch64/strlen.S
      
      Merged the page cross code into the main path and optimized it.
      Modified the zeroones mask to ignore the bytes that are loaded but are
      not part of the string. Made a special case for when there is 8 bytes
      or less to check before the alignment boundary.
      02cfc9cc
    • Szabolcs Nagy's avatar
      string: change build system to avoid fragile includes · 1dfd7b85
      Szabolcs Nagy authored
      Including multiple asm source files into a single top level file
      can cause problems, this can be fixed by having one top level
      file per target specific source file, but for maintenance and
      clarity it's better to use the sub directory structure for selecting
      which files to build.
      
      This requires a new ARCH make variable setting in config.mk which
      must be consistent with the target of CC.
      
      Note: the __ARM_FEATURE_SVE checks are moved into the SVE asm code.
      This is not entirely right: the feature test macro is for ACLE, not
      asm support, but this patch is not supposed to change the produced
      binaries and some toolchains (e.g. older clang) does not support SVE
      instructions.  The intention is to remove these checks eventually
      and always build all asm code and only support new toolchains (the
      test code will only test the SVE variants if there is target support
      for it though).
      1dfd7b85
  5. Feb 12, 2020
    • Wilco Dijkstra's avatar
      string: optimize memcpy · 4c175c8b
      Wilco Dijkstra authored
      Further optimize integer memcpy. Small cases now include copies up
      to 32 bytes. 64-128 byte copies are split into two cases to improve
      performance of 64-96 byte copies. Comments have been rewritten.
      
      Improves glibc's memcpy-random benchmark by ~10% on Neoverse N1.
      4c175c8b
  6. Jan 14, 2020
    • Szabolcs Nagy's avatar
      math: Add more ulp tests · 33ba1908
      Szabolcs Nagy authored
      Some functions were not tested with the statistical ulp error check
      tool, this commit adds tests for the current math symbols.
      33ba1908
    • Szabolcs Nagy's avatar
      math: add vector pow · a807c9bb
      Szabolcs Nagy authored
      This implementation is a wrapper around the scalar pow with appropriate
      call abi. As such it is not expected to be faster than scalar calls,
      the new double prec vector pow symbols are provided for completeness.
      a807c9bb
    • Wilco Dijkstra's avatar
      string: Remove memcpy_bytewise · 099350af
      Wilco Dijkstra authored
      This was a placeholder for testing the build system before we added
      optimized string code and thus no longer needed.
      099350af
  7. Jan 09, 2020
    • Szabolcs Nagy's avatar
      math: fix spurious overflow in pow with clang · 2771bc7f
      Szabolcs Nagy authored
      clang does not support c99 fenv_access and may move fp operations out
      of conditional blocks causing unconditional fenv side-effects. Here
      
        if (cond)
          ix = f (x * 0x1p52);
      
      was transformed to
      
        ix_ = f (x * 0x1p52);
        ix = cond ? ix_ : ix;
      
      where x can be a huge negative value so the mul overflows. The added
      barrier should prevent such transformation by significantly increasing
      the cost of doing the mul unconditionally.
      
      Found by enh from google on android arm and aarch64 targets.
      Fixes github issue #16.
      2771bc7f
  8. Jan 07, 2020
  9. Jan 06, 2020
    • Wilco Dijkstra's avatar
      string: Add strrchr · bbd64ec1
      Wilco Dijkstra authored
      Add strrchr for AArch64. Originally written by Richard Earnshaw, same
      code is present in newlib, this copy has minor edits for inclusion into
      the optimized-routines repo.
      bbd64ec1
  10. Jan 03, 2020
    • Szabolcs Nagy's avatar
      Add the Assignment Agreement v1.1 document · dbb919dc
      Szabolcs Nagy authored
      This Assignment Agreement has to be filled in, signed and sent to
      optimized-routines-assignment@arm.com by Contributors before their
      contributions can be accepted into optimized-routines.
      dbb919dc
  11. Jan 02, 2020
  12. Dec 10, 2019
    • Krzysztof Koch's avatar
      aarch64: Combine memcpy and memmove implementations · 3377796f
      Krzysztof Koch authored
      Modify integer and SIMD versions of memcpy to handle overlaps correctly.
      
      Make __memmove_aarch64 and __memmove_aarch64_simd alias to
      __memcpy_aarch64 and __memcpy_aarch64_simd respectively.
      
      Complete sharing of code between memcpy and memmove implementations is
      possible without noticeable performance penalty. This is thanks to
      moving the source and destination buffer overlap detection after
      the code for handling small and medium copies which are overlap-safe
      anyway.
      
      Benchmarking shows that keeping two versions of memcpy is necessary
      because newer platforms favor aligning src over destination for large
      copies. Using NEON registers also gives a small speedup. However,
      aligning dst and using general-purpose registers works best for older
      platforms. Consequently, memcpy.S and memcpy_simd.S contain memcpy
      code which is identical except for the registers used and src vs dst
      alignment.
      3377796f
  13. Nov 27, 2019
  14. Nov 26, 2019
    • Krzysztof Koch's avatar
      arch64: Add SIMD version of memcpy · 6d3ae5fc
      Krzysztof Koch authored
      Create a new memcpy implementation for targets with the NEON extension.
      
      __memcpy_aarch64_simd has been tested on a range of modern
      microarchitectures. It turned out to be faster than __memcpy_aarch64 on
      all of them, with a performance improvement of 3-11% depending on the
      platform.
      6d3ae5fc
    • Krzysztof Koch's avatar
      aarch64: Use common header file in memcpy.S · 015c9519
      Krzysztof Koch authored
      Include asmdefs.h in memcpy.S to avoid duplicate macro definitions.
      
      Add macro for defining labels in asmdefs.h.
      
      Change the default routine entry point alignment to 64 bytes.
      
      Define a new macro which allows controlling the entry point alignment.
      
      Add include guard to asmdefs.h.
      015c9519
    • Szabolcs Nagy's avatar
      Makefile tweak for better subproject handling · dec9ffea
      Szabolcs Nagy authored
      Don't include the makefile fragments of subprojects that aren't built.
      
      With this the build fails more reasonably when SUBS is set incorrectly.
      dec9ffea
  15. Nov 22, 2019
    • Szabolcs Nagy's avatar
      Build system refactoring · 1fd2aaae
      Szabolcs Nagy authored
      Reorganise the makefiles so subprojects can be more separately used and
      maintained.  Still kept the single toplevel Makefile and config.mk.
      
      Subproject Dir.mk is expected to provide all-X, check-X, clean-X and
      install-X targets where X is the subproject name and it may use generic
      make variables set in config.mk, like CFLAGS_ALL and CC, or subproject
      specific variables like X-cflags.
      1fd2aaae
  16. Nov 19, 2019
  17. Nov 06, 2019
  18. Nov 05, 2019
    • Szabolcs Nagy's avatar
      Add vector exp2f · 69170e15
      Szabolcs Nagy authored
      Same design as in expf. Worst-case error of __v_exp2f and __v_exp2f_1u
      is 1.96 and 0.88 ulp respectively.
      
      It is not clear if round/convert instructions are better or +- Shift.
      For expf the latter, for exp2f the former seems more consistently
      faster, but both options are kept in the code for now.
      69170e15
    • Szabolcs Nagy's avatar
      math: fix runulp.sh · 65464ec6
      Szabolcs Nagy authored
      Use heredoc instead of pipe when iterating over test cases to avoid
      creating a subshell that would break the PASS/FAIL accounting.
      65464ec6
    • Krzysztof Koch's avatar
      aarch64: Increase small and medium cases for memcpy · fec28a72
      Krzysztof Koch authored
      Increase the upper bound on medium cases from 96 to 128 bytes.
      Now, up to 128 bytes are copied unrolled.
      
      Increase the upper bound on small cases from 16 to 32 bytes so that
      copies of 17-32 bytes are not impacted by the larger medium case.
      fec28a72
  19. Oct 17, 2019
    • Szabolcs Nagy's avatar
      Add -Werror=implicit-function-declaration · 433a3b1f
      Szabolcs Nagy authored
      Implicit function declaration is always a bug, but compilers don't
      turn it into an error by default for historical reasons, so add it
      to the default config.
      433a3b1f
    • Szabolcs Nagy's avatar
      fix the build of s_powf.o on non-aarch64 targets · 3d7ecfe3
      Szabolcs Nagy authored
      This is a simple fix to the v_powf code, but in general the vector
      code may not work on arbitrary targets even when compiled with
      scalar types (s_powf.c), so in the long term may be all s_* should
      be disabled for non-aarch64 targets (requires test system and header
      changes too).
      3d7ecfe3
Loading