Skip to content
  1. Jan 03, 2020
    • Szabolcs Nagy's avatar
      Add the Assignment Agreement v1.1 document · dbb919dc
      Szabolcs Nagy authored
      This Assignment Agreement has to be filled in, signed and sent to
      optimized-routines-assignment@arm.com by Contributors before their
      contributions can be accepted into optimized-routines.
      dbb919dc
  2. Jan 02, 2020
  3. Dec 10, 2019
    • Krzysztof Koch's avatar
      aarch64: Combine memcpy and memmove implementations · 3377796f
      Krzysztof Koch authored
      Modify integer and SIMD versions of memcpy to handle overlaps correctly.
      
      Make __memmove_aarch64 and __memmove_aarch64_simd alias to
      __memcpy_aarch64 and __memcpy_aarch64_simd respectively.
      
      Complete sharing of code between memcpy and memmove implementations is
      possible without noticeable performance penalty. This is thanks to
      moving the source and destination buffer overlap detection after
      the code for handling small and medium copies which are overlap-safe
      anyway.
      
      Benchmarking shows that keeping two versions of memcpy is necessary
      because newer platforms favor aligning src over destination for large
      copies. Using NEON registers also gives a small speedup. However,
      aligning dst and using general-purpose registers works best for older
      platforms. Consequently, memcpy.S and memcpy_simd.S contain memcpy
      code which is identical except for the registers used and src vs dst
      alignment.
      3377796f
  4. Nov 27, 2019
  5. Nov 26, 2019
    • Krzysztof Koch's avatar
      arch64: Add SIMD version of memcpy · 6d3ae5fc
      Krzysztof Koch authored
      Create a new memcpy implementation for targets with the NEON extension.
      
      __memcpy_aarch64_simd has been tested on a range of modern
      microarchitectures. It turned out to be faster than __memcpy_aarch64 on
      all of them, with a performance improvement of 3-11% depending on the
      platform.
      6d3ae5fc
    • Krzysztof Koch's avatar
      aarch64: Use common header file in memcpy.S · 015c9519
      Krzysztof Koch authored
      Include asmdefs.h in memcpy.S to avoid duplicate macro definitions.
      
      Add macro for defining labels in asmdefs.h.
      
      Change the default routine entry point alignment to 64 bytes.
      
      Define a new macro which allows controlling the entry point alignment.
      
      Add include guard to asmdefs.h.
      015c9519
    • Szabolcs Nagy's avatar
      Makefile tweak for better subproject handling · dec9ffea
      Szabolcs Nagy authored
      Don't include the makefile fragments of subprojects that aren't built.
      
      With this the build fails more reasonably when SUBS is set incorrectly.
      dec9ffea
  6. Nov 22, 2019
    • Szabolcs Nagy's avatar
      Build system refactoring · 1fd2aaae
      Szabolcs Nagy authored
      Reorganise the makefiles so subprojects can be more separately used and
      maintained.  Still kept the single toplevel Makefile and config.mk.
      
      Subproject Dir.mk is expected to provide all-X, check-X, clean-X and
      install-X targets where X is the subproject name and it may use generic
      make variables set in config.mk, like CFLAGS_ALL and CC, or subproject
      specific variables like X-cflags.
      1fd2aaae
  7. Nov 19, 2019
  8. Nov 06, 2019
  9. Nov 05, 2019
    • Szabolcs Nagy's avatar
      Add vector exp2f · 69170e15
      Szabolcs Nagy authored
      Same design as in expf. Worst-case error of __v_exp2f and __v_exp2f_1u
      is 1.96 and 0.88 ulp respectively.
      
      It is not clear if round/convert instructions are better or +- Shift.
      For expf the latter, for exp2f the former seems more consistently
      faster, but both options are kept in the code for now.
      69170e15
    • Szabolcs Nagy's avatar
      math: fix runulp.sh · 65464ec6
      Szabolcs Nagy authored
      Use heredoc instead of pipe when iterating over test cases to avoid
      creating a subshell that would break the PASS/FAIL accounting.
      65464ec6
    • Krzysztof Koch's avatar
      aarch64: Increase small and medium cases for memcpy · fec28a72
      Krzysztof Koch authored
      Increase the upper bound on medium cases from 96 to 128 bytes.
      Now, up to 128 bytes are copied unrolled.
      
      Increase the upper bound on small cases from 16 to 32 bytes so that
      copies of 17-32 bytes are not impacted by the larger medium case.
      fec28a72
  10. Oct 17, 2019
    • Szabolcs Nagy's avatar
      Add -Werror=implicit-function-declaration · 433a3b1f
      Szabolcs Nagy authored
      Implicit function declaration is always a bug, but compilers don't
      turn it into an error by default for historical reasons, so add it
      to the default config.
      433a3b1f
    • Szabolcs Nagy's avatar
      fix the build of s_powf.o on non-aarch64 targets · 3d7ecfe3
      Szabolcs Nagy authored
      This is a simple fix to the v_powf code, but in general the vector
      code may not work on arbitrary targets even when compiled with
      scalar types (s_powf.c), so in the long term may be all s_* should
      be disabled for non-aarch64 targets (requires test system and header
      changes too).
      3d7ecfe3
  11. Oct 14, 2019
    • Szabolcs Nagy's avatar
      Add vector log · d984098b
      Szabolcs Nagy authored
      Worst-case error is 1.67 ulp, the polynomial was generated by sollya.
      Uses a 128 entry (2KB) lookup table. Special cases fall back to scalar
      log call.
      d984098b
    • Szabolcs Nagy's avatar
      Add vector sin and cos · a2f717ef
      Szabolcs Nagy authored
      Worst-case error is 3.5 ulp, the polynomial was generated by sollya.
      For large (>2^23) and special inputs the code falls back to scalar
      sin and cos.
      a2f717ef
    • Szabolcs Nagy's avatar
      Add vector powf · ba75d0a0
      Szabolcs Nagy authored
      Essentially the scalar powf algorithm is used for each element in the
      vector just inlined for better scheduling and simpler special case
      handling. The log polynomial is smaller as less accuracy is enough.
      
      Worst-case error is 2.6 ulp.
      ba75d0a0
    • Szabolcs Nagy's avatar
      Add vector sinf and cosf · c5cba852
      Szabolcs Nagy authored
      The polynomials were produced by searching the coefficient space using
      heuristics and ideas from https://arxiv.org/abs/1508.03211
      
      The worst-case error is 1.886 ulp, large inputs (> 2^20) and other
      special cases use scalar sinf and cosf.
      c5cba852
    • Szabolcs Nagy's avatar
      Add vector logf · c280e49d
      Szabolcs Nagy authored
      The polynomial was produced by searching the coefficient space using
      heuristics and ideas from https://arxiv.org/abs/1508.03211
      
      The worst-case error is 3.34 ulp, subnormal range inputs and other
      special cases use scalar logf.
      c280e49d
    • Szabolcs Nagy's avatar
      Add vector exp, expf and related vector math support code · 7a1f4cfd
      Szabolcs Nagy authored
      Vector math routines are added to the same libmathlib library as scalar
      ones. The difficulty is that they are not always available, the external
      abi depends on the compiler version used for the build. Currently only
      aarch64 AdvSIMD is supported, there are 4 new sets of symbols:
      
        __s_foo is a scalar function with identical result to the vector one,
        __v_foo is a vector function using the base PCS,
        __vn_foo uses the vector PCS and
        _ZGV*_foo is the vector ABI symbol alias of vn_foo
      
      for a scalar math function foo.
      
      The test and benchmark code got extended to handle vector functions.
      
      Vector functions aim for < 5 ulp worst case error, only support nearest
      rounding mode and don't support floating-point exceptions. Vector
      functions may call scalar functions to handle special cases, but for a
      single value they should return the same result independently of values
      in other vector lanes or the position of the value in the vector.
      
      The __v_expf and __v_expf_1u polynomials were produced by searching the
      coefficient space with some heuristics and ideas from
      https://arxiv.org/abs/1508.03211
      Their worst case error is 1.95 and 0.866 ulp respectively.
      
      The exp polynomial was produced by sollya, it uses a 128 element (1KB)
      lookup table and has 2.38 ulp worst case error.
      7a1f4cfd
    • Szabolcs Nagy's avatar
      math: more robust mathbench_libc · a88f3f60
      Szabolcs Nagy authored
      Not all symbols referenced by mathbench may be available in libc so
      link to libmathlib too to resolve the missing symbols.
      a88f3f60
    • Szabolcs Nagy's avatar
      math: update the plot script · 0a51e645
      Szabolcs Nagy authored
      Fix it to be python3 compatible and plot the exact and approximated
      values too.
      0a51e645
    • Szabolcs Nagy's avatar
      Prevent fenv access breaking optimizations of the ulp tool · 60463383
      Szabolcs Nagy authored
      The ulp tool compares output of a math function to a larger precision
      implementation of the same function.
      
      But when the input argument is converted to a larger precision number
      the signaling nan property is lost, so ensure that the conversion
      happens inside the critical region where fenv exceptions are checked
      and then the conversion itself will raise the invalid exception, which
      is the correct behaviour in most cases.
      
      The volatile barrier is not perfect and the snan behaviour is not
      always signaling, but this should give more reliable results in most
      cases than before.
      60463383
  12. Oct 08, 2019
  13. Aug 29, 2019
Loading