Results 1 to 6 of 6

Thread: Turbo Base64 : Fastest Base64 SIMD/SSE/AVX/AVX2/Neon/Altivec

  1. #1
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts

    Exclamation Turbo Base64 : Fastest Base64 SIMD/SSE/AVX/AVX2/Neon/Altivec

    Turbo Base64 SIMD

    - 100% C (C++ headers), as simple as memcpy.
    - No other base64 library encode or decode faster
    - Scalar can be faster than other SSE or ARM Neon based base64 libraries
    - Turbo Base64 SSE faster than other SSE/AVX/AVX2! base64 library
    - Fastest AVX2 implementation, damn near to memcpy
    - TurboBase64 AVX2 decoding is ~2x faster than other AVX2 libs.
    - Fastest ARM Neon base64
    - Dynamic CPU detection and JIT scalar/sse/avx/avx2 switching
    - Base64 robust error checking
    - Portable library, 32/64 bits, SSE/AVX/AVX2, ARM Neon, Power9 Altivec
    - OS:Linux amd64, arm64, Power9, MacOs, s390x. Windows: Mingw, visual c++
    - Big+Little endian
    - Ready and simple to use library, no armada of files, no hassles dependencies
    Last edited by dnd; 25th December 2019 at 17:57.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I wonder if its possible to do it with bitslice like that - http://www.darkside.com.au/bitslice/
    I mean, for encoding: take 3xN input bytes, compute 32xN output bits with vector logic, write.

  3. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    In base64 encoding the output is clearly defined (ASCII) and reversible.
    I don't think, it will be possible to get the same output with bitslices and a reasonable instruction count.
    The fast scalar branchless encode function is using only 2 'L1-lookups', 3 'shifts' and 1 'and' for a 3 bytes to 4 conversion.
    The simd functions are converting the 6 bits to a 4 bits index and then use the shuffle instruction.
    The rest is bitpacking.
    Decoding is using an equivalent reverse function.
    Lookups are very costly with bitslicing.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I mean something like this: http://nishi.dreamhosters.com/u/b64_bitslice.txt
    Its a way to work with vectors of 128/256/512 symbols in parallel.
    (I only provided boolean functions for 2 bits; normally it would be better to optimize a function for all output bits at once).
    There're no lookups, that's the point.

  5. #5
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Two basic steps in base64 encoding are necessary for each 6 bits (see Base64 encoding with SIMD instructions )
    - unpack 6 binary bits to 8 (3N bytes to 4N bytes)
    this can done with 'and', 'or' and 'shift'
    - map (lookup) each byte to a base64 ASCII char.
    without lookup/shuffle it is possible to map the 6 bits ranges to
    'A-Z','a-z','0-9','+','/' using branchless compare and
    some arithmetic and logical ops.

    In decoding, you must additionally check for bad input.

    You have more faster options using the full set of SIMD instructions.
    Unpacking 12/24 bytes to 16/32 can be done using 'horizontal bitpacking' using shuffle and two multiplications.
    A single simd shuffle can replace a serie of compare+arithmetic+logical ops. in the mapping op.

    In bitslicing you must replace all these instructions with elementary gates ops.
    Last edited by dnd; 17th December 2019 at 21:59.

  6. #6
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    The Blazing Fast Clickhouse DBMS from Yandex,
    has now replaced the most popular SIMD aklomp/base64with Turbobase64.

    NEW:
    Turbo-Base64 SIMD now 3-4 times faster than the next fastest SIMD base64 for short strings.

Similar Threads

  1. Replies: 0
    Last Post: 27th October 2019, 09:40
  2. AVX-512 and interleaved rANS
    By JamesB in forum Data Compression
    Replies: 41
    Last Post: 6th November 2016, 14:26
  3. Is there AMD CPU that supports AVX2?
    By lz77 in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 3rd June 2016, 11:17
  4. Compression of partial BASE64 encoded files?
    By Sebastian W in forum Data Compression
    Replies: 6
    Last Post: 29th April 2016, 10:27
  5. Good free SIMD library for x86 SSE & ARM NEON?
    By Paul W. in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 17th May 2014, 04:31

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •