[PATCH] Add support for zstd decompression

Sat Sep 11 00:51:51 UTC 2021

On Wed, Aug 18, 2021 at 4:35 AM Jeff Pohlmeyer <yetanothergeek at gmail.com> wrote:
> [EDIT: The original message was rejected because it was too large; I
> am re-sending w/ bzip2'd attachment]
>
> This (rather enormous) patch adds the unzstd, zstdcat, and zstd -d
> applets; as well as zstd support for modprobe, etc. (Arch Linux is now
> using zstd format by default for its pacman packages, initramfs, and
> kernel modules.)
>
> When enabled, this adds around 28kb to the statically linked musl-x86
> 32-bit busybox. But the patch adds some ~14k lines of code and
> comments, mostly in the libarchive/zstd directory. The code was pulled
> unmodified from the current https://github.com/facebook/zstd.git
> repository, only the bare minimum of files needed for decompression
> are included.

I'm getting this:

function                                             old     new   delta
ZSTD_decompressStream                                  -    1914   +1914
XXH64                                                  -    1664   +1664
static.ZSTD_decompressSequences_body                   -    1353   +1353
XXH64_digest                                           -    1272   +1272
XXH64_update                                           -     990    +990
ZSTD_decompressContinue                                -     969    +969
HUF_readDTableX1_wksp_bmi2                             -     888    +888
ZSTD_decompressMultiFrame                              -     878    +878
ZSTD_decodeLiteralsBlock                               -     740    +740
HUF_decompress4X1_usingDTable_internal_body            -     699    +699
FSE_buildDTable_internal                               -     617    +617
FSE_readNCount_body                                    -     597    +597
static.ZSTD_buildFSETable_body                         -     534    +534
ZSTD_getFrameHeader_advanced                           -     528    +528
ML_defaultDTable                                       -     520    +520
LL_defaultDTable                                       -     520    +520
FSE_decompress_usingDTable_generic                     -     477    +477
ZSTD_loadDEntropy                                      -     460    +460
ZSTD_DCtx_refDDict                                     -     445    +445
ZSTD_decodeSeqHeaders                                  -     401    +401
static.HUF_readStats_body                              -     349    +349
static.ZSTD_buildSeqTable                              -     336    +336
unpack_zstd_stream                                     -     319    +319
static.FSE_decompress_wksp_body                        -     313    +313
ZSTD_findFrameSizeInfo                                 -     264    +264
OF_defaultDTable                                       -     264    +264
ZSTD_freeDCtx                                          -     244    +244
ZSTD_execSequenceEnd                                   -     243    +243
ZSTD_initDCtx_internal                                 -     239    +239
ZSTD_decompressBegin                                   -     222    +222
ML_bits                                                -     212    +212
ML_base                                                -     212    +212
ZSTD_decompressBegin_usingDict                         -     201    +201
ZSTD_safecopy                                          -     181    +181
ZSTD_decodeFrameHeader                                 -     180    +180
BIT_initDStream                                        -     178    +178
ZSTD_decompressBlock_internal                          -     176    +176
ZSTD_decompressContinueStream                          -     173    +173
ZSTD_copyDDictParameters                               -     169    +169
ZSTD_DCtx_selectFrameDDict                             -     152    +152
ZSTD_DDictHashSet_emplaceDDict                         -     144    +144
LL_bits                                                -     144    +144
LL_base                                                -     144    +144
HUF_decodeStreamX1                                     -     138    +138
HUF_decompress4X_hufOnly_wksp_bmi2                     -     136    +136
OF_bits                                                -     128    +128
OF_base                                                -     128    +128
BIT_mask                                               -     128    +128
ZSTD_DCtx_reset                                        -     127    +127
XXH64_reset                                            -     125    +125
ZSTD_wildcopy                                          -     110    +110
BIT_reloadDStream                                      -     102    +102
ZSTD_overlapCopy8                                      -      99     +99
HUF_decompress1X1_DCtx_wksp_bmi2                       -      99     +99
.rodata                                           104233  104329     +96
HUF_decompress1X1_usingDTable_internal_body            -      93     +93
ZSTD_createDCtx_advanced                               -      90     +90
ZSTD_decodingBufferSize_min                            -      87     +87
ZSTD_freeDDict                                         -      86     +86
unzstd_main                                            -      82     +82
ZSTD_getcBlockSize                                     -      82     +82
ZSTD_frameHeaderSize_internal                          -      82     +82
ZSTD_decompressBegin_usingDDict                        -      76     +76
ZSTD_decompress_usingDDict                             -      63     +63
ZSTD_checkContinuity                                   -      62     +62
ZSTD_customCalloc                                      -      57     +57
ZSTD_initDStream_usingDDict                            -      52     +52
ZSTD_clearDict                                         -      47     +47
ZSTD_createDStream                                     -      46     +46
FSE_decodeSymbolFast                                   -      42     +42
FSE_decodeSymbol                                       -      42     +42
BIT_reloadDStreamFast                                  -      42     +42
HUF_decodeSymbolX1                                     -      40     +40
ZSTD_getDDict                                          -      39     +39
ZSTD_initFseState                                      -      38     +38
ZSTD_customFree                                        -      36     +36
FSE_readNCount                                         -      33     +33
static.dec64table                                      -      32     +32
static.dec32table                                      -      32     +32
readSkippableFrameSize                                 -      31     +31
ZSTD_customMalloc                                      -      31     +31
HUF_readDTableX1_wksp                                  -      31     +31
BIT_readBits                                           -      30     +30
ZSTD_nextInputType                                     -      28     +28
ZSTD_getDictID_fromDict                                -      25     +25
ZSTD_findFrameCompressedSize                           -      25     +25
ZSTD_getDictID_fromDDict                               -      24     +24
BIT_readBitsFast                                       -      24     +24
ZSTD_getFrameHeader                                    -      23     +23
BIT_endOfDStream                                       -      22     +22
applet_names                                        2751    2771     +20
setup_transformer_on_fd                              150     167     +17
ZSTD_fcs_fieldSize                                     -      16     +16
ZSTD_did_fieldSize                                     -      16     +16
ZSTD_initDStream                                       -      14     +14
repStartValue                                          -      12     +12
applet_main                                         1592    1604     +12
ZSTD_defaultCMem                                       -      12     +12
ZSTD_freeDStream                                       -       5      +5
applet_install_loc                                   199     201      +2
applet_suid                                          100     101      +1
packed_usage                                       33950   33922     -28
static.CSWTCH                                         76       6     -70
------------------------------------------------------------------------------
(add/remove: 96/0 grow/shrink: 6/2 up/down: 24743/-98)      Total: 24645 bytes
   text       data        bss        dec        hex    filename
1043078        559       5052    1048689     100071    busybox_old
1067799        559       5052    1073410     106102    busybox_unstripped

I suspect Facebook et al do not share busybox's zeal about smaller size.
Random example":

        /* close range match, overlap */
        static const U32 dec32table[] = { 0, 1, 2, 1, 4, 4, 4, 4 };
/* added */
        static const int dec64table[] = { 8, 8, 8, 7, 8, 9,10,11 };
/* subtracted */
        int const sub2 = dec64table[offset];

> Among the files is an xxhash.[ch] implementation which might be useful
> to other applets in the future, but I left it in the zstd directory
> for now.

In particular, XXH64, the second-largest added object, has no
config knob to select smaller, slower version:

XXH_PUBLIC_API unsigned long long XXH64 (const void* input, size_t
len, unsigned long long seed)
{
#if 0
    /* Simple version, good for code maintenance, but unfortunately
slow for small inputs */
    XXH64_CREATESTATE_STATIC(state);
    XXH64_reset(state, seed);
    XXH64_update(state, input, len);
    return XXH64_digest(state);
#else
    XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN;

    if (XXH_FORCE_ALIGN_CHECK) {
        if ((((size_t)input) & 7)==0) {  /* Input is aligned, let's
leverage the speed advantage */
            if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
                return XXH64_endian_align(input, len, seed,
XXH_littleEndian, XXH_aligned);
            else
                return XXH64_endian_align(input, len, seed,
XXH_bigEndian, XXH_aligned);
    }   }

    if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
        return XXH64_endian_align(input, len, seed, XXH_littleEndian,
XXH_unaligned);
    else
        return XXH64_endian_align(input, len, seed, XXH_bigEndian,
XXH_unaligned);
#endif
}

The "if 0" code block does not compile if enabled by hand.

Then there are gems like this:

#if defined(__GNUC__) && defined(__x86_64__)
        /* Align the decompression loop to 32 + 16 bytes.
         *
         * zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
         * speed swings based on the alignment of the decompression loop. This
         * performance swing is caused by parts of the decompression
loop falling
         * out of the DSB. The entire decompression loop should fit in the DSB,
         * when it can't we get much worse performance. You can
measure if you've
         * hit the good case or the bad case with this perf command for some
         * compressed file test.zst:
         *
         *   perf stat -e cycles -e instructions -e
idq.all_dsb_cycles_any_uops \
         *             -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
         *
         * If you see most cycles served out of the MITE you've hit
the bad case.
         * If you see most cycles served out of the DSB you've hit the
good case.
         * If it is pretty even then you may be in an okay case.
         *
         * This issue has been reproduced on the following CPUs:
         *   - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
         *               Use Instruments->Counters to get DSB/MITE cycles.
         *               I never got performance swings, but I was able to
         *               go from the good case of mostly DSB to half of the
         *               cycles served from MITE.
         *   - Coffeelake: Intel i9-9900k
         *   - Coffeelake: Intel i7-9700k
         *
         * I haven't been able to reproduce the instability or DSB misses on any
         * of the following CPUS:
         *   - Haswell
         *   - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
         *   - Skylake
         *
         * If you are seeing performance stability this script can help test.
         * It tests on 4 commits in zstd where I saw performance change.
         *
         *   https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
         */
        __asm__(".p2align 6");
        __asm__("nop");
        __asm__(".p2align 5");
        __asm__("nop");
#  if __GNUC__ >= 9 && __GNUC__ < 11
        /* better for gcc-9 and gcc-10, worse for clang and gcc-8, gcc-11 */
        __asm__(".p2align 3");
#  else
        __asm__(".p2align 4");
#  endif
#endif

"Let's add a tweak for a combination of uarch and compiler which
will invariably become irrelevant, as it always does, in ~3 years"?

I don't know what to do.

If we take this code in and start tweaking it for small size,
the backporting of future fixes will be a nightmare.

Should we submit patches to zstd to make upstream configurable
to generate smaller code? ... maybe. Anyone knows whether working
with zstd people is doable, or it is more of a "Pottering"
experience?

With b[un]zip2, we just waited until the upstream code has
fossilized enough that backporting pains stopped being a concern.