[PATCH] Add support for zstd decompression
Denys Vlasenko
vda.linux at googlemail.com
Sat Sep 11 00:51:51 UTC 2021
On Wed, Aug 18, 2021 at 4:35 AM Jeff Pohlmeyer <yetanothergeek at gmail.com> wrote:
> [EDIT: The original message was rejected because it was too large; I
> am re-sending w/ bzip2'd attachment]
>
> This (rather enormous) patch adds the unzstd, zstdcat, and zstd -d
> applets; as well as zstd support for modprobe, etc. (Arch Linux is now
> using zstd format by default for its pacman packages, initramfs, and
> kernel modules.)
>
> When enabled, this adds around 28kb to the statically linked musl-x86
> 32-bit busybox. But the patch adds some ~14k lines of code and
> comments, mostly in the libarchive/zstd directory. The code was pulled
> unmodified from the current https://github.com/facebook/zstd.git
> repository, only the bare minimum of files needed for decompression
> are included.
I'm getting this:
function old new delta
ZSTD_decompressStream - 1914 +1914
XXH64 - 1664 +1664
static.ZSTD_decompressSequences_body - 1353 +1353
XXH64_digest - 1272 +1272
XXH64_update - 990 +990
ZSTD_decompressContinue - 969 +969
HUF_readDTableX1_wksp_bmi2 - 888 +888
ZSTD_decompressMultiFrame - 878 +878
ZSTD_decodeLiteralsBlock - 740 +740
HUF_decompress4X1_usingDTable_internal_body - 699 +699
FSE_buildDTable_internal - 617 +617
FSE_readNCount_body - 597 +597
static.ZSTD_buildFSETable_body - 534 +534
ZSTD_getFrameHeader_advanced - 528 +528
ML_defaultDTable - 520 +520
LL_defaultDTable - 520 +520
FSE_decompress_usingDTable_generic - 477 +477
ZSTD_loadDEntropy - 460 +460
ZSTD_DCtx_refDDict - 445 +445
ZSTD_decodeSeqHeaders - 401 +401
static.HUF_readStats_body - 349 +349
static.ZSTD_buildSeqTable - 336 +336
unpack_zstd_stream - 319 +319
static.FSE_decompress_wksp_body - 313 +313
ZSTD_findFrameSizeInfo - 264 +264
OF_defaultDTable - 264 +264
ZSTD_freeDCtx - 244 +244
ZSTD_execSequenceEnd - 243 +243
ZSTD_initDCtx_internal - 239 +239
ZSTD_decompressBegin - 222 +222
ML_bits - 212 +212
ML_base - 212 +212
ZSTD_decompressBegin_usingDict - 201 +201
ZSTD_safecopy - 181 +181
ZSTD_decodeFrameHeader - 180 +180
BIT_initDStream - 178 +178
ZSTD_decompressBlock_internal - 176 +176
ZSTD_decompressContinueStream - 173 +173
ZSTD_copyDDictParameters - 169 +169
ZSTD_DCtx_selectFrameDDict - 152 +152
ZSTD_DDictHashSet_emplaceDDict - 144 +144
LL_bits - 144 +144
LL_base - 144 +144
HUF_decodeStreamX1 - 138 +138
HUF_decompress4X_hufOnly_wksp_bmi2 - 136 +136
OF_bits - 128 +128
OF_base - 128 +128
BIT_mask - 128 +128
ZSTD_DCtx_reset - 127 +127
XXH64_reset - 125 +125
ZSTD_wildcopy - 110 +110
BIT_reloadDStream - 102 +102
ZSTD_overlapCopy8 - 99 +99
HUF_decompress1X1_DCtx_wksp_bmi2 - 99 +99
.rodata 104233 104329 +96
HUF_decompress1X1_usingDTable_internal_body - 93 +93
ZSTD_createDCtx_advanced - 90 +90
ZSTD_decodingBufferSize_min - 87 +87
ZSTD_freeDDict - 86 +86
unzstd_main - 82 +82
ZSTD_getcBlockSize - 82 +82
ZSTD_frameHeaderSize_internal - 82 +82
ZSTD_decompressBegin_usingDDict - 76 +76
ZSTD_decompress_usingDDict - 63 +63
ZSTD_checkContinuity - 62 +62
ZSTD_customCalloc - 57 +57
ZSTD_initDStream_usingDDict - 52 +52
ZSTD_clearDict - 47 +47
ZSTD_createDStream - 46 +46
FSE_decodeSymbolFast - 42 +42
FSE_decodeSymbol - 42 +42
BIT_reloadDStreamFast - 42 +42
HUF_decodeSymbolX1 - 40 +40
ZSTD_getDDict - 39 +39
ZSTD_initFseState - 38 +38
ZSTD_customFree - 36 +36
FSE_readNCount - 33 +33
static.dec64table - 32 +32
static.dec32table - 32 +32
readSkippableFrameSize - 31 +31
ZSTD_customMalloc - 31 +31
HUF_readDTableX1_wksp - 31 +31
BIT_readBits - 30 +30
ZSTD_nextInputType - 28 +28
ZSTD_getDictID_fromDict - 25 +25
ZSTD_findFrameCompressedSize - 25 +25
ZSTD_getDictID_fromDDict - 24 +24
BIT_readBitsFast - 24 +24
ZSTD_getFrameHeader - 23 +23
BIT_endOfDStream - 22 +22
applet_names 2751 2771 +20
setup_transformer_on_fd 150 167 +17
ZSTD_fcs_fieldSize - 16 +16
ZSTD_did_fieldSize - 16 +16
ZSTD_initDStream - 14 +14
repStartValue - 12 +12
applet_main 1592 1604 +12
ZSTD_defaultCMem - 12 +12
ZSTD_freeDStream - 5 +5
applet_install_loc 199 201 +2
applet_suid 100 101 +1
packed_usage 33950 33922 -28
static.CSWTCH 76 6 -70
------------------------------------------------------------------------------
(add/remove: 96/0 grow/shrink: 6/2 up/down: 24743/-98) Total: 24645 bytes
text data bss dec hex filename
1043078 559 5052 1048689 100071 busybox_old
1067799 559 5052 1073410 106102 busybox_unstripped
I suspect Facebook et al do not share busybox's zeal about smaller size.
Random example":
/* close range match, overlap */
static const U32 dec32table[] = { 0, 1, 2, 1, 4, 4, 4, 4 };
/* added */
static const int dec64table[] = { 8, 8, 8, 7, 8, 9,10,11 };
/* subtracted */
int const sub2 = dec64table[offset];
> Among the files is an xxhash.[ch] implementation which might be useful
> to other applets in the future, but I left it in the zstd directory
> for now.
In particular, XXH64, the second-largest added object, has no
config knob to select smaller, slower version:
XXH_PUBLIC_API unsigned long long XXH64 (const void* input, size_t
len, unsigned long long seed)
{
#if 0
/* Simple version, good for code maintenance, but unfortunately
slow for small inputs */
XXH64_CREATESTATE_STATIC(state);
XXH64_reset(state, seed);
XXH64_update(state, input, len);
return XXH64_digest(state);
#else
XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN;
if (XXH_FORCE_ALIGN_CHECK) {
if ((((size_t)input) & 7)==0) { /* Input is aligned, let's
leverage the speed advantage */
if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
return XXH64_endian_align(input, len, seed,
XXH_littleEndian, XXH_aligned);
else
return XXH64_endian_align(input, len, seed,
XXH_bigEndian, XXH_aligned);
} }
if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
return XXH64_endian_align(input, len, seed, XXH_littleEndian,
XXH_unaligned);
else
return XXH64_endian_align(input, len, seed, XXH_bigEndian,
XXH_unaligned);
#endif
}
The "if 0" code block does not compile if enabled by hand.
Then there are gems like this:
#if defined(__GNUC__) && defined(__x86_64__)
/* Align the decompression loop to 32 + 16 bytes.
*
* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
* speed swings based on the alignment of the decompression loop. This
* performance swing is caused by parts of the decompression
loop falling
* out of the DSB. The entire decompression loop should fit in the DSB,
* when it can't we get much worse performance. You can
measure if you've
* hit the good case or the bad case with this perf command for some
* compressed file test.zst:
*
* perf stat -e cycles -e instructions -e
idq.all_dsb_cycles_any_uops \
* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
*
* If you see most cycles served out of the MITE you've hit
the bad case.
* If you see most cycles served out of the DSB you've hit the
good case.
* If it is pretty even then you may be in an okay case.
*
* This issue has been reproduced on the following CPUs:
* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
* Use Instruments->Counters to get DSB/MITE cycles.
* I never got performance swings, but I was able to
* go from the good case of mostly DSB to half of the
* cycles served from MITE.
* - Coffeelake: Intel i9-9900k
* - Coffeelake: Intel i7-9700k
*
* I haven't been able to reproduce the instability or DSB misses on any
* of the following CPUS:
* - Haswell
* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
* - Skylake
*
* If you are seeing performance stability this script can help test.
* It tests on 4 commits in zstd where I saw performance change.
*
* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
*/
__asm__(".p2align 6");
__asm__("nop");
__asm__(".p2align 5");
__asm__("nop");
# if __GNUC__ >= 9 && __GNUC__ < 11
/* better for gcc-9 and gcc-10, worse for clang and gcc-8, gcc-11 */
__asm__(".p2align 3");
# else
__asm__(".p2align 4");
# endif
#endif
"Let's add a tweak for a combination of uarch and compiler which
will invariably become irrelevant, as it always does, in ~3 years"?
I don't know what to do.
If we take this code in and start tweaking it for small size,
the backporting of future fixes will be a nightmare.
Should we submit patches to zstd to make upstream configurable
to generate smaller code? ... maybe. Anyone knows whether working
with zstd people is doable, or it is more of a "Pottering"
experience?
With b[un]zip2, we just waited until the upstream code has
fossilized enough that backporting pains stopped being a concern.
More information about the busybox
mailing list