Decided to put some of my RAM-saving experiments on website

Sun Apr 24 12:23:32 UTC 2016

https://busybox.net/use_less_ram.html

How to build a Busybox binary with reduced memory usage

Busybox is designed to be frugal with memory usage. However, it is
still written in C. Compiler and linker are usually not written with a
focus to strongly minimize memory usage.
Overview of RAM usage by Linux binaries

Executables have the following parts: read-only executable code and
constants, also known as "text", read-write initialized data, and
read-write non-initialized (zeroed on demand) data, also known as
"bss".

At runtime, all text pages are mapped RO and executable. A RO mapping,
like any other, can only cover whole pages, not parts of them. What
happens to the last page of text? The entire page is mapped RO,
including the part beyond the last text byte.

Data pages are mapped RW and they are file-backed (this includes a
small portion of bss which may live in the last partial page of data).
Pages which are fully in bss are mapped to anonymous memory: when a
page fault requires kernel to allocate a real page, a free page is
found and zero-filled. No reads from storage are needed.

File-backed RO pages are shared among all copies of running process.
RW pages, both data and bss, exist as separate copies for each
process. (They may be shared as long as they are not modified.
However, well-designed programs don't put constant data into RW
sections, therefore in practice almost all RW data or bss pages which
were touched by the program are modified, and therefore not shared).

When process starts executing, two more mappings appear. One is the
stack. It is usually located near the top of virtual address space,
and contains program command line, environment and ELF auxiliary
vector. This is an anonymouns mapping.

Another anonymouns mapping is created by malloc subsystem in libc.
This mapping is initially empty, but it is a growable and shrinkable
via the brk(2) system call. Normally it starts right after the last
page of bss, but kernels configured with address space randomization
can place it elsewhere. Moderately sized malloc requests are served
from this area. pmap shows it as "[heap]" mapping.

You can see these mappings with the following command, where busybox
shows its own memory map:

$ sh -c 'exec busybox pmap $$'
1628: busybox pmap 1628
08048000     900K r-xp  /bin/busybox  - text
08129000       4K rw-p  /bin/busybox  - data
0812a000      12K rw-p    [ anon ]    - bss
08854000       8K rw-p  [heap]        - malloc space
f76fe000       8K r--p  [vvar]
f7700000       8K r-xp  [vdso]
ff82a000     132K rw-p  [stack]       - stack

It is important to minimize the number of RW pages your program
touches. Data, bss and stack pages are never freed, therefore for
large, and especially for temporary allocations, it's best to use
malloc() or mmap(). One way to use fewer RW pages is to have fewer RW
pages in your binary. There are several scenarios when a binary may
end up having more data+bss pages than necessary.

It is possible to have a Busybox binary with almost all applets and
only 8 kilobytes of data+bss.
Using libc which is better at using RAM sparingly

[TODO:] Static builds are better than dynamic ones. Glibc is horrible
for static builds, though. Ideally, libs should not initialize
"[heap]" before user program calls malloc(). Libc should aggressively
prune (return to OS) freed malloc space (glibc defaults are bad). Even
though we don't need that in Busybox, there should be a method to
prune used (dirtied) stack space after the use of deep recursion or
large on-stack objects.

Optimizing start of data

A simple solution for RO/RW mappings for text and data would be to pad
text to a full page. However, on some architectures pages are quite
large and this would cause a significant growth of on-disk size of the
binary. GNU linker employs a hack to avoid that: sometimes it starts
data immediately after text, without padding. The last, "mixed" page,
gets mapped twice: once as a RO page, and second time as a RW page
(usually at the position of the very next page in the virtual address
space).

This reduces image size on-disk, yes, but it wastes RAM: now there is
an unused part of first data page which takes up RAM. This may push
the opposite, tail end of the data/bss sections far enough that they
now need one more page.

This can be fixed if we make GNU linker to use our own linker script
instead of a built-in one. We will do this by modifying its default
script. Every busybox build generates a "busybox_unstripped.out" file.
Among other infromation, it contains a part which says "using internal
linker script:", and the script follows.

Cut out the script and put it into a file named "busybox_ldscript".
Now the build will use this script, you should see a message "Custom
linker script 'busybox_ldscript' found, using it" next time.

To unconditionally align data to the next page boundary, find a part
which looks similar to:

/* Adjust the address for the data segment.  We want to adjust up to
   the same address within the page on the next page up.  */
. = ALIGN (0x1000) - ((0x1000 - .) & (0x1000 - 1)); . =
DATA_SEGMENT_ALIGN (0x1000, 0x1000);

and replace it with:

. = ALIGN (0x1000); . = DATA_SEGMENT_ALIGN (0x1000, 0x1000);

Reducing padding between data

Busybox is linked with "--sort-section alignment". This helps, but
unfortunately ld is not doing good enough job. It's possible to help
it.

Output sections in the executable are specified in linker script as follows:

.rodata: { *(.rodata .rodata.*) }

If in such rules bare section wildcards are replaced with
"SORT_BY_ALIGNMENT(wildcard)", the result is often more compact. In
practice, vast majority of sections which benefit from alignment
sorting are .rodata, .data and .bss ones, thus it's enough to only
changle three locations in the linker script:

   *(SORT_BY_ALIGNMENT(.rodata) SORT_BY_ALIGNMENT(.rodata.*) {the rest})
 ...
   *(SORT_BY_ALIGNMENT(.data) SORT_BY_ALIGNMENT(.data.*) {the rest})
 ...
   *(SORT_BY_ALIGNMENT(.bss) SORT_BY_ALIGNMENT(.bss.*) {the rest})

Unfortunately, a more optimal "SORT_BY_ALIGNMENT(.rodata .rodata.*)",
which would sort .rodata and .rodata.foo sections by alignment as one
group, not separately, is not a valid syntax.

GCC, even with -Os, compiles byte and 16-bit arrays into data object
definitions which require word alignment (at least 4 bytes). Some
versions even require 32-byte alignment for arrays of more than 32
bytes long. This includes explicitly declared string arrays.

As a result, these arrays often have padding added at the end when
they are packed by linker into the binary. As of 2016, linker is not
clever enough yet to use that padding for tiny data objects (e.g. bool
variables).

The alignment can be relaxed by explicit ALIGN1 or ALIGN2 attributes
on such arrays:

static const char extn[][5] ALIGN1 = { ".zip", ".ZIP" };

The above prevents 2 bytes of padding in the binary.

Busybox code is evolving, new arrays constantly pop up, coders forget
to add ALIGNn on them. You can run "grep -F -B3 '*fill*'
busybox_unstripped.map" to find all linker-added padding in your
binary, and add forgotten ALIGNn's. Please send a patch if you do.
Converting bss to data

Busybox's data and bss sections are small already, some 4-12
kilobytes. But they still require two mappings (VMAs in kernel-speak)
because they have different attributes: data mapping is file-backed,
while bss is anonymous. We can save a bit of memory on the kernel
side, for every process, by not requiring two VMAs. This can be
achieved by changing linker script to place all formerly-bss variables
to data:

  .data:
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  ......................
  .bss:
  {
    *(.dynbss)                       ---
    *(.bss .bss.* .gnu.linkonce.b.*) --- move these lines to .data {} block
    *(COMMON)                        ---
    . = ALIGN(. != 0 ? 64 / 8 : 1);
  }

The result of this change is visible in pmap as absense of "[ anon ]" mapping:

08048000     856K r-xp  /bin/busybox
0811e000       8K rw-p  /bin/busybox -- former bss is in here
08bbf000       4K rw-p  [heap]
f77a2000       8K r--p  [vvar]
f77a4000       8K r-xp  [vdso]
ffa18000     132K rw-p  [stack]

This has a drawback that on-disk binary contains a few zeroed pages
and they will need to be read when formerly-bss variables are touched.
IOW: this has a small speed penalty.
Use space at the end of bss: FEATURE_USE_BSS_TAIL

The end of bss is usually not page-aligned. There is an unused space
in the last page. Linker marks that location with the _end symbol.

The FEATURE_USE_BSS_TAIL option attempts to use that space for
bb_common_bufsiz1[] array. If it fits after _end, that space will be
used, and COMMON_BUFSIZE will be enlarged from its guaranteed minimum
size of 1 kbyte. This may require recompilation a second time, since
the position of _end is known only after final link.