[Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files

Thomas Petazzoni thomas.petazzoni at bootlin.com
Thu Jan 3 21:37:52 UTC 2019


Hello,

On Thu,  3 Jan 2019 17:04:31 +0100, Yann E. MORIN wrote:
> Currently, we check that no two packages write to the same files, as a
> sanity check. We do so by checking which files were touched since the
> end of the build (aka begining of the installation).
> 
> However, when the packages do install the exact same file, i,e, the
> same content, we in fact do not really care what package had provided
> said file.
> 
> In the past, we avoided that situation because we were md5sum-inf every
> files before and after installation. Anything that changed was new or
> modified, and everything that did not change was not modified (but could
> have been reinstalled).
> 
> However, since 7fb6e78254 (core/instrumentation: shave minutes off the
> build time), we're now using mtimes, and we're in the situation that the
> exact same file installed by two-or-more packages is reported.
> 
> In such a situation, it is not very interesting to know what package
> installed the file, because whatever the ordering, or whatever the
> subset of said packages, we'd have ended up with the same file anyway.
> One prominent case where this happens, if the fftw familly of pcackages,
> that all install different libraries, but install the same set of
> identical headers and some common utilities.
> 
> The rationale for 7fb6e78254 was that the impact on the build time was
> horrible, so we can't revert it.
> 
> However, we can partially restore use of md5 to detect if the files were
> actually modified or not, and limiting the md5sum-ing only to those
> actually modified files. The comparison of the md5 is then offloaded to
> later, during the final check-uniq-files calls.
> 
> Since the md5sum is run only on new files, they are cache-hot, and so
> the md5 is not going to storage to fetch them (and if they were not
> cache-hot, it would mean memory pressure for a reason or another, and
> the system would already be slowed for other reasons).
> 
> Using a defconfig with a various set of ~236 packages, the build times
> reported by running "time make >log.file 2>&1", on an otherwise unloaded
> system, were (smallest value of 5 runs each):
> 
>     before:     35:15.92
>     after:      35:22.03
> 
> Which is a slight overhead of just 6.11s, which is less than 26ms per
> package. Also, the system, although pretty idle, was still doing
> background work like fetching mails and such, so the overhead is not
> even consistent across various measurements...

Overall, I find the idea good and useful.

A couple of comments/questions (here and below):

 - I believe we should still fix the fact that the .la file timestamp
   is changed everytime we $(SED) the .la files, even if they are not
   changed. This will also help reduce the number of files to check
   using md5sum: a .la file should be fixed only once, and its
   timestamp never touched again.

 - Could we make a check-uniq-files error a hard build failure, as a
   follow-up patch from yours ? So far, check-uniq-files errors have
   been a bit useless, because they do not cause the build to fail, so
   nobody really cared.

> diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
> index f5cab2b9c2..7c6a995c19 100644
> --- a/package/pkg-generic.mk
> +++ b/package/pkg-generic.mk
> @@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
>  # $(3): suffix of file  (optional)
>  define step_pkg_size_inner
>  	@touch $(BUILD_DIR)/packages-file-list$(3).txt
> -	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
> +	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt

Why is the separator changed from , to space ?

>  	cd $(2); \
> -	find . \( -type f -o -type l \) \
> +	find . -xtype f \

Why are you changing from -type f -o -type l to -xtype f ? Is this
really related to the change ? Is -xtype supported in old version of
find ?

>  		-newer $($(PKG)_DIR)/.stamp_built \
> -		-exec printf '$(1),%s\n' {} + \
> -		>> $(BUILD_DIR)/packages-file-list$(3).txt  
> +		-print0 \
> +	|xargs -0 -r md5sum \

Ah, you're changing the separator to two spaces, because md5sum's output
already uses two spaces as a separator ?

This could break people who have scripts that parse the
package-file-list.txt contents. Maybe we should keep the compatibility,
and keep using , as a separator, and push the md5sum as an additional
third column ?

Like:
	|xargs -0 -r md5sum \
	| sed -r -e 's/([0-9a-f]{32})  (.*)/$(1),\2,\1/'

One drawback is that it won't work with filenames that contain a comma.
But in this case, we're forced to keep the filename field as the last
one, and therefore break backward compatibility of the
package-file-lists.txt format every time we want to add a new
information to it.

> +	|sed -r -e 's/^/$(1)  /' \
> +	>> $(BUILD_DIR)/packages-file-list$(3).txt
>  endef
>  
>  define step_pkg_size
> diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
> index fbc6b5d6e7..94deea01db 100755
> --- a/support/scripts/check-uniq-files
> +++ b/support/scripts/check-uniq-files
> @@ -24,24 +24,30 @@ def main():
>          sys.stderr.write('No type was provided\n')
>          return False
>  
> -    file_to_pkg = defaultdict(list)
> +    file_to_pkg = defaultdict(set)

You're changing from list to set, it's a bit unrelated. Should be a
separate commit ?

> +    file_md5 = defaultdict(set)
>      with open(args.packages_file_list[0], 'rb') as pkg_file_list:
>          for line in pkg_file_list.readlines():
> -            pkg, _, file = line.rstrip(b'\n').partition(b',')
> -            file_to_pkg[file].append(pkg)
> +            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)

The ", 2" in split() is to make sure that if the filename contains two
consecutive spaces, we don't split it up ?

> +            file_to_pkg[file].add(pkg)
> +            file_md5[file].add(md5)
>  
>      for file in file_to_pkg:
> -        if len(file_to_pkg[file]) > 1:
> -            # If possible, try to decode the binary strings with
> -            # the default user's locale
> -            try:
> -                sys.stderr.write(warn.format(args.type, file.decode(),
> -                                             [p.decode() for p in file_to_pkg[file]]))
> -            except UnicodeDecodeError:
> -                # ... but fallback to just dumping them raw if they
> -                # contain non-representable chars
> -                sys.stderr.write(warn.format(args.type, file,
> -                                             file_to_pkg[file]))
> +        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
> +            # If only one package installed that file, or they all
> +            # installed the same content, don't report that file.
> +            continue
> +
> +        # If possible, try to decode the binary strings with
> +        # the default user's locale
> +        try:
> +            sys.stderr.write(warn.format(args.type, file.decode(),
> +                                         [p.decode() for p in file_to_pkg[file]]))
> +        except UnicodeDecodeError:
> +            # ... but fallback to just dumping them raw if they
> +            # contain non-representable chars
> +            sys.stderr.write(warn.format(args.type, file,
> +                                         file_to_pkg[file]))

I'm not sure why the whole test is inverted, it causes a lot of noise
in the diff. If there is a good reason for it, should be a separate
patch.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


More information about the buildroot mailing list