[Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool
Arnout Vandecappelle
arnout at mind.be
Thu Jul 11 13:37:49 UTC 2019
On 11/07/2019 11:54, Victor Huesca wrote:
> The major bottleneck in pkg-stats is the time spent waiting for
> responses from distant servers. There is tow parts that imply
> such communication with remote servers in pkg-stats:
> - the function checking that package URLs are responding
> - the function that fetch the latest package version from
> release-monitoring.
> The 1st one already make use of process-pools for speed-up but the 2nd
> does not.
>
> This patch simply add support of process-pools to this
> `check_package_latest_version` function in addition to http-pool.
>
> There have already been work trying to parallelize this function
> using threads but there were a failure on some configurations [1].
> This implementation rely on a dedicated module already in use on this
> script, so it's unlikely to see failure on this one.
>
> In terms of performances I achieve to run this function in ~3m vs ~25m
> for the linear version.
>
> [1] http://lists.busybox.net/pipermail/buildroot/2018-March/215368.html
>
> Signed-off-by: Victor Huesca <victor.huesca at bootlin.com>
> ---
> support/scripts/pkg-stats | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index 2ed41dffa9..14033941ba 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -322,9 +322,9 @@ def check_package_urls(packages):
> pkg.url_status = pkg.url_worker.get(timeout=3600)
>
>
> -def release_monitoring_get_latest_version_by_distro(pool, name):
> +def release_monitoring_get_latest_version_by_distro(name):
Is there a good reason not to pass pool as an argument here? See below.
> try:
> - req = pool.request('GET', "/api/project/Buildroot/%s" % name)
> + req = Package.pool.request('GET', "/api/project/Buildroot/%s" % name)
> except HTTPError:
> return (RM_API_STATUS_ERROR, None, None)
>
> @@ -339,9 +339,9 @@ def release_monitoring_get_latest_version_by_distro(pool, name):
> return (RM_API_STATUS_FOUND_BY_DISTRO, None, data['id'])
>
>
> -def release_monitoring_get_latest_version_by_guess(pool, name):
> +def release_monitoring_get_latest_version_by_guess(name):
> try:
> - req = pool.request('GET', "/api/projects/?pattern=%s" % name)
> + req = Package.pool.request('GET', "/api/projects/?pattern=%s" % name)
> except HTTPError:
> return (RM_API_STATUS_ERROR, None, None)
>
> @@ -375,18 +375,20 @@ def check_package_latest_version(packages):
> - id: string containing the id of the project corresponding to this
> package, as known by release-monitoring.org
> """
> - pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> - cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> - timeout=30)
> - count = 0
> + Package.pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> + cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> + timeout=30)
> + worker_pool = Pool(processes=64)
> + for pkg in packages:
> + pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_distro, (pkg.name, ))
Here, you can just use (Package.pool, pkg.name) as args, no?
With that, you don't need to add pool as a class attribute of Package anymore.
That was anyway a bit ugly.
Also, adding url_worker as an attribute to pkg here is a bit ugly. It would be
less so if you initialize it to None in Package.__init__.
> +
> for pkg in packages:
> - v = release_monitoring_get_latest_version_by_distro(pool, pkg.name)
> - if v[0] == RM_API_STATUS_NOT_FOUND:
> - v = release_monitoring_get_latest_version_by_guess(pool, pkg.name)
> + status, _, _ = pkg.url_worker.get()
Does this work? I mean, can you call get() multiple times?
> + if status == RM_API_STATUS_NOT_FOUND:
> + pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_guess, (pkg.name, ))
I believe this approach has the disadvantage that it still kind of serializes
the retries (not exactly serializes, but they're not parallelized as much as
they could be). I think a better approach would be to include the retry in a
single callback function that tries both.
>
> - pkg.latest_version = v
> - print("[%d/%d] Package %s" % (count, len(packages), pkg.name))
It's a pity that we loose this progress indication, but I don't see a
convenient way to keep it.
> - count += 1
> + for count, pkg in enumerate(packages):
> + pkg.latest_version = pkg.url_worker.get()
You're not actually using count, so no need to enumerate. I guess that's a
leftover from attempting to keep the progress indication.
Regards,
Arnout
>
>
> def calculate_stats(packages):
> @@ -773,7 +775,7 @@ def __main__():
> pkg.set_check_package_warnings()
> pkg.set_current_version()
> pkg.set_url()
> - print("Checking URL status")
> + print("Checking URL status ...")
> check_package_urls(packages)
> print("Getting latest versions ...")
> check_package_latest_version(packages)
>
More information about the buildroot
mailing list