From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces~public-inbox=simark.ca@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id 2txBB1VekmjcqAQAWB0awg
	(envelope-from <gdb-patches-bounces~public-inbox=simark.ca@sourceware.org>)
	for <public-inbox@simark.ca>; Tue, 05 Aug 2025 15:41:09 -0400
Authentication-Results: simark.ca;
	dkim=pass (1024-bit key; unprotected) header.d=polymtl.ca header.i=@polymtl.ca header.a=rsa-sha256 header.s=default header.b=szrmjTri;
	dkim-atps=neutral
Received: by simark.ca (Postfix, from userid 112)
	id 075421E100; Tue,  5 Aug 2025 15:41:09 -0400 (EDT)
X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on simark.ca
X-Spam-Level: 
X-Spam-Status: No, score=-9.1 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_00,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_VALIDITY_CERTIFIED,RCVD_IN_VALIDITY_RPBL,
	RCVD_IN_VALIDITY_SAFE autolearn=unavailable autolearn_force=no
	version=4.0.1
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id C984B1E091
	for <public-inbox@simark.ca>; Tue,  5 Aug 2025 15:41:04 -0400 (EDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 40D03385802C
	for <public-inbox@simark.ca>; Tue,  5 Aug 2025 19:41:04 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 40D03385802C
Authentication-Results: sourceware.org;
	dkim=pass (1024-bit key, unprotected) header.d=polymtl.ca header.i=@polymtl.ca header.a=rsa-sha256 header.s=default header.b=szrmjTri
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id 55C5338515E9
 for <gdb-patches@sourceware.org>; Tue,  5 Aug 2025 17:01:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 55C5338515E9
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=polymtl.ca
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=polymtl.ca
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 55C5338515E9
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=132.207.4.11
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1754413313; cv=none;
 b=uSfN7VxP/T/v4am23YHKHLy4o4niHEGXtxUP+ZP24niCceEGGpPoVAoRtW9bnRRMOcC7HdgxUD+iZ7kZFUH58Z7mdic0IqHg8nyeTmRqdESmk7p5vEPvr1Ye9trqI1aJIbbNwFDqm2nPikjRPWCHlf4gFixjgzGg6r/yUwB9LZU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1754413313; c=relaxed/simple;
 bh=FVLHj4Qn23TnK8VzNTJvspneGjqTb4SHYVniBvMZKYc=;
 h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From;
 b=FFTJ0gTltlJx1VrmEmdoPm5mW0c9IjsukggF6hUDdiOP5k19axaJt7t8uwGw6ltFhc0tZY9pAXQfID3J97/iNxKXUZ+1tT2B93zvOh8f8MjLXklSf5yQMpF9rl+bkJTi7hxZEd8MFitkOg3RUnPshtZuE0mFsRwtDSKLGK3FjqM=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 575H1IlI244265
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gdb-patches@sourceware.org>; Tue, 5 Aug 2025 13:01:22 -0400
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 575H1IlI244265
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=polymtl.ca;
 s=default; t=1754413282;
 bh=OqopipFQY6d+fks3qkdh/iPUhI8ZXQw1y53ZYBL5VzM=;
 h=Date:Subject:To:From:In-Reply-To:From;
 b=szrmjTriCaytshnTtcVaxYGXT6c6QTj+eb1dKlFbvBchh6EscCmvS/FMxi937Qyf0
 W/ZbekqNHXds0I+fodeT3N959Umjc4GTUXuq6tTlOwDELFdWZl97h0YxIfIlDzeIhq
 jDMo/uSRFXq1DzbfAU8pMgishn3zcc9TZrUtuD6Q=
Received: by simark.ca (Postfix, from userid 112)
 id 179081E100; Tue,  5 Aug 2025 13:01:18 -0400 (EDT)
Received: from [172.24.0.6] (192-222-132-26.qc.cable.ebox.net [192.222.132.26])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256)
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id B0BAE1E091
 for <gdb-patches@sourceware.org>; Tue,  5 Aug 2025 13:01:16 -0400 (EDT)
Message-ID: <879f5c1e-14e6-41d4-b3d9-76d6c91bcf15@polymtl.ca>
Date: Tue, 5 Aug 2025 13:01:15 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in
 gdb::parallel_for_each
To: gdb-patches@sourceware.org
References: <20250703200130.4095761-1-simon.marchi@polymtl.ca>
Content-Language: fr
From: Simon Marchi <simon.marchi@polymtl.ca>
In-Reply-To: <20250703200130.4095761-1-simon.marchi@polymtl.ca>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Tue, 5 Aug 2025 17:01:18 +0000
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces~public-inbox=simark.ca@sourceware.org

Ping.

On 7/3/25 4:01 PM, simon.marchi@polymtl.ca wrote:
> From: Simon Marchi <simon.marchi@polymtl.ca>
> 
> gdb::parallel_for_each uses static partitioning of the workload, meaning
> that each worker thread receives a similar number of work items.  Change
> it to use dynamic partitioning, where worker threads pull work items
> from a shared work queue when they need to.
> 
> Note that gdb::parallel_for_each is currently only used for processing
> minimal symbols in GDB.  I am looking at improving the startup
> performance of GDB, where the minimal symbol process is one step.
> 
> With static partitioning, there is a risk of workload imbalance if some
> threads receive "easier" work than others.  Some threads sit still while
> others finish working on their share of the work.  This is not
> desirable, because the gdb::parallel_for_each takes as long as the
> slowest thread takes.
> 
> When loading a file with a lot of minimal symbols (~600k) in GDB, with
> "maint set per-command time on", I observe some imbalance:
> 
>     Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
>     Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
>     Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
>     Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
>     Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
>     Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
>     Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
>     Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
>                                               ^
>                                           ----´
> 
> The fastest thread took 0.732 seconds to complete its work (and then sat
> still), while the slowest took 3.927 seconds.  This means the
> parallel_for_each took a bit less than 4 seconds.
> 
> Even if the number of minimal symbols assigned to each worker is the
> same, I suppose that some symbols (e.g. those that need demangling) take
> longer to process, which could explain the imbalance.
> 
> With this patch, things are much more balanced:
> 
>     Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
>     Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
>     Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
>     Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
>     Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
>     Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
>     Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
>     Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
>                                               ^
>                                           ----´
> 
> In this version, the parallel_for_each took about 2.8 seconds,
> representing a reduction of ~1.2 seconds for this step.  Not
> life-changing, but it's still good I think.
> 
> Note that this patch helps when loading big programs.  My go-to test
> program for this is telegram-desktop that I built from source.  For
> small programs (including loading gdb itself), it makes no perceptible
> difference.
> 
> Now the technical bits:
> 
>  - One impact that this change has on the minimal symbol processing
>    specifically is that not all calls to compute_and_set_names (a
>    critical region guarded by a mutex) are done at the end of each
>    worker thread's task anymore.
> 
>    Before this patch, each thread would compute the names and hash values for
>    all the minimal symbols it has been assigned, and then would call
>    compute_and_set_names for all of them, while holding the mutex (thus
>    preventing other threads from doing this same step).
> 
>    With the shared work queue approach, each thread grabs a batch of of
>    minimal symbols, computes the names and hash values for them, and
>    then calls compute_and_set_names (with the mutex held) for this batch
>    only.  It then repeats that until the work queue is empty.
> 
>    There are therefore more small and spread out compute_and_set_names
>    critical sections, instead of just one per worker thread at the end.
>    Given that before this patch the work was not well balanced among worker
>    threads, I guess that threads would enter that critical region at
>    roughly different times, causing little contention.
> 
>    In the "with this patch" results, the CPU utilization numbers are not
>    as good, suggesting that there is some contention.  But I don't know
>    if it's contention due to the compute_and_set_names critical section
>    or the shared work queue critical section.  That can be investigated
>    later.  In any case, what ultimately counts is the wall time, which
>    improves.
> 
>  - One choice I had to make was to decide how many work items (in this
>    case minimal symbols) each worker should pop when getting work from
>    the shared queue.  The general wisdom is that:
> 
>    - popping too few items, and the synchronization overhead becomes
>      significant, and the total processing time increases
>    - popping too many items, and we get some imbalance back, and the
>      total processing time increases again
> 
>    I experimented using a dynamic batch size proportional to the number
>    of remaining work items.  It worked well in some cases but not
>    always.  So I decided to keep it simple, with a fixed batch size.
>    That can always be tweaked later.
> 
>  - I want to still be able to use scoped_time_it to measure the time
>    that each worker thread spent working on the task.  I find it really
>    handy when measuring the performance impact of changes.
> 
>    Unfortunately, the current interface of gdb::parallel_for_each, which
>    receives a simple callback, is not well-suited for that, once I
>    introduce the dynamic partitioning.  The callback would get called
>    once for each work item batch (multiple time for each worker thread),
>    so it's not possible to maintain a per-worker thread object for the
>    duration of the parallel for.
> 
>    To allow this, I changed gdb::parallel_for_each to receive a worker
>    type as a template parameter.  Each worker thread creates one local
>    instance of that type, and calls operator() on it for each work item
>    batch.  By having a scoped_time_it object as a field of that worker,
>    we can get the timings per worker thread.
> 
>    The drawbacks of this approach is that we must now define the
>    parallel task in a separate class and manually capture any context we
>    need as fields of that class.
> 
> Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
> ---
>  gdb/minsyms.c                          | 133 ++++++++++++--------
>  gdb/unittests/parallel-for-selftests.c |  22 +++-
>  gdbsupport/parallel-for.h              | 164 ++++++++++++-------------
>  3 files changed, 178 insertions(+), 141 deletions(-)
> 
> diff --git a/gdb/minsyms.c b/gdb/minsyms.c
> index 4a6459a6f2d2..e353a8a9ba00 100644
> --- a/gdb/minsyms.c
> +++ b/gdb/minsyms.c
> @@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
>      }
>  }
>  
> +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes.  */
> +
> +class minimal_symbol_install_worker
> +{
> +public:
> +  minimal_symbol_install_worker
> +    (minimal_symbol *msymbols,
> +     gdb::array_view<computed_hash_values> hash_values,
> +     objfile_per_bfd_storage *per_bfd,
> +     std::mutex &demangled_mutex)
> +    : m_time_it ("minsym install worker"),
> +      m_msymbols (msymbols),
> +      m_hash_values (hash_values),
> +      m_per_bfd (per_bfd),
> +      m_demangled_mutex (demangled_mutex)
> +  {}
> +
> +  void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
> +  {
> +    for (minimal_symbol *msym = start; msym < end; ++msym)
> +      {
> +	size_t idx = msym - m_msymbols;
> +	m_hash_values[idx].name_length = strlen (msym->linkage_name ());
> +
> +	if (!msym->name_set)
> +	  {
> +	    /* This will be freed later, by compute_and_set_names.  */
> +	    gdb::unique_xmalloc_ptr<char> demangled_name
> +	      = symbol_find_demangled_name (msym, msym->linkage_name ());
> +	    msym->set_demangled_name (demangled_name.release (),
> +				      &m_per_bfd->storage_obstack);
> +	    msym->name_set = 1;
> +	  }
> +
> +	/* This mangled_name_hash computation has to be outside of
> +	   the name_set check, or compute_and_set_names below will
> +	   be called with an invalid hash value.  */
> +	m_hash_values[idx].mangled_name_hash
> +	  = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
> +	m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
> +
> +	/* We only use this hash code if the search name differs
> +	   from the linkage name.  See the code in
> +	   build_minimal_symbol_hash_tables.  */
> +	if (msym->search_name () != msym->linkage_name ())
> +	  m_hash_values[idx].minsym_demangled_hash
> +	    = search_name_hash (msym->language (), msym->search_name ());
> +      }
> +
> +    {
> +      /* To limit how long we hold the lock, we only acquire it here
> +	 and not while we demangle the names above.  */
> +#if CXX_STD_THREAD
> +      std::lock_guard<std::mutex> guard (m_demangled_mutex);
> +#endif
> +      for (minimal_symbol *msym = start; msym < end; ++msym)
> +	{
> +	  size_t idx = msym - m_msymbols;
> +	  std::string_view name (msym->linkage_name (),
> +				 m_hash_values[idx].name_length);
> +	  hashval_t hashval = m_hash_values[idx].mangled_name_hash;
> +
> +	  msym->compute_and_set_names (name, false, m_per_bfd, hashval);
> +	}
> +    }
> +  }
> +
> +private:
> +  scoped_time_it m_time_it;
> +  minimal_symbol *m_msymbols;
> +  gdb::array_view<computed_hash_values> m_hash_values;
> +  objfile_per_bfd_storage *m_per_bfd;
> +  std::mutex &m_demangled_mutex;
> +};
> +
>  /* Add the minimal symbols in the existing bunches to the objfile's official
>     minimal symbol table.  In most cases there is no minimal symbol table yet
>     for this objfile, and the existing bunches are used to create one.  Once
> @@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
>        std::vector<computed_hash_values> hash_values (mcount);
>  
>        msymbols = m_objfile->per_bfd->msymbols.get ();
> -      /* Arbitrarily require at least 10 elements in a thread.  */
> -      gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
> -	 [&] (minimal_symbol *start, minimal_symbol *end)
> -	 {
> -	   scoped_time_it time_it ("minsyms install worker");
> -
> -	   for (minimal_symbol *msym = start; msym < end; ++msym)
> -	     {
> -	       size_t idx = msym - msymbols;
> -	       hash_values[idx].name_length = strlen (msym->linkage_name ());
> -	       if (!msym->name_set)
> -		 {
> -		   /* This will be freed later, by compute_and_set_names.  */
> -		   gdb::unique_xmalloc_ptr<char> demangled_name
> -		     = symbol_find_demangled_name (msym, msym->linkage_name ());
> -		   msym->set_demangled_name
> -		     (demangled_name.release (),
> -		      &m_objfile->per_bfd->storage_obstack);
> -		   msym->name_set = 1;
> -		 }
> -	       /* This mangled_name_hash computation has to be outside of
> -		  the name_set check, or compute_and_set_names below will
> -		  be called with an invalid hash value.  */
> -	       hash_values[idx].mangled_name_hash
> -		 = fast_hash (msym->linkage_name (),
> -			      hash_values[idx].name_length);
> -	       hash_values[idx].minsym_hash
> -		 = msymbol_hash (msym->linkage_name ());
> -	       /* We only use this hash code if the search name differs
> -		  from the linkage name.  See the code in
> -		  build_minimal_symbol_hash_tables.  */
> -	       if (msym->search_name () != msym->linkage_name ())
> -		 hash_values[idx].minsym_demangled_hash
> -		   = search_name_hash (msym->language (), msym->search_name ());
> -	     }
> -	   {
> -	     /* To limit how long we hold the lock, we only acquire it here
> -		and not while we demangle the names above.  */
> -#if CXX_STD_THREAD
> -	     std::lock_guard<std::mutex> guard (demangled_mutex);
> -#endif
> -	     for (minimal_symbol *msym = start; msym < end; ++msym)
> -	       {
> -		 size_t idx = msym - msymbols;
> -		 msym->compute_and_set_names
> -		   (std::string_view (msym->linkage_name (),
> -				      hash_values[idx].name_length),
> -		    false,
> -		    m_objfile->per_bfd,
> -		    hash_values[idx].mangled_name_hash);
> -	       }
> -	   }
> -	 });
> +
> +      gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
> +	(&msymbols[0], &msymbols[mcount], msymbols,
> +	 gdb::array_view<computed_hash_values> (hash_values),
> +	 m_objfile->per_bfd, demangled_mutex);
>  
>        build_minimal_symbol_hash_tables (m_objfile, hash_values);
>      }
> diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
> index f5456141ff6e..86bf06c073a6 100644
> --- a/gdb/unittests/parallel-for-selftests.c
> +++ b/gdb/unittests/parallel-for-selftests.c
> @@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
>  static void
>  test_parallel_for_each ()
>  {
> +  struct test_worker
> +  {
> +    /* DUMMY is there to test passing multiple arguments to the worker
> +       constructor.  */
> +    test_worker (foreach_callback_t callback, int dummy)
> +      : m_callback (callback)
> +    {
> +    }
> +
> +    void operator() (int first, int last) noexcept
> +    {
> +      return m_callback (first, last);
> +    }
> +
> +  private:
> +    foreach_callback_t m_callback;
> +  };
> +
>    const std::vector<do_foreach_t> for_each_functions
>      {
>        [] (int start, int end, foreach_callback_t callback)
> -      { gdb::parallel_for_each<1> (start, end, callback); },
> +      { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
>        [] (int start, int end, foreach_callback_t callback)
> -      { gdb::sequential_for_each (start, end, callback);}
> +      { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
>      };
>  
>    for (int n_threads : { 0, 1, 3 })
> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
> index b74c8068cf2c..bb41eb8700f0 100644
> --- a/gdbsupport/parallel-for.h
> +++ b/gdbsupport/parallel-for.h
> @@ -21,115 +21,107 @@
>  #define GDBSUPPORT_PARALLEL_FOR_H
>  
>  #include <algorithm>
> -#include <type_traits>
> +#include <tuple>
>  #include "gdbsupport/thread-pool.h"
> -#include "gdbsupport/function-view.h"
>  
>  namespace gdb
>  {
>  
> -/* A very simple "parallel for".  This splits the range of iterators
> -   into subranges, and then passes each subrange to the callback.  The
> -   work may or may not be done in separate threads.
> +/* A "parallel-for" implementation using a shared work queue.  Work items get
> +   popped in batches of size up to BATCH_SIZE from the queue and handed out to
> +   worker threads.
>  
> -   This approach was chosen over having the callback work on single
> -   items because it makes it simple for the caller to do
> -   once-per-subrange initialization and destruction.
> +   Each worker thread instantiates an object of type Worker, forwarding ARGS to
> +   its constructor.  The Worker object can be used to keep some per-worker
> +   thread state.
>  
> -   The parameter N says how batching ought to be done -- there will be
> -   at least N elements processed per thread.  Setting N to 0 is not
> -   allowed.  */
> +   Worker threads call Worker::operator() repeatedly until the queue is
> +   empty.  */
>  
> -template<std::size_t n, class RandomIt, class RangeFunction>
> +template<std::size_t batch_size, class RandomIt, class Worker,
> +	 class... WorkerArgs>
>  void
> -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +parallel_for_each (const RandomIt first, const RandomIt last,
> +		   WorkerArgs &&...worker_args)
>  {
>    /* If enabled, print debug info about how the work is distributed across
>       the threads.  */
>    const bool parallel_for_each_debug = false;
>  
> -  size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
> -  size_t n_threads = n_worker_threads;
> -  size_t n_elements = last - first;
> -  size_t elts_per_thread = 0;
> -  size_t elts_left_over = 0;
> +  gdb_assert (first <= last);
>  
> -  if (n_threads > 1)
> +  if (parallel_for_each_debug)
>      {
> -      /* Require that there should be at least N elements in a
> -	 thread.  */
> -      gdb_assert (n > 0);
> -      if (n_elements / n_threads < n)
> -	n_threads = std::max (n_elements / n, (size_t) 1);
> -      elts_per_thread = n_elements / n_threads;
> -      elts_left_over = n_elements % n_threads;
> -      /* n_elements == n_threads * elts_per_thread + elts_left_over. */
> +      debug_printf ("Parallel for: n elements: %zu\n",
> +		    static_cast<std::size_t> (last - first));
> +      debug_printf ("Parallel for: batch size: %zu\n", batch_size);
>      }
>  
> -  size_t count = n_threads == 0 ? 0 : n_threads - 1;
> +  const size_t n_worker_threads
> +    = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
>    std::vector<gdb::future<void>> results;
>  
> -  if (parallel_for_each_debug)
> -    {
> -      debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
> -      debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
> -      debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
> -    }
> +  /* The next item to hand out.  */
> +  std::atomic<RandomIt> next = first;
>  
> -  for (int i = 0; i < count; ++i)
> -    {
> -      RandomIt end;
> -      end = first + elts_per_thread;
> -      if (i < elts_left_over)
> -	/* Distribute the leftovers over the worker threads, to avoid having
> -	   to handle all of them in a single thread.  */
> -	end++;
> -
> -      /* This case means we don't have enough elements to really
> -	 distribute them.  Rather than ever submit a task that does
> -	 nothing, we short-circuit here.  */
> -      if (first == end)
> -	end = last;
> -
> -      if (end == last)
> -	{
> -	  /* We're about to dispatch the last batch of elements, which
> -	     we normally process in the main thread.  So just truncate
> -	     the result list here.  This avoids submitting empty tasks
> -	     to the thread pool.  */
> -	  count = i;
> -	  break;
> -	}
> +  /* The worker thread task.
> +
> +     We need to capture args as a tuple, because it's not possible to capture
> +     the parameter pack directly in C++17.  Once we migrate to C++20, the
> +     capture can be simplified to:
>  
> -      if (parallel_for_each_debug)
> +       ... args = std::forward<Args>(args)
> +
> +     and `args` can be used as-is in the lambda.  */
> +  auto args_tuple
> +    = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
> +  auto task =
> +    [&next, first, last, n_worker_threads, &args_tuple] () {
> +      /* Instantiate the user-defined worker.  */
> +      auto worker = std::make_from_tuple<Worker> (args_tuple);
> +
> +      for (;;)
>  	{
> -	  debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
> -			i, (size_t)(end - first));
> -	  debug_printf (_("\n"));
> +	  /* Grab a snapshot of NEXT.  */
> +	  auto local_next = next.load ();
> +	  gdb_assert (local_next <= last);
> +
> +	  /* Number of remaining items.  */
> +	  auto n_remaining = last - local_next;
> +	  gdb_assert (n_remaining >= 0);
> +
> +	  /* Are we done?  */
> +	  if (n_remaining == 0)
> +	    break;
> +
> +	  const auto this_batch_size
> +	    = std::min<std::size_t> (batch_size, n_remaining);
> +
> +	  /* The range to process in this iteration.  */
> +	  const auto this_batch_first = local_next;
> +	  const auto this_batch_last = local_next + this_batch_size;
> +
> +	  /* Update NEXT.  If the current value of NEXT doesn't match
> +	     LOCAL_NEXT, it means another thread updated it concurrently,
> +	     restart.  */
> +	  if (!next.compare_exchange_weak (local_next, this_batch_last))
> +	    continue;
> +
> +	  if (parallel_for_each_debug)
> +	    debug_printf ("Processing %zu items, range [%zu, %zu[\n",
> +			  this_batch_size,
> +			  static_cast<size_t> (this_batch_first - first),
> +			  static_cast<size_t> (this_batch_last - first));
> +
> +	  worker (this_batch_first, this_batch_last);
>  	}
> -      results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
> -        {
> -	  return callback (first, end);
> -	}));
> -      first = end;
> -    }
> -
> -  for (int i = count; i < n_worker_threads; ++i)
> -    if (parallel_for_each_debug)
> -      {
> -	debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
> -	debug_printf (_("\n"));
> -      }
> +    };
>  
> -  /* Process all the remaining elements in the main thread.  */
> -  if (parallel_for_each_debug)
> -    {
> -      debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
> -		    (size_t)(last - first));
> -      debug_printf (_("\n"));
> -    }
> -  callback (first, last);
> +  /* Start N_WORKER_THREADS tasks.  */
> +  for (int i = 0; i < n_worker_threads; ++i)
> +    results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
>  
> +  /* Wait for all of them to be finished.  */
>    for (auto &fut : results)
>      fut.get ();
>  }
> @@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>     when debugging multi-threading behavior, and you want to limit
>     multi-threading in a fine-grained way.  */
>  
> -template<class RandomIt, class RangeFunction>
> +template<class RandomIt, class Worker, class... WorkerArgs>
>  void
> -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
>  {
> -  callback (first, last);
> +  Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
>  }
>  
>  }
> 
> base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
> -- 
> 2.50.0
>