From: Simon Marchi <simon.marchi@polymtl.ca>
To: gdb-patches@sourceware.org
Subject: Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
Date: Tue, 5 Aug 2025 13:01:15 -0400 [thread overview]
Message-ID: <879f5c1e-14e6-41d4-b3d9-76d6c91bcf15@polymtl.ca> (raw)
In-Reply-To: <20250703200130.4095761-1-simon.marchi@polymtl.ca>
Ping.
On 7/3/25 4:01 PM, simon.marchi@polymtl.ca wrote:
> From: Simon Marchi <simon.marchi@polymtl.ca>
>
> gdb::parallel_for_each uses static partitioning of the workload, meaning
> that each worker thread receives a similar number of work items. Change
> it to use dynamic partitioning, where worker threads pull work items
> from a shared work queue when they need to.
>
> Note that gdb::parallel_for_each is currently only used for processing
> minimal symbols in GDB. I am looking at improving the startup
> performance of GDB, where the minimal symbol process is one step.
>
> With static partitioning, there is a risk of workload imbalance if some
> threads receive "easier" work than others. Some threads sit still while
> others finish working on their share of the work. This is not
> desirable, because the gdb::parallel_for_each takes as long as the
> slowest thread takes.
>
> When loading a file with a lot of minimal symbols (~600k) in GDB, with
> "maint set per-command time on", I observe some imbalance:
>
> Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
> Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
> Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
> Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
> Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
> Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
> Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
> Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
> ^
> ----´
>
> The fastest thread took 0.732 seconds to complete its work (and then sat
> still), while the slowest took 3.927 seconds. This means the
> parallel_for_each took a bit less than 4 seconds.
>
> Even if the number of minimal symbols assigned to each worker is the
> same, I suppose that some symbols (e.g. those that need demangling) take
> longer to process, which could explain the imbalance.
>
> With this patch, things are much more balanced:
>
> Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
> Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
> Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
> Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
> Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
> Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
> Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
> Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
> ^
> ----´
>
> In this version, the parallel_for_each took about 2.8 seconds,
> representing a reduction of ~1.2 seconds for this step. Not
> life-changing, but it's still good I think.
>
> Note that this patch helps when loading big programs. My go-to test
> program for this is telegram-desktop that I built from source. For
> small programs (including loading gdb itself), it makes no perceptible
> difference.
>
> Now the technical bits:
>
> - One impact that this change has on the minimal symbol processing
> specifically is that not all calls to compute_and_set_names (a
> critical region guarded by a mutex) are done at the end of each
> worker thread's task anymore.
>
> Before this patch, each thread would compute the names and hash values for
> all the minimal symbols it has been assigned, and then would call
> compute_and_set_names for all of them, while holding the mutex (thus
> preventing other threads from doing this same step).
>
> With the shared work queue approach, each thread grabs a batch of of
> minimal symbols, computes the names and hash values for them, and
> then calls compute_and_set_names (with the mutex held) for this batch
> only. It then repeats that until the work queue is empty.
>
> There are therefore more small and spread out compute_and_set_names
> critical sections, instead of just one per worker thread at the end.
> Given that before this patch the work was not well balanced among worker
> threads, I guess that threads would enter that critical region at
> roughly different times, causing little contention.
>
> In the "with this patch" results, the CPU utilization numbers are not
> as good, suggesting that there is some contention. But I don't know
> if it's contention due to the compute_and_set_names critical section
> or the shared work queue critical section. That can be investigated
> later. In any case, what ultimately counts is the wall time, which
> improves.
>
> - One choice I had to make was to decide how many work items (in this
> case minimal symbols) each worker should pop when getting work from
> the shared queue. The general wisdom is that:
>
> - popping too few items, and the synchronization overhead becomes
> significant, and the total processing time increases
> - popping too many items, and we get some imbalance back, and the
> total processing time increases again
>
> I experimented using a dynamic batch size proportional to the number
> of remaining work items. It worked well in some cases but not
> always. So I decided to keep it simple, with a fixed batch size.
> That can always be tweaked later.
>
> - I want to still be able to use scoped_time_it to measure the time
> that each worker thread spent working on the task. I find it really
> handy when measuring the performance impact of changes.
>
> Unfortunately, the current interface of gdb::parallel_for_each, which
> receives a simple callback, is not well-suited for that, once I
> introduce the dynamic partitioning. The callback would get called
> once for each work item batch (multiple time for each worker thread),
> so it's not possible to maintain a per-worker thread object for the
> duration of the parallel for.
>
> To allow this, I changed gdb::parallel_for_each to receive a worker
> type as a template parameter. Each worker thread creates one local
> instance of that type, and calls operator() on it for each work item
> batch. By having a scoped_time_it object as a field of that worker,
> we can get the timings per worker thread.
>
> The drawbacks of this approach is that we must now define the
> parallel task in a separate class and manually capture any context we
> need as fields of that class.
>
> Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
> ---
> gdb/minsyms.c | 133 ++++++++++++--------
> gdb/unittests/parallel-for-selftests.c | 22 +++-
> gdbsupport/parallel-for.h | 164 ++++++++++++-------------
> 3 files changed, 178 insertions(+), 141 deletions(-)
>
> diff --git a/gdb/minsyms.c b/gdb/minsyms.c
> index 4a6459a6f2d2..e353a8a9ba00 100644
> --- a/gdb/minsyms.c
> +++ b/gdb/minsyms.c
> @@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
> }
> }
>
> +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */
> +
> +class minimal_symbol_install_worker
> +{
> +public:
> + minimal_symbol_install_worker
> + (minimal_symbol *msymbols,
> + gdb::array_view<computed_hash_values> hash_values,
> + objfile_per_bfd_storage *per_bfd,
> + std::mutex &demangled_mutex)
> + : m_time_it ("minsym install worker"),
> + m_msymbols (msymbols),
> + m_hash_values (hash_values),
> + m_per_bfd (per_bfd),
> + m_demangled_mutex (demangled_mutex)
> + {}
> +
> + void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
> + {
> + for (minimal_symbol *msym = start; msym < end; ++msym)
> + {
> + size_t idx = msym - m_msymbols;
> + m_hash_values[idx].name_length = strlen (msym->linkage_name ());
> +
> + if (!msym->name_set)
> + {
> + /* This will be freed later, by compute_and_set_names. */
> + gdb::unique_xmalloc_ptr<char> demangled_name
> + = symbol_find_demangled_name (msym, msym->linkage_name ());
> + msym->set_demangled_name (demangled_name.release (),
> + &m_per_bfd->storage_obstack);
> + msym->name_set = 1;
> + }
> +
> + /* This mangled_name_hash computation has to be outside of
> + the name_set check, or compute_and_set_names below will
> + be called with an invalid hash value. */
> + m_hash_values[idx].mangled_name_hash
> + = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
> + m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
> +
> + /* We only use this hash code if the search name differs
> + from the linkage name. See the code in
> + build_minimal_symbol_hash_tables. */
> + if (msym->search_name () != msym->linkage_name ())
> + m_hash_values[idx].minsym_demangled_hash
> + = search_name_hash (msym->language (), msym->search_name ());
> + }
> +
> + {
> + /* To limit how long we hold the lock, we only acquire it here
> + and not while we demangle the names above. */
> +#if CXX_STD_THREAD
> + std::lock_guard<std::mutex> guard (m_demangled_mutex);
> +#endif
> + for (minimal_symbol *msym = start; msym < end; ++msym)
> + {
> + size_t idx = msym - m_msymbols;
> + std::string_view name (msym->linkage_name (),
> + m_hash_values[idx].name_length);
> + hashval_t hashval = m_hash_values[idx].mangled_name_hash;
> +
> + msym->compute_and_set_names (name, false, m_per_bfd, hashval);
> + }
> + }
> + }
> +
> +private:
> + scoped_time_it m_time_it;
> + minimal_symbol *m_msymbols;
> + gdb::array_view<computed_hash_values> m_hash_values;
> + objfile_per_bfd_storage *m_per_bfd;
> + std::mutex &m_demangled_mutex;
> +};
> +
> /* Add the minimal symbols in the existing bunches to the objfile's official
> minimal symbol table. In most cases there is no minimal symbol table yet
> for this objfile, and the existing bunches are used to create one. Once
> @@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
> std::vector<computed_hash_values> hash_values (mcount);
>
> msymbols = m_objfile->per_bfd->msymbols.get ();
> - /* Arbitrarily require at least 10 elements in a thread. */
> - gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
> - [&] (minimal_symbol *start, minimal_symbol *end)
> - {
> - scoped_time_it time_it ("minsyms install worker");
> -
> - for (minimal_symbol *msym = start; msym < end; ++msym)
> - {
> - size_t idx = msym - msymbols;
> - hash_values[idx].name_length = strlen (msym->linkage_name ());
> - if (!msym->name_set)
> - {
> - /* This will be freed later, by compute_and_set_names. */
> - gdb::unique_xmalloc_ptr<char> demangled_name
> - = symbol_find_demangled_name (msym, msym->linkage_name ());
> - msym->set_demangled_name
> - (demangled_name.release (),
> - &m_objfile->per_bfd->storage_obstack);
> - msym->name_set = 1;
> - }
> - /* This mangled_name_hash computation has to be outside of
> - the name_set check, or compute_and_set_names below will
> - be called with an invalid hash value. */
> - hash_values[idx].mangled_name_hash
> - = fast_hash (msym->linkage_name (),
> - hash_values[idx].name_length);
> - hash_values[idx].minsym_hash
> - = msymbol_hash (msym->linkage_name ());
> - /* We only use this hash code if the search name differs
> - from the linkage name. See the code in
> - build_minimal_symbol_hash_tables. */
> - if (msym->search_name () != msym->linkage_name ())
> - hash_values[idx].minsym_demangled_hash
> - = search_name_hash (msym->language (), msym->search_name ());
> - }
> - {
> - /* To limit how long we hold the lock, we only acquire it here
> - and not while we demangle the names above. */
> -#if CXX_STD_THREAD
> - std::lock_guard<std::mutex> guard (demangled_mutex);
> -#endif
> - for (minimal_symbol *msym = start; msym < end; ++msym)
> - {
> - size_t idx = msym - msymbols;
> - msym->compute_and_set_names
> - (std::string_view (msym->linkage_name (),
> - hash_values[idx].name_length),
> - false,
> - m_objfile->per_bfd,
> - hash_values[idx].mangled_name_hash);
> - }
> - }
> - });
> +
> + gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
> + (&msymbols[0], &msymbols[mcount], msymbols,
> + gdb::array_view<computed_hash_values> (hash_values),
> + m_objfile->per_bfd, demangled_mutex);
>
> build_minimal_symbol_hash_tables (m_objfile, hash_values);
> }
> diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
> index f5456141ff6e..86bf06c073a6 100644
> --- a/gdb/unittests/parallel-for-selftests.c
> +++ b/gdb/unittests/parallel-for-selftests.c
> @@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
> static void
> test_parallel_for_each ()
> {
> + struct test_worker
> + {
> + /* DUMMY is there to test passing multiple arguments to the worker
> + constructor. */
> + test_worker (foreach_callback_t callback, int dummy)
> + : m_callback (callback)
> + {
> + }
> +
> + void operator() (int first, int last) noexcept
> + {
> + return m_callback (first, last);
> + }
> +
> + private:
> + foreach_callback_t m_callback;
> + };
> +
> const std::vector<do_foreach_t> for_each_functions
> {
> [] (int start, int end, foreach_callback_t callback)
> - { gdb::parallel_for_each<1> (start, end, callback); },
> + { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
> [] (int start, int end, foreach_callback_t callback)
> - { gdb::sequential_for_each (start, end, callback);}
> + { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
> };
>
> for (int n_threads : { 0, 1, 3 })
> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
> index b74c8068cf2c..bb41eb8700f0 100644
> --- a/gdbsupport/parallel-for.h
> +++ b/gdbsupport/parallel-for.h
> @@ -21,115 +21,107 @@
> #define GDBSUPPORT_PARALLEL_FOR_H
>
> #include <algorithm>
> -#include <type_traits>
> +#include <tuple>
> #include "gdbsupport/thread-pool.h"
> -#include "gdbsupport/function-view.h"
>
> namespace gdb
> {
>
> -/* A very simple "parallel for". This splits the range of iterators
> - into subranges, and then passes each subrange to the callback. The
> - work may or may not be done in separate threads.
> +/* A "parallel-for" implementation using a shared work queue. Work items get
> + popped in batches of size up to BATCH_SIZE from the queue and handed out to
> + worker threads.
>
> - This approach was chosen over having the callback work on single
> - items because it makes it simple for the caller to do
> - once-per-subrange initialization and destruction.
> + Each worker thread instantiates an object of type Worker, forwarding ARGS to
> + its constructor. The Worker object can be used to keep some per-worker
> + thread state.
>
> - The parameter N says how batching ought to be done -- there will be
> - at least N elements processed per thread. Setting N to 0 is not
> - allowed. */
> + Worker threads call Worker::operator() repeatedly until the queue is
> + empty. */
>
> -template<std::size_t n, class RandomIt, class RangeFunction>
> +template<std::size_t batch_size, class RandomIt, class Worker,
> + class... WorkerArgs>
> void
> -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +parallel_for_each (const RandomIt first, const RandomIt last,
> + WorkerArgs &&...worker_args)
> {
> /* If enabled, print debug info about how the work is distributed across
> the threads. */
> const bool parallel_for_each_debug = false;
>
> - size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
> - size_t n_threads = n_worker_threads;
> - size_t n_elements = last - first;
> - size_t elts_per_thread = 0;
> - size_t elts_left_over = 0;
> + gdb_assert (first <= last);
>
> - if (n_threads > 1)
> + if (parallel_for_each_debug)
> {
> - /* Require that there should be at least N elements in a
> - thread. */
> - gdb_assert (n > 0);
> - if (n_elements / n_threads < n)
> - n_threads = std::max (n_elements / n, (size_t) 1);
> - elts_per_thread = n_elements / n_threads;
> - elts_left_over = n_elements % n_threads;
> - /* n_elements == n_threads * elts_per_thread + elts_left_over. */
> + debug_printf ("Parallel for: n elements: %zu\n",
> + static_cast<std::size_t> (last - first));
> + debug_printf ("Parallel for: batch size: %zu\n", batch_size);
> }
>
> - size_t count = n_threads == 0 ? 0 : n_threads - 1;
> + const size_t n_worker_threads
> + = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
> std::vector<gdb::future<void>> results;
>
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
> - debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
> - debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
> - }
> + /* The next item to hand out. */
> + std::atomic<RandomIt> next = first;
>
> - for (int i = 0; i < count; ++i)
> - {
> - RandomIt end;
> - end = first + elts_per_thread;
> - if (i < elts_left_over)
> - /* Distribute the leftovers over the worker threads, to avoid having
> - to handle all of them in a single thread. */
> - end++;
> -
> - /* This case means we don't have enough elements to really
> - distribute them. Rather than ever submit a task that does
> - nothing, we short-circuit here. */
> - if (first == end)
> - end = last;
> -
> - if (end == last)
> - {
> - /* We're about to dispatch the last batch of elements, which
> - we normally process in the main thread. So just truncate
> - the result list here. This avoids submitting empty tasks
> - to the thread pool. */
> - count = i;
> - break;
> - }
> + /* The worker thread task.
> +
> + We need to capture args as a tuple, because it's not possible to capture
> + the parameter pack directly in C++17. Once we migrate to C++20, the
> + capture can be simplified to:
>
> - if (parallel_for_each_debug)
> + ... args = std::forward<Args>(args)
> +
> + and `args` can be used as-is in the lambda. */
> + auto args_tuple
> + = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
> + auto task =
> + [&next, first, last, n_worker_threads, &args_tuple] () {
> + /* Instantiate the user-defined worker. */
> + auto worker = std::make_from_tuple<Worker> (args_tuple);
> +
> + for (;;)
> {
> - debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
> - i, (size_t)(end - first));
> - debug_printf (_("\n"));
> + /* Grab a snapshot of NEXT. */
> + auto local_next = next.load ();
> + gdb_assert (local_next <= last);
> +
> + /* Number of remaining items. */
> + auto n_remaining = last - local_next;
> + gdb_assert (n_remaining >= 0);
> +
> + /* Are we done? */
> + if (n_remaining == 0)
> + break;
> +
> + const auto this_batch_size
> + = std::min<std::size_t> (batch_size, n_remaining);
> +
> + /* The range to process in this iteration. */
> + const auto this_batch_first = local_next;
> + const auto this_batch_last = local_next + this_batch_size;
> +
> + /* Update NEXT. If the current value of NEXT doesn't match
> + LOCAL_NEXT, it means another thread updated it concurrently,
> + restart. */
> + if (!next.compare_exchange_weak (local_next, this_batch_last))
> + continue;
> +
> + if (parallel_for_each_debug)
> + debug_printf ("Processing %zu items, range [%zu, %zu[\n",
> + this_batch_size,
> + static_cast<size_t> (this_batch_first - first),
> + static_cast<size_t> (this_batch_last - first));
> +
> + worker (this_batch_first, this_batch_last);
> }
> - results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
> - {
> - return callback (first, end);
> - }));
> - first = end;
> - }
> -
> - for (int i = count; i < n_worker_threads; ++i)
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
> - debug_printf (_("\n"));
> - }
> + };
>
> - /* Process all the remaining elements in the main thread. */
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
> - (size_t)(last - first));
> - debug_printf (_("\n"));
> - }
> - callback (first, last);
> + /* Start N_WORKER_THREADS tasks. */
> + for (int i = 0; i < n_worker_threads; ++i)
> + results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
>
> + /* Wait for all of them to be finished. */
> for (auto &fut : results)
> fut.get ();
> }
> @@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> when debugging multi-threading behavior, and you want to limit
> multi-threading in a fine-grained way. */
>
> -template<class RandomIt, class RangeFunction>
> +template<class RandomIt, class Worker, class... WorkerArgs>
> void
> -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
> {
> - callback (first, last);
> + Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
> }
>
> }
>
> base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
> --
> 2.50.0
>
next prev parent reply other threads:[~2025-08-05 19:41 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-03 20:01 simon.marchi
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
2025-09-18 19:07 ` Tom Tromey
2025-09-19 18:06 ` Simon Marchi
2025-07-03 20:01 ` [PATCH v2 3/4] gdbsupport: add async parallel_for_each version simon.marchi
2025-09-18 19:16 ` Tom Tromey
2025-09-19 20:12 ` Simon Marchi
2025-07-03 20:01 ` [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing simon.marchi
2025-09-18 19:26 ` Tom Tromey
2025-09-18 19:46 ` Simon Marchi
2025-09-18 19:53 ` Tom Tromey
2025-09-19 20:15 ` Simon Marchi
2025-08-05 17:01 ` Simon Marchi [this message]
2025-08-28 17:00 ` [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each Simon Marchi
2025-09-08 17:23 ` Simon Marchi
2025-09-18 18:59 ` Tom Tromey
2025-09-19 17:47 ` Simon Marchi
2025-09-19 18:04 ` Tom Tromey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=879f5c1e-14e6-41d4-b3d9-76d6c91bcf15@polymtl.ca \
--to=simon.marchi@polymtl.ca \
--cc=gdb-patches@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox