* [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
@ 2025-07-03 20:01 simon.marchi
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
` (4 more replies)
0 siblings, 5 replies; 18+ messages in thread
From: simon.marchi @ 2025-07-03 20:01 UTC (permalink / raw)
To: gdb-patches; +Cc: Simon Marchi
From: Simon Marchi <simon.marchi@polymtl.ca>
gdb::parallel_for_each uses static partitioning of the workload, meaning
that each worker thread receives a similar number of work items. Change
it to use dynamic partitioning, where worker threads pull work items
from a shared work queue when they need to.
Note that gdb::parallel_for_each is currently only used for processing
minimal symbols in GDB. I am looking at improving the startup
performance of GDB, where the minimal symbol process is one step.
With static partitioning, there is a risk of workload imbalance if some
threads receive "easier" work than others. Some threads sit still while
others finish working on their share of the work. This is not
desirable, because the gdb::parallel_for_each takes as long as the
slowest thread takes.
When loading a file with a lot of minimal symbols (~600k) in GDB, with
"maint set per-command time on", I observe some imbalance:
Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
^
----´
The fastest thread took 0.732 seconds to complete its work (and then sat
still), while the slowest took 3.927 seconds. This means the
parallel_for_each took a bit less than 4 seconds.
Even if the number of minimal symbols assigned to each worker is the
same, I suppose that some symbols (e.g. those that need demangling) take
longer to process, which could explain the imbalance.
With this patch, things are much more balanced:
Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
^
----´
In this version, the parallel_for_each took about 2.8 seconds,
representing a reduction of ~1.2 seconds for this step. Not
life-changing, but it's still good I think.
Note that this patch helps when loading big programs. My go-to test
program for this is telegram-desktop that I built from source. For
small programs (including loading gdb itself), it makes no perceptible
difference.
Now the technical bits:
- One impact that this change has on the minimal symbol processing
specifically is that not all calls to compute_and_set_names (a
critical region guarded by a mutex) are done at the end of each
worker thread's task anymore.
Before this patch, each thread would compute the names and hash values for
all the minimal symbols it has been assigned, and then would call
compute_and_set_names for all of them, while holding the mutex (thus
preventing other threads from doing this same step).
With the shared work queue approach, each thread grabs a batch of of
minimal symbols, computes the names and hash values for them, and
then calls compute_and_set_names (with the mutex held) for this batch
only. It then repeats that until the work queue is empty.
There are therefore more small and spread out compute_and_set_names
critical sections, instead of just one per worker thread at the end.
Given that before this patch the work was not well balanced among worker
threads, I guess that threads would enter that critical region at
roughly different times, causing little contention.
In the "with this patch" results, the CPU utilization numbers are not
as good, suggesting that there is some contention. But I don't know
if it's contention due to the compute_and_set_names critical section
or the shared work queue critical section. That can be investigated
later. In any case, what ultimately counts is the wall time, which
improves.
- One choice I had to make was to decide how many work items (in this
case minimal symbols) each worker should pop when getting work from
the shared queue. The general wisdom is that:
- popping too few items, and the synchronization overhead becomes
significant, and the total processing time increases
- popping too many items, and we get some imbalance back, and the
total processing time increases again
I experimented using a dynamic batch size proportional to the number
of remaining work items. It worked well in some cases but not
always. So I decided to keep it simple, with a fixed batch size.
That can always be tweaked later.
- I want to still be able to use scoped_time_it to measure the time
that each worker thread spent working on the task. I find it really
handy when measuring the performance impact of changes.
Unfortunately, the current interface of gdb::parallel_for_each, which
receives a simple callback, is not well-suited for that, once I
introduce the dynamic partitioning. The callback would get called
once for each work item batch (multiple time for each worker thread),
so it's not possible to maintain a per-worker thread object for the
duration of the parallel for.
To allow this, I changed gdb::parallel_for_each to receive a worker
type as a template parameter. Each worker thread creates one local
instance of that type, and calls operator() on it for each work item
batch. By having a scoped_time_it object as a field of that worker,
we can get the timings per worker thread.
The drawbacks of this approach is that we must now define the
parallel task in a separate class and manually capture any context we
need as fields of that class.
Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
---
gdb/minsyms.c | 133 ++++++++++++--------
gdb/unittests/parallel-for-selftests.c | 22 +++-
gdbsupport/parallel-for.h | 164 ++++++++++++-------------
3 files changed, 178 insertions(+), 141 deletions(-)
diff --git a/gdb/minsyms.c b/gdb/minsyms.c
index 4a6459a6f2d2..e353a8a9ba00 100644
--- a/gdb/minsyms.c
+++ b/gdb/minsyms.c
@@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
}
}
+/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */
+
+class minimal_symbol_install_worker
+{
+public:
+ minimal_symbol_install_worker
+ (minimal_symbol *msymbols,
+ gdb::array_view<computed_hash_values> hash_values,
+ objfile_per_bfd_storage *per_bfd,
+ std::mutex &demangled_mutex)
+ : m_time_it ("minsym install worker"),
+ m_msymbols (msymbols),
+ m_hash_values (hash_values),
+ m_per_bfd (per_bfd),
+ m_demangled_mutex (demangled_mutex)
+ {}
+
+ void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
+ {
+ for (minimal_symbol *msym = start; msym < end; ++msym)
+ {
+ size_t idx = msym - m_msymbols;
+ m_hash_values[idx].name_length = strlen (msym->linkage_name ());
+
+ if (!msym->name_set)
+ {
+ /* This will be freed later, by compute_and_set_names. */
+ gdb::unique_xmalloc_ptr<char> demangled_name
+ = symbol_find_demangled_name (msym, msym->linkage_name ());
+ msym->set_demangled_name (demangled_name.release (),
+ &m_per_bfd->storage_obstack);
+ msym->name_set = 1;
+ }
+
+ /* This mangled_name_hash computation has to be outside of
+ the name_set check, or compute_and_set_names below will
+ be called with an invalid hash value. */
+ m_hash_values[idx].mangled_name_hash
+ = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
+ m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
+
+ /* We only use this hash code if the search name differs
+ from the linkage name. See the code in
+ build_minimal_symbol_hash_tables. */
+ if (msym->search_name () != msym->linkage_name ())
+ m_hash_values[idx].minsym_demangled_hash
+ = search_name_hash (msym->language (), msym->search_name ());
+ }
+
+ {
+ /* To limit how long we hold the lock, we only acquire it here
+ and not while we demangle the names above. */
+#if CXX_STD_THREAD
+ std::lock_guard<std::mutex> guard (m_demangled_mutex);
+#endif
+ for (minimal_symbol *msym = start; msym < end; ++msym)
+ {
+ size_t idx = msym - m_msymbols;
+ std::string_view name (msym->linkage_name (),
+ m_hash_values[idx].name_length);
+ hashval_t hashval = m_hash_values[idx].mangled_name_hash;
+
+ msym->compute_and_set_names (name, false, m_per_bfd, hashval);
+ }
+ }
+ }
+
+private:
+ scoped_time_it m_time_it;
+ minimal_symbol *m_msymbols;
+ gdb::array_view<computed_hash_values> m_hash_values;
+ objfile_per_bfd_storage *m_per_bfd;
+ std::mutex &m_demangled_mutex;
+};
+
/* Add the minimal symbols in the existing bunches to the objfile's official
minimal symbol table. In most cases there is no minimal symbol table yet
for this objfile, and the existing bunches are used to create one. Once
@@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
std::vector<computed_hash_values> hash_values (mcount);
msymbols = m_objfile->per_bfd->msymbols.get ();
- /* Arbitrarily require at least 10 elements in a thread. */
- gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
- [&] (minimal_symbol *start, minimal_symbol *end)
- {
- scoped_time_it time_it ("minsyms install worker");
-
- for (minimal_symbol *msym = start; msym < end; ++msym)
- {
- size_t idx = msym - msymbols;
- hash_values[idx].name_length = strlen (msym->linkage_name ());
- if (!msym->name_set)
- {
- /* This will be freed later, by compute_and_set_names. */
- gdb::unique_xmalloc_ptr<char> demangled_name
- = symbol_find_demangled_name (msym, msym->linkage_name ());
- msym->set_demangled_name
- (demangled_name.release (),
- &m_objfile->per_bfd->storage_obstack);
- msym->name_set = 1;
- }
- /* This mangled_name_hash computation has to be outside of
- the name_set check, or compute_and_set_names below will
- be called with an invalid hash value. */
- hash_values[idx].mangled_name_hash
- = fast_hash (msym->linkage_name (),
- hash_values[idx].name_length);
- hash_values[idx].minsym_hash
- = msymbol_hash (msym->linkage_name ());
- /* We only use this hash code if the search name differs
- from the linkage name. See the code in
- build_minimal_symbol_hash_tables. */
- if (msym->search_name () != msym->linkage_name ())
- hash_values[idx].minsym_demangled_hash
- = search_name_hash (msym->language (), msym->search_name ());
- }
- {
- /* To limit how long we hold the lock, we only acquire it here
- and not while we demangle the names above. */
-#if CXX_STD_THREAD
- std::lock_guard<std::mutex> guard (demangled_mutex);
-#endif
- for (minimal_symbol *msym = start; msym < end; ++msym)
- {
- size_t idx = msym - msymbols;
- msym->compute_and_set_names
- (std::string_view (msym->linkage_name (),
- hash_values[idx].name_length),
- false,
- m_objfile->per_bfd,
- hash_values[idx].mangled_name_hash);
- }
- }
- });
+
+ gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
+ (&msymbols[0], &msymbols[mcount], msymbols,
+ gdb::array_view<computed_hash_values> (hash_values),
+ m_objfile->per_bfd, demangled_mutex);
build_minimal_symbol_hash_tables (m_objfile, hash_values);
}
diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
index f5456141ff6e..86bf06c073a6 100644
--- a/gdb/unittests/parallel-for-selftests.c
+++ b/gdb/unittests/parallel-for-selftests.c
@@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
static void
test_parallel_for_each ()
{
+ struct test_worker
+ {
+ /* DUMMY is there to test passing multiple arguments to the worker
+ constructor. */
+ test_worker (foreach_callback_t callback, int dummy)
+ : m_callback (callback)
+ {
+ }
+
+ void operator() (int first, int last) noexcept
+ {
+ return m_callback (first, last);
+ }
+
+ private:
+ foreach_callback_t m_callback;
+ };
+
const std::vector<do_foreach_t> for_each_functions
{
[] (int start, int end, foreach_callback_t callback)
- { gdb::parallel_for_each<1> (start, end, callback); },
+ { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
[] (int start, int end, foreach_callback_t callback)
- { gdb::sequential_for_each (start, end, callback);}
+ { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
};
for (int n_threads : { 0, 1, 3 })
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index b74c8068cf2c..bb41eb8700f0 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -21,115 +21,107 @@
#define GDBSUPPORT_PARALLEL_FOR_H
#include <algorithm>
-#include <type_traits>
+#include <tuple>
#include "gdbsupport/thread-pool.h"
-#include "gdbsupport/function-view.h"
namespace gdb
{
-/* A very simple "parallel for". This splits the range of iterators
- into subranges, and then passes each subrange to the callback. The
- work may or may not be done in separate threads.
+/* A "parallel-for" implementation using a shared work queue. Work items get
+ popped in batches of size up to BATCH_SIZE from the queue and handed out to
+ worker threads.
- This approach was chosen over having the callback work on single
- items because it makes it simple for the caller to do
- once-per-subrange initialization and destruction.
+ Each worker thread instantiates an object of type Worker, forwarding ARGS to
+ its constructor. The Worker object can be used to keep some per-worker
+ thread state.
- The parameter N says how batching ought to be done -- there will be
- at least N elements processed per thread. Setting N to 0 is not
- allowed. */
+ Worker threads call Worker::operator() repeatedly until the queue is
+ empty. */
-template<std::size_t n, class RandomIt, class RangeFunction>
+template<std::size_t batch_size, class RandomIt, class Worker,
+ class... WorkerArgs>
void
-parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
+parallel_for_each (const RandomIt first, const RandomIt last,
+ WorkerArgs &&...worker_args)
{
/* If enabled, print debug info about how the work is distributed across
the threads. */
const bool parallel_for_each_debug = false;
- size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
- size_t n_threads = n_worker_threads;
- size_t n_elements = last - first;
- size_t elts_per_thread = 0;
- size_t elts_left_over = 0;
+ gdb_assert (first <= last);
- if (n_threads > 1)
+ if (parallel_for_each_debug)
{
- /* Require that there should be at least N elements in a
- thread. */
- gdb_assert (n > 0);
- if (n_elements / n_threads < n)
- n_threads = std::max (n_elements / n, (size_t) 1);
- elts_per_thread = n_elements / n_threads;
- elts_left_over = n_elements % n_threads;
- /* n_elements == n_threads * elts_per_thread + elts_left_over. */
+ debug_printf ("Parallel for: n elements: %zu\n",
+ static_cast<std::size_t> (last - first));
+ debug_printf ("Parallel for: batch size: %zu\n", batch_size);
}
- size_t count = n_threads == 0 ? 0 : n_threads - 1;
+ const size_t n_worker_threads
+ = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
std::vector<gdb::future<void>> results;
- if (parallel_for_each_debug)
- {
- debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
- debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
- debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
- }
+ /* The next item to hand out. */
+ std::atomic<RandomIt> next = first;
- for (int i = 0; i < count; ++i)
- {
- RandomIt end;
- end = first + elts_per_thread;
- if (i < elts_left_over)
- /* Distribute the leftovers over the worker threads, to avoid having
- to handle all of them in a single thread. */
- end++;
-
- /* This case means we don't have enough elements to really
- distribute them. Rather than ever submit a task that does
- nothing, we short-circuit here. */
- if (first == end)
- end = last;
-
- if (end == last)
- {
- /* We're about to dispatch the last batch of elements, which
- we normally process in the main thread. So just truncate
- the result list here. This avoids submitting empty tasks
- to the thread pool. */
- count = i;
- break;
- }
+ /* The worker thread task.
+
+ We need to capture args as a tuple, because it's not possible to capture
+ the parameter pack directly in C++17. Once we migrate to C++20, the
+ capture can be simplified to:
- if (parallel_for_each_debug)
+ ... args = std::forward<Args>(args)
+
+ and `args` can be used as-is in the lambda. */
+ auto args_tuple
+ = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
+ auto task =
+ [&next, first, last, n_worker_threads, &args_tuple] () {
+ /* Instantiate the user-defined worker. */
+ auto worker = std::make_from_tuple<Worker> (args_tuple);
+
+ for (;;)
{
- debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
- i, (size_t)(end - first));
- debug_printf (_("\n"));
+ /* Grab a snapshot of NEXT. */
+ auto local_next = next.load ();
+ gdb_assert (local_next <= last);
+
+ /* Number of remaining items. */
+ auto n_remaining = last - local_next;
+ gdb_assert (n_remaining >= 0);
+
+ /* Are we done? */
+ if (n_remaining == 0)
+ break;
+
+ const auto this_batch_size
+ = std::min<std::size_t> (batch_size, n_remaining);
+
+ /* The range to process in this iteration. */
+ const auto this_batch_first = local_next;
+ const auto this_batch_last = local_next + this_batch_size;
+
+ /* Update NEXT. If the current value of NEXT doesn't match
+ LOCAL_NEXT, it means another thread updated it concurrently,
+ restart. */
+ if (!next.compare_exchange_weak (local_next, this_batch_last))
+ continue;
+
+ if (parallel_for_each_debug)
+ debug_printf ("Processing %zu items, range [%zu, %zu[\n",
+ this_batch_size,
+ static_cast<size_t> (this_batch_first - first),
+ static_cast<size_t> (this_batch_last - first));
+
+ worker (this_batch_first, this_batch_last);
}
- results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
- {
- return callback (first, end);
- }));
- first = end;
- }
-
- for (int i = count; i < n_worker_threads; ++i)
- if (parallel_for_each_debug)
- {
- debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
- debug_printf (_("\n"));
- }
+ };
- /* Process all the remaining elements in the main thread. */
- if (parallel_for_each_debug)
- {
- debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
- (size_t)(last - first));
- debug_printf (_("\n"));
- }
- callback (first, last);
+ /* Start N_WORKER_THREADS tasks. */
+ for (int i = 0; i < n_worker_threads; ++i)
+ results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
+ /* Wait for all of them to be finished. */
for (auto &fut : results)
fut.get ();
}
@@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
when debugging multi-threading behavior, and you want to limit
multi-threading in a fine-grained way. */
-template<class RandomIt, class RangeFunction>
+template<class RandomIt, class Worker, class... WorkerArgs>
void
-sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
+sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
{
- callback (first, last);
+ Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
}
}
base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
--
2.50.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
@ 2025-07-03 20:01 ` simon.marchi
2025-09-18 19:07 ` Tom Tromey
2025-07-03 20:01 ` [PATCH v2 3/4] gdbsupport: add async parallel_for_each version simon.marchi
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: simon.marchi @ 2025-07-03 20:01 UTC (permalink / raw)
To: gdb-patches; +Cc: Simon Marchi
From: Simon Marchi <simon.marchi@efficios.com>
New in v2: add some consts
In preparation for a following patch that will re-use the shared work
queue algorithm, move it to a separate class.
Change-Id: Id05cf8898a5d162048fa8fa056fbf7e0441bfb78
---
gdbsupport/parallel-for.h | 45 +++++--------------
gdbsupport/work-queue.h | 95 +++++++++++++++++++++++++++++++++++++++
2 files changed, 107 insertions(+), 33 deletions(-)
create mode 100644 gdbsupport/work-queue.h
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index bb41eb8700f0..c5a15d3235a7 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -23,6 +23,7 @@
#include <algorithm>
#include <tuple>
#include "gdbsupport/thread-pool.h"
+#include "gdbsupport/work-queue.h"
namespace gdb
{
@@ -57,12 +58,8 @@ parallel_for_each (const RandomIt first, const RandomIt last,
debug_printf ("Parallel for: batch size: %zu\n", batch_size);
}
- const size_t n_worker_threads
- = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
std::vector<gdb::future<void>> results;
-
- /* The next item to hand out. */
- std::atomic<RandomIt> next = first;
+ work_queue<RandomIt, batch_size> queue (first, last);
/* The worker thread task.
@@ -75,49 +72,31 @@ parallel_for_each (const RandomIt first, const RandomIt last,
and `args` can be used as-is in the lambda. */
auto args_tuple
= std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
- auto task =
- [&next, first, last, n_worker_threads, &args_tuple] () {
+ auto task = [&queue, first, &args_tuple] () {
/* Instantiate the user-defined worker. */
auto worker = std::make_from_tuple<Worker> (args_tuple);
for (;;)
{
- /* Grab a snapshot of NEXT. */
- auto local_next = next.load ();
- gdb_assert (local_next <= last);
+ const auto [batch_first, batch_last] = queue.pop_batch ();
- /* Number of remaining items. */
- auto n_remaining = last - local_next;
- gdb_assert (n_remaining >= 0);
-
- /* Are we done? */
- if (n_remaining == 0)
+ if (batch_first == batch_last)
break;
- const auto this_batch_size
- = std::min<std::size_t> (batch_size, n_remaining);
-
- /* The range to process in this iteration. */
- const auto this_batch_first = local_next;
- const auto this_batch_last = local_next + this_batch_size;
-
- /* Update NEXT. If the current value of NEXT doesn't match
- LOCAL_NEXT, it means another thread updated it concurrently,
- restart. */
- if (!next.compare_exchange_weak (local_next, this_batch_last))
- continue;
-
if (parallel_for_each_debug)
debug_printf ("Processing %zu items, range [%zu, %zu[\n",
- this_batch_size,
- static_cast<size_t> (this_batch_first - first),
- static_cast<size_t> (this_batch_last - first));
+ static_cast<std::size_t> (batch_last - batch_first),
+ static_cast<std::size_t> (batch_first - first),
+ static_cast<std::size_t> (batch_last - first));
- worker (this_batch_first, this_batch_last);
+ worker (batch_first, batch_last);
}
};
/* Start N_WORKER_THREADS tasks. */
+ const size_t n_worker_threads
+ = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
+
for (int i = 0; i < n_worker_threads; ++i)
results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
diff --git a/gdbsupport/work-queue.h b/gdbsupport/work-queue.h
new file mode 100644
index 000000000000..f84b29a4f523
--- /dev/null
+++ b/gdbsupport/work-queue.h
@@ -0,0 +1,95 @@
+/* Synchronized work queue.
+
+ Copyright (C) 2025 Free Software Foundation, Inc.
+
+ This file is part of GDB.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>. */
+
+#ifndef GDBSUPPORT_WORK_QUEUE_H
+#define GDBSUPPORT_WORK_QUEUE_H
+
+namespace gdb
+{
+
+/* Implementation of a thread-safe work queue.
+
+ The work items are specified by two iterators of type RandomIt.
+
+ BATCH_SIZE is the number of work items to pop in a batch. */
+
+template<typename RandomIt, std::size_t batch_size>
+class work_queue
+{
+public:
+ /* The work items are specified by the range `[first, last[`. */
+ work_queue (RandomIt first, RandomIt last) noexcept
+ : m_next (first),
+ m_last (last)
+ {
+ gdb_assert (first <= last);
+ }
+
+ DISABLE_COPY_AND_ASSIGN (work_queue);
+
+ /* Pop a batch of work items.
+
+ The return value is a pair containing iterators to the first and last
+ (exclusive) work items. */
+ std::pair<RandomIt, RandomIt> pop_batch () noexcept
+ {
+ for (;;)
+ {
+ /* Grab a snapshot of M_NEXT. */
+ auto next = m_next.load ();
+ gdb_assert (next <= m_last);
+
+ /* The number of items remaining in the queue. */
+ const auto n_remaining = static_cast<std::size_t> (m_last - next);
+
+ /* Are we done? */
+ if (n_remaining == 0)
+ return std::make_pair (m_last, m_last);
+
+ /* The batch size is proportional to the number of items remaining in
+ the queue. We do this to try to stike a balance, avoiding
+ synchronization overhead when there are many items to process at the
+ start, and avoiding workload imbalance when there are few items to
+ process at the end. */
+ const auto this_batch_size = std::min (batch_size, n_remaining);
+
+ /* The range of items in this batch. */
+ const auto this_batch_first = next;
+ const auto this_batch_last = next + this_batch_size;
+
+ /* Update M_NEXT. If the current value of M_NEXT doesn't match NEXT, it
+ means another thread updated it concurrently, restart. */
+ if (!m_next.compare_exchange_weak (next, this_batch_last))
+ continue;
+
+ return std::make_pair (this_batch_first, this_batch_last);
+ }
+ }
+
+private:
+ /* The next work item to hand out. */
+ std::atomic<RandomIt> m_next;
+
+ /* The end of the work item range. */
+ RandomIt m_last;
+};
+
+} /* namespace gdb */
+
+#endif /* GDBSUPPORT_WORK_QUEUE_H */
--
2.50.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 3/4] gdbsupport: add async parallel_for_each version
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
@ 2025-07-03 20:01 ` simon.marchi
2025-09-18 19:16 ` Tom Tromey
2025-07-03 20:01 ` [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing simon.marchi
` (2 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: simon.marchi @ 2025-07-03 20:01 UTC (permalink / raw)
To: gdb-patches; +Cc: Simon Marchi
From: Simon Marchi <simon.marchi@polymtl.ca>
New in v2: make the global variable constexpr.
I would like to use gdb::parallel_for_each to implement the parallelism
of the DWARF unit indexing. However, the existing implementation of
gdb::parallel_for_each is blocking, which doesn't work with the model
used by the DWARF indexer, which is asynchronous and callback-based.
Add an asynchronouys version of gdb::parallel_for_each that will be
suitable for this task.
This new version accepts a callback that is invoked when the parallel
for each is complete.
This function uses the same strategy as gdb::task_group to invoke the
"done" callback: worker threads have a shared_ptr reference to some
object. The last worker thread to drop its reference causes the object
to be deleted, which invokes the callback.
Unlike for the sync version of gdb::parallel_for_each, it's not possible
to keep any state in the calling thread's stack, because that disappears
immediately after starting the workers. So all the state is kept in
that same shared object.
There is a limitation that the sync version doesn't have, regarding the
arguments you can pass to the worker objects: it's not possibly to rely
on references. There are more details in a comment in the code.
It would be possible to implement the sync version of
gdb::parallel_for_each on top of the async version, but I decided not to
do it to avoid the unnecessary dynamic allocation of the shared object,
and to avoid adding the limitations on passing references I mentioned
just above. But if we judge that it would be an acceptable cost to
avoid the duplication, we could do it.
Add a self test for the new function.
Change-Id: I6173defb1e09856d137c1aa05ad51cbf521ea0b0
---
gdb/unittests/parallel-for-selftests.c | 23 ++++
gdbsupport/parallel-for.h | 141 ++++++++++++++++++++++++-
gdbsupport/work-queue.h | 2 +-
3 files changed, 160 insertions(+), 6 deletions(-)
diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
index 86bf06c073a6..883d95299607 100644
--- a/gdb/unittests/parallel-for-selftests.c
+++ b/gdb/unittests/parallel-for-selftests.c
@@ -111,8 +111,31 @@ test_parallel_for_each ()
const std::vector<do_foreach_t> for_each_functions
{
+ /* Test gdb::parallel_for_each. */
[] (int start, int end, foreach_callback_t callback)
{ gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
+
+ /* Test gdb::parallel_for_each_async. */
+ [] (int start, int end, foreach_callback_t callback)
+ {
+ bool done_flag = false;
+ std::condition_variable cv;
+ std::mutex mtx;
+
+ gdb::parallel_for_each_async<1, int, test_worker> (start, end,
+ [&mtx, &done_flag, &cv] ()
+ {
+ std::lock_guard<std::mutex> lock (mtx);
+ done_flag = true;
+ cv.notify_one();
+ }, callback, 0);
+
+ /* Wait for the async parallel-for to complete. */
+ std::unique_lock<std::mutex> lock (mtx);
+ cv.wait (lock, [&done_flag] () { return done_flag; });
+ },
+
+ /* Test gdb::sequential_for_each. */
[] (int start, int end, foreach_callback_t callback)
{ gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
};
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index c5a15d3235a7..4d2c3e76aab4 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -28,6 +28,10 @@
namespace gdb
{
+/* If enabled, print debug info about the inner workings of the parallel for
+ each functions. */
+constexpr bool parallel_for_each_debug = false;
+
/* A "parallel-for" implementation using a shared work queue. Work items get
popped in batches of size up to BATCH_SIZE from the queue and handed out to
worker threads.
@@ -37,7 +41,10 @@ namespace gdb
thread state.
Worker threads call Worker::operator() repeatedly until the queue is
- empty. */
+ empty.
+
+ This function is synchronous, meaning that it blocks and returns once the
+ processing is complete. */
template<std::size_t batch_size, class RandomIt, class Worker,
class... WorkerArgs>
@@ -45,10 +52,6 @@ void
parallel_for_each (const RandomIt first, const RandomIt last,
WorkerArgs &&...worker_args)
{
- /* If enabled, print debug info about how the work is distributed across
- the threads. */
- const bool parallel_for_each_debug = false;
-
gdb_assert (first <= last);
if (parallel_for_each_debug)
@@ -116,6 +119,134 @@ sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
}
+namespace detail
+{
+
+/* Type to hold the state shared between threads of
+ gdb::parallel_for_each_async. */
+
+template<std::size_t min_batch_size, typename RandomIt, typename... WorkerArgs>
+struct pfea_state
+{
+ pfea_state (RandomIt first, RandomIt last, std::function<void ()> &&done,
+ WorkerArgs &&...worker_args)
+ : first (first),
+ last (last),
+ worker_args_tuple (std::forward_as_tuple
+ (std::forward<WorkerArgs> (worker_args)...)),
+ queue (first, last),
+ m_done (std::move (done))
+ {}
+
+ DISABLE_COPY_AND_ASSIGN (pfea_state);
+
+ /* This gets called by the last worker thread that drops its reference on
+ the shared state, thus when the processing is complete. */
+ ~pfea_state ()
+ {
+ if (m_done)
+ m_done ();
+ }
+
+ /* The interval to process. */
+ const RandomIt first, last;
+
+ /* Tuple of arguments to pass when constructing the user's worker object.
+
+ Use std::decay_t to avoid storing references to the caller's local
+ variables. If we didn't use it and the caller passed an lvalue `foo *`,
+ we would store it as a reference to `foo *`, thus storing a reference to
+ the caller's local variable.
+
+ The downside is that it's not possible to pass arguments by reference,
+ callers need to pass pointers or std::reference_wrappers. */
+ std::tuple<std::decay_t<WorkerArgs>...> worker_args_tuple;
+
+ /* Work queue that worker threads pull work items from. */
+ work_queue<RandomIt, min_batch_size> queue;
+
+private:
+ /* Callable called when the parallel-for is done. */
+ std::function<void ()> m_done;
+};
+
+} /* namespace detail */
+
+/* A "parallel-for" implementation using a shared work queue. Work items get
+ popped in batches from the queue and handed out to worker threads.
+
+ Batch sizes are proportional to the number of remaining items in the queue,
+ but always greater or equal to MIN_BATCH_SIZE.
+
+ The DONE callback is invoked when processing is done.
+
+ Each worker thread instantiates an object of type Worker, forwarding ARGS to
+ its constructor. The Worker object can be used to keep some per-worker
+ thread state. This version does not support passing references as arguments
+ to the worker. Use std::reference_wrapper or pointers instead.
+
+ Worker threads call Worker::operator() repeatedly until the queue is
+ empty.
+
+ This function is asynchronous. An arbitrary worker thread will call the DONE
+ callback when processing is done. */
+
+template<std::size_t min_batch_size, class RandomIt, class Worker,
+ class... WorkerArgs>
+void
+parallel_for_each_async (const RandomIt first, const RandomIt last,
+ std::function<void ()> &&done,
+ WorkerArgs &&...worker_args)
+{
+ gdb_assert (first <= last);
+
+ if (parallel_for_each_debug)
+ {
+ debug_printf ("Parallel for: n elements: %zu\n",
+ static_cast<std::size_t> (last - first));
+ debug_printf ("Parallel for: min batch size: %zu\n", min_batch_size);
+ }
+
+ const size_t n_worker_threads
+ = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
+
+ /* The state shared between all worker threads. All worker threads get a
+ reference on the shared pointer through the lambda below. The last worker
+ thread to drop its reference will cause this object to be destroyed, which
+ will call the DONE callback. */
+ using state_t = detail::pfea_state<min_batch_size, RandomIt, WorkerArgs...>;
+ auto state
+ = std::make_shared<state_t> (first, last, std::move (done),
+ std::forward<WorkerArgs> (worker_args)...);
+
+ /* The worker thread task. */
+ auto task = [state] ()
+ {
+ /* Instantiate the user-defined worker. */
+ auto worker = std::make_from_tuple<Worker> (state->worker_args_tuple);
+
+ for (;;)
+ {
+ const auto [batch_first, batch_last] = state->queue.pop_batch ();
+
+ if (batch_first == batch_last)
+ break;
+
+ if (parallel_for_each_debug)
+ debug_printf ("Processing %zu items, range [%zu, %zu[\n",
+ static_cast<std::size_t> (batch_last - batch_first),
+ static_cast<std::size_t> (batch_first - state->first),
+ static_cast<std::size_t> (batch_last - state->first));
+
+ worker (batch_first, batch_last);
+ }
+ };
+
+ /* Start N_WORKER_THREADS tasks. */
+ for (int i = 0; i < n_worker_threads; ++i)
+ gdb::thread_pool::g_thread_pool->post_task (task);
+}
+
}
#endif /* GDBSUPPORT_PARALLEL_FOR_H */
diff --git a/gdbsupport/work-queue.h b/gdbsupport/work-queue.h
index f84b29a4f523..4b69ee83abd7 100644
--- a/gdbsupport/work-queue.h
+++ b/gdbsupport/work-queue.h
@@ -34,7 +34,7 @@ class work_queue
{
public:
/* The work items are specified by the range `[first, last[`. */
- work_queue (RandomIt first, RandomIt last) noexcept
+ work_queue (const RandomIt first, const RandomIt last) noexcept
: m_next (first),
m_last (last)
{
--
2.50.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
2025-07-03 20:01 ` [PATCH v2 3/4] gdbsupport: add async parallel_for_each version simon.marchi
@ 2025-07-03 20:01 ` simon.marchi
2025-09-18 19:26 ` Tom Tromey
2025-08-05 17:01 ` [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each Simon Marchi
2025-09-18 18:59 ` Tom Tromey
4 siblings, 1 reply; 18+ messages in thread
From: simon.marchi @ 2025-07-03 20:01 UTC (permalink / raw)
To: gdb-patches; +Cc: Simon Marchi
From: Simon Marchi <simon.marchi@efficios.com>
The DWARF indexer splits the work statically based on the unit sizes,
attempting to give each worker thread about the same amount of bytes to
process. This works relatively well with standard compilation. But
when compiling with DWO files (-gsplit-dwarf), it's not as good. I see
this when loading a relatively big program (telegram-desktop, which
includes a lot of static dependencies) compiled with -gsplit-dwarf:
Time for "DWARF indexing worker": wall 0.000, user 0.000, sys 0.000, user+sys 0.000, -nan % CPU
Time for "DWARF indexing worker": wall 0.001, user 0.000, sys 0.000, user+sys 0.000, 0.0 % CPU
Time for "DWARF indexing worker": wall 0.001, user 0.001, sys 0.000, user+sys 0.001, 100.0 % CPU
Time for "DWARF indexing worker": wall 0.748, user 0.284, sys 0.297, user+sys 0.581, 77.7 % CPU
Time for "DWARF indexing worker": wall 0.818, user 0.408, sys 0.262, user+sys 0.670, 81.9 % CPU
Time for "DWARF indexing worker": wall 1.196, user 0.580, sys 0.402, user+sys 0.982, 82.1 % CPU
Time for "DWARF indexing worker": wall 1.250, user 0.511, sys 0.500, user+sys 1.011, 80.9 % CPU
Time for "DWARF indexing worker": wall 7.730, user 5.891, sys 1.729, user+sys 7.620, 98.6 % CPU
Note how the wall times vary from 0 to 7.7 seconds. This is
undesirable, because the time to do that indexing step takes as long as
the slowest worker thread takes.
The imbalance in this step also causes imbalance in the following
"finalize" step:
Time for "DWARF finalize worker": wall 0.007, user 0.004, sys 0.002, user+sys 0.006, 85.7 % CPU
Time for "DWARF finalize worker": wall 0.012, user 0.005, sys 0.005, user+sys 0.010, 83.3 % CPU
Time for "DWARF finalize worker": wall 0.015, user 0.010, sys 0.004, user+sys 0.014, 93.3 % CPU
Time for "DWARF finalize worker": wall 0.389, user 0.359, sys 0.029, user+sys 0.388, 99.7 % CPU
Time for "DWARF finalize worker": wall 0.680, user 0.644, sys 0.035, user+sys 0.679, 99.9 % CPU
Time for "DWARF finalize worker": wall 0.929, user 0.907, sys 0.020, user+sys 0.927, 99.8 % CPU
Time for "DWARF finalize worker": wall 1.093, user 1.055, sys 0.037, user+sys 1.092, 99.9 % CPU
Time for "DWARF finalize worker": wall 2.016, user 1.934, sys 0.082, user+sys 2.016, 100.0 % CPU
Time for "DWARF finalize worker": wall 25.882, user 25.471, sys 0.404, user+sys 25.875, 100.0 % CPU
With DWO files, the split of the workload by size doesn't work, because
it is done using the size of the skeleton units in the main file, which
is not representative of how much DWARF is contained in each DWO file.
I haven't tried it, but a similar problem could occur with cross-unit
imports, which can happen with dwz or LTO. You could have a small unit
that imports a lot from other units, in which case the size of the units
is not representative of the work to accomplish.
To try to improve this situation, change the DWARF indexer to use
dynamic partitioning, using gdb::parallel_for_each_async. With this,
each worker thread pops one unit at a time from a shared work queue to
process it, until the queue is empty. That should result in a more
balance workload split. I chose 1 as the minimum batch size here,
because I judged that indexing one CU was a big enough piece of work
compared to the synchronization overhead of the queue. That can always
be tweaked later if someone wants to do more tests.
As a result, the timings are much more balanced:
Time for "DWARF indexing worker": wall 2.325, user 1.033, sys 0.573, user+sys 1.606, 69.1 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.028, sys 0.568, user+sys 1.596, 68.6 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.068, sys 0.513, user+sys 1.581, 68.0 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.005, sys 0.579, user+sys 1.584, 68.1 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.070, sys 0.516, user+sys 1.586, 68.2 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.063, sys 0.584, user+sys 1.647, 70.8 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.049, sys 0.550, user+sys 1.599, 68.7 % CPU
Time for "DWARF indexing worker": wall 2.328, user 1.058, sys 0.541, user+sys 1.599, 68.7 % CPU
...
Time for "DWARF finalize worker": wall 2.833, user 2.791, sys 0.040, user+sys 2.831, 99.9 % CPU
Time for "DWARF finalize worker": wall 2.939, user 2.896, sys 0.043, user+sys 2.939, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.016, user 2.969, sys 0.046, user+sys 3.015, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.076, user 2.957, sys 0.118, user+sys 3.075, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.159, user 3.054, sys 0.104, user+sys 3.158, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.198, user 3.082, sys 0.114, user+sys 3.196, 99.9 % CPU
Time for "DWARF finalize worker": wall 3.197, user 3.076, sys 0.121, user+sys 3.197, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.268, user 3.136, sys 0.131, user+sys 3.267, 100.0 % CPU
Time for "DWARF finalize worker": wall 1.907, user 1.810, sys 0.096, user+sys 1.906, 99.9 % CPU
In absolute terms, the total time for GDB to load the file and exit goes
down from about 42 seconds to 17 seconds.
Some implementation notes:
- The state previously kept in as local variables in
cooked_index_worker_debug_info::process_units becomes fields of the
new parallel worker object.
- The work previously done for each unit in
cooked_index_worker_debug_info::process_units becomes the operator()
of the new parallel worker object.
- The work previously done at the end of
cooked_index_worker_debug_info::process_units (including calling
bfd_thread_cleanup) becomes the destructor of the new parallel worker
object.
- The "done" callback of gdb::task_group becomes the "done" callback of
gdb::parallel_for_each_async.
- I placed the parallel_indexing_worker struct inside
cooked_index_worker_debug_info, so that it has access to
parallel_indexing_worker's private fields (e.g. m_results, to push
the results). It will also be possible to re-use it for skeletonless
type units in a later patch.
Change-Id: I5dc5cf8793abe9ebe2659e78da38ffc94289e5f2
---
gdb/dwarf2/cooked-index-worker.h | 6 ++
gdb/dwarf2/read.c | 147 +++++++++++++------------------
2 files changed, 65 insertions(+), 88 deletions(-)
diff --git a/gdb/dwarf2/cooked-index-worker.h b/gdb/dwarf2/cooked-index-worker.h
index 8b9766cddcbf..5c282c138506 100644
--- a/gdb/dwarf2/cooked-index-worker.h
+++ b/gdb/dwarf2/cooked-index-worker.h
@@ -283,8 +283,14 @@ class cooked_index_worker
/* The per-objfile object. */
dwarf2_per_objfile *m_per_objfile;
+
/* Result of each worker task. */
std::vector<cooked_index_worker_result> m_results;
+
+ /* Mutex to synchronize access to M_RESULTS when workers append their
+ result. */
+ std::mutex m_results_mutex;
+
/* Any warnings emitted. For the time being at least, this only
needed in do_reading, not in every worker. Note that
deferred_warnings uses gdb_stderr in its constructor, and this
diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index 5e18e4520610..55b736776227 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -51,6 +51,7 @@
#include "elf-bfd.h"
#include "event-top.h"
#include "exceptions.h"
+#include "gdbsupport/parallel-for.h"
#include "gdbsupport/task-group.h"
#include "maint.h"
#include "symtab.h"
@@ -3206,6 +3207,59 @@ class cooked_index_worker_debug_info : public cooked_index_worker
}
private:
+ /* The task for parallel workers that index units. */
+ struct parallel_indexing_worker
+ {
+ explicit parallel_indexing_worker (const char *step_name,
+ cooked_index_worker_debug_info *parent)
+ : m_scoped_time_it (step_name, parent->m_per_command_time),
+ m_parent (parent)
+ {
+ }
+
+ DISABLE_COPY_AND_ASSIGN (parallel_indexing_worker);
+
+ ~parallel_indexing_worker ()
+ {
+ bfd_thread_cleanup ();
+
+ m_thread_storage.done_reading (m_complaint_handler.release ());
+
+ /* Append the results of this worker to the parent instance. */
+ std::lock_guard<std::mutex> lock (m_parent->m_results_mutex);
+ m_parent->m_results.emplace_back (std::move (m_thread_storage));
+ }
+
+ void operator() (dwarf2_per_cu_up *first, dwarf2_per_cu_up *last)
+ {
+ for (auto &it : iterator_range (first, last))
+ this->process_one (*it);
+ }
+
+ private:
+ void process_one (dwarf2_per_cu &unit)
+ {
+ m_thread_storage.catch_error ([&] ()
+ {
+ m_parent->process_unit (&unit, m_parent->m_per_objfile,
+ &m_thread_storage);
+ });
+ }
+
+ /* Measures the execution time of this worker. */
+ scoped_time_it m_scoped_time_it;
+
+ /* Delayed complaints and errors recorded while indexing units. */
+ complaint_interceptor m_complaint_handler;
+ std::vector<gdb_exception> m_errors;
+
+ /* Index storage for this worker. */
+ cooked_index_worker_result m_thread_storage;
+
+ /* The instance that spawned this worker. */
+ cooked_index_worker_debug_info *m_parent;
+ };
+
void do_reading () override;
/* Print collected type unit statistics. */
@@ -3247,12 +3301,6 @@ class cooked_index_worker_debug_info : public cooked_index_worker
/* An iterator for the comp units. */
using unit_iterator = std::vector<dwarf2_per_cu_up>::iterator;
- /* Process a batch of CUs. This may be called multiple times in
- separate threads. TASK_NUMBER indicates which task this is --
- the result is stored in that slot of M_RESULTS. */
- void process_units (size_t task_number, unit_iterator first,
- unit_iterator end);
-
/* Process unit THIS_CU. */
void process_unit (dwarf2_per_cu *this_cu, dwarf2_per_objfile *per_objfile,
cooked_index_worker_result *storage);
@@ -3476,32 +3524,6 @@ cooked_index_worker_debug_info::process_skeletonless_type_units
});
}
-void
-cooked_index_worker_debug_info::process_units (size_t task_number,
- unit_iterator first,
- unit_iterator end)
-{
- SCOPE_EXIT { bfd_thread_cleanup (); };
-
- /* Ensure that complaints are handled correctly. */
- complaint_interceptor complaint_handler;
-
- std::vector<gdb_exception> errors;
- cooked_index_worker_result thread_storage;
- for (auto inner = first; inner != end; ++inner)
- {
- dwarf2_per_cu *per_cu = inner->get ();
-
- thread_storage.catch_error ([&] ()
- {
- process_unit (per_cu, m_per_objfile, &thread_storage);
- });
- }
-
- thread_storage.done_reading (complaint_handler.release ());
- m_results[task_number] = std::move (thread_storage);
-}
-
void
cooked_index_worker_debug_info::done_reading ()
{
@@ -3527,62 +3549,11 @@ cooked_index_worker_debug_info::do_reading ()
m_index_storage.get_addrmap (),
&m_warnings);
- /* We want to balance the load between the worker threads. This is
- done by using the size of each CU as a rough estimate of how
- difficult it will be to operate on. This isn't ideal -- for
- example if dwz is used, the early CUs will all tend to be
- "included" and won't be parsed independently. However, this
- heuristic works well for typical compiler output. */
-
- size_t total_size = 0;
- for (const auto &per_cu : per_bfd->all_units)
- total_size += per_cu->length ();
-
- /* How many worker threads we plan to use. We may not actually use
- this many. We use 1 as the minimum to avoid division by zero,
- and anyway in the N==0 case the work will be done
- synchronously. */
- const size_t n_worker_threads
- = std::max (gdb::thread_pool::g_thread_pool->thread_count (), (size_t) 1);
-
- /* How much effort should be put into each worker. */
- const size_t size_per_thread
- = std::max (total_size / n_worker_threads, (size_t) 1);
-
- /* Work is done in a task group. */
- gdb::task_group workers ([this] ()
- {
- this->done_reading ();
- });
-
- auto end = per_bfd->all_units.end ();
- size_t task_count = 0;
- for (auto iter = per_bfd->all_units.begin (); iter != end; )
- {
- auto last = iter;
- /* Put all remaining CUs into the last task. */
- if (task_count == n_worker_threads - 1)
- last = end;
- else
- {
- size_t chunk_size = 0;
- for (; last != end && chunk_size < size_per_thread; ++last)
- chunk_size += (*last)->length ();
- }
-
- gdb_assert (iter != last);
- workers.add_task ([this, task_count, iter, last] ()
- {
- scoped_time_it time_it ("DWARF indexing worker", m_per_command_time);
- process_units (task_count, iter, last);
- });
-
- ++task_count;
- iter = last;
- }
-
- m_results.resize (task_count);
- workers.start ();
+ /* Launch parallel tasks to index units. */
+ gdb::parallel_for_each_async<1, dwarf2_per_cu_up *, parallel_indexing_worker>
+ (per_bfd->all_units.data (),
+ per_bfd->all_units.data () + per_bfd->all_units.size (),
+ [this] () { this->done_reading (); }, "DWARF indexing worker", this);
}
static void
--
2.50.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
` (2 preceding siblings ...)
2025-07-03 20:01 ` [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing simon.marchi
@ 2025-08-05 17:01 ` Simon Marchi
2025-08-28 17:00 ` Simon Marchi
2025-09-18 18:59 ` Tom Tromey
4 siblings, 1 reply; 18+ messages in thread
From: Simon Marchi @ 2025-08-05 17:01 UTC (permalink / raw)
To: gdb-patches
Ping.
On 7/3/25 4:01 PM, simon.marchi@polymtl.ca wrote:
> From: Simon Marchi <simon.marchi@polymtl.ca>
>
> gdb::parallel_for_each uses static partitioning of the workload, meaning
> that each worker thread receives a similar number of work items. Change
> it to use dynamic partitioning, where worker threads pull work items
> from a shared work queue when they need to.
>
> Note that gdb::parallel_for_each is currently only used for processing
> minimal symbols in GDB. I am looking at improving the startup
> performance of GDB, where the minimal symbol process is one step.
>
> With static partitioning, there is a risk of workload imbalance if some
> threads receive "easier" work than others. Some threads sit still while
> others finish working on their share of the work. This is not
> desirable, because the gdb::parallel_for_each takes as long as the
> slowest thread takes.
>
> When loading a file with a lot of minimal symbols (~600k) in GDB, with
> "maint set per-command time on", I observe some imbalance:
>
> Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
> Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
> Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
> Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
> Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
> Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
> Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
> Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
> ^
> ----´
>
> The fastest thread took 0.732 seconds to complete its work (and then sat
> still), while the slowest took 3.927 seconds. This means the
> parallel_for_each took a bit less than 4 seconds.
>
> Even if the number of minimal symbols assigned to each worker is the
> same, I suppose that some symbols (e.g. those that need demangling) take
> longer to process, which could explain the imbalance.
>
> With this patch, things are much more balanced:
>
> Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
> Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
> Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
> Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
> Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
> Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
> Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
> Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
> ^
> ----´
>
> In this version, the parallel_for_each took about 2.8 seconds,
> representing a reduction of ~1.2 seconds for this step. Not
> life-changing, but it's still good I think.
>
> Note that this patch helps when loading big programs. My go-to test
> program for this is telegram-desktop that I built from source. For
> small programs (including loading gdb itself), it makes no perceptible
> difference.
>
> Now the technical bits:
>
> - One impact that this change has on the minimal symbol processing
> specifically is that not all calls to compute_and_set_names (a
> critical region guarded by a mutex) are done at the end of each
> worker thread's task anymore.
>
> Before this patch, each thread would compute the names and hash values for
> all the minimal symbols it has been assigned, and then would call
> compute_and_set_names for all of them, while holding the mutex (thus
> preventing other threads from doing this same step).
>
> With the shared work queue approach, each thread grabs a batch of of
> minimal symbols, computes the names and hash values for them, and
> then calls compute_and_set_names (with the mutex held) for this batch
> only. It then repeats that until the work queue is empty.
>
> There are therefore more small and spread out compute_and_set_names
> critical sections, instead of just one per worker thread at the end.
> Given that before this patch the work was not well balanced among worker
> threads, I guess that threads would enter that critical region at
> roughly different times, causing little contention.
>
> In the "with this patch" results, the CPU utilization numbers are not
> as good, suggesting that there is some contention. But I don't know
> if it's contention due to the compute_and_set_names critical section
> or the shared work queue critical section. That can be investigated
> later. In any case, what ultimately counts is the wall time, which
> improves.
>
> - One choice I had to make was to decide how many work items (in this
> case minimal symbols) each worker should pop when getting work from
> the shared queue. The general wisdom is that:
>
> - popping too few items, and the synchronization overhead becomes
> significant, and the total processing time increases
> - popping too many items, and we get some imbalance back, and the
> total processing time increases again
>
> I experimented using a dynamic batch size proportional to the number
> of remaining work items. It worked well in some cases but not
> always. So I decided to keep it simple, with a fixed batch size.
> That can always be tweaked later.
>
> - I want to still be able to use scoped_time_it to measure the time
> that each worker thread spent working on the task. I find it really
> handy when measuring the performance impact of changes.
>
> Unfortunately, the current interface of gdb::parallel_for_each, which
> receives a simple callback, is not well-suited for that, once I
> introduce the dynamic partitioning. The callback would get called
> once for each work item batch (multiple time for each worker thread),
> so it's not possible to maintain a per-worker thread object for the
> duration of the parallel for.
>
> To allow this, I changed gdb::parallel_for_each to receive a worker
> type as a template parameter. Each worker thread creates one local
> instance of that type, and calls operator() on it for each work item
> batch. By having a scoped_time_it object as a field of that worker,
> we can get the timings per worker thread.
>
> The drawbacks of this approach is that we must now define the
> parallel task in a separate class and manually capture any context we
> need as fields of that class.
>
> Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
> ---
> gdb/minsyms.c | 133 ++++++++++++--------
> gdb/unittests/parallel-for-selftests.c | 22 +++-
> gdbsupport/parallel-for.h | 164 ++++++++++++-------------
> 3 files changed, 178 insertions(+), 141 deletions(-)
>
> diff --git a/gdb/minsyms.c b/gdb/minsyms.c
> index 4a6459a6f2d2..e353a8a9ba00 100644
> --- a/gdb/minsyms.c
> +++ b/gdb/minsyms.c
> @@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
> }
> }
>
> +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */
> +
> +class minimal_symbol_install_worker
> +{
> +public:
> + minimal_symbol_install_worker
> + (minimal_symbol *msymbols,
> + gdb::array_view<computed_hash_values> hash_values,
> + objfile_per_bfd_storage *per_bfd,
> + std::mutex &demangled_mutex)
> + : m_time_it ("minsym install worker"),
> + m_msymbols (msymbols),
> + m_hash_values (hash_values),
> + m_per_bfd (per_bfd),
> + m_demangled_mutex (demangled_mutex)
> + {}
> +
> + void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
> + {
> + for (minimal_symbol *msym = start; msym < end; ++msym)
> + {
> + size_t idx = msym - m_msymbols;
> + m_hash_values[idx].name_length = strlen (msym->linkage_name ());
> +
> + if (!msym->name_set)
> + {
> + /* This will be freed later, by compute_and_set_names. */
> + gdb::unique_xmalloc_ptr<char> demangled_name
> + = symbol_find_demangled_name (msym, msym->linkage_name ());
> + msym->set_demangled_name (demangled_name.release (),
> + &m_per_bfd->storage_obstack);
> + msym->name_set = 1;
> + }
> +
> + /* This mangled_name_hash computation has to be outside of
> + the name_set check, or compute_and_set_names below will
> + be called with an invalid hash value. */
> + m_hash_values[idx].mangled_name_hash
> + = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
> + m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
> +
> + /* We only use this hash code if the search name differs
> + from the linkage name. See the code in
> + build_minimal_symbol_hash_tables. */
> + if (msym->search_name () != msym->linkage_name ())
> + m_hash_values[idx].minsym_demangled_hash
> + = search_name_hash (msym->language (), msym->search_name ());
> + }
> +
> + {
> + /* To limit how long we hold the lock, we only acquire it here
> + and not while we demangle the names above. */
> +#if CXX_STD_THREAD
> + std::lock_guard<std::mutex> guard (m_demangled_mutex);
> +#endif
> + for (minimal_symbol *msym = start; msym < end; ++msym)
> + {
> + size_t idx = msym - m_msymbols;
> + std::string_view name (msym->linkage_name (),
> + m_hash_values[idx].name_length);
> + hashval_t hashval = m_hash_values[idx].mangled_name_hash;
> +
> + msym->compute_and_set_names (name, false, m_per_bfd, hashval);
> + }
> + }
> + }
> +
> +private:
> + scoped_time_it m_time_it;
> + minimal_symbol *m_msymbols;
> + gdb::array_view<computed_hash_values> m_hash_values;
> + objfile_per_bfd_storage *m_per_bfd;
> + std::mutex &m_demangled_mutex;
> +};
> +
> /* Add the minimal symbols in the existing bunches to the objfile's official
> minimal symbol table. In most cases there is no minimal symbol table yet
> for this objfile, and the existing bunches are used to create one. Once
> @@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
> std::vector<computed_hash_values> hash_values (mcount);
>
> msymbols = m_objfile->per_bfd->msymbols.get ();
> - /* Arbitrarily require at least 10 elements in a thread. */
> - gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
> - [&] (minimal_symbol *start, minimal_symbol *end)
> - {
> - scoped_time_it time_it ("minsyms install worker");
> -
> - for (minimal_symbol *msym = start; msym < end; ++msym)
> - {
> - size_t idx = msym - msymbols;
> - hash_values[idx].name_length = strlen (msym->linkage_name ());
> - if (!msym->name_set)
> - {
> - /* This will be freed later, by compute_and_set_names. */
> - gdb::unique_xmalloc_ptr<char> demangled_name
> - = symbol_find_demangled_name (msym, msym->linkage_name ());
> - msym->set_demangled_name
> - (demangled_name.release (),
> - &m_objfile->per_bfd->storage_obstack);
> - msym->name_set = 1;
> - }
> - /* This mangled_name_hash computation has to be outside of
> - the name_set check, or compute_and_set_names below will
> - be called with an invalid hash value. */
> - hash_values[idx].mangled_name_hash
> - = fast_hash (msym->linkage_name (),
> - hash_values[idx].name_length);
> - hash_values[idx].minsym_hash
> - = msymbol_hash (msym->linkage_name ());
> - /* We only use this hash code if the search name differs
> - from the linkage name. See the code in
> - build_minimal_symbol_hash_tables. */
> - if (msym->search_name () != msym->linkage_name ())
> - hash_values[idx].minsym_demangled_hash
> - = search_name_hash (msym->language (), msym->search_name ());
> - }
> - {
> - /* To limit how long we hold the lock, we only acquire it here
> - and not while we demangle the names above. */
> -#if CXX_STD_THREAD
> - std::lock_guard<std::mutex> guard (demangled_mutex);
> -#endif
> - for (minimal_symbol *msym = start; msym < end; ++msym)
> - {
> - size_t idx = msym - msymbols;
> - msym->compute_and_set_names
> - (std::string_view (msym->linkage_name (),
> - hash_values[idx].name_length),
> - false,
> - m_objfile->per_bfd,
> - hash_values[idx].mangled_name_hash);
> - }
> - }
> - });
> +
> + gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
> + (&msymbols[0], &msymbols[mcount], msymbols,
> + gdb::array_view<computed_hash_values> (hash_values),
> + m_objfile->per_bfd, demangled_mutex);
>
> build_minimal_symbol_hash_tables (m_objfile, hash_values);
> }
> diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
> index f5456141ff6e..86bf06c073a6 100644
> --- a/gdb/unittests/parallel-for-selftests.c
> +++ b/gdb/unittests/parallel-for-selftests.c
> @@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
> static void
> test_parallel_for_each ()
> {
> + struct test_worker
> + {
> + /* DUMMY is there to test passing multiple arguments to the worker
> + constructor. */
> + test_worker (foreach_callback_t callback, int dummy)
> + : m_callback (callback)
> + {
> + }
> +
> + void operator() (int first, int last) noexcept
> + {
> + return m_callback (first, last);
> + }
> +
> + private:
> + foreach_callback_t m_callback;
> + };
> +
> const std::vector<do_foreach_t> for_each_functions
> {
> [] (int start, int end, foreach_callback_t callback)
> - { gdb::parallel_for_each<1> (start, end, callback); },
> + { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
> [] (int start, int end, foreach_callback_t callback)
> - { gdb::sequential_for_each (start, end, callback);}
> + { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
> };
>
> for (int n_threads : { 0, 1, 3 })
> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
> index b74c8068cf2c..bb41eb8700f0 100644
> --- a/gdbsupport/parallel-for.h
> +++ b/gdbsupport/parallel-for.h
> @@ -21,115 +21,107 @@
> #define GDBSUPPORT_PARALLEL_FOR_H
>
> #include <algorithm>
> -#include <type_traits>
> +#include <tuple>
> #include "gdbsupport/thread-pool.h"
> -#include "gdbsupport/function-view.h"
>
> namespace gdb
> {
>
> -/* A very simple "parallel for". This splits the range of iterators
> - into subranges, and then passes each subrange to the callback. The
> - work may or may not be done in separate threads.
> +/* A "parallel-for" implementation using a shared work queue. Work items get
> + popped in batches of size up to BATCH_SIZE from the queue and handed out to
> + worker threads.
>
> - This approach was chosen over having the callback work on single
> - items because it makes it simple for the caller to do
> - once-per-subrange initialization and destruction.
> + Each worker thread instantiates an object of type Worker, forwarding ARGS to
> + its constructor. The Worker object can be used to keep some per-worker
> + thread state.
>
> - The parameter N says how batching ought to be done -- there will be
> - at least N elements processed per thread. Setting N to 0 is not
> - allowed. */
> + Worker threads call Worker::operator() repeatedly until the queue is
> + empty. */
>
> -template<std::size_t n, class RandomIt, class RangeFunction>
> +template<std::size_t batch_size, class RandomIt, class Worker,
> + class... WorkerArgs>
> void
> -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +parallel_for_each (const RandomIt first, const RandomIt last,
> + WorkerArgs &&...worker_args)
> {
> /* If enabled, print debug info about how the work is distributed across
> the threads. */
> const bool parallel_for_each_debug = false;
>
> - size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
> - size_t n_threads = n_worker_threads;
> - size_t n_elements = last - first;
> - size_t elts_per_thread = 0;
> - size_t elts_left_over = 0;
> + gdb_assert (first <= last);
>
> - if (n_threads > 1)
> + if (parallel_for_each_debug)
> {
> - /* Require that there should be at least N elements in a
> - thread. */
> - gdb_assert (n > 0);
> - if (n_elements / n_threads < n)
> - n_threads = std::max (n_elements / n, (size_t) 1);
> - elts_per_thread = n_elements / n_threads;
> - elts_left_over = n_elements % n_threads;
> - /* n_elements == n_threads * elts_per_thread + elts_left_over. */
> + debug_printf ("Parallel for: n elements: %zu\n",
> + static_cast<std::size_t> (last - first));
> + debug_printf ("Parallel for: batch size: %zu\n", batch_size);
> }
>
> - size_t count = n_threads == 0 ? 0 : n_threads - 1;
> + const size_t n_worker_threads
> + = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
> std::vector<gdb::future<void>> results;
>
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
> - debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
> - debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
> - }
> + /* The next item to hand out. */
> + std::atomic<RandomIt> next = first;
>
> - for (int i = 0; i < count; ++i)
> - {
> - RandomIt end;
> - end = first + elts_per_thread;
> - if (i < elts_left_over)
> - /* Distribute the leftovers over the worker threads, to avoid having
> - to handle all of them in a single thread. */
> - end++;
> -
> - /* This case means we don't have enough elements to really
> - distribute them. Rather than ever submit a task that does
> - nothing, we short-circuit here. */
> - if (first == end)
> - end = last;
> -
> - if (end == last)
> - {
> - /* We're about to dispatch the last batch of elements, which
> - we normally process in the main thread. So just truncate
> - the result list here. This avoids submitting empty tasks
> - to the thread pool. */
> - count = i;
> - break;
> - }
> + /* The worker thread task.
> +
> + We need to capture args as a tuple, because it's not possible to capture
> + the parameter pack directly in C++17. Once we migrate to C++20, the
> + capture can be simplified to:
>
> - if (parallel_for_each_debug)
> + ... args = std::forward<Args>(args)
> +
> + and `args` can be used as-is in the lambda. */
> + auto args_tuple
> + = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
> + auto task =
> + [&next, first, last, n_worker_threads, &args_tuple] () {
> + /* Instantiate the user-defined worker. */
> + auto worker = std::make_from_tuple<Worker> (args_tuple);
> +
> + for (;;)
> {
> - debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
> - i, (size_t)(end - first));
> - debug_printf (_("\n"));
> + /* Grab a snapshot of NEXT. */
> + auto local_next = next.load ();
> + gdb_assert (local_next <= last);
> +
> + /* Number of remaining items. */
> + auto n_remaining = last - local_next;
> + gdb_assert (n_remaining >= 0);
> +
> + /* Are we done? */
> + if (n_remaining == 0)
> + break;
> +
> + const auto this_batch_size
> + = std::min<std::size_t> (batch_size, n_remaining);
> +
> + /* The range to process in this iteration. */
> + const auto this_batch_first = local_next;
> + const auto this_batch_last = local_next + this_batch_size;
> +
> + /* Update NEXT. If the current value of NEXT doesn't match
> + LOCAL_NEXT, it means another thread updated it concurrently,
> + restart. */
> + if (!next.compare_exchange_weak (local_next, this_batch_last))
> + continue;
> +
> + if (parallel_for_each_debug)
> + debug_printf ("Processing %zu items, range [%zu, %zu[\n",
> + this_batch_size,
> + static_cast<size_t> (this_batch_first - first),
> + static_cast<size_t> (this_batch_last - first));
> +
> + worker (this_batch_first, this_batch_last);
> }
> - results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
> - {
> - return callback (first, end);
> - }));
> - first = end;
> - }
> -
> - for (int i = count; i < n_worker_threads; ++i)
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
> - debug_printf (_("\n"));
> - }
> + };
>
> - /* Process all the remaining elements in the main thread. */
> - if (parallel_for_each_debug)
> - {
> - debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
> - (size_t)(last - first));
> - debug_printf (_("\n"));
> - }
> - callback (first, last);
> + /* Start N_WORKER_THREADS tasks. */
> + for (int i = 0; i < n_worker_threads; ++i)
> + results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
>
> + /* Wait for all of them to be finished. */
> for (auto &fut : results)
> fut.get ();
> }
> @@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> when debugging multi-threading behavior, and you want to limit
> multi-threading in a fine-grained way. */
>
> -template<class RandomIt, class RangeFunction>
> +template<class RandomIt, class Worker, class... WorkerArgs>
> void
> -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
> +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
> {
> - callback (first, last);
> + Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
> }
>
> }
>
> base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
> --
> 2.50.0
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-08-05 17:01 ` [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each Simon Marchi
@ 2025-08-28 17:00 ` Simon Marchi
2025-09-08 17:23 ` Simon Marchi
0 siblings, 1 reply; 18+ messages in thread
From: Simon Marchi @ 2025-08-28 17:00 UTC (permalink / raw)
To: gdb-patches
Ping.
On 8/5/25 1:01 PM, Simon Marchi wrote:
> Ping.
>
> On 7/3/25 4:01 PM, simon.marchi@polymtl.ca wrote:
>> From: Simon Marchi <simon.marchi@polymtl.ca>
>>
>> gdb::parallel_for_each uses static partitioning of the workload, meaning
>> that each worker thread receives a similar number of work items. Change
>> it to use dynamic partitioning, where worker threads pull work items
>> from a shared work queue when they need to.
>>
>> Note that gdb::parallel_for_each is currently only used for processing
>> minimal symbols in GDB. I am looking at improving the startup
>> performance of GDB, where the minimal symbol process is one step.
>>
>> With static partitioning, there is a risk of workload imbalance if some
>> threads receive "easier" work than others. Some threads sit still while
>> others finish working on their share of the work. This is not
>> desirable, because the gdb::parallel_for_each takes as long as the
>> slowest thread takes.
>>
>> When loading a file with a lot of minimal symbols (~600k) in GDB, with
>> "maint set per-command time on", I observe some imbalance:
>>
>> Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
>> Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
>> Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
>> Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
>> Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
>> Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
>> Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
>> Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
>> ^
>> ----´
>>
>> The fastest thread took 0.732 seconds to complete its work (and then sat
>> still), while the slowest took 3.927 seconds. This means the
>> parallel_for_each took a bit less than 4 seconds.
>>
>> Even if the number of minimal symbols assigned to each worker is the
>> same, I suppose that some symbols (e.g. those that need demangling) take
>> longer to process, which could explain the imbalance.
>>
>> With this patch, things are much more balanced:
>>
>> Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
>> Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
>> Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
>> Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
>> Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
>> Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
>> Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
>> Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
>> ^
>> ----´
>>
>> In this version, the parallel_for_each took about 2.8 seconds,
>> representing a reduction of ~1.2 seconds for this step. Not
>> life-changing, but it's still good I think.
>>
>> Note that this patch helps when loading big programs. My go-to test
>> program for this is telegram-desktop that I built from source. For
>> small programs (including loading gdb itself), it makes no perceptible
>> difference.
>>
>> Now the technical bits:
>>
>> - One impact that this change has on the minimal symbol processing
>> specifically is that not all calls to compute_and_set_names (a
>> critical region guarded by a mutex) are done at the end of each
>> worker thread's task anymore.
>>
>> Before this patch, each thread would compute the names and hash values for
>> all the minimal symbols it has been assigned, and then would call
>> compute_and_set_names for all of them, while holding the mutex (thus
>> preventing other threads from doing this same step).
>>
>> With the shared work queue approach, each thread grabs a batch of of
>> minimal symbols, computes the names and hash values for them, and
>> then calls compute_and_set_names (with the mutex held) for this batch
>> only. It then repeats that until the work queue is empty.
>>
>> There are therefore more small and spread out compute_and_set_names
>> critical sections, instead of just one per worker thread at the end.
>> Given that before this patch the work was not well balanced among worker
>> threads, I guess that threads would enter that critical region at
>> roughly different times, causing little contention.
>>
>> In the "with this patch" results, the CPU utilization numbers are not
>> as good, suggesting that there is some contention. But I don't know
>> if it's contention due to the compute_and_set_names critical section
>> or the shared work queue critical section. That can be investigated
>> later. In any case, what ultimately counts is the wall time, which
>> improves.
>>
>> - One choice I had to make was to decide how many work items (in this
>> case minimal symbols) each worker should pop when getting work from
>> the shared queue. The general wisdom is that:
>>
>> - popping too few items, and the synchronization overhead becomes
>> significant, and the total processing time increases
>> - popping too many items, and we get some imbalance back, and the
>> total processing time increases again
>>
>> I experimented using a dynamic batch size proportional to the number
>> of remaining work items. It worked well in some cases but not
>> always. So I decided to keep it simple, with a fixed batch size.
>> That can always be tweaked later.
>>
>> - I want to still be able to use scoped_time_it to measure the time
>> that each worker thread spent working on the task. I find it really
>> handy when measuring the performance impact of changes.
>>
>> Unfortunately, the current interface of gdb::parallel_for_each, which
>> receives a simple callback, is not well-suited for that, once I
>> introduce the dynamic partitioning. The callback would get called
>> once for each work item batch (multiple time for each worker thread),
>> so it's not possible to maintain a per-worker thread object for the
>> duration of the parallel for.
>>
>> To allow this, I changed gdb::parallel_for_each to receive a worker
>> type as a template parameter. Each worker thread creates one local
>> instance of that type, and calls operator() on it for each work item
>> batch. By having a scoped_time_it object as a field of that worker,
>> we can get the timings per worker thread.
>>
>> The drawbacks of this approach is that we must now define the
>> parallel task in a separate class and manually capture any context we
>> need as fields of that class.
>>
>> Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
>> ---
>> gdb/minsyms.c | 133 ++++++++++++--------
>> gdb/unittests/parallel-for-selftests.c | 22 +++-
>> gdbsupport/parallel-for.h | 164 ++++++++++++-------------
>> 3 files changed, 178 insertions(+), 141 deletions(-)
>>
>> diff --git a/gdb/minsyms.c b/gdb/minsyms.c
>> index 4a6459a6f2d2..e353a8a9ba00 100644
>> --- a/gdb/minsyms.c
>> +++ b/gdb/minsyms.c
>> @@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
>> }
>> }
>>
>> +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */
>> +
>> +class minimal_symbol_install_worker
>> +{
>> +public:
>> + minimal_symbol_install_worker
>> + (minimal_symbol *msymbols,
>> + gdb::array_view<computed_hash_values> hash_values,
>> + objfile_per_bfd_storage *per_bfd,
>> + std::mutex &demangled_mutex)
>> + : m_time_it ("minsym install worker"),
>> + m_msymbols (msymbols),
>> + m_hash_values (hash_values),
>> + m_per_bfd (per_bfd),
>> + m_demangled_mutex (demangled_mutex)
>> + {}
>> +
>> + void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
>> + {
>> + for (minimal_symbol *msym = start; msym < end; ++msym)
>> + {
>> + size_t idx = msym - m_msymbols;
>> + m_hash_values[idx].name_length = strlen (msym->linkage_name ());
>> +
>> + if (!msym->name_set)
>> + {
>> + /* This will be freed later, by compute_and_set_names. */
>> + gdb::unique_xmalloc_ptr<char> demangled_name
>> + = symbol_find_demangled_name (msym, msym->linkage_name ());
>> + msym->set_demangled_name (demangled_name.release (),
>> + &m_per_bfd->storage_obstack);
>> + msym->name_set = 1;
>> + }
>> +
>> + /* This mangled_name_hash computation has to be outside of
>> + the name_set check, or compute_and_set_names below will
>> + be called with an invalid hash value. */
>> + m_hash_values[idx].mangled_name_hash
>> + = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
>> + m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
>> +
>> + /* We only use this hash code if the search name differs
>> + from the linkage name. See the code in
>> + build_minimal_symbol_hash_tables. */
>> + if (msym->search_name () != msym->linkage_name ())
>> + m_hash_values[idx].minsym_demangled_hash
>> + = search_name_hash (msym->language (), msym->search_name ());
>> + }
>> +
>> + {
>> + /* To limit how long we hold the lock, we only acquire it here
>> + and not while we demangle the names above. */
>> +#if CXX_STD_THREAD
>> + std::lock_guard<std::mutex> guard (m_demangled_mutex);
>> +#endif
>> + for (minimal_symbol *msym = start; msym < end; ++msym)
>> + {
>> + size_t idx = msym - m_msymbols;
>> + std::string_view name (msym->linkage_name (),
>> + m_hash_values[idx].name_length);
>> + hashval_t hashval = m_hash_values[idx].mangled_name_hash;
>> +
>> + msym->compute_and_set_names (name, false, m_per_bfd, hashval);
>> + }
>> + }
>> + }
>> +
>> +private:
>> + scoped_time_it m_time_it;
>> + minimal_symbol *m_msymbols;
>> + gdb::array_view<computed_hash_values> m_hash_values;
>> + objfile_per_bfd_storage *m_per_bfd;
>> + std::mutex &m_demangled_mutex;
>> +};
>> +
>> /* Add the minimal symbols in the existing bunches to the objfile's official
>> minimal symbol table. In most cases there is no minimal symbol table yet
>> for this objfile, and the existing bunches are used to create one. Once
>> @@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
>> std::vector<computed_hash_values> hash_values (mcount);
>>
>> msymbols = m_objfile->per_bfd->msymbols.get ();
>> - /* Arbitrarily require at least 10 elements in a thread. */
>> - gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
>> - [&] (minimal_symbol *start, minimal_symbol *end)
>> - {
>> - scoped_time_it time_it ("minsyms install worker");
>> -
>> - for (minimal_symbol *msym = start; msym < end; ++msym)
>> - {
>> - size_t idx = msym - msymbols;
>> - hash_values[idx].name_length = strlen (msym->linkage_name ());
>> - if (!msym->name_set)
>> - {
>> - /* This will be freed later, by compute_and_set_names. */
>> - gdb::unique_xmalloc_ptr<char> demangled_name
>> - = symbol_find_demangled_name (msym, msym->linkage_name ());
>> - msym->set_demangled_name
>> - (demangled_name.release (),
>> - &m_objfile->per_bfd->storage_obstack);
>> - msym->name_set = 1;
>> - }
>> - /* This mangled_name_hash computation has to be outside of
>> - the name_set check, or compute_and_set_names below will
>> - be called with an invalid hash value. */
>> - hash_values[idx].mangled_name_hash
>> - = fast_hash (msym->linkage_name (),
>> - hash_values[idx].name_length);
>> - hash_values[idx].minsym_hash
>> - = msymbol_hash (msym->linkage_name ());
>> - /* We only use this hash code if the search name differs
>> - from the linkage name. See the code in
>> - build_minimal_symbol_hash_tables. */
>> - if (msym->search_name () != msym->linkage_name ())
>> - hash_values[idx].minsym_demangled_hash
>> - = search_name_hash (msym->language (), msym->search_name ());
>> - }
>> - {
>> - /* To limit how long we hold the lock, we only acquire it here
>> - and not while we demangle the names above. */
>> -#if CXX_STD_THREAD
>> - std::lock_guard<std::mutex> guard (demangled_mutex);
>> -#endif
>> - for (minimal_symbol *msym = start; msym < end; ++msym)
>> - {
>> - size_t idx = msym - msymbols;
>> - msym->compute_and_set_names
>> - (std::string_view (msym->linkage_name (),
>> - hash_values[idx].name_length),
>> - false,
>> - m_objfile->per_bfd,
>> - hash_values[idx].mangled_name_hash);
>> - }
>> - }
>> - });
>> +
>> + gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
>> + (&msymbols[0], &msymbols[mcount], msymbols,
>> + gdb::array_view<computed_hash_values> (hash_values),
>> + m_objfile->per_bfd, demangled_mutex);
>>
>> build_minimal_symbol_hash_tables (m_objfile, hash_values);
>> }
>> diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
>> index f5456141ff6e..86bf06c073a6 100644
>> --- a/gdb/unittests/parallel-for-selftests.c
>> +++ b/gdb/unittests/parallel-for-selftests.c
>> @@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
>> static void
>> test_parallel_for_each ()
>> {
>> + struct test_worker
>> + {
>> + /* DUMMY is there to test passing multiple arguments to the worker
>> + constructor. */
>> + test_worker (foreach_callback_t callback, int dummy)
>> + : m_callback (callback)
>> + {
>> + }
>> +
>> + void operator() (int first, int last) noexcept
>> + {
>> + return m_callback (first, last);
>> + }
>> +
>> + private:
>> + foreach_callback_t m_callback;
>> + };
>> +
>> const std::vector<do_foreach_t> for_each_functions
>> {
>> [] (int start, int end, foreach_callback_t callback)
>> - { gdb::parallel_for_each<1> (start, end, callback); },
>> + { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
>> [] (int start, int end, foreach_callback_t callback)
>> - { gdb::sequential_for_each (start, end, callback);}
>> + { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
>> };
>>
>> for (int n_threads : { 0, 1, 3 })
>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>> index b74c8068cf2c..bb41eb8700f0 100644
>> --- a/gdbsupport/parallel-for.h
>> +++ b/gdbsupport/parallel-for.h
>> @@ -21,115 +21,107 @@
>> #define GDBSUPPORT_PARALLEL_FOR_H
>>
>> #include <algorithm>
>> -#include <type_traits>
>> +#include <tuple>
>> #include "gdbsupport/thread-pool.h"
>> -#include "gdbsupport/function-view.h"
>>
>> namespace gdb
>> {
>>
>> -/* A very simple "parallel for". This splits the range of iterators
>> - into subranges, and then passes each subrange to the callback. The
>> - work may or may not be done in separate threads.
>> +/* A "parallel-for" implementation using a shared work queue. Work items get
>> + popped in batches of size up to BATCH_SIZE from the queue and handed out to
>> + worker threads.
>>
>> - This approach was chosen over having the callback work on single
>> - items because it makes it simple for the caller to do
>> - once-per-subrange initialization and destruction.
>> + Each worker thread instantiates an object of type Worker, forwarding ARGS to
>> + its constructor. The Worker object can be used to keep some per-worker
>> + thread state.
>>
>> - The parameter N says how batching ought to be done -- there will be
>> - at least N elements processed per thread. Setting N to 0 is not
>> - allowed. */
>> + Worker threads call Worker::operator() repeatedly until the queue is
>> + empty. */
>>
>> -template<std::size_t n, class RandomIt, class RangeFunction>
>> +template<std::size_t batch_size, class RandomIt, class Worker,
>> + class... WorkerArgs>
>> void
>> -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>> +parallel_for_each (const RandomIt first, const RandomIt last,
>> + WorkerArgs &&...worker_args)
>> {
>> /* If enabled, print debug info about how the work is distributed across
>> the threads. */
>> const bool parallel_for_each_debug = false;
>>
>> - size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
>> - size_t n_threads = n_worker_threads;
>> - size_t n_elements = last - first;
>> - size_t elts_per_thread = 0;
>> - size_t elts_left_over = 0;
>> + gdb_assert (first <= last);
>>
>> - if (n_threads > 1)
>> + if (parallel_for_each_debug)
>> {
>> - /* Require that there should be at least N elements in a
>> - thread. */
>> - gdb_assert (n > 0);
>> - if (n_elements / n_threads < n)
>> - n_threads = std::max (n_elements / n, (size_t) 1);
>> - elts_per_thread = n_elements / n_threads;
>> - elts_left_over = n_elements % n_threads;
>> - /* n_elements == n_threads * elts_per_thread + elts_left_over. */
>> + debug_printf ("Parallel for: n elements: %zu\n",
>> + static_cast<std::size_t> (last - first));
>> + debug_printf ("Parallel for: batch size: %zu\n", batch_size);
>> }
>>
>> - size_t count = n_threads == 0 ? 0 : n_threads - 1;
>> + const size_t n_worker_threads
>> + = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
>> std::vector<gdb::future<void>> results;
>>
>> - if (parallel_for_each_debug)
>> - {
>> - debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
>> - debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
>> - debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
>> - }
>> + /* The next item to hand out. */
>> + std::atomic<RandomIt> next = first;
>>
>> - for (int i = 0; i < count; ++i)
>> - {
>> - RandomIt end;
>> - end = first + elts_per_thread;
>> - if (i < elts_left_over)
>> - /* Distribute the leftovers over the worker threads, to avoid having
>> - to handle all of them in a single thread. */
>> - end++;
>> -
>> - /* This case means we don't have enough elements to really
>> - distribute them. Rather than ever submit a task that does
>> - nothing, we short-circuit here. */
>> - if (first == end)
>> - end = last;
>> -
>> - if (end == last)
>> - {
>> - /* We're about to dispatch the last batch of elements, which
>> - we normally process in the main thread. So just truncate
>> - the result list here. This avoids submitting empty tasks
>> - to the thread pool. */
>> - count = i;
>> - break;
>> - }
>> + /* The worker thread task.
>> +
>> + We need to capture args as a tuple, because it's not possible to capture
>> + the parameter pack directly in C++17. Once we migrate to C++20, the
>> + capture can be simplified to:
>>
>> - if (parallel_for_each_debug)
>> + ... args = std::forward<Args>(args)
>> +
>> + and `args` can be used as-is in the lambda. */
>> + auto args_tuple
>> + = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
>> + auto task =
>> + [&next, first, last, n_worker_threads, &args_tuple] () {
>> + /* Instantiate the user-defined worker. */
>> + auto worker = std::make_from_tuple<Worker> (args_tuple);
>> +
>> + for (;;)
>> {
>> - debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
>> - i, (size_t)(end - first));
>> - debug_printf (_("\n"));
>> + /* Grab a snapshot of NEXT. */
>> + auto local_next = next.load ();
>> + gdb_assert (local_next <= last);
>> +
>> + /* Number of remaining items. */
>> + auto n_remaining = last - local_next;
>> + gdb_assert (n_remaining >= 0);
>> +
>> + /* Are we done? */
>> + if (n_remaining == 0)
>> + break;
>> +
>> + const auto this_batch_size
>> + = std::min<std::size_t> (batch_size, n_remaining);
>> +
>> + /* The range to process in this iteration. */
>> + const auto this_batch_first = local_next;
>> + const auto this_batch_last = local_next + this_batch_size;
>> +
>> + /* Update NEXT. If the current value of NEXT doesn't match
>> + LOCAL_NEXT, it means another thread updated it concurrently,
>> + restart. */
>> + if (!next.compare_exchange_weak (local_next, this_batch_last))
>> + continue;
>> +
>> + if (parallel_for_each_debug)
>> + debug_printf ("Processing %zu items, range [%zu, %zu[\n",
>> + this_batch_size,
>> + static_cast<size_t> (this_batch_first - first),
>> + static_cast<size_t> (this_batch_last - first));
>> +
>> + worker (this_batch_first, this_batch_last);
>> }
>> - results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
>> - {
>> - return callback (first, end);
>> - }));
>> - first = end;
>> - }
>> -
>> - for (int i = count; i < n_worker_threads; ++i)
>> - if (parallel_for_each_debug)
>> - {
>> - debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
>> - debug_printf (_("\n"));
>> - }
>> + };
>>
>> - /* Process all the remaining elements in the main thread. */
>> - if (parallel_for_each_debug)
>> - {
>> - debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
>> - (size_t)(last - first));
>> - debug_printf (_("\n"));
>> - }
>> - callback (first, last);
>> + /* Start N_WORKER_THREADS tasks. */
>> + for (int i = 0; i < n_worker_threads; ++i)
>> + results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
>>
>> + /* Wait for all of them to be finished. */
>> for (auto &fut : results)
>> fut.get ();
>> }
>> @@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>> when debugging multi-threading behavior, and you want to limit
>> multi-threading in a fine-grained way. */
>>
>> -template<class RandomIt, class RangeFunction>
>> +template<class RandomIt, class Worker, class... WorkerArgs>
>> void
>> -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>> +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
>> {
>> - callback (first, last);
>> + Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
>> }
>>
>> }
>>
>> base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
>> --
>> 2.50.0
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-08-28 17:00 ` Simon Marchi
@ 2025-09-08 17:23 ` Simon Marchi
0 siblings, 0 replies; 18+ messages in thread
From: Simon Marchi @ 2025-09-08 17:23 UTC (permalink / raw)
To: gdb-patches
Ping.
On 8/28/25 1:00 PM, Simon Marchi wrote:
> Ping.
>
> On 8/5/25 1:01 PM, Simon Marchi wrote:
>> Ping.
>>
>> On 7/3/25 4:01 PM, simon.marchi@polymtl.ca wrote:
>>> From: Simon Marchi <simon.marchi@polymtl.ca>
>>>
>>> gdb::parallel_for_each uses static partitioning of the workload, meaning
>>> that each worker thread receives a similar number of work items. Change
>>> it to use dynamic partitioning, where worker threads pull work items
>>> from a shared work queue when they need to.
>>>
>>> Note that gdb::parallel_for_each is currently only used for processing
>>> minimal symbols in GDB. I am looking at improving the startup
>>> performance of GDB, where the minimal symbol process is one step.
>>>
>>> With static partitioning, there is a risk of workload imbalance if some
>>> threads receive "easier" work than others. Some threads sit still while
>>> others finish working on their share of the work. This is not
>>> desirable, because the gdb::parallel_for_each takes as long as the
>>> slowest thread takes.
>>>
>>> When loading a file with a lot of minimal symbols (~600k) in GDB, with
>>> "maint set per-command time on", I observe some imbalance:
>>>
>>> Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
>>> Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
>>> Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
>>> Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
>>> Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
>>> Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
>>> Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
>>> Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
>>> ^
>>> ----´
>>>
>>> The fastest thread took 0.732 seconds to complete its work (and then sat
>>> still), while the slowest took 3.927 seconds. This means the
>>> parallel_for_each took a bit less than 4 seconds.
>>>
>>> Even if the number of minimal symbols assigned to each worker is the
>>> same, I suppose that some symbols (e.g. those that need demangling) take
>>> longer to process, which could explain the imbalance.
>>>
>>> With this patch, things are much more balanced:
>>>
>>> Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
>>> Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
>>> Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
>>> Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
>>> Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
>>> Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
>>> Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
>>> Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
>>> ^
>>> ----´
>>>
>>> In this version, the parallel_for_each took about 2.8 seconds,
>>> representing a reduction of ~1.2 seconds for this step. Not
>>> life-changing, but it's still good I think.
>>>
>>> Note that this patch helps when loading big programs. My go-to test
>>> program for this is telegram-desktop that I built from source. For
>>> small programs (including loading gdb itself), it makes no perceptible
>>> difference.
>>>
>>> Now the technical bits:
>>>
>>> - One impact that this change has on the minimal symbol processing
>>> specifically is that not all calls to compute_and_set_names (a
>>> critical region guarded by a mutex) are done at the end of each
>>> worker thread's task anymore.
>>>
>>> Before this patch, each thread would compute the names and hash values for
>>> all the minimal symbols it has been assigned, and then would call
>>> compute_and_set_names for all of them, while holding the mutex (thus
>>> preventing other threads from doing this same step).
>>>
>>> With the shared work queue approach, each thread grabs a batch of of
>>> minimal symbols, computes the names and hash values for them, and
>>> then calls compute_and_set_names (with the mutex held) for this batch
>>> only. It then repeats that until the work queue is empty.
>>>
>>> There are therefore more small and spread out compute_and_set_names
>>> critical sections, instead of just one per worker thread at the end.
>>> Given that before this patch the work was not well balanced among worker
>>> threads, I guess that threads would enter that critical region at
>>> roughly different times, causing little contention.
>>>
>>> In the "with this patch" results, the CPU utilization numbers are not
>>> as good, suggesting that there is some contention. But I don't know
>>> if it's contention due to the compute_and_set_names critical section
>>> or the shared work queue critical section. That can be investigated
>>> later. In any case, what ultimately counts is the wall time, which
>>> improves.
>>>
>>> - One choice I had to make was to decide how many work items (in this
>>> case minimal symbols) each worker should pop when getting work from
>>> the shared queue. The general wisdom is that:
>>>
>>> - popping too few items, and the synchronization overhead becomes
>>> significant, and the total processing time increases
>>> - popping too many items, and we get some imbalance back, and the
>>> total processing time increases again
>>>
>>> I experimented using a dynamic batch size proportional to the number
>>> of remaining work items. It worked well in some cases but not
>>> always. So I decided to keep it simple, with a fixed batch size.
>>> That can always be tweaked later.
>>>
>>> - I want to still be able to use scoped_time_it to measure the time
>>> that each worker thread spent working on the task. I find it really
>>> handy when measuring the performance impact of changes.
>>>
>>> Unfortunately, the current interface of gdb::parallel_for_each, which
>>> receives a simple callback, is not well-suited for that, once I
>>> introduce the dynamic partitioning. The callback would get called
>>> once for each work item batch (multiple time for each worker thread),
>>> so it's not possible to maintain a per-worker thread object for the
>>> duration of the parallel for.
>>>
>>> To allow this, I changed gdb::parallel_for_each to receive a worker
>>> type as a template parameter. Each worker thread creates one local
>>> instance of that type, and calls operator() on it for each work item
>>> batch. By having a scoped_time_it object as a field of that worker,
>>> we can get the timings per worker thread.
>>>
>>> The drawbacks of this approach is that we must now define the
>>> parallel task in a separate class and manually capture any context we
>>> need as fields of that class.
>>>
>>> Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
>>> ---
>>> gdb/minsyms.c | 133 ++++++++++++--------
>>> gdb/unittests/parallel-for-selftests.c | 22 +++-
>>> gdbsupport/parallel-for.h | 164 ++++++++++++-------------
>>> 3 files changed, 178 insertions(+), 141 deletions(-)
>>>
>>> diff --git a/gdb/minsyms.c b/gdb/minsyms.c
>>> index 4a6459a6f2d2..e353a8a9ba00 100644
>>> --- a/gdb/minsyms.c
>>> +++ b/gdb/minsyms.c
>>> @@ -1392,6 +1392,81 @@ build_minimal_symbol_hash_tables
>>> }
>>> }
>>>
>>> +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */
>>> +
>>> +class minimal_symbol_install_worker
>>> +{
>>> +public:
>>> + minimal_symbol_install_worker
>>> + (minimal_symbol *msymbols,
>>> + gdb::array_view<computed_hash_values> hash_values,
>>> + objfile_per_bfd_storage *per_bfd,
>>> + std::mutex &demangled_mutex)
>>> + : m_time_it ("minsym install worker"),
>>> + m_msymbols (msymbols),
>>> + m_hash_values (hash_values),
>>> + m_per_bfd (per_bfd),
>>> + m_demangled_mutex (demangled_mutex)
>>> + {}
>>> +
>>> + void operator() (minimal_symbol *start, minimal_symbol *end) noexcept
>>> + {
>>> + for (minimal_symbol *msym = start; msym < end; ++msym)
>>> + {
>>> + size_t idx = msym - m_msymbols;
>>> + m_hash_values[idx].name_length = strlen (msym->linkage_name ());
>>> +
>>> + if (!msym->name_set)
>>> + {
>>> + /* This will be freed later, by compute_and_set_names. */
>>> + gdb::unique_xmalloc_ptr<char> demangled_name
>>> + = symbol_find_demangled_name (msym, msym->linkage_name ());
>>> + msym->set_demangled_name (demangled_name.release (),
>>> + &m_per_bfd->storage_obstack);
>>> + msym->name_set = 1;
>>> + }
>>> +
>>> + /* This mangled_name_hash computation has to be outside of
>>> + the name_set check, or compute_and_set_names below will
>>> + be called with an invalid hash value. */
>>> + m_hash_values[idx].mangled_name_hash
>>> + = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length);
>>> + m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ());
>>> +
>>> + /* We only use this hash code if the search name differs
>>> + from the linkage name. See the code in
>>> + build_minimal_symbol_hash_tables. */
>>> + if (msym->search_name () != msym->linkage_name ())
>>> + m_hash_values[idx].minsym_demangled_hash
>>> + = search_name_hash (msym->language (), msym->search_name ());
>>> + }
>>> +
>>> + {
>>> + /* To limit how long we hold the lock, we only acquire it here
>>> + and not while we demangle the names above. */
>>> +#if CXX_STD_THREAD
>>> + std::lock_guard<std::mutex> guard (m_demangled_mutex);
>>> +#endif
>>> + for (minimal_symbol *msym = start; msym < end; ++msym)
>>> + {
>>> + size_t idx = msym - m_msymbols;
>>> + std::string_view name (msym->linkage_name (),
>>> + m_hash_values[idx].name_length);
>>> + hashval_t hashval = m_hash_values[idx].mangled_name_hash;
>>> +
>>> + msym->compute_and_set_names (name, false, m_per_bfd, hashval);
>>> + }
>>> + }
>>> + }
>>> +
>>> +private:
>>> + scoped_time_it m_time_it;
>>> + minimal_symbol *m_msymbols;
>>> + gdb::array_view<computed_hash_values> m_hash_values;
>>> + objfile_per_bfd_storage *m_per_bfd;
>>> + std::mutex &m_demangled_mutex;
>>> +};
>>> +
>>> /* Add the minimal symbols in the existing bunches to the objfile's official
>>> minimal symbol table. In most cases there is no minimal symbol table yet
>>> for this objfile, and the existing bunches are used to create one. Once
>>> @@ -1478,59 +1553,11 @@ minimal_symbol_reader::install ()
>>> std::vector<computed_hash_values> hash_values (mcount);
>>>
>>> msymbols = m_objfile->per_bfd->msymbols.get ();
>>> - /* Arbitrarily require at least 10 elements in a thread. */
>>> - gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount],
>>> - [&] (minimal_symbol *start, minimal_symbol *end)
>>> - {
>>> - scoped_time_it time_it ("minsyms install worker");
>>> -
>>> - for (minimal_symbol *msym = start; msym < end; ++msym)
>>> - {
>>> - size_t idx = msym - msymbols;
>>> - hash_values[idx].name_length = strlen (msym->linkage_name ());
>>> - if (!msym->name_set)
>>> - {
>>> - /* This will be freed later, by compute_and_set_names. */
>>> - gdb::unique_xmalloc_ptr<char> demangled_name
>>> - = symbol_find_demangled_name (msym, msym->linkage_name ());
>>> - msym->set_demangled_name
>>> - (demangled_name.release (),
>>> - &m_objfile->per_bfd->storage_obstack);
>>> - msym->name_set = 1;
>>> - }
>>> - /* This mangled_name_hash computation has to be outside of
>>> - the name_set check, or compute_and_set_names below will
>>> - be called with an invalid hash value. */
>>> - hash_values[idx].mangled_name_hash
>>> - = fast_hash (msym->linkage_name (),
>>> - hash_values[idx].name_length);
>>> - hash_values[idx].minsym_hash
>>> - = msymbol_hash (msym->linkage_name ());
>>> - /* We only use this hash code if the search name differs
>>> - from the linkage name. See the code in
>>> - build_minimal_symbol_hash_tables. */
>>> - if (msym->search_name () != msym->linkage_name ())
>>> - hash_values[idx].minsym_demangled_hash
>>> - = search_name_hash (msym->language (), msym->search_name ());
>>> - }
>>> - {
>>> - /* To limit how long we hold the lock, we only acquire it here
>>> - and not while we demangle the names above. */
>>> -#if CXX_STD_THREAD
>>> - std::lock_guard<std::mutex> guard (demangled_mutex);
>>> -#endif
>>> - for (minimal_symbol *msym = start; msym < end; ++msym)
>>> - {
>>> - size_t idx = msym - msymbols;
>>> - msym->compute_and_set_names
>>> - (std::string_view (msym->linkage_name (),
>>> - hash_values[idx].name_length),
>>> - false,
>>> - m_objfile->per_bfd,
>>> - hash_values[idx].mangled_name_hash);
>>> - }
>>> - }
>>> - });
>>> +
>>> + gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker>
>>> + (&msymbols[0], &msymbols[mcount], msymbols,
>>> + gdb::array_view<computed_hash_values> (hash_values),
>>> + m_objfile->per_bfd, demangled_mutex);
>>>
>>> build_minimal_symbol_hash_tables (m_objfile, hash_values);
>>> }
>>> diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c
>>> index f5456141ff6e..86bf06c073a6 100644
>>> --- a/gdb/unittests/parallel-for-selftests.c
>>> +++ b/gdb/unittests/parallel-for-selftests.c
>>> @@ -91,12 +91,30 @@ test_one (int n_threads, do_foreach_t do_foreach)
>>> static void
>>> test_parallel_for_each ()
>>> {
>>> + struct test_worker
>>> + {
>>> + /* DUMMY is there to test passing multiple arguments to the worker
>>> + constructor. */
>>> + test_worker (foreach_callback_t callback, int dummy)
>>> + : m_callback (callback)
>>> + {
>>> + }
>>> +
>>> + void operator() (int first, int last) noexcept
>>> + {
>>> + return m_callback (first, last);
>>> + }
>>> +
>>> + private:
>>> + foreach_callback_t m_callback;
>>> + };
>>> +
>>> const std::vector<do_foreach_t> for_each_functions
>>> {
>>> [] (int start, int end, foreach_callback_t callback)
>>> - { gdb::parallel_for_each<1> (start, end, callback); },
>>> + { gdb::parallel_for_each<1, int, test_worker> (start, end, callback, 0); },
>>> [] (int start, int end, foreach_callback_t callback)
>>> - { gdb::sequential_for_each (start, end, callback);}
>>> + { gdb::sequential_for_each<int, test_worker> (start, end, callback, 0); },
>>> };
>>>
>>> for (int n_threads : { 0, 1, 3 })
>>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>>> index b74c8068cf2c..bb41eb8700f0 100644
>>> --- a/gdbsupport/parallel-for.h
>>> +++ b/gdbsupport/parallel-for.h
>>> @@ -21,115 +21,107 @@
>>> #define GDBSUPPORT_PARALLEL_FOR_H
>>>
>>> #include <algorithm>
>>> -#include <type_traits>
>>> +#include <tuple>
>>> #include "gdbsupport/thread-pool.h"
>>> -#include "gdbsupport/function-view.h"
>>>
>>> namespace gdb
>>> {
>>>
>>> -/* A very simple "parallel for". This splits the range of iterators
>>> - into subranges, and then passes each subrange to the callback. The
>>> - work may or may not be done in separate threads.
>>> +/* A "parallel-for" implementation using a shared work queue. Work items get
>>> + popped in batches of size up to BATCH_SIZE from the queue and handed out to
>>> + worker threads.
>>>
>>> - This approach was chosen over having the callback work on single
>>> - items because it makes it simple for the caller to do
>>> - once-per-subrange initialization and destruction.
>>> + Each worker thread instantiates an object of type Worker, forwarding ARGS to
>>> + its constructor. The Worker object can be used to keep some per-worker
>>> + thread state.
>>>
>>> - The parameter N says how batching ought to be done -- there will be
>>> - at least N elements processed per thread. Setting N to 0 is not
>>> - allowed. */
>>> + Worker threads call Worker::operator() repeatedly until the queue is
>>> + empty. */
>>>
>>> -template<std::size_t n, class RandomIt, class RangeFunction>
>>> +template<std::size_t batch_size, class RandomIt, class Worker,
>>> + class... WorkerArgs>
>>> void
>>> -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>>> +parallel_for_each (const RandomIt first, const RandomIt last,
>>> + WorkerArgs &&...worker_args)
>>> {
>>> /* If enabled, print debug info about how the work is distributed across
>>> the threads. */
>>> const bool parallel_for_each_debug = false;
>>>
>>> - size_t n_worker_threads = thread_pool::g_thread_pool->thread_count ();
>>> - size_t n_threads = n_worker_threads;
>>> - size_t n_elements = last - first;
>>> - size_t elts_per_thread = 0;
>>> - size_t elts_left_over = 0;
>>> + gdb_assert (first <= last);
>>>
>>> - if (n_threads > 1)
>>> + if (parallel_for_each_debug)
>>> {
>>> - /* Require that there should be at least N elements in a
>>> - thread. */
>>> - gdb_assert (n > 0);
>>> - if (n_elements / n_threads < n)
>>> - n_threads = std::max (n_elements / n, (size_t) 1);
>>> - elts_per_thread = n_elements / n_threads;
>>> - elts_left_over = n_elements % n_threads;
>>> - /* n_elements == n_threads * elts_per_thread + elts_left_over. */
>>> + debug_printf ("Parallel for: n elements: %zu\n",
>>> + static_cast<std::size_t> (last - first));
>>> + debug_printf ("Parallel for: batch size: %zu\n", batch_size);
>>> }
>>>
>>> - size_t count = n_threads == 0 ? 0 : n_threads - 1;
>>> + const size_t n_worker_threads
>>> + = std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
>>> std::vector<gdb::future<void>> results;
>>>
>>> - if (parallel_for_each_debug)
>>> - {
>>> - debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
>>> - debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n);
>>> - debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
>>> - }
>>> + /* The next item to hand out. */
>>> + std::atomic<RandomIt> next = first;
>>>
>>> - for (int i = 0; i < count; ++i)
>>> - {
>>> - RandomIt end;
>>> - end = first + elts_per_thread;
>>> - if (i < elts_left_over)
>>> - /* Distribute the leftovers over the worker threads, to avoid having
>>> - to handle all of them in a single thread. */
>>> - end++;
>>> -
>>> - /* This case means we don't have enough elements to really
>>> - distribute them. Rather than ever submit a task that does
>>> - nothing, we short-circuit here. */
>>> - if (first == end)
>>> - end = last;
>>> -
>>> - if (end == last)
>>> - {
>>> - /* We're about to dispatch the last batch of elements, which
>>> - we normally process in the main thread. So just truncate
>>> - the result list here. This avoids submitting empty tasks
>>> - to the thread pool. */
>>> - count = i;
>>> - break;
>>> - }
>>> + /* The worker thread task.
>>> +
>>> + We need to capture args as a tuple, because it's not possible to capture
>>> + the parameter pack directly in C++17. Once we migrate to C++20, the
>>> + capture can be simplified to:
>>>
>>> - if (parallel_for_each_debug)
>>> + ... args = std::forward<Args>(args)
>>> +
>>> + and `args` can be used as-is in the lambda. */
>>> + auto args_tuple
>>> + = std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
>>> + auto task =
>>> + [&next, first, last, n_worker_threads, &args_tuple] () {
>>> + /* Instantiate the user-defined worker. */
>>> + auto worker = std::make_from_tuple<Worker> (args_tuple);
>>> +
>>> + for (;;)
>>> {
>>> - debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
>>> - i, (size_t)(end - first));
>>> - debug_printf (_("\n"));
>>> + /* Grab a snapshot of NEXT. */
>>> + auto local_next = next.load ();
>>> + gdb_assert (local_next <= last);
>>> +
>>> + /* Number of remaining items. */
>>> + auto n_remaining = last - local_next;
>>> + gdb_assert (n_remaining >= 0);
>>> +
>>> + /* Are we done? */
>>> + if (n_remaining == 0)
>>> + break;
>>> +
>>> + const auto this_batch_size
>>> + = std::min<std::size_t> (batch_size, n_remaining);
>>> +
>>> + /* The range to process in this iteration. */
>>> + const auto this_batch_first = local_next;
>>> + const auto this_batch_last = local_next + this_batch_size;
>>> +
>>> + /* Update NEXT. If the current value of NEXT doesn't match
>>> + LOCAL_NEXT, it means another thread updated it concurrently,
>>> + restart. */
>>> + if (!next.compare_exchange_weak (local_next, this_batch_last))
>>> + continue;
>>> +
>>> + if (parallel_for_each_debug)
>>> + debug_printf ("Processing %zu items, range [%zu, %zu[\n",
>>> + this_batch_size,
>>> + static_cast<size_t> (this_batch_first - first),
>>> + static_cast<size_t> (this_batch_last - first));
>>> +
>>> + worker (this_batch_first, this_batch_last);
>>> }
>>> - results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] ()
>>> - {
>>> - return callback (first, end);
>>> - }));
>>> - first = end;
>>> - }
>>> -
>>> - for (int i = count; i < n_worker_threads; ++i)
>>> - if (parallel_for_each_debug)
>>> - {
>>> - debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
>>> - debug_printf (_("\n"));
>>> - }
>>> + };
>>>
>>> - /* Process all the remaining elements in the main thread. */
>>> - if (parallel_for_each_debug)
>>> - {
>>> - debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
>>> - (size_t)(last - first));
>>> - debug_printf (_("\n"));
>>> - }
>>> - callback (first, last);
>>> + /* Start N_WORKER_THREADS tasks. */
>>> + for (int i = 0; i < n_worker_threads; ++i)
>>> + results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
>>>
>>> + /* Wait for all of them to be finished. */
>>> for (auto &fut : results)
>>> fut.get ();
>>> }
>>> @@ -138,11 +130,11 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>>> when debugging multi-threading behavior, and you want to limit
>>> multi-threading in a fine-grained way. */
>>>
>>> -template<class RandomIt, class RangeFunction>
>>> +template<class RandomIt, class Worker, class... WorkerArgs>
>>> void
>>> -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback)
>>> +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
>>> {
>>> - callback (first, last);
>>> + Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
>>> }
>>>
>>> }
>>>
>>> base-commit: b7ff16c68a2c0bacc0416c4b36a44e65888ce72b
>>> --
>>> 2.50.0
>>>
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
` (3 preceding siblings ...)
2025-08-05 17:01 ` [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each Simon Marchi
@ 2025-09-18 18:59 ` Tom Tromey
2025-09-19 17:47 ` Simon Marchi
4 siblings, 1 reply; 18+ messages in thread
From: Tom Tromey @ 2025-09-18 18:59 UTC (permalink / raw)
To: simon.marchi; +Cc: gdb-patches
>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
Simon> From: Simon Marchi <simon.marchi@polymtl.ca>
Simon> gdb::parallel_for_each uses static partitioning of the workload, meaning
Simon> that each worker thread receives a similar number of work items. Change
Simon> it to use dynamic partitioning, where worker threads pull work items
Simon> from a shared work queue when they need to.
Finally getting back to this.
Simon> Note that gdb::parallel_for_each is currently only used for processing
Simon> minimal symbols in GDB. I am looking at improving the startup
Simon> performance of GDB, where the minimal symbol process is one step.
I do wonder if this functionality should just be intrinsic to
thread-pool somehow, like separate the notions of a "task" and a "job
queue".
But anyway I think we should move forward with what exists.
Simon> + auto task =
Simon> + [&next, first, last, n_worker_threads, &args_tuple] () {
I think the "{" should go on the next line.
Approved-By: Tom Tromey <tom@tromey.com>
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
@ 2025-09-18 19:07 ` Tom Tromey
2025-09-19 18:06 ` Simon Marchi
0 siblings, 1 reply; 18+ messages in thread
From: Tom Tromey @ 2025-09-18 19:07 UTC (permalink / raw)
To: simon.marchi; +Cc: gdb-patches, Simon Marchi
>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
Simon> + /* Pop a batch of work items.
Simon> +
Simon> + The return value is a pair containing iterators to the first and last
Simon> + (exclusive) work items. */
Simon> + std::pair<RandomIt, RandomIt> pop_batch () noexcept
How about using an iterator_range<> here instead?
It's basically a pair but a bit more specialized.
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] gdbsupport: add async parallel_for_each version
2025-07-03 20:01 ` [PATCH v2 3/4] gdbsupport: add async parallel_for_each version simon.marchi
@ 2025-09-18 19:16 ` Tom Tromey
2025-09-19 20:12 ` Simon Marchi
0 siblings, 1 reply; 18+ messages in thread
From: Tom Tromey @ 2025-09-18 19:16 UTC (permalink / raw)
To: simon.marchi; +Cc: gdb-patches
>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
Simon> From: Simon Marchi <simon.marchi@polymtl.ca>
Simon> New in v2: make the global variable constexpr.
Simon> I would like to use gdb::parallel_for_each to implement the parallelism
Simon> of the DWARF unit indexing. However, the existing implementation of
Simon> gdb::parallel_for_each is blocking, which doesn't work with the model
Simon> used by the DWARF indexer, which is asynchronous and callback-based.
Simon> Add an asynchronouys version of gdb::parallel_for_each that will be
Simon> suitable for this task.
Simon> This new version accepts a callback that is invoked when the parallel
Simon> for each is complete.
This looks good to me.
Simon> + const auto [batch_first, batch_last] = state->queue.pop_batch ();
Simon> +
Simon> + if (batch_first == batch_last)
Simon> + break;
If this was using iterator_range, it'd be easy enough to add an 'empty'
method to that object.
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing
2025-07-03 20:01 ` [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing simon.marchi
@ 2025-09-18 19:26 ` Tom Tromey
2025-09-18 19:46 ` Simon Marchi
0 siblings, 1 reply; 18+ messages in thread
From: Tom Tromey @ 2025-09-18 19:26 UTC (permalink / raw)
To: simon.marchi; +Cc: gdb-patches, Simon Marchi
>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
Simon> + /* Launch parallel tasks to index units. */
Simon> + gdb::parallel_for_each_async<1, dwarf2_per_cu_up *, parallel_indexing_worker>
Simon> + (per_bfd->all_units.data (),
Simon> + per_bfd->all_units.data () + per_bfd->all_units.size (),
Simon> + [this] () { this->done_reading (); }, "DWARF indexing worker", this);
Simon> }
I was curious to know why this uses .data() rather than iterators.
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing
2025-09-18 19:26 ` Tom Tromey
@ 2025-09-18 19:46 ` Simon Marchi
2025-09-18 19:53 ` Tom Tromey
0 siblings, 1 reply; 18+ messages in thread
From: Simon Marchi @ 2025-09-18 19:46 UTC (permalink / raw)
To: Tom Tromey, simon.marchi; +Cc: gdb-patches
On 9/18/25 3:26 PM, Tom Tromey wrote:
>>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
>
> Simon> + /* Launch parallel tasks to index units. */
> Simon> + gdb::parallel_for_each_async<1, dwarf2_per_cu_up *, parallel_indexing_worker>
> Simon> + (per_bfd->all_units.data (),
> Simon> + per_bfd->all_units.data () + per_bfd->all_units.size (),
> Simon> + [this] () { this->done_reading (); }, "DWARF indexing worker", this);
> Simon> }
>
> I was curious to know why this uses .data() rather than iterators.
>
> Tom
Yeah, that's unfortunate. It's because
std::atomic<std::vector<...>::iterator> won't work with
-D_GLIBCXX_DEBUG. If I try to use iterators all the way, I get:
/opt/gcc/15.2.0/include/c++/15.2.0/atomic:214:21: error: static assertion failed: std::atomic requires a trivially copyable type
214 | static_assert(__is_trivially_copyable(_Tp),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Without _GLIBCXX_DEBUG, vector iterators boil down to just a pointer.
But with _GLIBCXX_DEBUG, there is additional state.
I still want to be able to use _GLIBCXX_DEBUG, it's pretty useful.
Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing
2025-09-18 19:46 ` Simon Marchi
@ 2025-09-18 19:53 ` Tom Tromey
2025-09-19 20:15 ` Simon Marchi
0 siblings, 1 reply; 18+ messages in thread
From: Tom Tromey @ 2025-09-18 19:53 UTC (permalink / raw)
To: Simon Marchi; +Cc: Tom Tromey, simon.marchi, gdb-patches
>>>>> "Simon" == Simon Marchi <simon.marchi@efficios.com> writes:
Simon> Yeah, that's unfortunate. It's because
Simon> std::atomic<std::vector<...>::iterator> won't work with
Simon> -D_GLIBCXX_DEBUG. If I try to use iterators all the way, I get:
Simon> /opt/gcc/15.2.0/include/c++/15.2.0/atomic:214:21: error: static assertion failed: std::atomic requires a trivially copyable type
Simon> 214 | static_assert(__is_trivially_copyable(_Tp),
Simon> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Simon> Without _GLIBCXX_DEBUG, vector iterators boil down to just a pointer.
Simon> But with _GLIBCXX_DEBUG, there is additional state.
Simon> I still want to be able to use _GLIBCXX_DEBUG, it's pretty useful.
Maybe a comment to this effect would be nice to have.
thanks,
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-09-18 18:59 ` Tom Tromey
@ 2025-09-19 17:47 ` Simon Marchi
2025-09-19 18:04 ` Tom Tromey
0 siblings, 1 reply; 18+ messages in thread
From: Simon Marchi @ 2025-09-19 17:47 UTC (permalink / raw)
To: Tom Tromey; +Cc: gdb-patches
On 9/18/25 2:59 PM, Tom Tromey wrote:
>>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
>
> Simon> From: Simon Marchi <simon.marchi@polymtl.ca>
> Simon> gdb::parallel_for_each uses static partitioning of the workload, meaning
> Simon> that each worker thread receives a similar number of work items. Change
> Simon> it to use dynamic partitioning, where worker threads pull work items
> Simon> from a shared work queue when they need to.
>
> Finally getting back to this.
>
> Simon> Note that gdb::parallel_for_each is currently only used for processing
> Simon> minimal symbols in GDB. I am looking at improving the startup
> Simon> performance of GDB, where the minimal symbol process is one step.
>
> I do wonder if this functionality should just be intrinsic to
> thread-pool somehow, like separate the notions of a "task" and a "job
> queue".
I think you'll have to talk me through it, because I don't immediately
see what you mean (perhaps because it's been a while since I worked on
this).
> But anyway I think we should move forward with what exists.
>
> Simon> + auto task =
> Simon> + [&next, first, last, n_worker_threads, &args_tuple] () {
>
> I think the "{" should go on the next line.
Ok, changed it to:
auto task = [&next, first, last, n_worker_threads, &args_tuple] ()
{
/* Instantiate the user-defined worker. */
auto worker = std::make_from_tuple<Worker> (args_tuple);
> Approved-By: Tom Tromey <tom@tromey.com>
Thanks,
Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each
2025-09-19 17:47 ` Simon Marchi
@ 2025-09-19 18:04 ` Tom Tromey
0 siblings, 0 replies; 18+ messages in thread
From: Tom Tromey @ 2025-09-19 18:04 UTC (permalink / raw)
To: Simon Marchi; +Cc: Tom Tromey, gdb-patches
>> I do wonder if this functionality should just be intrinsic to
>> thread-pool somehow, like separate the notions of a "task" and a "job
>> queue".
Simon> I think you'll have to talk me through it, because I don't immediately
Simon> see what you mean (perhaps because it's been a while since I worked on
Simon> this).
I just mean, instead of having this partitioning being a feature of
parallel_for_each, maybe this is something that should just be directly
in thread-pool, since it seems to be the only way that threads are
actually used.
I wouldn't worry about it though. It's better to just go with what we
have. We can refactor it again later if it seems worthwhile.
Tom
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h
2025-09-18 19:07 ` Tom Tromey
@ 2025-09-19 18:06 ` Simon Marchi
0 siblings, 0 replies; 18+ messages in thread
From: Simon Marchi @ 2025-09-19 18:06 UTC (permalink / raw)
To: Tom Tromey, simon.marchi; +Cc: gdb-patches
On 9/18/25 3:07 PM, Tom Tromey wrote:
>>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
>
> Simon> + /* Pop a batch of work items.
> Simon> +
> Simon> + The return value is a pair containing iterators to the first and last
> Simon> + (exclusive) work items. */
> Simon> + std::pair<RandomIt, RandomIt> pop_batch () noexcept
>
> How about using an iterator_range<> here instead?
> It's basically a pair but a bit more specialized.
I think it's a good idea. I'm also experimenting in using
iterator_range in the parallel_for_each interface. That is, make the
worker function receive an iterator_range, rather than separate
begin/end parameters. That makes it even easier to use a ranged for
loop in the worker, which is nice.
Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] gdbsupport: add async parallel_for_each version
2025-09-18 19:16 ` Tom Tromey
@ 2025-09-19 20:12 ` Simon Marchi
0 siblings, 0 replies; 18+ messages in thread
From: Simon Marchi @ 2025-09-19 20:12 UTC (permalink / raw)
To: Tom Tromey; +Cc: gdb-patches
On 9/18/25 3:16 PM, Tom Tromey wrote:
>>>>>> "Simon" == simon marchi <simon.marchi@polymtl.ca> writes:
>
> Simon> From: Simon Marchi <simon.marchi@polymtl.ca>
> Simon> New in v2: make the global variable constexpr.
>
> Simon> I would like to use gdb::parallel_for_each to implement the parallelism
> Simon> of the DWARF unit indexing. However, the existing implementation of
> Simon> gdb::parallel_for_each is blocking, which doesn't work with the model
> Simon> used by the DWARF indexer, which is asynchronous and callback-based.
> Simon> Add an asynchronouys version of gdb::parallel_for_each that will be
> Simon> suitable for this task.
>
> Simon> This new version accepts a callback that is invoked when the parallel
> Simon> for each is complete.
>
> This looks good to me.
>
> Simon> + const auto [batch_first, batch_last] = state->queue.pop_batch ();
> Simon> +
> Simon> + if (batch_first == batch_last)
> Simon> + break;
>
> If this was using iterator_range, it'd be easy enough to add an 'empty'
> method to that object.
I did that. I also changed the interface so that the workers receive an
iterator_range, I'll include that in v3.
Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing
2025-09-18 19:53 ` Tom Tromey
@ 2025-09-19 20:15 ` Simon Marchi
0 siblings, 0 replies; 18+ messages in thread
From: Simon Marchi @ 2025-09-19 20:15 UTC (permalink / raw)
To: Tom Tromey, Simon Marchi; +Cc: gdb-patches
On 9/18/25 3:53 PM, Tom Tromey wrote:
>>>>>> "Simon" == Simon Marchi <simon.marchi@efficios.com> writes:
>
> Simon> Yeah, that's unfortunate. It's because
> Simon> std::atomic<std::vector<...>::iterator> won't work with
> Simon> -D_GLIBCXX_DEBUG. If I try to use iterators all the way, I get:
>
> Simon> /opt/gcc/15.2.0/include/c++/15.2.0/atomic:214:21: error: static assertion failed: std::atomic requires a trivially copyable type
> Simon> 214 | static_assert(__is_trivially_copyable(_Tp),
> Simon> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Simon> Without _GLIBCXX_DEBUG, vector iterators boil down to just a pointer.
> Simon> But with _GLIBCXX_DEBUG, there is additional state.
>
> Simon> I still want to be able to use _GLIBCXX_DEBUG, it's pretty useful.
>
> Maybe a comment to this effect would be nice to have.
>
> thanks,
> Tom
Ok, added.
Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-09-19 20:23 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-03 20:01 [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each simon.marchi
2025-07-03 20:01 ` [PATCH v2 2/4] gdbsupport: factor out work queue from parallel-for.h simon.marchi
2025-09-18 19:07 ` Tom Tromey
2025-09-19 18:06 ` Simon Marchi
2025-07-03 20:01 ` [PATCH v2 3/4] gdbsupport: add async parallel_for_each version simon.marchi
2025-09-18 19:16 ` Tom Tromey
2025-09-19 20:12 ` Simon Marchi
2025-07-03 20:01 ` [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU indexing simon.marchi
2025-09-18 19:26 ` Tom Tromey
2025-09-18 19:46 ` Simon Marchi
2025-09-18 19:53 ` Tom Tromey
2025-09-19 20:15 ` Simon Marchi
2025-08-05 17:01 ` [PATCH v2 1/4] gdbsupport: use dynamic partitioning in gdb::parallel_for_each Simon Marchi
2025-08-28 17:00 ` Simon Marchi
2025-09-08 17:23 ` Simon Marchi
2025-09-18 18:59 ` Tom Tromey
2025-09-19 17:47 ` Simon Marchi
2025-09-19 18:04 ` Tom Tromey
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox