From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca by simark.ca with LMTP id KICAMSkdGWhjUhYAWB0awg (envelope-from ) for ; Mon, 05 May 2025 16:18:49 -0400 Received: by simark.ca (Postfix, from userid 112) id C85E91E10E; Mon, 5 May 2025 16:18:49 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on simark.ca X-Spam-Level: X-Spam-Status: No, score=-9.0 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_00, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_VALIDITY_CERTIFIED, RCVD_IN_VALIDITY_RPBL,RCVD_IN_VALIDITY_SAFE autolearn=unavailable autolearn_force=no version=4.0.1 Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPS id 0982C1E089 for ; Mon, 5 May 2025 16:18:49 -0400 (EDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id ADAD1385770D for ; Mon, 5 May 2025 20:18:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org ADAD1385770D Received: from simark.ca (simark.ca [158.69.221.121]) by sourceware.org (Postfix) with ESMTPS id 8A8E93857709 for ; Mon, 5 May 2025 20:16:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8A8E93857709 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=efficios.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=efficios.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8A8E93857709 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=158.69.221.121 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1746476160; cv=none; b=HyTkhQJ7yH0p+NI46t0+NTTw07jgLGDQ3Bbn4BrDffh5T/meIwkxYVDXrxzlPT0VcFHY95oxK2kgF4CxeLQASHQuRxjWLHTf2Xf90PSSiBxqfYgetBXi+Q6YWktAnBn3M4KQ2UeHWERKu7NA3aQIFS34kRXuRNYm1NiLMIpgeuY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1746476160; c=relaxed/simple; bh=swU01DqqdDbaqHQO3MbMPO0+WW3nm9PzBnoQ9GwQ5r8=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=A5Zq84vJMJ07s6fHjzyLc1T3l4rUMunAyRTpzVDjluR7Q/OMIKi7e6I+GlEjmX8hiGLKvdHtXdPS0SecD5KisfiBmepyv3WP2phZd2rdNsxEbuMfFwkW4IABfNIykxH9+CcicMMiKTKO6ooqTc6m5vgaKAhXV4wJ1QGR+x6qP8k= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by simark.ca (Postfix, from userid 112) id 4700C1E10E; Mon, 5 May 2025 16:16:00 -0400 (EDT) Received: from smarchi-efficios.internal.efficios.com (96-127-217-162.qc.cable.ebox.net [96.127.217.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id AED3B1E11C; Mon, 5 May 2025 16:15:49 -0400 (EDT) From: Simon Marchi To: gdb-patches@sourceware.org Cc: Simon Marchi Subject: [PATCH 6/6] gdb/dwarf: use dynamic partitioning for DWARF CU indexing Date: Mon, 5 May 2025 16:15:30 -0400 Message-ID: <20250505201548.184917-6-simon.marchi@efficios.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250505201548.184917-1-simon.marchi@efficios.com> References: <20250505201548.184917-1-simon.marchi@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gdb-patches-bounces~public-inbox=simark.ca@sourceware.org From: Simon Marchi The DWARF indexer splits the work statically based on the unit sizes, attempting to give each worker thread about the same amount of bytes to process. This works relatively well with standard compilation. For instance, I see this when loading a big program: Time for "DWARF indexing worker": wall 0.503, user 0.419, sys 0.077, user+sys 0.496, 98.6 % CPU Time for "DWARF indexing worker": wall 0.540, user 0.450, sys 0.081, user+sys 0.531, 98.3 % CPU Time for "DWARF indexing worker": wall 0.573, user 0.480, sys 0.087, user+sys 0.567, 99.0 % CPU Time for "DWARF indexing worker": wall 0.577, user 0.480, sys 0.090, user+sys 0.570, 98.8 % CPU Time for "DWARF indexing worker": wall 0.580, user 0.473, sys 0.097, user+sys 0.570, 98.3 % CPU Time for "DWARF indexing worker": wall 0.587, user 0.487, sys 0.082, user+sys 0.569, 96.9 % CPU Time for "DWARF indexing worker": wall 0.592, user 0.496, sys 0.081, user+sys 0.577, 97.5 % CPU Time for "DWARF indexing worker": wall 0.599, user 0.499, sys 0.086, user+sys 0.585, 97.7 % CPU The "wall" times of all threads are relatively close, meaning the workload balance is not too bad. However, when compiling with DWO files (-gsplit-dwarf), it's not as good: Time for "DWARF indexing worker": wall 0.201, user 0.125, sys 0.053, user+sys 0.178, 88.6 % CPU Time for "DWARF indexing worker": wall 0.232, user 0.158, sys 0.056, user+sys 0.214, 92.2 % CPU Time for "DWARF indexing worker": wall 0.413, user 0.311, sys 0.081, user+sys 0.392, 94.9 % CPU Time for "DWARF indexing worker": wall 0.523, user 0.400, sys 0.101, user+sys 0.501, 95.8 % CPU Time for "DWARF indexing worker": wall 0.669, user 0.526, sys 0.125, user+sys 0.651, 97.3 % CPU Time for "DWARF indexing worker": wall 0.860, user 0.699, sys 0.138, user+sys 0.837, 97.3 % CPU Time for "DWARF indexing worker": wall 0.968, user 0.782, sys 0.169, user+sys 0.951, 98.2 % CPU Time for "DWARF indexing worker": wall 1.107, user 0.916, sys 0.166, user+sys 1.082, 97.7 % CPU Note how the wall times vary from 0.2 to 1.1 seconds. This is undesirable, because the time to do that indexing step takes as long as the slowest worker thread takes. With DWO files, the split of the workload by size doesn't work, because it is done using the size of the skeleton units in the main file, which is not representative of how much DWARF is contained in each DWO file. I haven't tried it, but a similar problem could occur with cross-unit imports, which can happen with dwz or LTO. You could have a small unit that imports a lot from other units, in which case the size of the units is not representative of the work to accomplish. To try to improve this situation, change the DWARF indexer to use dynamic partitioning, using gdb::parallel_for_each_async. With this, each worker thread pops one unit at a time from a shared work queue to process it, until the queue is empty. That should result in a more balance workload split. I chose 1 as the minimum batch size here, because I judged that indexing one CU was a big enough piece of work compared to the synchronization overhead of the queue. That can always be tweaked later if someone wants to do more tests. With the standard compilation case, the balance is a bit better, but the the time is similar as before: Time for "DWARF indexing worker": wall 0.571, user 0.490, sys 0.071, user+sys 0.561, 98.2 % CPU Time for "DWARF indexing worker": wall 0.571, user 0.463, sys 0.097, user+sys 0.560, 98.1 % CPU Time for "DWARF indexing worker": wall 0.571, user 0.498, sys 0.066, user+sys 0.564, 98.8 % CPU Time for "DWARF indexing worker": wall 0.572, user 0.473, sys 0.087, user+sys 0.560, 97.9 % CPU Time for "DWARF indexing worker": wall 0.569, user 0.477, sys 0.084, user+sys 0.561, 98.6 % CPU Time for "DWARF indexing worker": wall 0.572, user 0.476, sys 0.079, user+sys 0.555, 97.0 % CPU Time for "DWARF indexing worker": wall 0.572, user 0.482, sys 0.079, user+sys 0.561, 98.1 % CPU Time for "DWARF indexing worker": wall 0.582, user 0.484, sys 0.083, user+sys 0.567, 97.4 % CPU However, for the DWO case, it's much better: Time for "DWARF indexing worker": wall 0.686, user 0.496, sys 0.108, user+sys 0.604, 88.0 % CPU Time for "DWARF indexing worker": wall 0.686, user 0.513, sys 0.094, user+sys 0.607, 88.5 % CPU Time for "DWARF indexing worker": wall 0.686, user 0.480, sys 0.120, user+sys 0.600, 87.5 % CPU Time for "DWARF indexing worker": wall 0.686, user 0.500, sys 0.117, user+sys 0.617, 89.9 % CPU Time for "DWARF indexing worker": wall 0.687, user 0.476, sys 0.133, user+sys 0.609, 88.6 % CPU Time for "DWARF indexing worker": wall 0.687, user 0.509, sys 0.104, user+sys 0.613, 89.2 % CPU Time for "DWARF indexing worker": wall 0.687, user 0.507, sys 0.109, user+sys 0.616, 89.7 % CPU Time for "DWARF indexing worker": wall 0.688, user 0.497, sys 0.104, user+sys 0.601, 87.4 % CPU The state previously kept in as local variables in cooked_index_worker_debug_info::process_units becomes fields of the new parallel worker object. The work previously done for each unit in cooked_index_worker_debug_info::process_units becomes the operator() of the new parallel owrker object. The work previously done at the end of cooked_index_worker_debug_info::process_units (including calling bfd_thread_cleanup) becomes the destructor of the new parallel worker object. The "done" callback of gdb::task_group becomes the "done" callback of gdb::parallel_for_each_async. Change-Id: I5dc5cf8793abe9ebe2659e78da38ffc94289e5f2 --- gdb/dwarf2/cooked-index-worker.h | 6 ++ gdb/dwarf2/read.c | 136 +++++++++++-------------------- 2 files changed, 54 insertions(+), 88 deletions(-) diff --git a/gdb/dwarf2/cooked-index-worker.h b/gdb/dwarf2/cooked-index-worker.h index df5c31d572df..0d8932503ab9 100644 --- a/gdb/dwarf2/cooked-index-worker.h +++ b/gdb/dwarf2/cooked-index-worker.h @@ -268,8 +268,14 @@ class cooked_index_worker /* The per-objfile object. */ dwarf2_per_objfile *m_per_objfile; + /* Result of each worker task. */ std::vector m_results; + + /* Mutex to synchronize access to M_RESULTS when workers append their + result. */ + std::mutex m_results_mutex; + /* Any warnings emitted. For the time being at least, this only needed in do_reading, not in every worker. Note that deferred_warnings uses gdb_stderr in its constructor, and this diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c index f90b22781e0d..efba56ff1ec0 100644 --- a/gdb/dwarf2/read.c +++ b/gdb/dwarf2/read.c @@ -51,6 +51,7 @@ #include "elf-bfd.h" #include "event-top.h" #include "exceptions.h" +#include "gdbsupport/parallel-for.h" #include "gdbsupport/task-group.h" #include "maint.h" #include "symtab.h" @@ -3244,12 +3245,6 @@ class cooked_index_worker_debug_info : public cooked_index_worker /* An iterator for the comp units. */ using unit_iterator = std::vector::iterator; - /* Process a batch of CUs. This may be called multiple times in - separate threads. TASK_NUMBER indicates which task this is -- - the result is stored in that slot of M_RESULTS. */ - void process_units (size_t task_number, unit_iterator first, - unit_iterator end); - /* Process unit THIS_CU. */ void process_unit (dwarf2_per_cu *this_cu, dwarf2_per_objfile *per_objfile, cooked_index_worker_result *storage); @@ -3467,36 +3462,6 @@ cooked_index_worker_debug_info::process_skeletonless_type_units process_skeletonless_type_unit (unit, per_objfile, storage); } -void -cooked_index_worker_debug_info::process_units (size_t task_number, - unit_iterator first, - unit_iterator end) -{ - SCOPE_EXIT { bfd_thread_cleanup (); }; - - /* Ensure that complaints are handled correctly. */ - complaint_interceptor complaint_handler; - - std::vector errors; - cooked_index_worker_result thread_storage; - for (auto inner = first; inner != end; ++inner) - { - dwarf2_per_cu *per_cu = inner->get (); - - try - { - process_unit (per_cu, m_per_objfile, &thread_storage); - } - catch (gdb_exception &except) - { - thread_storage.note_error (std::move (except)); - } - } - - thread_storage.done_reading (complaint_handler.release ()); - m_results[task_number] = std::move (thread_storage); -} - void cooked_index_worker_debug_info::done_reading () { @@ -3522,62 +3487,57 @@ cooked_index_worker_debug_info::do_reading () m_index_storage.get_addrmap (), &m_warnings); - /* We want to balance the load between the worker threads. This is - done by using the size of each CU as a rough estimate of how - difficult it will be to operate on. This isn't ideal -- for - example if dwz is used, the early CUs will all tend to be - "included" and won't be parsed independently. However, this - heuristic works well for typical compiler output. */ - - size_t total_size = 0; - for (const auto &per_cu : per_bfd->all_units) - total_size += per_cu->length (); - - /* How many worker threads we plan to use. We may not actually use - this many. We use 1 as the minimum to avoid division by zero, - and anyway in the N==0 case the work will be done - synchronously. */ - const size_t n_worker_threads - = std::max (gdb::thread_pool::g_thread_pool->thread_count (), (size_t) 1); - - /* How much effort should be put into each worker. */ - const size_t size_per_thread - = std::max (total_size / n_worker_threads, (size_t) 1); - - /* Work is done in a task group. */ - gdb::task_group workers ([this] () + /* The task for prallel workers that index units. */ + struct parallel_indexing_worker { - this->done_reading (); - }); - - auto end = per_bfd->all_units.end (); - size_t task_count = 0; - for (auto iter = per_bfd->all_units.begin (); iter != end; ) + explicit parallel_indexing_worker (cooked_index_worker_debug_info *parent) + : m_scoped_time_it ("DWARF indexing worker"), + m_parent (parent) { - auto last = iter; - /* Put all remaining CUs into the last task. */ - if (task_count == n_worker_threads - 1) - last = end; - else - { - size_t chunk_size = 0; - for (; last != end && chunk_size < size_per_thread; ++last) - chunk_size += (*last)->length (); - } - - gdb_assert (iter != last); - workers.add_task ([this, task_count, iter, last] () - { - scoped_time_it time_it ("DWARF indexing worker"); - process_units (task_count, iter, last); - }); - - ++task_count; - iter = last; } - m_results.resize (task_count); - workers.start (); + DISABLE_COPY_AND_ASSIGN (parallel_indexing_worker); + + ~parallel_indexing_worker () + { + bfd_thread_cleanup (); + + m_thread_storage.done_reading (m_complaint_handler.release ()); + + std::lock_guard lock (m_parent->m_results_mutex); + m_parent->m_results.emplace_back (std::move (m_thread_storage)); + } + + void operator() (dwarf2_per_cu_up *first, dwarf2_per_cu_up *last) + { + for (auto &it : iterator_range (first, last)) + { + try + { + m_parent->process_unit (it.get (), m_parent->m_per_objfile, + &m_thread_storage); + } + catch (gdb_exception &except) + { + m_thread_storage.note_error (std::move (except)); + } + } + } + + private: + scoped_time_it m_scoped_time_it; + + complaint_interceptor m_complaint_handler; + std::vector m_errors; + cooked_index_worker_result m_thread_storage; + + cooked_index_worker_debug_info *m_parent; + }; + + gdb::parallel_for_each_async<1, dwarf2_per_cu_up *, parallel_indexing_worker> + (per_bfd->all_units.data (), + per_bfd->all_units.data () + per_bfd->all_units.size (), + [this] () { this->done_reading (); }, this); } static void -- 2.49.0