From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces~public-inbox=simark.ca@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id xu4lNUvjZmivgyoAWB0awg
	(envelope-from <gdb-patches-bounces~public-inbox=simark.ca@sourceware.org>)
	for <public-inbox@simark.ca>; Thu, 03 Jul 2025 16:08:43 -0400
Authentication-Results: simark.ca;
	dkim=pass (1024-bit key; unprotected) header.d=polymtl.ca header.i=@polymtl.ca header.a=rsa-sha256 header.s=default header.b=oZVk1C8Q;
	dkim-atps=neutral
Received: by simark.ca (Postfix, from userid 112)
	id CA9801E11E; Thu,  3 Jul 2025 16:08:43 -0400 (EDT)
X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on simark.ca
X-Spam-Level: 
X-Spam-Status: No, score=-9.1 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_00,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_VALIDITY_CERTIFIED,RCVD_IN_VALIDITY_RPBL,
	RCVD_IN_VALIDITY_SAFE autolearn=unavailable autolearn_force=no
	version=4.0.1
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id D6A1B1E0C2
	for <public-inbox@simark.ca>; Thu,  3 Jul 2025 16:08:42 -0400 (EDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 754913852126
	for <public-inbox@simark.ca>; Thu,  3 Jul 2025 20:08:42 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 754913852126
Authentication-Results: sourceware.org;
	dkim=pass (1024-bit key, unprotected) header.d=polymtl.ca header.i=@polymtl.ca header.a=rsa-sha256 header.s=default header.b=oZVk1C8Q
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id AF303385274D
 for <gdb-patches@sourceware.org>; Thu,  3 Jul 2025 20:07:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AF303385274D
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=polymtl.ca
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=polymtl.ca
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org AF303385274D
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=132.207.4.11
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1751573259; cv=none;
 b=wEDHrbLO3ucOEtqiTFkoNoAlZZqEwfiosxbLXz9VxAY4Omla30kDEL7MP8yJRb/ecl9sWRsDap78J6JJe4fSH2qBhGDwMND0cmr5F8XUGLO/vMALH8cfSxisDJrP+T8hpw9kUCSPYwBIqA9MXyCCGh9JaO4jnPUqFDCzloZQg4w=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1751573259; c=relaxed/simple;
 bh=Mxqmg/BPQ9wjZ12FIH+m521ozToI+pok1XGYTckdy5c=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=xn+k5AciQnk1kVwgIkIFmlCscR11oyI3laPiPlJRnaoMj5wcTJuYvA2bw8ikAJNLDpy+iwsGxVAIFv51Gp0CEy8nHVLcWTYFh8SRVNu2APZ8OHvMMuQGSb1JHdo2SHyoLFarZL03PvCfPQUgVCkwuDVuWRX8DMlyCrCb6+iRvO0=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AF303385274D
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 563K7YuR009509
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT)
 for <gdb-patches@sourceware.org>; Thu, 3 Jul 2025 16:07:39 -0400
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 563K7YuR009509
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=polymtl.ca;
 s=default; t=1751573259;
 bh=Mp8XqNC/CK/N6IJHeMDU5B/yD7dJe6Qe5ccotGS3iFI=;
 h=From:To:Cc:Subject:Date:In-Reply-To:From;
 b=oZVk1C8QqEH/xVQifM369RCanmzjE/Rin273bphs5A1bPiB6jvTKawHWzc9pZDIVw
 AFErD/sknnYUf4N/ngcxWBCXCLu5VIFmU+QF6Zuj2Jzf68E1siBiB/Zv/2FNsU3FMz
 vENp8C1FynKlRGhnRxXNAHxlobxQGIC9W+9tNyy8=
Received: by simark.ca (Postfix, from userid 112)
 id 1EA0E1E13A; Thu,  3 Jul 2025 16:01:40 -0400 (EDT)
Received: from simark.localdomain (modemcable238.237-201-24.mc.videotron.ca
 [24.201.237.238])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256)
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id 653031E11F;
 Thu,  3 Jul 2025 16:01:32 -0400 (EDT)
From: simon.marchi@polymtl.ca
To: gdb-patches@sourceware.org
Cc: Simon Marchi <simon.marchi@efficios.com>
Subject: [PATCH v2 4/4] gdb/dwarf: use dynamic partitioning for DWARF CU
 indexing
Date: Thu,  3 Jul 2025 16:01:19 -0400
Message-ID: <20250703200130.4095761-4-simon.marchi@polymtl.ca>
X-Mailer: git-send-email 2.50.0
In-Reply-To: <20250703200130.4095761-1-simon.marchi@polymtl.ca>
References: <20250703200130.4095761-1-simon.marchi@polymtl.ca>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Thu, 3 Jul 2025 20:07:34 +0000
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces~public-inbox=simark.ca@sourceware.org

From: Simon Marchi <simon.marchi@efficios.com>

The DWARF indexer splits the work statically based on the unit sizes,
attempting to give each worker thread about the same amount of bytes to
process.  This works relatively well with standard compilation.  But
when compiling with DWO files (-gsplit-dwarf), it's not as good.  I see
this when loading a relatively big program (telegram-desktop, which
includes a lot of static dependencies) compiled with -gsplit-dwarf:

    Time for "DWARF indexing worker": wall 0.000, user 0.000, sys 0.000, user+sys 0.000, -nan % CPU
    Time for "DWARF indexing worker": wall 0.001, user 0.000, sys 0.000, user+sys 0.000, 0.0 % CPU
    Time for "DWARF indexing worker": wall 0.001, user 0.001, sys 0.000, user+sys 0.001, 100.0 % CPU
    Time for "DWARF indexing worker": wall 0.748, user 0.284, sys 0.297, user+sys 0.581, 77.7 % CPU
    Time for "DWARF indexing worker": wall 0.818, user 0.408, sys 0.262, user+sys 0.670, 81.9 % CPU
    Time for "DWARF indexing worker": wall 1.196, user 0.580, sys 0.402, user+sys 0.982, 82.1 % CPU
    Time for "DWARF indexing worker": wall 1.250, user 0.511, sys 0.500, user+sys 1.011, 80.9 % CPU
    Time for "DWARF indexing worker": wall 7.730, user 5.891, sys 1.729, user+sys 7.620, 98.6 % CPU

Note how the wall times vary from 0 to 7.7 seconds.  This is
undesirable, because the time to do that indexing step takes as long as
the slowest worker thread takes.

The imbalance in this step also causes imbalance in the following
"finalize" step:

    Time for "DWARF finalize worker": wall 0.007, user 0.004, sys 0.002, user+sys 0.006, 85.7 % CPU
    Time for "DWARF finalize worker": wall 0.012, user 0.005, sys 0.005, user+sys 0.010, 83.3 % CPU
    Time for "DWARF finalize worker": wall 0.015, user 0.010, sys 0.004, user+sys 0.014, 93.3 % CPU
    Time for "DWARF finalize worker": wall 0.389, user 0.359, sys 0.029, user+sys 0.388, 99.7 % CPU
    Time for "DWARF finalize worker": wall 0.680, user 0.644, sys 0.035, user+sys 0.679, 99.9 % CPU
    Time for "DWARF finalize worker": wall 0.929, user 0.907, sys 0.020, user+sys 0.927, 99.8 % CPU
    Time for "DWARF finalize worker": wall 1.093, user 1.055, sys 0.037, user+sys 1.092, 99.9 % CPU
    Time for "DWARF finalize worker": wall 2.016, user 1.934, sys 0.082, user+sys 2.016, 100.0 % CPU
    Time for "DWARF finalize worker": wall 25.882, user 25.471, sys 0.404, user+sys 25.875, 100.0 % CPU

With DWO files, the split of the workload by size doesn't work, because
it is done using the size of the skeleton units in the main file, which
is not representative of how much DWARF is contained in each DWO file.

I haven't tried it, but a similar problem could occur with cross-unit
imports, which can happen with dwz or LTO.  You could have a small unit
that imports a lot from other units, in which case the size of the units
is not representative of the work to accomplish.

To try to improve this situation, change the DWARF indexer to use
dynamic partitioning, using gdb::parallel_for_each_async.  With this,
each worker thread pops one unit at a time from a shared work queue to
process it, until the queue is empty.  That should result in a more
balance workload split.  I chose 1 as the minimum batch size here,
because I judged that indexing one CU was a big enough piece of work
compared to the synchronization overhead of the queue.  That can always
be tweaked later if someone wants to do more tests.

As a result, the timings are much more balanced:

    Time for "DWARF indexing worker": wall 2.325, user 1.033, sys 0.573, user+sys 1.606, 69.1 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.028, sys 0.568, user+sys 1.596, 68.6 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.068, sys 0.513, user+sys 1.581, 68.0 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.005, sys 0.579, user+sys 1.584, 68.1 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.070, sys 0.516, user+sys 1.586, 68.2 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.063, sys 0.584, user+sys 1.647, 70.8 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.049, sys 0.550, user+sys 1.599, 68.7 % CPU
    Time for "DWARF indexing worker": wall 2.328, user 1.058, sys 0.541, user+sys 1.599, 68.7 % CPU
    ...
    Time for "DWARF finalize worker": wall 2.833, user 2.791, sys 0.040, user+sys 2.831, 99.9 % CPU
    Time for "DWARF finalize worker": wall 2.939, user 2.896, sys 0.043, user+sys 2.939, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.016, user 2.969, sys 0.046, user+sys 3.015, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.076, user 2.957, sys 0.118, user+sys 3.075, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.159, user 3.054, sys 0.104, user+sys 3.158, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.198, user 3.082, sys 0.114, user+sys 3.196, 99.9 % CPU
    Time for "DWARF finalize worker": wall 3.197, user 3.076, sys 0.121, user+sys 3.197, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.268, user 3.136, sys 0.131, user+sys 3.267, 100.0 % CPU
    Time for "DWARF finalize worker": wall 1.907, user 1.810, sys 0.096, user+sys 1.906, 99.9 % CPU

In absolute terms, the total time for GDB to load the file and exit goes
down from about 42 seconds to 17 seconds.

Some implementation notes:

 - The state previously kept in as local variables in
   cooked_index_worker_debug_info::process_units becomes fields of the
   new parallel worker object.

 - The work previously done for each unit in
   cooked_index_worker_debug_info::process_units becomes the operator()
   of the new parallel worker object.

 - The work previously done at the end of
   cooked_index_worker_debug_info::process_units (including calling
   bfd_thread_cleanup) becomes the destructor of the new parallel worker
   object.

 - The "done" callback of gdb::task_group becomes the "done" callback of
   gdb::parallel_for_each_async.

 - I placed the parallel_indexing_worker struct inside
   cooked_index_worker_debug_info, so that it has access to
   parallel_indexing_worker's private fields (e.g. m_results, to push
   the results).  It will also be possible to re-use it for skeletonless
   type units in a later patch.

Change-Id: I5dc5cf8793abe9ebe2659e78da38ffc94289e5f2
---
 gdb/dwarf2/cooked-index-worker.h |   6 ++
 gdb/dwarf2/read.c                | 147 +++++++++++++------------------
 2 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/gdb/dwarf2/cooked-index-worker.h b/gdb/dwarf2/cooked-index-worker.h
index 8b9766cddcbf..5c282c138506 100644
--- a/gdb/dwarf2/cooked-index-worker.h
+++ b/gdb/dwarf2/cooked-index-worker.h
@@ -283,8 +283,14 @@ class cooked_index_worker
 
   /* The per-objfile object.  */
   dwarf2_per_objfile *m_per_objfile;
+
   /* Result of each worker task.  */
   std::vector<cooked_index_worker_result> m_results;
+
+  /* Mutex to synchronize access to M_RESULTS when workers append their
+     result.  */
+  std::mutex m_results_mutex;
+
   /* Any warnings emitted.  For the time being at least, this only
      needed in do_reading, not in every worker.  Note that
      deferred_warnings uses gdb_stderr in its constructor, and this
diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index 5e18e4520610..55b736776227 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -51,6 +51,7 @@
 #include "elf-bfd.h"
 #include "event-top.h"
 #include "exceptions.h"
+#include "gdbsupport/parallel-for.h"
 #include "gdbsupport/task-group.h"
 #include "maint.h"
 #include "symtab.h"
@@ -3206,6 +3207,59 @@ class cooked_index_worker_debug_info : public cooked_index_worker
   }
 
 private:
+  /* The task for parallel workers that index units.  */
+  struct parallel_indexing_worker
+  {
+    explicit parallel_indexing_worker (const char *step_name,
+				       cooked_index_worker_debug_info *parent)
+      : m_scoped_time_it (step_name, parent->m_per_command_time),
+	m_parent (parent)
+    {
+    }
+
+    DISABLE_COPY_AND_ASSIGN (parallel_indexing_worker);
+
+    ~parallel_indexing_worker ()
+    {
+      bfd_thread_cleanup ();
+
+      m_thread_storage.done_reading (m_complaint_handler.release ());
+
+      /* Append the results of this worker to the parent instance.  */
+      std::lock_guard<std::mutex> lock (m_parent->m_results_mutex);
+      m_parent->m_results.emplace_back (std::move (m_thread_storage));
+    }
+
+    void operator() (dwarf2_per_cu_up *first, dwarf2_per_cu_up *last)
+    {
+      for (auto &it : iterator_range (first, last))
+	this->process_one (*it);
+    }
+
+  private:
+    void process_one (dwarf2_per_cu &unit)
+    {
+      m_thread_storage.catch_error ([&] ()
+	{
+	  m_parent->process_unit (&unit, m_parent->m_per_objfile,
+				  &m_thread_storage);
+	});
+    }
+
+    /* Measures the execution time of this worker.  */
+    scoped_time_it m_scoped_time_it;
+
+    /* Delayed complaints and errors recorded while indexing units.  */
+    complaint_interceptor m_complaint_handler;
+    std::vector<gdb_exception> m_errors;
+
+    /* Index storage for this worker.  */
+    cooked_index_worker_result m_thread_storage;
+
+    /* The instance that spawned this worker.  */
+    cooked_index_worker_debug_info *m_parent;
+  };
+
   void do_reading () override;
 
   /* Print collected type unit statistics.  */
@@ -3247,12 +3301,6 @@ class cooked_index_worker_debug_info : public cooked_index_worker
   /* An iterator for the comp units.  */
   using unit_iterator = std::vector<dwarf2_per_cu_up>::iterator;
 
-  /* Process a batch of CUs.  This may be called multiple times in
-     separate threads.  TASK_NUMBER indicates which task this is --
-     the result is stored in that slot of M_RESULTS.  */
-  void process_units (size_t task_number, unit_iterator first,
-		      unit_iterator end);
-
   /* Process unit THIS_CU.  */
   void process_unit (dwarf2_per_cu *this_cu, dwarf2_per_objfile *per_objfile,
 		     cooked_index_worker_result *storage);
@@ -3476,32 +3524,6 @@ cooked_index_worker_debug_info::process_skeletonless_type_units
 	  });
 }
 
-void
-cooked_index_worker_debug_info::process_units (size_t task_number,
-					       unit_iterator first,
-					       unit_iterator end)
-{
-  SCOPE_EXIT { bfd_thread_cleanup (); };
-
-  /* Ensure that complaints are handled correctly.  */
-  complaint_interceptor complaint_handler;
-
-  std::vector<gdb_exception> errors;
-  cooked_index_worker_result thread_storage;
-  for (auto inner = first; inner != end; ++inner)
-    {
-      dwarf2_per_cu *per_cu = inner->get ();
-
-      thread_storage.catch_error ([&] ()
-	{
-	  process_unit (per_cu, m_per_objfile, &thread_storage);
-	});
-    }
-
-  thread_storage.done_reading (complaint_handler.release ());
-  m_results[task_number] = std::move (thread_storage);
-}
-
 void
 cooked_index_worker_debug_info::done_reading ()
 {
@@ -3527,62 +3549,11 @@ cooked_index_worker_debug_info::do_reading ()
 			       m_index_storage.get_addrmap (),
 			       &m_warnings);
 
-  /* We want to balance the load between the worker threads.  This is
-     done by using the size of each CU as a rough estimate of how
-     difficult it will be to operate on.  This isn't ideal -- for
-     example if dwz is used, the early CUs will all tend to be
-     "included" and won't be parsed independently.  However, this
-     heuristic works well for typical compiler output.  */
-
-  size_t total_size = 0;
-  for (const auto &per_cu : per_bfd->all_units)
-    total_size += per_cu->length ();
-
-  /* How many worker threads we plan to use.  We may not actually use
-     this many.  We use 1 as the minimum to avoid division by zero,
-     and anyway in the N==0 case the work will be done
-     synchronously.  */
-  const size_t n_worker_threads
-    = std::max (gdb::thread_pool::g_thread_pool->thread_count (), (size_t) 1);
-
-  /* How much effort should be put into each worker.  */
-  const size_t size_per_thread
-    = std::max (total_size / n_worker_threads, (size_t) 1);
-
-  /* Work is done in a task group.  */
-  gdb::task_group workers ([this] ()
-  {
-    this->done_reading ();
-  });
-
-  auto end = per_bfd->all_units.end ();
-  size_t task_count = 0;
-  for (auto iter = per_bfd->all_units.begin (); iter != end; )
-    {
-      auto last = iter;
-      /* Put all remaining CUs into the last task.  */
-      if (task_count == n_worker_threads - 1)
-	last = end;
-      else
-	{
-	  size_t chunk_size = 0;
-	  for (; last != end && chunk_size < size_per_thread; ++last)
-	    chunk_size += (*last)->length ();
-	}
-
-      gdb_assert (iter != last);
-      workers.add_task ([this, task_count, iter, last] ()
-	{
-	  scoped_time_it time_it ("DWARF indexing worker", m_per_command_time);
-	  process_units (task_count, iter, last);
-	});
-
-      ++task_count;
-      iter = last;
-    }
-
-  m_results.resize (task_count);
-  workers.start ();
+  /* Launch parallel tasks to index units.  */
+  gdb::parallel_for_each_async<1, dwarf2_per_cu_up *, parallel_indexing_worker>
+    (per_bfd->all_units.data (),
+     per_bfd->all_units.data () + per_bfd->all_units.size (),
+     [this] () { this->done_reading (); }, "DWARF indexing worker", this);
 }
 
 static void
-- 
2.50.0