[PATCH][gdbsupport] Use task size in parallel_for

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* [PATCH][gdbsupport] Use task size in parallel_for_each
@ 2022-07-18 19:42 Tom de Vries via Gdb-patches
  2022-07-21 17:35 ` Pedro Alves
  0 siblings, 1 reply; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-18 19:42 UTC (permalink / raw)
  To: gdb-patches; +Cc: Tom Tromey

Hi,

Ensuring a fair distribution over the worker threads and main thread in terms
of number of CUs might not be the most efficient way, given that CUs can vary
in size.

Fix this by:
- adding a task_size_ptr parameter to parallel_for_each,
  defaulting to nullptr,
- using per_cu->get_length () as the task size in the parallel_for_each
  in dwarf2_build_psymtabs_hard, and
- using the task size in parallel_for_each to distribute similarly-sized tasks
  to the threads.

I've used this experiment to verify the performance impact:
...
$ for n in $(seq 1 10); do \
    time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \
    2>&1 \
    | grep "real:"; \
  done
...
and without the patch got:
...
real: 4.71
real: 4.88
real: 4.29
real: 4.30
real: 4.65
real: 4.27
real: 4.27
real: 4.27
real: 4.75
real: 4.41
...
and with the patch:
...
real: 3.68
real: 3.81
real: 3.80
real: 3.68
real: 3.75
real: 3.69
real: 3.69
real: 3.74
real: 3.67
real: 3.74
...
so that seems a reasonable improvement.

With parallel_for_each_debug set to true, we get some more details about the
difference in behaviour.  Without the patch we have:
...
Parallel for: n_elements: 2818
Parallel for: minimum elements per thread: 1
Parallel for: elts_per_thread: 704
Parallel for: elements on worker thread 0       : 705
Parallel for: elements on worker thread 1       : 705
Parallel for: elements on worker thread 2       : 704
Parallel for: elements on worker thread 3       : 0
Parallel for: elements on main thread           : 704
...
and with the patch:
...
Parallel for: n_elements: 2818
Parallel for: total_size: 1483674865
Parallel for: size_per_thread: 370918716
Parallel for: elements on worker thread 0       : 752   (size: 371811790)
Parallel for: elements on worker thread 1       : 360   (size: 371509370)
Parallel for: elements on worker thread 2       : 1130  (size: 372681710)
Parallel for: elements on worker thread 3       : 0     (size: 0)
Parallel for: elements on main thread           : 576   (size: 367671995)
...

Tested on x86_64-linux.

Any comments?

Thanks,
- Tom

[gdbsupport] Use task size in parallel_for_each

---
 gdb/dwarf2/read.c         |  9 ++++-
 gdbsupport/parallel-for.h | 97 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 84 insertions(+), 22 deletions(-)

diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index bcd01107377..143bcfb5374 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -7059,6 +7059,13 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 
     using iter_type = decltype (per_bfd->all_comp_units.begin ());
 
+    std::function<unsigned int (iter_type)> task_size
+      = [=] (iter_type iter)
+      {
+	dwarf2_per_cu_data *per_cu = iter->get ();
+	return per_cu->length ();
+      };
+
     /* Each thread returns a pair holding a cooked index, and a vector
        of errors that should be printed.  The latter is done because
        GDB's I/O system is not thread-safe.  run_on_main_thread could be
@@ -7087,7 +7094,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 	      }
 	  }
 	return result_type (thread_storage.release (), std::move (errors));
-      });
+      }, &task_size);
 
     /* Only show a given exception a single time.  */
     std::unordered_set<gdb_exception> seen_exceptions;
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index bf40f125f0f..3c9269574df 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
     typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
   >::result_type
 parallel_for_each (unsigned n, RandomIt first, RandomIt last,
-		   RangeFunction callback)
+		   RangeFunction callback,
+		   std::function<unsigned int(RandomIt)> *task_size_ptr
+		     = (std::function<unsigned int(RandomIt)> *)nullptr)
 {
   using result_type
     = typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type;
@@ -148,17 +150,33 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   size_t n_elements = last - first;
   size_t elts_per_thread = 0;
   size_t elts_left_over = 0;
+  size_t total_size = 0;
+  size_t size_per_thread = 0;
 
   if (n_threads > 1)
     {
-      /* Require that there should be at least N elements in a
-	 thread.  */
-      gdb_assert (n > 0);
-      if (n_elements / n_threads < n)
-	n_threads = std::max (n_elements / n, (size_t) 1);
-      elts_per_thread = n_elements / n_threads;
-      elts_left_over = n_elements % n_threads;
-      /* n_elements == n_threads * elts_per_thread + elts_left_over. */
+      if (task_size_ptr != nullptr)
+	{
+	  gdb_assert (n == 1);
+	  for (RandomIt i = first; i != last; ++i)
+	    {
+	      std::function<unsigned int(RandomIt)> f = *task_size_ptr;
+	      size_t s = (size_t)f (i);
+	      total_size += s;
+	    }
+	  size_per_thread = total_size / n_threads;
+	}
+      else
+	{
+	  /* Require that there should be at least N elements in a
+	     thread.  */
+	  gdb_assert (n > 0);
+	  if (n_elements / n_threads < n)
+	    n_threads = std::max (n_elements / n, (size_t) 1);
+	  elts_per_thread = n_elements / n_threads;
+	  elts_left_over = n_elements % n_threads;
+	  /* n_elements == n_threads * elts_per_thread + elts_left_over. */
+	}
     }
 
   size_t count = n_threads == 0 ? 0 : n_threads - 1;
@@ -167,20 +185,47 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   if (parallel_for_each_debug)
     {
       debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
-      debug_printf (_("Parallel for: minimum elements per thread: %u\n"), n);
-      debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
+      if (task_size_ptr != nullptr)
+	{
+	  debug_printf (_("Parallel for: total_size: %zu\n"), total_size);
+	  debug_printf (_("Parallel for: size_per_thread: %zu\n"), size_per_thread);
+	}
+      else
+	{
+	  debug_printf (_("Parallel for: minimum elements per thread: %u\n"), n);
+	  debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
+	}
     }
 
+  size_t remaining_size = total_size;
   for (int i = 0; i < count; ++i)
     {
-      RandomIt end = first + elts_per_thread;
-      if (i < elts_left_over)
-	/* Distribute the leftovers over the worker threads, to avoid having
-	   to handle all of them in a single thread.  */
-	end++;
+      RandomIt end;
+      size_t s = 0;
+      if (task_size_ptr == nullptr)
+	{
+	  end = first + elts_per_thread;
+	  if (i < elts_left_over)
+	    /* Distribute the leftovers over the worker threads, to avoid having
+	       to handle all of them in a single thread.  */
+	    end++;
+	}
+      else
+	{
+	  RandomIt j;
+	  for (j = first; j < last && s < size_per_thread; ++j)
+	    s += (size_t)(*task_size_ptr) (j);
+	  end = j;
+	  remaining_size -= s;
+	}
       if (parallel_for_each_debug)
-	debug_printf (_("Parallel for: elements on worker thread %i\t: %zu\n"),
-		      i, (size_t)(end - first));
+	{
+	  debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
+			i, (size_t)(end - first));
+	  if (task_size_ptr != nullptr)
+	    debug_printf (_("\t(size: %zu)"), s);
+	  debug_printf (_("\n"));
+	}
       results.post (i, [=] ()
         {
 	  return callback (first, end);
@@ -190,12 +235,22 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
 
   for (int i = count; i < n_worker_threads; ++i)
     if (parallel_for_each_debug)
-      debug_printf (_("Parallel for: elements on worker thread %i\t: 0\n"), i);
+      {
+	debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
+	if (task_size_ptr != nullptr)
+	  debug_printf (_("\t(size: 0)"));
+	debug_printf (_("\n"));
+      }
 
   /* Process all the remaining elements in the main thread.  */
   if (parallel_for_each_debug)
-    debug_printf (_("Parallel for: elements on main thread\t\t: %zu\n"),
-		  (size_t)(last - first));
+    {
+      debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
+		    (size_t)(last - first));
+      if (task_size_ptr != nullptr)
+	debug_printf (_("\t(size: %zu)"), remaining_size);
+      debug_printf (_("\n"));
+    }
   return results.finish ([=] ()
     {
       return callback (first, last);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-18 19:42 [PATCH][gdbsupport] Use task size in parallel_for_each Tom de Vries via Gdb-patches
@ 2022-07-21 17:35 ` Pedro Alves
  2022-07-21 20:23   ` Tom de Vries via Gdb-patches
  0 siblings, 1 reply; 13+ messages in thread
From: Pedro Alves @ 2022-07-21 17:35 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Tom Tromey

On 2022-07-18 8:42 p.m., Tom de Vries via Gdb-patches wrote:
> Hi,
> 
> Ensuring a fair distribution over the worker threads and main thread in terms
> of number of CUs might not be the most efficient way, given that CUs can vary
> in size.
> 
> Fix this by:
> - adding a task_size_ptr parameter to parallel_for_each,
>   defaulting to nullptr,
> - using per_cu->get_length () as the task size in the parallel_for_each
>   in dwarf2_build_psymtabs_hard, and
> - using the task size in parallel_for_each to distribute similarly-sized tasks
>   to the threads.
> 
> I've used this experiment to verify the performance impact:
> ...
> $ for n in $(seq 1 10); do \
>     time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \
>     2>&1 \
>     | grep "real:"; \
>   done
> ...
> and without the patch got:
> ...
> real: 4.71
> real: 4.88
> real: 4.29
> real: 4.30
> real: 4.65
> real: 4.27
> real: 4.27
> real: 4.27
> real: 4.75
> real: 4.41
> ...
> and with the patch:
> ...
> real: 3.68
> real: 3.81
> real: 3.80
> real: 3.68
> real: 3.75
> real: 3.69
> real: 3.69
> real: 3.74
> real: 3.67
> real: 3.74
> ...
> so that seems a reasonable improvement.

Nice.

> 
> With parallel_for_each_debug set to true, we get some more details about the
> difference in behaviour.  Without the patch we have:
> ...
> Parallel for: n_elements: 2818
> Parallel for: minimum elements per thread: 1
> Parallel for: elts_per_thread: 704
> Parallel for: elements on worker thread 0       : 705
> Parallel for: elements on worker thread 1       : 705
> Parallel for: elements on worker thread 2       : 704
> Parallel for: elements on worker thread 3       : 0
> Parallel for: elements on main thread           : 704
> ...
> and with the patch:
> ...
> Parallel for: n_elements: 2818
> Parallel for: total_size: 1483674865
> Parallel for: size_per_thread: 370918716
> Parallel for: elements on worker thread 0       : 752   (size: 371811790)
> Parallel for: elements on worker thread 1       : 360   (size: 371509370)
> Parallel for: elements on worker thread 2       : 1130  (size: 372681710)
> Parallel for: elements on worker thread 3       : 0     (size: 0)
> Parallel for: elements on main thread           : 576   (size: 367671995)
> ...
> 
> Tested on x86_64-linux.
> 
> Any comments?

I'll defer to Tromey.  Just a passerby comment, and a comment on the code itself
if we follow this approach, below.

My passerby comment is that I wonder whether we should consider switching to a work
stealing implementation, so that threads that are done with their chunk would grab another
chunk from the work pool.  I think this would spare us from having to worry about these
distribution heuristics.

> @@ -7059,6 +7059,13 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>  
>      using iter_type = decltype (per_bfd->all_comp_units.begin ());
>  
> +    std::function<unsigned int (iter_type)> task_size
> +      = [=] (iter_type iter)
> +      {
> +	dwarf2_per_cu_data *per_cu = iter->get ();
> +	return per_cu->length ();
> +      };
> +
>      /* Each thread returns a pair holding a cooked index, and a vector
>         of errors that should be printed.  The latter is done because
>         GDB's I/O system is not thread-safe.  run_on_main_thread could be
> @@ -7087,7 +7094,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>  	      }
>  	  }
>  	return result_type (thread_storage.release (), std::move (errors));
> -      });
> +      }, &task_size);
>  
>      /* Only show a given exception a single time.  */
>      std::unordered_set<gdb_exception> seen_exceptions;
> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
> index bf40f125f0f..3c9269574df 100644
> --- a/gdbsupport/parallel-for.h
> +++ b/gdbsupport/parallel-for.h
> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
>      typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>    >::result_type
>  parallel_for_each (unsigned n, RandomIt first, RandomIt last,
> -		   RangeFunction callback)
> +		   RangeFunction callback,
> +		   std::function<unsigned int(RandomIt)> *task_size_ptr
> +		     = (std::function<unsigned int(RandomIt)> *)nullptr)

That use of a std::function pointer looks odd.  AFAICT, TASK_SIZE_PTR is only ever called
as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored
anywhere, right?  If so, gdb::function_view instead should work, is lightweight, and is
nullable, meaning you don't need a pointer.

And then, at the caller, just using a lambda instead of a std::function should work too:

    auto task_size = [=] (iter_type iter)
      {
	dwarf2_per_cu_data *per_cu = iter->get ();
	return per_cu->length ();
      };

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-21 17:35 ` Pedro Alves
@ 2022-07-21 20:23   ` Tom de Vries via Gdb-patches
  2022-07-22  0:03     ` Pedro Alves
  2022-07-22 19:08     ` Tom Tromey
  0 siblings, 2 replies; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-21 20:23 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches; +Cc: Tom Tromey

[-- Attachment #1: Type: text/plain, Size: 5300 bytes --]

On 7/21/22 19:35, Pedro Alves wrote:
> On 2022-07-18 8:42 p.m., Tom de Vries via Gdb-patches wrote:
>> Hi,
>>
>> Ensuring a fair distribution over the worker threads and main thread in terms
>> of number of CUs might not be the most efficient way, given that CUs can vary
>> in size.
>>
>> Fix this by:
>> - adding a task_size_ptr parameter to parallel_for_each,
>>    defaulting to nullptr,
>> - using per_cu->get_length () as the task size in the parallel_for_each
>>    in dwarf2_build_psymtabs_hard, and
>> - using the task size in parallel_for_each to distribute similarly-sized tasks
>>    to the threads.
>>
>> I've used this experiment to verify the performance impact:
>> ...
>> $ for n in $(seq 1 10); do \
>>      time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \
>>      2>&1 \
>>      | grep "real:"; \
>>    done
>> ...
>> and without the patch got:
>> ...
>> real: 4.71
>> real: 4.88
>> real: 4.29
>> real: 4.30
>> real: 4.65
>> real: 4.27
>> real: 4.27
>> real: 4.27
>> real: 4.75
>> real: 4.41
>> ...
>> and with the patch:
>> ...
>> real: 3.68
>> real: 3.81
>> real: 3.80
>> real: 3.68
>> real: 3.75
>> real: 3.69
>> real: 3.69
>> real: 3.74
>> real: 3.67
>> real: 3.74
>> ...
>> so that seems a reasonable improvement.
> 
> Nice.
> 

:)

>>
>> With parallel_for_each_debug set to true, we get some more details about the
>> difference in behaviour.  Without the patch we have:
>> ...
>> Parallel for: n_elements: 2818
>> Parallel for: minimum elements per thread: 1
>> Parallel for: elts_per_thread: 704
>> Parallel for: elements on worker thread 0       : 705
>> Parallel for: elements on worker thread 1       : 705
>> Parallel for: elements on worker thread 2       : 704
>> Parallel for: elements on worker thread 3       : 0
>> Parallel for: elements on main thread           : 704
>> ...
>> and with the patch:
>> ...
>> Parallel for: n_elements: 2818
>> Parallel for: total_size: 1483674865
>> Parallel for: size_per_thread: 370918716
>> Parallel for: elements on worker thread 0       : 752   (size: 371811790)
>> Parallel for: elements on worker thread 1       : 360   (size: 371509370)
>> Parallel for: elements on worker thread 2       : 1130  (size: 372681710)
>> Parallel for: elements on worker thread 3       : 0     (size: 0)
>> Parallel for: elements on main thread           : 576   (size: 367671995)
>> ...
>>
>> Tested on x86_64-linux.
>>
>> Any comments?
> 
> I'll defer to Tromey.  Just a passerby comment, and a comment on the code itself
> if we follow this approach, below.
> 
> My passerby comment is that I wonder whether we should consider switching to a work
> stealing implementation, so that threads that are done with their chunk would grab another
> chunk from the work pool.  I think this would spare us from having to worry about these
> distribution heuristics.
> 

I also though about a dynamic solution, but decided to try the simplest 
solution first.

Anyway, with a dynamic solution you still might want to decide how big a 
chunck is, for which you could still need this type of heuristics.

>> @@ -7059,6 +7059,13 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>>   
>>       using iter_type = decltype (per_bfd->all_comp_units.begin ());
>>   
>> +    std::function<unsigned int (iter_type)> task_size
>> +      = [=] (iter_type iter)
>> +      {
>> +	dwarf2_per_cu_data *per_cu = iter->get ();
>> +	return per_cu->length ();
>> +      };
>> +
>>       /* Each thread returns a pair holding a cooked index, and a vector
>>          of errors that should be printed.  The latter is done because
>>          GDB's I/O system is not thread-safe.  run_on_main_thread could be
>> @@ -7087,7 +7094,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>>   	      }
>>   	  }
>>   	return result_type (thread_storage.release (), std::move (errors));
>> -      });
>> +      }, &task_size);
>>   
>>       /* Only show a given exception a single time.  */
>>       std::unordered_set<gdb_exception> seen_exceptions;
>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>> index bf40f125f0f..3c9269574df 100644
>> --- a/gdbsupport/parallel-for.h
>> +++ b/gdbsupport/parallel-for.h
>> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
>>       typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>>     >::result_type
>>   parallel_for_each (unsigned n, RandomIt first, RandomIt last,
>> -		   RangeFunction callback)
>> +		   RangeFunction callback,
>> +		   std::function<unsigned int(RandomIt)> *task_size_ptr
>> +		     = (std::function<unsigned int(RandomIt)> *)nullptr)
> 
> That use of a std::function pointer looks odd.  AFAICT, TASK_SIZE_PTR is only ever called
> as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored
> anywhere, right?  If so, gdb::function_view instead should work, is lightweight, and is
> nullable, meaning you don't need a pointer.
> 
> And then, at the caller, just using a lambda instead of a std::function should work too:
> 
>      auto task_size = [=] (iter_type iter)
>        {
> 	dwarf2_per_cu_data *per_cu = iter->get ();
> 	return per_cu->length ();
>        };

I've tried that (attached) but ran into the usual template error mess, 
not sure how I could solve that yet.

Thanks,
- Tom

[-- Attachment #2: 0001-gdbsupport-Use-task-size-in-parallel_for_each.patch --]
[-- Type: text/x-patch, Size: 8185 bytes --]

[gdbsupport] Use task size in parallel_for_each

Ensuring a fair distribution over the worker threads and main thread in terms
of number of CUs might not be the most efficient way, given that CUs can vary
in size.

Fix this by:
- adding a task_size_ptr parameter to parallel_for_each,
  defaulting to nullptr,
- using per_cu->get_length () as the task size in the parallel_for_each
  in dwarf2_build_psymtabs_hard, and
- using the task size in parallel_for_each to distribute similarly-sized tasks
  to the threads.

I've used this experiment to verify the performance impact:
...
$ for n in $(seq 1 10); do \
    time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \
    2>&1 \
    | grep "real:"; \
  done
...
and without the patch got:
...
real: 4.71
real: 4.88
real: 4.29
real: 4.30
real: 4.65
real: 4.27
real: 4.27
real: 4.27
real: 4.75
real: 4.41
...
and with the patch:
...
real: 3.68
real: 3.81
real: 3.80
real: 3.68
real: 3.75
real: 3.69
real: 3.69
real: 3.74
real: 3.67
real: 3.74
...
so that seems a reasonable improvement.

With parallel_for_each_debug set to true, we get some more detail about
the difference in behaviour.  Without the patch we have:
...
Parallel for: n_elements: 2818
Parallel for: minimum elements per thread: 1
Parallel for: elts_per_thread: 704
Parallel for: elements on worker thread 0       : 705
Parallel for: elements on worker thread 1       : 705
Parallel for: elements on worker thread 2       : 704
Parallel for: elements on worker thread 3       : 0
Parallel for: elements on main thread           : 704
...
and with the patch:
...
Parallel for: n_elements: 2818
Parallel for: total_size: 1483674865
Parallel for: size_per_thread: 370918716
Parallel for: elements on worker thread 0       : 752   (size: 371811790)
Parallel for: elements on worker thread 1       : 360   (size: 371509370)
Parallel for: elements on worker thread 2       : 1130  (size: 372681710)
Parallel for: elements on worker thread 3       : 0     (size: 0)
Parallel for: elements on main thread           : 576   (size: 367671995)
...

Tested on x86_64-linux.

---
 gdb/dwarf2/read.c         |  8 +++-
 gdbsupport/parallel-for.h | 97 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 83 insertions(+), 22 deletions(-)

diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index 42230607fe0..23c3873cba6 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -7067,6 +7067,12 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 
     using iter_type = decltype (per_bfd->all_comp_units.begin ());
 
+    auto task_size = [=] (iter_type iter)
+      {
+	dwarf2_per_cu_data *per_cu = iter->get ();
+	return per_cu->length ();
+      };
+
     /* Each thread returns a pair holding a cooked index, and a vector
        of errors that should be printed.  The latter is done because
        GDB's I/O system is not thread-safe.  run_on_main_thread could be
@@ -7095,7 +7101,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 	      }
 	  }
 	return result_type (thread_storage.release (), std::move (errors));
-      });
+      }, task_size);
 
     /* Only show a given exception a single time.  */
     std::unordered_set<gdb_exception> seen_exceptions;
diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
index 0037ee23ff3..e6f784eafd5 100644
--- a/gdbsupport/parallel-for.h
+++ b/gdbsupport/parallel-for.h
@@ -23,6 +23,7 @@
 #include <algorithm>
 #include <type_traits>
 #include "gdbsupport/thread-pool.h"
+#include "gdbsupport/function-view.h"
 
 namespace gdb
 {
@@ -134,7 +135,9 @@ typename gdb::detail::par_for_accumulator<
     typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
   >::result_type
 parallel_for_each (unsigned n, RandomIt first, RandomIt last,
-		   RangeFunction callback)
+		   RangeFunction callback,
+		   gdb::function_view<unsigned int(RandomIt)> task_size
+		     = nullptr)
 {
   using result_type
     = typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type;
@@ -148,17 +151,32 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   size_t n_elements = last - first;
   size_t elts_per_thread = 0;
   size_t elts_left_over = 0;
+  size_t total_size = 0;
+  size_t size_per_thread = 0;
 
   if (n_threads > 1)
     {
-      /* Require that there should be at least N elements in a
-	 thread.  */
-      gdb_assert (n > 0);
-      if (n_elements / n_threads < n)
-	n_threads = std::max (n_elements / n, (size_t) 1);
-      elts_per_thread = n_elements / n_threads;
-      elts_left_over = n_elements % n_threads;
-      /* n_elements == n_threads * elts_per_thread + elts_left_over. */
+      if (task_size != nullptr)
+	{
+	  gdb_assert (n == 1);
+	  for (RandomIt i = first; i != last; ++i)
+	    {
+	      size_t s = (size_t)task_size (i);
+	      total_size += s;
+	    }
+	  size_per_thread = total_size / n_threads;
+	}
+      else
+	{
+	  /* Require that there should be at least N elements in a
+	     thread.  */
+	  gdb_assert (n > 0);
+	  if (n_elements / n_threads < n)
+	    n_threads = std::max (n_elements / n, (size_t) 1);
+	  elts_per_thread = n_elements / n_threads;
+	  elts_left_over = n_elements % n_threads;
+	  /* n_elements == n_threads * elts_per_thread + elts_left_over. */
+	}
     }
 
   size_t count = n_threads == 0 ? 0 : n_threads - 1;
@@ -167,20 +185,47 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
   if (parallel_for_each_debug)
     {
       debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements);
-      debug_printf (_("Parallel for: minimum elements per thread: %u\n"), n);
-      debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
+      if (task_size != nullptr)
+	{
+	  debug_printf (_("Parallel for: total_size: %zu\n"), total_size);
+	  debug_printf (_("Parallel for: size_per_thread: %zu\n"), size_per_thread);
+	}
+      else
+	{
+	  debug_printf (_("Parallel for: minimum elements per thread: %u\n"), n);
+	  debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread);
+	}
     }
 
+  size_t remaining_size = total_size;
   for (int i = 0; i < count; ++i)
     {
-      RandomIt end = first + elts_per_thread;
-      if (i < elts_left_over)
-	/* Distribute the leftovers over the worker threads, to avoid having
-	   to handle all of them in a single thread.  */
-	end++;
+      RandomIt end;
+      size_t s = 0;
+      if (task_size == nullptr)
+	{
+	  end = first + elts_per_thread;
+	  if (i < elts_left_over)
+	    /* Distribute the leftovers over the worker threads, to avoid having
+	       to handle all of them in a single thread.  */
+	    end++;
+	}
+      else
+	{
+	  RandomIt j;
+	  for (j = first; j < last && s < size_per_thread; ++j)
+	    s += (size_t)(task_size) (j);
+	  end = j;
+	  remaining_size -= s;
+	}
       if (parallel_for_each_debug)
-	debug_printf (_("Parallel for: elements on worker thread %i\t: %zu\n"),
-		      i, (size_t)(end - first));
+	{
+	  debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"),
+			i, (size_t)(end - first));
+	  if (task_size != nullptr)
+	    debug_printf (_("\t(size: %zu)"), s);
+	  debug_printf (_("\n"));
+	}
       results.post (i, [=] ()
         {
 	  return callback (first, end);
@@ -190,12 +235,22 @@ parallel_for_each (unsigned n, RandomIt first, RandomIt last,
 
   for (int i = count; i < n_worker_threads; ++i)
     if (parallel_for_each_debug)
-      debug_printf (_("Parallel for: elements on worker thread %i\t: 0\n"), i);
+      {
+	debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i);
+	if (task_size != nullptr)
+	  debug_printf (_("\t(size: 0)"));
+	debug_printf (_("\n"));
+      }
 
   /* Process all the remaining elements in the main thread.  */
   if (parallel_for_each_debug)
-    debug_printf (_("Parallel for: elements on main thread\t\t: %zu\n"),
-		  (size_t)(last - first));
+    {
+      debug_printf (_("Parallel for: elements on main thread\t\t: %zu"),
+		    (size_t)(last - first));
+      if (task_size != nullptr)
+	debug_printf (_("\t(size: %zu)"), remaining_size);
+      debug_printf (_("\n"));
+    }
   return results.finish ([=] ()
     {
       return callback (first, last);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-21 20:23   ` Tom de Vries via Gdb-patches
@ 2022-07-22  0:03     ` Pedro Alves
  2022-07-22 11:07       ` Pedro Alves
  2022-07-22 17:07       ` Tom de Vries via Gdb-patches
  2022-07-22 19:08     ` Tom Tromey
  1 sibling, 2 replies; 13+ messages in thread
From: Pedro Alves @ 2022-07-22  0:03 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Tom Tromey

On 2022-07-21 9:23 p.m., Tom de Vries wrote:
> On 7/21/22 19:35, Pedro Alves wrote:
>>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>>> index bf40f125f0f..3c9269574df 100644
>>> --- a/gdbsupport/parallel-for.h
>>> +++ b/gdbsupport/parallel-for.h
>>> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
>>>       typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>>>     >::result_type
>>>   parallel_for_each (unsigned n, RandomIt first, RandomIt last,
>>> -           RangeFunction callback)
>>> +           RangeFunction callback,
>>> +           std::function<unsigned int(RandomIt)> *task_size_ptr
>>> +             = (std::function<unsigned int(RandomIt)> *)nullptr)
>>
>> That use of a std::function pointer looks odd.  AFAICT, TASK_SIZE_PTR is only ever called
>> as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored
>> anywhere, right?  If so, gdb::function_view instead should work, is lightweight, and is
>> nullable, meaning you don't need a pointer.
>>
>> And then, at the caller, just using a lambda instead of a std::function should work too:
>>
>>      auto task_size = [=] (iter_type iter)
>>        {
>>     dwarf2_per_cu_data *per_cu = iter->get ();
>>     return per_cu->length ();
>>        };
> 
> I've tried that (attached) but ran into the usual template error mess, not sure how I could solve that yet.

I see.  The problem is that here:

 @@ -134,7 +135,9 @@ typename gdb::detail::par_for_accumulator<
      typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
    >::result_type
  parallel_for_each (unsigned n, RandomIt first, RandomIt last,
 -		   RangeFunction callback)
 +		   RangeFunction callback,
 +		   gdb::function_view<unsigned int(RandomIt)> task_size
 +		     = nullptr)
 
parallel_for_each is a template, and the function_view parameter's type depends
on a template parameter (RandomIt), so we can't rely on implicit conversions, such
as when passing a lambda  (lambda -> function_view).  We need to pass a function_view of
the right type already.  That's not special about function_view, it's just how templates
and overload resolution works.

So this would fix it:

diff --git c/gdb/dwarf2/read.c w/gdb/dwarf2/read.c
index 23c3873cba6..06df773f1e0 100644
--- c/gdb/dwarf2/read.c
+++ w/gdb/dwarf2/read.c
@@ -7067,11 +7067,12 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 
     using iter_type = decltype (per_bfd->all_comp_units.begin ());
 
-    auto task_size = [=] (iter_type iter)
+    auto task_size_ = [=] (iter_type iter)
       {
        dwarf2_per_cu_data *per_cu = iter->get ();
        return per_cu->length ();
       };
+    gdb::function_view<unsigned (iter_type)> task_size = task_size_;

Though it's annoying to have to write the function_view type. 

Note that this instead, is a bad pattern with function_view (similarly
to other view types, like string_view), as it immediately dangles the lambda
temporary:

-    auto task_size = [=] (iter_type iter)
+    gdb::function_view<unsigned (iter_type)> task_size = [=] (iter_type iter)
       {
        dwarf2_per_cu_data *per_cu = iter->get ();
        return per_cu->length ();
       };

I think the best is to introduce a gdb::make_function_view function,
so that you can do this:

diff --git c/gdb/dwarf2/read.c w/gdb/dwarf2/read.c
index 23c3873cba6..255b955a54c 100644
--- c/gdb/dwarf2/read.c
+++ w/gdb/dwarf2/read.c
@@ -7101,7 +7101,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
              }
          }
        return result_type (thread_storage.release (), std::move (errors));
-      }, task_size);
+      }, gdb::make_function_view (task_size));
 
     /* Only show a given exception a single time.  */
     std::unordered_set<gdb_exception> seen_exceptions;

I've got that working here.  I'll post it tomorrow.

Note: the 'task_size' lambda doesn't actually need to capture anything:

@@ -7067,11 +7067,12 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
 
     using iter_type = decltype (per_bfd->all_comp_units.begin ());
 
-    auto task_size = [=] (iter_type iter)
+    auto task_size = [] (iter_type iter)
       {
        dwarf2_per_cu_data *per_cu = iter->get ();
        return per_cu->length ();
       };


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22  0:03     ` Pedro Alves
@ 2022-07-22 11:07       ` Pedro Alves
  2022-07-22 17:07       ` Tom de Vries via Gdb-patches
  1 sibling, 0 replies; 13+ messages in thread
From: Pedro Alves @ 2022-07-22 11:07 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Tom Tromey

On 2022-07-22 1:03 a.m., Pedro Alves wrote:

> I think the best is to introduce a gdb::make_function_view function,
> so that you can do this:
> 
> diff --git c/gdb/dwarf2/read.c w/gdb/dwarf2/read.c
> index 23c3873cba6..255b955a54c 100644
> --- c/gdb/dwarf2/read.c
> +++ w/gdb/dwarf2/read.c
> @@ -7101,7 +7101,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>               }
>           }
>         return result_type (thread_storage.release (), std::move (errors));
> -      }, task_size);
> +      }, gdb::make_function_view (task_size));
>  
>      /* Only show a given exception a single time.  */
>      std::unordered_set<gdb_exception> seen_exceptions;
> 
> I've got that working here.  I'll post it tomorrow.

Posted here:

 [PATCH 0/1] Introduce gdb::make_function_view
 https://sourceware.org/pipermail/gdb-patches/2022-July/190991.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22  0:03     ` Pedro Alves
  2022-07-22 11:07       ` Pedro Alves
@ 2022-07-22 17:07       ` Tom de Vries via Gdb-patches
  1 sibling, 0 replies; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-22 17:07 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches; +Cc: Tom Tromey

On 7/22/22 02:03, Pedro Alves wrote:
> On 2022-07-21 9:23 p.m., Tom de Vries wrote:
>> On 7/21/22 19:35, Pedro Alves wrote:
>>>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>>>> index bf40f125f0f..3c9269574df 100644
>>>> --- a/gdbsupport/parallel-for.h
>>>> +++ b/gdbsupport/parallel-for.h
>>>> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
>>>>        typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>>>>      >::result_type
>>>>    parallel_for_each (unsigned n, RandomIt first, RandomIt last,
>>>> -           RangeFunction callback)
>>>> +           RangeFunction callback,
>>>> +           std::function<unsigned int(RandomIt)> *task_size_ptr
>>>> +             = (std::function<unsigned int(RandomIt)> *)nullptr)
>>>
>>> That use of a std::function pointer looks odd.  AFAICT, TASK_SIZE_PTR is only ever called
>>> as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored
>>> anywhere, right?  If so, gdb::function_view instead should work, is lightweight, and is
>>> nullable, meaning you don't need a pointer.
>>>
>>> And then, at the caller, just using a lambda instead of a std::function should work too:
>>>
>>>       auto task_size = [=] (iter_type iter)
>>>         {
>>>      dwarf2_per_cu_data *per_cu = iter->get ();
>>>      return per_cu->length ();
>>>         };
>>
>> I've tried that (attached) but ran into the usual template error mess, not sure how I could solve that yet.
> 
> I see.  The problem is that here:
> 
>   @@ -134,7 +135,9 @@ typename gdb::detail::par_for_accumulator<
>        typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>      >::result_type
>    parallel_for_each (unsigned n, RandomIt first, RandomIt last,
>   -		   RangeFunction callback)
>   +		   RangeFunction callback,
>   +		   gdb::function_view<unsigned int(RandomIt)> task_size
>   +		     = nullptr)
>   
> parallel_for_each is a template, and the function_view parameter's type depends
> on a template parameter (RandomIt), so we can't rely on implicit conversions, such
> as when passing a lambda  (lambda -> function_view).  We need to pass a function_view of
> the right type already.  That's not special about function_view, it's just how templates
> and overload resolution works.
> 	

Ack, I usually manage if there's a specific error mesage, but it was too 
generic in this case.  Anyway, I just lack experience writing templated 
code, something that needs fixing on my side.

> So this would fix it:
> 
> diff --git c/gdb/dwarf2/read.c w/gdb/dwarf2/read.c
> index 23c3873cba6..06df773f1e0 100644
> --- c/gdb/dwarf2/read.c
> +++ w/gdb/dwarf2/read.c
> @@ -7067,11 +7067,12 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>   
>       using iter_type = decltype (per_bfd->all_comp_units.begin ());
>   
> -    auto task_size = [=] (iter_type iter)
> +    auto task_size_ = [=] (iter_type iter)
>         {
>          dwarf2_per_cu_data *per_cu = iter->get ();
>          return per_cu->length ();
>         };
> +    gdb::function_view<unsigned (iter_type)> task_size = task_size_;
> 
> Though it's annoying to have to write the function_view type.
> 
> Note that this instead, is a bad pattern with function_view (similarly
> to other view types, like string_view), as it immediately dangles the lambda
> temporary:
> 
> -    auto task_size = [=] (iter_type iter)
> +    gdb::function_view<unsigned (iter_type)> task_size = [=] (iter_type iter)
>         {
>          dwarf2_per_cu_data *per_cu = iter->get ();
>          return per_cu->length ();
>         };
> 
> I think the best is to introduce a gdb::make_function_view function,
> so that you can do this:
> 
> diff --git c/gdb/dwarf2/read.c w/gdb/dwarf2/read.c
> index 23c3873cba6..255b955a54c 100644
> --- c/gdb/dwarf2/read.c
> +++ w/gdb/dwarf2/read.c
> @@ -7101,7 +7101,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>                }
>            }
>          return result_type (thread_storage.release (), std::move (errors));
> -      }, task_size);
> +      }, gdb::make_function_view (task_size));
>   
>       /* Only show a given exception a single time.  */
>       std::unordered_set<gdb_exception> seen_exceptions;
> 
> I've got that working here.  I'll post it tomorrow.
> 


Ack, used in updated patch, submitted in this ( 
https://sourceware.org/pipermail/gdb-patches/2022-July/191004.html ) series.


> Note: the 'task_size' lambda doesn't actually need to capture anything:
> 
> @@ -7067,11 +7067,12 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>   
>       using iter_type = decltype (per_bfd->all_comp_units.begin ());
>   
> -    auto task_size = [=] (iter_type iter)
> +    auto task_size = [] (iter_type iter)
>         {
>          dwarf2_per_cu_data *per_cu = iter->get ();
>          return per_cu->length ();
>         };
> 

Fixed.

Thanks for the review and help.

- Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-21 20:23   ` Tom de Vries via Gdb-patches
  2022-07-22  0:03     ` Pedro Alves
@ 2022-07-22 19:08     ` Tom Tromey
  2022-07-22 19:38       ` Simon Marchi via Gdb-patches
                         ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Tom Tromey @ 2022-07-22 19:08 UTC (permalink / raw)
  To: Tom de Vries via Gdb-patches; +Cc: Tom Tromey, Pedro Alves

Pedro> My passerby comment is that I wonder whether we should consider
Pedro> switching to a work stealing implementation, so that threads that
Pedro> are done with their chunk would grab another chunk from the work
Pedro> pool.  I think this would spare us from having to worry about
Pedro> these distribution heuristics.

Tom> I also though about a dynamic solution, but decided to try the
Tom> simplest solution first.

Tom> Anyway, with a dynamic solution you still might want to decide how big
Tom> a chunck is, for which you could still need this type of heuristics.

I think the idea of the work-stealing approach is that each worker grabs
work as it can, and so no sizing is needed at all.  If one worker ends
up with a very large CU, it will simply end up working on fewer CUs.

In this situation, work stealing might make the overall implementation
simpler.  Perhaps the batching parameter patch (commit 82d734f7a) and
the vector of results patch (commit f4565e4c9) could be removed.

Then, rather than using parallel_for, the DWARF reader could send N jobs
to the thread pool, and each job would simply take the next available CU
by incrementing an atomic counter.  When the counter reached the number
of CUs, a job would stop.

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22 19:08     ` Tom Tromey
@ 2022-07-22 19:38       ` Simon Marchi via Gdb-patches
  2022-07-23  3:17         ` Tom Tromey
  2022-07-22 21:21       ` Tom Tromey
  2022-07-23  5:55       ` Tom de Vries via Gdb-patches
  2 siblings, 1 reply; 13+ messages in thread
From: Simon Marchi via Gdb-patches @ 2022-07-22 19:38 UTC (permalink / raw)
  To: Tom Tromey, Tom de Vries via Gdb-patches; +Cc: Pedro Alves

On 2022-07-22 15:08, Tom Tromey wrote:
> Pedro> My passerby comment is that I wonder whether we should consider
> Pedro> switching to a work stealing implementation, so that threads that
> Pedro> are done with their chunk would grab another chunk from the work
> Pedro> pool.  I think this would spare us from having to worry about
> Pedro> these distribution heuristics.
> 
> Tom> I also though about a dynamic solution, but decided to try the
> Tom> simplest solution first.
> 
> Tom> Anyway, with a dynamic solution you still might want to decide how big
> Tom> a chunck is, for which you could still need this type of heuristics.
> 
> I think the idea of the work-stealing approach is that each worker grabs
> work as it can, and so no sizing is needed at all.  If one worker ends
> up with a very large CU, it will simply end up working on fewer CUs.
> 
> In this situation, work stealing might make the overall implementation
> simpler.  Perhaps the batching parameter patch (commit 82d734f7a) and
> the vector of results patch (commit f4565e4c9) could be removed.
> 
> Then, rather than using parallel_for, the DWARF reader could send N jobs
> to the thread pool, and each job would simply take the next available CU
> by incrementing an atomic counter.  When the counter reached the number
> of CUs, a job would stop.
> 
> Tom

Just a nit, if we are talking about a centralized work pool that threads
pick from to get their next work item, that's not really work stealing.
Work stealing is when threads all have their own work queue, and threads
with empty queues go steal some work from other threads' queues.

I suppose a downside to having one centralized work item queue / pool
where threads pick one work item at a time is contention.  If you have a
ton of tiny work items, threads could spend a lot of time synchronizing
on that work queue, since they need to synchronize between each item.
With work stealing, there's no synchronization required until a thread
is actually out of work.

Given our work items are fairly large, I'm guessing that having one
centralized queue is fine.  Well, benchmarks will tell if that is faster
than Tom de Vries' balancing approach.

Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22 19:08     ` Tom Tromey
  2022-07-22 19:38       ` Simon Marchi via Gdb-patches
@ 2022-07-22 21:21       ` Tom Tromey
  2022-07-23  6:51         ` Tom de Vries via Gdb-patches
  2022-07-23  5:55       ` Tom de Vries via Gdb-patches
  2 siblings, 1 reply; 13+ messages in thread
From: Tom Tromey @ 2022-07-22 21:21 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Pedro Alves, Tom de Vries via Gdb-patches

Tom> Then, rather than using parallel_for, the DWARF reader could send N jobs
Tom> to the thread pool, and each job would simply take the next available CU
Tom> by incrementing an atomic counter.  When the counter reached the number
Tom> of CUs, a job would stop.

Here's a patch.  I didn't test it much, though according to "maint time 1",
it is ~10% faster on gdb itself.  I pushed it as "t/work-stealing" on my
github as well, in case you want to try it out.

Tom

commit 0d530b3a072fd7acd9fc89c755ab4ded2719a17d
Author: Tom Tromey <tom@tromey.com>
Date:   Fri Jul 22 15:16:22 2022 -0600

    patch

diff --git a/gdb/dwarf2/read.c b/gdb/dwarf2/read.c
index 42230607fe0..7768d02b4f3 100644
--- a/gdb/dwarf2/read.c
+++ b/gdb/dwarf2/read.c
@@ -7074,28 +7074,44 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
        prompt, which looks weird.  */
     using result_type = std::pair<std::unique_ptr<cooked_index>,
 				  std::vector<gdb_exception>>;
-    std::vector<result_type> results
-      = gdb::parallel_for_each (1, per_bfd->all_comp_units.begin (),
-				per_bfd->all_comp_units.end (),
-				[=] (iter_type iter, iter_type end)
+    const size_t n_units = per_bfd->all_comp_units.size ();
+    std::vector<result_type> results (n_units);
+
+    const size_t n_threads
+      = std::min (gdb::thread_pool::g_thread_pool->thread_count (), n_units);
+    gdb::future<void> futures[n_threads];
+
+    std::atomic<size_t> next_cu = n_threads;
+    for (size_t i = 0; i < n_threads; ++i)
       {
-	std::vector<gdb_exception> errors;
-	cooked_index_storage thread_storage;
-	for (; iter != end; ++iter)
+	futures[i] = gdb::thread_pool::g_thread_pool->post_task ([&, i] ()
 	  {
-	    dwarf2_per_cu_data *per_cu = iter->get ();
-	    try
-	      {
-		process_psymtab_comp_unit (per_cu, per_objfile,
-					   &thread_storage);
-	      }
-	    catch (gdb_exception &except)
+	    std::vector<gdb_exception> &errors = results[i].second;
+	    cooked_index_storage thread_storage;
+
+	    size_t this_cu = i;
+	    while (this_cu < n_units)
 	      {
-		errors.push_back (std::move (except));
+		dwarf2_per_cu_data *per_cu = per_bfd->get_cu (this_cu);
+		try
+		  {
+		    process_psymtab_comp_unit (per_cu, per_objfile,
+					       &thread_storage);
+		  }
+		catch (gdb_exception &except)
+		  {
+		    errors.push_back (std::move (except));
+		  }
+
+		this_cu = next_cu++;
 	      }
-	  }
-	return result_type (thread_storage.release (), std::move (errors));
-      });
+
+	    results[i].first = thread_storage.release ();
+	  });
+      }
+
+    for (size_t i = 0; i < n_threads; ++i)
+      futures[i].get ();
 
     /* Only show a given exception a single time.  */
     std::unordered_set<gdb_exception> seen_exceptions;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22 19:38       ` Simon Marchi via Gdb-patches
@ 2022-07-23  3:17         ` Tom Tromey
  0 siblings, 0 replies; 13+ messages in thread
From: Tom Tromey @ 2022-07-23  3:17 UTC (permalink / raw)
  To: Simon Marchi; +Cc: Tom Tromey, Pedro Alves, Tom de Vries via Gdb-patches

Simon> Just a nit, if we are talking about a centralized work pool that threads
Simon> pick from to get their next work item, that's not really work stealing.
Simon> Work stealing is when threads all have their own work queue, and threads
Simon> with empty queues go steal some work from other threads' queues.

Thanks, that's a clear explanation.

Simon> Given our work items are fairly large, I'm guessing that having one
Simon> centralized queue is fine.  Well, benchmarks will tell if that is faster
Simon> than Tom de Vries' balancing approach.

Yeah.  Also I didn't see a straightforward way to implement work
stealing.

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22 19:08     ` Tom Tromey
  2022-07-22 19:38       ` Simon Marchi via Gdb-patches
  2022-07-22 21:21       ` Tom Tromey
@ 2022-07-23  5:55       ` Tom de Vries via Gdb-patches
  2 siblings, 0 replies; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-23  5:55 UTC (permalink / raw)
  To: Tom Tromey, Tom de Vries via Gdb-patches; +Cc: Pedro Alves

On 7/22/22 21:08, Tom Tromey wrote:
> Pedro> My passerby comment is that I wonder whether we should consider
> Pedro> switching to a work stealing implementation, so that threads that
> Pedro> are done with their chunk would grab another chunk from the work
> Pedro> pool.  I think this would spare us from having to worry about
> Pedro> these distribution heuristics.
> 
> Tom> I also though about a dynamic solution, but decided to try the
> Tom> simplest solution first.
> 
> Tom> Anyway, with a dynamic solution you still might want to decide how big
> Tom> a chunck is, for which you could still need this type of heuristics.
> 
> I think the idea of the work-stealing approach is that each worker grabs
> work as it can, and so no sizing is needed at all.  If one worker ends
> up with a very large CU, it will simply end up working on fewer CUs.
> 

Right, I understand that.

My concern was that the current parallel-for solution potentially 
exploits caching behaviour by accessing a sequence of elements, and a 
pure work-stealing approach is more likely to execute non-subsequent 
elements, which wouldn't exploit that.  But thinking about it now, 
that's reasoning from the point of view of a single cpu, and the 
work-stealing approach would also allow for exploiting caching behaviour 
between cpus.

Anyway, so I wondered if a hybrid approach would be a good idea:
where you first schedule a percentage of the workload, say 80% 
processing slices determined using size heuristics, and do the remaining 
20% using work-stealing to counteract the discrepancy between size 
heuristics and actually used execution time.

It also occurred to me today that if you'd have a large CU as the last 
CU, with pure work-stealing you may well end up with one thread 
processing that CU and the other threads idle, while if you use size 
heuristics you may force a thread to start early on it.

Thanks,
- Tom

> In this situation, work stealing might make the overall implementation
> simpler.  Perhaps the batching parameter patch (commit 82d734f7a) and
> the vector of results patch (commit f4565e4c9) could be removed.
> 
> Then, rather than using parallel_for, the DWARF reader could send N jobs
> to the thread pool, and each job would simply take the next available CU
> by incrementing an atomic counter.  When the counter reached the number
> of CUs, a job would stop.
> 
> Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-22 21:21       ` Tom Tromey
@ 2022-07-23  6:51         ` Tom de Vries via Gdb-patches
  2022-07-31  8:38           ` Tom de Vries via Gdb-patches
  0 siblings, 1 reply; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-23  6:51 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Pedro Alves, Tom de Vries via Gdb-patches

On 7/22/22 23:21, Tom Tromey wrote:
> Tom> Then, rather than using parallel_for, the DWARF reader could send N jobs
> Tom> to the thread pool, and each job would simply take the next available CU
> Tom> by incrementing an atomic counter.  When the counter reached the number
> Tom> of CUs, a job would stop.
> 
> Here's a patch.  I didn't test it much, though according to "maint time 1",
> it is ~10% faster on gdb itself.  I pushed it as "t/work-stealing" on my
> github as well, in case you want to try it out.
> 

I've tried it out (initially didn't build for me with gcc 7.5.0, but it 
did after using gcc 12.1.1).

So, the same libxul experiment as before, gcc-12, O2, base commit 
5ae3df226b1.

---

base commit:

real: 4.64
real: 4.10
real: 4.11
real: 4.65
real: 4.04
real: 4.21
real: 4.03
real: 4.48
real: 4.04
real: 4.65

t/work-stealing:

real: 3.65
real: 3.58
real: 3.58
real: 3.58
real: 3.58
real: 3.57
real: 3.58
real: 3.59
real: 3.59
real: 3.59

size heuristics:

real: 3.44
real: 3.44
real: 3.52
real: 3.43
real: 3.54
real: 3.47
real: 3.60
real: 3.46
real: 3.46
real: 3.52

Just one data point of course.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH][gdbsupport] Use task size in parallel_for_each
  2022-07-23  6:51         ` Tom de Vries via Gdb-patches
@ 2022-07-31  8:38           ` Tom de Vries via Gdb-patches
  0 siblings, 0 replies; 13+ messages in thread
From: Tom de Vries via Gdb-patches @ 2022-07-31  8:38 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Tom de Vries via Gdb-patches, Pedro Alves

On 7/23/22 08:51, Tom de Vries via Gdb-patches wrote:
> On 7/22/22 23:21, Tom Tromey wrote:
>> Tom> Then, rather than using parallel_for, the DWARF reader could send 
>> N jobs
>> Tom> to the thread pool, and each job would simply take the next 
>> available CU
>> Tom> by incrementing an atomic counter.  When the counter reached the 
>> number
>> Tom> of CUs, a job would stop.
>>
>> Here's a patch.  I didn't test it much, though according to "maint 
>> time 1",
>> it is ~10% faster on gdb itself.  I pushed it as "t/work-stealing" on my
>> github as well, in case you want to try it out.
>>
> 
> I've tried it out (initially didn't build for me with gcc 7.5.0, but it 
> did after using gcc 12.1.1).
> 
> So, the same libxul experiment as before, gcc-12, O2, base commit 
> 5ae3df226b1.
> 
> ---
> 
> base commit:
> 
> real: 4.64
> real: 4.10
> real: 4.11
> real: 4.65
> real: 4.04
> real: 4.21
> real: 4.03
> real: 4.48
> real: 4.04
> real: 4.65
> 
> t/work-stealing:
> 
> real: 3.65
> real: 3.58
> real: 3.58
> real: 3.58
> real: 3.58
> real: 3.57
> real: 3.58
> real: 3.59
> real: 3.59
> real: 3.59
> 
> size heuristics:
> 
> real: 3.44
> real: 3.44
> real: 3.52
> real: 3.43
> real: 3.54
> real: 3.47
> real: 3.60
> real: 3.46
> real: 3.46
> real: 3.52
> 
> Just one data point of course.
> 

I've rebase the patch series on the latest version for the 
gdb::make_function_view patch, and retested using the try-bot.  If there 
are no further comments, I'll commit in a week or so.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-07-31  8:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 19:42 [PATCH][gdbsupport] Use task size in parallel_for_each Tom de Vries via Gdb-patches
2022-07-21 17:35 ` Pedro Alves
2022-07-21 20:23   ` Tom de Vries via Gdb-patches
2022-07-22  0:03     ` Pedro Alves
2022-07-22 11:07       ` Pedro Alves
2022-07-22 17:07       ` Tom de Vries via Gdb-patches
2022-07-22 19:08     ` Tom Tromey
2022-07-22 19:38       ` Simon Marchi via Gdb-patches
2022-07-23  3:17         ` Tom Tromey
2022-07-22 21:21       ` Tom Tromey
2022-07-23  6:51         ` Tom de Vries via Gdb-patches
2022-07-31  8:38           ` Tom de Vries via Gdb-patches
2022-07-23  5:55       ` Tom de Vries via Gdb-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox