On 7/21/22 19:35, Pedro Alves wrote:
> On 2022-07-18 8:42 p.m., Tom de Vries via Gdb-patches wrote:
>> Hi,
>>
>> Ensuring a fair distribution over the worker threads and main thread in terms
>> of number of CUs might not be the most efficient way, given that CUs can vary
>> in size.
>>
>> Fix this by:
>> - adding a task_size_ptr parameter to parallel_for_each,
>>    defaulting to nullptr,
>> - using per_cu->get_length () as the task size in the parallel_for_each
>>    in dwarf2_build_psymtabs_hard, and
>> - using the task size in parallel_for_each to distribute similarly-sized tasks
>>    to the threads.
>>
>> I've used this experiment to verify the performance impact:
>> ...
>> $ for n in $(seq 1 10); do \
>>      time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \
>>      2>&1 \
>>      | grep "real:"; \
>>    done
>> ...
>> and without the patch got:
>> ...
>> real: 4.71
>> real: 4.88
>> real: 4.29
>> real: 4.30
>> real: 4.65
>> real: 4.27
>> real: 4.27
>> real: 4.27
>> real: 4.75
>> real: 4.41
>> ...
>> and with the patch:
>> ...
>> real: 3.68
>> real: 3.81
>> real: 3.80
>> real: 3.68
>> real: 3.75
>> real: 3.69
>> real: 3.69
>> real: 3.74
>> real: 3.67
>> real: 3.74
>> ...
>> so that seems a reasonable improvement.
> 
> Nice.
> 

:)

>>
>> With parallel_for_each_debug set to true, we get some more details about the
>> difference in behaviour.  Without the patch we have:
>> ...
>> Parallel for: n_elements: 2818
>> Parallel for: minimum elements per thread: 1
>> Parallel for: elts_per_thread: 704
>> Parallel for: elements on worker thread 0       : 705
>> Parallel for: elements on worker thread 1       : 705
>> Parallel for: elements on worker thread 2       : 704
>> Parallel for: elements on worker thread 3       : 0
>> Parallel for: elements on main thread           : 704
>> ...
>> and with the patch:
>> ...
>> Parallel for: n_elements: 2818
>> Parallel for: total_size: 1483674865
>> Parallel for: size_per_thread: 370918716
>> Parallel for: elements on worker thread 0       : 752   (size: 371811790)
>> Parallel for: elements on worker thread 1       : 360   (size: 371509370)
>> Parallel for: elements on worker thread 2       : 1130  (size: 372681710)
>> Parallel for: elements on worker thread 3       : 0     (size: 0)
>> Parallel for: elements on main thread           : 576   (size: 367671995)
>> ...
>>
>> Tested on x86_64-linux.
>>
>> Any comments?
> 
> I'll defer to Tromey.  Just a passerby comment, and a comment on the code itself
> if we follow this approach, below.
> 
> My passerby comment is that I wonder whether we should consider switching to a work
> stealing implementation, so that threads that are done with their chunk would grab another
> chunk from the work pool.  I think this would spare us from having to worry about these
> distribution heuristics.
> 

I also though about a dynamic solution, but decided to try the simplest 
solution first.

Anyway, with a dynamic solution you still might want to decide how big a 
chunck is, for which you could still need this type of heuristics.

>> @@ -7059,6 +7059,13 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>>   
>>       using iter_type = decltype (per_bfd->all_comp_units.begin ());
>>   
>> +    std::function<unsigned int (iter_type)> task_size
>> +      = [=] (iter_type iter)
>> +      {
>> +	dwarf2_per_cu_data *per_cu = iter->get ();
>> +	return per_cu->length ();
>> +      };
>> +
>>       /* Each thread returns a pair holding a cooked index, and a vector
>>          of errors that should be printed.  The latter is done because
>>          GDB's I/O system is not thread-safe.  run_on_main_thread could be
>> @@ -7087,7 +7094,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile)
>>   	      }
>>   	  }
>>   	return result_type (thread_storage.release (), std::move (errors));
>> -      });
>> +      }, &task_size);
>>   
>>       /* Only show a given exception a single time.  */
>>       std::unordered_set<gdb_exception> seen_exceptions;
>> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h
>> index bf40f125f0f..3c9269574df 100644
>> --- a/gdbsupport/parallel-for.h
>> +++ b/gdbsupport/parallel-for.h
>> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator<
>>       typename std::result_of<RangeFunction (RandomIt, RandomIt)>::type
>>     >::result_type
>>   parallel_for_each (unsigned n, RandomIt first, RandomIt last,
>> -		   RangeFunction callback)
>> +		   RangeFunction callback,
>> +		   std::function<unsigned int(RandomIt)> *task_size_ptr
>> +		     = (std::function<unsigned int(RandomIt)> *)nullptr)
> 
> That use of a std::function pointer looks odd.  AFAICT, TASK_SIZE_PTR is only ever called
> as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored
> anywhere, right?  If so, gdb::function_view instead should work, is lightweight, and is
> nullable, meaning you don't need a pointer.
> 
> And then, at the caller, just using a lambda instead of a std::function should work too:
> 
>      auto task_size = [=] (iter_type iter)
>        {
> 	dwarf2_per_cu_data *per_cu = iter->get ();
> 	return per_cu->length ();
>        };

I've tried that (attached) but ran into the usual template error mess, 
not sure how I could solve that yet.

Thanks,
- Tom