On 7/21/22 19:35, Pedro Alves wrote: > On 2022-07-18 8:42 p.m., Tom de Vries via Gdb-patches wrote: >> Hi, >> >> Ensuring a fair distribution over the worker threads and main thread in terms >> of number of CUs might not be the most efficient way, given that CUs can vary >> in size. >> >> Fix this by: >> - adding a task_size_ptr parameter to parallel_for_each, >> defaulting to nullptr, >> - using per_cu->get_length () as the task size in the parallel_for_each >> in dwarf2_build_psymtabs_hard, and >> - using the task size in parallel_for_each to distribute similarly-sized tasks >> to the threads. >> >> I've used this experiment to verify the performance impact: >> ... >> $ for n in $(seq 1 10); do \ >> time gdb -q -batch ~/firefox/libxul.so-93.0-1.1.x86_64.debug \ >> 2>&1 \ >> | grep "real:"; \ >> done >> ... >> and without the patch got: >> ... >> real: 4.71 >> real: 4.88 >> real: 4.29 >> real: 4.30 >> real: 4.65 >> real: 4.27 >> real: 4.27 >> real: 4.27 >> real: 4.75 >> real: 4.41 >> ... >> and with the patch: >> ... >> real: 3.68 >> real: 3.81 >> real: 3.80 >> real: 3.68 >> real: 3.75 >> real: 3.69 >> real: 3.69 >> real: 3.74 >> real: 3.67 >> real: 3.74 >> ... >> so that seems a reasonable improvement. > > Nice. > :) >> >> With parallel_for_each_debug set to true, we get some more details about the >> difference in behaviour. Without the patch we have: >> ... >> Parallel for: n_elements: 2818 >> Parallel for: minimum elements per thread: 1 >> Parallel for: elts_per_thread: 704 >> Parallel for: elements on worker thread 0 : 705 >> Parallel for: elements on worker thread 1 : 705 >> Parallel for: elements on worker thread 2 : 704 >> Parallel for: elements on worker thread 3 : 0 >> Parallel for: elements on main thread : 704 >> ... >> and with the patch: >> ... >> Parallel for: n_elements: 2818 >> Parallel for: total_size: 1483674865 >> Parallel for: size_per_thread: 370918716 >> Parallel for: elements on worker thread 0 : 752 (size: 371811790) >> Parallel for: elements on worker thread 1 : 360 (size: 371509370) >> Parallel for: elements on worker thread 2 : 1130 (size: 372681710) >> Parallel for: elements on worker thread 3 : 0 (size: 0) >> Parallel for: elements on main thread : 576 (size: 367671995) >> ... >> >> Tested on x86_64-linux. >> >> Any comments? > > I'll defer to Tromey. Just a passerby comment, and a comment on the code itself > if we follow this approach, below. > > My passerby comment is that I wonder whether we should consider switching to a work > stealing implementation, so that threads that are done with their chunk would grab another > chunk from the work pool. I think this would spare us from having to worry about these > distribution heuristics. > I also though about a dynamic solution, but decided to try the simplest solution first. Anyway, with a dynamic solution you still might want to decide how big a chunck is, for which you could still need this type of heuristics. >> @@ -7059,6 +7059,13 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile) >> >> using iter_type = decltype (per_bfd->all_comp_units.begin ()); >> >> + std::function task_size >> + = [=] (iter_type iter) >> + { >> + dwarf2_per_cu_data *per_cu = iter->get (); >> + return per_cu->length (); >> + }; >> + >> /* Each thread returns a pair holding a cooked index, and a vector >> of errors that should be printed. The latter is done because >> GDB's I/O system is not thread-safe. run_on_main_thread could be >> @@ -7087,7 +7094,7 @@ dwarf2_build_psymtabs_hard (dwarf2_per_objfile *per_objfile) >> } >> } >> return result_type (thread_storage.release (), std::move (errors)); >> - }); >> + }, &task_size); >> >> /* Only show a given exception a single time. */ >> std::unordered_set seen_exceptions; >> diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h >> index bf40f125f0f..3c9269574df 100644 >> --- a/gdbsupport/parallel-for.h >> +++ b/gdbsupport/parallel-for.h >> @@ -134,7 +134,9 @@ typename gdb::detail::par_for_accumulator< >> typename std::result_of::type >> >::result_type >> parallel_for_each (unsigned n, RandomIt first, RandomIt last, >> - RangeFunction callback) >> + RangeFunction callback, >> + std::function *task_size_ptr >> + = (std::function *)nullptr) > > That use of a std::function pointer looks odd. AFAICT, TASK_SIZE_PTR is only ever called > as a callback by parallel_for_each, for setup, in the calling thread, and isn't stored > anywhere, right? If so, gdb::function_view instead should work, is lightweight, and is > nullable, meaning you don't need a pointer. > > And then, at the caller, just using a lambda instead of a std::function should work too: > > auto task_size = [=] (iter_type iter) > { > dwarf2_per_cu_data *per_cu = iter->get (); > return per_cu->length (); > }; I've tried that (attached) but ran into the usual template error mess, not sure how I could solve that yet. Thanks, - Tom