From: Lancelot SIX <lancelot.six@amd.com>
To: Pedro Alves <pedro@palves.net>
Cc: <gdb-patches@sourceware.org>
Subject: Re: [PATCH 6/8] Fix handling of vanishing threads that were stepping/stopping
Date: Fri, 15 Dec 2023 10:51:30 +0000 [thread overview]
Message-ID: <20231215105130.kjig2ad2q6tctkkj@khazad-dum> (raw)
In-Reply-To: <20231214202238.1065676-7-pedro@palves.net>
Hi Pedro,
I have a minor comment below.
On Thu, Dec 14, 2023 at 08:22:36PM +0000, Pedro Alves wrote:
> Downstream, AMD is carrying a testcase
> (gdb.rocm/continue-over-kernel-exit.exp) that exposes a couple issues
> with the amd-dbgapi target's handling of exited threads. The test
> can't be added upstream yet, unfortunately, due to dependency on DWARF
> extensions that can't be upstreamed yet. However, it can be found on
> the mailing list on the same series as this patch.
>
> The test spawns a kernel with a number of waves. The waves do nothing
> but exit. There is a breakpoint on the s_endpgm instruction. Once
> that breakpoint is hit, the test issues a "continue" command. We
> should see one breakpoint hit per wave, and then the whole program
> exiting. We do see that, however we also see this:
>
> [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
> [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
> *repeat for other waves*
> ...
> [Thread 0x7ffff626f640 (LWP 3048491) exited]
> [Thread 0x7fffeb7ff640 (LWP 3048488) exited]
> [Inferior 1 (process 3048475) exited normally]
>
> That "New AMDGPU Wave" output comes from infrun.c itself adding the
> thread to the GDB thread list, because it got an event for a thread
> not on the thread list yet. The output shows "?"s instead of proper
> coordinates, because the event was a TARGET_WAITKIND_THREAD_EXITED,
> i.e., the wave was already gone when infrun.c added the thread to the
> thread list.
>
> That shouldn't ever happen for the amd-dbgapi target, threads should
> only ever be added by the backend.
>
> Note "New AMDGPU Wave ?:?:?:1" is for wave 1. What happened was that
> wave 1 terminated previously, and a previous call to
> amd_dbgapi_target::update_thread_list() noticed the wave had vanished
> and removed it from the GDB thread list. However, because the wave
> was stepping when it terminated (due to the displaced step over the
> s_endpgm) instruction, it is guaranteed that the amd-dbgapi library
> queues a WAVE_COMMAND_TERMINATED event for the exit.
>
> When we process that WAVE_COMMAND_TERMINATED event, in
> amd-dbgapi-target.c:process_one_event, we return it to the core as a
> TARGET_WAITKIND_THREAD_EXITED event:
>
> static void
> process_one_event (amd_dbgapi_event_id_t event_id,
> amd_dbgapi_event_kind_t event_kind)
> {
> ...
> if (status == AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID
> && event_kind == AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED)
> ws.set_thread_exited (0);
> ...
> }
>
> Recall the wave is already gone from the GDB thread list. So when GDB
> sees that TARGET_WAITKIND_THREAD_EXITED event for a thread it doesn't
> know about, it adds the thread to the thread list, resulting in that:
>
> [New AMDGPU Wave ?:?:?:1 (?,?,?)/?]
>
> and then, because it was a TARGET_WAITKIND_THREAD_EXITED event, GDB
> marks the thread exited right afterwards:
>
> [AMDGPU Wave ?:?:?:1 (?,?,?)/? exited]
>
> The fix is to make amd_dbgapi_target::update_thread_list() _not_
> delete vanishing waves iff they were stepping or in progress of being
> stopped. These two cases are the ones dbgapi guarantees will result
> in a WAVE_COMMAND_TERMINATED event if the wave terminates:
>
> /**
> * A command for a wave was not able to complete because the wave has
> * terminated.
> *
> * Commands that can result in this event are ::amd_dbgapi_wave_stop and
> * ::amd_dbgapi_wave_resume in single step mode. Since the wave terminated
> * before stopping, this event will be reported instead of
> * ::AMD_DBGAPI_EVENT_KIND_WAVE_STOP.
> *
> * The wave that terminated is available by the ::AMD_DBGAPI_EVENT_INFO_WAVE
> * query. However, the wave will be invalid since it has already terminated.
> * It is the client's responsibility to know what command was being performed
> * and was unable to complete due to the wave terminating.
> */
> AMD_DBGAPI_EVENT_KIND_WAVE_COMMAND_TERMINATED = 2,
>
> As the comment says, it's GDB's responsability to know whether the
> wave was stepping or being stopped. Since we now have a wave_info map
> with one entry for each wave, that seems like the place to store that
> information. However, I still decided to put all the coordinate
> information in its own structure. I.e., basically renamed the
> existing wave_info to wave_coordinates, and then added a new wave_info
> structure that holds the new state, plus a wave_coordinates object.
> This seemer cleaner as there are places where we only need to
s/seemer/seemed/ maybe?
> instantiate a wave_coordinates object.
>
> There's an extra twist. The testcase also exercises stopping at a new
> kernel right after the first kernel fully exits. In that scenario, we
> were hitting this assertion after the first kernel fully exits and the
> hit of the breakpoint at the second kernel is handled:
>
> [amd-dbgapi] process_event_queue: Pulled event from dbgapi: event_id.handle = 26, event_kind = WAVE_STOP
> [amd-dbgapi-lib] suspending queue_3, queue_2, queue_1 (refresh wave list)
> ../../src/gdb/amd-dbgapi-target.c:1625: internal-error: amd_dbgapi_thread_deleted: Assertion `it != info->wave_info_map.end ()' failed.
> A problem internal to GDB has been detected,
> further debugging may prove unreliable.
>
> This is the exact same problem as above, just a different
> manifestation. In this scenario, we end up in update_thread_list
> successfully deleting the exited thread (because it was no longer the
> current thread) that was incorrectly added by infrun.c. Because it
> was added by infrun.c and not by amd-dbgapi-target.c:add_gpu_thread,
> it doesn't have an entry in the wave_info map, so
> amd_dbgapi_thread_deleted trips on this assertion:
>
> gdb_assert (it != info->wave_info_map.end ());
>
> here:
>
> ...
> -> stop_all_threads
> -> update_thread_list
> -> target_update_thread_list
> -> amd_dbgapi_target::update_thread_list
> -> thread_db_target::update_thread_list
> -> linux_nat_target::update_thread_list
> -> delete_exited_threads
> -> delete_thread
> -> delete_thread_1
> -> gdb::observers::observable<thread_info*>::notify
> -> amd_dbgapi_thread_deleted
> -> internal_error_loc
>
> The testcase thus tries both running to exit after the first kernel
> exits, and running to a breakpoint in a second kernel after the first
> kernel exits.
>
> Change-Id: I43a66f060c35aad1fe0d9ff022ce2afd0537f028
> ---
> gdb/amd-dbgapi-target.c | 197 ++++++++++++++++++++++++++++++----------
> 1 file changed, 149 insertions(+), 48 deletions(-)
>
> diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c
> index 86102b7fb03..06f9e8c5f9c 100644
> --- a/gdb/amd-dbgapi-target.c
> +++ b/gdb/amd-dbgapi-target.c
> @@ -109,10 +109,9 @@ get_amd_dbgapi_target_inferior_created_observer_token ()
> return amd_dbgapi_target_inferior_created_observer_token;
> }
>
> -/* A type holding coordinate, etc. info for a given wave. We cache
> - this because we need this information after a wave exits. */
> +/* A type holding coordinates, etc. info for a given wave. */
>
> -struct wave_info
> +struct wave_coordinates
> {
> /* The wave. Set by the ctor. */
> amd_dbgapi_wave_id_t wave_id;
> @@ -125,11 +124,44 @@ struct wave_info
> uint32_t group_ids[3] {UINT32_MAX, UINT32_MAX, UINT32_MAX};
> uint32_t wave_in_group = UINT32_MAX;
>
> - explicit wave_info (amd_dbgapi_wave_id_t wave_id);
> + explicit wave_coordinates (amd_dbgapi_wave_id_t wave_id)
> + : wave_id (wave_id)
> + {}
>
> - /* Return the target ID string for the wave this wave_info is
> + /* Return the target ID string for the wave this wave_coordinates is
> for. */
> std::string to_string () const;
> +
> + /* Pull out coordinates info from the amd-dbgapi library. */
> + void fetch ();
> +};
> +
> +/* A type holding info about a given wave. */
> +
> +struct wave_info
> +{
> + /* We cache the coordinates info because we need it after a wave
> + exits. The wave's ID is here. */
> + wave_coordinates coords;
> +
> + /* The last resume_mode passed to amd_dbgapi_wave_resume for this
> + wave. We track this because we are guaranteed to see a
> + WAVE_COMMAND_TERMINATED event if a stepping wave terminates, and
> + we need to know to not delete such a wave until we process that
> + event. */
> + amd_dbgapi_resume_mode_t last_resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
> +
> + /* Whether we've called amd_dbgapi_wave_stop for this wave and are
> + waiting for its stop event. Similarly, we track this because
> + we're guaranteed to get a WAVE_COMMAND_TERMINATED event if the
> + wave terminates while being stopped. */
> + bool stopping = false;
> +
> + explicit wave_info (amd_dbgapi_wave_id_t wave_id)
> + : coords (wave_id)
> + {
> + coords.fetch ();
> + }
> };
>
> /* Big enough to hold the size of the largest register in bytes. */
> @@ -275,6 +307,19 @@ static struct amd_dbgapi_target the_amd_dbgapi_target;
> static const registry<inferior>::key<amd_dbgapi_inferior_info>
> amd_dbgapi_inferior_data;
>
> +/* Fetch the amd_dbgapi_inferior_info data for the given inferior. */
> +
> +static struct amd_dbgapi_inferior_info *
> +get_amd_dbgapi_inferior_info (struct inferior *inferior)
> +{
> + amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
> +
> + if (info == nullptr)
> + info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
> +
> + return info;
> +}
> +
> /* The async event handler registered with the event loop, indicating that we
> might have events to report to the core and that we'd like our wait method
> to be called.
> @@ -289,7 +334,7 @@ static const registry<inferior>::key<amd_dbgapi_inferior_info>
> static async_event_handler *amd_dbgapi_async_event_handler = nullptr;
>
> std::string
> -wave_info::to_string () const
> +wave_coordinates::to_string () const
> {
> std::string str = "AMDGPU Wave";
>
> @@ -319,37 +364,41 @@ wave_info::to_string () const
> return str;
> }
>
> -wave_info::wave_info (amd_dbgapi_wave_id_t wave_id)
> - : wave_id (wave_id)
> -{
> -}
> -
> -/* Read in wave_info for WAVE_ID. */
> -
> -static wave_info
> -get_wave_info (amd_dbgapi_wave_id_t wave_id)
> +void
> +wave_coordinates::fetch ()
> {
> - wave_info res (wave_id);
> -
> /* Any field that fails to be read is left with its in-class
> initialized value, which is printed as "?". */
>
> amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_AGENT,
> - sizeof (res.agent_id), &res.agent_id);
> + sizeof (agent_id), &agent_id);
> amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_QUEUE,
> - sizeof (res.queue_id), &res.queue_id);
> + sizeof (queue_id), &queue_id);
> amd_dbgapi_wave_get_info (wave_id, AMD_DBGAPI_WAVE_INFO_DISPATCH,
> - sizeof (res.dispatch_id), &res.dispatch_id);
> + sizeof (dispatch_id), &dispatch_id);
>
> amd_dbgapi_wave_get_info (wave_id,
> AMD_DBGAPI_WAVE_INFO_WORKGROUP_COORD,
> - sizeof (res.group_ids), &res.group_ids);
> + sizeof (group_ids), &group_ids);
>
> amd_dbgapi_wave_get_info (wave_id,
> AMD_DBGAPI_WAVE_INFO_WAVE_NUMBER_IN_WORKGROUP,
> - sizeof (res.wave_in_group), &res.wave_in_group);
> + sizeof (wave_in_group), &wave_in_group);
> +}
> +
> +/* Get the wave_info object for TP, from the wave_info map. It is
> + assumed that the wave is in the map. */
> +
> +static wave_info &
> +get_thread_wave_info (thread_info *tp)
> +{
> + amd_dbgapi_inferior_info *info = get_amd_dbgapi_inferior_info (tp->inf);
> + amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (tp->ptid);
> +
> + auto it = info->wave_info_map.find (wave_id.handle);
> + gdb_assert (it != info->wave_info_map.end ());
>
> - return res;
> + return it->second;
> }
>
> /* Clear our async event handler. */
> @@ -370,19 +419,6 @@ async_event_handler_mark ()
> mark_async_event_handler (amd_dbgapi_async_event_handler);
> }
>
> -/* Fetch the amd_dbgapi_inferior_info data for the given inferior. */
> -
> -static struct amd_dbgapi_inferior_info *
> -get_amd_dbgapi_inferior_info (struct inferior *inferior)
> -{
> - amd_dbgapi_inferior_info *info = amd_dbgapi_inferior_data.get (inferior);
> -
> - if (info == nullptr)
> - info = amd_dbgapi_inferior_data.emplace (inferior, inferior);
> -
> - return info;
> -}
> -
> /* Set forward progress requirement to REQUIRE for all processes of PROC_TARGET
> matching PTID. */
>
> @@ -565,12 +601,12 @@ amd_dbgapi_target::pid_to_str (ptid_t ptid)
>
> auto it = info->wave_info_map.find (wave_id.handle);
> if (it != info->wave_info_map.end ())
> - return it->second.to_string ();
> + return it->second.coords.to_string ();
>
> /* A wave we don't know about. Shouldn't usually happen, but
> asserting and bringing down the session is a bit too harsh. Just
> print all unknown info as "?"s. */
> - return wave_info (wave_id).to_string ();
> + return wave_coordinates (wave_id).to_string ();
> }
>
> const char *
> @@ -694,16 +730,24 @@ amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
>
> amd_dbgapi_wave_id_t wave_id = get_amd_dbgapi_wave_id (thread->ptid);
> amd_dbgapi_status_t status;
> +
> + wave_info &wi = get_thread_wave_info (thread);
> + amd_dbgapi_resume_mode_t &resume_mode = wi.last_resume_mode;
> + amd_dbgapi_exceptions_t wave_exception;
> if (thread->ptid == inferior_ptid)
> - status = amd_dbgapi_wave_resume (wave_id,
> - (step
> - ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> - : AMD_DBGAPI_RESUME_MODE_NORMAL),
> - exception);
> + {
> + resume_mode = (step
> + ? AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> + : AMD_DBGAPI_RESUME_MODE_NORMAL);
> + wave_exception = exception;
> + }
> else
> - status = amd_dbgapi_wave_resume (wave_id, AMD_DBGAPI_RESUME_MODE_NORMAL,
> - AMD_DBGAPI_EXCEPTION_NONE);
> + {
> + resume_mode = AMD_DBGAPI_RESUME_MODE_NORMAL;
> + wave_exception = AMD_DBGAPI_EXCEPTION_NONE;
> + }
>
> + status = amd_dbgapi_wave_resume (wave_id, resume_mode, wave_exception);
> if (status != AMD_DBGAPI_STATUS_SUCCESS
> /* Ignore the error that wave is no longer valid as that could
> indicate that the process has exited. GDB treats resuming a
> @@ -711,6 +755,8 @@ amd_dbgapi_target::resume (ptid_t scope_ptid, int step, enum gdb_signal signo)
> && status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
> error (_("wave_resume for wave_%ld failed (%s)"), wave_id.handle,
> get_status_string (status));
> +
> + wi.stopping = false;
> }
> }
>
> @@ -725,6 +771,15 @@ amd_dbgapi_target::commit_resumed ()
> require_forward_progress (minus_one_ptid, proc_target, true);
> }
>
> +/* Return a string version of RESUME_MODE, for debug log purposes. */
> +static const char *
> +resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
> +{
> + return (resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> + ? "step"
> + : "normal");
Pedantically, amd_dbgapi_resume_mode_t is an enum and could in theory
get more enumerators. I don't expect any, but the following construct
would make the compiler complain if that ever happens:
static const char *
resume_mode_to_string (amd_dbgapi_resume_mode_t resume_mode)
{
switch (resume_mode)
{
case AMD_DBGAPI_RESUME_MODE_NORMAL:
return "normal";
case AMD_DBGAPI_RESUME_MODE_SINGLE_STEP:
return "step";
}
gdb_assert_not_reached ("invalid amd_dbgapi_resume_mode_t");
}
> +}
> +
> void
> amd_dbgapi_target::stop (ptid_t ptid)
> {
> @@ -758,7 +813,11 @@ amd_dbgapi_target::stop (ptid_t ptid)
>
> status = amd_dbgapi_wave_stop (wave_id);
> if (status == AMD_DBGAPI_STATUS_SUCCESS)
> - return;
> + {
> + wave_info &wi = get_thread_wave_info (thread);
> + wi.stopping = true;
> + return;
> + }
>
> if (status != AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID)
> error (_("wave_stop for wave_%ld failed (%s)"), wave_id.handle,
> @@ -772,6 +831,23 @@ amd_dbgapi_target::stop (ptid_t ptid)
> could have terminated since the last time the wave list was
> refreshed. */
>
> + wave_info &wi = get_thread_wave_info (thread);
> + wi.stopping = true;
> +
> + amd_dbgapi_debug_printf ("got AMD_DBGAPI_STATUS_ERROR_INVALID_WAVE_ID "
> + "for wave_%ld, last_resume_mode=%s, "
> + "report_thread_events=%d",
> + wave_id.handle,
> + resume_mode_to_string (wi.last_resume_mode),
> + m_report_thread_events);
> +
> + /* If the wave was stepping when it terminated, then it is
> + guaranteed that we will see a WAVE_COMMAND_TERMINATED event
> + for it. Don't report a thread exit event or delete the
> + thread yet, until we see such event. */
> + if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP)
> + return;
> +
> if (m_report_thread_events)
> {
> get_amd_dbgapi_inferior_info (thread->inf)->wave_events.emplace_back
> @@ -1018,7 +1094,7 @@ add_gpu_thread (inferior *inf, ptid_t wave_ptid)
> auto wave_id = get_amd_dbgapi_wave_id (wave_ptid);
>
> if (!info->wave_info_map.try_emplace (wave_id.handle,
> - get_wave_info (wave_id)).second)
> + wave_info (wave_id)).second)
> internal_error ("wave ID %ld already in map", wave_id.handle);
>
> /* Create new GPU threads silently to avoid spamming the terminal
> @@ -1770,7 +1846,32 @@ amd_dbgapi_target::update_thread_list ()
> auto it = threads.find (tp->ptid.tid ());
>
> if (it == threads.end ())
> - delete_thread_silent (tp);
> + {
> + auto wave_id = get_amd_dbgapi_wave_id (tp->ptid);
> + wave_info &wi = get_thread_wave_info (tp);
> +
> + /* Waves that were stepping or in progress of being
> + stopped are guaranteed to report a
> + WAVE_COMMAND_TERMINATED event if they terminate.
> + Don't delete such threads until we see the
> + event. */
> + if (wi.last_resume_mode == AMD_DBGAPI_RESUME_MODE_SINGLE_STEP
> + || wi.stopping)
> + {
> + amd_dbgapi_debug_printf
> + ("wave_%ld disappeared, keeping it"
> + " (last_resume_mode=%s, stopping=%d)",
> + wave_id.handle,
> + resume_mode_to_string (wi.last_resume_mode),
> + wi.stopping);
> + }
> + else
> + {
> + amd_dbgapi_debug_printf ("wave_%ld disappeared, deleting it",
> + wave_id.handle);
> + delete_thread_silent (tp);
> + }
> + }
> else
> threads.erase (it);
> }
>
> --
> 2.43.0
>
I happy with the patch as it is if you prefer to not change
resume_mode_to_string. Either way
Approved-By: Lancelot Six <laneclot.six@amd.com> (amdgpu)
Best,
Lancelot.
next prev parent reply other threads:[~2023-12-15 10:52 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-14 20:22 [PATCH 0/8] Step over thread exit improvements/fixes + AMD GPU Pedro Alves
2023-12-14 20:22 ` [PATCH 1/8] gdb.threads/step-over-thread-exit.exp improvements Pedro Alves
2023-12-14 20:22 ` [PATCH 2/8] Ensure selected thread after thread exit stop Pedro Alves
2023-12-14 20:22 ` [PATCH 3/8] displaced_step_finish: Don't fetch the regcache of exited threads Pedro Alves
2023-12-14 20:22 ` [PATCH 4/8] Step over thread exit, always delete the thread non-silently Pedro Alves
2023-12-14 20:22 ` [PATCH 5/8] Fix thread target ID of exited waves Pedro Alves
2023-12-15 10:51 ` Lancelot SIX
2023-12-14 20:22 ` [PATCH 6/8] Fix handling of vanishing threads that were stepping/stopping Pedro Alves
2023-12-15 10:51 ` Lancelot SIX [this message]
2023-12-14 20:22 ` [PATCH 7/8] Add tests for s_endpgm handling Pedro Alves
2023-12-14 20:22 ` [PATCH 8/8] Add tests for handling of vanishing threads that were stepping/stopping Pedro Alves
2023-12-20 21:24 ` [PATCH 0/8] Step over thread exit improvements/fixes + AMD GPU Pedro Alves
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231215105130.kjig2ad2q6tctkkj@khazad-dum \
--to=lancelot.six@amd.com \
--cc=gdb-patches@sourceware.org \
--cc=pedro@palves.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox