From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id SMnTC+/Y/V+9SgAAWB0awg
	(envelope-from <gdb-patches-bounces@sourceware.org>)
	for <public-inbox@simark.ca>; Tue, 12 Jan 2021 12:14:23 -0500
Received: by simark.ca (Postfix, from userid 112)
	id 209981EF7E; Tue, 12 Jan 2021 12:14:23 -0500 (EST)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on simark.ca
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,MAILING_LIST_MULTI,RDNS_NONE,URIBL_BLOCKED autolearn=no
	autolearn_force=no version=3.4.2
Received: from sourceware.org (unknown [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id 8180D1EE1B
	for <public-inbox@simark.ca>; Tue, 12 Jan 2021 12:14:22 -0500 (EST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D8E6C3892451;
	Tue, 12 Jan 2021 17:14:21 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D8E6C3892451
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1610471661;
	bh=2tzrNtN1NWvg4qg6TnD402yEsWAtZRF63/EgJm4d9PY=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=ZIBf+VsnaC4E1fFWdMSec5AHzDzFGUIW8w/mx8NRiMfRhM+5G6fZxapx3FSbfSssn
	 6oN2wr71DeRD+lQFnikYo0jJ9mISSqymvcalQMXVQXP93HbwRzxCPKQdnaxFk/Yc+U
	 DO+iCgQo/DWR3ANhfkpJ0xiN8aiwG331gFKX9UG4=
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id E8FE43858031
 for <gdb-patches@sourceware.org>; Tue, 12 Jan 2021 17:14:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E8FE43858031
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 10CHE6of009640
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Tue, 12 Jan 2021 12:14:11 -0500
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 10CHE6of009640
Received: from [10.0.0.11] (192-222-157-6.qc.cable.ebox.net [192.222.157.6])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id 8BBB41EE1B;
 Tue, 12 Jan 2021 12:14:06 -0500 (EST)
Subject: Re: [PATCH v3 4/5] gdb: generalize commit_resume, avoid
 commit-resuming when threads have pending statuses
To: gdb-patches@sourceware.org
References: <20210108041734.3873826-1-simon.marchi@polymtl.ca>
 <20210108041734.3873826-5-simon.marchi@polymtl.ca>
Message-ID: <ef5a6aad-fa3c-7887-9322-e6c9dbf87a34@polymtl.ca>
Date: Tue, 12 Jan 2021 12:14:06 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <20210108041734.3873826-5-simon.marchi@polymtl.ca>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Tue, 12 Jan 2021 17:14:06 +0000
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
From: Simon Marchi via Gdb-patches <gdb-patches@sourceware.org>
Reply-To: Simon Marchi <simon.marchi@polymtl.ca>
Cc: Simon Marchi <simon.marchi@efficios.com>
Errors-To: gdb-patches-bounces@sourceware.org
Sender: "Gdb-patches" <gdb-patches-bounces@sourceware.org>

On 2021-01-07 11:17 p.m., Simon Marchi via Gdb-patches wrote:
> From: Simon Marchi <simon.marchi@efficios.com>
> 
> The rationale for this patch comes from the ROCm port [1], the goal
> being to reduce the number of back and forths between GDB and the target
> when doing successive operations.  I'll start with explaining the
> rationale and then go over the implementation.  In the ROCm / GPU world,
> the term "wave" is somewhat equivalent to a "thread" in GDB.  So if you
> read if from a GPU stand point, just s/thread/wave/.
> 
> ROCdbgapi, the library used by GDB [2] to communicate with the GPU
> target, gives the illusion that it's possible for the debugger to
> control (start and stop) individual threads.  But in reality, this is
> not how it works.  Under the hood, all threads of a queue are controlled
> as a group.  To stop one thread in a group of running ones, the state of
> all threads is retrieved from the GPU, all threads are destroyed, and all
> threads but the one we want to stop are re-created from the saved state.
> The net result, from the point of view of GDB, is that the library
> stopped one thread.  The same thing goes if we want to resume one thread
> while others are running: the state of all running threads is retrieved
> from the GPU, they are all destroyed, and they are all re-created,
> including the thread we want to resume.
> 
> This leads to some inefficiencies when combined with how GDB works, here
> are two examples:
> 
>  - Stopping all threads: because the target operates in non-stop mode,
>    when the user interface mode is all-stop, GDB must stop all threads
>    individually when presenting a stop.  Let's suppose we have 1000
>    threads and the user does ^C.  GDB asks the target to stop one
>    thread.  Behind the scenes, the library retrieves 1000 thread states
>    and restores the 999 others still running ones.  GDB asks the target
>    to stop another one.  The target retrieves 999 thread states and
>    restores the 998 remaining ones.  That means that to stop 1000
>    threads, we did 1000 back and forths with the GPU.  It would have
>    been much better to just retrieve the states once and stop there.
> 
>  - Resuming with pending events: suppose the 1000 threads hit a
>    breakpoint at the same time.  The breakpoint is conditional and
>    evaluates to true for the first thread, to false for all others.  GDB
>    pulls one event (for the first thread) from the target, decides that
>    it should present a stop, so stops all threads using
>    stop_all_threads.  All these other threads have a breakpoint event to
>    report, which is saved in `thread_info::suspend::waitstatus` for
>    later.  When the user does "continue", GDB resumes that one thread
>    that did hit the breakpoint.  It then processes the pending events
>    one by one as if they just arrived.  It picks one, evaluates the
>    condition to false, and resumes the thread.  It picks another one,
>    evaluates the condition to false, and resumes the thread.  And so on.
>    In between each resumption, there is a full state retrieval and
>    re-creation.  It would be much nicer if we could wait a little bit
>    before sending those threads on the GPU, until it processed all those
>    pending events.
> 
> To address this kind of performance issue, ROCdbgapi has a concept
> called "forward progress required", which is a boolean state that allows
> its user (i.e. GDB) to say "I'm doing a bunch of operations, you can
> hold off putting the threads on the GPU until I'm done" (the "forward
> progress not required" state).  Turning forward progress back on
> indicates to the library that all threads that are supposed to be
> running should now be really running on the GPU.
> 
> It turns out that GDB has a similar concept, though not as general,
> commit_resume.  On difference is that commit_resume is not stateful: the
> target can't look up "does the core need me to schedule resumed threads
> for execution right now".  It is also specifically linked to the resume
> method, it is not used in other contexts.  The target accumulates
> resumption requests through target_ops::resume calls, and then commits
> those resumptions when target_ops::commit_resume is called.  The target
> has no way to check if it's ok to leave resumed threads stopped in other
> target methods.
> 
> To bridge the gap, this patch generalizes the commit_resume concept in
> GDB to match the forward progress concept of ROCdbgapi.  The current
> name (commit_resume) can be interpreted as "commit the previous resume
> calls".  I renamed the concept to "commit_resumed", as in "commit the
> threads that are resumed".
> 
> In the new version, we have two things in process_stratum_target:
> 
>  - the commit_resumed_state field: indicates whether GDB requires this
>    target to have resumed threads committed to the execution
>    target/device.  If false, the target is allowed to leave resumed
>    threads un-committed at the end of whatever method it is executing.
> 
>  - the commit_resumed method: called when commit_resumed_state
>    transitions from false to true.  While commit_resumed_state was
>    false, the target may have left some resumed threads un-committed.
>    This method being called tells it that it should commit them back to
>    the execution device.
> 
> Let's take the "Stopping all threads" scenario from above and see how it
> would work with the ROCm target with this change.  Before stopping all
> threads, GDB would set the target's commit_resumed_state field to false.
> It would then ask the target to stop the first thread.  The target would
> retrieve all threads' state from the GPU and mark that one as stopped.
> Since commit_resumed_state is false, it leaves all the other threads
> (still resumed) stopped.  GDB would then proceed to call target_stop for
> all the other threads.  Since resumed threads are not committed, this
> doesn't do any back and forth with the GPU.
> 
> To simplify the implementation of targets, I made it so that when
> calling certain target methods, the contract between the core and the
> targets guarantees that commit_resumed_state is false.  This way, the
> target doesn't need two paths, one commit_resumed_state == true and one
> for commit_resumed_state == false.  It can just assert that
> commit_resumed_state is false and work with that assumption.  This also
> helps catch places where we forgot to disable commit_resumed_state
> before calling the method, which represents a probable optimization
> opportunity.
> 
> To have some confidence that this contract between the core and the
> targets is respected, I added assertions in the linux-nat target
> methods, even though the linux-nat target doesn't actually use that
> feature.  Since linux-nat is tested much more than other targets, this
> will help catch these issues quicker.
> 
> To ensure that commit_resumed_state is always turned back on (only if
> necessary, see below) and the commit_resumed method is called when doing
> so, I introduced the scoped_disabled_commit_resumed RAII object, which
> replaces make_scoped_defer_process_target_commit_resume.  On
> construction, it clears the commit_resumed_state flag of all process
> targets.  On destruction, it turns it back on (if necessary) and calls
> the commit_resumed method.  The nested case is handled by having a
> "nesting" counter: only when the counter goes back to 0 is
> commit_resumed_state turned back on.
> 
> On destruction, commit-resumed is not re-enabled for a given target if:
> 
>  1. this target has no threads resumed, or
>  2. this target at least one thread with a pending status known to the
>     core (saved in thread_info::suspend::waitstatus).
> 
> The first point is not technically necessary, because a proper
> commit_resumed implementation would be a no-op if the target has no
> resumed threads.  But since we have a flag do to a quick check, I think
> it doesn't hurt.
> 
> The second point is more important: together with the
> scoped_disable_commit_resumed instance added in fetch_inferior_event, it
> makes it so the "Resuming with pending events" described above is
> handled efficiently.  Here's what happens in that case:
> 
>  1. The user types "continue".
>  2. Upon destruction, the scoped_disable_commit_resumed in the `proceed`
>     function does not enable commit-resumed, as it sees other threads
>     have pending statuses.
>  3. fetch_inferior_event is called to handle another event, one thread
>     is resumed.  Because there are still more threads with pending
>     statuses, the destructor of scoped_disable_commit_resumed in
>     fetch_inferior_event still doesn't enable commit-resumed.
>  4. Rinse and repeat step 3, until the last pending status is handled by
>     fetch_inferior_event.  In that case, scoped_disable_commit_resumed's
>     destructor sees there are no more threads with pending statues, so
>     it asks the target to commit resumed threads.
> 
> This allows us to avoid all unnecessary back and forths, there is a
> single commit_resumed call.
> 
> This change required remote_target::remote_stop_ns to learn how to
> handle stopping threads that were resumed but pending vCont.  The
> simplest example where that happens is when using the remote target in
> all-stop, but with "maint set target-non-stop on", to force it to
> operate in non-stop mode under the hood.  If two threads hit a
> breakpoint at the same time, GDB will receive two stop replies.  It will
> present the stop for one thread and save the other one in
> thread_info::suspend::waitstatus.
> 
> Before this patch, when doing "continue", GDB first resumes the thread
> without a pending status:
> 
>     Sending packet: $vCont;c:p172651.172676#f3
> 
> It then consumes the pending status in the next fetch_inferior_event
> call:
> 
>     [infrun] do_target_wait_1: Using pending wait status status->kind = stopped, signal = GDB_SIGNAL_TRAP for Thread 1517137.1517137.
>     [infrun] target_wait (-1.0.0, status) =
>     [infrun]   1517137.1517137.0 [Thread 1517137.1517137],
>     [infrun]   status->kind = stopped, signal = GDB_SIGNAL_TRAP
> 
> It then realizes it needs to stop all threads to present the stop, so
> stops the thread it just resumed:
> 
>     [infrun] stop_all_threads:   Thread 1517137.1517137 not executing
>     [infrun] stop_all_threads:   Thread 1517137.1517174 executing, need stop
>     remote_stop called
>     Sending packet: $vCont;t:p172651.172676#04
> 
> This is an unnecessary resume/stop.  With this patch, we don't commit
> resumed threads after proceeding, because of the pending status:
> 
>     [infrun] maybe_commit_resumed_all_process_targets: not requesting commit-resumed for target extended-remote, a thread has a pending waitstatus
> 
> When GDB handles the pending status and stop_all_threads runs, we stop a
> resumed but pending vCont thread:
> 
>     remote_stop_ns: Enqueueing phony stop reply for thread pending vCont-resume (1520940, 1520976, 0)
> 
> That thread was never actually resumed on the remote stub / gdbserver.
> This is why remote_stop_ns needed to learn this new trick of enqueueing
> phony stop replies.
> 
> Note that this patch only considers pending statuses known to the core
> of GDB, that is the events that were pulled out of the target and stored
> in `thread_info::suspend::waitstatus`.  In some cases, we could also
> avoid unnecessary back and forth when the target has events that it has
> not yet reported the core.  I plan to implement this as a subsequent
> patch, once this series has settled.

I think this patch introduces some regressions, when running

$ while make check TESTS="gdb.threads/interrupt-while-step-over.exp" RUNTESTFLAGS="--target_board=native-extended-gdbserver"; do done

I'll sometimes get:

/home/smarchi/src/binutils-gdb/gdb/inline-frame.c:383: internal-error: void skip_inline_frames(thread_info*, bpstat): Assertion `find_inline_frame_state (thread) == NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) 

Simon