From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-148248-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 46340 invoked by alias); 15 Jun 2018 12:12:08 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 46330 invoked by uid 89); 15 Jun 2018 12:12:07 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=resume, hoped, sends, reliably
X-HELO: mx1.redhat.com
Received: from mx3-rdu2.redhat.com (HELO mx1.redhat.com) (66.187.233.73) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 15 Jun 2018 12:12:06 +0000
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 6252E8A3DD;	Fri, 15 Jun 2018 12:12:04 +0000 (UTC)
Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4])	by smtp.corp.redhat.com (Postfix) with ESMTP id 6602120284D6;	Fri, 15 Jun 2018 12:12:03 +0000 (UTC)
Subject: Re: [PATCHv2] gdb: Don't drop SIGSTOP during stop_all_threads
To: Andrew Burgess <andrew.burgess@embecosm.com>, gdb-patches@sourceware.org
References: <6841cae5-f4d4-a1d4-be74-4db2718bb949@redhat.com> <20180614194601.13322-1-andrew.burgess@embecosm.com>
Cc: Richard Bunt <Richard.Bunt@arm.com>
From: Pedro Alves <palves@redhat.com>
Message-ID: <2a06039a-26dc-bb85-af2d-a7a77a56a8d4@redhat.com>
Date: Fri, 15 Jun 2018 12:12:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0
MIME-Version: 1.0
In-Reply-To: <20180614194601.13322-1-andrew.burgess@embecosm.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-06/txt/msg00391.txt.bz2

On 06/14/2018 08:46 PM, Andrew Burgess wrote:
> This is an update based on Pedro's feedback:
> 

Thank you.

> So, in this version the preload library is a little more complex.  I
> now try to emulate the kernels sending of SIGCHLD after a short period
> of time in order to make it appear like the child thread status really
> has just arrived from the kernel later than it really did.  This works
> better than I had hoped, with this change in place not only does the
> test now pass on gdbserver, but with the preload library I can (in a
> very simple test) use GDB as normal.

Awesome.

I found a few more small nits, but this is OK with those fixed.

> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  All waitpid calls targeting
> +# pid -1 with the WNOHANG flag are rate limited so that only 1 per
> +# second can complete.  Additional calls are forced to return 0
> +# indicating no event waiting.  This is enough to trigger the bug
> +# during the attach phase, however, it breaks the rest of GDB's
> +# behaviour, so we can't do more than attach with this library in
> +# place.

Does this last sentence ("breaks the rest of GDB") need updating?

> +   If the kernel is slow in either delivering the signal, or making the
> +   result available to the waitpid call then GDB will enter a sigsuspend
> +   call in order to wait for the inferior threads to change state, this is
> +   signalled to GDB with a SIGCHLD.
> +
> +   A bug in GDB meant that in some cases we would deadlock during this
> +   process.  This was rarely seen as the kernel is usually quick at
> +   delivering signals and making the results available to waitpid, so quick
> +   that GDB would gather the statuses from all inferior threads in the
> +   original pass.
> +
> +   The idea in this library is to rate limit calls to waitpid (where pid is
> +   -1 and the WNOHANG option is set) so that only 1 per second can return
> +   an answer.  Any additional calls will report that no threads are
> +   currently ready.  This should match the behaviour we see on a slow
> +   kernel.
> +
> +   However, given that usually when using this library, the kernel does
> +   have the waitpid result ready this means that the kernel will never send
> +   GDB a SIGCHLD.  This means that when GDB enters sigsuspend it will block
> +   forever.  Alternatively, if GDB enters it's polling loop the lack of

s/is's/its/

> +   SIGCHLD means that we will never see an event on the child threads.  To
> +   resolve these problems the library intercepts calls to sigsuspend and
> +   forces the call to exit if there is a pending waitpid result.  Also,
> +   when we know that there's a waitpid result that we've ignored, we create
> +   a new thread which, after a short delay, will send GDB a SIGCHLD.  */
> +

> +
> +/* Cache the result of a waitpid call that has not been reported back to
> +   GDB yet.  We only ever cache a single result.  Once we have a result
> +   cached then later calls to waitpid with the WNOHANG option will return a
> +   result of 0.  */
> +
> +struct
> +{
> +  /* Flag to indicate when we have a result cached.  */
> +  int cached_p;
> +
> +  /* The cached result fields from a waitpid call.  */
> +  pid_t pid;
> +  int wstatus;
> +} cached_wait_status;

"static" to avoid interposition issues.

> +
> +/* Lock to hold when modifying SIGNAL_THREAD_ACTIVE_P.  */
> +
> +pthread_mutex_t thread_creation_lock_obj = PTHREAD_MUTEX_INITIALIZER;

Ditto.

> +#define thread_creation_lock (&thread_creation_lock_obj)
> +
> +/* This flag is only modified while holding the THREAD_CREATION_LOCK mutex.
> +   When this flag is true then there is a signal thread alive that will be
> +   sending a SIGCHLD at some point in the future.  */
> +
> +int signal_thread_active_p;

Ditto.

> +
> +/* When we last allowed a waitpid to complete.  */
> +
> +static struct timeval last_waitpid_time = { 0, 0 };
> +
> +/* The number of seconds that must elapse between calls to waitpid where
> +   the pid is -1 and the WNOHANG option is set.  If calls occur faster than
> +   this then we force a result of 0 to be returned from waitpid.  */
> +
> +#define WAITPID_MIN_TIME (1)
> +
> +/* Return true (non-zero) if we should skip this call to waitpid, or false
> +   (zero) if this waitpid call should be handled with a call to the "real"
> +   waitpid library.  Allows 1 waitpid call per second.  */

s/waitpid library/waitpid function/  ?

> +
> +static int
> +should_skip_waitpid (void)
> +{
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < WAITPID_MIN_TIME)
> +	return 1; /* Skip.  */
> +
> +      *tv = new_tv;
> +    }
> +
> +  /* Don't skip.  */
> +  return 0;
> +}
> +
> +/* Perform a real waitpid call.  */
> +
> +static pid_t
> +real_waitpid (pid_t pid, int *wstatus, int options)
> +{
> +  typedef pid_t (*fptr_t) (pid_t, int *, int);
> +  static fptr_t real_func = NULL;
> +
> +  if (real_func == NULL)
> +    {
> +      real_func = dlsym (RTLD_NEXT, "waitpid");
> +      if (real_func == NULL)
> +	error ("error: failed to find real waitpid\n");
> +    }
> +
> +  return (*real_func) (pid, wstatus, options);
> +}
> +
> +/* Thread worked created when we cached a waitpid result.  Delays for

s/worked/worker/
s/cached/cache/

> +   a short period of time and then sends SIGCHLD to the GDB process.  This
> +   should trigger GDB to call waitpid again, at which point we will make
> +   the cached waitpid result available.  */
> +
> +static void*
> +send_sigchld_thread (void *arg)
> +{
> +  /* Delay one second longer than WAITPID_MIN_TIME so that there can be no
> +     chance that a call to SHOULD_SKIP_WAITPID will return true one the

s/one/once/

Thanks,
Pedro Alves