From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-57816-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 8259 invoked by alias); 9 Aug 2008 14:52:02 -0000
Received: (qmail 8247 invoked by uid 22791); 9 Aug 2008 14:52:01 -0000
X-Spam-Check-By: sourceware.org
Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (65.74.133.4)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Sat, 09 Aug 2008 14:51:16 +0000
Received: (qmail 30571 invoked from network); 9 Aug 2008 14:51:14 -0000
Received: from unknown (HELO orlando.local) (pedro@127.0.0.2)   by mail.codesourcery.com with ESMTPA; 9 Aug 2008 14:51:14 -0000
From: Pedro Alves <pedro@codesourcery.com>
To: "John David Anglin" <dave@hiauly1.hia.nrc.ca>
Subject: Re: ttrace: Protocal error
Date: Sat, 09 Aug 2008 14:52:00 -0000
User-Agent: KMail/1.9.9
Cc: gdb-patches@sourceware.org
References: <20080808201457.3064B4EBE@hiauly1.hia.nrc.ca>
In-Reply-To: <20080808201457.3064B4EBE@hiauly1.hia.nrc.ca>
MIME-Version: 1.0
Content-Type: text/plain;   charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200808091551.17013.pedro@codesourcery.com>
X-IsSubscribed: yes
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
X-SW-Source: 2008-08/txt/msg00239.txt.bz2

On Friday 08 August 2008 21:14:56, John David Anglin wrote:
> > Hmmm, the thread seems to have exited but state_ is still 1.
>
> The patch below seems to fix the problem. =A0I was finally able
> to catch an abort in vla6.f90.
>
> Ok?
>
>
> =A0=A0=A0=A0=A0=A0=A0=A0* inf-ttrace.c (inf_ttrace_resume_callback): Don'=
t resume dying
> thread.
>
> Index: inf-ttrace.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> RCS file: /cvs/src/src/gdb/inf-ttrace.c,v
> retrieving revision 1.30
> diff -u -3 -p -r1.30 inf-ttrace.c
> --- inf-ttrace.c=A0=A0=A0=A0=A0=A0=A0=A09 Jul 2008 22:23:05 -0000=A0=A0=
=A0=A0=A0=A0=A01.30
> +++ inf-ttrace.c=A0=A0=A0=A0=A0=A0=A0=A08 Aug 2008 19:57:13 -0000
> @@ -787,7 +804,9 @@ inf_ttrace_kill (void)
> =A0static int
> =A0inf_ttrace_resume_callback (struct thread_info *info, void *arg)
> =A0{
> - =A0if (!ptid_equal (info->ptid, inferior_ptid))
> + =A0if (!ptid_equal (info->ptid, inferior_ptid)
> + =A0 =A0 =A0&& !((struct inf_ttrace_private_thread_info *)info->private)=
->dying
> + =A0 =A0 =A0&& !is_exited (info->ptid))
> =A0 =A0 =A0{
> =A0 =A0 =A0 =A0pid_t pid =3D ptid_get_pid (info->ptid);
> =A0 =A0 =A0 =A0lwpid_t lwpid =3D ptid_get_lwp (info->ptid);

>From your original backtrace, I'm still puzzled on how you got here.  When
we tag the thread as dying, we return TARGET_WAITKIND_SPURIOUS, which
triggers a resume.  This first resume should have removed the thread
from the thread list.=20=20

If you were really in a TARGET_WAITKIND_SYSCALL_ENTRY, this would be
at least the second resume after the lwp exit.
Maybe I am reading the backtrace wrong though.  If you have the patience,
showing what GDB outputs when you do a "run" after setting
"set debug infrun 1" would help.

Looking again at the TTEVT_LWP_EXIT event, which tags the thread as dying.
When we get it, the lwp is not dead yet:

"TTEVT_LWP_EXIT

    This event flag indicates that the debugger wants to be notified when a=
=20
thread is exiting via the lwp_exit() system call. The thread stops upon ent=
ry=20
to the system call."

That's why we resume it immediatelly here,

inf_ttrace_wait ()
...
      case TTEVT_LWP_EXIT:
        if (print_thread_events)
          printf_unfiltered (_("[%s exited]\n"), target_pid_to_str (ptid));
        ti =3D find_thread_pid (ptid);
        gdb_assert (ti !=3D NULL);
        ((struct inf_ttrace_private_thread_info *)ti->private)->dying =3D 1;
        inf_ttrace_num_lwps--;
(1)     ttrace (TT_LWP_CONTINUE, ptid_get_pid (ptid),
              ptid_get_lwp (ptid), TT_NOPC, 0, 0);
        /* If we don't return -1 here, core GDB will re-add the thread.  */
        ptid =3D minus_one_ptid;
        break;
...

    /* Make sure all threads within the process are stopped.  */
(2)  if (ttrace (TT_PROC_STOP, tts.tts_pid, 0, 0, 0, 0) =3D=3D -1)
       perror_with_name (("ttrace"));

    return ptid;
  }


It seems to me, that for some reason, in most cases, the inferior was slow
enough that when you reach (2), the dying thread hadn't exited
yet.  The TT_PROC_STOP call stops all lwps of the process, the
dying one included, I would think.  In that case, you still need the
resume on the dying thread in inf_ttrace_wait.  Otherwise, you *may*
get this bug back, depending on how the OS is waking waiting processes:

 http://www.cygwin.com/ml/gdb-patches/2007-09/msg00238.html

 "So, if the dying thread is stopped, it should be resumed one last time.
 Otherwise, any other thread waiting for its death on pthread_join
 would be blocked forever (e.g. in attachment, a simple program which
 freezes when it is run under GDB)."

That would explain why you trip on this bug, but, Jerome didn't,
I guess.  It may be your testcase triggers the race better.

The way this is solved currently, is also racy.  We
resume all threads, dying threads included, and assume they
will die just shortly, so, we immediately delete dying
threads after resuming:

  if (ptid_equal (ptid, minus_one_ptid))
    {
      /* Let all the other threads run too.  */
      iterate_over_threads (inf_ttrace_resume_callback, NULL);
****  iterate_over_threads (inf_ttrace_delete_dying_threads_callback, NULL);
    }

It could happen, that under stress, GDB handles another event,
and stops all lwps, *before* the dying lwp having a chance
of exiting.  The symptom would again the be one Jerome was
fixing.  Having no idea on how much the thread still executes
after an TTEVT_LWP_EXIT event, I can't tell how likelly this
is to happen, say causing spurious testsuite failures.

The ttrace API doesn't have an "lwp is really gone from
the OS tables" event (sigh), but, it does have
a TTEVT_LWP_CREATE event.  This mean that we can detect
if the OS is reusing an lwpid at that point.

So, to minimise the possible race, how about:

- still try to resume a dying lwp.  Ignore the errno you
  were originally seeing in that case (only).
- on resume failure, delete it from GDBs thread table.
- if by any chance, the lwp exits, and the inferior spawn a
  new lwp, and the OS reuses the same lwpid of the lwp we knew
  was dying, we delete the dying lwp, and add the new one.
  If the OS is reusing the id, the original lwp has to be gone.
  This is just an add_thread call, as that is already handled by it
  internally (*).
- If the thread is still alive, but is dying, let that show
  in "info threads".  The linux pthread support implementation
  also does this.

I think the race below still exists, but it happens to
be innocuous:

- we know an lwp was dying, but, we didn't detect its exit,
  because we resumed it really exited.
- another thread hits breakpoint
- yet another thread spawns a new thread, which reuses the
  dying thread's lwpid.
- GDB is informed of the breakpoint hit.
- Since events are handled sequencially, we will only notice the
  new thread event after handling the breakpoint hit.
- the user resumes the inferior, after inspecting what happened
  at the breakpoint.
- GDB tries to resume the dying lwpid --- actually, this id belongs
  to a new thread by now.  Since we were handling a breakpoint, we
  had stopped the whole process, so this resume succeeds.  We consider
  that the lwp is still dying.
- GDB now handles the TTEVT_LWP_CREATE event, and adds it to GDB's
  thread list, which automatically gets rid of the dying lwp, due
  to PTID collision.

Also, when we detect a TTEVT_LWP_CREATE, TTEVT_LWP_EXIT or=20
TTEVT_LWP_TERMINATE, The ttrace docs indicate that only one lwp is
stopped.  There's no reason to stop all lwps, and return
TARGET_WAITKING_SPURIOUS, only to resume all lwps again.
This just adds overhead and messes more with the
scheduling of the inferior than needed.  We could just resume
the stopped lwp, and return TARGET_WAITKIND_IGNORE.  But, we'd not
detect that the thread really is gone until the next whole inferior
resume.  I guess we could also detect if dying threads have already
exited after stopping the whole process and sending signal 0.
We can also do that in inf_ttrace_thread_alive (like inf_ptrace_thread_alive
does), so "info threads" also gets rid of dying threads that have already
really died.

Sorry, that came out longer that I was expecting.  Am I making any sense?
I can work up a patch for this.

If I'm analising the race correctly, an alternative simpler solution
to the above, would be to just ignore a failed resume on a
dying thread, but still try to resume it.  Not resuming dying threads
may be a problem.

--=20
Pedro Alves