On Saturday 09 August 2008 15:51:16, Pedro Alves wrote:


> inf_ttrace_wait ()
> ...
>       case TTEVT_LWP_EXIT:
>         if (print_thread_events)
>           printf_unfiltered (_("[%s exited]\n"), target_pid_to_str (ptid));
>         ti = find_thread_pid (ptid);
>         gdb_assert (ti != NULL);
>         ((struct inf_ttrace_private_thread_info *)ti->private)->dying = 1;
>         inf_ttrace_num_lwps--;
> (1)     ttrace (TT_LWP_CONTINUE, ptid_get_pid (ptid),
>               ptid_get_lwp (ptid), TT_NOPC, 0, 0);
>         /* If we don't return -1 here, core GDB will re-add the thread.  */
>         ptid = minus_one_ptid;
>         break;
> ...
>
>     /* Make sure all threads within the process are stopped.  */
> (2)  if (ttrace (TT_PROC_STOP, tts.tts_pid, 0, 0, 0, 0) == -1)
>        perror_with_name (("ttrace"));
>
>     return ptid;
>   }
>
>
> It seems to me, that for some reason, in most cases, the inferior was slow
> enough that when you reach (2), the dying thread hadn't exited
> yet.  The TT_PROC_STOP call stops all lwps of the process, the
> dying one included, I would think.  In that case, you still need the
> resume on the dying thread in inf_ttrace_wait.  Otherwise, you *may*
> get this bug back, depending on how the OS is waking waiting processes:


> So, to minimise the possible race, how about:
>
> - still try to resume a dying lwp.  Ignore the errno you
>   were originally seeing in that case (only).
> - on resume failure, delete it from GDBs thread table.
> - if by any chance, the lwp exits, and the inferior spawn a
>   new lwp, and the OS reuses the same lwpid of the lwp we knew
>   was dying, we delete the dying lwp, and add the new one.
>   If the OS is reusing the id, the original lwp has to be gone.
>   This is just an add_thread call, as that is already handled by it
>   internally (*).
> - If the thread is still alive, but is dying, let that show
>   in "info threads".  The linux pthread support implementation
>   also does this.

This is what the attached patch does.  In adition to what is
described above, I'm checking if any dying thread is now gone
after stopping the whole process.  I'm checking for lwp "aliveness"
with sending signal 0.  I hope it works as expected against
ttrace stopped threads, otherwise, I'd need another way to detect
if the lwp is still alive.

With this change, we no longer unconditionaly delete the dying 
lwps after the first resume.  This is to prevent that another event
that was already queued is handled and GDB stopping the whole process
before the dying thread having a chance to die.  In this case, we'll
still need another resume in the dying lwp -- until it really exits.

Hope I haven't broken anything badly.  I've never in my live logged in
to an HP-UX system, so wear sunglasses.

-- 
Pedro Alves