From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-13315-listarch-gdb-patches=sourceware.cygnus.com@sources.redhat.com>
Received: (qmail 27557 invoked by alias); 11 Mar 2002 23:46:22 -0000
Mailing-List: contact gdb-patches-help@sources.redhat.com; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-patches-subscribe@sources.redhat.com>
List-Archive: <http://sources.redhat.com/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sources.redhat.com>
List-Help: <mailto:gdb-patches-help@sources.redhat.com>, <http://sources.redhat.com/ml/#faqs>
Sender: gdb-patches-owner@sources.redhat.com
Received: (qmail 27097 invoked from network); 11 Mar 2002 23:46:14 -0000
Received: from unknown (HELO cygnus.com) (205.180.230.5)
  by sources.redhat.com with SMTP; 11 Mar 2002 23:46:14 -0000
Received: from cse.cygnus.com (cse.cygnus.com [205.180.230.236])
	by runyon.cygnus.com (8.8.7-cygnus/8.8.7) with ESMTP id PAA05037
	for <gdb-patches@sources.redhat.com>; Mon, 11 Mar 2002 15:46:12 -0800 (PST)
Received: (from kev@localhost)
	by cse.cygnus.com (8.11.6/8.11.6) id g2BNjsh20651
	for gdb-patches@sources.redhat.com; Mon, 11 Mar 2002 16:45:54 -0700
Date: Mon, 11 Mar 2002 15:46:00 -0000
From: Kevin Buettner <kevinb@redhat.com>
Message-Id: <1020311234554.ZM20650@localhost.localdomain>
X-Mailer: Z-Mail (4.0.1 13Jan97 Caldera)
To: gdb-patches@sources.redhat.com
Subject: [PATCH RFA/RFC] Don't use lwp_from_thread() in thread_db_wait()
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-SW-Source: 2002-03/txt/msg00167.txt.bz2

I'm seeing the following failure when I run the gdb testsuite on an
SMP machine (GNU/Linux/x86):

FAIL: gdb.threads/pthreads.exp: continue to bkpt at common_routine in thread 2

Here's the relevant bit from the log file:

break common_routine thread 4
Breakpoint 6 at 0x804864e: file /saguaro1/intelp4-011128-branch/devo/gdb/testsuite/gdb.threads/pthreads.c, line 50.
(gdb) PASS: gdb.threads/pthreads.exp: set break at common_routine in thread 2
continue
Continuing.
Cannot find thread 1024: generic error
[followed by the FAIL message]

Now, consider the following portion of gdb's stack when this happens:

(top-gdb) bt
#0  ps_xfer_memory (ph=0x8392f80, addr=1073962208, buf=0xbfffef40 "", len=16, 
    write=0) at ../../devo/gdb/proc-service.c:81
#1  0x08111a36 in ps_pdread (ph=0x8392f80, addr=1073962208, buf=0xbfffef40, 
    size=16) at ../../devo/gdb/proc-service.c:194
#2  0x402aa3d9 in td_ta_map_id2thr (ta=0x83cde40, pt=1024, th=0xbfffef88)
    at td_ta_map_id2thr.c:41
#3  0x08112086 in lwp_from_thread (ptid={pid = 22395, lwp = 0, tid = 1024})
    at ../../devo/gdb/thread-db.c:261
#4  0x08112e20 in thread_db_wait (ptid={pid = 22395, lwp = 0, tid = 1024}, 
    ourstatus=0xbffff150) at ../../devo/gdb/thread-db.c:720
#5  0x080acbde in wait_for_inferior () at ../../devo/gdb/infrun.c:1246
#6  0x080ac915 in proceed (addr=4294967295, siggnal=TARGET_SIGNAL_DEFAULT, 
    step=0) at ../../devo/gdb/infrun.c:1045
#7  0x080a98b4 in continue_command (proc_count_exp=0x0, from_tty=1)
    at ../../devo/gdb/infcmd.c:536

As I see it, the problem is as follows...

thread_db_wait() wants to learn the lwp id of the thread that it
should wait for so that it can ask the lwp layer to wait on the lwp
corresponding to the thread in question.  In order to do this, it
calls lwp_from_thread().  lwp_from_thread needs help from the
libthread_db.so to figure this out, so it calls td_ta_map_id2thr(). 
BUT, this libthread_db function must interrogate the inferior
process's memory to look at the thread data structures.  To do this,
it calls back into gdb, using ps_pdread() to fetch the memory in
question.  Eventually, on Linux, ptrace() gets called to actually
fetch the memory.

The Linux/i386 (kernel) ptrace code contains the following check:

	ret = -ESRCH;
	if (!(child->ptrace & PT_PTRACED))
		goto out_tsk;
	if (child->state != TASK_STOPPED) {
		if (request != PTRACE_KILL)
			goto out_tsk;
	}

This says that ESRCH will be returned if the child process is not
being traced.  (Not relevant.)  It ALSO says to return ESRCH if the
process is not stopped.  This is of critical importance.

In the above trace, we are wanting to wait for the main thread to
stop, but in order to find out the necessary information so that we
can do this, the main thread must first be stopped!

The patch below fixes this problem for me and shows no regressions
in the testsuite.

I considered a number of other less palatable solutions.  One of them
involved implementing a linux/x86 specific version of
child_xfer_memory() which would (attempt to) explicitly stop the
process if an error occurred in attempting to do a memory read.  The
problem with this is that once stopped, what do we do with it?  Start
it again?  I'm sure that something could be worked out, but I studied
lin-lwp.c which had a fair amount of this kind of hair in it already,
and it scared me enough to opt for a simpler solution.  FWIW, the now
defunct lin-thread.c had #if 0'd out the corresponding bit of code
that I chose to disable in thread-db.c.  I'm guessing that the older
thread implementation had run into the same kind of problem in the
past.

Comments?  Okay to commit?

	* thread-db.c (thread_db_wait): Don't attempt to use
	lwp_from_thread().  Doing so assumes that the main thread
	is already stopped and this might not be the case.  Instead,
	simply wait for any thread.

Index: thread-db.c
===================================================================
RCS file: /cvs/src/src/gdb/thread-db.c,v
retrieving revision 1.21
diff -u -p -r1.21 thread-db.c
--- thread-db.c	2002/02/24 21:53:02	1.21
+++ thread-db.c	2002/03/11 23:25:36
@@ -719,10 +719,31 @@ thread_db_wait (ptid_t ptid, struct targ
 {
   extern ptid_t trap_ptid;
 
-  if (GET_PID (ptid) != -1 && is_thread (ptid))
-    ptid = lwp_from_thread (ptid);
+  /* Note: kevinb/2002-03-11: We used to do the following here:
 
-  ptid = target_beneath->to_wait (ptid, ourstatus);
+	if (GET_PID (ptid) != -1 && is_thread (ptid))
+	  ptid = lwp_from_thread (ptid);
+
+	ptid = target_beneath->to_wait (ptid, ourstatus);
+
+     The problem with calling lwp_from_thread() at this point is that
+     the main thread is not necessarily stopped.  This is a problem
+     because lwp_from_thread() requires help from the thread_db to
+     obtain the thread to lwp mapping.  In order to perform this
+     operation, the thread_db library calls back into GDB to do a
+     memory read of the main thread.  On GNU/Linux, a memory read
+     is performed via ptrace(), which requires that the process be
+     stopped.  (ESRCH is returned otherwise.)  Even if it were
+     permissible to read the memory of a running process, it would
+     probably not be a good idea to rely on such results.
+     
+     So, instead of attempting to fetch the LWP id and invoke a
+     lower layer's target_wait() with a ptid constructed from this
+     LWP, we simply wait for any thread and let infrun.c's thread
+     hopping machinery sort out whether the desired thread has been
+     stopped or not.  */
+
+  ptid = target_beneath->to_wait (pid_to_ptid (-1), ourstatus);
 
   if (proc_handle.pid == 0)
     /* The current child process isn't the actual multi-threaded