From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27557 invoked by alias); 11 Mar 2002 23:46:22 -0000 Mailing-List: contact gdb-patches-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sources.redhat.com Received: (qmail 27097 invoked from network); 11 Mar 2002 23:46:14 -0000 Received: from unknown (HELO cygnus.com) (205.180.230.5) by sources.redhat.com with SMTP; 11 Mar 2002 23:46:14 -0000 Received: from cse.cygnus.com (cse.cygnus.com [205.180.230.236]) by runyon.cygnus.com (8.8.7-cygnus/8.8.7) with ESMTP id PAA05037 for ; Mon, 11 Mar 2002 15:46:12 -0800 (PST) Received: (from kev@localhost) by cse.cygnus.com (8.11.6/8.11.6) id g2BNjsh20651 for gdb-patches@sources.redhat.com; Mon, 11 Mar 2002 16:45:54 -0700 Date: Mon, 11 Mar 2002 15:46:00 -0000 From: Kevin Buettner Message-Id: <1020311234554.ZM20650@localhost.localdomain> X-Mailer: Z-Mail (4.0.1 13Jan97 Caldera) To: gdb-patches@sources.redhat.com Subject: [PATCH RFA/RFC] Don't use lwp_from_thread() in thread_db_wait() MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-SW-Source: 2002-03/txt/msg00167.txt.bz2 I'm seeing the following failure when I run the gdb testsuite on an SMP machine (GNU/Linux/x86): FAIL: gdb.threads/pthreads.exp: continue to bkpt at common_routine in thread 2 Here's the relevant bit from the log file: break common_routine thread 4 Breakpoint 6 at 0x804864e: file /saguaro1/intelp4-011128-branch/devo/gdb/testsuite/gdb.threads/pthreads.c, line 50. (gdb) PASS: gdb.threads/pthreads.exp: set break at common_routine in thread 2 continue Continuing. Cannot find thread 1024: generic error [followed by the FAIL message] Now, consider the following portion of gdb's stack when this happens: (top-gdb) bt #0 ps_xfer_memory (ph=0x8392f80, addr=1073962208, buf=0xbfffef40 "", len=16, write=0) at ../../devo/gdb/proc-service.c:81 #1 0x08111a36 in ps_pdread (ph=0x8392f80, addr=1073962208, buf=0xbfffef40, size=16) at ../../devo/gdb/proc-service.c:194 #2 0x402aa3d9 in td_ta_map_id2thr (ta=0x83cde40, pt=1024, th=0xbfffef88) at td_ta_map_id2thr.c:41 #3 0x08112086 in lwp_from_thread (ptid={pid = 22395, lwp = 0, tid = 1024}) at ../../devo/gdb/thread-db.c:261 #4 0x08112e20 in thread_db_wait (ptid={pid = 22395, lwp = 0, tid = 1024}, ourstatus=0xbffff150) at ../../devo/gdb/thread-db.c:720 #5 0x080acbde in wait_for_inferior () at ../../devo/gdb/infrun.c:1246 #6 0x080ac915 in proceed (addr=4294967295, siggnal=TARGET_SIGNAL_DEFAULT, step=0) at ../../devo/gdb/infrun.c:1045 #7 0x080a98b4 in continue_command (proc_count_exp=0x0, from_tty=1) at ../../devo/gdb/infcmd.c:536 As I see it, the problem is as follows... thread_db_wait() wants to learn the lwp id of the thread that it should wait for so that it can ask the lwp layer to wait on the lwp corresponding to the thread in question. In order to do this, it calls lwp_from_thread(). lwp_from_thread needs help from the libthread_db.so to figure this out, so it calls td_ta_map_id2thr(). BUT, this libthread_db function must interrogate the inferior process's memory to look at the thread data structures. To do this, it calls back into gdb, using ps_pdread() to fetch the memory in question. Eventually, on Linux, ptrace() gets called to actually fetch the memory. The Linux/i386 (kernel) ptrace code contains the following check: ret = -ESRCH; if (!(child->ptrace & PT_PTRACED)) goto out_tsk; if (child->state != TASK_STOPPED) { if (request != PTRACE_KILL) goto out_tsk; } This says that ESRCH will be returned if the child process is not being traced. (Not relevant.) It ALSO says to return ESRCH if the process is not stopped. This is of critical importance. In the above trace, we are wanting to wait for the main thread to stop, but in order to find out the necessary information so that we can do this, the main thread must first be stopped! The patch below fixes this problem for me and shows no regressions in the testsuite. I considered a number of other less palatable solutions. One of them involved implementing a linux/x86 specific version of child_xfer_memory() which would (attempt to) explicitly stop the process if an error occurred in attempting to do a memory read. The problem with this is that once stopped, what do we do with it? Start it again? I'm sure that something could be worked out, but I studied lin-lwp.c which had a fair amount of this kind of hair in it already, and it scared me enough to opt for a simpler solution. FWIW, the now defunct lin-thread.c had #if 0'd out the corresponding bit of code that I chose to disable in thread-db.c. I'm guessing that the older thread implementation had run into the same kind of problem in the past. Comments? Okay to commit? * thread-db.c (thread_db_wait): Don't attempt to use lwp_from_thread(). Doing so assumes that the main thread is already stopped and this might not be the case. Instead, simply wait for any thread. Index: thread-db.c =================================================================== RCS file: /cvs/src/src/gdb/thread-db.c,v retrieving revision 1.21 diff -u -p -r1.21 thread-db.c --- thread-db.c 2002/02/24 21:53:02 1.21 +++ thread-db.c 2002/03/11 23:25:36 @@ -719,10 +719,31 @@ thread_db_wait (ptid_t ptid, struct targ { extern ptid_t trap_ptid; - if (GET_PID (ptid) != -1 && is_thread (ptid)) - ptid = lwp_from_thread (ptid); + /* Note: kevinb/2002-03-11: We used to do the following here: - ptid = target_beneath->to_wait (ptid, ourstatus); + if (GET_PID (ptid) != -1 && is_thread (ptid)) + ptid = lwp_from_thread (ptid); + + ptid = target_beneath->to_wait (ptid, ourstatus); + + The problem with calling lwp_from_thread() at this point is that + the main thread is not necessarily stopped. This is a problem + because lwp_from_thread() requires help from the thread_db to + obtain the thread to lwp mapping. In order to perform this + operation, the thread_db library calls back into GDB to do a + memory read of the main thread. On GNU/Linux, a memory read + is performed via ptrace(), which requires that the process be + stopped. (ESRCH is returned otherwise.) Even if it were + permissible to read the memory of a running process, it would + probably not be a good idea to rely on such results. + + So, instead of attempting to fetch the LWP id and invoke a + lower layer's target_wait() with a ptid constructed from this + LWP, we simply wait for any thread and let infrun.c's thread + hopping machinery sort out whether the desired thread has been + stopped or not. */ + + ptid = target_beneath->to_wait (pid_to_ptid (-1), ourstatus); if (proc_handle.pid == 0) /* The current child process isn't the actual multi-threaded