From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23312 invoked by alias); 26 Aug 2011 19:06:39 -0000 Received: (qmail 23302 invoked by uid 22791); 26 Aug 2011 19:06:38 -0000 X-SWARE-Spam-Status: No, hits=-2.2 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,TW_QX X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 26 Aug 2011 19:06:24 +0000 Received: (qmail 16167 invoked from network); 26 Aug 2011 19:06:23 -0000 Received: from unknown (HELO scottsdale.localnet) (pedro@127.0.0.2) by mail.codesourcery.com with ESMTPA; 26 Aug 2011 19:06:23 -0000 From: Pedro Alves To: gdb-patches@sourceware.org Subject: fix "info os processes" race -> crash (ext-run.exp racy FAIL) Date: Fri, 26 Aug 2011 19:06:00 -0000 User-Agent: KMail/1.13.6 (Linux/2.6.38-11-generic; KDE/4.7.0; x86_64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201108262006.21158.pedro@codesourcery.com> X-IsSubscribed: yes Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org X-SW-Source: 2011-08/txt/msg00505.txt.bz2 I'm seeing ext-run.exp randomly fail with: gdb.sum: Running ../../../src/gdb/testsuite/gdb.server/ext-run.exp ... FAIL: gdb.server/ext-run.exp: get process list (pattern 1) FAIL: gdb.server/ext-run.exp: load new file without any gdbserver inferior FAIL: gdb.server/ext-run.exp: monitor exit gdb.log: (gdb) PASS: gdb.server/ext-run.exp: continue to main gdb_expect_list pattern: /pid +user +command/ info os processes Remote connection closed (gdb) FAIL: gdb.server/ext-run.exp: get process list (pattern 1) This is gdbserver crashing: $ ./gdb gdbserver/gdbserver ./testsuite/core.27095 ... Program terminated with signal 11, Segmentation fault. ... (top-gdb) bt #0 0x00002ae59c23a3f6 in __readdir (dirp=0x0) at ../sysdeps/unix/readdir.c:45 #1 0x000000000042613b in get_cores_used_by_process (pid=27135, cores=0xafe7e0) at ../../../src/gdb/gdbserver/../common/linux-osdata.c:263 #2 0x0000000000426312 in linux_xfer_osdata_processes (readbuf=0xafd7d0 "", offset=0, len=4096) at ../../../src/gdb/gdbserver/../common/linux-osdata.c:338 #3 0x0000000000426b91 in linux_common_xfer_osdata (annex=0xaf5202 "processes", readbuf=0xafd7d0 "", offset=0, len=4096) at ../../../src/gdb/gdbserver/../common/linux-osdata.c:579 #4 0x0000000000424cb7 in linux_qxfer_osdata (annex=0xaf5202 "processes", readbuf=0xafd7d0 "", writebuf=0x0, offset=0, len=4096) at ../../../src/gdb/gdbserver/linux-low.c:4467 #5 0x000000000040812a in handle_qxfer_osdata (annex=0xaf5202 "processes", readbuf=0xafd7d0 "", writebuf=0x0, offset=0, len=4096) at ../../../src/gdb/gdbserver/server.c:981 #6 0x00000000004088ac in handle_qxfer (own_buf=0xaf51f0 "qXfer:osdata", packet_len=33, new_packet_len_p=0x7fff8e2ecdd4) at ../../../src/gdb/gdbserver/server.c:1254 #7 0x0000000000409dce in handle_query (own_buf=0xaf51f0 "qXfer:osdata", packet_len=33, new_packet_len_p=0x7fff8e2ecdd4) at ../../../src/gdb/gdbserver/server.c:1749 #8 0x000000000040bda0 in process_serial_event () at ../../../src/gdb/gdbserver/server.c:2778 #9 0x000000000040ce3f in handle_serial_event (err=0, client_data=0x0) at ../../../src/gdb/gdbserver/server.c:3194 #10 0x000000000041164b in handle_file_event (event_file_desc=6) at ../../../src/gdb/gdbserver/event-loop.c:489 #11 0x0000000000410dfc in process_event () at ../../../src/gdb/gdbserver/event-loop.c:244 #12 0x0000000000411bbd in start_event_loop () at ../../../src/gdb/gdbserver/event-loop.c:607 #13 0x000000000040bc21 in main (argc=4, argv=0x7fff8e2ed008) at ../../../src/gdb/gdbserver/server.c:2689 The problem is that get_cores_used_by_process assumes opening /proc/PID/task always suceeds, but since we're listing all the processes running on the system, it can fail if PID happens to exit after we've seen it exist (by listing /proc contents), but just before we open /proc/PID/task. This is easier to trip on if you run the testsuite in parallel mode (make check -jN). All other places are careful in handling /proc... file or dir open failure, except this one. I've applied the obvious fix. (fixes both native gdb and gdbserver, hurray for code sharing!) -- Pedro Alves 2011-08-26 Pedro Alves gdb/ * common/linux-osdata.c (get_cores_used_by_process): Don't assume opening /proc/PID/task always succeeds. --- gdb/common/linux-osdata.c | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-) Index: src/gdb/common/linux-osdata.c =================================================================== --- src.orig/gdb/common/linux-osdata.c 2011-08-26 19:41:37.255883141 +0100 +++ src/gdb/common/linux-osdata.c 2011-08-26 19:45:18.515883179 +0100 @@ -259,27 +259,29 @@ get_cores_used_by_process (pid_t pid, in sprintf (taskdir, "/proc/%d/task", pid); dir = opendir (taskdir); - - while ((dp = readdir (dir)) != NULL) + if (dir) { - pid_t tid; - int core; - - if (!isdigit (dp->d_name[0]) - || NAMELEN (dp) > sizeof ("4294967295") - 1) - continue; - - tid = atoi (dp->d_name); - core = linux_common_core_of_thread (ptid_build (pid, tid, 0)); - - if (core >= 0) + while ((dp = readdir (dir)) != NULL) { - ++cores[core]; - ++task_count; + pid_t tid; + int core; + + if (!isdigit (dp->d_name[0]) + || NAMELEN (dp) > sizeof ("4294967295") - 1) + continue; + + tid = atoi (dp->d_name); + core = linux_common_core_of_thread (ptid_build (pid, tid, 0)); + + if (core >= 0) + { + ++cores[core]; + ++task_count; + } } - } - closedir (dir); + closedir (dir); + } return task_count; }