From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22160 invoked by alias); 22 Jan 2014 14:04:07 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 22144 invoked by uid 89); 22 Jan 2014 14:04:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 22 Jan 2014 14:04:05 +0000 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s0ME416s011836 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 22 Jan 2014 09:04:02 -0500 Received: from [127.0.0.1] (ovpn01.gateway.prod.ext.ams2.redhat.com [10.39.146.11]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s0ME3xKJ015560; Wed, 22 Jan 2014 09:04:00 -0500 Message-ID: <52DFCFCF.4030101@redhat.com> Date: Wed, 22 Jan 2014 14:04:00 -0000 From: Pedro Alves User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Sterling Augustine CC: gdb-patches Subject: Re: [PATCH] RFC: All stepping threads need the same checks and cleanup as the currently stepping thread References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-SW-Source: 2014-01/txt/msg00858.txt.bz2 On 01/21/2014 10:17 PM, Sterling Augustine wrote: > But the problem is that the thread out of the step range is not the > currently stepped thread. Then I'm quite confused. Each thread has its own step range. How come the thread that is not stepped has a step range at all then? > Process_event_stop_test calls > switch_back_to_stepped_thread, which, in turn, calls resume, bypassing > the extra logic in process_event_stop_test to fix up the step range, > and leading to the assertion error. The issue seems to me, as previously discussed, not really about missing the "fix up" of the step range, but rather that we overstep the thread by mistake. Running the thread through process_event_stop_test makes us detect that the step finished (before we ever get to fix up the step range). That is, we switch back to the stepping thread, and re-step it, ignoring the possibility that that thread might have already moved past the step range, but not have had a chance to report that trap to the core yet (because events are serialized). The thing missing is a testcase clearly showing that that's indeed the issue in question. I spent a few days trying to write one from scratch a while ago, but failed, because linux-nat.c always gives preference to reporting the stepping/SIGTRAP thread if there are multiple simultaneous events, and it seemed like another signal needs to be involved to trigger this. Perhaps we could confirm all this already in a log produced by your extra debug outputs ran against your big app? > But I don't see any reason to assume that only the current thread > would need the additional cleanup found in process_event_stop_test. In > fact, switch_back_to_stepped_thread has a special case (hidden in > currently_stepping_or_nexting_callback), to _prevent_ it from > restarting the current thread, presumably so it that thread can get > the additional cleanup. > > The enclosed patch does two things: > > 1. Adds a large amount of tracing, which helped me diagnose the problem. > 2. Changes switch_back_to_stepped_thread to still switch back to a > stepped thread, but to avoid restarting it, allowing the additional > checks in process_event_stop_test to work their magic. I'm not convinced the first two branches in switch_back_to_stepped_thread should be changed at all. So without those that reduces to exactly the original patch I had shown you originally: https://github.com/palves/gdb/commit/b6b55ba610f8db5d89ec7405c93013a10d9a1c20 Does that alone fix things for you? In that branch, I then later rewrote that fix differently: https://github.com/palves/gdb/commit/1d56ddf439b6f7e7fa9759cf1f8e02106eea6af5 The idea of that "better fix" was to handle the case mentioned in this comment: + There might be some cases where this loses signal + information, if a signal has arrived at exactly the same + time that the PC changed, but this is the best we can do + with the information available. by setting a breakpoint at the current PC, and re-resuming the thread. That means that if there was indeed some other signal/event pending, we'd collect it first. But that's unfinished, and breaks hardware-step targets again in the process, for it only handles software-step targets. The thing preventing moving this forward is a testcase (or a log showing clearly that the problem is what I say above it is, which should show the steps needed to construct a testcase). -- Pedro Alves