From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19909 invoked by alias); 5 Dec 2002 23:14:54 -0000 Mailing-List: contact gdb-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sources.redhat.com Received: (qmail 19894 invoked from network); 5 Dec 2002 23:14:53 -0000 Received: from unknown (HELO pc2.dolda2000.com) (217.215.27.171) by sources.redhat.com with SMTP; 5 Dec 2002 23:14:53 -0000 Received: from [192.168.0.154] ([192.168.0.154]) by pc2.dolda2000.com (8.11.6/8.11.2) with ESMTP id gB5NEpN20370 for ; Fri, 6 Dec 2002 00:14:52 +0100 Subject: Re: Checking function calls From: Fredrik Tolf To: gdb@sources.redhat.com In-Reply-To: <200212050451.gB54ppB31800@duracef.shout.net> References: <200212050451.gB54ppB31800@duracef.shout.net> Content-Type: text/plain Content-Transfer-Encoding: 7bit Date: Thu, 05 Dec 2002 15:14:00 -0000 Message-Id: <1039130097.2343.52.camel@pc7> Mime-Version: 1.0 X-SW-Source: 2002-12/txt/msg00098.txt.bz2 On Thu, 2002-12-05 at 05:51, Michael Elizabeth Chastain wrote: > Hi Fredrik, > > I'm throwing out a bunch of ideas here, take whatever looks useful > and discard the rest. > > > Therefore, the failure has to be that a called > > function doesn't restore EBX correctly, on rare occasions, right? > > I have seen this happen in a mixed programming environment, > with a Cygwin program that used a Windows DLL. The Windows DLL > had subtly different calling conventions where it did not preserve > %ebx, %esi, and %edi across function calls. Perhaps you have some > kind of third party library in your program which has a similar > compatibilty issue? The only libraries are libc, libpthread, libdl and libpam. In the affected function, only libc and libpthread are used. Therefore I don't think it's calling convention incompatibility, unless they in turn call functions in third party libraries, which I find very unlikely. > > > My question is thus: Is there any way of debugging this with GDB? Can I > > make GDB check that EBX is the same before and after every function call > > from that frame in this thread to isolate the failing function? The > > frame never exits (until the program exits, that is), if that helps. > > You could set a bunch of conditional breakpoints with "break if %ebx != > saved_ebx", where you add code to your program to initialize saved_ebx. > Or you could say "break if %ebx < 0x1000" or some convenient constant. > That would, of course, be a good thing. It's only that I'd have to do that after every single function call... That would take some time. Maybe I'll do it, anyway. I was actually thinking of doing something like that, but with code instead, and making the thread SIGSTOP itself when EBX is invalid. > You could also try forcing your variable to be on the stack instead of a > register. Remove the "register" attribute from the declaration of "next" > if you have one. Then add a "do_nothing(&next)" call to your function, > to force "next" to be on the stack instead of in a register. If the > symptoms go away then it's more likely to really be a register clobber. That just doesn't feel like a very elegant solution, though. And, this bug does actually surface in another function as well, only even more seldomly. There it also affects a variable stored in EBX, but it gets set to 0 instead. So, I would prefer actually solving the problem, so that it doesn't show up anywhere else. I have noticed no similarities between the two places where the bugs shows itself. > > At first I was expecting that another thread somehow gets there and > > modifies the storage memory of next. > > I still suspect this. It's more likely that memory gets clobbered rather > than a register value. > But next isn't stored in memory at any place, so it cannot be that. > Perhaps you need a function that locks the whole list and walks it for > a sanity check, without deleting anything? > I always check the list with gdb when the program crashes, and it's always correct. That's why I think that it's impossible that next is loaded when the list is in an unstable state. The only times I actually set the next element, I always set it to NULL or a pointer returned by malloc(). If the list was to be made unstable by a buggy function somewhere, it would have to restored again by the same function (since it's always consistent when I look at it), and I just don't see that happening. > Here is another wild lead: if, somehow, a block gets freed and then > you read it, many implementations of malloc keep housekeeping information > in the first word or two of a freed block. That would explain why the > value is always 0x10 to 0x30 (that could be block size, especially if it is > rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered. > If you manage your blocks with malloc/free, you could try turning on any > malloc debugging facilities that you have. I also suspected that something like that might happen, and therefore I lock the elements one element ahead of the block I'm currently looking at, so that the current block and the next are always locked. That's why I have: for(cur = list; cur != NULL; cur = next) { if((next = cur->next) != NULL) pthread_mutex_lock(&next->mutex); ... } That is also a reason why the next variable has to be clobbered at some later point, since pthread_mutex_lock succeeds on it. The program always on the line "if((next = cur->next) != NULL)", since it segfaults when it looks up cur->next, i.e. at that point cur has been set to the invalid next as directed by the loop. Therefore, when the program crashes, next and cur are equal, and I cannot see what element it was at before. By the way, if you want to look at the code, it's available at http://sourceforge.net/projects/dcprod/. I don't know if it's the latest version, though.