From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-11439-listarch-gdb=sources.redhat.com@sources.redhat.com>
Received: (qmail 19909 invoked by alias); 5 Dec 2002 23:14:54 -0000
Mailing-List: contact gdb-help@sources.redhat.com; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sources.redhat.com>
List-Archive: <http://sources.redhat.com/ml/gdb/>
List-Post: <mailto:gdb@sources.redhat.com>
List-Help: <mailto:gdb-help@sources.redhat.com>, <http://sources.redhat.com/ml/#faqs>
Sender: gdb-owner@sources.redhat.com
Received: (qmail 19894 invoked from network); 5 Dec 2002 23:14:53 -0000
Received: from unknown (HELO pc2.dolda2000.com) (217.215.27.171)
  by sources.redhat.com with SMTP; 5 Dec 2002 23:14:53 -0000
Received: from [192.168.0.154] ([192.168.0.154])
	by pc2.dolda2000.com (8.11.6/8.11.2) with ESMTP id gB5NEpN20370
	for <gdb@sources.redhat.com>; Fri, 6 Dec 2002 00:14:52 +0100
Subject: Re: Checking function calls
From: Fredrik Tolf <fredrik@dolda2000.cjb.net>
To: gdb@sources.redhat.com
In-Reply-To: <200212050451.gB54ppB31800@duracef.shout.net>
References: <200212050451.gB54ppB31800@duracef.shout.net>
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Date: Thu, 05 Dec 2002 15:14:00 -0000
Message-Id: <1039130097.2343.52.camel@pc7>
Mime-Version: 1.0
X-SW-Source: 2002-12/txt/msg00098.txt.bz2

On Thu, 2002-12-05 at 05:51, Michael Elizabeth Chastain wrote:
> Hi Fredrik,
> 
> I'm throwing out a bunch of ideas here, take whatever looks useful
> and discard the rest.
> 
> > Therefore, the failure has to be that a called
> > function doesn't restore EBX correctly, on rare occasions, right?
> 
> I have seen this happen in a mixed programming environment,
> with a Cygwin program that used a Windows DLL.  The Windows DLL
> had subtly different calling conventions where it did not preserve
> %ebx, %esi, and %edi across function calls.  Perhaps you have some
> kind of third party library in your program which has a similar
> compatibilty issue?

The only libraries are libc, libpthread, libdl and libpam. In the
affected function, only libc and libpthread are used. Therefore I don't
think it's calling convention incompatibility, unless they in turn call
functions in third party libraries, which I find very unlikely.

> 
> > My question is thus: Is there any way of debugging this with GDB? Can I
> > make GDB check that EBX is the same before and after every function call
> > from that frame in this thread to isolate the failing function? The
> > frame never exits (until the program exits, that is), if that helps.
> 
> You could set a bunch of conditional breakpoints with "break if %ebx !=
> saved_ebx", where you add code to your program to initialize saved_ebx.
> Or you could say "break if %ebx < 0x1000" or some convenient constant.
> 

That would, of course, be a good thing. It's only that I'd have to do
that after every single function call... That would take some time.
Maybe I'll do it, anyway. I was actually thinking of doing something
like that, but with code instead, and making the thread SIGSTOP itself
when EBX is invalid.

> You could also try forcing your variable to be on the stack instead of a
> register.  Remove the "register" attribute from the declaration of "next"
> if you have one.  Then add a "do_nothing(&next)" call to your function,
> to force "next" to be on the stack instead of in a register.  If the
> symptoms go away then it's more likely to really be a register clobber.

That just doesn't feel like a very elegant solution, though. And, this
bug does actually surface in another function as well, only even more
seldomly. There it also affects a variable stored in EBX, but it gets
set to 0 instead. So, I would prefer actually solving the problem, so
that it doesn't show up anywhere else. I have noticed no similarities
between the two places where the bugs shows itself.

> > At first I was expecting that another thread somehow gets there and
> > modifies the storage memory of next.
> 
> I still suspect this.  It's more likely that memory gets clobbered rather
> than a register value.
> 

But next isn't stored in memory at any place, so it cannot be that.

> Perhaps you need a function that locks the whole list and walks it for
> a sanity check, without deleting anything?
> 

I always check the list with gdb when the program crashes, and it's
always correct. That's why I think that it's impossible that next is
loaded when the list is in an unstable state. The only times I actually
set the next element, I always set it to NULL or a pointer returned by
malloc(). If the list was to be made unstable by a buggy function
somewhere, it would have to restored again by the same function (since
it's always consistent when I look at it), and I just don't see that
happening.

> Here is another wild lead: if, somehow, a block gets freed and then
> you read it, many implementations of malloc keep housekeeping information
> in the first word or two of a freed block.  That would explain why the
> value is always 0x10 to 0x30 (that could be block size, especially if it is
> rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
> If you manage your blocks with malloc/free, you could try turning on any
> malloc debugging facilities that you have.

I also suspected that something like that might happen, and therefore I
lock the elements one element ahead of the block I'm currently looking
at, so that the current block and the next are always locked. That's why
I have:

for(cur = list; cur != NULL; cur = next)
{
    if((next = cur->next) != NULL)
        pthread_mutex_lock(&next->mutex);
    ...
}

That is also a reason why the next variable has to be clobbered at some
later point, since pthread_mutex_lock succeeds on it. The program always
on the line "if((next = cur->next) != NULL)", since it segfaults when it
looks up cur->next, i.e. at that point cur has been set to the invalid
next as directed by the loop. Therefore, when the program crashes, next
and cur are equal, and I cannot see what element it was at before.

By the way, if you want to look at the code, it's available at
http://sourceforge.net/projects/dcprod/. I don't know if it's the latest
version, though.