From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-11429-listarch-gdb=sources.redhat.com@sources.redhat.com>
Received: (qmail 26520 invoked by alias); 5 Dec 2002 04:51:59 -0000
Mailing-List: contact gdb-help@sources.redhat.com; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sources.redhat.com>
List-Archive: <http://sources.redhat.com/ml/gdb/>
List-Post: <mailto:gdb@sources.redhat.com>
List-Help: <mailto:gdb-help@sources.redhat.com>, <http://sources.redhat.com/ml/#faqs>
Sender: gdb-owner@sources.redhat.com
Received: (qmail 26494 invoked from network); 5 Dec 2002 04:51:58 -0000
Received: from unknown (HELO duracef.shout.net) (204.253.184.12)
  by sources.redhat.com with SMTP; 5 Dec 2002 04:51:58 -0000
Received: (from mec@localhost)
	by duracef.shout.net (8.11.6/8.11.6) id gB54ppB31800;
	Wed, 4 Dec 2002 22:51:51 -0600
Date: Wed, 04 Dec 2002 20:51:00 -0000
From: Michael Elizabeth Chastain <mec@shout.net>
Message-Id: <200212050451.gB54ppB31800@duracef.shout.net>
To: fredrik@dolda2000.cjb.net, gdb@sources.redhat.com
Subject: Re: Checking function calls
X-SW-Source: 2002-12/txt/msg00088.txt.bz2

Hi Fredrik,

I'm throwing out a bunch of ideas here, take whatever looks useful
and discard the rest.

> Therefore, the failure has to be that a called
> function doesn't restore EBX correctly, on rare occasions, right?

I have seen this happen in a mixed programming environment,
with a Cygwin program that used a Windows DLL.  The Windows DLL
had subtly different calling conventions where it did not preserve
%ebx, %esi, and %edi across function calls.  Perhaps you have some
kind of third party library in your program which has a similar
compatibilty issue?

> My question is thus: Is there any way of debugging this with GDB? Can I
> make GDB check that EBX is the same before and after every function call
> from that frame in this thread to isolate the failing function? The
> frame never exits (until the program exits, that is), if that helps.

You could set a bunch of conditional breakpoints with "break if %ebx !=
saved_ebx", where you add code to your program to initialize saved_ebx.
Or you could say "break if %ebx < 0x1000" or some convenient constant.

You could also try forcing your variable to be on the stack instead of a
register.  Remove the "register" attribute from the declaration of "next"
if you have one.  Then add a "do_nothing(&next)" call to your function,
to force "next" to be on the stack instead of in a register.  If the
symptoms go away then it's more likely to really be a register clobber.
If the symptoms remain then it's more likely to be a memory clobber
(or you have a really sick low-level function that clobbers random words
on the stack but this does not feel like it).

> At first I was expecting that another thread somehow gets there and
> modifies the storage memory of next.

I still suspect this.  It's more likely that memory gets clobbered rather
than a register value.

Perhaps you need a function that locks the whole list and walks it for
a sanity check, without deleting anything?

Here is another wild lead: if, somehow, a block gets freed and then
you read it, many implementations of malloc keep housekeeping information
in the first word or two of a freed block.  That would explain why the
value is always 0x10 to 0x30 (that could be block size, especially if it is
rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
If you manage your blocks with malloc/free, you could try turning on any
malloc debugging facilities that you have.

Hope this helps,

Michael C