Mirror of the gdb mailing list
 help / color / mirror / Atom feed
* Checking function calls
@ 2002-12-04 18:02 Fredrik Tolf
  0 siblings, 0 replies; 6+ messages in thread
From: Fredrik Tolf @ 2002-12-04 18:02 UTC (permalink / raw)
  To: gdb

I'm having a strange problem in a program that I'm writing (in C). The
background is essentially as follows:
The program is multithreaded, and in one thread I'm looping through a
linked list, and because elements may be freed inside the loop, I have
an extra variable to hold a pointer to the next element. I only use this
variable three times in total, like this:

for(cur = list; cur != NULL; cur = next)
{
    if((next = cur->next) != NULL)
        pthread_mutex_lock(&next->mutex);
    ... /* next is not mentioned anymore */
}

There is a bug, which happens extremely seldomly (the program can go for
days without anything happening), that appears to change the content the
next variable, usually to something between 0x10 and 0x30. This, of
course, causes the thread to segfault in the next iteration. At first I
was expecting that another thread somehow gets there and modifies the
storage memory of next. I realized that it was extremely unlikely that
this would happen, especially since it was this variable and nothing
else that was being changed, but I didn't have any other lead. Recently,
I debugged it a little, and found that next is actually being stored in
a register (EBX, more specifically, I'm using an IA32 arch).
At first I therefore suspected a compile error by gcc, but after
checking the assembly output, I ruled out that possibility; EBX was
being used exactly as instructed. The possibility that the next element
of the list structure is changed and then loaded into the next variable
is also impossible. Therefore, the failure has to be that a called
function doesn't restore EBX correctly, on rare occasions, right? If I'm
not completely mistaken, there is no other possibility.

My question is thus: Is there any way of debugging this with GDB? Can I
make GDB check that EBX is the same before and after every function call
from that frame in this thread to isolate the failing function? The
frame never exits (until the program exits, that is), if that helps.

I have been trying to solve this problem for months now, and I would be
eternally grateful if someone helped me do it.

Fredrik Tolf



^ permalink raw reply	[flat|nested] 6+ messages in thread
* Re: Checking function calls
@ 2002-12-04 20:51 Michael Elizabeth Chastain
  2002-12-05 15:14 ` Fredrik Tolf
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Elizabeth Chastain @ 2002-12-04 20:51 UTC (permalink / raw)
  To: fredrik, gdb

Hi Fredrik,

I'm throwing out a bunch of ideas here, take whatever looks useful
and discard the rest.

> Therefore, the failure has to be that a called
> function doesn't restore EBX correctly, on rare occasions, right?

I have seen this happen in a mixed programming environment,
with a Cygwin program that used a Windows DLL.  The Windows DLL
had subtly different calling conventions where it did not preserve
%ebx, %esi, and %edi across function calls.  Perhaps you have some
kind of third party library in your program which has a similar
compatibilty issue?

> My question is thus: Is there any way of debugging this with GDB? Can I
> make GDB check that EBX is the same before and after every function call
> from that frame in this thread to isolate the failing function? The
> frame never exits (until the program exits, that is), if that helps.

You could set a bunch of conditional breakpoints with "break if %ebx !=
saved_ebx", where you add code to your program to initialize saved_ebx.
Or you could say "break if %ebx < 0x1000" or some convenient constant.

You could also try forcing your variable to be on the stack instead of a
register.  Remove the "register" attribute from the declaration of "next"
if you have one.  Then add a "do_nothing(&next)" call to your function,
to force "next" to be on the stack instead of in a register.  If the
symptoms go away then it's more likely to really be a register clobber.
If the symptoms remain then it's more likely to be a memory clobber
(or you have a really sick low-level function that clobbers random words
on the stack but this does not feel like it).

> At first I was expecting that another thread somehow gets there and
> modifies the storage memory of next.

I still suspect this.  It's more likely that memory gets clobbered rather
than a register value.

Perhaps you need a function that locks the whole list and walks it for
a sanity check, without deleting anything?

Here is another wild lead: if, somehow, a block gets freed and then
you read it, many implementations of malloc keep housekeeping information
in the first word or two of a freed block.  That would explain why the
value is always 0x10 to 0x30 (that could be block size, especially if it is
rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
If you manage your blocks with malloc/free, you could try turning on any
malloc debugging facilities that you have.

Hope this helps,

Michael C


^ permalink raw reply	[flat|nested] 6+ messages in thread
[parent not found: <200212052240.gB5Mefm16249@duracef.shout.net>]
* Re: Checking function calls
@ 2002-12-06  9:08 Michael Elizabeth Chastain
  2002-12-06 11:31 ` Fredrik Tolf
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Elizabeth Chastain @ 2002-12-06  9:08 UTC (permalink / raw)
  To: fredrik; +Cc: gdb

Hi Fredrik,

> It is a GNU/Linux platform, and, yes, I am using gcc.

Well, that wraps up that line of inquiry.

> I know, I didn't plan ahead good enough when I started writing it, and
> now I'm stuck with either this, or a large rewrite.

When I run into this kind of problem, I like to step back -- way back --
get away from computers for a day or two and think about it.

I think there is no easy way out, that you actually are stuck with a
large rewrite.  There are just too many pthread_mutex_lock's flying
around.

For instance:

  client.c:findtransfer() does not have any locks.

  in client.c:freesharecache(), there is code:

    if (cache->parent != NULL)
    {
      pthread_mutex_lock(&cache->parent->mutex)l;
      ...
    }

  in general, it's unsafe to test a member and then acquire the lock,
  because someone else can delete cache->parent between the "if" statement
  and the acquisition of the lock.

  In client.c:clientmain():

    for(cur = transfers; cur != NULL; cur = next)
    {
      pthread_mutex_lock(&cur_mutex);
      next = cur->next;
      ...
    }

    between the execution of "cur = transfers" and "cur != NULL",
    the first item of the list can be deleted.

I recommend finding a textbook on multi-threaded programming that covers
"how to write thread-safe lists".  From your package, it looks like
you are in it to learn, so you could step way back from the code and
learn some theory at this point.

Another alternative is to use one big mutex for the whole list.
Then the primitive operations become:

  add item to list
    lock the whole list
    add the item
    unlock the whole list

  delete item from list
    lock the whole list
    delete the item
    unlock the whole list

  iterate over the list
    lock the whole list
    iterate over all the items
    unlock the whole list

The drawback is that walking the list locks the whole list against
addition and deletion.  If your list walker is just "print status
information" then that is fine.  If your list walker does some
long-lived network operation at each node then it is not fine.

Michael C


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-12-06 19:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-04 18:02 Checking function calls Fredrik Tolf
2002-12-04 20:51 Michael Elizabeth Chastain
2002-12-05 15:14 ` Fredrik Tolf
     [not found] <200212052240.gB5Mefm16249@duracef.shout.net>
2002-12-06  8:24 ` Fredrik Tolf
2002-12-06  9:08 Michael Elizabeth Chastain
2002-12-06 11:31 ` Fredrik Tolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox