Checking function calls

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* Checking function calls
@ 2002-12-04 18:02 Fredrik Tolf
  0 siblings, 0 replies; 6+ messages in thread
From: Fredrik Tolf @ 2002-12-04 18:02 UTC (permalink / raw)
  To: gdb

I'm having a strange problem in a program that I'm writing (in C). The
background is essentially as follows:
The program is multithreaded, and in one thread I'm looping through a
linked list, and because elements may be freed inside the loop, I have
an extra variable to hold a pointer to the next element. I only use this
variable three times in total, like this:

for(cur = list; cur != NULL; cur = next)
{
    if((next = cur->next) != NULL)
        pthread_mutex_lock(&next->mutex);
    ... /* next is not mentioned anymore */
}

There is a bug, which happens extremely seldomly (the program can go for
days without anything happening), that appears to change the content the
next variable, usually to something between 0x10 and 0x30. This, of
course, causes the thread to segfault in the next iteration. At first I
was expecting that another thread somehow gets there and modifies the
storage memory of next. I realized that it was extremely unlikely that
this would happen, especially since it was this variable and nothing
else that was being changed, but I didn't have any other lead. Recently,
I debugged it a little, and found that next is actually being stored in
a register (EBX, more specifically, I'm using an IA32 arch).
At first I therefore suspected a compile error by gcc, but after
checking the assembly output, I ruled out that possibility; EBX was
being used exactly as instructed. The possibility that the next element
of the list structure is changed and then loaded into the next variable
is also impossible. Therefore, the failure has to be that a called
function doesn't restore EBX correctly, on rare occasions, right? If I'm
not completely mistaken, there is no other possibility.

My question is thus: Is there any way of debugging this with GDB? Can I
make GDB check that EBX is the same before and after every function call
from that frame in this thread to isolate the failing function? The
frame never exits (until the program exits, that is), if that helps.

I have been trying to solve this problem for months now, and I would be
eternally grateful if someone helped me do it.

Fredrik Tolf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Checking function calls
@ 2002-12-04 20:51 Michael Elizabeth Chastain
  2002-12-05 15:14 ` Fredrik Tolf
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Elizabeth Chastain @ 2002-12-04 20:51 UTC (permalink / raw)
  To: fredrik, gdb

Hi Fredrik,

I'm throwing out a bunch of ideas here, take whatever looks useful
and discard the rest.

> Therefore, the failure has to be that a called
> function doesn't restore EBX correctly, on rare occasions, right?

I have seen this happen in a mixed programming environment,
with a Cygwin program that used a Windows DLL.  The Windows DLL
had subtly different calling conventions where it did not preserve
%ebx, %esi, and %edi across function calls.  Perhaps you have some
kind of third party library in your program which has a similar
compatibilty issue?

> My question is thus: Is there any way of debugging this with GDB? Can I
> make GDB check that EBX is the same before and after every function call
> from that frame in this thread to isolate the failing function? The
> frame never exits (until the program exits, that is), if that helps.

You could set a bunch of conditional breakpoints with "break if %ebx !=
saved_ebx", where you add code to your program to initialize saved_ebx.
Or you could say "break if %ebx < 0x1000" or some convenient constant.

You could also try forcing your variable to be on the stack instead of a
register.  Remove the "register" attribute from the declaration of "next"
if you have one.  Then add a "do_nothing(&next)" call to your function,
to force "next" to be on the stack instead of in a register.  If the
symptoms go away then it's more likely to really be a register clobber.
If the symptoms remain then it's more likely to be a memory clobber
(or you have a really sick low-level function that clobbers random words
on the stack but this does not feel like it).

> At first I was expecting that another thread somehow gets there and
> modifies the storage memory of next.

I still suspect this.  It's more likely that memory gets clobbered rather
than a register value.

Perhaps you need a function that locks the whole list and walks it for
a sanity check, without deleting anything?

Here is another wild lead: if, somehow, a block gets freed and then
you read it, many implementations of malloc keep housekeeping information
in the first word or two of a freed block.  That would explain why the
value is always 0x10 to 0x30 (that could be block size, especially if it is
rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
If you manage your blocks with malloc/free, you could try turning on any
malloc debugging facilities that you have.

Hope this helps,

Michael C

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Checking function calls
  2002-12-04 20:51 Michael Elizabeth Chastain
@ 2002-12-05 15:14 ` Fredrik Tolf
  0 siblings, 0 replies; 6+ messages in thread
From: Fredrik Tolf @ 2002-12-05 15:14 UTC (permalink / raw)
  To: gdb

On Thu, 2002-12-05 at 05:51, Michael Elizabeth Chastain wrote:
> Hi Fredrik,
> 
> I'm throwing out a bunch of ideas here, take whatever looks useful
> and discard the rest.
> 
> > Therefore, the failure has to be that a called
> > function doesn't restore EBX correctly, on rare occasions, right?
> 
> I have seen this happen in a mixed programming environment,
> with a Cygwin program that used a Windows DLL.  The Windows DLL
> had subtly different calling conventions where it did not preserve
> %ebx, %esi, and %edi across function calls.  Perhaps you have some
> kind of third party library in your program which has a similar
> compatibilty issue?

The only libraries are libc, libpthread, libdl and libpam. In the
affected function, only libc and libpthread are used. Therefore I don't
think it's calling convention incompatibility, unless they in turn call
functions in third party libraries, which I find very unlikely.

> 
> > My question is thus: Is there any way of debugging this with GDB? Can I
> > make GDB check that EBX is the same before and after every function call
> > from that frame in this thread to isolate the failing function? The
> > frame never exits (until the program exits, that is), if that helps.
> 
> You could set a bunch of conditional breakpoints with "break if %ebx !=
> saved_ebx", where you add code to your program to initialize saved_ebx.
> Or you could say "break if %ebx < 0x1000" or some convenient constant.
> 

That would, of course, be a good thing. It's only that I'd have to do
that after every single function call... That would take some time.
Maybe I'll do it, anyway. I was actually thinking of doing something
like that, but with code instead, and making the thread SIGSTOP itself
when EBX is invalid.

> You could also try forcing your variable to be on the stack instead of a
> register.  Remove the "register" attribute from the declaration of "next"
> if you have one.  Then add a "do_nothing(&next)" call to your function,
> to force "next" to be on the stack instead of in a register.  If the
> symptoms go away then it's more likely to really be a register clobber.

That just doesn't feel like a very elegant solution, though. And, this
bug does actually surface in another function as well, only even more
seldomly. There it also affects a variable stored in EBX, but it gets
set to 0 instead. So, I would prefer actually solving the problem, so
that it doesn't show up anywhere else. I have noticed no similarities
between the two places where the bugs shows itself.

> > At first I was expecting that another thread somehow gets there and
> > modifies the storage memory of next.
> 
> I still suspect this.  It's more likely that memory gets clobbered rather
> than a register value.
> 

But next isn't stored in memory at any place, so it cannot be that.

> Perhaps you need a function that locks the whole list and walks it for
> a sanity check, without deleting anything?
> 

I always check the list with gdb when the program crashes, and it's
always correct. That's why I think that it's impossible that next is
loaded when the list is in an unstable state. The only times I actually
set the next element, I always set it to NULL or a pointer returned by
malloc(). If the list was to be made unstable by a buggy function
somewhere, it would have to restored again by the same function (since
it's always consistent when I look at it), and I just don't see that
happening.

> Here is another wild lead: if, somehow, a block gets freed and then
> you read it, many implementations of malloc keep housekeeping information
> in the first word or two of a freed block.  That would explain why the
> value is always 0x10 to 0x30 (that could be block size, especially if it is
> rounded up to a multiple of 4 or 8) and why only 1-2 words are clobbered.
> If you manage your blocks with malloc/free, you could try turning on any
> malloc debugging facilities that you have.

I also suspected that something like that might happen, and therefore I
lock the elements one element ahead of the block I'm currently looking
at, so that the current block and the next are always locked. That's why
I have:

for(cur = list; cur != NULL; cur = next)
{
    if((next = cur->next) != NULL)
        pthread_mutex_lock(&next->mutex);
    ...
}

That is also a reason why the next variable has to be clobbered at some
later point, since pthread_mutex_lock succeeds on it. The program always
on the line "if((next = cur->next) != NULL)", since it segfaults when it
looks up cur->next, i.e. at that point cur has been set to the invalid
next as directed by the loop. Therefore, when the program crashes, next
and cur are equal, and I cannot see what element it was at before.

By the way, if you want to look at the code, it's available at
http://sourceforge.net/projects/dcprod/. I don't know if it's the latest
version, though.

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <200212052240.gB5Mefm16249@duracef.shout.net>]

* Re: Checking function calls
       [not found] <200212052240.gB5Mefm16249@duracef.shout.net>
@ 2002-12-06  8:24 ` Fredrik Tolf
  0 siblings, 0 replies; 6+ messages in thread
From: Fredrik Tolf @ 2002-12-06  8:24 UTC (permalink / raw)
  To: Michael Elizabeth Chastain; +Cc: gdb

On Thu, 2002-12-05 at 23:40, Michael Elizabeth Chastain wrote:
> Hi Fredrik,
> 
> > The only libraries are libc, libpthread, libdl and libpam. In the
> > affected function, only libc and libpthread are used.
> 
> What operating environment are you running on?  If it is a Linux
> platform, and gcc is the only compiler anywhere in sight,
> then it's likely not an ABI clash.  If it's non-Linux Unix,
> this becomes slightly likely.  If it is Cygwin/Windows then
> it's a common gotcha.
> 

It is a GNU/Linux platform, and, yes, I am using gcc.

> > That would, of course, be a good thing. It's only that I'd have to do
> > that after every single function call... That would take some time.
> > Maybe I'll do it, anyway.
> 
> Yes.

I have added checks where I compare the current value of next to a saved
buffer after every function call now. I am currently testing with it.

> 
> mec> You could also try forcing your variable to be on the stack instead of a
> mec> register.  Remove the "register" attribute from the declaration of "next"
> mec> if you have one.  Then add a "do_nothing(&next)" call to your function,
> mec> to force "next" to be on the stack instead of in a register.  If the
> mec> symptoms go away then it's more likely to really be a register clobber.
> 
> > That just doesn't feel like a very elegant solution, though.
> 
> Oh, it's not meant to be a solution, it's meant to be a diagnostic tool
> to help figure out the problem.
> 

True, of course. I just don't really understand where it would lead.

> > But next isn't stored in memory at any place, so it cannot be that.
> 
> 'next' is initialized from a memory location though, and you have no
> check that it is valid when it is first initialized.  Actually that
> would be a good check to add.
> 

That's true of course. I have, however, already added such checks
recently

> > If the list was to be made unstable by a buggy function somewhere, it
> > would have to restored again by the same function (since it's always
> > consistent when I look at it), and I just don't see that happening.
> 
> Mmmm, that is not true!
> 
> Let us stare at your source code a bit:
> 
>   /* 1 */ for(cur = list; cur != NULL; cur = next)
>   /* 2 */ {
>   /* 3 */      if((next = cur->next) != NULL)
>   /* 4 */	  pthread_mutex_lock(&next->mutex);
>   /* 5 */      ... /* next is not mentioned anymore */
>   /* 6 */ }
> 
> Suppose that you have two threads, T1 and T2, and three blocks
> on the list, B0, B1, B2.
> 
>   T1 executes [1], "cur = list", so "cur" holds the address of B0.
>   T1 executes [3], "next = cur->next", so "next" holds the address of B1.
>   T2 is scheduled -- and T1 is holding no mutexes!

Sorry that I didn't mention it, but just above the loop, I actually do
have

if(list != NULL)
    pthread_mutex_lock(&list->mutex);

> > I also suspected that something like that might happen, and therefore I
> > lock the elements one element ahead of the block I'm currently looking
> > at, so that the current block and the next are always locked.
> 
> Err, okay, I see that in the source code.  So in my scenario,
> T1 has a lock on B0, so that T2 cannot delete B0->next.
> 
> Foo.

Exactly. Sorry, again, that I didn't write that.

> 
> But I see so many lock's and unlock's in the code that I suspect it is
> a race condition in your code rather than a code generation bug or a
> pthread library bug.  It could still be a scenario where the list
> pointers are okay, but "next" has become a block which is deleted
> from the list somehow.
> 

I know, I didn't plan ahead good enough when I started writing it, and
now I'm stuck with either this, or a large rewrite.

> That still leaves the question of how to debug it.
> 
> I would actually start with a book on multi-threaded linked lists,
> and then find a library (or code a library) that implements them,
> and use that.  If you have a separate library then you can write some
> stress test code and provoke failures a lot faster.
> 

I would like to do that, and I have been thinking about it for a while,
but see above.

> > Therefore, when the program crashes, next and cur are equal,
> > and I cannot see what element it was at before.
> 
> Mmmm, throw in a "prev" variable, so that you say "prev = cur, cur = next"
> and then "prev" is available for debugging.

I've been thinking about that, too. Maybe I should just do that.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Checking function calls
@ 2002-12-06  9:08 Michael Elizabeth Chastain
  2002-12-06 11:31 ` Fredrik Tolf
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Elizabeth Chastain @ 2002-12-06  9:08 UTC (permalink / raw)
  To: fredrik; +Cc: gdb

Hi Fredrik,

> It is a GNU/Linux platform, and, yes, I am using gcc.

Well, that wraps up that line of inquiry.

> I know, I didn't plan ahead good enough when I started writing it, and
> now I'm stuck with either this, or a large rewrite.

When I run into this kind of problem, I like to step back -- way back --
get away from computers for a day or two and think about it.

I think there is no easy way out, that you actually are stuck with a
large rewrite.  There are just too many pthread_mutex_lock's flying
around.

For instance:

  client.c:findtransfer() does not have any locks.

  in client.c:freesharecache(), there is code:

    if (cache->parent != NULL)
    {
      pthread_mutex_lock(&cache->parent->mutex)l;
      ...
    }

  in general, it's unsafe to test a member and then acquire the lock,
  because someone else can delete cache->parent between the "if" statement
  and the acquisition of the lock.

  In client.c:clientmain():

    for(cur = transfers; cur != NULL; cur = next)
    {
      pthread_mutex_lock(&cur_mutex);
      next = cur->next;
      ...
    }

    between the execution of "cur = transfers" and "cur != NULL",
    the first item of the list can be deleted.

I recommend finding a textbook on multi-threaded programming that covers
"how to write thread-safe lists".  From your package, it looks like
you are in it to learn, so you could step way back from the code and
learn some theory at this point.

Another alternative is to use one big mutex for the whole list.
Then the primitive operations become:

  add item to list
    lock the whole list
    add the item
    unlock the whole list

  delete item from list
    lock the whole list
    delete the item
    unlock the whole list

  iterate over the list
    lock the whole list
    iterate over all the items
    unlock the whole list

The drawback is that walking the list locks the whole list against
addition and deletion.  If your list walker is just "print status
information" then that is fine.  If your list walker does some
long-lived network operation at each node then it is not fine.

Michael C

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Checking function calls
  2002-12-06  9:08 Michael Elizabeth Chastain
@ 2002-12-06 11:31 ` Fredrik Tolf
  0 siblings, 0 replies; 6+ messages in thread
From: Fredrik Tolf @ 2002-12-06 11:31 UTC (permalink / raw)
  To: Michael Elizabeth Chastain; +Cc: gdb

On Fri, 2002-12-06 at 18:08, Michael Elizabeth Chastain wrote:
> > I know, I didn't plan ahead good enough when I started writing it, and
> > now I'm stuck with either this, or a large rewrite.
> 
> When I run into this kind of problem, I like to step back -- way back --
> get away from computers for a day or two and think about it.
> 
> I think there is no easy way out, that you actually are stuck with a
> large rewrite.  There are just too many pthread_mutex_lock's flying
> around.

I'm beginning to believe that, too. Maybe I have just been too
optimistic.

> 
> For instance:
> 
>   client.c:findtransfer() does not have any locks.
> 
>   in client.c:freesharecache(), there is code:
> 
>     if (cache->parent != NULL)
>     {
>       pthread_mutex_lock(&cache->parent->mutex)l;
>       ...
>     }
> 
>   in general, it's unsafe to test a member and then acquire the lock,
>   because someone else can delete cache->parent between the "if" statement
>   and the acquisition of the lock.
> 

Here, however, that isn't possible, since all deletions from that list
go via the freesharecache function, and a deletion of the parent also
loops through, locks, and deletes all the children, and since one of the
children apparently is locked, it won't go any further. I suspect it
might deadlock it, though.

> I recommend finding a textbook on multi-threaded programming that covers
> "how to write thread-safe lists".  From your package, it looks like
> you are in it to learn, so you could step way back from the code and
> learn some theory at this point.

Yeah, when I began writing this program, I did not have much experience
in multithreading. That's the reason that there are much too few mutexes
in the program.
Still, I don't think that's the reason for this bug. The loop in which
it crashes in quite thread-safe.

> Another alternative is to use one big mutex for the whole list.

That is precisely what I have been wanting to implement for a long time.
It's only that it would require an enormous rewrite to implement
everywhere that it should be used.

> The drawback is that walking the list locks the whole list against
> addition and deletion.  If your list walker is just "print status
> information" then that is fine.  If your list walker does some
> long-lived network operation at each node then it is not fine.

I have, however, made sure that doesn't happen by only using nonblocking
I/O.

Once again, though, I don't think that thread-unsafeness is the reason
for this bug to happen. But I've added checks to that loop now, so I
should discover it sooner or later. Thank you very much for all your
help.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-12-06 19:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-04 18:02 Checking function calls Fredrik Tolf
2002-12-04 20:51 Michael Elizabeth Chastain
2002-12-05 15:14 ` Fredrik Tolf
     [not found] <200212052240.gB5Mefm16249@duracef.shout.net>
2002-12-06  8:24 ` Fredrik Tolf
2002-12-06  9:08 Michael Elizabeth Chastain
2002-12-06 11:31 ` Fredrik Tolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox