Function fingerprinting for useful backtraces in absence of debuginfo

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* Function fingerprinting for useful backtraces in absence of debuginfo
@ 2011-09-15 12:31 Martin Milata
  2011-09-15 17:49 ` Jan Kratochvil
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Milata @ 2011-09-15 12:31 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: Tom Tromey, gdb, Karel Klic

Hi Jan,

Karel probably told you about this, but since more people are CC'd, I'll
add a brief introduction.

In ABRT [1], we would like to be able to check if two coredumps are from
the same bug in source code without using debuginfo. We have an idea how
to do this which involves computing some kind of fingerprint from
assembly of a function. Now we need someone who has good insight into
compilation and assembly in general to take a look at it and tell us
what he thinks. More detailed description is below.

Thanks for your time,
Martin

[1] https://fedorahosted.org/abrt/wiki

The problem
-----------

How would you check if two coredumps are from the same bug in source
code, but without using debuginfo?

In ABRT, we are working on coredump duplicate detection that is run at
the time of a crash. We want to avoid filling users' harddrives with
unnecessary coredumps from repeated crashes. At crash time, program
binaries are available, but debuginfo packages are not. Duplicate
coredumps should be detected even when the used binary or shared
library has been updated to newer version (=patched and recompiled),
and when the package has been rebuilt with a newer gcc.

The approach under consideration is to create a 'canonical backtrace'
from the coredump and its binaries without using the debuginfo. Having
a backtrace is useful as we have good duplicate detection algorithms
for backtraces. So the question is how to generate solid backtrace
from coredump. For each stack frame in a given core dump, we can
obtain:

 * The name of the function, if the corresponding binary is compiled
   with function symbols (as is the case with the libraries) together
   with offset into the function.

 * Build ID of the binary together with offset of the instruction
   pointer from the start of the executable segment of the file. This
   should allow us to compare the pointers even if the text segments
   were loaded at different addresses (prelink/aslr).

This means that we can compare two stack frames that either belong to
a libraries with function symbols available or to the same build of an
executable (that has Build IDs). We are not able to compare stackframes
from two executables built from slightly different source or with
different compiler options, because the instruction pointer offsets
are different.

Proposed solution
-----------------

The proposed solution of this problem is to take the instruction
pointer from each stack frame, look at the .eh_frame section of the
corresponding ELF to determine the boundaries of the function it
points to and then compute a fingerprint of this function. Such
fingerprint should be the same for two sequences of instructions that
were compiled from the same source code (and different for two
different functions).

This is obviously not possible in general, but we thought we should be
able to devise something that will work in most of the cases. The
prototype we put together computes the fingerprint as several
properties of the function:

 (Call graph properties)
 * List of the library functions called.
 * Whether the function calls some other functions in the file.
 * Whether the function calls itself.
 (Presence of types of instructions)
 * Conditional jumps based on equality test/signed comparison/unsigned
   comparison.

This way, we are able to get the same fingerprint for something below
90 % of pairs of the same functions from a handful of programs we
tested, with ~3 % probability of two different random functions having
the same fingerprint.

What we need
------------

Unfortunately, I have pretty much no experience with assembly and have
only a vague knowledge of compiler optimization techniques. The above
fingerprinting scheme is mostly based on trial-and-error and wild
guesses.

So the question is: How to improve this function fingerprinting
scheme? Is there a better approach for coredump duplicate detection?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Function fingerprinting for useful backtraces in absence of debuginfo
  2011-09-15 12:31 Function fingerprinting for useful backtraces in absence of debuginfo Martin Milata
@ 2011-09-15 17:49 ` Jan Kratochvil
  2011-09-20 13:50   ` Martin Milata
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kratochvil @ 2011-09-15 17:49 UTC (permalink / raw)
  To: Martin Milata; +Cc: Tom Tromey, gdb, Karel Klic

Hi Martin,

I see this was more directed at gcc people but I hope I can reply some.


On Thu, 15 Sep 2011 14:32:31 +0200, Martin Milata wrote:
>  * The name of the function, if the corresponding binary is compiled
>    with function symbols (as is the case with the libraries) together
>    with offset into the function.

This is not true for static functions in the libraries:

==> 26.c <==
extern void f (void (*) (void));
static void i (void) {}
int main (void) { f (i); return 0; }

==> 26l.c <==
static void g (void (*h) (void)) { h (); }
void f (void (*h) (void)) { g (h); }

gcc -o 26l.so 26l.c -Wall -shared -fPIC -s; gcc -o 26 26.c -Wall -g ./26l.so; gdb -nx ./26 -ex 'b i' -ex r -ex bt
#0  i () at 26.c:2
#1  0x00007ffff7dfc53e in ?? () from ./26l.so
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#2  0x00007ffff7dfc558 in f () from ./26l.so
#3  0x00000000004005c8 in main () at 26.c:3

glibc in Fedora packaging is probably the only exception; it has the .symtab
section in the main rpm.  All the other libraries have .symtab only in the
debuginfo rpm.


>  (Call graph properties)
>  * List of the library functions called.

That is the functions called via .plt section - either from different libraries
or within the same library (if it does not use direct calls like glibc does).
Hopefully this should not change, I agree.


>  * Whether the function calls some other functions in the file.

Various functions get inlined during various optimizations levels and compiler
changes, it also changes with gcc -flto.


>  * Whether the function calls itself.
>  (Presence of types of instructions)

Tail call optimizations (call+ret -> jump) change this so -O0 vs. -O2 code will
be definitely different; But -O2 compilation of slightly different code
hopefully should have the same signature.


>  * Conditional jumps based on equality test/signed comparison/unsigned
>    comparison.

This is the exact target of the gcc -fprofile-* optimizations; AFAIK SuSE uses
it a lot (I had some negative results trying to apply it for gdb packaging).
That is to invert the jump conditional and reshuffle the code around so that in
>50% cases it does not jump depending on "random" benchmark data during each
>build.


> So the question is: How to improve this function fingerprinting
> scheme? Is there a better approach for coredump duplicate detection?

I am a bit skeptical against such function content comparison but sure it does
not have to be perfect.


There may be soon cheap enough to run gdbserver on the local core file with
the recent optimization by Paul Pluzhnikov to be finished:
	Re: [patch] Implement qXfer:libraries for Linux/gdbserver
	http://sourceware.org/ml/gdb-patches/2011-08/msg00291.html
But I do not have any benchmark numbers now to support it.


Thanks,
Jan


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Function fingerprinting for useful backtraces in absence of debuginfo
  2011-09-15 17:49 ` Jan Kratochvil
@ 2011-09-20 13:50   ` Martin Milata
  2011-09-20 20:22     ` Jan Kratochvil
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Milata @ 2011-09-20 13:50 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: Tom Tromey, gdb, Karel Klic

On Thu, Sep 15, 2011 at 19:48:31 +0200, Jan Kratochvil wrote:
> Hi Martin,
> 
> I see this was more directed at gcc people but I hope I can reply some.

I guess I can try my luck at a gcc mailing list;)

> On Thu, 15 Sep 2011 14:32:31 +0200, Martin Milata wrote:
> >  * The name of the function, if the corresponding binary is compiled
> >    with function symbols (as is the case with the libraries) together
> >    with offset into the function.
> 
> This is not true for static functions in the libraries:
> 
> ==> 26.c <==
> extern void f (void (*) (void));
> static void i (void) {}
> int main (void) { f (i); return 0; }
> 
> ==> 26l.c <==
> static void g (void (*h) (void)) { h (); }
> void f (void (*h) (void)) { g (h); }
> 
> gcc -o 26l.so 26l.c -Wall -shared -fPIC -s; gcc -o 26 26.c -Wall -g ./26l.so; gdb -nx ./26 -ex 'b i' -ex r -ex bt
> #0  i () at 26.c:2
> #1  0x00007ffff7dfc53e in ?? () from ./26l.so
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #2  0x00007ffff7dfc558 in f () from ./26l.so
> #3  0x00000000004005c8 in main () at 26.c:3
> 
> glibc in Fedora packaging is probably the only exception; it has the .symtab
> section in the main rpm.  All the other libraries have .symtab only in the
> debuginfo rpm.

Good point ... well, at least we can use the names in .dynsym and fall
back to the other method if the function does not have a symbol table
entry.

> >  (Call graph properties)
> >  * List of the library functions called.
> 
> That is the functions called via .plt section - either from different libraries
> or within the same library (if it does not use direct calls like glibc does).
> Hopefully this should not change, I agree.

Great, this is so far the most important component in the signature as
each of the rest of the properties only provide one bit of information.

This unfortunately means that lot of functions that do not call anything
through .plt have the same fingerprint. Can you think of some other
properties that we could use in those functions?

> >  * Whether the function calls some other functions in the file.
> 
> Various functions get inlined during various optimizations levels and compiler
> changes, it also changes with gcc -flto.
>
> >  * Whether the function calls itself.
> >  (Presence of types of instructions)
> 
> Tail call optimizations (call+ret -> jump) change this so -O0 vs. -O2 code will
> be definitely different; But -O2 compilation of slightly different code
> hopefully should have the same signature.

I see.

> >  * Conditional jumps based on equality test/signed comparison/unsigned
> >    comparison.
> 
> This is the exact target of the gcc -fprofile-* optimizations; AFAIK SuSE uses
> it a lot (I had some negative results trying to apply it for gdb packaging).
> That is to invert the jump conditional and reshuffle the code around so that in
> >50% cases it does not jump depending on "random" benchmark data during each
> >build.

But we can test if the code contains either of the jX and jnX
instructions, right?

> > So the question is: How to improve this function fingerprinting
> > scheme? Is there a better approach for coredump duplicate detection?
> 
> I am a bit skeptical against such function content comparison but sure it does
> not have to be perfect.

I don't like it very much either, but I wasn't able to come up with
anything else that would work solely on the core dumps and binaries.
Sure it won't work perfectly, but hopefully it could work well enough to
be useful.

> There may be soon cheap enough to run gdbserver on the local core file with
> the recent optimization by Paul Pluzhnikov to be finished:
> 	Re: [patch] Implement qXfer:libraries for Linux/gdbserver
> 	http://sourceware.org/ml/gdb-patches/2011-08/msg00291.html
> But I do not have any benchmark numbers now to support it.

I'll see if we can somehow use the gdbserver, though we'd rather have
something that works without transmitting data over the network, because
doing it remotely would require additional infrastructure. Also, if I
understand correctly, the connection has to be initiated from the host
machine which might be a problem if there are NATs/firewalls on the way.

Anyway, thank you for your response.

Martin


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Function fingerprinting for useful backtraces in absence of debuginfo
  2011-09-20 13:50   ` Martin Milata
@ 2011-09-20 20:22     ` Jan Kratochvil
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kratochvil @ 2011-09-20 20:22 UTC (permalink / raw)
  To: Martin Milata; +Cc: Tom Tromey, gdb, Karel Klic

On Tue, 20 Sep 2011 15:51:17 +0200, Martin Milata wrote:
> On Thu, Sep 15, 2011 at 19:48:31 +0200, Jan Kratochvil wrote:
> > >  (Call graph properties)
> > >  * List of the library functions called.
> > 
> > That is the functions called via .plt section - either from different libraries
> > or within the same library (if it does not use direct calls like glibc does).
> > Hopefully this should not change, I agree.
> 
> Great, this is so far the most important component in the signature

Be aware you have to disassemble the function, relocations are for the .plt
section, functions call PC-relatively the functions in .plt without any
further relocations.

> This unfortunately means that lot of functions that do not call anything
> through .plt have the same fingerprint. Can you think of some other
> properties that we could use in those functions?

As you already have to do the disassembly analysis very similar to the .plt
functions calls you can find references to exported variables due to the .got
section references.

> > >  * Conditional jumps based on equality test/signed comparison/unsigned
> > >    comparison.
> > 
> > This is the exact target of the gcc -fprofile-* optimizations; AFAIK SuSE uses
> > it a lot (I had some negative results trying to apply it for gdb packaging).
> > That is to invert the jump conditional and reshuffle the code around so that in
> > >50% cases it does not jump depending on "random" benchmark data during each
> > >build.
> 
> But we can test if the code contains either of the jX and jnX
> instructions, right?

jX vs. jnX will change depending on the -fprofile-* feedback file and you
probably cannot find out which of the two code paths match which of the former
code paths.  Try yourself -fprofile-generate build, give it two different
external data input for more positive/negative conditional and how you can
match the resulting two -fprofile-use generated executables.

BTW have fun porting the disassemblt analysis to all arches.

[gdbserver]
> Also, if I
> understand correctly, the connection has to be initiated from the host
> machine which might be a problem if there are NATs/firewalls on the way.

The direction of the connection is not relevant IMO, it needs to be tunnelled
for some encryption anyway which can change the way how the connection is
initiated..

Thanks,
Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-09-20 20:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-15 12:31 Function fingerprinting for useful backtraces in absence of debuginfo Martin Milata
2011-09-15 17:49 ` Jan Kratochvil
2011-09-20 13:50   ` Martin Milata
2011-09-20 20:22     ` Jan Kratochvil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox