From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-39782-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 3082 invoked by alias); 15 Sep 2011 12:31:52 -0000
Received: (qmail 3069 invoked by uid 22791); 15 Sep 2011 12:31:49 -0000
X-SWARE-Spam-Status: No, hits=-7.4 required=5.0	tests=BAYES_00,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,SPF_HELO_PASS
X-Spam-Check-By: sourceware.org
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 15 Sep 2011 12:31:31 +0000
Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p8FCVUe2030541	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)	for <gdb@sourceware.org>; Thu, 15 Sep 2011 08:31:30 -0400
Received: from dhcp-25-199.brq.redhat.com (dhcp-0-233.brq.redhat.com [10.34.0.233])	by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id p8FCVQ4A025743	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);	Thu, 15 Sep 2011 08:31:29 -0400
Date: Thu, 15 Sep 2011 12:31:00 -0000
From: Martin Milata <mmilata@redhat.com>
To: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Tom Tromey <tromey@redhat.com>, gdb@sourceware.org,        Karel Klic <kklic@redhat.com>
Subject: Function fingerprinting for useful backtraces in absence of debuginfo
Message-ID: <20110915123230.GA4048@dhcp-25-199.brq.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2011-09/txt/msg00052.txt.bz2

Hi Jan,

Karel probably told you about this, but since more people are CC'd, I'll
add a brief introduction.

In ABRT [1], we would like to be able to check if two coredumps are from
the same bug in source code without using debuginfo. We have an idea how
to do this which involves computing some kind of fingerprint from
assembly of a function. Now we need someone who has good insight into
compilation and assembly in general to take a look at it and tell us
what he thinks. More detailed description is below.

Thanks for your time,
Martin

[1] https://fedorahosted.org/abrt/wiki


The problem
-----------

How would you check if two coredumps are from the same bug in source
code, but without using debuginfo?

In ABRT, we are working on coredump duplicate detection that is run at
the time of a crash. We want to avoid filling users' harddrives with
unnecessary coredumps from repeated crashes. At crash time, program
binaries are available, but debuginfo packages are not. Duplicate
coredumps should be detected even when the used binary or shared
library has been updated to newer version (=patched and recompiled),
and when the package has been rebuilt with a newer gcc.

The approach under consideration is to create a 'canonical backtrace'
from the coredump and its binaries without using the debuginfo. Having
a backtrace is useful as we have good duplicate detection algorithms
for backtraces. So the question is how to generate solid backtrace
from coredump. For each stack frame in a given core dump, we can
obtain:

 * The name of the function, if the corresponding binary is compiled
   with function symbols (as is the case with the libraries) together
   with offset into the function.

 * Build ID of the binary together with offset of the instruction
   pointer from the start of the executable segment of the file. This
   should allow us to compare the pointers even if the text segments
   were loaded at different addresses (prelink/aslr).

This means that we can compare two stack frames that either belong to
a libraries with function symbols available or to the same build of an
executable (that has Build IDs). We are not able to compare stackframes
from two executables built from slightly different source or with
different compiler options, because the instruction pointer offsets
are different.


Proposed solution
-----------------

The proposed solution of this problem is to take the instruction
pointer from each stack frame, look at the .eh_frame section of the
corresponding ELF to determine the boundaries of the function it
points to and then compute a fingerprint of this function. Such
fingerprint should be the same for two sequences of instructions that
were compiled from the same source code (and different for two
different functions).

This is obviously not possible in general, but we thought we should be
able to devise something that will work in most of the cases. The
prototype we put together computes the fingerprint as several
properties of the function:

 (Call graph properties)
 * List of the library functions called.
 * Whether the function calls some other functions in the file.
 * Whether the function calls itself.
 (Presence of types of instructions)
 * Conditional jumps based on equality test/signed comparison/unsigned
   comparison.

This way, we are able to get the same fingerprint for something below
90 % of pairs of the same functions from a handful of programs we
tested, with ~3 % probability of two different random functions having
the same fingerprint.


What we need
------------

Unfortunately, I have pretty much no experience with assembly and have
only a vague knowledge of compiler optimization techniques. The above
fingerprinting scheme is mostly based on trial-and-error and wild
guesses.

So the question is: How to improve this function fingerprinting
scheme? Is there a better approach for coredump duplicate detection?