From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3082 invoked by alias); 15 Sep 2011 12:31:52 -0000 Received: (qmail 3069 invoked by uid 22791); 15 Sep 2011 12:31:49 -0000 X-SWARE-Spam-Status: No, hits=-7.4 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,SPF_HELO_PASS X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 15 Sep 2011 12:31:31 +0000 Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p8FCVUe2030541 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 15 Sep 2011 08:31:30 -0400 Received: from dhcp-25-199.brq.redhat.com (dhcp-0-233.brq.redhat.com [10.34.0.233]) by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id p8FCVQ4A025743 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Thu, 15 Sep 2011 08:31:29 -0400 Date: Thu, 15 Sep 2011 12:31:00 -0000 From: Martin Milata To: Jan Kratochvil Cc: Tom Tromey , gdb@sourceware.org, Karel Klic Subject: Function fingerprinting for useful backtraces in absence of debuginfo Message-ID: <20110915123230.GA4048@dhcp-25-199.brq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2011-09/txt/msg00052.txt.bz2 Hi Jan, Karel probably told you about this, but since more people are CC'd, I'll add a brief introduction. In ABRT [1], we would like to be able to check if two coredumps are from the same bug in source code without using debuginfo. We have an idea how to do this which involves computing some kind of fingerprint from assembly of a function. Now we need someone who has good insight into compilation and assembly in general to take a look at it and tell us what he thinks. More detailed description is below. Thanks for your time, Martin [1] https://fedorahosted.org/abrt/wiki The problem ----------- How would you check if two coredumps are from the same bug in source code, but without using debuginfo? In ABRT, we are working on coredump duplicate detection that is run at the time of a crash. We want to avoid filling users' harddrives with unnecessary coredumps from repeated crashes. At crash time, program binaries are available, but debuginfo packages are not. Duplicate coredumps should be detected even when the used binary or shared library has been updated to newer version (=patched and recompiled), and when the package has been rebuilt with a newer gcc. The approach under consideration is to create a 'canonical backtrace' from the coredump and its binaries without using the debuginfo. Having a backtrace is useful as we have good duplicate detection algorithms for backtraces. So the question is how to generate solid backtrace from coredump. For each stack frame in a given core dump, we can obtain: * The name of the function, if the corresponding binary is compiled with function symbols (as is the case with the libraries) together with offset into the function. * Build ID of the binary together with offset of the instruction pointer from the start of the executable segment of the file. This should allow us to compare the pointers even if the text segments were loaded at different addresses (prelink/aslr). This means that we can compare two stack frames that either belong to a libraries with function symbols available or to the same build of an executable (that has Build IDs). We are not able to compare stackframes from two executables built from slightly different source or with different compiler options, because the instruction pointer offsets are different. Proposed solution ----------------- The proposed solution of this problem is to take the instruction pointer from each stack frame, look at the .eh_frame section of the corresponding ELF to determine the boundaries of the function it points to and then compute a fingerprint of this function. Such fingerprint should be the same for two sequences of instructions that were compiled from the same source code (and different for two different functions). This is obviously not possible in general, but we thought we should be able to devise something that will work in most of the cases. The prototype we put together computes the fingerprint as several properties of the function: (Call graph properties) * List of the library functions called. * Whether the function calls some other functions in the file. * Whether the function calls itself. (Presence of types of instructions) * Conditional jumps based on equality test/signed comparison/unsigned comparison. This way, we are able to get the same fingerprint for something below 90 % of pairs of the same functions from a handful of programs we tested, with ~3 % probability of two different random functions having the same fingerprint. What we need ------------ Unfortunately, I have pretty much no experience with assembly and have only a vague knowledge of compiler optimization techniques. The above fingerprinting scheme is mostly based on trial-and-error and wild guesses. So the question is: How to improve this function fingerprinting scheme? Is there a better approach for coredump duplicate detection?