From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Berlin <dan@cgsoftware.com>
To: Jim Blandy <jimb@zwingli.cygnus.com>
Cc: gdb-patches@sources.redhat.com
Subject: Re: [WIP] New dwarf2 reader - updated 07-02-2001
Date: Tue, 31 Jul 2001 13:45:00 -0000
Message-id: <229924053.996597936@[192.168.0.106]>
References: <npg0bc96ou.fsf@zwingli.cygnus.com>
X-SW-Source: 2001-07/msg00763.html

--On Tuesday, July 31, 2001 3:36 PM -0500 Jim Blandy 
<jimb@zwingli.cygnus.com> wrote:

>
> Daniel Berlin <dan@cgsoftware.com> writes:
>> Jim Blandy <jimb@zwingli.cygnus.com> writes:
>> > - MD5 generates a 128-bit checksum.  Any reason you only use the first
>> >   32 bits in your hash?  It looks like checksum_die and
>> >   checksum_partial_die both cut it off at eight digits.
>> I use the first 64 bits, just like GCC does for checksumming DIE's and
>> duplicate elimination.  Though i'll give you gcc also includes the
>> name attached to the die. I'll happily make the two algorithms
>> *exactly* the same if you like.
>
> Eight hex digits * 4 bits per hex digit is 32 bits.  No?
>
>     static inline char *
>     checksum_die (struct die_info *die)
>     {
>       struct md5_ctx ctx;
>       char *retval;
>       int i;
>       unsigned char checksum[16];
>       md5_init_ctx (&ctx);
>       process_full_die (die, &ctx);
>       md5_finish_ctx (&ctx, checksum);
>       retval = xcalloc (1, 9);
>       for (i = 0; i < 4; ++i)
>         sprintf (retval + (i * 2), "%.2x", checksum[i]);
>       retval[8] = 0;
>       return retval;
>     }
>
> The "MD5 never clashes" thesis is okay by me.  But to say that,
> therefore, the first 32 bits of MD5 never clash, is not okay.
>
> Using a function yielding n distinct hash values, the probability is
> 1/n that they clash, thus 1-1/n that they don't clash.  For p
> independent pairs, the chances are (1-1/n)^p that none clash.  For a
> set of d dies, there are d(d-1) pairs.  So in our case, using 32 bits
> of MD5 for 10000 distinct dies, the chance of no clashes are
> (1-1/(2^32))^(10000(10000-1)), or 0.977.  Thus a 2.3% chance of a
> clash.  This sucks.
>
> Using 64 bits, as you intended, is much better, but why not use the
> full 128?  Honestly, why play this kind of game at all?
Because it's faster than trying to keep all the die data.
>  It's not that
> hard to do the job perfectly --- just use the significant die contents
> themselves as the tree key.  No collisions, no hash function, simpler.
>

Um, think about how much memory that would use.
We'd have to keep copies of die contents around.

>> > - Why are you using a splay tree keyed by MD5 hash values, instead of
>> >   a hash table?  The only advantage of a splay over a hash table is
>> >   ordered retrieval, but your keys don't have any meaningful order.
>> I hate the hash table interface for libiberty, it's annoying to use,
>> IMHO.
>> Other than that, no reason. I was meaning to make a nice generic
>> dictionary interface, and talked with DJ about whether i could just
>> use libdict, but since it's not GPL, and i don't have time to
>> duplicate it's nice interface right now, i just went with what was
>> easiest to program.
>
> Okay.
>
>> > - Assuming MD5 is perfect, you don't checksum all attribute forms
>> >   (DW_FORM_flag, for example).  This means that you can get false
>> >   duplicates, if two dies differ only in (say) a flag's value.  How is
>> >   read_comp_unit_dies set up to tolerate that?
>> I only checksummed the  stuff gcc generates (since i based the
>> checksumming code on gcc's), and I forgot that gcc
>> doesn't generate some forms.
>
> GCC does generate DW_FORM_flag:
>
> 	.byte	0x2	; uleb128 0x2; (abbrev code)
> 	.byte	0x2e	; uleb128 0x2e; (TAG: DW_TAG_subprogram)
> 	.byte	0x1	; DW_children_yes
> 	.byte	0x1	; uleb128 0x1; (DW_AT_sibling)
> 	.byte	0x13	; uleb128 0x13; (DW_FORM_ref4)
> 	.byte	0x3f	; uleb128 0x3f; (DW_AT_external)
> 	.byte	0xc	; uleb128 0xc; (DW_FORM_flag)
> 	.byte	0x3	; uleb128 0x3; (DW_AT_name)
> 	.byte	0x8	; uleb128 0x8; (DW_FORM_string)
> 	.byte	0x3a	; uleb128 0x3a; (DW_AT_decl_file)
> 	.byte	0xb	; uleb128 0xb; (DW_FORM_data1)
> 	.byte	0x3b	; uleb128 0x3b; (DW_AT_decl_line)
> 	.byte	0xb	; uleb128 0xb; (DW_FORM_data1)
> 	.byte	0x27	; uleb128 0x27; (DW_AT_prototyped)
> 	.byte	0xc	; uleb128 0xc; (DW_FORM_flag)
> 	.byte	0x49	; uleb128 0x49; (DW_AT_type)
> 	.byte	0x13	; uleb128 0x13; (DW_FORM_ref4)
> 	.byte	0x11	; uleb128 0x11; (DW_AT_low_pc)
> 	.byte	0x1	; uleb128 0x1; (DW_FORM_addr)
> 	.byte	0x12	; uleb128 0x12; (DW_AT_high_pc)
> 	.byte	0x1	; uleb128 0x1; (DW_FORM_addr)
> 	.byte	0x40	; uleb128 0x40; (DW_AT_frame_base)
> 	.byte	0xa	; uleb128 0xa; (DW_FORM_block1)
> 	.byte	0x0
> 	.byte	0x0
>
>> > - In read_comp_unit_dies, when you find a duplicate die, you skip to
>> >   its sibling.  What if the parent die is identical, but the children
>> >   dies differ?
>>
>> I don't believe this is possible in any language.
>
> How about this:
>
> namespace X {
>   namespace A {
>     int m, n;
>   }
> }
>
> namespace Y {
>   namespace A {
>     int o, p;
>   }
> }
>
> The dies for the two 'A' namespaces have the name DW_AT_name
> attribute, but they're clearly different.
>
This won't do it, they'll have different decl line attributes.
And we only eliminate at the top level anyway.

namespace A
{
	int o, p;
}
namespace A
{
	int m, n;
}
won't screw us either, since the decl line number will still differ.
and, if you implemented it according to spec, one would be a different tag 
than the other.

> Just in principle, this seems sloppy.  Dies are just arbitrary tree
> nodes.  There's no reason to assume that if two parent nodes are
> identical, we can skip the second one and all its children.
It's not sloppy.
Elimination at the top level is normal in other debuggers i know of that do 
elimination.