From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Berlin To: Jim Blandy Cc: gdb-patches@sources.redhat.com Subject: Re: [WIP] New dwarf2 reader - updated 07-02-2001 Date: Tue, 31 Jul 2001 13:45:00 -0000 Message-id: <229924053.996597936@[192.168.0.106]> References: X-SW-Source: 2001-07/msg00763.html --On Tuesday, July 31, 2001 3:36 PM -0500 Jim Blandy wrote: > > Daniel Berlin writes: >> Jim Blandy writes: >> > - MD5 generates a 128-bit checksum. Any reason you only use the first >> > 32 bits in your hash? It looks like checksum_die and >> > checksum_partial_die both cut it off at eight digits. >> I use the first 64 bits, just like GCC does for checksumming DIE's and >> duplicate elimination. Though i'll give you gcc also includes the >> name attached to the die. I'll happily make the two algorithms >> *exactly* the same if you like. > > Eight hex digits * 4 bits per hex digit is 32 bits. No? > > static inline char * > checksum_die (struct die_info *die) > { > struct md5_ctx ctx; > char *retval; > int i; > unsigned char checksum[16]; > md5_init_ctx (&ctx); > process_full_die (die, &ctx); > md5_finish_ctx (&ctx, checksum); > retval = xcalloc (1, 9); > for (i = 0; i < 4; ++i) > sprintf (retval + (i * 2), "%.2x", checksum[i]); > retval[8] = 0; > return retval; > } > > The "MD5 never clashes" thesis is okay by me. But to say that, > therefore, the first 32 bits of MD5 never clash, is not okay. > > Using a function yielding n distinct hash values, the probability is > 1/n that they clash, thus 1-1/n that they don't clash. For p > independent pairs, the chances are (1-1/n)^p that none clash. For a > set of d dies, there are d(d-1) pairs. So in our case, using 32 bits > of MD5 for 10000 distinct dies, the chance of no clashes are > (1-1/(2^32))^(10000(10000-1)), or 0.977. Thus a 2.3% chance of a > clash. This sucks. > > Using 64 bits, as you intended, is much better, but why not use the > full 128? Honestly, why play this kind of game at all? Because it's faster than trying to keep all the die data. > It's not that > hard to do the job perfectly --- just use the significant die contents > themselves as the tree key. No collisions, no hash function, simpler. > Um, think about how much memory that would use. We'd have to keep copies of die contents around. >> > - Why are you using a splay tree keyed by MD5 hash values, instead of >> > a hash table? The only advantage of a splay over a hash table is >> > ordered retrieval, but your keys don't have any meaningful order. >> I hate the hash table interface for libiberty, it's annoying to use, >> IMHO. >> Other than that, no reason. I was meaning to make a nice generic >> dictionary interface, and talked with DJ about whether i could just >> use libdict, but since it's not GPL, and i don't have time to >> duplicate it's nice interface right now, i just went with what was >> easiest to program. > > Okay. > >> > - Assuming MD5 is perfect, you don't checksum all attribute forms >> > (DW_FORM_flag, for example). This means that you can get false >> > duplicates, if two dies differ only in (say) a flag's value. How is >> > read_comp_unit_dies set up to tolerate that? >> I only checksummed the stuff gcc generates (since i based the >> checksumming code on gcc's), and I forgot that gcc >> doesn't generate some forms. > > GCC does generate DW_FORM_flag: > > .byte 0x2 ; uleb128 0x2; (abbrev code) > .byte 0x2e ; uleb128 0x2e; (TAG: DW_TAG_subprogram) > .byte 0x1 ; DW_children_yes > .byte 0x1 ; uleb128 0x1; (DW_AT_sibling) > .byte 0x13 ; uleb128 0x13; (DW_FORM_ref4) > .byte 0x3f ; uleb128 0x3f; (DW_AT_external) > .byte 0xc ; uleb128 0xc; (DW_FORM_flag) > .byte 0x3 ; uleb128 0x3; (DW_AT_name) > .byte 0x8 ; uleb128 0x8; (DW_FORM_string) > .byte 0x3a ; uleb128 0x3a; (DW_AT_decl_file) > .byte 0xb ; uleb128 0xb; (DW_FORM_data1) > .byte 0x3b ; uleb128 0x3b; (DW_AT_decl_line) > .byte 0xb ; uleb128 0xb; (DW_FORM_data1) > .byte 0x27 ; uleb128 0x27; (DW_AT_prototyped) > .byte 0xc ; uleb128 0xc; (DW_FORM_flag) > .byte 0x49 ; uleb128 0x49; (DW_AT_type) > .byte 0x13 ; uleb128 0x13; (DW_FORM_ref4) > .byte 0x11 ; uleb128 0x11; (DW_AT_low_pc) > .byte 0x1 ; uleb128 0x1; (DW_FORM_addr) > .byte 0x12 ; uleb128 0x12; (DW_AT_high_pc) > .byte 0x1 ; uleb128 0x1; (DW_FORM_addr) > .byte 0x40 ; uleb128 0x40; (DW_AT_frame_base) > .byte 0xa ; uleb128 0xa; (DW_FORM_block1) > .byte 0x0 > .byte 0x0 > >> > - In read_comp_unit_dies, when you find a duplicate die, you skip to >> > its sibling. What if the parent die is identical, but the children >> > dies differ? >> >> I don't believe this is possible in any language. > > How about this: > > namespace X { > namespace A { > int m, n; > } > } > > namespace Y { > namespace A { > int o, p; > } > } > > The dies for the two 'A' namespaces have the name DW_AT_name > attribute, but they're clearly different. > This won't do it, they'll have different decl line attributes. And we only eliminate at the top level anyway. namespace A { int o, p; } namespace A { int m, n; } won't screw us either, since the decl line number will still differ. and, if you implemented it according to spec, one would be a different tag than the other. > Just in principle, this seems sloppy. Dies are just arbitrary tree > nodes. There's no reason to assume that if two parent nodes are > identical, we can skip the second one and all its children. It's not sloppy. Elimination at the top level is normal in other debuggers i know of that do elimination.