New feature "source-id"

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* New feature "source-id"
@ 2014-03-15 10:49 Gerhard Gappmeier
  2014-03-15 17:32 ` Doug Evans
  2014-03-18 13:22 ` Mark Wielaard
  0 siblings, 2 replies; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-15 10:49 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]

Hi all,

I've implemented a new feature that I call "source-id" analogous to the 
existing "build-id" functionality. Unlike fetching debug symbols automatically 
using build-ids this feature can fetch the correct source code from a version 
control system when debugging.
The idea is that when you need to debug an executable or opening a coredump of 
an executable that was built with "source-id" and "build-id" enabled it just 
works. You don't need to care about how and where to get debug symbols and the 
correct sources.

How does this works? The technical details are explained in the README.md 
hosted in my test repository: https://github.com/gergap/source-id
(The content of README.md is shown on this site nicely formatted)
This repository contains an example that I used to test the new feature.
It also contains example "source-fetch-scripts" that are used by GDB to fetch 
the source code. This should probably be bundled with GDB itself. Just give me 
a hint where the best location would be to add these files in the GDB repo.

The GDB patches you find here: https://github.com/gergap/binutils-gdb
on the branch gergap/source-id-feature

What has been changed in GDB:
* add support for a new .note section ".note.gnu.source-id" to be able to 
store and retrieve the VCS data from an ELF file.
* add new gdb commands: "set source-lookup <script>", "unset source-lookup", 
"show source-lookup". This way the source-lookup functionality can be enabled.
* change function open_source_file to fetch the source file using the external 
fetch script.
* disable mtime check when source-lookup is enabled to avoid the warning 
"Source file is more recent than executable". It is normal that the timestamp 
is newer when it was just fetched using a VCS like git.

What I need from you guys:
* Feedback if you like the feature or not. I really hope this feature can make 
it into mainline as it is really useful.
* Feedback on the implementation: Security, CodingStyle, etc.
* We need to make the new section ".note.gnu.source-id" official. I don't know 
who maintains this and this needs to be registered somewhere.

Future work:
* adding file hashes (SHA1) for each source file to the debug info. This way 
we can completely remove the mtime check and replace it with a check of the 
SHA1 sum. When we can replace the existing warning with a message like "The 
source file does not match the executable."
* this hash can also be used to implement reliable caching for the "fetch-
scripts" 

-- 
mit freundlichen Grüßen / best regards

Gerhard Gappmeier
ascolab GmbH - automation systems communication laboratory
Tel.: +49 9131 691 123
Fax: +49 9131 691 128
Web: http://www.ascolab.com
GPG-KeyId: 5AAC50C4
GPG-Fingerprint: 967A 15F1 2788 164D CCA3 6C46 07CD 6F82 5AAC 50C4

--
ascolab GmbH
Geschäftsführer: Gerhard Gappmeier, Matthias Damm, Uwe Steinkrauß
Sitz der Gesellschaft: Am Weichselgarten 7 • 91058 Erlangen • Germany
Registernummer: HRB 9360
Registergericht: Amtsgericht Fürth

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-15 10:49 New feature "source-id" Gerhard Gappmeier
@ 2014-03-15 17:32 ` Doug Evans
  2014-03-15 20:06   ` Eli Zaretskii
  2014-03-18 13:22 ` Mark Wielaard
  1 sibling, 1 reply; 22+ messages in thread
From: Doug Evans @ 2014-03-15 17:32 UTC (permalink / raw)
  To: Gerhard Gappmeier; +Cc: gdb-patches

On Sat, Mar 15, 2014 at 3:49 AM, Gerhard Gappmeier
<gerhard.gappmeier@ascolab.com> wrote:
> Hi all,
>
> I've implemented a new feature that I call "source-id" analogous to the
> existing "build-id" functionality. Unlike fetching debug symbols automatically
> using build-ids this feature can fetch the correct source code from a version
> control system when debugging.
> The idea is that when you need to debug an executable or opening a coredump of
> an executable that was built with "source-id" and "build-id" enabled it just
> works. You don't need to care about how and where to get debug symbols and the
> correct sources.
>
> How does this works? The technical details are explained in the README.md
> hosted in my test repository: https://github.com/gergap/source-id
> (The content of README.md is shown on this site nicely formatted)
> This repository contains an example that I used to test the new feature.
> It also contains example "source-fetch-scripts" that are used by GDB to fetch
> the source code. This should probably be bundled with GDB itself. Just give me
> a hint where the best location would be to add these files in the GDB repo.
>
> The GDB patches you find here: https://github.com/gergap/binutils-gdb
> on the branch gergap/source-id-feature
>
> What has been changed in GDB:
> * add support for a new .note section ".note.gnu.source-id" to be able to
> store and retrieve the VCS data from an ELF file.
> * add new gdb commands: "set source-lookup <script>", "unset source-lookup",
> "show source-lookup". This way the source-lookup functionality can be enabled.
> * change function open_source_file to fetch the source file using the external
> fetch script.
> * disable mtime check when source-lookup is enabled to avoid the warning
> "Source file is more recent than executable". It is normal that the timestamp
> is newer when it was just fetched using a VCS like git.
>
> What I need from you guys:
> * Feedback if you like the feature or not. I really hope this feature can make
> it into mainline as it is really useful.
> * Feedback on the implementation: Security, CodingStyle, etc.
> * We need to make the new section ".note.gnu.source-id" official. I don't know
> who maintains this and this needs to be registered somewhere.
>
> Future work:
> * adding file hashes (SHA1) for each source file to the debug info. This way
> we can completely remove the mtime check and replace it with a check of the
> SHA1 sum. When we can replace the existing warning with a message like "The
> source file does not match the executable."
> * this hash can also be used to implement reliable caching for the "fetch-
> scripts"

Hi!
I would support including this feature in gdb.
But IMO the fetching of source must go through the Extension Language API.
See gdb/extension*.[ch].
I.e., don't do the popen in gdb.  Just call out to the extension
language via the API (e.g, Python), passing it the necessary
parameters.
[One could elaborate on this, following the
pretty-printer/debug-operator scheme and support having multiple
source fetchers that can be individually enabled/disabled.  Dunno if
it's necessary though, have to think about it some more.]

I didn't dig deeper into the implementation, but it is an interesting idea.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-15 17:32 ` Doug Evans
@ 2014-03-15 20:06   ` Eli Zaretskii
  2014-03-16  2:34     ` Doug Evans
  0 siblings, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2014-03-15 20:06 UTC (permalink / raw)
  To: Doug Evans; +Cc: gerhard.gappmeier, gdb-patches

> Date: Sat, 15 Mar 2014 10:32:12 -0700
> From: Doug Evans <dje@google.com>
> Cc: gdb-patches <gdb-patches@sourceware.org>
> 
> But IMO the fetching of source must go through the Extension Language API.
> See gdb/extension*.[ch].
> I.e., don't do the popen in gdb.  Just call out to the extension
> language via the API (e.g, Python), passing it the necessary
> parameters.

That would mean the feature will be unavailable in a GDB compiled
without Python and Guile.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-15 20:06   ` Eli Zaretskii
@ 2014-03-16  2:34     ` Doug Evans
  2014-03-16  9:43       ` Gerhard Gappmeier
  2014-03-16 16:22       ` Doug Evans
  0 siblings, 2 replies; 22+ messages in thread
From: Doug Evans @ 2014-03-16  2:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Gerhard Gappmeier, gdb-patches

On Sat, Mar 15, 2014 at 1:06 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> Date: Sat, 15 Mar 2014 10:32:12 -0700
>> From: Doug Evans <dje@google.com>
>> Cc: gdb-patches <gdb-patches@sourceware.org>
>>
>> But IMO the fetching of source must go through the Extension Language API.
>> See gdb/extension*.[ch].
>> I.e., don't do the popen in gdb.  Just call out to the extension
>> language via the API (e.g, Python), passing it the necessary
>> parameters.
>
> That would mean the feature will be unavailable in a GDB compiled
> without Python and Guile.

Actually, that could be an orthogonal question: the Extension Language
API also provides an interface to gdb's own scripting language (e.g.,
to support auto-loading .gdb scripts).
I'm not sure I would go that route though.  I have no problem with
requiring Python or Guile in order to support this.

Note that one concern I have is that it may be that some sites will
want to have some of gdb's state updated when source files are
automagically fetched.  E.g., maybe one would want to update the
source search path.  Maybe not, but at any rate I don't want this
feature to preclude doing things like that, and one can't do that if
the feature works by running an external program via popen.

btw, Gerhard, this would require a copyright assignment.
Do you have one?  [If/when (and at this point I'd say it's still a big
"if") there is general consensus on adding the feature you'll need to
complete an assignment before the patch can be added to the FSF
sources.  Let me know if you need the necessary paperwork.  There's no
rush on that, but it's something to keep in mind.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-16  2:34     ` Doug Evans
@ 2014-03-16  9:43       ` Gerhard Gappmeier
  2014-03-16 16:22       ` Doug Evans
  1 sibling, 0 replies; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-16  9:43 UTC (permalink / raw)
  To: gdb-patches

On Saturday, March 15, 2014 07:34:08 PM Doug Evans wrote:
> On Sat, Mar 15, 2014 at 1:06 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> >> Date: Sat, 15 Mar 2014 10:32:12 -0700
> >> From: Doug Evans <dje@google.com>
> >> Cc: gdb-patches <gdb-patches@sourceware.org>
> >> 
> >> But IMO the fetching of source must go through the Extension Language
> >> API.
> >> See gdb/extension*.[ch].
> >> I.e., don't do the popen in gdb.  Just call out to the extension
> >> language via the API (e.g, Python), passing it the necessary
> >> parameters.
> > 
> > That would mean the feature will be unavailable in a GDB compiled
> > without Python and Guile.
Actually I agree with Eli. I would prefer if this would work without pyhton.
Using th popen approach it is very flexible. You can write fetch scripts with 
whatever language you like. My example scripts are simple BASH scripts that 
call wget.
But maybe I just don't understand what you mean with that extension language 
api. Does this require to implement fetch scripts with python? Do you have 
security concerns with popen or what is the reason for this?
> 
> Actually, that could be an orthogonal question: the Extension Language
> API also provides an interface to gdb's own scripting language (e.g.,
> to support auto-loading .gdb scripts).
> I'm not sure I would go that route though.  I have no problem with
> requiring Python or Guile in order to support this.
> 
> Note that one concern I have is that it may be that some sites will
> want to have some of gdb's state updated when source files are
> automagically fetched.  E.g., maybe one would want to update the
> source search path. 
Actually a search path is not required. The script outputs the filename to 
stdout which is then read by GDB and use to open the file. There is no need 
for any search path to make this working.

> Maybe not, but at any rate I don't want this
> feature to preclude doing things like that, and one can't do that if
> the feature works by running an external program via popen.
> 
> btw, Gerhard, this would require a copyright assignment.
> Do you have one?  [If/when (and at this point I'd say it's still a big
> "if") there is general consensus on adding the feature you'll need to
> complete an assignment before the patch can be added to the FSF
> sources.  Let me know if you need the necessary paperwork.  There's no
> rush on that, but it's something to keep in mind.
I've no problem with signing a copyright assigment. Just let me know what I 
need to do for it. It's me first contribution to FSF.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-16  2:34     ` Doug Evans
  2014-03-16  9:43       ` Gerhard Gappmeier
@ 2014-03-16 16:22       ` Doug Evans
  2014-03-16 16:34         ` Eli Zaretskii
  2014-03-17  8:49         ` Gerhard Gappmeier
  1 sibling, 2 replies; 22+ messages in thread
From: Doug Evans @ 2014-03-16 16:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Gerhard Gappmeier, gdb-patches

On Sat, Mar 15, 2014 at 7:34 PM, Doug Evans <dje@google.com> wrote:
>
> Note that one concern I have is that it may be that some sites will
> want to have some of gdb's state updated when source files are
> automagically fetched.  E.g., maybe one would want to update the
> source search path.  Maybe not, but at any rate I don't want this
> feature to preclude doing things like that, and one can't do that if
> the feature works by running an external program via popen.

As a data point,
another way to go is to just have a convention for some global
variables in the binary.
With the debug info gdb can access them, and they could contain
everything that would be in the .note section.

I don't have a preference, per se.
I just mention it as a possibility, and if one went that route then
doing this in Python/Guile would be while perhaps not required
certainly easy.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-16 16:22       ` Doug Evans
@ 2014-03-16 16:34         ` Eli Zaretskii
  2014-03-17  8:49         ` Gerhard Gappmeier
  1 sibling, 0 replies; 22+ messages in thread
From: Eli Zaretskii @ 2014-03-16 16:34 UTC (permalink / raw)
  To: Doug Evans; +Cc: gerhard.gappmeier, gdb-patches

> Date: Sun, 16 Mar 2014 09:22:17 -0700
> From: Doug Evans <dje@google.com>
> Cc: Gerhard Gappmeier <gerhard.gappmeier@ascolab.com>, gdb-patches <gdb-patches@sourceware.org>
> 
> another way to go is to just have a convention for some global
> variables in the binary.
> With the debug info gdb can access them, and they could contain
> everything that would be in the .note section.

This would have an advantage of not being dependent on the ELF format
of the object files.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-16 16:22       ` Doug Evans
  2014-03-16 16:34         ` Eli Zaretskii
@ 2014-03-17  8:49         ` Gerhard Gappmeier
  2014-03-17 12:25           ` Matt Rice
  1 sibling, 1 reply; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-17  8:49 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 2505 bytes --]

On Sunday, March 16, 2014 09:22:17 AM Doug Evans wrote:
> On Sat, Mar 15, 2014 at 7:34 PM, Doug Evans <dje@google.com> wrote:
> > Note that one concern I have is that it may be that some sites will
> > want to have some of gdb's state updated when source files are
> > automagically fetched.  E.g., maybe one would want to update the
> > source search path.  Maybe not, but at any rate I don't want this
> > feature to preclude doing things like that, and one can't do that if
> > the feature works by running an external program via popen.
> 
> As a data point,
> another way to go is to just have a convention for some global
> variables in the binary.
> With the debug info gdb can access them, and they could contain
> everything that would be in the .note section.
> 
> I don't have a preference, per se.
> I just mention it as a possibility, and if one went that route then
> doing this in Python/Guile would be while perhaps not required
> certainly easy.
That's an interesting idea. When I first read this comment I thought it would 
require code changes what would not be what I want. But indeed we can simply 
generate an own 'vcsinfo.c' file which gets compiled and linked with the 
executable. I think its even simpler to add a new C file than requiring GNU as 
to generate a new section.

example vcsinfo.c:
/* this file was genarated, bla bla, don't modifiy */
static const char vcs_type[] = "git";
static const char vcs_url[] = "git@github.com:gergap/source-id.git"
static const char vcs_version[] = "c2ec66e6a36451ba47422d186fd97311989ef278"

I just have to investigate how to access this debug info in GDB, as I'm still 
new to this code. I hope it is as easy as it was to access the ELF info ;-)

Do you see any problem with declaring the variables static? Doing so we can 
avoid name clashes.

How can we avoid that these variables get dropped by the linker if they are 
not referenced by any code? declaring them volatile?


-- 
mit freundlichen Grüßen / best regards

Gerhard Gappmeier
ascolab GmbH - automation systems communication laboratory
Tel.: +49 9131 691 123
Fax: +49 9131 691 128
Web: http://www.ascolab.com
GPG-KeyId: 5AAC50C4
GPG-Fingerprint: 967A 15F1 2788 164D CCA3 6C46 07CD 6F82 5AAC 50C4

--
ascolab GmbH
Geschäftsführer: Gerhard Gappmeier, Matthias Damm, Uwe Steinkrauß
Sitz der Gesellschaft: Am Weichselgarten 7 • 91058 Erlangen • Germany
Registernummer: HRB 9360
Registergericht: Amtsgericht Fürth

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-17  8:49         ` Gerhard Gappmeier
@ 2014-03-17 12:25           ` Matt Rice
  2014-03-17 19:01             ` Gerhard Gappmeier
  0 siblings, 1 reply; 22+ messages in thread
From: Matt Rice @ 2014-03-17 12:25 UTC (permalink / raw)
  To: Gerhard Gappmeier; +Cc: gdb-patches

On Mon, Mar 17, 2014 at 1:49 AM, Gerhard Gappmeier
<gerhard.gappmeier@ascolab.com> wrote:
> On Sunday, March 16, 2014 09:22:17 AM Doug Evans wrote:
>> On Sat, Mar 15, 2014 at 7:34 PM, Doug Evans <dje@google.com> wrote:
>> > Note that one concern I have is that it may be that some sites will
>> > want to have some of gdb's state updated when source files are
>> > automagically fetched.  E.g., maybe one would want to update the
>> > source search path.  Maybe not, but at any rate I don't want this
>> > feature to preclude doing things like that, and one can't do that if
>> > the feature works by running an external program via popen.
>>
>> As a data point,
>> another way to go is to just have a convention for some global
>> variables in the binary.
>> With the debug info gdb can access them, and they could contain
>> everything that would be in the .note section.
>>
>> I don't have a preference, per se.
>> I just mention it as a possibility, and if one went that route then
>> doing this in Python/Guile would be while perhaps not required
>> certainly easy.
> That's an interesting idea. When I first read this comment I thought it would
> require code changes what would not be what I want. But indeed we can simply
> generate an own 'vcsinfo.c' file which gets compiled and linked with the
> executable. I think its even simpler to add a new C file than requiring GNU as
> to generate a new section.
>
> example vcsinfo.c:
> /* this file was genarated, bla bla, don't modifiy */
> static const char vcs_type[] = "git";
> static const char vcs_url[] = "git@github.com:gergap/source-id.git"
> static const char vcs_version[] = "c2ec66e6a36451ba47422d186fd97311989ef278"

I think its weird to store this in .rodata instead of somewhere it can
be easily stripped, especially if you plan on adding the sha1 file
hashes through this same mechanism, since that is a less constant
size, though you did mention adding that to the debug info
specifically.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-17 12:25           ` Matt Rice
@ 2014-03-17 19:01             ` Gerhard Gappmeier
  2014-03-18  0:25               ` Doug Evans
  0 siblings, 1 reply; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-17 19:01 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 2537 bytes --]

On Monday, March 17, 2014 05:25:45 AM you wrote:
> On Mon, Mar 17, 2014 at 1:49 AM, Gerhard Gappmeier
> 
> <gerhard.gappmeier@ascolab.com> wrote:
> > On Sunday, March 16, 2014 09:22:17 AM Doug Evans wrote:
> >> On Sat, Mar 15, 2014 at 7:34 PM, Doug Evans <dje@google.com> wrote:
> >> > Note that one concern I have is that it may be that some sites will
> >> > want to have some of gdb's state updated when source files are
> >> > automagically fetched.  E.g., maybe one would want to update the
> >> > source search path.  Maybe not, but at any rate I don't want this
> >> > feature to preclude doing things like that, and one can't do that if
> >> > the feature works by running an external program via popen.
> >> 
> >> As a data point,
> >> another way to go is to just have a convention for some global
> >> variables in the binary.
> >> With the debug info gdb can access them, and they could contain
> >> everything that would be in the .note section.
> >> 
> >> I don't have a preference, per se.
> >> I just mention it as a possibility, and if one went that route then
> >> doing this in Python/Guile would be while perhaps not required
> >> certainly easy.
> > 
> > That's an interesting idea. When I first read this comment I thought it
> > would require code changes what would not be what I want. But indeed we
> > can simply generate an own 'vcsinfo.c' file which gets compiled and
> > linked with the executable. I think its even simpler to add a new C file
> > than requiring GNU as to generate a new section.
> > 
> > example vcsinfo.c:
> > /* this file was genarated, bla bla, don't modifiy */
> > static const char vcs_type[] = "git";
> > static const char vcs_url[] = "git@github.com:gergap/source-id.git"
> > static const char vcs_version[] =
> > "c2ec66e6a36451ba47422d186fd97311989ef278"
> I think its weird to store this in .rodata instead of somewhere it can
> be easily stripped, especially if you plan on adding the sha1 file
> hashes through this same mechanism, since that is a less constant
> size, though you did mention adding that to the debug info
> specifically.
I agree. That's a good point. I think we should stay with the original idea of 
having a .note section. It is also more consistent with the build-id feature.

Another argument against adding this to the source might be code size. For 
small programs on embedded devices memory matters, so saving these strings 
would be a benefit. The .note section can be stripped and the feature would 
still work with the "separate-debug-info" approach.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-17 19:01             ` Gerhard Gappmeier
@ 2014-03-18  0:25               ` Doug Evans
  2014-03-18  0:48                 ` Bruce Dawson
  0 siblings, 1 reply; 22+ messages in thread
From: Doug Evans @ 2014-03-18  0:25 UTC (permalink / raw)
  To: Gerhard Gappmeier; +Cc: gdb-patches

On Mon, Mar 17, 2014 at 12:01 PM, Gerhard Gappmeier
<gerhard.gappmeier@ascolab.com> wrote:
>> > example vcsinfo.c:
>> > /* this file was genarated, bla bla, don't modifiy */
>> > static const char vcs_type[] = "git";
>> > static const char vcs_url[] = "git@github.com:gergap/source-id.git"
>> > static const char vcs_version[] =
>> > "c2ec66e6a36451ba47422d186fd97311989ef278"
>> I think its weird to store this in .rodata instead of somewhere it can
>> be easily stripped, especially if you plan on adding the sha1 file
>> hashes through this same mechanism, since that is a less constant
>> size, though you did mention adding that to the debug info
>> specifically.
> I agree. That's a good point. I think we should stay with the original idea of
> having a .note section. It is also more consistent with the build-id feature.

I agree the consistency of .note is nice, but I wouldn't preclude
people wanting something different.
Getting something into a .note section may involve more build changes
than some group may want to take on.

> Another argument against adding this to the source might be code size. For
> small programs on embedded devices memory matters, so saving these strings
> would be a benefit. The .note section can be stripped and the feature would
> still work with the "separate-debug-info" approach.

Technically, even if the info was added to the source (so to speak),
it needn't affect code size.
I can imagine all of these (so called) global variables being put in a
specific section which is put in a non-loadable segment.

The solution in gdb needn't preclude any implementation, that is up to
the script.
So, assuming the community wants this feature, let's separate out how
the source information is obtained from how gdb uses it.

btw, If Python doesn't have a library for reading ELF files, it should.
Thus we needn't hardcode anything about where the data lives into gdb
- leave it to the externally supplied script.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: New feature "source-id"
  2014-03-18  0:25               ` Doug Evans
@ 2014-03-18  0:48                 ` Bruce Dawson
  2014-03-18  1:39                   ` Doug Evans
  0 siblings, 1 reply; 22+ messages in thread
From: Bruce Dawson @ 2014-03-18  0:48 UTC (permalink / raw)
  To: Doug Evans, Gerhard Gappmeier; +Cc: gdb-patches

Hi, I thought I'd chime in since I'm the one who suggested this idea (at Steam Dev Days), and I have a (cruder, not worth sharing) version of this which we have been using for about a year, so I speak from experience.

The size issue is not really a code-size issue but a file-size issue. Traditionally the debug information has been in strippable sections so that non-developer users don't need to pay any price (download bandwidth, storage) for debug information which they don't care about. I think that the source-id information falls into the same category. A simple hello world program could end up with many KB of source-id information and it would be a shame to have that in the data segment, loaded or not. Stripped ELF files don't have debug information, and they shouldn't have source-id information, philosophically and practically.

Getting something into the .note section is done in our build system by using objcopy -- add-section. For some reason we do this in our build pipeline *after* we've stripped the symbols so we actually add the section to our .dbg file rather than the original .so file. We then have to remove and recreate the .gnu_debuglink. Adding the .note section to the original .so file would be cleaner since then the normal stripping steps would just handle it like other debug information. I think I did it wrong because it was simpler, but it might have been through ignorance.

The way that our workflow works is that after we have done a build we use objdump -Wl (slightly customized to reduce the volume of information) get a list of all of the source files used to create a shared object. Then we query our source-control system for the version numbers associated with the files (we use Perforce). We then inject this data (a mapping from each local file system path to a Perforce path/version information) into the custom section. We have a hacky system for getting gdb to find the files, and we are looking forward to Gerhard's superior system.

I don't see how this would work with a source file. You can easily know the set of input files (which includes header files, and source files from archive files) until the build has completed. If you put the mappings in a source file then you would  need to compile that after the initial link and then relink. That would be inefficient. So, you would need a way to find the list of source files (including header files) prior to building. That sounds messy.

BTW, one thing to think about is that this system should produce reproducible results. The gcc way is that if your input files have not changed then you should get an identical output. With Perforce this automatically works -- the file version numbers only change if new versions of the files have been submitted. For VCS systems with a global version number (git, IIRC) some care must be taken. If the global version number is used to specify which versions of the files to retrieve then a check-in of an unrelated file will change the source-id information and make the binaries not match. I believe that would be bad. This doesn't affect Gerhard's code at all, but it does affect how a sample script to create the source-id information should work. 

Finally, I think that a new command-line option to objdump to dump *just* source paths would be helpful. objdump -Wl is too slow -- something better is needed and I could not find anything. The patch to objdump would be trivial. Or a stand-alone tool could be created, although that is less tempting.

Thanks to Gerhard for pushing this through. I dream of a future where I can step through the code in (almost) any package and have symbols and source code magically show up on-demand.

Bruce Dawson, Valve

-----Original Message-----
From: gdb-patches-owner@sourceware.org [mailto:gdb-patches-owner@sourceware.org] On Behalf Of Doug Evans
Sent: Monday, March 17, 2014 5:26 PM
To: Gerhard Gappmeier
Cc: gdb-patches
Subject: Re: New feature "source-id"

On Mon, Mar 17, 2014 at 12:01 PM, Gerhard Gappmeier <gerhard.gappmeier@ascolab.com> wrote:
>> > example vcsinfo.c:
>> > /* this file was genarated, bla bla, don't modifiy */ static const 
>> > char vcs_type[] = "git"; static const char vcs_url[] = 
>> > "git@github.com:gergap/source-id.git"
>> > static const char vcs_version[] =
>> > "c2ec66e6a36451ba47422d186fd97311989ef278"
>> I think its weird to store this in .rodata instead of somewhere it 
>> can be easily stripped, especially if you plan on adding the sha1 
>> file hashes through this same mechanism, since that is a less 
>> constant size, though you did mention adding that to the debug info 
>> specifically.
> I agree. That's a good point. I think we should stay with the original 
> idea of having a .note section. It is also more consistent with the build-id feature.

I agree the consistency of .note is nice, but I wouldn't preclude people wanting something different.
Getting something into a .note section may involve more build changes than some group may want to take on.

> Another argument against adding this to the source might be code size. 
> For small programs on embedded devices memory matters, so saving these 
> strings would be a benefit. The .note section can be stripped and the 
> feature would still work with the "separate-debug-info" approach.

Technically, even if the info was added to the source (so to speak), it needn't affect code size.
I can imagine all of these (so called) global variables being put in a specific section which is put in a non-loadable segment.

The solution in gdb needn't preclude any implementation, that is up to the script.
So, assuming the community wants this feature, let's separate out how the source information is obtained from how gdb uses it.

btw, If Python doesn't have a library for reading ELF files, it should.
Thus we needn't hardcode anything about where the data lives into gdb
- leave it to the externally supplied script.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18  0:48                 ` Bruce Dawson
@ 2014-03-18  1:39                   ` Doug Evans
  2014-03-18 17:44                     ` Bruce Dawson
  0 siblings, 1 reply; 22+ messages in thread
From: Doug Evans @ 2014-03-18  1:39 UTC (permalink / raw)
  To: Bruce Dawson; +Cc: Gerhard Gappmeier, gdb-patches

On Mon, Mar 17, 2014 at 5:48 PM, Bruce Dawson <bruced@valvesoftware.com> wrote:
> Hi, I thought I'd chime in since I'm the one who suggested this idea (at Steam Dev Days), and I have a (cruder, not worth sharing) version of this which we have been using for about a year, so I speak from experience.
>
> The size issue is not really a code-size issue but a file-size issue. Traditionally the debug information has been in strippable sections so that non-developer users don't need to pay any price (download bandwidth, storage) for debug information which they don't care about. I think that the source-id information falls into the same category. A simple hello world program could end up with many KB of source-id information and it would be a shame to have that in the data segment, loaded or not. Stripped ELF files don't have debug information, and they shouldn't have source-id information, philosophically and practically.
>
> Getting something into the .note section is done in our build system by using objcopy -- add-section. For some reason we do this in our build pipeline *after* we've stripped the symbols so we actually add the section to our .dbg file rather than the original .so file. We then have to remove and recreate the .gnu_debuglink. Adding the .note section to the original .so file would be cleaner since then the normal stripping steps would just handle it like other debug information. I think I did it wrong because it was simpler, but it might have been through ignorance.

It might be serendipitous that you do it afterwards.
Note that strip doesn't remove .note.gnu.build-id.
I'm guessing there's a way to mark a .note section as strippable, but
I only skimmed binutils.

> The way that our workflow works is that after we have done a build we use objdump -Wl (slightly customized to reduce the volume of information) get a list of all of the source files used to create a shared object. Then we query our source-control system for the version numbers associated with the files (we use Perforce). We then inject this data (a mapping from each local file system path to a Perforce path/version information) into the custom section. We have a hacky system for getting gdb to find the files, and we are looking forward to Gerhard's superior system.
>
> I don't see how this would work with a source file. You can easily know the set of input files (which includes header files, and source files from archive files) until the build has completed. If you put the mappings in a source file then you would  need to compile that after the initial link and then relink. That would be inefficient. So, you would need a way to find the list of source files (including header files) prior to building. That sounds messy.

I can imagine generating a .c file (say, could even be .S) from the
post-link binary that contains source information, compile it, and
then using objcopy to add the resultant section with the source
information into the resultant binary.  The key here being that source
file information gets put in a specific section so it's easy to do
this.  If one is splitting the debug information into a separate file,
one could objcopy --add-section the source-file-section there instead.
 There's no real "relinking" here.  I don't see the difference with
what you're doing now.

> BTW, one thing to think about is that this system should produce reproducible results. The gcc way is that if your input files have not changed then you should get an identical output. With Perforce this automatically works -- the file version numbers only change if new versions of the files have been submitted. For VCS systems with a global version number (git, IIRC) some care must be taken. If the global version number is used to specify which versions of the files to retrieve then a check-in of an unrelated file will change the source-id information and make the binaries not match. I believe that would be bad. This doesn't affect Gerhard's code at all, but it does affect how a sample script to create the source-id information should work.
>
> Finally, I think that a new command-line option to objdump to dump *just* source paths would be helpful. objdump -Wl is too slow -- something better is needed and I could not find anything. The patch to objdump would be trivial. Or a stand-alone tool could be created, although that is less tempting.
>
> Thanks to Gerhard for pushing this through. I dream of a future where I can step through the code in (almost) any package and have symbols and source code magically show up on-demand.
>
> Bruce Dawson, Valve
>
> -----Original Message-----
> From: gdb-patches-owner@sourceware.org [mailto:gdb-patches-owner@sourceware.org] On Behalf Of Doug Evans
> Sent: Monday, March 17, 2014 5:26 PM
> To: Gerhard Gappmeier
> Cc: gdb-patches
> Subject: Re: New feature "source-id"
>
> On Mon, Mar 17, 2014 at 12:01 PM, Gerhard Gappmeier <gerhard.gappmeier@ascolab.com> wrote:
>>> > example vcsinfo.c:
>>> > /* this file was genarated, bla bla, don't modifiy */ static const
>>> > char vcs_type[] = "git"; static const char vcs_url[] =
>>> > "git@github.com:gergap/source-id.git"
>>> > static const char vcs_version[] =
>>> > "c2ec66e6a36451ba47422d186fd97311989ef278"
>>> I think its weird to store this in .rodata instead of somewhere it
>>> can be easily stripped, especially if you plan on adding the sha1
>>> file hashes through this same mechanism, since that is a less
>>> constant size, though you did mention adding that to the debug info
>>> specifically.
>> I agree. That's a good point. I think we should stay with the original
>> idea of having a .note section. It is also more consistent with the build-id feature.
>
> I agree the consistency of .note is nice, but I wouldn't preclude people wanting something different.
> Getting something into a .note section may involve more build changes than some group may want to take on.
>
>> Another argument against adding this to the source might be code size.
>> For small programs on embedded devices memory matters, so saving these
>> strings would be a benefit. The .note section can be stripped and the
>> feature would still work with the "separate-debug-info" approach.
>
> Technically, even if the info was added to the source (so to speak), it needn't affect code size.
> I can imagine all of these (so called) global variables being put in a specific section which is put in a non-loadable segment.
>
> The solution in gdb needn't preclude any implementation, that is up to the script.
> So, assuming the community wants this feature, let's separate out how the source information is obtained from how gdb uses it.
>
> btw, If Python doesn't have a library for reading ELF files, it should.
> Thus we needn't hardcode anything about where the data lives into gdb
> - leave it to the externally supplied script.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-15 10:49 New feature "source-id" Gerhard Gappmeier
  2014-03-15 17:32 ` Doug Evans
@ 2014-03-18 13:22 ` Mark Wielaard
  2014-03-18 14:00   ` Gerhard Gappmeier
  1 sibling, 1 reply; 22+ messages in thread
From: Mark Wielaard @ 2014-03-18 13:22 UTC (permalink / raw)
  To: Gerhard Gappmeier; +Cc: gdb-patches

Hi Gerhard,

On Sat, 2014-03-15 at 11:49 +0100, Gerhard Gappmeier wrote:
> The idea is that when you need to debug an executable or opening a coredump of 
> an executable that was built with "source-id" and "build-id" enabled it just 
> works. You don't need to care about how and where to get debug symbols and the 
> correct sources.

I was wondering how this would work together with what distros like
Fedora do right now to solve this same issue of finding the
corresponding source files.

Distros, at least those based on rpm, rely on the build-id and DWARF
debug information. For each executable/library they record the build-id
and strip the symbol table and debug information in a separate .debug
file. The debug Compile Unit and DWARF line table reference the source
files used to build the executable file. These files are collected and
put under /usr/src/debug/<package-foo>/.... Then they run debugedit [1]
on the .debug files to replace all file references to the files
under /usr/src/debug/... Both the .debug files (placed
under /usr/lib/debug/<package-foo>) and the source files are then
bundled together in the <package-foo>-debuginfo.rpm (including the
necessary build-id directories).

That way you can use the build-id from the ELF note section to retrieve
both the separate .debug files and the corresponding source files. And
on my distro gdb even helpfully suggests how to do this:
Missing separate debuginfos, use: debuginfo-install at-3.1.13-14.fc20.x86_64
Which will then fetch the debuginfo package and all dependencies so gdb
can find the .debug files and the corresponding source code those .debug
files refer to. I don't know if the debuginfo-install suggestion is
upstream or only in the distro package of gdb.

> * We need to make the new section ".note.gnu.source-id" official. I don't know 
> who maintains this and this needs to be registered somewhere.
> [...]
> * adding file hashes (SHA1) for each source file to the debug info. This way 
> we can completely remove the mtime check and replace it with a check of the 
> SHA1 sum. When we can replace the existing warning with a message like "The 
> source file does not match the executable."

For DWARF5 there is a proposal to add the MD5 digest to debug-line file
table: http://dwarfstd.org/ShowIssue.php?issue=130701.1

Would that be a good alternative location to store the hash of the
source file?

Cheers,

Mark

[1] http://rpm.org/gitweb?p=rpm.git;a=blob;f=tools/debugedit.c;hb=HEAD

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18 13:22 ` Mark Wielaard
@ 2014-03-18 14:00   ` Gerhard Gappmeier
  2014-03-18 15:03     ` Mark Wielaard
  0 siblings, 1 reply; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-18 14:00 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 3762 bytes --]

On Tuesday, March 18, 2014 02:22:04 PM you wrote:
> Hi Gerhard,
> 
> On Sat, 2014-03-15 at 11:49 +0100, Gerhard Gappmeier wrote:
> > The idea is that when you need to debug an executable or opening a
> > coredump of an executable that was built with "source-id" and "build-id"
> > enabled it just works. You don't need to care about how and where to get
> > debug symbols and the correct sources.
> 
> I was wondering how this would work together with what distros like
> Fedora do right now to solve this same issue of finding the
> corresponding source files.
> 
> Distros, at least those based on rpm, rely on the build-id and DWARF
> debug information. For each executable/library they record the build-id
> and strip the symbol table and debug information in a separate .debug
> file. The debug Compile Unit and DWARF line table reference the source
> files used to build the executable file. These files are collected and
> put under /usr/src/debug/<package-foo>/.... Then they run debugedit [1]
> on the .debug files to replace all file references to the files
> under /usr/src/debug/... Both the .debug files (placed
> under /usr/lib/debug/<package-foo>) and the source files are then
> bundled together in the <package-foo>-debuginfo.rpm (including the
> necessary build-id directories).
> 
> That way you can use the build-id from the ELF note section to retrieve
> both the separate .debug files and the corresponding source files. And
> on my distro gdb even helpfully suggests how to do this:
> Missing separate debuginfos, use: debuginfo-install at-3.1.13-14.fc20.x86_64
> Which will then fetch the debuginfo package and all dependencies so gdb can
> find the .debug files and the corresponding source code those .debug files
> refer to. I don't know if the debuginfo-install suggestion is upstream or
> only in the distro package of gdb.

If I understood this right, this means whenever a software is built the 
sources get archived with the debug symbols in an debuginfo RPM file.
This way the build-id is all you need to get the correct sources and debug 
symbols.

However my idea is somewhat different and a little bit smarter IMO:
* The SHA1 id of a git repo gets stored in the source-id meta info when 
building.
* There is no need of archiving the source files in RPM, deb, tar.gz or zip 
files. We have them already in the version control system and we don't want to 
duplicate the data
* This solution is independent from any package format.
* You can analyze coredumps of executables that you don't have on your system. 
There is no need to install any RPM package for that. This way you can analyze 
e.g. a crash within a Ubuntu package on a Fedora system.
* The fetch-script fetches only the sources required by GDB, not the complete 
project.

> 
> > * We need to make the new section ".note.gnu.source-id" official. I don't
> > know who maintains this and this needs to be registered somewhere.
> > [...]
> > * adding file hashes (SHA1) for each source file to the debug info. This
> > way we can completely remove the mtime check and replace it with a check
> > of the SHA1 sum. When we can replace the existing warning with a message
> > like "The source file does not match the executable."
> 
> For DWARF5 there is a proposal to add the MD5 digest to debug-line file
> table: http://dwarfstd.org/ShowIssue.php?issue=130701.1
> 
> Would that be a good alternative location to store the hash of the
> source file?
That's exactly what I proposed. Only that I proposed SHA1 instead of MD5, but 
this doesn't matter.
If this is already in the DWARD standard we should use this feature and don't 
reinvent the wheel.
Thx for this hint.
> 
> Cheers,
> 
> Mark
> 
> [1] http://rpm.org/gitweb?p=rpm.git;a=blob;f=tools/debugedit.c;hb=HEAD

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18 14:00   ` Gerhard Gappmeier
@ 2014-03-18 15:03     ` Mark Wielaard
  2014-03-18 16:40       ` Gerhard Gappmeier
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Wielaard @ 2014-03-18 15:03 UTC (permalink / raw)
  To: Gerhard Gappmeier; +Cc: gdb-patches

On Tue, 2014-03-18 at 15:00 +0100, Gerhard Gappmeier wrote:
> On Tuesday, March 18, 2014 02:22:04 PM you wrote:
> > On Sat, 2014-03-15 at 11:49 +0100, Gerhard Gappmeier wrote:
> > That way you can use the build-id from the ELF note section to retrieve
> > both the separate .debug files and the corresponding source files. And
> > on my distro gdb even helpfully suggests how to do this:
> > Missing separate debuginfos, use: debuginfo-install at-3.1.13-14.fc20.x86_64
> > Which will then fetch the debuginfo package and all dependencies so gdb can
> > find the .debug files and the corresponding source code those .debug files
> > refer to. I don't know if the debuginfo-install suggestion is upstream or
> > only in the distro package of gdb.
> 
> If I understood this right, this means whenever a software is built the 
> sources get archived with the debug symbols in an debuginfo RPM file.
> This way the build-id is all you need to get the correct sources and debug 
> symbols.

Indeed. Just turn the build-id into the package, either through
something like https://darkserver.fedoraproject.org/ or through yum
install /usr/lib/debug/.build-id/b7/07011ecdbd5bcb1fad73cdc9b4433c791d8328.debug or just through debuginfo-install and you get both the .debug files and all sources files that .debug file refers to.

> However my idea is somewhat different and a little bit smarter IMO:
> * The SHA1 id of a git repo gets stored in the source-id meta info when 
> building.
> * There is no need of archiving the source files in RPM, deb, tar.gz or zip 
> files. We have them already in the version control system and we don't want to 
> duplicate the data
> * This solution is independent from any package format.
> * You can analyze coredumps of executables that you don't have on your system. 
> There is no need to install any RPM package for that. This way you can analyze 
> e.g. a crash within a Ubuntu package on a Fedora system.
> * The fetch-script fetches only the sources required by GDB, not the complete 
> project.

Some of those features are already possible with the way distros package
the debuginfo files. But your way might indeed be more flexible. I am
mostly wondering how to take advantage of the way distros do it
currently in your scheme. How do you describe the default distro setup
and how do you make sure not to duplicate the storage of source files?

One difference with your scheme is that the distros packages the
post-processed source files. That means they are the actual files,
however generated, that the compiler compiled to object code. Not
necessarily the pristine source files. That is so in a debugger you can
step through the source file as seen by the compiler (e.g. it will
include source files generated by configure or the lex and yacc
generated files that the compiler builds).

> > > * We need to make the new section ".note.gnu.source-id" official. I don't
> > > know who maintains this and this needs to be registered somewhere.
> > > [...]
> > > * adding file hashes (SHA1) for each source file to the debug info. This
> > > way we can completely remove the mtime check and replace it with a check
> > > of the SHA1 sum. When we can replace the existing warning with a message
> > > like "The source file does not match the executable."
> > 
> > For DWARF5 there is a proposal to add the MD5 digest to debug-line file
> > table: http://dwarfstd.org/ShowIssue.php?issue=130701.1
> > 
> > Would that be a good alternative location to store the hash of the
> > source file?
> That's exactly what I proposed. Only that I proposed SHA1 instead of MD5, but 
> this doesn't matter.
> If this is already in the DWARD standard we should use this feature and don't 
> reinvent the wheel.

It is currently just a proposal for DWARF5. The proposal deadline is end
of this month. I just reviewed that proposal and saw that it is not very
extensible, so I suggested some additions. See the discussion here:
http://thread.gmane.org/gmane.comp.standards.dwarf/100

Cheers,

Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18 15:03     ` Mark Wielaard
@ 2014-03-18 16:40       ` Gerhard Gappmeier
  2014-03-18 17:56         ` Bruce Dawson
  0 siblings, 1 reply; 22+ messages in thread
From: Gerhard Gappmeier @ 2014-03-18 16:40 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 5728 bytes --]

On Tuesday, March 18, 2014 04:03:11 PM you wrote:
> On Tue, 2014-03-18 at 15:00 +0100, Gerhard Gappmeier wrote:
> > On Tuesday, March 18, 2014 02:22:04 PM you wrote:
> > > On Sat, 2014-03-15 at 11:49 +0100, Gerhard Gappmeier wrote:
> > > That way you can use the build-id from the ELF note section to retrieve
> > > both the separate .debug files and the corresponding source files. And
> > > on my distro gdb even helpfully suggests how to do this:
> > > Missing separate debuginfos, use: debuginfo-install
> > > at-3.1.13-14.fc20.x86_64 Which will then fetch the debuginfo package
> > > and all dependencies so gdb can find the .debug files and the
> > > corresponding source code those .debug files refer to. I don't know if
> > > the debuginfo-install suggestion is upstream or only in the distro
> > > package of gdb.
> > 
> > If I understood this right, this means whenever a software is built the
> > sources get archived with the debug symbols in an debuginfo RPM file.
> > This way the build-id is all you need to get the correct sources and debug
> > symbols.
> 
> Indeed. Just turn the build-id into the package, either through
> something like https://darkserver.fedoraproject.org/ or through yum
> install
> /usr/lib/debug/.build-id/b7/07011ecdbd5bcb1fad73cdc9b4433c791d8328.debug or
> just through debuginfo-install and you get both the .debug files and all
> sources files that .debug file refers to.
> > However my idea is somewhat different and a little bit smarter IMO:
> > * The SHA1 id of a git repo gets stored in the source-id meta info when
> > building.
> > * There is no need of archiving the source files in RPM, deb, tar.gz or
> > zip
> > files. We have them already in the version control system and we don't
> > want to duplicate the data
> > * This solution is independent from any package format.
> > * You can analyze coredumps of executables that you don't have on your
> > system. There is no need to install any RPM package for that. This way
> > you can analyze e.g. a crash within a Ubuntu package on a Fedora system.
> > * The fetch-script fetches only the sources required by GDB, not the
> > complete project.
> 
> Some of those features are already possible with the way distros package
> the debuginfo files. But your way might indeed be more flexible. I am
> mostly wondering how to take advantage of the way distros do it
> currently in your scheme. How do you describe the default distro setup
> and how do you make sure not to duplicate the storage of source files?
> 
> One difference with your scheme is that the distros packages the
> post-processed source files. That means they are the actual files,
> however generated, that the compiler compiled to object code. Not
> necessarily the pristine source files. That is so in a debugger you can
> step through the source file as seen by the compiler (e.g. it will
> include source files generated by configure or the lex and yacc
> generated files that the compiler builds).
Having generated files (that are not in a VCS) available is indeed an advantage 
of this concept.
However you this is very focused on sources that get packaged by a Linux 
distribution.

But there is also the usecase for proprietary software the gets not bundled 
with your distribution. Vendors are creating there own installers or simply a 
tar.gz file which gets installed in /opt/somewhere.
So there is no debuginfo package available in the package manager.
Companies could recreate this concept of creating debug packages, but I really 
prefer to just fetch the sources from git.
That's the way I work today:
* Getting a problem report from a customer
* Hoping that the customer reports the correct version
* Search for a version tag in git which matches the reported version
* Checkout that version using git
The source-id simplifies that process. Just open the crashdump -> fetch 
separate debug info using build-id -> fetch source file that should be 
displayed in GDB using the fetch-script from an internal cgit web interface.

Our build server copies the separate debug info to an NFS share which I have 
mounted via /etc/fstab. So fetching symbols just works.
The sources are already in git and available via cgit web interface.
The missing part is just this "source-id" feature, then everything works out-
of-the-box.
> 
> > > > * We need to make the new section ".note.gnu.source-id" official. I
> > > > don't
> > > > know who maintains this and this needs to be registered somewhere.
> > > > [...]
> > > > * adding file hashes (SHA1) for each source file to the debug info.
> > > > This
> > > > way we can completely remove the mtime check and replace it with a
> > > > check
> > > > of the SHA1 sum. When we can replace the existing warning with a
> > > > message
> > > > like "The source file does not match the executable."
> > > 
> > > For DWARF5 there is a proposal to add the MD5 digest to debug-line file
> > > table: http://dwarfstd.org/ShowIssue.php?issue=130701.1
> > > 
> > > Would that be a good alternative location to store the hash of the
> > > source file?
> > 
> > That's exactly what I proposed. Only that I proposed SHA1 instead of MD5,
> > but this doesn't matter.
> > If this is already in the DWARD standard we should use this feature and
> > don't reinvent the wheel.
> 
> It is currently just a proposal for DWARF5. The proposal deadline is end
> of this month. I just reviewed that proposal and saw that it is not very
> extensible, so I suggested some additions. See the discussion here:
> http://thread.gmane.org/gmane.comp.standards.dwarf/100
This feature really makes sense independently from the source-id feature.
I'm really looking forward to see that being accepted.

Cheers,
Gerhard
> 
> Cheers,
> 
> Mark

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: New feature "source-id"
  2014-03-18  1:39                   ` Doug Evans
@ 2014-03-18 17:44                     ` Bruce Dawson
  2014-03-18 17:57                       ` Doug Evans
  0 siblings, 1 reply; 22+ messages in thread
From: Bruce Dawson @ 2014-03-18 17:44 UTC (permalink / raw)
  To: 'Doug Evans'; +Cc: Gerhard Gappmeier, gdb-patches

I better understand what you are doing now with the .c file now. Previously I had assumed that you would be compiling it and then linking the .o file along with all of the others.

Now I am confused about *why* you are compiling it. If you have a text file that has the source information then why not just add it as a custom section directly? What is the value in compiling it into a .o file first?

> I can imagine generating a .c file (say, could even be .S) from the post-link binary that contains source information

Don't forget that this text file is generated from both the source files listed in the post-link binary *and* from the version control information for these files.

-----Original Message-----
From: Doug Evans [mailto:dje@google.com] 
Sent: Monday, March 17, 2014 6:39 PM
To: Bruce Dawson
Cc: Gerhard Gappmeier; gdb-patches
Subject: Re: New feature "source-id"

...
I can imagine generating a .c file (say, could even be .S) from the post-link binary that contains source information, compile it, and then using objcopy to add the resultant section with the source information into the resultant binary.  The key here being that source file information gets put in a specific section so it's easy to do this.  If one is splitting the debug information into a separate file, one could objcopy --add-section the source-file-section there instead.
 There's no real "relinking" here.  I don't see the difference with what you're doing now.
...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: New feature "source-id"
  2014-03-18 16:40       ` Gerhard Gappmeier
@ 2014-03-18 17:56         ` Bruce Dawson
  2014-05-21 19:30           ` Tom Tromey
  0 siblings, 1 reply; 22+ messages in thread
From: Bruce Dawson @ 2014-03-18 17:56 UTC (permalink / raw)
  To: 'Gerhard Gappmeier', gdb-patches

I understand that some Linux distributions already make source packages for each package that they distribute, and this technique offers some unique advantages.

However, this is orthogonal to the source-id proposal. Source-id's offer different value that is complementary.

Our build system spits out dozens of builds a day. Some of these are run by developers, others by testers, and others by customers. Any one of them might crash. I might end up debugging (live debugging or a core file) any one of these builds, perhaps weeks after it was created. Because we have the source-id system set up I know that I can walk up and down the stack and have the source files automatically show up, with *zero* effort on my part. I don't' have to install source packages, I can have multiple core files from multiple versions loaded simultaneously. Only the source files that I need are downloaded so it is *extremely* efficient. Retrieving the needed source files is essentially instantaneous and requires zero developer effort.

A source package to go with every package has some advantages, such as getting all of the generated files. However it is a very heavyweight solution because it requires retrieving thousands of files from dozens of shared objects in order to look at a crash. In comparison, the source-id solution requires retrieving exactly the set of files that are needed to view the functions on the back-trace so it is extremely lightweight.

Another advantage to the source-id solution is it actually tells you the versions of the files. When I get a crash I can look at the source-id information and see that, for instance, it was built with foo.cpp#17 (version 17 of foo.cpp). That information is literally embedded in the source-id section. I can then look at that file in my VCS client and see if there is a newer version, perhaps with a fix. Doing this with a source package is more cumbersome because (if I understand correctly) you are getting a copy of the source file rather than a reference to a particular version.

A source package is like copying the files, whereas source-id is like having a symbolic link to the files.

I'm not suggesting that the source-id solution is better than a source package, I'm just saying that they are orthogonal. They support different work flows. They can co-exist perfectly. 

-----Original Message-----
From: gdb-patches-owner@sourceware.org [mailto:gdb-patches-owner@sourceware.org] On Behalf Of Gerhard Gappmeier
Sent: Tuesday, March 18, 2014 9:41 AM
To: gdb-patches@sourceware.org
Subject: Re: New feature "source-id"

On Tuesday, March 18, 2014 04:03:11 PM you wrote:
> On Tue, 2014-03-18 at 15:00 +0100, Gerhard Gappmeier wrote:
> > On Tuesday, March 18, 2014 02:22:04 PM you wrote:
> > > On Sat, 2014-03-15 at 11:49 +0100, Gerhard Gappmeier wrote:
> > > That way you can use the build-id from the ELF note section to 
> > > retrieve both the separate .debug files and the corresponding 
> > > source files. And on my distro gdb even helpfully suggests how to do this:
> > > Missing separate debuginfos, use: debuginfo-install
> > > at-3.1.13-14.fc20.x86_64 Which will then fetch the debuginfo 
> > > package and all dependencies so gdb can find the .debug files and 
> > > the corresponding source code those .debug files refer to. I don't 
> > > know if the debuginfo-install suggestion is upstream or only in 
> > > the distro package of gdb.
> > 
> > If I understood this right, this means whenever a software is built 
> > the sources get archived with the debug symbols in an debuginfo RPM file.
> > This way the build-id is all you need to get the correct sources and 
> > debug symbols.
> 
> Indeed. Just turn the build-id into the package, either through 
> something like https://darkserver.fedoraproject.org/ or through yum 
> install 
> /usr/lib/debug/.build-id/b7/07011ecdbd5bcb1fad73cdc9b4433c791d8328.deb
> ug or just through debuginfo-install and you get both the .debug files 
> and all sources files that .debug file refers to.
> > However my idea is somewhat different and a little bit smarter IMO:
> > * The SHA1 id of a git repo gets stored in the source-id meta info 
> > when building.
> > * There is no need of archiving the source files in RPM, deb, tar.gz 
> > or zip files. We have them already in the version control system and 
> > we don't want to duplicate the data
> > * This solution is independent from any package format.
> > * You can analyze coredumps of executables that you don't have on 
> > your system. There is no need to install any RPM package for that. 
> > This way you can analyze e.g. a crash within a Ubuntu package on a Fedora system.
> > * The fetch-script fetches only the sources required by GDB, not the 
> > complete project.
> 
> Some of those features are already possible with the way distros 
> package the debuginfo files. But your way might indeed be more 
> flexible. I am mostly wondering how to take advantage of the way 
> distros do it currently in your scheme. How do you describe the 
> default distro setup and how do you make sure not to duplicate the storage of source files?
> 
> One difference with your scheme is that the distros packages the 
> post-processed source files. That means they are the actual files, 
> however generated, that the compiler compiled to object code. Not 
> necessarily the pristine source files. That is so in a debugger you 
> can step through the source file as seen by the compiler (e.g. it will 
> include source files generated by configure or the lex and yacc 
> generated files that the compiler builds).
Having generated files (that are not in a VCS) available is indeed an advantage of this concept.
However you this is very focused on sources that get packaged by a Linux distribution.

But there is also the usecase for proprietary software the gets not bundled with your distribution. Vendors are creating there own installers or simply a tar.gz file which gets installed in /opt/somewhere.
So there is no debuginfo package available in the package manager.
Companies could recreate this concept of creating debug packages, but I really prefer to just fetch the sources from git.
That's the way I work today:
* Getting a problem report from a customer
* Hoping that the customer reports the correct version
* Search for a version tag in git which matches the reported version
* Checkout that version using git
The source-id simplifies that process. Just open the crashdump -> fetch separate debug info using build-id -> fetch source file that should be displayed in GDB using the fetch-script from an internal cgit web interface.

Our build server copies the separate debug info to an NFS share which I have mounted via /etc/fstab. So fetching symbols just works.
The sources are already in git and available via cgit web interface.
The missing part is just this "source-id" feature, then everything works out- of-the-box.
> 
> > > > * We need to make the new section ".note.gnu.source-id" 
> > > > official. I don't know who maintains this and this needs to be 
> > > > registered somewhere.
> > > > [...]
> > > > * adding file hashes (SHA1) for each source file to the debug info.
> > > > This
> > > > way we can completely remove the mtime check and replace it with 
> > > > a check of the SHA1 sum. When we can replace the existing 
> > > > warning with a message like "The source file does not match the 
> > > > executable."
> > > 
> > > For DWARF5 there is a proposal to add the MD5 digest to debug-line 
> > > file
> > > table: http://dwarfstd.org/ShowIssue.php?issue=130701.1
> > > 
> > > Would that be a good alternative location to store the hash of the 
> > > source file?
> > 
> > That's exactly what I proposed. Only that I proposed SHA1 instead of 
> > MD5, but this doesn't matter.
> > If this is already in the DWARD standard we should use this feature 
> > and don't reinvent the wheel.
> 
> It is currently just a proposal for DWARF5. The proposal deadline is 
> end of this month. I just reviewed that proposal and saw that it is 
> not very extensible, so I suggested some additions. See the discussion here:
> http://thread.gmane.org/gmane.comp.standards.dwarf/100
This feature really makes sense independently from the source-id feature.
I'm really looking forward to see that being accepted.

Cheers,
Gerhard
> 
> Cheers,
> 
> Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18 17:44                     ` Bruce Dawson
@ 2014-03-18 17:57                       ` Doug Evans
  0 siblings, 0 replies; 22+ messages in thread
From: Doug Evans @ 2014-03-18 17:57 UTC (permalink / raw)
  To: Bruce Dawson; +Cc: Gerhard Gappmeier, gdb-patches

On Tue, Mar 18, 2014 at 10:44 AM, Bruce Dawson <bruced@valvesoftware.com> wrote:
> I better understand what you are doing now with the .c file now. Previously I had assumed that you would be compiling it and then linking the .o file along with all of the others.
>
> Now I am confused about *why* you are compiling it. If you have a text file that has the source information then why not just add it as a custom section directly? What is the value in compiling it into a .o file first?

More control over how the data can appear in the resulting section.
It's hardly an issue worth quibbling over.
I will leave it to the app developer / whatever to decide how best to
accomplish this.
The high order bit for me for the task at hand is that gdb doesn't need to care.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: New feature "source-id"
  2014-03-18 17:56         ` Bruce Dawson
@ 2014-05-21 19:30           ` Tom Tromey
  2014-05-21 20:42             ` Bruce Dawson
  0 siblings, 1 reply; 22+ messages in thread
From: Tom Tromey @ 2014-05-21 19:30 UTC (permalink / raw)
  To: Bruce Dawson; +Cc: 'Gerhard Gappmeier', gdb-patches

>>>>> "Bruce" == Bruce Dawson <bruced@valvesoftware.com> writes:

Bruce> I understand that some Linux distributions already make source
Bruce> packages for each package that they distribute, and this technique
Bruce> offers some unique advantages.

Bruce> However, this is orthogonal to the source-id proposal. Source-id's
Bruce> offer different value that is complementary.

Bruce> Our build system spits out dozens of builds a day. Some of these are
Bruce> run by developers, others by testers, and others by customers. Any one
Bruce> of them might crash. I might end up debugging (live debugging or a
Bruce> core file) any one of these builds, perhaps weeks after it was
Bruce> created. Because we have the source-id system set up I know that I can
Bruce> walk up and down the stack and have the source files automatically
Bruce> show up, with *zero* effort on my part. I don't' have to install
Bruce> source packages, I can have multiple core files from multiple versions
Bruce> loaded simultaneously. Only the source files that I need are
Bruce> downloaded so it is *extremely* efficient. Retrieving the needed
Bruce> source files is essentially instantaneous and requires zero developer
Bruce> effort.

I wonder if you considered an approach based on build-ids.

You'd start with the existing build-id feature.  Then when your build
completes, you would record a build-id -> source-id mapping.  Finally
you would have a small fuse filesystem that looks up the build-id in the
database and fetches the appropriate source tree from git.

One benefit of this approach is that it requires nearly no changes in gdb.
This avoids a lot of bikeshedding ;)

I found a few git/fuse projects on github.

If you considered this & rejected it, I'd be curious to know why.
If it doesn't meet your needs then I probably misunderstood what you are
going for.

FWIW the SRPM-based approach we use at Red Hat is pretty good, but not
truly great.  It has a hack in the rewriting step and sometimes the
source tree layout isn't preserved properly somehow.

So something like the above may be more desirable overall.

Tom

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: New feature "source-id"
  2014-05-21 19:30           ` Tom Tromey
@ 2014-05-21 20:42             ` Bruce Dawson
  0 siblings, 0 replies; 22+ messages in thread
From: Bruce Dawson @ 2014-05-21 20:42 UTC (permalink / raw)
  To: 'Tom Tromey'; +Cc: 'Gerhard Gappmeier', gdb-patches

I did consider doing a fuse file system. It could work, and would minimize the changes needed to gdb. I'm not aware of any specific problems but I suspect it would add some extra complexity, and might make it difficult to avoid some problems. But, as you say, if it gets it done then that counts for a lot. Presumably such a system would use the Python hooks (that I currently use) in order to configure the fuse file system.

One potential problem with the fuse file system is with loading source files with matching file names. Let's say we have server.so and client.so and both have a file called foo.c. In an ideal world the debug information would contain a full path to foo.c and we could use "set substitute-path" to remap from the two different build directories to two different fuse file systems, and life would be good. Unfortunately the reality is that many projects (including libc6) put incomplete paths into the debug information. If source-id was a first-class feature of gdb then when the debugger needed "foo.c" for server.so it would look in the source-mapping information in server.so.dbg, find the version control information, retrieve the file, and load it. If source-id is not a first-class feature of gdb then I see no way to set up source search paths such that the right version would be loaded at the right time. We could try to fix every build system in the world to embed full paths, but that will never happen.

This failure case is real. On the other hand, it is rare enough that a source-id system that ignored it would still be incredibly useful.

> Then when your build completes, you would record a build-id -> source-id mapping.  

Where would the build-id -> source-id mapping be stored? The method we currently use is to have a section in the debug file which contains the mapping from source files to the version control identifiers. This feels like a simple and reliable method to record the mapping. We've been using it for over a year, and similar techniques have been used for many years on other platforms, so it is a proven technique.

-----Original Message-----
From: Tom Tromey [mailto:tromey@redhat.com] 
Sent: Wednesday, May 21, 2014 12:30 PM
To: Bruce Dawson
Cc: 'Gerhard Gappmeier'; gdb-patches@sourceware.org
Subject: Re: New feature "source-id"

>>>>> "Bruce" == Bruce Dawson <bruced@valvesoftware.com> writes:

Bruce> I understand that some Linux distributions already make source 
Bruce> packages for each package that they distribute, and this 
Bruce> technique offers some unique advantages.

Bruce> However, this is orthogonal to the source-id proposal. 
Bruce> Source-id's offer different value that is complementary.

Bruce> Our build system spits out dozens of builds a day. Some of these 
Bruce> are run by developers, others by testers, and others by 
Bruce> customers. Any one of them might crash. I might end up debugging 
Bruce> (live debugging or a core file) any one of these builds, perhaps 
Bruce> weeks after it was created. Because we have the source-id system 
Bruce> set up I know that I can walk up and down the stack and have the 
Bruce> source files automatically show up, with *zero* effort on my 
Bruce> part. I don't' have to install source packages, I can have 
Bruce> multiple core files from multiple versions loaded simultaneously. 
Bruce> Only the source files that I need are downloaded so it is 
Bruce> *extremely* efficient. Retrieving the needed source files is 
Bruce> essentially instantaneous and requires zero developer effort.

I wonder if you considered an approach based on build-ids.

You'd start with the existing build-id feature.  Then when your build completes, you would record a build-id -> source-id mapping.  Finally you would have a small fuse filesystem that looks up the build-id in the database and fetches the appropriate source tree from git.

One benefit of this approach is that it requires nearly no changes in gdb.
This avoids a lot of bikeshedding ;)

I found a few git/fuse projects on github.

If you considered this & rejected it, I'd be curious to know why.
If it doesn't meet your needs then I probably misunderstood what you are going for.

FWIW the SRPM-based approach we use at Red Hat is pretty good, but not truly great.  It has a hack in the rewriting step and sometimes the source tree layout isn't preserved properly somehow.

So something like the above may be more desirable overall.

Tom

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-05-21 20:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-15 10:49 New feature "source-id" Gerhard Gappmeier
2014-03-15 17:32 ` Doug Evans
2014-03-15 20:06   ` Eli Zaretskii
2014-03-16  2:34     ` Doug Evans
2014-03-16  9:43       ` Gerhard Gappmeier
2014-03-16 16:22       ` Doug Evans
2014-03-16 16:34         ` Eli Zaretskii
2014-03-17  8:49         ` Gerhard Gappmeier
2014-03-17 12:25           ` Matt Rice
2014-03-17 19:01             ` Gerhard Gappmeier
2014-03-18  0:25               ` Doug Evans
2014-03-18  0:48                 ` Bruce Dawson
2014-03-18  1:39                   ` Doug Evans
2014-03-18 17:44                     ` Bruce Dawson
2014-03-18 17:57                       ` Doug Evans
2014-03-18 13:22 ` Mark Wielaard
2014-03-18 14:00   ` Gerhard Gappmeier
2014-03-18 15:03     ` Mark Wielaard
2014-03-18 16:40       ` Gerhard Gappmeier
2014-03-18 17:56         ` Bruce Dawson
2014-05-21 19:30           ` Tom Tromey
2014-05-21 20:42             ` Bruce Dawson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox