collecting data from a coring process

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* collecting data from a coring process
@ 2016-08-26  9:06 Paul Marquess
  2016-08-26 12:01 ` vijay nag
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Marquess @ 2016-08-26  9:06 UTC (permalink / raw)
  To: gdb

I have an existing Linux application that uses gdb to collect data from a process if it cores. Currently I've been doing that with gdb after the core is written to disk. No problem there.

The requirements have now changed and it won't be possible to allow the core file to be written to disk. That means I need a way to (somehow) get gdb to collect the data while the process is still in memory.

My first thought was to add a script in /proc/sys/kernel/core_pattern to catch the process as it is coring. Then I get gdb to attach to the PID of the process that is about to core. Unfortunately, when I tried that, gdb gives me this error

    Unable to attach: program terminated with signal SIGSEGV, Segmentation fault.
    No stack.

That seems to imply that by the time /proc/sys/kernel/core_pattern kicks in it is too late to use the PID with gdb.

Anyone know of a way to do this? Preferably one that doesn't involve changing the process itself.

cheers
Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: collecting data from a coring process
  2016-08-26  9:06 collecting data from a coring process Paul Marquess
@ 2016-08-26 12:01 ` vijay nag
  2016-08-26 12:33   ` Paul Marquess
  0 siblings, 1 reply; 13+ messages in thread
From: vijay nag @ 2016-08-26 12:01 UTC (permalink / raw)
  To: Paul Marquess; +Cc: gdb

On Fri, Aug 26, 2016 at 2:36 PM, Paul Marquess
<Paul.Marquess@owmobility.com> wrote:
> I have an existing Linux application that uses gdb to collect data from a process if it cores. Currently I've been doing that with gdb after the core is written to disk. No problem there.
>
> The requirements have now changed and it won't be possible to allow the core file to be written to disk. That means I need a way to (somehow) get gdb to collect the data while the process is still in memory.
>
> My first thought was to add a script in /proc/sys/kernel/core_pattern to catch the process as it is coring. Then I get gdb to attach to the PID of the process that is about to core. Unfortunately, when I tried that, gdb gives me this error
>
>     Unable to attach: program terminated with signal SIGSEGV, Segmentation fault.
>     No stack.
>
> That seems to imply that by the time /proc/sys/kernel/core_pattern kicks in it is too late to use the PID with gdb.
>
> Anyone know of a way to do this? Preferably one that doesn't involve changing the process itself.
>
> cheers
> Paul
You can do one of the following

1) Why not dump the information that you are looking for into a file
in the process signal handler ?
2) RTFM core file piping on linux (Probably you've done this already
?) - The idea may seem dangerous, but you can try inserting sleep in
the script for sometime till you gather whatever information that is
required.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
  2016-08-26 12:01 ` vijay nag
@ 2016-08-26 12:33   ` Paul Marquess
  2016-08-28  7:48     ` Dmitry Samersoff
  2016-09-01  4:23     ` David Niklas
  0 siblings, 2 replies; 13+ messages in thread
From: Paul Marquess @ 2016-08-26 12:33 UTC (permalink / raw)
  To: vijay nag, gdb

From: vijay nag [mailto:vijunag@gmail.com] 
> Sent: 26 August 2016 13:02
> To: Paul Marquess <Paul.Marquess@owmobility.com>
> Cc: gdb@sourceware.org
> Subject: Re: collecting data from a coring process
> 
> On Fri, Aug 26, 2016 at 2:36 PM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> > I have an existing Linux application that uses gdb to collect data from a process if it cores. Currently I've been doing that with gdb after the core is written to disk. No problem there.
> >
> > The requirements have now changed and it won't be possible to allow the core file to be written to disk. That means I need a way to (somehow) get gdb to collect the data while the process is still in memory.
> >
> > My first thought was to add a script in /proc/sys/kernel/core_pattern 
> > to catch the process as it is coring. Then I get gdb to attach to the 
> > PID of the process that is about to core. Unfortunately, when I tried 
> > that, gdb gives me this error
> >
> >     Unable to attach: program terminated with signal SIGSEGV, Segmentation fault.
> >     No stack.
> >
> > That seems to imply that by the time /proc/sys/kernel/core_pattern kicks in it is too late to use the PID with gdb.
> >
> > Anyone know of a way to do this? Preferably one that doesn't involve changing the process itself.
> >
> > cheers
> > Paul
> You can do one of the following
> 
> 1) Why not dump the information that you are looking for into a file in the process signal handler ?

Would love to, but I have no idea what state the process is in once the SEGV has been triggered. Just taking pointes as an example, once in the signal handler I'm in a situation where I can't trust that any pointer contains a valid value. A quick search suggests I can guard against that by using a signal handler. But I'm already in one! If there is some prior art that shows how to safely dump data from a process once a SEGV has been triggered it would make my day.

For now, using gdb means I'm isolated from the risk of the data collection crashing the process again inside the signal handler.

> 2) RTFM core file piping on linux (Probably you've done this already

Looked, but did not find anything.

> ?) - The idea may seem dangerous, but you can try inserting sleep in the script for sometime till you gather whatever information that is required.

Not sure what you mean by this. Can you elaborate please?

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: collecting data from a coring process
  2016-08-26 12:33   ` Paul Marquess
@ 2016-08-28  7:48     ` Dmitry Samersoff
  2016-09-05 11:10       ` Paul Marquess
  2016-09-01  4:23     ` David Niklas
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Samersoff @ 2016-08-28  7:48 UTC (permalink / raw)
  To: Paul Marquess, vijay nag, gdb


[-- Attachment #1.1: Type: text/plain, Size: 3736 bytes --]

Paul,

>> 1) Why not dump the information that you are looking for into a
>> file in the process signal handler ?
>
> Would love to, but I have no idea what state the process is in once
> the SEGV has been triggered.

If you use altstack and avoid malloc you can dump bunch of information
from the signal handler more or less safely.

e.g.

http://hg.openjdk.java.net/jdk9/hs/hotspot/file/tip/src/share/vm/utilities/vmError.cpp

>>> My first thought was to add a script in
>>> /proc/sys/kernel/core_pattern to catch the process as it is
>>> coring. Then I get gdb to attach to the PID of the process that
>>> is about to core. Unfortunately, when I tried that, gdb gives me
>>> this error

One of possible solution is:

1. Change /proc/sys/kernel/core_pattern to have all coredumps from your
app in a separate directory, something like /var/dumps/%e/core.%p

2. Have a cron job that looks over this directory and run
gdb <exe image name> <core_name> < gdb_script > core.%p.out
on demand.

-Dmitry

On 2016-08-26 15:33, Paul Marquess wrote:
> From: vijay nag [mailto:vijunag@gmail.com]
>> Sent: 26 August 2016 13:02 To: Paul Marquess
>> <Paul.Marquess@owmobility.com> Cc: gdb@sourceware.org Subject: Re:
>> collecting data from a coring process
>> 
>> On Fri, Aug 26, 2016 at 2:36 PM, Paul Marquess
>> <Paul.Marquess@owmobility.com> wrote:
>>> I have an existing Linux application that uses gdb to collect
>>> data from a process if it cores. Currently I've been doing that
>>> with gdb after the core is written to disk. No problem there.
>>> 
>>> The requirements have now changed and it won't be possible to
>>> allow the core file to be written to disk. That means I need a
>>> way to (somehow) get gdb to collect the data while the process is
>>> still in memory.
>>> 
>>> My first thought was to add a script in
>>> /proc/sys/kernel/core_pattern to catch the process as it is
>>> coring. Then I get gdb to attach to the PID of the process that
>>> is about to core. Unfortunately, when I tried that, gdb gives me
>>> this error
>>> 
>>> Unable to attach: program terminated with signal SIGSEGV,
>>> Segmentation fault. No stack.
>>> 
>>> That seems to imply that by the time
>>> /proc/sys/kernel/core_pattern kicks in it is too late to use the
>>> PID with gdb.
>>> 
>>> Anyone know of a way to do this? Preferably one that doesn't
>>> involve changing the process itself.
>>> 
>>> cheers Paul
>> You can do one of the following
>> 
>> 1) Why not dump the information that you are looking for into a
>> file in the process signal handler ?
> 
> Would love to, but I have no idea what state the process is in once
> the SEGV has been triggered. Just taking pointes as an example, once
> in the signal handler I'm in a situation where I can't trust that any
> pointer contains a valid value. A quick search suggests I can guard
> against that by using a signal handler. But I'm already in one! If
> there is some prior art that shows how to safely dump data from a
> process once a SEGV has been triggered it would make my day.
> 
> For now, using gdb means I'm isolated from the risk of the data
> collection crashing the process again inside the signal handler.
> 
>> 2) RTFM core file piping on linux (Probably you've done this
>> already
> 
> Looked, but did not find anything.
> 
>> ?) - The idea may seem dangerous, but you can try inserting sleep
>> in the script for sometime till you gather whatever information
>> that is required.
> 
> Not sure what you mean by this. Can you elaborate please?
> 
> Paul
> 


-- 
Dmitry Samersoff
Saint Petersburg, Russia, http://devnull.samersoff.net
* There will come soft rains  ...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: collecting data from a coring process
  2016-08-26 12:33   ` Paul Marquess
  2016-08-28  7:48     ` Dmitry Samersoff
@ 2016-09-01  4:23     ` David Niklas
  2016-09-05 10:59       ` Paul Marquess
  1 sibling, 1 reply; 13+ messages in thread
From: David Niklas @ 2016-09-01  4:23 UTC (permalink / raw)
  To: gdb; +Cc: Paul Marquess

On Fri, 26 Aug 2016 12:33:24 <Paul.Marquess@owmobility.com> wrote:
> > On Fri, Aug 26, 2016 at 2:36 PM, Paul Marquess
> > <Paul.Marquess@owmobility.com> wrote:  
> > > I have an existing Linux application that uses gdb to collect data
> > > from a process if it cores. Currently I've been doing that with gdb
> > > after the core is written to disk. No problem there.
> > >
> > > The requirements have now changed and it won't be possible to allow
> > > the core file to be written to disk. That means I need a way to
> > > (somehow) get gdb to collect the data while the process is still in
> > > memory.
> > >
> > > My first thought was to add a script
> > > in /proc/sys/kernel/core_pattern to catch the process as it is
> > > coring. Then I get gdb to attach to the PID of the process that is
> > > about to core. Unfortunately, when I tried that, gdb gives me this
> > > error
> > >
> > >     Unable to attach: program terminated with signal SIGSEGV,
> > > Segmentation fault. No stack.
> > >
> > > That seems to imply that by the time /proc/sys/kernel/core_pattern
> > > kicks in it is too late to use the PID with gdb.
> > >
> > > Anyone know of a way to do this? Preferably one that doesn't
> > > involve changing the process itself.
> > >
> > > cheers
> > > Paul  
> > You can do one of the following
> > 
> > 1) Why not dump the information that you are looking for into a file
> > in the process signal handler ?  
> 
> Would love to, but I have no idea what state the process is in once the
> SEGV has been triggered. Just taking pointes as an example, once in the
> signal handler I'm in a situation where I can't trust that any pointer
> contains a valid value. A quick search suggests I can guard against
> that by using a signal handler. But I'm already in one! If there is
> some prior art that shows how to safely dump data from a process once a
> SEGV has been triggered it would make my day.
> <snip>

Oh, oh, I know how to do this (finally a chance to be of use on
gdb's mailing list :) Ok, assuming that you know what variables you can
access or that you can stick them in a struct what you do is enter the
signal handler (hear after known as SH), printing the values one by one. 
When one of the values causes a SEGV then you reenter SH which checks an
atomic type to see if your in SH and if so then another atomic type to
see if you've begun to print the values.
If so then a new branch is executed and what gets printed is an
inaccessible error value.
Control returns to the first SH and it continues iterating.

WARNING: There may be better ways to do what you want, like dump core to a
tmpfs (which is in memory anyway so it leaves no trace on disk).

Sincerely,
David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
  2016-09-01  4:23     ` David Niklas
@ 2016-09-05 10:59       ` Paul Marquess
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Marquess @ 2016-09-05 10:59 UTC (permalink / raw)
  To: David Niklas, gdb

From: David Niklas [mailto:doark@mail.com] 
 
...
> > > 1) Why not dump the information that you are looking for into a file 
> > > in the process signal handler ?
> > 
> > Would love to, but I have no idea what state the process is in once 
> > the SEGV has been triggered. Just taking pointes as an example, once 
> > in the signal handler I'm in a situation where I can't trust that any 
> > pointer contains a valid value. A quick search suggests I can guard 
> > against that by using a signal handler. But I'm already in one! If 
> > there is some prior art that shows how to safely dump data from a 
> > process once a SEGV has been triggered it would make my day.
> > <snip>
> 
> Oh, oh, I know how to do this (finally a chance to be of use on gdb's mailing list :) Ok, assuming that you know what variables you can access or that you can stick them in a struct what you do is enter the signal handler (hear after known as SH), printing the values one by one. 

Can do that for a very limited sub-set of the data I'm dumping, but in general there is just too much data to remember that way.

> When one of the values causes a SEGV then you reenter SH which checks an atomic type to see if your in SH and if so then another atomic type to see if you've begun to print the values.
>
> If so then a new branch is executed and what gets printed is an inaccessible error value.
> Control returns to the first SH and it continues iterating.

Interesting technique for dealing with recursive signals.

Paul
P.S. Sorry for the delay in following up. Had no internet access for about 10 days.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
  2016-08-28  7:48     ` Dmitry Samersoff
@ 2016-09-05 11:10       ` Paul Marquess
  2016-09-05 22:17         ` Samuel Bronson
       [not found]         ` <d327752e-89f6-c5a3-6d72-4789b106e1f6@samersoff.net>
  0 siblings, 2 replies; 13+ messages in thread
From: Paul Marquess @ 2016-09-05 11:10 UTC (permalink / raw)
  To: Dmitry Samersoff, vijay nag, gdb

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1830 bytes --]

From: Dmitry Samersoff [mailto:dms@samersoff.net] 

> Paul,
> 
> >> 1) Why not dump the information that you are looking for into a file 
> >> in the process signal handler ?
> >
> > Would love to, but I have no idea what state the process is in once 
> > the SEGV has been triggered.
> 
> If you use altstack and avoid malloc you can dump bunch of information from the signal handler more or less safely.
> 
> e.g.
> 
> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/tip/src/share/vm/utilities/vmError.cpp

Thanks, will take a look at that. When you say "more or less safely", I'm reading that as saying there will be issues with it.  :-)

I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.  

> >>> My first thought was to add a script in 
> >>> /proc/sys/kernel/core_pattern to catch the process as it is coring. 
> >>> Then I get gdb to attach to the PID of the process that is about to 
> >>> core. Unfortunately, when I tried that, gdb gives me this error
> 
> One of possible solution is:
> 
> 1. Change /proc/sys/kernel/core_pattern to have all coredumps from your app in a separate directory, something like /var/dumps/%e/core.%p
> 
> 2. Have a cron job that looks over this directory and run gdb <exe image name> <core_name> < gdb_script > core.%p.out on demand.

That is exactly what I'm doing at the moment. Trouble is I soon will not allow a core file to be written -- the process is reaching a size where I cannot allow it to be out of action for the amount of time it takes to write that to disk. 

Paul

P.S. Sorry for the delay in following up. Had no internet access for about 10 days.
\x16º&ÖëzÛ«ŸŽv÷yb²Ö«r\x18\x1d

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: collecting data from a coring process
  2016-09-05 11:10       ` Paul Marquess
@ 2016-09-05 22:17         ` Samuel Bronson
  2016-09-05 23:19           ` Paul Marquess
       [not found]         ` <d327752e-89f6-c5a3-6d72-4789b106e1f6@samersoff.net>
  1 sibling, 1 reply; 13+ messages in thread
From: Samuel Bronson @ 2016-09-05 22:17 UTC (permalink / raw)
  To: Paul Marquess; +Cc: Dmitry Samersoff, vijay nag, gdb

On Mon, Sep 5, 2016 at 7:09 AM, Paul Marquess
<Paul.Marquess@owmobility.com> wrote:
> From: Dmitry Samersoff [mailto:dms@samersoff.net]
>
>> Paul,
>>
>> >> 1) Why not dump the information that you are looking for into a file
>> >> in the process signal handler ?
>> >
>> > Would love to, but I have no idea what state the process is in once
>> > the SEGV has been triggered.
[...]
> I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.

You could just spawn (and wait for) your GDB-launching script from the
signal handler; then, the process & stack will still be around for
GDB.  I think this is even legal!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
  2016-09-05 22:17         ` Samuel Bronson
@ 2016-09-05 23:19           ` Paul Marquess
       [not found]             ` <CAJYzjmf0a2Dd8XbOQaO3937Bcab1AW9gVp=r3mKSgUq_27G8ow@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Marquess @ 2016-09-05 23:19 UTC (permalink / raw)
  To: Samuel Bronson; +Cc: Dmitry Samersoff, vijay nag, gdb

From: Samuel Bronson [mailto:naesten@gmail.com] 

> On Mon, Sep 5, 2016 at 7:09 AM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> > From: Dmitry Samersoff [mailto:dms@samersoff.net]
> >
> >> Paul,
> >>
> >> >> 1) Why not dump the information that you are looking for into a 
> >> >> file in the process signal handler ?
> >> >
> >> > Would love to, but I have no idea what state the process is in once 
> >> > the SEGV has been triggered.
> [...]
> > I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.
> 
> You could just spawn (and wait for) your GDB-launching script from the signal handler; then, 
> the process & stack will still be around for GDB.  I think this is even legal!

That's one of the approaches I'm thinking of. I need to check if the fork/exec & wait use malloc.

The process I want to get data from is controlled by a parent process. Had thought I could get the parent to spot the SIGABRT and attach to the child, but the stack is gone by the time gdb attaches to the PID of the coring process. Need to play with that a bit more to see if I can find a way for the child to tell the parent to fire up gdb before the stacks are gone.

Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
       [not found]             ` <CAJYzjmf0a2Dd8XbOQaO3937Bcab1AW9gVp=r3mKSgUq_27G8ow@mail.gmail.com>
@ 2016-09-06 16:41               ` Paul Marquess
  2016-09-07 19:22                 ` Samuel Bronson
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Marquess @ 2016-09-06 16:41 UTC (permalink / raw)
  To: Samuel Bronson; +Cc: Dmitry Samersoff, vijay nag, gdb

From: Samuel Bronson [mailto:naesten@gmail.com] 

> On Mon, Sep 5, 2016 at 7:19 PM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> > From: Samuel Bronson [mailto:naesten@gmail.com]
> >
> >> On Mon, Sep 5, 2016 at 7:09 AM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> >> > From: Dmitry Samersoff [mailto:dms@samersoff.net]
> >> >
> >> >> Paul,
> >> >>
> >> >> >> 1) Why not dump the information that you are looking for into a 
> >> >> >> file in the process signal handler ?
> >> >> >
> >> >> > Would love to, but I have no idea what state the process is in 
> >> >> > once the SEGV has been triggered.
> >> [...]
> >> > I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.
> >>
> >> You could just spawn (and wait for) your GDB-launching script from 
> >> the signal handler; then, the process & stack will still be around for GDB.  I think this is even legal!
> >
> > That's one of the approaches I'm thinking of. I need to check if the fork/exec & wait use malloc.
> 
> I think it should suffice for them to be "async-signal-safe "?  It looks like signal(7) documents which functions several 
> versions of POSIX require to be async-signal-safe, and it looks like there are two versions of exec*() on there as well 
> as fork() and wait().  Which is basically what I meant by "I think this is even legal!" :-).

I agree that "async-signal-safe " is something that needs to be considered, but it isn't the only thing. I've seen plenty of cores where corruption of a data structure inside malloc itself was the trigger for the SEGV. That's why I need to be sure that any code executed in the signal handler isn't going to blow up.

I've had success with a toy setup that checks if the following scenario will work.

I have a Parent process that spawns a Child process. The child process contains a deliberate SEGV error.

In the Child process I get the signal handler to send USR1 to the parent process, then send SIGSTOP to itself. Once the SIGSTOP is released I get the process to exit.

The Parent process has a handler to catch the USR1 signal. I use this to trigger the execution of gdb.  When I get gdb triggered it seems to be working fine -- stack is still present & I can access data structures. Exiting gdb must send a CONT to the process because it the child process then exits normally.

Still early days, but I like this approach because it means I only need to add a small amount of code in the signal handler of the coring process.

Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: collecting data from a coring process
  2016-09-06 16:41               ` Paul Marquess
@ 2016-09-07 19:22                 ` Samuel Bronson
  2016-09-07 22:10                   ` Paul Marquess
  0 siblings, 1 reply; 13+ messages in thread
From: Samuel Bronson @ 2016-09-07 19:22 UTC (permalink / raw)
  To: Paul Marquess; +Cc: Dmitry Samersoff, vijay nag, gdb

On Tue, Sep 6, 2016 at 12:40 PM, Paul Marquess
<Paul.Marquess@owmobility.com> wrote:
> From: Samuel Bronson [mailto:naesten@gmail.com]
>
>
>> On Mon, Sep 5, 2016 at 7:19 PM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
>> > From: Samuel Bronson [mailto:naesten@gmail.com]
>> >
>> >> On Mon, Sep 5, 2016 at 7:09 AM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
>> >> > From: Dmitry Samersoff [mailto:dms@samersoff.net]
>> >> >
>> >> >> Paul,
>> >> >>
>> >> >> >> 1) Why not dump the information that you are looking for into a
>> >> >> >> file in the process signal handler ?
>> >> >> >
>> >> >> > Would love to, but I have no idea what state the process is in
>> >> >> > once the SEGV has been triggered.
>> >> [...]
>> >> > I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.
>> >>
>> >> You could just spawn (and wait for) your GDB-launching script from
>> >> the signal handler; then, the process & stack will still be around for GDB.  I think this is even legal!
>> >
>> > That's one of the approaches I'm thinking of. I need to check if the fork/exec & wait use malloc.
>>
>> I think it should suffice for them to be "async-signal-safe "?  It looks like signal(7) documents which functions several
>> versions of POSIX require to be async-signal-safe, and it looks like there are two versions of exec*() on there as well
>> as fork() and wait().  Which is basically what I meant by "I think this is even legal!" :-).
>
> I agree that "async-signal-safe " is something that needs to be considered, but it isn't the only thing. I've seen plenty of cores where corruption of a data structure inside malloc itself was the trigger for the SEGV. That's why I need to be sure that any code executed in the signal handler isn't going to blow up.

Hmm.  I had not really considered that it might technically be
possible to have an async-signal-safe implementation of malloc(), and
was therefore operating under the assumption that it was impossible
for an async-signal-safe function to rely on malloc().  So, that
leaves a few questions:

  1. Would it actually be a problem for an sync-signal-safe
implementation of malloc() to be called in this scenario?

  2. Is such an implementation even possible?

  3. Are you willing to take the chance that anyone would actually
ship one AND dare to use it in any of POSIX's mandated
async-signal-safe functions?

(Also, it has come to my attention that s*printf() are actually
functions which are not on the list -- somehow, the nature of their
task had gotten them past my radar -- so it's presumably simplest to
have the helper script get the parent PID on its own, rather than
passing it on the command line as I had initially imagined.)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
  2016-09-07 19:22                 ` Samuel Bronson
@ 2016-09-07 22:10                   ` Paul Marquess
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Marquess @ 2016-09-07 22:10 UTC (permalink / raw)
  To: Samuel Bronson; +Cc: Dmitry Samersoff, vijay nag, gdb

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3717 bytes --]

From: Samuel Bronson [mailto:naesten@gmail.com] 

> On Tue, Sep 6, 2016 at 12:40 PM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> > From: Samuel Bronson [mailto:naesten@gmail.com]
> >
> >
> >> On Mon, Sep 5, 2016 at 7:19 PM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> >> > From: Samuel Bronson [mailto:naesten@gmail.com]
> >> >
> >> >> On Mon, Sep 5, 2016 at 7:09 AM, Paul Marquess <Paul.Marquess@owmobility.com> wrote:
> >> >> > From: Dmitry Samersoff [mailto:dms@samersoff.net]
> >> >> >
> >> >> >> Paul,
> >> >> >>
> >> >> >> >> 1) Why not dump the information that you are looking for 
> >> >> >> >> into a file in the process signal handler ?
> >> >> >> >
> >> >> >> > Would love to, but I have no idea what state the process is 
> >> >> >> > in once the SEGV has been triggered.
> >> >> [...]
> >> >> > I know we've had problems with signal handlers causing problems, thus my preference to find a way to have the signal handler code do as little as possible and get all the data collection handled at arm's length by gdb.
> >> >>
> >> >> You could just spawn (and wait for) your GDB-launching script from 
> >> >> the signal handler; then, the process & stack will still be around for GDB.  I think this is even legal!
> >> >
> >> > That's one of the approaches I'm thinking of. I need to check if the fork/exec & wait use malloc.
> >>
> >> I think it should suffice for them to be "async-signal-safe "?  It 
> >> looks like signal(7) documents which functions several versions of 
> >> POSIX require to be async-signal-safe, and it looks like there are two versions of exec*() on there as well as fork() and wait().  Which is basically what I meant by "I think this is even legal!" :-).
> >
> > I agree that "async-signal-safe " is something that needs to be considered, but it isn't the only thing. I've seen plenty of cores where corruption of a data structure inside malloc itself was the trigger for the SEGV. That's why I need to be sure that any code executed in the signal handler isn't going to blow up.
> 
> Hmm.  I had not really considered that it might technically be possible to have an async-signal-safe implementation of malloc(), and was 
> therefore operating under the assumption that it was impossible for an async-signal-safe function to rely on malloc().  So, that leaves a few 
> questions:
> 
>   1. Would it actually be a problem for an sync-signal-safe implementation of malloc() to be called in this scenario?
> 
>   2. Is such an implementation even possible?
> 
>   3. Are you willing to take the chance that anyone would actually ship one AND dare to use it in any of POSIX's mandated async-signal-safe functions?

This feels like it is getting into uncharted waters, so, no, I wouldn't want to have the risk of shipping something like this unless it was already mature code thatâ€™s had all the issues sorted out.

> (Also, it has come to my attention that s*printf() are actually functions which are not on the list -- somehow, the nature of their task had gotten 
> them past my radar -- so it's presumably simplest to have the helper script get the parent PID on its own, rather than passing it on the command 
> line as I had initially imagined.)

Given that I've now got a working prototype where the signal handler for SEGV ultimately just sends a USR1 signal to a parent process, I don't think I'm prepared to take the risk of getting a process that is about to core to do a fork & exec. My current approach is very simple (which is always good) and means that all complexity (and risk) is moved to the parent process.

Just need to check that kill doesn't use malloc :-)

Paul
\x16º&ÖëzÛ«ŸŽw×™b²Ö«r\x18\x1d

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: collecting data from a coring process
       [not found]         ` <d327752e-89f6-c5a3-6d72-4789b106e1f6@samersoff.net>
@ 2016-09-08 17:12           ` Paul Marquess
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Marquess @ 2016-09-08 17:12 UTC (permalink / raw)
  To: Dmitry Samersoff, gdb

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1842 bytes --]

From: Dmitry Samersoff [mailto:dms@samersoff.net] 

> Paul,
> 
> > Thanks, will take a look at that. When you say "more or less safely", 
> > I'm reading that as saying there will be issues with it.  :-)
> 
> I don't know a way to do anything with a crashing process with 100% reliability. Ever coredump. 

Agree.

> Custom code in signal handler doesn't make the situation worse.

I'd rephrase that as saying that to say that if you careful and know the all limitations of what is possible in a signal handler you won't make the situation worse.

> It's quite often for complicated apps that the crash is result of something that happens far before crash point. E.g. when you see a memory corruption you typically interesting where the memory had been corrupted but not where corrupted memory was hit by the app.

Tell me about it. Memory corruption errors can be impossible to track down.

> So signal handlers that know application data structure and can print meaningful information is quite usable and saves a lot of time in debugging.
> 
> Also it might be necessary to free some resources before process start dumping core to allow faster restart.
> 
> > Trouble is I soon will not allow a core file to be written -- the 
> > process is reaching a size where I cannot allow it to be out of action 
> > for the amount of time it takes to write that to disk.
> 
> One of possible solution is to add some keep-alive protocol between child and parent (e.g. child keep touching file on disk or sending udp packets), if keep-alive doesn't come in time, parent consider the child as dead, send abort to it and fire a new process.
> 
> This solution also covers the situation when a child process hugs or deadlocks.

Luckily I already have a health check probe that does that.

> -Dmitry
\x16º&ÖëzÛ«ŸŽw×¹b²Ö«r\x18\x1d

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-09-08 17:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-26  9:06 collecting data from a coring process Paul Marquess
2016-08-26 12:01 ` vijay nag
2016-08-26 12:33   ` Paul Marquess
2016-08-28  7:48     ` Dmitry Samersoff
2016-09-05 11:10       ` Paul Marquess
2016-09-05 22:17         ` Samuel Bronson
2016-09-05 23:19           ` Paul Marquess
     [not found]             ` <CAJYzjmf0a2Dd8XbOQaO3937Bcab1AW9gVp=r3mKSgUq_27G8ow@mail.gmail.com>
2016-09-06 16:41               ` Paul Marquess
2016-09-07 19:22                 ` Samuel Bronson
2016-09-07 22:10                   ` Paul Marquess
     [not found]         ` <d327752e-89f6-c5a3-6d72-4789b106e1f6@samersoff.net>
2016-09-08 17:12           ` Paul Marquess
2016-09-01  4:23     ` David Niklas
2016-09-05 10:59       ` Paul Marquess

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox