From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-34455-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 31287 invoked by alias); 21 May 2009 16:32:56 -0000
Received: (qmail 31277 invoked by uid 22791); 21 May 2009 16:32:55 -0000
X-SWARE-Spam-Status: No, hits=-1.4 required=5.0 	tests=AWL,BAYES_00,SPF_SOFTFAIL
X-Spam-Check-By: sourceware.org
Received: from mta.netezza.com (HELO netezza.com) (12.148.248.132)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 21 May 2009 16:32:46 +0000
Received: from ([172.29.50.72]) 	by mta.netezza.com with SMTP  id 4441227.12258018; 	Thu, 21 May 2009 12:32:24 -0400
Received: from [172.29.82.111] ([172.29.82.111]) by mail1.netezza.com with Microsoft SMTPSVC(6.0.3790.3959); 	 Thu, 21 May 2009 12:32:24 -0400
Subject: Re: Partial cores using Linux "pipe" core_pattern
From: Paul Smith <psmith@gnu.org>
Reply-To: psmith@gnu.org
To: Andreas Schwab <schwab@linux-m68k.org>
Cc: Andi Kleen <andi@firstfloor.org>, gdb@sourceware.org
In-Reply-To: <m27i0ejzbi.fsf@igel.home>
References: <1242609756.2800.135.camel@homebase.localnet> 	 <87ab5aq3dq.fsf@basil.nowhere.org> 	 <1242653371.2800.163.camel@homebase.localnet>  <m27i0ejzbi.fsf@igel.home>
Content-Type: text/plain
Date: Thu, 21 May 2009 16:32:00 -0000
Message-Id: <1242923544.29250.134.camel@psmith-ubeta.netezza.com>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2009-05/txt/msg00141.txt.bz2

On Mon, 2009-05-18 at 15:49 +0200, Andreas Schwab wrote:
> Apparently the ELF core dumper cannot handle short writes (see
> dump_write in fs/binfmt_elf.c).  You should probably use a read buffer
> of at least a page, which is the most the kernel tries to write at
> once.

Sorry for the delay; I lost my repro case and it took me a while to find
one.  And now when I dump cores over NFS, the bonding driver is causing
a kernel panic so there's that *sigh*.  I reconfigured my interfaces to
use a single non-bonded interface to avoid that issue and concentrate on
this one... I'll worry about that tomorrow.

I still need to do more investigation but I have more clarity around
when I see these short cores vs. "good" cores.  My system has a single
process and when a request for work comes in it forks (but not execs) a
number of helper copies of itself (typically 8).

In my test, all copies run the same code and so all will segv at the
around the same time (I just added code to do an invalid pointer access
at different areas of the program when certain test files exist).

Some areas of the code consider a segv or similar to be unrecoverable.
In those situations I have a signal handler that stops the other
processes in the process group, dumps a single core, then those other
process do NOT dump core and the whole thing exits.  The cores I get in
this situation are fine.

Other areas of the code consider a segv or similar to be recoverable.
In this case, each worker is left to dump core (or not) on its own, and
the system overall stays up.  When I force a segv in these areas, I get
the short cores.  Note that I am serializing my core dumping program
(the one cores are piped to) via an flock() file on the local disk, and
this serialization (based on messages to syslog) does seem to be
working.  What I see are 6-8 core dump messages from the kernel, then my
core saver runs on the first one and dumps about 50M of the 1G process
space (about 188 reads of 256K buffers plus some change).  Then that
exits and the second one starts and it dumps a 64K core (1 read), then
the next also dumps 64K etc.

It _feels_ to me like there's some kind of COW or similar mismanagement
of the VM for these forked processes such that they interfere and we
can't get a full and complete core dump when all of them are dumping at
the same time.

I'm going to do more investigation but maybe this rings some bells with
someone.