From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31287 invoked by alias); 21 May 2009 16:32:56 -0000 Received: (qmail 31277 invoked by uid 22791); 21 May 2009 16:32:55 -0000 X-SWARE-Spam-Status: No, hits=-1.4 required=5.0 tests=AWL,BAYES_00,SPF_SOFTFAIL X-Spam-Check-By: sourceware.org Received: from mta.netezza.com (HELO netezza.com) (12.148.248.132) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 21 May 2009 16:32:46 +0000 Received: from ([172.29.50.72]) by mta.netezza.com with SMTP id 4441227.12258018; Thu, 21 May 2009 12:32:24 -0400 Received: from [172.29.82.111] ([172.29.82.111]) by mail1.netezza.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 21 May 2009 12:32:24 -0400 Subject: Re: Partial cores using Linux "pipe" core_pattern From: Paul Smith Reply-To: psmith@gnu.org To: Andreas Schwab Cc: Andi Kleen , gdb@sourceware.org In-Reply-To: References: <1242609756.2800.135.camel@homebase.localnet> <87ab5aq3dq.fsf@basil.nowhere.org> <1242653371.2800.163.camel@homebase.localnet> Content-Type: text/plain Date: Thu, 21 May 2009 16:32:00 -0000 Message-Id: <1242923544.29250.134.camel@psmith-ubeta.netezza.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2009-05/txt/msg00141.txt.bz2 On Mon, 2009-05-18 at 15:49 +0200, Andreas Schwab wrote: > Apparently the ELF core dumper cannot handle short writes (see > dump_write in fs/binfmt_elf.c). You should probably use a read buffer > of at least a page, which is the most the kernel tries to write at > once. Sorry for the delay; I lost my repro case and it took me a while to find one. And now when I dump cores over NFS, the bonding driver is causing a kernel panic so there's that *sigh*. I reconfigured my interfaces to use a single non-bonded interface to avoid that issue and concentrate on this one... I'll worry about that tomorrow. I still need to do more investigation but I have more clarity around when I see these short cores vs. "good" cores. My system has a single process and when a request for work comes in it forks (but not execs) a number of helper copies of itself (typically 8). In my test, all copies run the same code and so all will segv at the around the same time (I just added code to do an invalid pointer access at different areas of the program when certain test files exist). Some areas of the code consider a segv or similar to be unrecoverable. In those situations I have a signal handler that stops the other processes in the process group, dumps a single core, then those other process do NOT dump core and the whole thing exits. The cores I get in this situation are fine. Other areas of the code consider a segv or similar to be recoverable. In this case, each worker is left to dump core (or not) on its own, and the system overall stays up. When I force a segv in these areas, I get the short cores. Note that I am serializing my core dumping program (the one cores are piped to) via an flock() file on the local disk, and this serialization (based on messages to syslog) does seem to be working. What I see are 6-8 core dump messages from the kernel, then my core saver runs on the first one and dumps about 50M of the 1G process space (about 188 reads of 256K buffers plus some change). Then that exits and the second one starts and it dumps a 64K core (1 read), then the next also dumps 64K etc. It _feels_ to me like there's some kind of COW or similar mismanagement of the VM for these forked processes such that they interfere and we can't get a full and complete core dump when all of them are dumping at the same time. I'm going to do more investigation but maybe this rings some bells with someone.