From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2729 invoked by alias); 26 May 2009 19:26:42 -0000 Received: (qmail 2716 invoked by uid 22791); 26 May 2009 19:26:40 -0000 X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=BAYES_00,SPF_NEUTRAL X-Spam-Check-By: sourceware.org Received: from mta.netezza.com (HELO netezza.com) (12.148.248.132) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 26 May 2009 19:26:33 +0000 Received: from ([172.29.50.72]) by mta.netezza.com with SMTP id 4441227.12394409; Tue, 26 May 2009 15:26:09 -0400 Received: from [172.29.82.111] ([172.29.82.111]) by mail1.netezza.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 26 May 2009 15:26:08 -0400 Subject: Re: Partial cores using Linux "pipe" core_pattern From: Paul Smith Reply-To: paul@mad-scientist.us To: Andreas Schwab Cc: Andi Kleen , gdb@sourceware.org In-Reply-To: <1242923544.29250.134.camel@psmith-ubeta.netezza.com> References: <1242609756.2800.135.camel@homebase.localnet> <87ab5aq3dq.fsf@basil.nowhere.org> <1242653371.2800.163.camel@homebase.localnet> <1242923544.29250.134.camel@psmith-ubeta.netezza.com> Content-Type: text/plain Date: Tue, 26 May 2009 19:26:00 -0000 Message-Id: <1243365968.29250.357.camel@psmith-ubeta.netezza.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2009-05/txt/msg00169.txt.bz2 On Thu, 2009-05-21 at 12:32 -0400, Paul Smith wrote: > It _feels_ to me like there's some kind of COW or similar mismanagement > of the VM for these forked processes such that they interfere and we > can't get a full and complete core dump when all of them are dumping at > the same time. Well, my feelings were way off. I did more investigation and it turns out that what's happening is that the TIF_SIGPENDING flag is being set during the core dump. This causes the write to the pipe to stop, and the core dumping code makes no effort to manage errors or partial writes. Here's the function in binfmt_elf.c that does the write: static int dump_write(struct file *file, const void *addr, int nr) { return file->f_op->write(file, addr, nr, &file->f_pos) == nr; } If we get back anything other than exactly the number of bytes we tried to write, we give up and return false (0). This definitely returns false when I see the short cores, and never when I see "normal" cores. I modified it to see what it's getting back and file->f_op->write() is returning ERESTARTSYS. So I annotated fs/pipe.c:pipe_write() and I'm definitely getting it from this code, at line 550 or so: if (signal_pending(current)) { if (!ret) ret = -ERESTARTSYS; break; } I've been posting on the linux-kernel mailing list, so this is really just an FYI to anyone interested in following the progress; you can find the current end of the thread here: http://marc.info/?l=linux-kernel&m=124336093401443&w=2 So far I've failed to gain any interest from anyone on that list but hopefully someone will respond, who can help me figure out what to do next. Cheers!