From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15498 invoked by alias); 24 Jan 2020 17:05:15 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 15490 invoked by uid 89); 24 Jan 2020 17:05:14 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-5.4 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy= X-HELO: simark.ca Received: from simark.ca (HELO simark.ca) (158.69.221.121) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 24 Jan 2020 17:05:13 +0000 Received: from [172.16.0.95] (192-222-181-218.qc.cable.ebox.net [192.222.181.218]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPSA id 0BDB51E5F7; Fri, 24 Jan 2020 12:05:11 -0500 (EST) Subject: Re: [PATCH] Fix re-runs of a second inferior (PR gdb/25410) To: Pedro Alves , "Aktemur, Tankut Baris" , "gdb-patches@sourceware.org" References: <20200124030222.13854-1-palves@redhat.com> <28a65882-8c59-db61-418c-1dc2139481af@simark.ca> From: Simon Marchi Message-ID: <5db80e9a-2c58-6edc-8a9e-db44ea44a8dc@simark.ca> Date: Fri, 24 Jan 2020 17:15:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SW-Source: 2020-01/txt/msg00817.txt.bz2 On 2020-01-24 10:11 a.m., Pedro Alves wrote: > On 1/24/20 2:59 PM, Simon Marchi wrote: >> On 2020-01-24 6:22 a.m., Aktemur, Tankut Baris wrote: >>> On Friday, January 24, 2020 4:02 AM, Pedro Alves wrote: >>>> This fixes a latent bug exposed by the multi-target patch (5b6d1e4fa >>>> "Multi-target support). >>> >>> The patch led to the assertion violation below when running >>> gdb.threads/vfork-follow-child-exit.exp. >>> >>> gdb/progspace.c:243: internal-error: void set_current_program_space(program_space*): Assertion `pspace != NULL' failed. >> >> >> Oh, I see that too. > > Me too. No idea how I missed this. Sorry about that... > There's actually yet another internal error in addition > to that one. > > Here's the updated patch, which fixes all. > > From 2880f692d69cf54ef9fb41f84ac00a1d04f28447 Mon Sep 17 00:00:00 2001 > From: Pedro Alves > Date: Fri, 24 Jan 2020 14:55:43 +0000 > Subject: [PATCH] Fix re-runs of a second inferior (PR gdb/25410) > > This fixes a latent bug exposed by the multi-target patch (5b6d1e4fa > "Multi-target support), and then fixes two other latent bugs exposed > by fixing that first latent bug. > > The symptom described in the bug report is that starting a first > inferior, then trying to run a second (multi-threaded) inferior twice, > causes libthread_db to fail to load, along with other erratic > behavior: > > (gdb) run > Starting program: /tmp/foo > warning: td_ta_new failed: generic error > > Going a bit deeply, I found that if the two inferiors have different > symbols, we can see that just after inferior 2 exits, we are left with > inferior 2 selected, which is correct, but the symbols in scope belong > to inferior 1, which is obviously incorrect... > > This problem is that there's a path in > scoped_restore_current_thread::restore() that switches to no thread > selected, and switches the current inferior, but leaves the current > program space as is, resulting in leaving the program space pointing > to the wrong program space (the one of the other inferior). This was > happening after handling TARGET_WAITKIND_NO_RESUMED, which is an event > that triggers after TARGET_WAITKIND_EXITED for the previous inferior > exit. Subsequent symbol lookups find the symbols of the wrong > inferior. > > The fix is to use switch_to_inferior_no_thread in that problem spot. > This function was recently added along with the multi-target work > exactly for these situations. > > As for testing, this patch adds a new testcase that tests symbol > printing just after inferior exit, which exercises the root cause of > the problem more directly. And then, to cover the use case described > in the bug too, it also exercises the lithread_db.so mis-loading, by > using TLS printing as a proxy for being sure that threaded debugging > was activated sucessfully. The testcase fails without the fix like > this, for the "print symbol just after exit" bits: > > ... > [Inferior 1 (process 8719) exited normally] > (gdb) PASS: gdb.multi/multi-re-run.exp: re_run_inf=1: iter=1: continue until exit > print re_run_var_1 > No symbol "re_run_var_1" in current context. > (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=1: iter=1: print re_run_var_1 > ... > > And like this for the "libthread_db.so loading" bits: > > (gdb) run > Starting program: /home/pedro/gdb/binutils-gdb/build/gdb/testsuite/outputs/gdb.multi/multi-re-run/multi-re-run > warning: td_ta_new failed: generic error > [New LWP 27001] > > Thread 1.1 "multi-re-run" hit Breakpoint 3, all_started () at /home/pedro/gdb/binutils-gdb/build/../src/gdb/testsuite/gdb.multi/multi-re-run.c:44 > 44 } > (gdb) PASS: gdb.multi/multi-re-run.exp: re_run_inf=1: iter=2: running to all_started in runto > print tls_var > Cannot find thread-local storage for LWP 27000, executable file /home/pedro/gdb/binutils-gdb/build/gdb/testsuite/outputs/gdb.multi/multi-re-run/multi-re-run: > Cannot find thread-local variables on this target > (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=1: iter=2: print tls_var > > > As mentioned, that fix above goes on to expose a couple other latent > bugs. This commit fixes those as well. > > The first latent bug exposed is in > infrun.c:handle_vfork_child_exec_or_exit. The current code is leaving > inf->pspace == NULL while calling clone_program_space. The idea was > to make it so that the breakpoints module doesn't use this inferior's > pspace to set breakpoints. With that, any > scoped_restore_current_thread use from within clone_program_space > tries to restore a NULL program space, which hits an assertion: > > Attaching after Thread 0x7ffff74b8700 (LWP 27276) vfork to child process 27277] > [New inferior 2 (process 27277)] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > /home/pedro/gdb/binutils-gdb/build/../src/gdb/progspace.c:243: internal-error: void set_current_program_space(program_space*): Assertion `pspace != NULL' faile > d. > A problem internal to GDB has been detected, > further debugging may prove unreliable. > Quit this debugging session? (y or n) FAIL: gdb.threads/vfork-follow-child-exit.exp: detach-on-fork=off: continue (GDB internal error) > > That NULL pspace idea was legitimate, but it's no longer necessary, > since commit b2e586e850db ("Defer breakpoint reset when cloning > progspace for fork child"). So the fix is to just set the inferior's > program space earlier. > > > The other latent bug exposed is in exec.c. When exec_close is called > from the program_space destructor, it is purposedly called with a > current program space that is not current inferior's program space. > The problem is that the multi-target work added some code to > remove_target_sections that loops over all inferiors, and uses > scoped_restore_current_thread to save/restore the previous > thread/inferior/frame state. This makes it so that exec_close returns > with the current program space set to the current inferior's program > space, which is exactly what we did not want. Then the program_space > destructor continues into free_all_objfiles, but it is not running > that method on the wrong program space, resulting in: > > Reading symbols from /home/pedro/gdb/binutils-gdb/build/gdb/testsuite/outputs/gdb.threads/fork-plus-threads/fork-plus-threads... > Reading symbols from /usr/lib/debug/usr/lib64/libpthread-2.26.so.debug... > Reading symbols from /usr/lib/debug/usr/lib64/libm-2.26.so.debug... > Reading symbols from /usr/lib/debug/usr/lib64/libc-2.26.so.debug... > Reading symbols from /usr/lib/debug/usr/lib64/ld-2.26.so.debug... > [Inferior 3 (process 9583) exited normally] > /home/pedro/gdb/binutils-gdb/build/../src/gdb/progspace.c:170: internal-error: void program_space::free_all_objfiles(): Assertion `so->objfile == NULL' failed. > A problem internal to GDB has been detected, > further debugging may prove unreliable. > Quit this debugging session? (y or n) FAIL: gdb.threads/fork-plus-threads.exp: detach-on-fork=off: inferior 1 exited (GDB internal error) > > The fix is to use scoped_restore_current_pspace_and_thread instead of > scoped_restore_current_thread. I did not look at the code in depth, but your explanations make sense to me, so if the testsuite shows no regression, this LGTM. Simon