From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca by simark.ca with LMTP id mAeANc/rH2bcoDIAWB0awg (envelope-from ) for ; Wed, 17 Apr 2024 11:33:35 -0400 Received: by simark.ca (Postfix, from userid 112) id D5E7F1E0C0; Wed, 17 Apr 2024 11:33:35 -0400 (EDT) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPS id AA82C1E092 for ; Wed, 17 Apr 2024 11:33:33 -0400 (EDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1A19F3858424 for ; Wed, 17 Apr 2024 15:33:33 +0000 (GMT) Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by sourceware.org (Postfix) with ESMTPS id 252363858D20 for ; Wed, 17 Apr 2024 15:33:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 252363858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 252363858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=209.85.208.169 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713367992; cv=none; b=Nv6TjFXFfSwKcf5vnKkmy9KcrYpmuOrteVjxljWUHDGzmqcEy7CVdKSlW6oiHsY9tyHECDXH0bPaFQ7UIw0Euk5tNH5P9lL3oWNKTY9CDhhg0W9E/hJ9X89mncUCCM1zQIxGrNRY8haIFiJ5wL2GF7Ic6Uh46r9hR+Q1aPnAmoI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713367992; c=relaxed/simple; bh=TNPx9IxP+sLDj5VXJYkWagQW3baefphOatnW1MwpF3w=; h=Message-ID:Date:MIME-Version:From:Subject:To; b=fivxhoM7uv3OjHny3YrTcH2USrve7DVWJtBZ/2v1JFuGjpzjGBZAa1K03yHzDQVS2AG5GSnJMU/21/CP5sgMiNvzTinr+EK2uxicgIXzFbnLh17qzEBIYgph6TQdOqlsHR508ocnzWV/t7FcNRxVLkosn7exnAor3fJmurStTQ4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2d8a24f8a3cso65020971fa.1 for ; Wed, 17 Apr 2024 08:33:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713367980; x=1713972780; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:subject:from:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jYA/sCS5OXZIdJxtMkQVZt8ATBxMgUZZI8ZuD/wUp14=; b=L43n08UUiQ1GGLwbDU536jc4BOWDchNim1C45l+YNn6HVEEi+HYUuYcFw5BDmEqyD5 756WQ7D6t6Mbc9UrCGcDxk1Q2hxp4V//aNv+jyilYnrDcIjNniWKjl4Vc91Lfy9LYK1F Gj1b6xGdC+shSGH6/tc13srnqFBQFkz2vzCxQF5Qvik/YrFQyXa+gYkcsm87/86WLacP AsxxN54MzDtVTzZNLa1BR+NXUVVklSer2ljF6S3EOZp8TQTEz9tptbS1tZq3cmEoyDZI gCSxLODY+hN++0dwfuHON/mHHXksbIgOehtZ4frWycvQuo4xm9tAh8RBR2obX+RP1Ume ubRg== X-Gm-Message-State: AOJu0Yw/wcOuGJIa1b8R3YI8DJPZhMrq8P+jaJJdsRIWgRIRv734T3xa epgr1jH9xF8+i7plBICOnQbdGQxdcpmQ46cs4ZdHV4FiibhdRwHvb270QDiW X-Google-Smtp-Source: AGHT+IH4GXwgwhcoHh+OJHyFrUCCt6NtkIpPJF/3i2C+tmJ0wgzCOuGiiCSO7cBS/Qga4cVC1ZacfA== X-Received: by 2002:a2e:84c4:0:b0:2da:8b96:7470 with SMTP id q4-20020a2e84c4000000b002da8b967470mr5515402ljh.33.1713367979303; Wed, 17 Apr 2024 08:32:59 -0700 (PDT) Received: from ?IPV6:2001:8a0:f93d:b900:e643:3adf:640c:b3ad? ([2001:8a0:f93d:b900:e643:3adf:640c:b3ad]) by smtp.gmail.com with ESMTPSA id d5-20020adfc805000000b00349bd105089sm1928834wrh.47.2024.04.17.08.32.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 17 Apr 2024 08:32:58 -0700 (PDT) Message-ID: <9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net> Date: Wed, 17 Apr 2024 16:32:56 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Pedro Alves Subject: Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads To: Thiago Jung Bauermann Cc: gdb-patches@sourceware.org References: <20240321231149.519549-1-thiago.bauermann@linaro.org> <20240321231149.519549-4-thiago.bauermann@linaro.org> <87msptgbey.fsf@linaro.org> Content-Language: en-US In-Reply-To: <87msptgbey.fsf@linaro.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gdb-patches-bounces+public-inbox=simark.ca@sourceware.org On 2024-04-16 05:48, Thiago Jung Bauermann wrote: > > Hello Pedro, > > [ I just now noticed that I replied to you in private last time. Sorry > about that, I thought I had cc'd the list. ] I hadn't noticed. :-) > > Pedro Alves writes: > >> On 2024-03-21 23:11, Thiago Jung Bauermann wrote: >>> When GDB attaches to a multi-threaded process, it calls >>> linux_proc_attach_tgid_threads () to go through all threads found in >>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of >>> them. If it does that twice without the callback reporting that a new >>> thread was found, then it considers that all inferior threads have been >>> found and returns. >>> >>> The problem is that the callback considers any thread that it hasn't >>> attached to yet as new. This causes problems if the process has one or >>> more zombie threads, because GDB can't attach to it and the loop will >>> always "find" a new thread (the zombie one), and get stuck in an >>> infinite loop. >>> >>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux) >>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because >>> its test program constantly creates and finishes joinable threads so the >>> chance of having zombie threads is high. >> >> Hmm. How about simply not restarting the loop if attach_lwp tries to attach to >> a zombie lwp (and silently fails)? >> >> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here: >> >> if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0) >> { >> int err = errno; >> >> /* Be quiet if we simply raced with the thread exiting. >> EPERM is returned if the thread's task still exists, and >> is marked as exited or zombie, as well as other >> conditions, so in that case, confirm the status in >> /proc/PID/status. */ >> if (err == ESRCH >> || (err == EPERM && linux_proc_pid_is_gone (lwpid))) >> { >> linux_nat_debug_printf >> ("Cannot attach to lwp %d: thread is gone (%d: %s)", >> lwpid, err, safe_strerror (err)); >> >> return 0; <<<< NEW RETURN >> } >> >> Similar thing for gdbserver, of course. > > Thank you for the suggestion. I tried doing that, and in fact I attached > a patch with that change in comment #17 of PR 31312 when I was > investigating a fix¹. I called it a workaround because I also had to > increase the number of iterations in linux_proc_attach_tgid_threads from > 2 to 100, otherwise GDB gives up on waiting for new inferior threads too > early and the inferior dies with a SIGTRAP because some new unnoticed > thread tripped into the breakpoint. > > Because of the need to increase the number of iterations, I didn't > consider it a good solution and went with the approach in this patch > series instead. Now I finally understand why I had to increase the > number of iterations (though I still don't see a way around it other > than what this patch series does): > > With the approach in this patch series, even if a new thread becomes > zombie by the time GDB tries to attach to it, it still causes > new_threads_found to be set the first time GDB notices it, and the loop > in linux_proc_attach_tgid_threads starts over. > > With the approach above, a new thread that becomes zombie before GDB has > a chance to attach to it never causes the loop to start over, and so it > exits earlier. > > I think it's a matter of opinion whether one approach or the other can > be considered the better one. > > Even with this patch series, it's not guaranteed that two iterations > without finding new threads is enough to ensure that GDB has found all > threads in the inferior. I left the test running in a loop overnight > with the patch series applied and it failed after about 2500 runs. Thanks for all the investigations. I hadn't seen the bugzilla before, I didn't notice there was one identified in the patch. Let me just start by saying "bleh"... This is all of course bad ptrace/kernel design... The only way to plug this race completely is with kernel help, I believe. The only way that I know of today that we can make the kernel pause all threads for us, is with a group-stop. I.e., pass a SIGSTOP to the inferior, and the kernel stops _all_ threads with SIGSTOP. I prototyped it today, and while far from perfect (would need to handle a corner scenarios like attaching and getting back something != SIGSTOP, and I still see some spurious SIGSTOPs), it kind of works ... ... except for what I think is a deal breaker that I don't know how to work around: If you attach to a process that is running on a shell, you will see that the group-stop makes it go to the background. GDB resumes the process, but it won't go back to the foreground without an explicit "fg" in the shell... Like: $ ./test_program [1]+ Stopped ./test_program $ I'm pasting the incomplete patch below for the archives, but I'm not going to pursue this further. Another idea I had was to look for the thread_db thread creation event breakpoint address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there. That would halt creation of new pthreads. But it wouldn't halt raw clones... Maybe someone has some better idea. Really the kernel should have a nice interface for this. Thankfully this is a contrived test, no real program should be spawning threads like this. One would hope. I will take a new look at your series. >From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001 From: Pedro Alves Date: Wed, 17 Apr 2024 12:47:25 +0100 Subject: [PATCH] Force group-stop when attaching Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7 --- gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c index 2602e1f240d..c95b4370d1b 100644 --- a/gdb/linux-nat.c +++ b/gdb/linux-nat.c @@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled) status_to_str (status).c_str ()); } + /* Force the inferior into group-stop, so all threads are frozen by + the kernel, so we can attach to all of them, race-free. */ + ptrace (PTRACE_CONT, pid, 0, SIGSTOP); + + /* */ + pid_t pid2 = my_waitpid (pid, &status, __WALL); + gdb_assert (pid == pid2); + return status; } @@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty) throw; } + /* Consume the group-stop SIGSTOP for every thread, and move them + into ptrace-stopped. */ + for (lwp_info *lwp : all_lwps_safe ()) + { + if (lwp->ptid.pid () == lwp->ptid.lwp ()) + continue; + + int lwp_status; + int lwpid = lwp->ptid.lwp (); + pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL); + gdb_assert (ret == lwpid); + + gdb_assert (WIFSTOPPED (lwp_status)); + gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP); + + /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely + a group-stop. */ + siginfo_t siginfo; + gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo)); + gdb_assert (errno == EINVAL); + + kill_lwp (lwpid, SIGSTOP); + ptrace (PTRACE_CONT, lwpid, 0, 0); + } + /* Add all the LWPs to gdb's thread list. */ iterate_over_lwps (ptid_t (ptid.pid ()), [] (struct lwp_info *lwp) -> int base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25 -- 2.43.2