From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca by simark.ca with LMTP id yAnlFoW2vV/DegAAWB0awg (envelope-from ) for ; Tue, 24 Nov 2020 20:42:29 -0500 Received: by simark.ca (Postfix, from userid 112) id 5A5331F0AB; Tue, 24 Nov 2020 20:42:29 -0500 (EST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on simark.ca X-Spam-Level: X-Spam-Status: No, score=0.3 required=5.0 tests=MAILING_LIST_MULTI,RDNS_NONE, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.2 Received: from sourceware.org (unknown [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPS id EFF5A1E58E for ; Tue, 24 Nov 2020 20:42:28 -0500 (EST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 439D139540BC; Wed, 25 Nov 2020 01:42:28 +0000 (GMT) Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com [209.85.128.68]) by sourceware.org (Postfix) with ESMTPS id 58DCF386EC6E for ; Wed, 25 Nov 2020 01:42:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 58DCF386EC6E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=palves.net Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=alves.ped@gmail.com Received: by mail-wm1-f68.google.com with SMTP id c198so651804wmd.0 for ; Tue, 24 Nov 2020 17:42:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=HV2purJ8oGLBTV24GK13fBBZmYhvVtYiCnDAdp3cvMY=; b=dN1+NN+vnT4IwYMsI741bbeFz3IDHAOqqAScI/7VYcIdU0HWQFLONh0x7fG1fAK88D ukMXLt98DUTbd76w+aQrP1YJ1VkdMhHuY+REgUj8cmWglhSP20CBrE0bjuNQyKc6fNAr SnUF4pLIfCJ0oFrgxIZM9byKbc0nZFHdn4kQpR4Esn9tl0zqH/ICKGBg+PYGDWzXl6Pi ADyQKzLGZonxediTRoyIzlt8zZfLzNl3UccKoapNrC72yJvc7CMil0eB8JH2gn0k1y6b 6LxKpt3yB907kuWeW99KBBrb8Wws2oy8HIeEjALTEyTvlMcqvXSVLWUq1V8azp9FrSzN eXoQ== X-Gm-Message-State: AOAM530DU6PQSPoVfeHAWUm2qEQMuUp3akJRODl9NA3Xi2SqgHqt3Lbh bhyYAYfTKH4sToZQ9R+Dt79ATRg7urrRtw== X-Google-Smtp-Source: ABdhPJzFN/6kDSS/oVymBkSoaU1ABCQbc/EcjLcrVaZvUU/Spfv8X6Y7RNcPMjs1G/J1I1iXGTjhwA== X-Received: by 2002:a7b:c11a:: with SMTP id w26mr1116177wmi.131.1606268542161; Tue, 24 Nov 2020 17:42:22 -0800 (PST) Received: from ?IPv6:2001:8a0:f91f:e900:1d90:d745:3c32:c159? ([2001:8a0:f91f:e900:1d90:d745:3c32:c159]) by smtp.gmail.com with ESMTPSA id e3sm1448441wrn.25.2020.11.24.17.42.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 24 Nov 2020 17:42:21 -0800 (PST) Subject: Re: [PATCH 12/12] gdb: use two displaced step buffers on amd64/Linux To: Simon Marchi , gdb-patches@sourceware.org References: <20201110214614.2842615-1-simon.marchi@efficios.com> <20201110214614.2842615-13-simon.marchi@efficios.com> From: Pedro Alves Message-ID: <7040e2ee-4d28-a83e-22df-20b2ace082bb@palves.net> Date: Wed, 25 Nov 2020 01:42:20 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <20201110214614.2842615-13-simon.marchi@efficios.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gdb-patches-bounces@sourceware.org Sender: "Gdb-patches" On 11/10/20 9:46 PM, Simon Marchi via Gdb-patches wrote: > As observed on a binary compiled on AMD64 Ubuntu 20.04, against glibc > 2.31 (I think it's the libc that provides this startup code, right?), Right. This is crt0.o. It's sysdeps/x86_64/start.S in this case: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=7477b632f7ad6ffe6cd1975cdc4ec2317f11a6c3;hb=HEAD > there are enough bytes at the executable's entry point to hold more than > one displaced step buffer. gdbarch_max_insn_length is 16, and the > code at _start looks like: > > 0000000000001040 <_start>: > 1040: f3 0f 1e fa endbr64 > 1044: 31 ed xor %ebp,%ebp > 1046: 49 89 d1 mov %rdx,%r9 > 1049: 5e pop %rsi > 104a: 48 89 e2 mov %rsp,%rdx > 104d: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp > 1051: 50 push %rax > 1052: 54 push %rsp > 1053: 4c 8d 05 56 01 00 00 lea 0x156(%rip),%r8 # 11b0 <__libc_csu_fini> > 105a: 48 8d 0d df 00 00 00 lea 0xdf(%rip),%rcx # 1140 <__libc_csu_init> > 1061: 48 8d 3d c1 00 00 00 lea 0xc1(%rip),%rdi # 1129
> 1068: ff 15 72 2f 00 00 callq *0x2f72(%rip) # 3fe0 <__libc_start_main@GLIBC_2.2.5> > 106e: f4 hlt > 106f: 90 nop > > The two buffers would occupy [0x1040, 0x1060). > > I checked on Alpine, which uses the musl C library, the startup code > looks like: > > 0000000000001048 <_start>: > 1048: 48 31 ed xor %rbp,%rbp > 104b: 48 89 e7 mov %rsp,%rdi > 104e: 48 8d 35 e3 2d 00 00 lea 0x2de3(%rip),%rsi # 3e38 <_DYNAMIC> > 1055: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp > 1059: e8 00 00 00 00 callq 105e <_start_c> > > 000000000000105e <_start_c>: > 105e: 48 8b 37 mov (%rdi),%rsi > 1061: 48 8d 57 08 lea 0x8(%rdi),%rdx > 1065: 45 31 c9 xor %r9d,%r9d > 1068: 4c 8d 05 47 01 00 00 lea 0x147(%rip),%r8 # 11b6 <_fini> > 106f: 48 8d 0d 8a ff ff ff lea -0x76(%rip),%rcx # 1000 <_init> > 1076: 48 8d 3d 0c 01 00 00 lea 0x10c(%rip),%rdi # 1189
> 107d: e9 9e ff ff ff jmpq 1020 <__libc_start_main@plt> > > Even though there's a _start_c symbol, it all appears to be code that > runs once at the very beginning of the program, so it looks fine if the > two buffers occupy [0x1048, 0x1068). > > One important thing I discovered while doing this is that when debugging > a dynamically-linked executable, breakpoints in the shared library > loader are hit before executing the _start code, and these breakpoints > may be displaced-stepped. So it's very important that the buffer bytes > are restored properly after doing the displaced steps, otherwise the > _start code will be corrupted once we try to execute it. > > Another thing that made me think about is that library constructors (as > in `__attribute__((constructor))`) run before _start. And they are free > to spawn threads. What if one of these threads executes a displaced > step, therefore changing the bytes at _start, while the main thread > executes _start? That doesn't sound good and I don't know how we could > prevent it. But this is a problem that predates the current patch. Maybe we should put the buffer/buffers at the beginning of the dynamic loader instead. > > Even when stress-testing the implementation, by making many threads do > displaced steps over and over, I didn't see a significant performance (I > confirmed that the two buffers were used by checking the "set debug > displaced" logs though). However, this patch mostly helps make the > feature testable by anybody with an AMD64/Linux machine, so I think it's > useful. It should speed up the use case of multiple threads hitting a conditional breakpoint that evals false at the same time. The more buffers, the less time threads need to spend waiting for their turn in the displaced stepping queue. If that doesn't speed things up, then probably something isn't quite right. /me gives it a try. Here's a testcase that can be used to observe speed and fairness. It has 10 threads running a single tight loop for 3 seconds. After the 3 seconds, the threads exit and the number of iterations each thread managed to do is printed. The idea is to set a breakpoint with a condition that evals false on the loop. The higher the number, the better. I compiled it with: $ gcc disp-step-buffers-test.c -o disp-step-buffers-test -g3 -O2 -pthread Here's a run outside gdb: $ /tmp/disp-step-buffers-test thread 0, count 12785417966 thread 1, count 12784090476 thread 2, count 12773373753 thread 3, count 12782542707 thread 4, count 12767835404 thread 5, count 12754382637 thread 6, count 12783454775 thread 7, count 12605966064 thread 8, count 12635255271 thread 9, count 12783436261 Here's the test: $ cat /tmp/disp-step-buffers-test.c #include #include #include #include #define NUM_THREADS 10 static pthread_t child_thread[NUM_THREADS]; static unsigned long long counters[NUM_THREADS]; static volatile int done; static void * child_function (void *arg) { while (!done) counters[(long) arg]++; // set breakpoint here return NULL; } int main (void) { long i; for (i = 0; i < NUM_THREADS; i++) pthread_create (&child_thread[i], NULL, child_function, (void *) i); sleep (3); done = 1; for (i = 0; i < NUM_THREADS; i++) pthread_join (child_thread[i], NULL); for (i = 0; i < NUM_THREADS; i++) printf ("thread %ld, count %llu\n", i, counters[i]); return 0; } And here are the results under gdb. Test #1, pristine master (119e99bb7f50): (gdb) b 16 if 0 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16. (gdb) r Starting program: /tmp/disp-step-buffers-test thread 0, count 1661 thread 1, count 1663 thread 2, count 1646 thread 3, count 1663 thread 4, count 1622 thread 5, count 1661 thread 6, count 1659 thread 7, count 1662 thread 8, count 1660 thread 9, count 1660 [Inferior 1 (process 18852) exited normally] (gdb) Test #2, patches 1 through 11 applied, i.e., 1 buffer: (gdb) b 16 if 0 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16. (gdb) r Starting program: /tmp/disp-step-buffers-test ... thread 0, count 966 thread 1, count 950 thread 2, count 951 thread 3, count 950 thread 4, count 946 thread 5, count 948 thread 6, count 946 thread 7, count 979 thread 8, count 951 thread 9, count 966 [Inferior 1 (process 16099) exited normally] (gdb) Test #3, patches 1 through 12 applied, i.e., 2 buffers: (gdb) r Starting program: /tmp/disp-step-buffers-test ... thread 0, count 1124 thread 1, count 1128 thread 2, count 1127 thread 3, count 1123 thread 4, count 1121 thread 5, count 1125 thread 6, count 1126 thread 7, count 1126 thread 8, count 1122 thread 9, count 1122 [Inferior 1 (process 14983) exited normally] Wow, test #2 was surprising, it's twice as slow as current master. That was unexpected. Test #3 with the two buffers improves the speed a bit, but still quite behind current master. I think this should be sorted out, at least understood. I checked, and it's patch #9 the one that introduces the slowdown: gdb: move displaced stepping logic to gdbarch, allow starting concurrent displaced steps so, I'd suspect some new linear thread walks are causing this? E.g., displaced_step_in_progress. Maybe we need to keep counters of number of threads displaced stepping instead of walking all threads, for example. I didn't expect this to be observable with only 10 threads. I'd expect syscall time / ptrace, etc. to dominate. Maybe it's something else? The above numbers were with an -O0 gdb. I tried again against a gdb built at -O2, and the absolute numbers were of course better, but the slowdown is still observed. I also tried with different numbers of threads. Here's what I see: |threads | master | patch #11 | patch #12| |----------------------------------------| | 10 | 3959 | 2847 | 3055| | 20 | 1806 | 1055 | 1134| | 50 | 690 | 258 | 258|