From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id yAnlFoW2vV/DegAAWB0awg
	(envelope-from <gdb-patches-bounces@sourceware.org>)
	for <public-inbox@simark.ca>; Tue, 24 Nov 2020 20:42:29 -0500
Received: by simark.ca (Postfix, from userid 112)
	id 5A5331F0AB; Tue, 24 Nov 2020 20:42:29 -0500 (EST)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on simark.ca
X-Spam-Level: 
X-Spam-Status: No, score=0.3 required=5.0 tests=MAILING_LIST_MULTI,RDNS_NONE,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.2
Received: from sourceware.org (unknown [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id EFF5A1E58E
	for <public-inbox@simark.ca>; Tue, 24 Nov 2020 20:42:28 -0500 (EST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 439D139540BC;
	Wed, 25 Nov 2020 01:42:28 +0000 (GMT)
Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com
 [209.85.128.68])
 by sourceware.org (Postfix) with ESMTPS id 58DCF386EC6E
 for <gdb-patches@sourceware.org>; Wed, 25 Nov 2020 01:42:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 58DCF386EC6E
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=palves.net
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=alves.ped@gmail.com
Received: by mail-wm1-f68.google.com with SMTP id c198so651804wmd.0
 for <gdb-patches@sourceware.org>; Tue, 24 Nov 2020 17:42:24 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=HV2purJ8oGLBTV24GK13fBBZmYhvVtYiCnDAdp3cvMY=;
 b=dN1+NN+vnT4IwYMsI741bbeFz3IDHAOqqAScI/7VYcIdU0HWQFLONh0x7fG1fAK88D
 ukMXLt98DUTbd76w+aQrP1YJ1VkdMhHuY+REgUj8cmWglhSP20CBrE0bjuNQyKc6fNAr
 SnUF4pLIfCJ0oFrgxIZM9byKbc0nZFHdn4kQpR4Esn9tl0zqH/ICKGBg+PYGDWzXl6Pi
 ADyQKzLGZonxediTRoyIzlt8zZfLzNl3UccKoapNrC72yJvc7CMil0eB8JH2gn0k1y6b
 6LxKpt3yB907kuWeW99KBBrb8Wws2oy8HIeEjALTEyTvlMcqvXSVLWUq1V8azp9FrSzN
 eXoQ==
X-Gm-Message-State: AOAM530DU6PQSPoVfeHAWUm2qEQMuUp3akJRODl9NA3Xi2SqgHqt3Lbh
 bhyYAYfTKH4sToZQ9R+Dt79ATRg7urrRtw==
X-Google-Smtp-Source: ABdhPJzFN/6kDSS/oVymBkSoaU1ABCQbc/EcjLcrVaZvUU/Spfv8X6Y7RNcPMjs1G/J1I1iXGTjhwA==
X-Received: by 2002:a7b:c11a:: with SMTP id w26mr1116177wmi.131.1606268542161; 
 Tue, 24 Nov 2020 17:42:22 -0800 (PST)
Received: from ?IPv6:2001:8a0:f91f:e900:1d90:d745:3c32:c159?
 ([2001:8a0:f91f:e900:1d90:d745:3c32:c159])
 by smtp.gmail.com with ESMTPSA id e3sm1448441wrn.25.2020.11.24.17.42.20
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Tue, 24 Nov 2020 17:42:21 -0800 (PST)
Subject: Re: [PATCH 12/12] gdb: use two displaced step buffers on amd64/Linux
To: Simon Marchi <simon.marchi@efficios.com>, gdb-patches@sourceware.org
References: <20201110214614.2842615-1-simon.marchi@efficios.com>
 <20201110214614.2842615-13-simon.marchi@efficios.com>
From: Pedro Alves <pedro@palves.net>
Message-ID: <7040e2ee-4d28-a83e-22df-20b2ace082bb@palves.net>
Date: Wed, 25 Nov 2020 01:42:20 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <20201110214614.2842615-13-simon.marchi@efficios.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces@sourceware.org
Sender: "Gdb-patches" <gdb-patches-bounces@sourceware.org>

On 11/10/20 9:46 PM, Simon Marchi via Gdb-patches wrote:
> As observed on a binary compiled on AMD64 Ubuntu 20.04, against glibc
> 2.31 (I think it's the libc that provides this startup code, right?),

Right.  This is crt0.o.  It's sysdeps/x86_64/start.S in this case:

  https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=7477b632f7ad6ffe6cd1975cdc4ec2317f11a6c3;hb=HEAD

> there are enough bytes at the executable's entry point to hold more than
> one displaced step buffer.  gdbarch_max_insn_length is 16, and the
> code at _start looks like:
> 
> 0000000000001040 <_start>:
>     1040:       f3 0f 1e fa             endbr64
>     1044:       31 ed                   xor    %ebp,%ebp
>     1046:       49 89 d1                mov    %rdx,%r9
>     1049:       5e                      pop    %rsi
>     104a:       48 89 e2                mov    %rsp,%rdx
>     104d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
>     1051:       50                      push   %rax
>     1052:       54                      push   %rsp
>     1053:       4c 8d 05 56 01 00 00    lea    0x156(%rip),%r8        # 11b0 <__libc_csu_fini>
>     105a:       48 8d 0d df 00 00 00    lea    0xdf(%rip),%rcx        # 1140 <__libc_csu_init>
>     1061:       48 8d 3d c1 00 00 00    lea    0xc1(%rip),%rdi        # 1129 <main>
>     1068:       ff 15 72 2f 00 00       callq  *0x2f72(%rip)        # 3fe0 <__libc_start_main@GLIBC_2.2.5>
>     106e:       f4                      hlt
>     106f:       90                      nop
> 
> The two buffers would occupy [0x1040, 0x1060).
> 
> I checked on Alpine, which uses the musl C library, the startup code
> looks like:
> 
> 0000000000001048 <_start>:
>     1048:       48 31 ed                xor    %rbp,%rbp
>     104b:       48 89 e7                mov    %rsp,%rdi
>     104e:       48 8d 35 e3 2d 00 00    lea    0x2de3(%rip),%rsi        # 3e38 <_DYNAMIC>
>     1055:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
>     1059:       e8 00 00 00 00          callq  105e <_start_c>
> 
> 000000000000105e <_start_c>:
>     105e:       48 8b 37                mov    (%rdi),%rsi
>     1061:       48 8d 57 08             lea    0x8(%rdi),%rdx
>     1065:       45 31 c9                xor    %r9d,%r9d
>     1068:       4c 8d 05 47 01 00 00    lea    0x147(%rip),%r8        # 11b6 <_fini>
>     106f:       48 8d 0d 8a ff ff ff    lea    -0x76(%rip),%rcx        # 1000 <_init>
>     1076:       48 8d 3d 0c 01 00 00    lea    0x10c(%rip),%rdi        # 1189 <main>
>     107d:       e9 9e ff ff ff          jmpq   1020 <__libc_start_main@plt>
> 
> Even though there's a _start_c symbol, it all appears to be code that
> runs once at the very beginning of the program, so it looks fine if the
> two buffers occupy [0x1048, 0x1068).
> 
> One important thing I discovered while doing this is that when debugging
> a dynamically-linked executable, breakpoints in the shared library
> loader are hit before executing the _start code, and these breakpoints
> may be displaced-stepped.  So it's very important that the buffer bytes
> are restored properly after doing the displaced steps, otherwise the
> _start code will be corrupted once we try to execute it.
> 
> Another thing that made me think about is that library constructors (as
> in `__attribute__((constructor))`) run before _start.  And they are free
> to spawn threads.  What if one of these threads executes a displaced
> step, therefore changing the bytes at _start, while the main thread
> executes _start?  That doesn't sound good and I don't know how we could
> prevent it.  But this is a problem that predates the current patch.

Maybe we should put the buffer/buffers at the beginning of the
dynamic loader instead.

> 
> Even when stress-testing the implementation, by making many threads do
> displaced steps over and over, I didn't see a significant performance (I
> confirmed that the two buffers were used by checking the "set debug
> displaced" logs though).  However, this patch mostly helps make the
> feature testable by anybody with an AMD64/Linux machine, so I think it's
> useful.

It should speed up the use case of multiple threads hitting a conditional
breakpoint that evals false at the same time.  The more buffers, the less
time threads need to spend waiting for their turn in the displaced stepping
queue.  If that doesn't speed things up, then probably something isn't quite
right.

/me gives it a try.

Here's a testcase that can be used to observe speed and fairness.
It has 10 threads running a single tight loop for 3 seconds.  After
the 3 seconds, the threads exit and the number of iterations each
thread managed to do is printed.  The idea is to set a breakpoint
with a condition that evals false on the loop.  The higher the
number, the better.  

I compiled it with:

 $ gcc disp-step-buffers-test.c -o disp-step-buffers-test -g3 -O2 -pthread

Here's a run outside gdb:

 $ /tmp/disp-step-buffers-test
 thread 0, count 12785417966
 thread 1, count 12784090476
 thread 2, count 12773373753
 thread 3, count 12782542707
 thread 4, count 12767835404
 thread 5, count 12754382637
 thread 6, count 12783454775
 thread 7, count 12605966064
 thread 8, count 12635255271
 thread 9, count 12783436261

Here's the test:

$ cat /tmp/disp-step-buffers-test.c 
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define NUM_THREADS 10

static pthread_t child_thread[NUM_THREADS];
static unsigned long long counters[NUM_THREADS];
static volatile int done;

static void *
child_function (void *arg)
{
  while (!done)
    counters[(long) arg]++;   // set breakpoint here
  return NULL;
}

int
main (void)
{
  long i;

  for (i = 0; i < NUM_THREADS; i++)
    pthread_create (&child_thread[i], NULL, child_function, (void *) i);

  sleep (3);

  done = 1;

  for (i = 0; i < NUM_THREADS; i++)
    pthread_join (child_thread[i], NULL);

  for (i = 0; i < NUM_THREADS; i++)
    printf ("thread %ld, count %llu\n", i, counters[i]);
  return 0;
}

And here are the results under gdb.

Test #1, pristine master (119e99bb7f50):

 (gdb) b 16 if 0
 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16.
 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 thread 0, count 1661
 thread 1, count 1663
 thread 2, count 1646
 thread 3, count 1663
 thread 4, count 1622
 thread 5, count 1661
 thread 6, count 1659
 thread 7, count 1662
 thread 8, count 1660
 thread 9, count 1660
 [Inferior 1 (process 18852) exited normally]
 (gdb) 

Test #2, patches 1 through 11 applied, i.e., 1 buffer:

 (gdb) b 16 if 0
 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16.
 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 ...
 thread 0, count 966
 thread 1, count 950
 thread 2, count 951
 thread 3, count 950
 thread 4, count 946
 thread 5, count 948
 thread 6, count 946
 thread 7, count 979
 thread 8, count 951
 thread 9, count 966
 [Inferior 1 (process 16099) exited normally]
 (gdb)

Test #3, patches 1 through 12 applied, i.e., 2 buffers:

 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 ...
 thread 0, count 1124
 thread 1, count 1128
 thread 2, count 1127
 thread 3, count 1123
 thread 4, count 1121
 thread 5, count 1125
 thread 6, count 1126
 thread 7, count 1126
 thread 8, count 1122
 thread 9, count 1122
 [Inferior 1 (process 14983) exited normally]

Wow, test #2 was surprising, it's twice as slow as current
master.  That was unexpected.

Test #3 with the two buffers improves the speed a bit, but still
quite behind current master.  I think this should be sorted out,
at least understood.

I checked, and it's patch #9 the one that introduces the slowdown:

 gdb: move displaced stepping logic to gdbarch, allow starting concurrent displaced steps

so, I'd suspect some new linear thread walks are causing this?
E.g., displaced_step_in_progress.  Maybe we need to keep counters
of number of threads displaced stepping instead of walking all threads,
for example.  I didn't expect this to be observable with only 10 threads.
I'd expect syscall time / ptrace, etc. to dominate.  Maybe it's
something else?

The above numbers were with an -O0 gdb.  I tried again against a gdb built
at -O2, and the absolute numbers were of course better, but the
slowdown is still observed.  I also tried with different numbers
of threads.  Here's what I see:

|threads | master | patch #11 | patch #12|
|----------------------------------------|
|     10 | 3959   |      2847 |      3055|
|     20 | 1806   |      1055 |      1134|
|     50 |  690   |       258 |       258|