From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca by simark.ca with LMTP id +N8aCGWqYWe90x4AWB0awg (envelope-from ) for ; Tue, 17 Dec 2024 11:44:21 -0500 Authentication-Results: simark.ca; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=giV0UDZA; dkim-atps=neutral Received: by simark.ca (Postfix, from userid 112) id 1CC981E097; Tue, 17 Dec 2024 11:44:21 -0500 (EST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on simark.ca X-Spam-Level: X-Spam-Status: No, score=-6.4 required=5.0 tests=ARC_SIGNED,ARC_VALID,BAYES_00, DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=unavailable autolearn_force=no version=4.0.0 Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPS id 391C41E091 for ; Tue, 17 Dec 2024 11:44:19 -0500 (EST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id CFB8C3858D20 for ; Tue, 17 Dec 2024 16:44:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CFB8C3858D20 Authentication-Results: sourceware.org; dkim=pass (1024-bit key, unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=giV0UDZA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTP id 87E293858D20 for ; Tue, 17 Dec 2024 16:43:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 87E293858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 87E293858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1734453790; cv=none; b=mz58pYsjaUCLe/ejgYj6FuesV9FffdcOZfh7ERbkrsGf5QPLvuFeFDRc1kq/muJzGeTIgQD//q6AA+MnEvpMYJCgTD2BaWCbfpZ8UsPHucFhRWaKZ/Ng8cV9woOA2HSxpu+toZt5i5cRk//oNNx5rnJY3Z+K3kK4N1FJoF7iEmQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1734453790; c=relaxed/simple; bh=6aXNNVYBHpK/ihUHdh68UMUav9eBHVAq4X+kq/G+O70=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=ikHu+UEjWf/KREbKnQUE2t6C/NsiUMw6ohKPr2qtN8BjR8Aoh+SnKMsgyjGMXvrRv6LY8jgNnU0+h59dCZG7UHgf6fMJT/1svRmE8U2tP0v+iNC3DMAUika5sdTBvJnq8EdIuHLRuIZeYRj0ccZPtzsJZ14E/pL/ZXHbhE5sDNk= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 87E293858D20 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1734453790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nmNjNlSNM7Gd96IMCojrvwOZwB8AkIuNnb0CyM3aOp4=; b=giV0UDZAd/uqIThfuTI6skPqevJs7lOGZ9b+/J3dUPya64bqnYu8VbNlalkA2L+MlB8I6D JqDMtcjOA22sRjXDm+UXCF2eDR1RGQSZGR1V+DL4otYZLuPzEjK0vTuYbC7PIqEbtejF5F ACscVD4MIJaLY9IsH5fXWV1UeIvW4uQ= Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-155-19q6PwbcPUq8BLh2-4LRGw-1; Tue, 17 Dec 2024 11:43:08 -0500 X-MC-Unique: 19q6PwbcPUq8BLh2-4LRGw-1 X-Mimecast-MFC-AGG-ID: 19q6PwbcPUq8BLh2-4LRGw Received: by mail-pf1-f200.google.com with SMTP id d2e1a72fcca58-725e3c6ad0dso7442322b3a.0 for ; Tue, 17 Dec 2024 08:43:08 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734453787; x=1735058587; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=nmNjNlSNM7Gd96IMCojrvwOZwB8AkIuNnb0CyM3aOp4=; b=FEsHqkweQ9UxiFOT18ngLxdr9Do49lzhMy1dv9RUw3Y2s5p4sctP/hY+gNCoydGpDh murT1hvlBKtu83wUvqND82XmwCrittWX+JHlkQps3JSsph9XPFp+TcN6vIImFM2mLohT dO5j/upq5LNxWEuWh3OQj4zaWz70KWbx13ym+Ot8Zw+1WwnpFoagkrT/vk3fE4Dxn0QM khbNqoYh7fiDs02reohVyb17X2vPNEmror5C3s5VuTJi9w7T4DijnCBOQV8ykSipAmsk vdMbGRpKgqQnLvNpYkrA7AbhqfTgEG9dMbFuNMyBfV9vGbCxKL9ivhc6XfEwif2CTqIX eq2A== X-Forwarded-Encrypted: i=1; AJvYcCWvvMLNfXm6sGKG/h6F2NJTPdhZHpZKrFjETLyY3WQRPcoMSiN/jVt1zlUbmCwMq+ENpNS2jtqc1s8sqA==@sourceware.org X-Gm-Message-State: AOJu0YyJ1W1kmzUlKjkwsdaOuGwbv8PvadvwW84nnND7TeuHiz/YtdPd c1pe/JGnaDzHGkh8ksxp057Q1t7/XFmL3xQmC+u2w02Zc+by9nZFQeCGaKTvNuBbkWHv8MRpRm5 PUlLIHrv4NqJyM1c449FeXmot4dAJ33N8qT9+0bwzHy9PhAIwiJi9FcEihwTGTSNAEI8= X-Gm-Gg: ASbGncss3vda0HSeTIXcIE0s8NBclHjqorKX+BX92Bujl5DI+IOdWthiiTNhIPlXkon m2dEZrCatsghABqPxF1nUeY6s6LBkzZ+jqAhBJzSy/IettAVx32qDZPa0gUuPTwNnC6u6eON86Z eIgbYZW5mbCcgIuTxu4LYen9OtdBpIce7xkH1CzMshoYfM8RUxx2bUhaV2CH9C+9JQCI1drcL/M DpDN4O28KzefUDrwVQ6zZJ2mVWK4kq6ngiMZ3jZr8l8xlpRnAveSjuvzG5BLmJ2M6mx X-Received: by 2002:a05:6a00:f8a:b0:724:f10b:e663 with SMTP id d2e1a72fcca58-7290be74313mr24653144b3a.0.1734453786697; Tue, 17 Dec 2024 08:43:06 -0800 (PST) X-Google-Smtp-Source: AGHT+IGpZgWc4sM7+wAnhiT5cUs6OjOJSr/BkvkcfST49ycrt4PpIfJqJIM+BH5PzLNdTDQtRgA7Nw== X-Received: by 2002:a05:6a00:f8a:b0:724:f10b:e663 with SMTP id d2e1a72fcca58-7290be74313mr24653060b3a.0.1734453785753; Tue, 17 Dec 2024 08:43:05 -0800 (PST) Received: from ?IPV6:2804:14d:8084:92c5::1000? ([2804:14d:8084:92c5::1000]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72918ad5687sm6884264b3a.72.2024.12.17.08.43.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 17 Dec 2024 08:43:05 -0800 (PST) Message-ID: Date: Tue, 17 Dec 2024 13:43:01 -0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] [gdb] Create script to convert old tests into Dwarf::assemble calls. To: William Ferreira , gdb-patches@sourceware.org References: <20241205135550.9320-1-wqferr@gmail.com> From: Guinevere Larsen In-Reply-To: <20241205135550.9320-1-wqferr@gmail.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: bDxYcmetNJCL_p_FZrOCY57KXWxQr5Ob_Cmz5KcFjtQ_1734453787 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gdb-patches-bounces~public-inbox=simark.ca@sourceware.org Thanks for working on this amazing script! This is really fancy,  and I really like what you got going I have some specific feedback on some lines, but in general I really like where the script is going! On 12/5/24 10:55 AM, William Ferreira wrote: > PR testsuite/32261 requests a script that could convert old .S-based > tests (that were made before dwarf.exp existed) into the new > Dwarf::assemble calls in Tcl. This commit is an initial implementation > of such a script. Python was chosen for convenience, and only relies on > a single external library. > > Usage: the script operates not on .S files, but on ELF files with DWARF > information. To convert an old test, one must run said test via > `make check-gdb TESTS=testname` in their build directory. This will > produce, as a side effect, an ELF file the test used as an inferior, at > gdb/testsuite/outputs/testname/testname. This ELF file is this script's > input. > > Reliability: not counting the limitations listed below, the script seems > functional enough to be worthy of discussion in the mailing list. I have > been testing it with different tests that already use Dwarf::assemble, > to see if the script can produce a similar call to it. Indeed, in the > cases that I've tested (including some more complex ones, marked with an > asterisk below) it is able to produce comparable output to the original > exp file. Of course, it can't reproduce the complex code *before* the > Dwarf::assemble call. Values calculated then are simply inlined. > > The following .exp files have been tried in this way and their outputs > highly resemble the original: > - gdb.dwarf2/dynarr-ptr > - gdb.dwarf2/void-type > - gdb.dwarf2/ada-thick-pointer > - gdb.dwarf2/atomic-type > - gdb.dwarf2/dw2-entry-points (*) > > The following .exp files DO NOT WORK with this script: > - gdb.dwarf2/cu-no-addrs > - aranges not supported. > - Compile unit hi_pc and low_pc hardcoded, prone to user error > due to forgetting to replace with variables. > > The following .exp files work, with caveats addressed in the limitations > section below. > - gdb.dwarf2/cpp-linkage-name > - Works correctly except for one attribute of the form SPECIAL_expr. > - gdb.dwarf2/dw2-unusual-field-names > - Same as above, with two instances of SPECIAL_expr. > - gdb.dwarf2/implptr-optimized-out > - Same as above, with two instances of SPECIAL_expr. > > Currently the script has the following known limitations: > - It does not support Tcl outside a small boilerplate and > Dwarf::assemble calls. I don't think we need to support anything other than creating the dwarf::assemble stuff. Everything else is usual boilerplate from the testsuite, so we're pretty familiar with it (or at least have plenty of things to double-check) while dwarf assembler isn't as common and much more complex. > - It does not support line tables. > - It does not use $srcfile and other variables in the call to > Dwarf::assemble (since it can't know where it is safe to substitute). > - It does not support "fission" type DWARFs (in fact I still have no > clue what those are). > - It does not support cu {label LABEL} {} CUs, mostly because I couldn't > find the information using pyelftools. > - It sometimes outputs empty CUs at the start and end of the call. This > might be a problem with my machine, but I've checked with DWARF dumps > and they are indeed in the input ELF files generated by > `make check-gdb`. > - It does not support attributes with the forms DW_FORM_block* and > DW_FORM_exprloc. This is mostly not a concern of the difficulty of > the implementation, but of it being an incomplete feature and, thus, > more susceptible to users forgetting to correct its mistakes or > unfinished values (please see discussion started by Guinevere at > comment 23 https://sourceware.org/bugzilla/show_bug.cgi?id=32261#c23). > The incompleteness of this feature is easy to demonstrate: any call to > gdb_target_symbol, a common tool used in these attributes, needs a > symbol name that is erased after compilation. There is no way to guess > where that address being referenced in a DW_OP_addr comes from, and it > can't be hard coded since it can change depending on the machine > compiling it. > > Please bring up any further shortcomings this script may have with your > expectations. > > Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32261 > --- > gdb/testsuite/lib/asm_to_dwarf_assembler.py | 830 ++++++++++++++++++++ > 1 file changed, 830 insertions(+) > create mode 100644 gdb/testsuite/lib/asm_to_dwarf_assembler.py > > diff --git a/gdb/testsuite/lib/asm_to_dwarf_assembler.py b/gdb/testsuite/lib/asm_to_dwarf_assembler.py > new file mode 100644 > index 00000000000..ea86c19c805 > --- /dev/null > +++ b/gdb/testsuite/lib/asm_to_dwarf_assembler.py > @@ -0,0 +1,830 @@ > +# Copyright 2024 Free Software Foundation, Inc. > + > +# This program is free software; you can redistribute it and/or modify > +# it under the terms of the GNU General Public License as published by > +# the Free Software Foundation; either version 3 of the License, or > +# (at your option) any later version. > +# > +# This program is distributed in the hope that it will be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program. If not, see . > + > +# Due to the pyelftools dependency, this script requires Python version > +# 3.10 or greater to run. > + > +"""A utility to convert ELF files with DWARF info to Dwarf::assemble code. > + > +Usage: > + python ./asm_to_dwarf_assembler.py > + > +Dependencies: > + Python >= 3.10 > + pyelftools >= 0.31 > + > +Notes: > +- Line tables are not currently supported. > +- Non-contiguous subprograms are not currently supported. > +- If you want to use $srcfile or similar, you must edit the references to the > + file name manually, including DW_AT_name attributes on compile units. > +- If run with binaries generated by make check-gdb, it may include an > + additional compile_unit before and after the actual compile units. This is > + an artifact of the normal compilation process, as these CUs are indeed in > + the generated DWARF in some cases. > +""" > + > +import sys > +import argparse > +import elftools > +import re > + > +from io import BytesIO, IOBase > +from typing import Any, Optional, Iterable, List, Union, Annotated > +from functools import cache > +from dataclasses import dataclass > +from copy import copy > +from datetime import datetime > +from logging import getLogger > + > +from elftools import dwarf, elf > +from elftools.elf.elffile import ELFFile > +from elftools.dwarf.compileunit import CompileUnit as RawCompileUnit > +from elftools.dwarf.die import DIE as RawDIE, AttributeValue > +from elftools.common.exceptions import DWARFError > + > + > +logger = getLogger(__file__) > + > + > +EXPR_ATTRIBUTE_FORMS = [ > + "DW_FORM_exprloc", > + "DW_FORM_block", > + "DW_FORM_block1", > + "DW_FORM_block2", > + "DW_FORM_block4", > +] Since these aren't supported, does it make sense to keep them here? > + > + > +@dataclass > +class DWARFOperation: > + """Values of the form DW_OP_.""" > + suffix: Annotated[str, 'Part of the name that succeeds "DW_OP_".'] > + code: Annotated[int, "Numeric representation of the OP."] > + num_operands: Annotated[int, "Number of operands this operator expects."] > + > + > +dwarf_operations: Annotated[ > + dict[int, DWARFOperation], > + "Table of all DWARF operations as per table 7.9 of the DWARF5 spec." > +] = {} > +def _register_op(suffix: str, code: int, num_operands: int) -> None: > + assert code not in dwarf_operations, "Duplicate operation code found." > + dwarf_operations[code] = DWARFOperation(suffix, code, num_operands) > + > +def _register_all_ops() -> None: > + # Op codes 0x01 and 0x02 are reserved. > + > + _register_op("addr", 0x03, 1) > + > + # Op codes 0x04 and 0x05 are reserved. > + > + _register_op("deref", 0x06, 0) > + > + # Op code 0x07 is reserved. > + > + _register_op("const1u", 0x08, 1) > + _register_op("const1s", 0x09, 1) > + _register_op("const2u", 0x0a, 1) > + _register_op("const2s", 0x0b, 1) > + _register_op("const4u", 0x0c, 1) > + _register_op("const4s", 0x0d, 1) > + _register_op("const8u", 0x0e, 1) > + _register_op("const8s", 0x0f, 1) > + _register_op("constu", 0x10, 1) > + _register_op("consts", 0x11, 1) > + > + _register_op("dup", 0x12, 0) > + _register_op("drop", 0x13, 0) > + _register_op("over", 0x14, 0) > + _register_op("pick", 0x15, 1) > + _register_op("swap", 0x16, 0) > + _register_op("rot", 0x17, 0) > + _register_op("xderef", 0x18, 0) > + > + _register_op("abs", 0x19, 0) > + _register_op("and", 0x1a, 0) > + _register_op("div", 0x1b, 0) > + _register_op("minus", 0x1c, 0) > + _register_op("mod", 0x1d, 0) > + _register_op("mul", 0x1e, 0) > + _register_op("neg", 0x1f, 0) > + _register_op("not", 0x20, 0) > + _register_op("or", 0x21, 0) > + _register_op("plus", 0x22, 0) > + _register_op("plus_uconst", 0x23, 1) > + _register_op("shl", 0x24, 0) > + _register_op("shr", 0x25, 0) > + _register_op("shra", 0x26, 0) > + _register_op("xor", 0x27, 0) > + _register_op("bra", 0x28, 1) > + _register_op("eq", 0x29, 0) > + _register_op("ge", 0x2a, 0) > + _register_op("gt", 0x2b, 0) > + _register_op("le", 0x2c, 0) > + _register_op("lt", 0x2d, 0) > + _register_op("ne", 0x2e, 0) > + _register_op("skip", 0x2f, 0) > + > + for lit_nr in range(32): > + _register_op(f"lit{lit_nr}", 0x30 + lit_nr, 0) > + > + for reg_nr in range(32): > + _register_op(f"reg{reg_nr}", 0x50 + reg_nr, 0) > + > + for breg_nr in range(32): > + _register_op(f"breg{breg_nr}", 0x70 + breg_nr, 0) > + > + _register_op("regx", 0x90, 1) > + _register_op("fbregx", 0x91, 1) > + _register_op("bregx", 0x92, 2) > + _register_op("piece", 0x93, 1) > + _register_op("deref_size", 0x94, 1) > + _register_op("xderef_size", 0x95, 1) > + _register_op("nop", 0x96, 0) > + _register_op("push_object_address", 0x97, 0) > + > + _register_op("call2", 0x98, 1) > + _register_op("call4", 0x99, 1) > + _register_op("call_ref", 0x9a, 1) > + _register_op("form_tls_address", 0x9b, 0) > + _register_op("call_frame_cfa", 0x9c, 0) > + > + _register_op("bit_piece", 0x9d, 2) > + _register_op("implicit_value", 0x9e, 2) > + _register_op("stack_value", 0x9f, 0) > + _register_op("implicit_pointer", 0xa0, 2) > + _register_op("addrx", 0xa1, 1) > + _register_op("constx", 0xa2, 1) > + > + _register_op("entry_value", 0xa3, 2) > + _register_op("const_type", 0xa4, 3) > + _register_op("regval_type", 0xa5, 2) > + _register_op("deref_type", 0xa6, 2) > + _register_op("xderef_type", 0xa7, 2) > + > + _register_op("convert", 0xa8, 1) > + _register_op("reinterpret", 0xa9, 1) > + > + > +_register_all_ops() Similar question to here, if I understood later code correctly. Do we want to keep this if we're not doing location expressions? > + > + > +copyright_notice_template = """ > +# Copyright 2024-{current_year} Free Software Foundation, Inc. If the script is adding copyright, it will be for a new file, and so copyright will always start on the year when the script is run. So it can just be {current_year} > +# > +# This program is free software; you can redistribute it and/or modify > +# it under the terms of the GNU General Public License as published by > +# the Free Software Foundation; either version 3 of the License, or > +# (at your option) any later version. > +# > +# This program is distributed in the hope that it will be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program. If not, see . > +""".lstrip() # Strip to remove initial \n, but not the last one. > + > +# Workaround for my editor not to freak out over unclosed braces. > +lbrace, rbrace = "{", "}" > + > +@cache > +def get_indent_str(indent_count: int) -> str: > + """Get whitespace string to prepend to another for indenting.""" > + indent = (indent_count // 2) * "\t" > + if indent_count % 2 == 1: > + indent += " " > + return indent > + > + > +def indent(line: str, indent_count: int) -> str: > + """Indent line by indent_count levels.""" > + return get_indent_str(indent_count) + line > + > + > +def labelify_str(s: str) -> str: > + """Make s appropriate for a label name.""" > + # Replace "*" with the literal word "ptr". > + s = s.replace("*", "ptr") > + > + # Replace any non-"word" characters by "_". > + s = re.sub(r"\W", "_", s) > + > + # Remove consecutive "_"s. > + s = re.sub(r"__+", "_", s) > + > + return s > + > + > +class DWARFAttribute: > + """Storage unit for a single DWARF attribute. > + > + All its values are strings that are usually passed on > + directly to format. The exceptions to this are attributes > + with int values with DW_FORM_ref4 or DW_FORM_ref_addr form. > + Their values are interpreted as the global offset of the DIE > + being referenced, which are looked up dynamically to fetch > + their labels. > + """ > + def __init__( > + self, > + die_offset: int, > + name: str, > + value: str | bytes | int | bool, > + form=None, > + ): > + self.die_offset = die_offset > + self.name = name > + self.value = value > + self.form = form > + > + def _format_expr_value(self) -> str: > + # This code is left here in case this is feature is desired in the > + # future. It is partly functional, being able to decode operations > + # but not operands due to their variable length in bytes. > + > + # self.form = "SPECIAL_expr" > + # lines = [lbrace] > + # curr_idx = 0 > + # while curr_idx < len(self.value): > + # curr_op_code = self.value[curr_idx] > + # > + # try: > + # curr_op: DWARFOperation = dwarf_operations[curr_op_code] > + # except KeyError as err: > + # raise KeyError( > + # f"Unknown op: {curr_op_code}, item {curr_idx} of" > + # f" attribute {self.name}, DIE at offset" > + # f" {self.die_offset}." > + # ) from err > + # # Empty arg here so it inserts a space before the first argument, > + # # only if there is indeed a first argument. > + # args = [""] > + # for arg_nr in range(curr_op.num_operands): > + # curr_idx += 1 > + # # TODO decode bytes according to specific operation. > + # # The following line only reads the next BYTE of the sequence, > + # # and needs to be adjusted to read the correct number of > + # # bytes. > + # # args.append(str(self.value[curr_idx])) > + # op_line = "DW_OP_" + curr_op.suffix + " ".join(args) > + # lines.append(op_line) > + # curr_idx += 1 > + # > + # lines.append(rbrace) > + # return "\n".join(lines) > + self.form = "SPECIAL_expr" > + return "{ TODO: Fill expr list }" Maybe, rather than "TODO" you could add "MANUAL" or something, to make it clear that this is a manual process and expected to be so? (naming is hard, but I think todo implies that this will be improved soon, and I don't think that'd be the case?) > + > + def _needs_escaping(self, str_value: str) -> bool: > + charset = set(str_value) > + return bool(charset.intersection({"{", "}", " ", "\t"})) > + > + def _format_str(self, str_value: str) -> str: > + if self._needs_escaping(str_value): > + escaped_str = str(str_value) > + # Replace single escape (which is itself escaped because of regex) > + # with a double escape (which doesn't mean anything to regex so > + # it doesn't need escaping). > + escaped_str = re.sub(r"\\", r"\\", escaped_str) > + escaped_str = re.sub("([{}])", r"\\\1", escaped_str) > + return "{" + escaped_str + "}" > + else: > + return str_value > + > + def _format_value( > + self, > + offset_die_lookup: dict[int, "DWARFDIE"], > + indent_count: int = 0 > + ) -> str: > + if self.form in EXPR_ATTRIBUTE_FORMS: > + return self._format_expr_value() > + elif isinstance(self.value, bool): > + return str(int(self.value)) > + elif isinstance(self.value, int): > + if self.form == "DW_FORM_ref4": > + # ref4-style referencing label. > + die: "DWARFDIE" = offset_die_lookup[self.value] > + return ":$" + die.tcl_label > + elif self.form == "DW_FORM_ref_addr": > + # ref_addr-style referencing label. > + die: "DWARFDIE" = offset_die_lookup[self.value] > + return "%$" + die.tcl_label > + else: > + return str(self.value) > + elif isinstance(self.value, bytes): > + return self._format_str(self.value.decode("ascii")) > + elif isinstance(self.value, str): > + return self._format_str(self.value) > + else: > + raise NotImplementedError( > + "Unknown data type: " + str(type(self.value)) > + ) > + > + def format( > + self, > + offset_die_lookup: dict[int, "DWARFDIE"], > + indent_count: int = 0 > + ) -> str: > + """Format the attribute in the form {name value form}. > + > + If form is DW_FORM_exprloc or DW_FORM_block, see next section on > + DWARFOperations. > + > + If it isn't, value is formatted as follows: > + If bool, use "1" if True, "0" if False. > + If int: > + If form is DW_FORM_ref4, use ":$label" where label is the > + tcl_label of the DWARFDIE at offset "value". > + If form is DW_FORM_ref_addr, use "%$label" where label is > + the tcl_label of the DWARFDIE at offset "value". > + Else, use value directly. > + If bytes, use value.decode("ascii") > + If str, use value directly. > + Any other type results in a NotImplementedError being raised. > + > + Regarding DW_FORM_exprloc and DW_FORM_block: > + The form is replaced with SPECIAL_expr. > + The entries in the value are interpreted and decoded using the > + dwarf_operations dictionary, and replaced with their names where > + applicable. > + """ > + s = lbrace > + s += self.name + " " > + s += self._format_value(offset_die_lookup) > + > + # Only explicitly state form if it's not a reference. > + if self.form not in [None, "DW_FORM_ref4", "DW_FORM_ref_addr"]: > + s += " " + self.form > + > + s += rbrace > + return indent(s, indent_count) > + > + > +class DWARFDIE: > + """This script's parsed version of a RawDIE.""" > + def __init__( > + self, > + offset: int, > + tag: str, > + attrs: dict[str, DWARFAttribute], > + tcl_label: Optional[str] = None > + ): > + self.offset = offset > + self.tag = tag > + self.attrs = copy(attrs) > + self.children = [] > + self.tcl_label = tcl_label I think it would be worth a comment here or in format lines, regarding what tcl_label is actually about. it might not be self-evident what these are, especially for people unfamiliar with the assembler. > + > + def format_lines( > + self, > + offset_die_lookup: dict[int, "DWARFDIE"], > + indent_count: int = 0 > + ) -> list[str]: > + """Get the list of lines that represent this DIE in Dwarf assembler.""" > + die_lines = [] > + > + # Prepend label to first line, if it's set. > + if self.tcl_label: > + first_line_start = self.tcl_label + ": " > + else: > + first_line_start = "" > + > + # First line, including label. > + first_line = indent( > + first_line_start + self.tag + " " + lbrace, > + indent_count > + ) > + die_lines.append(first_line) > + > + # Format attributes, if any. > + if self.attrs: > + for attr_name, attr in self.attrs.items(): > + attr_line = attr.format( > + offset_die_lookup, > + indent_count=indent_count+1 > + ) > + die_lines.append(attr_line) > + die_lines.append(indent(rbrace, indent_count)) > + else: > + # Don't create a new line, just append and immediately close the > + # brace on the last line. > + die_lines[-1] += rbrace > + > + # Format children, if any. > + if self.children: > + # Only open a new brace if there are any children for the > + # current DIE. > + die_lines[-1] += " " + lbrace > + for child in self.children: > + child_lines = child.format_lines( > + offset_die_lookup, > + indent_count=indent_count+1 > + ) > + die_lines.extend(child_lines) > + die_lines.append(indent(rbrace, indent_count)) > + > + return die_lines > + > + def format( > + self, > + offset_die_lookup: dict[int, "DWARFDIE"], > + indent_count: int = 0 > + ) -> str: > + """Join result from format_lines into a single str.""" > + return "\n".join(self.format_lines(offset_die_lookup, indent_count)) > + > + def name(self) -> Optional[str]: > + """Get DW_AT_name (if present) decoded as ASCII.""" > + raw_value = self.attrs.get("DW_AT_name") > + if raw_value is None: > + return None > + else: > + return raw_value.value.decode("ascii") > + > + def type_name(self) -> str: > + """Name of Dwarf tag, with the "DW_TAG_" prefix removed.""" > + return re.sub("DW_TAG_", "", self.tag) > + > +class DWARFCompileUnit(DWARFDIE): > + """Wrapper subclass for CU DIEs. > + > + This is necessary due to the special format CUs take in Dwarf::assemble. > + > + Instead of simply: > + DW_TAG_compile_unit { > + > + } { > + > + } > + > + CUs are formatted as: > + cu { } { > + DW_TAG_compile_unit { > + > + } { > + > + } > + } > + """ > + > + # Default value for parameter dwarf_version defined in dwarf.exp line 1552. > + default_dwarf_version = 4 > + > + # Default value for parameter is_fission defined in dwarf.exp line 1556. > + # Currently not implemented, see comment below. > + # default_is_fission = False > + > + # Tag that signifies a DIE is a compile unit. > + compile_unit_tag = "DW_TAG_compile_unit" > + > + def __init__( > + self, > + raw_die: RawDIE, > + raw_cu: RawCompileUnit, > + attrs: dict[str, DWARFAttribute], > + ): > + """Initialize additional instance variables for CU encoding. > + > + The additional instance variables are: > + - is_64_bit: Optional[bool] > + default None > + Whether this CU is 64 bit or not. > + - dwarf_version: int > + default DWARFCompileUnit.default_dwarf_version > + Version of DWARF this CU is using. > + - addr_size: Optional[int] > + default None > + Size of an address in bytes. > + > + These variables are used to configure the first parameter of the cu > + proc (which contains calls to the compile_unit proc in the body of > + Dwarf::assemble). > + """ > + super().__init__( > + raw_die.offset, > + DWARFCompileUnit.compile_unit_tag, > + attrs > + ) > + self.raw_cu = raw_cu > + self.is_64_bit: Optional[bool] = None > + self.dwarf_version: int = raw_cu.header.get( > + "version", > + DWARFCompileUnit.default_dwarf_version > + ) > + self.addr_size: Optional[int] = raw_cu.header["address_size"] > + > + # Fission is not currently implemented because I don't know where to > + # fetch this information from. > + # self.is_fission: bool = self.default_is_fission > + > + # CU labels are not currently implemented because I haven't found where > + # pyelftools exposes this information. > + # self.cu_label: Optional[str] = None > + > + def format_lines( > + self, > + offset_die_lookup: dict[int, DWARFDIE], > + indent_count: int = 0, > + ) -> list[str]: > + lines = [] > + lines.append(self._get_header(indent_count)) > + inner_lines = super().format_lines( > + offset_die_lookup, > + indent_count + 1 > + ) > + lines += inner_lines > + lines.append(indent(rbrace, indent_count)) > + return lines > + > + def _get_header(self, indent_count: int = 0) -> str: > + """Assemble the first line of the surrounding 'cu {} {}' proc call.""" > + header = indent("cu " + lbrace, indent_count) > + cu_params = [] > + > + if self.is_64_bit is not None: > + # Convert from True/False to 1/0. > + param_value = int(self.is_64_bit) > + cu_params += ["is_64", str(param_value)] > + > + if self.dwarf_version != DWARFCompileUnit.default_dwarf_version: > + cu_params += ["version", str(self.dwarf_version)] > + > + if self.addr_size is not None: > + cu_params += ["addr_size", str(self.addr_size)] > + > + # Fission is not currently implemented, see comment above. > + # if self.is_fission != DWARFCompileUnit.default_is_fission: > + # # Same as is_64_bit conversion, True/False -> 1/0. > + # param_value = int(self.is_fission) > + # cu_params += ["fission", str(param_value)] > + > + # CU labels are not currently implemented, see commend above. > + # if self.cu_label is not None: > + # cu_params += ["label", self.cu_label] > + > + if cu_params: > + header += " ".join(cu_params) > + > + header += rbrace + " " + lbrace > + return header > + > +class DWARFParser: > + """Converter from pyelftools's DWARF representation to this script's.""" > + > + def __init__(self, elf_file: IOBase): > + """Init parser with file opened in binary mode. > + > + File can be closed after this function is called. > + """ > + self.raw_data = BytesIO(elf_file.read()) > + self.elf_data = ELFFile(self.raw_data) > + self.dwarf_info = self.elf_data.get_dwarf_info() > + self.offset_to_die: dict[int, DWARFDIE] = {} > + self.label_to_die: dict[str, DWARFDIE] = {} > + self.referenced_offsets: set[int] = set() > + self.raw_cu_list: list[RawCompileUnit] = [] > + self.top_level_dies: list[DWARFDIE] = [] > + self.subprograms: list[DWARFDIE] = [] > + self.taken_labels: set[str] = set() > + > + self._read_cus() > + self._create_necessary_labels() > + > + def _read_cus(self): I think I'd personally rename this to _read_all_cus or something like that. It is easy to miss the S at the end and typo the function name or get confused. > + """Populate self.raw_cu_list with all CUs in self.dwarf_info.""" > + for cu in self.dwarf_info.iter_CUs(): > + self._read_cu(cu) > + > + def _read_cu(self, raw_cu: RawCompileUnit): > + """Read a compile_unit into self.cu_list.""" > + self.raw_cu_list.append(raw_cu) > + for raw_die in raw_cu.iter_DIEs(): > + if not raw_die.is_null(): > + self._parse_die(raw_die) > + > + def _parse_die(self, raw_die: RawDIE) -> DWARFDIE: > + """Process a single DIE and add it to offset_to_die. > + > + Look for DW_FORM_ref4 and DWD_FORM_ref_addr form attributes and replace > + them with the global offset of the referenced DIE, and marking the > + referenced DIE as "referenced". This will be used later to assign and > + use labels. > + > + In case the DIE is a top-level DIE, add it to self.top_level_dies. > + > + In case the DIE is a subprogram, add it to self.subprograms and call > + self._use_vars_for_low_and_high_pc_attr with it. > + """ > + processed_attrs = {} > + attr_value: AttributeValue > + for attr_name, attr_value in raw_die.attributes.items(): > + actual_value = attr_value.value > + if attr_value.form in ("DW_FORM_ref4", "DW_FORM_ref_addr"): > + referenced_die = raw_die.get_DIE_from_attribute(attr_name) > + actual_value = referenced_die.offset > + self.referenced_offsets.add(referenced_die.offset) > + > + processed_attrs[attr_name] = DWARFAttribute( > + raw_die.offset, > + attr_name, > + actual_value, According to the comment above, this should not send actual_value, but instead the string "referenced". Is this an oversight here or (more likely) in the comment? > + attr_value.form > + ) > + > + if raw_die.tag == DWARFCompileUnit.compile_unit_tag: > + # FIXME: This isn't pretty, as it relies on global parser state. > + # It also only works under the assumption CUs can't be nested. > + die_cu = self.raw_cu_list[-1] We try and not get code into GDB that has a FIXME, so this has to be resolved before going in. Would it make sense to call _parse_die with the CU that the DIE comes from? > + processed_die = DWARFCompileUnit(raw_die, die_cu, processed_attrs) > + else: > + processed_die = DWARFDIE( > + raw_die.offset, > + raw_die.tag, > + processed_attrs, > + None > + ) > + > + if raw_die.get_parent() is None: > + # Top level DIE > + self.top_level_dies.append(processed_die) > + else: > + # Setting the parent here assumes the parent was already processed > + # prior to this DIE being found. > + # As far as I'm aware, this is always true in DWARF. > + processed_parent = self.offset_to_die[raw_die.get_parent().offset] > + processed_parent.children.append(processed_die) > + > + if processed_die.tag == "DW_TAG_subprogram": > + self.subprograms.append(processed_die) > + self._use_vars_for_low_and_high_pc_attr(processed_die) > + > + self.offset_to_die[processed_die.offset] = processed_die > + return processed_die > + > + def _create_necessary_labels(self): > + """Create labels to DIEs that were referenced by others.""" > + for offset in self.referenced_offsets: > + die = self.offset_to_die[offset] > + self._create_label_for_die(die) > + > + def _use_vars_for_low_and_high_pc_attr(self, subprogram: DWARFDIE) -> None: > + """Replace existing PC attributes with Tcl variables. > + > + If DW_AT_low_pc exists for this DIE, replace it with accessing the > + variable whose name is given by self.subprogram_start_var(subprogram). > + > + If DW_AT_high_pc exists for this DIE, replace it with accessing the > + variable whose name is given by self.subprogram_end_var(subprogram). > + """ > + low_pc_attr_name = "DW_AT_low_pc" > + if low_pc_attr_name in subprogram.attrs: > + start = self.subprogram_start_var(subprogram) > + subprogram.attrs[low_pc_attr_name].value = start > + > + high_pc_attr_name = "DW_AT_high_pc" > + if high_pc_attr_name in subprogram.attrs: > + end = self.subprogram_end_var(subprogram) > + subprogram.attrs[high_pc_attr_name].value = end > + > + def _create_label_for_die(self, die: DWARFDIE) -> None: > + """Set tcl_label to a unique string among other DIEs for this parser. > + > + As a first attempt, use labelify(die.name()). If the DIE does not have > + a name, use labelify(die.type_name()). > + > + If the chosen initial label is already taken, try again appending "_2". > + While the attempt is still taken, try again replacing it with "_3", then > + "_4", and so on. > + > + This function also creates an entry on self.label_to_die. > + """ > + if die.tcl_label is not None: > + return > + > + label = labelify_str(die.name() or die.type_name()) > + > + # Deduplicate label in case of collision > + if label in self.taken_labels: > + suffix_nr = 2 > + > + # Walrus operator to prevent writing the assembled label_suffix > + # string literal twice. This could be rewritten by copying the > + # string literal to the line after the end of the while loop, > + # but I deemed it would be too frail in case one of them needs > + # to be changed and the other is forgotten. > + while (new_label := f"{label}_{suffix_nr}") in self.taken_labels: > + suffix_nr += 1 > + label = new_label > + > + die.tcl_label = label > + self.label_to_die[label] = die > + self.taken_labels.add(label) > + > + def subprogram_start_var(self, subprogram: DWARFDIE) -> str: > + """Name of the Tcl variable that holds the low PC for a subprogram.""" > + return f"${subprogram.name()}_start" > + > + def subprogram_end_var(self, subprogram: DWARFDIE) -> str: > + """Name of the Tcl variable that holds the high PC for a subprogram.""" > + return f"${subprogram.name()}_end" > + > + def all_labels(self) -> set[str]: > + """Get a copy of the set of all labels known to the parser so far.""" > + return copy(self.taken_labels) > + > + > +class DWARFAssemblerGenerator: > + """Class that generates Dwarf::assemble code out of a DWARFParser.""" > + > + def __init__(self, dwarf_parser: DWARFParser, output=sys.stdout): > + self.dwarf_parser = dwarf_parser > + self.output = output > + > + def emit(self, line: str, indent_count: int) -> None: > + """Print a single line indented indent_count times to self.output. > + > + If line is empty, it will always print an empty line, even with nonzero > + indent_count. > + """ > + if line: > + line = get_indent_str(indent_count) + line > + print(line, file=self.output) > + > + def generate_die(self, die: DWARFDIE, indent_count: int): > + """Generate the lines that represent a DIE.""" > + die_lines = die.format(self.dwarf_parser.offset_to_die, indent_count) > + self.emit(die_lines, 0) > + > + def generate(self): > + indent_count = 0 > + > + year = datetime.now().year > + copyright_notice = copyright_notice_template.format(current_year=year) > + for copyright_line in copyright_notice.split("\n"): > + self.emit(copyright_line, indent_count) > + > + self.emit("load_lib dwarf.exp", indent_count) > + self.emit("require dwarf2_support", indent_count) > + self.emit("set asm_file [standard_output_file $srcfile2]", indent_count) > + > + self.emit("", 0) > + self.emit("# TODO: call prepare_for_testing", indent_count) > + > + self.emit("", 0) > + self.emit("Dwarf::assemble $asm_file " + lbrace, indent_count) Honestly, you can just have "Dwarf::assemble $srcfile2" I think. no need to set an asm_file > + > + # Begin Dwarf::assemble body. > + indent_count += 1 > + self.emit("global srcdir subdir srcfile", indent_count) > + > + all_labels = self.dwarf_parser.all_labels() > + self.emit("declare_labels " + " ".join(all_labels), indent_count) > + > + self.emit("", 0) > + for subprogram in self.dwarf_parser.subprograms: > + self.emit(f"get_func_info {subprogram.name()}", indent_count) > + > + for die in self.dwarf_parser.top_level_dies: > + self.generate_die(die, indent_count) > + > + # TODO: line table, if it's within scope (it probably isn't). > + > + # End Dwarf::assemble body. > + indent_count -= 1 > + self.emit(rbrace, indent_count) > + > +def main(argv): > + try: > + filename = argv[1] > + except IndexError: > + print("Usage:") > + print("python ./asm_to_dwarf_assembler.py ") > + sys.exit(1) > + > + try: > + with open(filename, "rb") as elf_file: > + parser = DWARFParser(elf_file) > + except Exception as e: > + print("Error parsing ELF file. Does it contain DWARF information?") > + print(str(e)) > + sys.exit(2) > + generator = DWARFAssemblerGenerator(parser) > + generator.generate() > + > +if __name__ == "__main__": > + main(sys.argv) -- Cheers, Guinevere Larsen She/Her/Hers