Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed
From: William Ferreira <wqferr@gmail.com>
To: Guinevere Larsen <guinevere@redhat.com>
Cc: gdb-patches@sourceware.org
Subject: Re: [PATCH] [gdb] Create script to convert old tests into Dwarf::assemble calls.
Date: Tue, 17 Dec 2024 18:36:26 -0300	[thread overview]
Message-ID: <CACKLx+C0JF7QQx=oUCFqsJLwGqGCK2h=NVddhDyk5JuTziLd1Q@mail.gmail.com> (raw)
In-Reply-To: <c3f6ec7a-2f38-48ff-bc33-295aa8ca1d54@redhat.com>

I've addressed all of your concerns and I will send the next version of
the patch soon. One comment of yours in particular made me rethink a
comment section:

> According to the comment above, this should not send actual_value, but
> instead the string "referenced". Is this an oversight here or (more
> likely) in the comment?

The original meaning I thought the comment was passing along was that it
was keeping track of which DIEs were referenced, not that a string
literal "referenced" had a special meaning. I've since added more
comments in that region about how this is done - an instance variable -
and annotated said instance variable. Annotations are comments that are
visible in some IDEs' hover feature.

On Tue, Dec 17, 2024 at 1:43 PM Guinevere Larsen <guinevere@redhat.com> wrote:
>
> Thanks for working on this amazing script! This is really fancy,  and I
> really like what you got going
>
> I have some specific feedback on some lines, but in general I really
> like where the script is going!
> On 12/5/24 10:55 AM, William Ferreira wrote:
> > PR testsuite/32261 requests a script that could convert old .S-based
> > tests (that were made before dwarf.exp existed) into the new
> > Dwarf::assemble calls in Tcl. This commit is an initial implementation
> > of such a script. Python was chosen for convenience, and only relies on
> > a single external library.
> >
> > Usage: the script operates not on .S files, but on ELF files with DWARF
> > information. To convert an old test, one must run said test via
> > `make check-gdb TESTS=testname` in their build directory. This will
> > produce, as a side effect, an ELF file the test used as an inferior, at
> > gdb/testsuite/outputs/testname/testname. This ELF file is this script's
> > input.
> >
> > Reliability: not counting the limitations listed below, the script seems
> > functional enough to be worthy of discussion in the mailing list. I have
> > been testing it with different tests that already use Dwarf::assemble,
> > to see if the script can produce a similar call to it. Indeed, in the
> > cases that I've tested (including some more complex ones, marked with an
> > asterisk below) it is able to produce comparable output to the original
> > exp file. Of course, it can't reproduce the complex code *before* the
> > Dwarf::assemble call. Values calculated then are simply inlined.
> >
> > The following .exp files have been tried in this way and their outputs
> > highly resemble the original:
> > - gdb.dwarf2/dynarr-ptr
> > - gdb.dwarf2/void-type
> > - gdb.dwarf2/ada-thick-pointer
> > - gdb.dwarf2/atomic-type
> > - gdb.dwarf2/dw2-entry-points (*)
> >
> > The following .exp files DO NOT WORK with this script:
> > - gdb.dwarf2/cu-no-addrs
> >    - aranges not supported.
> >    - Compile unit hi_pc and low_pc hardcoded, prone to user error
> >      due to forgetting to replace with variables.
> >
> > The following .exp files work, with caveats addressed in the limitations
> > section below.
> > - gdb.dwarf2/cpp-linkage-name
> >    - Works correctly except for one attribute of the form SPECIAL_expr.
> > - gdb.dwarf2/dw2-unusual-field-names
> >    - Same as above, with two instances of SPECIAL_expr.
> > - gdb.dwarf2/implptr-optimized-out
> >    - Same as above, with two instances of SPECIAL_expr.
> >
> > Currently the script has the following known limitations:
> > - It does not support Tcl outside a small boilerplate and
> >    Dwarf::assemble calls.
> I don't think we need to support anything other than creating the
> dwarf::assemble stuff. Everything else is usual boilerplate from the
> testsuite, so we're pretty familiar with it (or at least have plenty of
> things to double-check) while dwarf assembler isn't as common and much
> more complex.
> > - It does not support line tables.
> > - It does not use $srcfile and other variables in the call to
> >    Dwarf::assemble (since it can't know where it is safe to substitute).
> > - It does not support "fission" type DWARFs (in fact I still have no
> >    clue what those are).
> > - It does not support cu {label LABEL} {} CUs, mostly because I couldn't
> >    find the information using pyelftools.
> > - It sometimes outputs empty CUs at the start and end of the call. This
> >    might be a problem with my machine, but I've checked with DWARF dumps
> >    and they are indeed in the input ELF files generated by
> >    `make check-gdb`.
> > - It does not support attributes with the forms DW_FORM_block* and
> >    DW_FORM_exprloc. This is mostly not a concern of the difficulty of
> >    the implementation, but of it being an incomplete feature and, thus,
> >    more susceptible to users forgetting to correct its mistakes or
> >    unfinished values (please see discussion started by Guinevere at
> >    comment 23 https://sourceware.org/bugzilla/show_bug.cgi?id=32261#c23).
> >    The incompleteness of this feature is easy to demonstrate: any call to
> >    gdb_target_symbol, a common tool used in these attributes, needs a
> >    symbol name that is erased after compilation. There is no way to guess
> >    where that address being referenced in a DW_OP_addr comes from, and it
> >    can't be hard coded since it can change depending on the machine
> >    compiling it.
> >
> > Please bring up any further shortcomings this script may have with your
> > expectations.
> >
> > Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=32261
> > ---
> >   gdb/testsuite/lib/asm_to_dwarf_assembler.py | 830 ++++++++++++++++++++
> >   1 file changed, 830 insertions(+)
> >   create mode 100644 gdb/testsuite/lib/asm_to_dwarf_assembler.py
> >
> > diff --git a/gdb/testsuite/lib/asm_to_dwarf_assembler.py b/gdb/testsuite/lib/asm_to_dwarf_assembler.py
> > new file mode 100644
> > index 00000000000..ea86c19c805
> > --- /dev/null
> > +++ b/gdb/testsuite/lib/asm_to_dwarf_assembler.py
> > @@ -0,0 +1,830 @@
> > +# Copyright 2024 Free Software Foundation, Inc.
> > +
> > +# This program is free software; you can redistribute it and/or modify
> > +# it under the terms of the GNU General Public License as published by
> > +# the Free Software Foundation; either version 3 of the License, or
> > +# (at your option) any later version.
> > +#
> > +# This program is distributed in the hope that it will be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> > +
> > +# Due to the pyelftools dependency, this script requires Python version
> > +# 3.10 or greater to run.
> > +
> > +"""A utility to convert ELF files with DWARF info to Dwarf::assemble code.
> > +
> > +Usage:
> > +    python ./asm_to_dwarf_assembler.py <path/to/elf/file>
> > +
> > +Dependencies:
> > +    Python >= 3.10
> > +    pyelftools >= 0.31
> > +
> > +Notes:
> > +- Line tables are not currently supported.
> > +- Non-contiguous subprograms are not currently supported.
> > +- If you want to use $srcfile or similar, you must edit the references to the
> > +  file name manually, including DW_AT_name attributes on compile units.
> > +- If run with binaries generated by make check-gdb, it may include an
> > +  additional compile_unit before and after the actual compile units. This is
> > +  an artifact of the normal compilation process, as these CUs are indeed in
> > +  the generated DWARF in some cases.
> > +"""
> > +
> > +import sys
> > +import argparse
> > +import elftools
> > +import re
> > +
> > +from io import BytesIO, IOBase
> > +from typing import Any, Optional, Iterable, List, Union, Annotated
> > +from functools import cache
> > +from dataclasses import dataclass
> > +from copy import copy
> > +from datetime import datetime
> > +from logging import getLogger
> > +
> > +from elftools import dwarf, elf
> > +from elftools.elf.elffile import ELFFile
> > +from elftools.dwarf.compileunit import CompileUnit as RawCompileUnit
> > +from elftools.dwarf.die import DIE as RawDIE, AttributeValue
> > +from elftools.common.exceptions import DWARFError
> > +
> > +
> > +logger = getLogger(__file__)
> > +
> > +
> > +EXPR_ATTRIBUTE_FORMS = [
> > +    "DW_FORM_exprloc",
> > +    "DW_FORM_block",
> > +    "DW_FORM_block1",
> > +    "DW_FORM_block2",
> > +    "DW_FORM_block4",
> > +]
> Since these aren't supported, does it make sense to keep them here?
> > +
> > +
> > +@dataclass
> > +class DWARFOperation:
> > +    """Values of the form DW_OP_<name>."""
> > +    suffix: Annotated[str, 'Part of the name that succeeds "DW_OP_".']
> > +    code: Annotated[int, "Numeric representation of the OP."]
> > +    num_operands: Annotated[int, "Number of operands this operator expects."]
> > +
> > +
> > +dwarf_operations: Annotated[
> > +    dict[int, DWARFOperation],
> > +    "Table of all DWARF operations as per table 7.9 of the DWARF5 spec."
> > +] = {}
> > +def _register_op(suffix: str, code: int, num_operands: int) -> None:
> > +    assert code not in dwarf_operations, "Duplicate operation code found."
> > +    dwarf_operations[code] = DWARFOperation(suffix, code, num_operands)
> > +
> > +def _register_all_ops() -> None:
> > +    # Op codes 0x01 and 0x02 are reserved.
> > +
> > +    _register_op("addr", 0x03, 1)
> > +
> > +    # Op codes 0x04 and 0x05 are reserved.
> > +
> > +    _register_op("deref", 0x06, 0)
> > +
> > +    # Op code 0x07 is reserved.
> > +
> > +    _register_op("const1u", 0x08, 1)
> > +    _register_op("const1s", 0x09, 1)
> > +    _register_op("const2u", 0x0a, 1)
> > +    _register_op("const2s", 0x0b, 1)
> > +    _register_op("const4u", 0x0c, 1)
> > +    _register_op("const4s", 0x0d, 1)
> > +    _register_op("const8u", 0x0e, 1)
> > +    _register_op("const8s", 0x0f, 1)
> > +    _register_op("constu", 0x10, 1)
> > +    _register_op("consts", 0x11, 1)
> > +
> > +    _register_op("dup", 0x12, 0)
> > +    _register_op("drop", 0x13, 0)
> > +    _register_op("over", 0x14, 0)
> > +    _register_op("pick", 0x15, 1)
> > +    _register_op("swap", 0x16, 0)
> > +    _register_op("rot", 0x17, 0)
> > +    _register_op("xderef", 0x18, 0)
> > +
> > +    _register_op("abs", 0x19, 0)
> > +    _register_op("and", 0x1a, 0)
> > +    _register_op("div", 0x1b, 0)
> > +    _register_op("minus", 0x1c, 0)
> > +    _register_op("mod", 0x1d, 0)
> > +    _register_op("mul", 0x1e, 0)
> > +    _register_op("neg", 0x1f, 0)
> > +    _register_op("not", 0x20, 0)
> > +    _register_op("or", 0x21, 0)
> > +    _register_op("plus", 0x22, 0)
> > +    _register_op("plus_uconst", 0x23, 1)
> > +    _register_op("shl", 0x24, 0)
> > +    _register_op("shr", 0x25, 0)
> > +    _register_op("shra", 0x26, 0)
> > +    _register_op("xor", 0x27, 0)
> > +    _register_op("bra", 0x28, 1)
> > +    _register_op("eq", 0x29, 0)
> > +    _register_op("ge", 0x2a, 0)
> > +    _register_op("gt", 0x2b, 0)
> > +    _register_op("le", 0x2c, 0)
> > +    _register_op("lt", 0x2d, 0)
> > +    _register_op("ne", 0x2e, 0)
> > +    _register_op("skip", 0x2f, 0)
> > +
> > +    for lit_nr in range(32):
> > +        _register_op(f"lit{lit_nr}", 0x30 + lit_nr, 0)
> > +
> > +    for reg_nr in range(32):
> > +        _register_op(f"reg{reg_nr}", 0x50 + reg_nr, 0)
> > +
> > +    for breg_nr in range(32):
> > +        _register_op(f"breg{breg_nr}", 0x70 + breg_nr, 0)
> > +
> > +    _register_op("regx", 0x90, 1)
> > +    _register_op("fbregx", 0x91, 1)
> > +    _register_op("bregx", 0x92, 2)
> > +    _register_op("piece", 0x93, 1)
> > +    _register_op("deref_size", 0x94, 1)
> > +    _register_op("xderef_size", 0x95, 1)
> > +    _register_op("nop", 0x96, 0)
> > +    _register_op("push_object_address", 0x97, 0)
> > +
> > +    _register_op("call2", 0x98, 1)
> > +    _register_op("call4", 0x99, 1)
> > +    _register_op("call_ref", 0x9a, 1)
> > +    _register_op("form_tls_address", 0x9b, 0)
> > +    _register_op("call_frame_cfa", 0x9c, 0)
> > +
> > +    _register_op("bit_piece", 0x9d, 2)
> > +    _register_op("implicit_value", 0x9e, 2)
> > +    _register_op("stack_value", 0x9f, 0)
> > +    _register_op("implicit_pointer", 0xa0, 2)
> > +    _register_op("addrx", 0xa1, 1)
> > +    _register_op("constx", 0xa2, 1)
> > +
> > +    _register_op("entry_value", 0xa3, 2)
> > +    _register_op("const_type", 0xa4, 3)
> > +    _register_op("regval_type", 0xa5, 2)
> > +    _register_op("deref_type", 0xa6, 2)
> > +    _register_op("xderef_type", 0xa7, 2)
> > +
> > +    _register_op("convert", 0xa8, 1)
> > +    _register_op("reinterpret", 0xa9, 1)
> > +
> > +
> > +_register_all_ops()
> Similar question to here, if I understood later code correctly. Do we
> want to keep this if we're not doing location expressions?
> > +
> > +
> > +copyright_notice_template = """
> > +# Copyright 2024-{current_year} Free Software Foundation, Inc.
> If the script is adding copyright, it will be for a new file, and so
> copyright will always start on the year when the script is run. So it
> can just be {current_year}
> > +#
> > +# This program is free software; you can redistribute it and/or modify
> > +# it under the terms of the GNU General Public License as published by
> > +# the Free Software Foundation; either version 3 of the License, or
> > +# (at your option) any later version.
> > +#
> > +# This program is distributed in the hope that it will be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> > +""".lstrip()  # Strip to remove initial \n, but not the last one.
> > +
> > +# Workaround for my editor not to freak out over unclosed braces.
> > +lbrace, rbrace = "{", "}"
> > +
> > +@cache
> > +def get_indent_str(indent_count: int) -> str:
> > +    """Get whitespace string to prepend to another for indenting."""
> > +    indent = (indent_count // 2) * "\t"
> > +    if indent_count % 2 == 1:
> > +        indent += "    "
> > +    return indent
> > +
> > +
> > +def indent(line: str, indent_count: int) -> str:
> > +    """Indent line by indent_count levels."""
> > +    return get_indent_str(indent_count) + line
> > +
> > +
> > +def labelify_str(s: str) -> str:
> > +    """Make s appropriate for a label name."""
> > +    # Replace "*" with the literal word "ptr".
> > +    s = s.replace("*", "ptr")
> > +
> > +    # Replace any non-"word" characters by "_".
> > +    s = re.sub(r"\W", "_", s)
> > +
> > +    # Remove consecutive "_"s.
> > +    s = re.sub(r"__+", "_", s)
> > +
> > +    return s
> > +
> > +
> > +class DWARFAttribute:
> > +    """Storage unit for a single DWARF attribute.
> > +
> > +    All its values are strings that are usually passed on
> > +    directly to format. The exceptions to this are attributes
> > +    with int values with DW_FORM_ref4 or DW_FORM_ref_addr form.
> > +    Their values are interpreted as the global offset of the DIE
> > +    being referenced, which are looked up dynamically to fetch
> > +    their labels.
> > +    """
> > +    def __init__(
> > +        self,
> > +        die_offset: int,
> > +        name: str,
> > +        value: str | bytes | int | bool,
> > +        form=None,
> > +    ):
> > +        self.die_offset = die_offset
> > +        self.name = name
> > +        self.value = value
> > +        self.form = form
> > +
> > +    def _format_expr_value(self) -> str:
> > +        # This code is left here in case this is feature is desired in the
> > +        # future. It is partly functional, being able to decode operations
> > +        # but not operands due to their variable length in bytes.
> > +
> > +        # self.form = "SPECIAL_expr"
> > +        # lines = [lbrace]
> > +        # curr_idx = 0
> > +        # while curr_idx < len(self.value):
> > +        #     curr_op_code = self.value[curr_idx]
> > +        #
> > +        #     try:
> > +        #         curr_op: DWARFOperation = dwarf_operations[curr_op_code]
> > +        #     except KeyError as err:
> > +        #         raise KeyError(
> > +        #             f"Unknown op: {curr_op_code}, item {curr_idx} of"
> > +        #             f" attribute {self.name}, DIE at offset"
> > +        #             f" {self.die_offset}."
> > +        #         ) from err
> > +        #     # Empty arg here so it inserts a space before the first argument,
> > +        #     # only if there is indeed a first argument.
> > +        #     args = [""]
> > +        #     for arg_nr in range(curr_op.num_operands):
> > +        #         curr_idx += 1
> > +        #         # TODO decode bytes according to specific operation.
> > +        #         # The following line only reads the next BYTE of the sequence,
> > +        #         # and needs to be adjusted to read the correct number of
> > +        #         # bytes.
> > +        #         # args.append(str(self.value[curr_idx]))
> > +        #     op_line = "DW_OP_" + curr_op.suffix + " ".join(args)
> > +        #     lines.append(op_line)
> > +        #     curr_idx += 1
> > +        #
> > +        # lines.append(rbrace)
> > +        # return "\n".join(lines)
> > +        self.form = "SPECIAL_expr"
> > +        return "{ TODO: Fill expr list }"
> Maybe, rather than "TODO" you could add "MANUAL" or something, to make
> it clear that this is a manual process and expected to be so? (naming is
> hard, but I think todo implies that this will be improved soon, and I
> don't think that'd be the case?)
> > +
> > +    def _needs_escaping(self, str_value: str) -> bool:
> > +        charset = set(str_value)
> > +        return bool(charset.intersection({"{", "}", " ", "\t"}))
> > +
> > +    def _format_str(self, str_value: str) -> str:
> > +        if self._needs_escaping(str_value):
> > +            escaped_str = str(str_value)
> > +            # Replace single escape (which is itself escaped because of regex)
> > +            # with a double escape (which doesn't mean anything to regex so
> > +            # it doesn't need escaping).
> > +            escaped_str = re.sub(r"\\", r"\\", escaped_str)
> > +            escaped_str = re.sub("([{}])", r"\\\1", escaped_str)
> > +            return "{" + escaped_str + "}"
> > +        else:
> > +            return str_value
> > +
> > +    def _format_value(
> > +        self,
> > +        offset_die_lookup: dict[int, "DWARFDIE"],
> > +        indent_count: int = 0
> > +    ) -> str:
> > +        if self.form in EXPR_ATTRIBUTE_FORMS:
> > +            return self._format_expr_value()
> > +        elif isinstance(self.value, bool):
> > +            return str(int(self.value))
> > +        elif isinstance(self.value, int):
> > +            if self.form == "DW_FORM_ref4":
> > +                # ref4-style referencing label.
> > +                die: "DWARFDIE" = offset_die_lookup[self.value]
> > +                return ":$" + die.tcl_label
> > +            elif self.form == "DW_FORM_ref_addr":
> > +                # ref_addr-style referencing label.
> > +                die: "DWARFDIE" = offset_die_lookup[self.value]
> > +                return "%$" + die.tcl_label
> > +            else:
> > +                return str(self.value)
> > +        elif isinstance(self.value, bytes):
> > +            return self._format_str(self.value.decode("ascii"))
> > +        elif isinstance(self.value, str):
> > +            return self._format_str(self.value)
> > +        else:
> > +            raise NotImplementedError(
> > +                "Unknown data type: " + str(type(self.value))
> > +            )
> > +
> > +    def format(
> > +        self,
> > +        offset_die_lookup: dict[int, "DWARFDIE"],
> > +        indent_count: int = 0
> > +    ) -> str:
> > +        """Format the attribute in the form {name value form}.
> > +
> > +        If form is DW_FORM_exprloc or DW_FORM_block, see next section on
> > +        DWARFOperations.
> > +
> > +        If it isn't, value is formatted as follows:
> > +            If bool, use "1" if True, "0" if False.
> > +            If int:
> > +                If form is DW_FORM_ref4, use ":$label" where label is the
> > +                    tcl_label of the DWARFDIE at offset "value".
> > +                If form is DW_FORM_ref_addr, use "%$label" where label is
> > +                    the tcl_label of the DWARFDIE at offset "value".
> > +                Else, use value directly.
> > +            If bytes, use value.decode("ascii")
> > +            If str, use value directly.
> > +            Any other type results in a NotImplementedError being raised.
> > +
> > +        Regarding DW_FORM_exprloc and DW_FORM_block:
> > +            The form is replaced with SPECIAL_expr.
> > +            The entries in the value are interpreted and decoded using the
> > +            dwarf_operations dictionary, and replaced with their names where
> > +            applicable.
> > +        """
> > +        s = lbrace
> > +        s += self.name + " "
> > +        s += self._format_value(offset_die_lookup)
> > +
> > +        # Only explicitly state form if it's not a reference.
> > +        if self.form not in [None, "DW_FORM_ref4", "DW_FORM_ref_addr"]:
> > +            s += " " + self.form
> > +
> > +        s += rbrace
> > +        return indent(s, indent_count)
> > +
> > +
> > +class DWARFDIE:
> > +    """This script's parsed version of a RawDIE."""
> > +    def __init__(
> > +        self,
> > +        offset: int,
> > +        tag: str,
> > +        attrs: dict[str, DWARFAttribute],
> > +        tcl_label: Optional[str] = None
> > +    ):
> > +        self.offset = offset
> > +        self.tag = tag
> > +        self.attrs = copy(attrs)
> > +        self.children = []
> > +        self.tcl_label = tcl_label
>
> I think it would be worth a comment here or in format lines, regarding
> what tcl_label is actually about.
>
> it might not be self-evident what these are, especially for people
> unfamiliar with the assembler.
>
> > +
> > +    def format_lines(
> > +        self,
> > +        offset_die_lookup: dict[int, "DWARFDIE"],
> > +        indent_count: int = 0
> > +    ) -> list[str]:
> > +        """Get the list of lines that represent this DIE in Dwarf assembler."""
> > +        die_lines = []
> > +
> > +        # Prepend label to first line, if it's set.
> > +        if self.tcl_label:
> > +            first_line_start = self.tcl_label + ": "
> > +        else:
> > +            first_line_start = ""
> > +
> > +        # First line, including label.
> > +        first_line = indent(
> > +            first_line_start + self.tag + " " + lbrace,
> > +            indent_count
> > +        )
> > +        die_lines.append(first_line)
> > +
> > +        # Format attributes, if any.
> > +        if self.attrs:
> > +            for attr_name, attr in self.attrs.items():
> > +                attr_line = attr.format(
> > +                    offset_die_lookup,
> > +                    indent_count=indent_count+1
> > +                )
> > +                die_lines.append(attr_line)
> > +            die_lines.append(indent(rbrace, indent_count))
> > +        else:
> > +            # Don't create a new line, just append and immediately close the
> > +            # brace on the last line.
> > +            die_lines[-1] += rbrace
> > +
> > +        # Format children, if any.
> > +        if self.children:
> > +            # Only open a new brace if there are any children for the
> > +            # current DIE.
> > +            die_lines[-1] += " " + lbrace
> > +            for child in self.children:
> > +                child_lines = child.format_lines(
> > +                    offset_die_lookup,
> > +                    indent_count=indent_count+1
> > +                )
> > +                die_lines.extend(child_lines)
> > +            die_lines.append(indent(rbrace, indent_count))
> > +
> > +        return die_lines
> > +
> > +    def format(
> > +        self,
> > +        offset_die_lookup: dict[int, "DWARFDIE"],
> > +        indent_count: int = 0
> > +    ) -> str:
> > +        """Join result from format_lines into a single str."""
> > +        return "\n".join(self.format_lines(offset_die_lookup, indent_count))
> > +
> > +    def name(self) -> Optional[str]:
> > +        """Get DW_AT_name (if present) decoded as ASCII."""
> > +        raw_value = self.attrs.get("DW_AT_name")
> > +        if raw_value is None:
> > +            return None
> > +        else:
> > +            return raw_value.value.decode("ascii")
> > +
> > +    def type_name(self) -> str:
> > +        """Name of Dwarf tag, with the "DW_TAG_" prefix removed."""
> > +        return re.sub("DW_TAG_", "", self.tag)
> > +
> > +class DWARFCompileUnit(DWARFDIE):
> > +    """Wrapper subclass for CU DIEs.
> > +
> > +    This is necessary due to the special format CUs take in Dwarf::assemble.
> > +
> > +    Instead of simply:
> > +        DW_TAG_compile_unit {
> > +            <attributes>
> > +        } {
> > +            <children>
> > +        }
> > +
> > +    CUs are formatted as:
> > +        cu { <cu_special_vars> } {
> > +            DW_TAG_compile_unit {
> > +                <attributes>
> > +            } {
> > +                <children>
> > +            }
> > +        }
> > +    """
> > +
> > +    # Default value for parameter dwarf_version defined in dwarf.exp line 1552.
> > +    default_dwarf_version = 4
> > +
> > +    # Default value for parameter is_fission defined in dwarf.exp line 1556.
> > +    # Currently not implemented, see comment below.
> > +    # default_is_fission = False
> > +
> > +    # Tag that signifies a DIE is a compile unit.
> > +    compile_unit_tag = "DW_TAG_compile_unit"
> > +
> > +    def __init__(
> > +        self,
> > +        raw_die: RawDIE,
> > +        raw_cu: RawCompileUnit,
> > +        attrs: dict[str, DWARFAttribute],
> > +    ):
> > +        """Initialize additional instance variables for CU encoding.
> > +
> > +        The additional instance variables are:
> > +            - is_64_bit: Optional[bool]
> > +              default None
> > +              Whether this CU is 64 bit or not.
> > +            - dwarf_version: int
> > +              default DWARFCompileUnit.default_dwarf_version
> > +              Version of DWARF this CU is using.
> > +            - addr_size: Optional[int]
> > +              default None
> > +              Size of an address in bytes.
> > +
> > +        These variables are used to configure the first parameter of the cu
> > +        proc (which contains calls to the compile_unit proc in the body of
> > +        Dwarf::assemble).
> > +        """
> > +        super().__init__(
> > +            raw_die.offset,
> > +            DWARFCompileUnit.compile_unit_tag,
> > +            attrs
> > +        )
> > +        self.raw_cu = raw_cu
> > +        self.is_64_bit: Optional[bool] = None
> > +        self.dwarf_version: int = raw_cu.header.get(
> > +            "version",
> > +            DWARFCompileUnit.default_dwarf_version
> > +        )
> > +        self.addr_size: Optional[int] = raw_cu.header["address_size"]
> > +
> > +        # Fission is not currently implemented because I don't know where to
> > +        # fetch this information from.
> > +        # self.is_fission: bool = self.default_is_fission
> > +
> > +        # CU labels are not currently implemented because I haven't found where
> > +        # pyelftools exposes this information.
> > +        # self.cu_label: Optional[str] = None
> > +
> > +    def format_lines(
> > +        self,
> > +        offset_die_lookup: dict[int, DWARFDIE],
> > +        indent_count: int = 0,
> > +    ) -> list[str]:
> > +        lines = []
> > +        lines.append(self._get_header(indent_count))
> > +        inner_lines = super().format_lines(
> > +            offset_die_lookup,
> > +            indent_count + 1
> > +        )
> > +        lines += inner_lines
> > +        lines.append(indent(rbrace, indent_count))
> > +        return lines
> > +
> > +    def _get_header(self, indent_count: int = 0) -> str:
> > +        """Assemble the first line of the surrounding 'cu {} {}' proc call."""
> > +        header = indent("cu " + lbrace, indent_count)
> > +        cu_params = []
> > +
> > +        if self.is_64_bit is not None:
> > +            # Convert from True/False to 1/0.
> > +            param_value = int(self.is_64_bit)
> > +            cu_params += ["is_64", str(param_value)]
> > +
> > +        if self.dwarf_version != DWARFCompileUnit.default_dwarf_version:
> > +            cu_params += ["version", str(self.dwarf_version)]
> > +
> > +        if self.addr_size is not None:
> > +            cu_params += ["addr_size", str(self.addr_size)]
> > +
> > +        # Fission is not currently implemented, see comment above.
> > +        # if self.is_fission != DWARFCompileUnit.default_is_fission:
> > +        #     # Same as is_64_bit conversion, True/False -> 1/0.
> > +        #     param_value = int(self.is_fission)
> > +        #     cu_params += ["fission", str(param_value)]
> > +
> > +        # CU labels are not currently implemented, see commend above.
> > +        # if self.cu_label is not None:
> > +        #     cu_params += ["label", self.cu_label]
> > +
> > +        if cu_params:
> > +            header += " ".join(cu_params)
> > +
> > +        header += rbrace + " " + lbrace
> > +        return header
> > +
> > +class DWARFParser:
> > +    """Converter from pyelftools's DWARF representation to this script's."""
> > +
> > +    def __init__(self, elf_file: IOBase):
> > +        """Init parser with file opened in binary mode.
> > +
> > +        File can be closed after this function is called.
> > +        """
> > +        self.raw_data = BytesIO(elf_file.read())
> > +        self.elf_data = ELFFile(self.raw_data)
> > +        self.dwarf_info = self.elf_data.get_dwarf_info()
> > +        self.offset_to_die: dict[int, DWARFDIE] = {}
> > +        self.label_to_die: dict[str, DWARFDIE] = {}
> > +        self.referenced_offsets: set[int] = set()
> > +        self.raw_cu_list: list[RawCompileUnit] = []
> > +        self.top_level_dies: list[DWARFDIE] = []
> > +        self.subprograms: list[DWARFDIE] = []
> > +        self.taken_labels: set[str] = set()
> > +
> > +        self._read_cus()
> > +        self._create_necessary_labels()
> > +
> > +    def _read_cus(self):
> I think I'd personally rename this to _read_all_cus or something like
> that. It is easy to miss the S at the end and typo the function name or
> get confused.
> > +        """Populate self.raw_cu_list with all CUs in self.dwarf_info."""
> > +        for cu in self.dwarf_info.iter_CUs():
> > +            self._read_cu(cu)
> > +
> > +    def _read_cu(self, raw_cu: RawCompileUnit):
> > +        """Read a compile_unit into self.cu_list."""
> > +        self.raw_cu_list.append(raw_cu)
> > +        for raw_die in raw_cu.iter_DIEs():
> > +            if not raw_die.is_null():
> > +                self._parse_die(raw_die)
> > +
> > +    def _parse_die(self, raw_die: RawDIE) -> DWARFDIE:
> > +        """Process a single DIE and add it to offset_to_die.
> > +
> > +        Look for DW_FORM_ref4 and DWD_FORM_ref_addr form attributes and replace
> > +        them with the global offset of the referenced DIE, and marking the
> > +        referenced DIE as "referenced". This will be used later to assign and
> > +        use labels.
> > +
> > +        In case the DIE is a top-level DIE, add it to self.top_level_dies.
> > +
> > +        In case the DIE is a subprogram, add it to self.subprograms and call
> > +        self._use_vars_for_low_and_high_pc_attr with it.
> > +        """
> > +        processed_attrs = {}
> > +        attr_value: AttributeValue
> > +        for attr_name, attr_value in raw_die.attributes.items():
> > +            actual_value = attr_value.value
> > +            if attr_value.form in ("DW_FORM_ref4", "DW_FORM_ref_addr"):
> > +                referenced_die = raw_die.get_DIE_from_attribute(attr_name)
> > +                actual_value = referenced_die.offset
> > +                self.referenced_offsets.add(referenced_die.offset)
> > +
> > +            processed_attrs[attr_name] = DWARFAttribute(
> > +                raw_die.offset,
> > +                attr_name,
> > +                actual_value,
> According to the comment above, this should not send actual_value, but
> instead the string "referenced". Is this an oversight here or (more
> likely) in the comment?
> > +                attr_value.form
> > +            )
> > +
> > +        if raw_die.tag == DWARFCompileUnit.compile_unit_tag:
> > +            # FIXME: This isn't pretty, as it relies on global parser state.
> > +            # It also only works under the assumption CUs can't be nested.
> > +            die_cu = self.raw_cu_list[-1]
>
> We try and not get code into GDB that has a FIXME, so this has to be
> resolved before going in.
>
> Would it make sense to call _parse_die with the CU that the DIE comes from?
>
> > +            processed_die = DWARFCompileUnit(raw_die, die_cu, processed_attrs)
> > +        else:
> > +            processed_die = DWARFDIE(
> > +                raw_die.offset,
> > +                raw_die.tag,
> > +                processed_attrs,
> > +                None
> > +            )
> > +
> > +        if raw_die.get_parent() is None:
> > +            # Top level DIE
> > +            self.top_level_dies.append(processed_die)
> > +        else:
> > +            # Setting the parent here assumes the parent was already processed
> > +            # prior to this DIE being found.
> > +            # As far as I'm aware, this is always true in DWARF.
> > +            processed_parent = self.offset_to_die[raw_die.get_parent().offset]
> > +            processed_parent.children.append(processed_die)
> > +
> > +        if processed_die.tag == "DW_TAG_subprogram":
> > +            self.subprograms.append(processed_die)
> > +            self._use_vars_for_low_and_high_pc_attr(processed_die)
> > +
> > +        self.offset_to_die[processed_die.offset] = processed_die
> > +        return processed_die
> > +
> > +    def _create_necessary_labels(self):
> > +        """Create labels to DIEs that were referenced by others."""
> > +        for offset in self.referenced_offsets:
> > +            die = self.offset_to_die[offset]
> > +            self._create_label_for_die(die)
> > +
> > +    def _use_vars_for_low_and_high_pc_attr(self, subprogram: DWARFDIE) -> None:
> > +        """Replace existing PC attributes with Tcl variables.
> > +
> > +        If DW_AT_low_pc exists for this DIE, replace it with accessing the
> > +        variable whose name is given by self.subprogram_start_var(subprogram).
> > +
> > +        If DW_AT_high_pc exists for this DIE, replace it with accessing the
> > +        variable whose name is given by self.subprogram_end_var(subprogram).
> > +        """
> > +        low_pc_attr_name = "DW_AT_low_pc"
> > +        if low_pc_attr_name in subprogram.attrs:
> > +            start = self.subprogram_start_var(subprogram)
> > +            subprogram.attrs[low_pc_attr_name].value = start
> > +
> > +        high_pc_attr_name = "DW_AT_high_pc"
> > +        if high_pc_attr_name in subprogram.attrs:
> > +            end = self.subprogram_end_var(subprogram)
> > +            subprogram.attrs[high_pc_attr_name].value = end
> > +
> > +    def _create_label_for_die(self, die: DWARFDIE) -> None:
> > +        """Set tcl_label to a unique string among other DIEs for this parser.
> > +
> > +        As a first attempt, use labelify(die.name()). If the DIE does not have
> > +        a name, use labelify(die.type_name()).
> > +
> > +        If the chosen initial label is already taken, try again appending "_2".
> > +        While the attempt is still taken, try again replacing it with "_3", then
> > +        "_4", and so on.
> > +
> > +        This function also creates an entry on self.label_to_die.
> > +        """
> > +        if die.tcl_label is not None:
> > +            return
> > +
> > +        label = labelify_str(die.name() or die.type_name())
> > +
> > +        # Deduplicate label in case of collision
> > +        if label in self.taken_labels:
> > +            suffix_nr = 2
> > +
> > +            # Walrus operator to prevent writing the assembled label_suffix
> > +            # string literal twice. This could be rewritten by copying the
> > +            # string literal to the line after the end of the while loop,
> > +            # but I deemed it would be too frail in case one of them needs
> > +            # to be changed and the other is forgotten.
> > +            while (new_label := f"{label}_{suffix_nr}") in self.taken_labels:
> > +                suffix_nr += 1
> > +            label = new_label
> > +
> > +        die.tcl_label = label
> > +        self.label_to_die[label] = die
> > +        self.taken_labels.add(label)
> > +
> > +    def subprogram_start_var(self, subprogram: DWARFDIE) -> str:
> > +        """Name of the Tcl variable that holds the low PC for a subprogram."""
> > +        return f"${subprogram.name()}_start"
> > +
> > +    def subprogram_end_var(self, subprogram: DWARFDIE) -> str:
> > +        """Name of the Tcl variable that holds the high PC for a subprogram."""
> > +        return f"${subprogram.name()}_end"
> > +
> > +    def all_labels(self) -> set[str]:
> > +        """Get a copy of the set of all labels known to the parser so far."""
> > +        return copy(self.taken_labels)
> > +
> > +
> > +class DWARFAssemblerGenerator:
> > +    """Class that generates Dwarf::assemble code out of a DWARFParser."""
> > +
> > +    def __init__(self, dwarf_parser: DWARFParser, output=sys.stdout):
> > +        self.dwarf_parser = dwarf_parser
> > +        self.output = output
> > +
> > +    def emit(self, line: str, indent_count: int) -> None:
> > +        """Print a single line indented indent_count times to self.output.
> > +
> > +        If line is empty, it will always print an empty line, even with nonzero
> > +        indent_count.
> > +        """
> > +        if line:
> > +            line = get_indent_str(indent_count) + line
> > +        print(line, file=self.output)
> > +
> > +    def generate_die(self, die: DWARFDIE, indent_count: int):
> > +        """Generate the lines that represent a DIE."""
> > +        die_lines = die.format(self.dwarf_parser.offset_to_die, indent_count)
> > +        self.emit(die_lines, 0)
> > +
> > +    def generate(self):
> > +        indent_count = 0
> > +
> > +        year = datetime.now().year
> > +        copyright_notice = copyright_notice_template.format(current_year=year)
> > +        for copyright_line in copyright_notice.split("\n"):
> > +            self.emit(copyright_line, indent_count)
> > +
> > +        self.emit("load_lib dwarf.exp", indent_count)
> > +        self.emit("require dwarf2_support", indent_count)
> > +        self.emit("set asm_file [standard_output_file $srcfile2]", indent_count)
> > +
> > +        self.emit("", 0)
> > +        self.emit("# TODO: call prepare_for_testing", indent_count)
> > +
> > +        self.emit("", 0)
> > +        self.emit("Dwarf::assemble $asm_file " + lbrace, indent_count)
> Honestly, you can just have "Dwarf::assemble $srcfile2" I think. no need
> to set an asm_file
> > +
> > +        # Begin Dwarf::assemble body.
> > +        indent_count += 1
> > +        self.emit("global srcdir subdir srcfile", indent_count)
> > +
> > +        all_labels = self.dwarf_parser.all_labels()
> > +        self.emit("declare_labels " + " ".join(all_labels), indent_count)
> > +
> > +        self.emit("", 0)
> > +        for subprogram in self.dwarf_parser.subprograms:
> > +            self.emit(f"get_func_info {subprogram.name()}", indent_count)
> > +
> > +        for die in self.dwarf_parser.top_level_dies:
> > +            self.generate_die(die, indent_count)
> > +
> > +        # TODO: line table, if it's within scope (it probably isn't).
> > +
> > +        # End Dwarf::assemble body.
> > +        indent_count -= 1
> > +        self.emit(rbrace, indent_count)
> > +
> > +def main(argv):
> > +    try:
> > +        filename = argv[1]
> > +    except IndexError:
> > +        print("Usage:")
> > +        print("python ./asm_to_dwarf_assembler.py <path/to/elf/file>")
> > +        sys.exit(1)
> > +
> > +    try:
> > +        with open(filename, "rb") as elf_file:
> > +            parser = DWARFParser(elf_file)
> > +    except Exception as e:
> > +        print("Error parsing ELF file. Does it contain DWARF information?")
> > +        print(str(e))
> > +        sys.exit(2)
> > +    generator = DWARFAssemblerGenerator(parser)
> > +    generator.generate()
> > +
> > +if __name__ == "__main__":
> > +    main(sys.argv)
>
>
> --
> Cheers,
> Guinevere Larsen
> She/Her/Hers
>

      reply	other threads:[~2024-12-17 21:36 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-05 13:55 William Ferreira
2024-12-05 17:25 ` Simon Marchi
2024-12-05 17:37   ` Luis Machado
2024-12-05 17:46     ` Simon Marchi
2024-12-05 17:48       ` Luis Machado
2024-12-09 20:30       ` Tom Tromey
2024-12-17 16:43 ` Guinevere Larsen
2024-12-17 21:36   ` William Ferreira [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACKLx+C0JF7QQx=oUCFqsJLwGqGCK2h=NVddhDyk5JuTziLd1Q@mail.gmail.com' \
    --to=wqferr@gmail.com \
    --cc=gdb-patches@sourceware.org \
    --cc=guinevere@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox