From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-140871-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 80635 invoked by alias); 9 Aug 2017 15:48:49 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 80166 invoked by uid 89); 9 Aug 2017 15:48:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=casual, prudent
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 09 Aug 2017 15:48:11 +0000
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 8B597357E44	for <gdb-patches@sourceware.org>; Wed,  9 Aug 2017 15:48:10 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 8B597357E44
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=keiths@redhat.com
Received: from valrhona.uglyboxes.com (ovpn04.gateway.prod.ext.phx2.redhat.com [10.5.9.4])	by smtp.corp.redhat.com (Postfix) with ESMTPS id 449E05D9CB;	Wed,  9 Aug 2017 15:48:10 +0000 (UTC)
Subject: Re: [PATCH 34/40] Make strcmp_iw NOT ignore whitespace in the middle of tokens
To: Pedro Alves <palves@redhat.com>, gdb-patches@sourceware.org
References: <1496406158-12663-1-git-send-email-palves@redhat.com> <1496406158-12663-35-git-send-email-palves@redhat.com>
From: Keith Seitz <keiths@redhat.com>
Message-ID: <50337ab0-512c-92e5-5b37-115f5dd8c427@redhat.com>
Date: Wed, 09 Aug 2017 15:48:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <1496406158-12663-35-git-send-email-palves@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2017-08/txt/msg00187.txt.bz2

On 06/02/2017 05:22 AM, Pedro Alves wrote:
> currently "b func tion" manages to set a breakpoint at "function" !
> 
> All this years I had never noticed this, but now that the linespec
> completer actually works, this easily happens by accident, with:

That makes two of us!

> The operator_stoken changes are necessary due to a latent bug --
> currently "operator char" becomes "operatorchar", and later look ups
> only find it because strcmp_iw ignores the whitespace...

I have a similar fix on the compile branch. :-)

> gdb/ChangeLog:
> yyyy-mm-dd  Pedro Alves  <palves@redhat.com>
> 
> 	* c-exp.y (oper): Add space to operator names.
> 	* cp-support.c (cp_symbol_name_matches_1)
> 	(cp_fq_symbol_name_matches): Pass language to
> 	strncmp_iw_with_mode.
> 	(test_cp_symbol_name_cmp): Add unit tests.
> 	* language.c (default_symbol_name_matcher): Pass language to
> 	strncmp_iw_with_mode.
> 	* utils.c: Include "cp-support.h" and <algorithm>.
> 	(valid_identifier_name_char, cp_skip_operator_token, skip_ws)
> 	(cp_is_operator): New functions.
> 	(strncmp_iw_with_mode): Use them.  Add language parameter.  Don't
> 	skip whitespace in the symbol name when the lookup name doesn't
> 	have spaces, and vice versa.
> 	(strncmp_iw, strcmp_iw): Pass language to strncmp_iw_with_mode.
                                      ^^^^^^^^

Not just language, but language_minimal.

> 	* utils.h (strncmp_iw_with_mode): Add language parameter.

> diff --git a/gdb/c-exp.y b/gdb/c-exp.y
> index 24a2fbd..0a182cc 100644
> --- a/gdb/c-exp.y
> +++ b/gdb/c-exp.y
> @@ -1487,7 +1487,7 @@ oper:	OPERATOR NEW
>  	|	OPERATOR '>'
>  			{ $$ = operator_stoken (">"); }
>  	|	OPERATOR ASSIGN_MODIFY
> -			{ const char *op = "unknown";
> +			{ const char *op = " unknown";

Good catch. I missed that.

>  			  switch ($2)
>  			    {
>  			    case BINOP_RSH:
> @@ -1563,7 +1563,8 @@ oper:	OPERATOR NEW
>  
>  			  c_print_type ($2, NULL, &buf, -1, 0,
>  					&type_print_raw_options);
> -			  $$ = operator_stoken (buf.c_str ());
> +			  std::string name = " " + buf.string ();
> +			  $$ = operator_stoken (name.c_str ());
>  			}
>  	;
>  

The only additional change that I have in my compile branch is
that since this type's name could come from a user, it needs to be canonicalized.
But I can hit that when(ever?!) I start submitting some of the precursor
patches that I have. [I have a test that demonstrates the need for
canonicalization in the c++compile branch.]

> diff --git a/gdb/cp-support.c b/gdb/cp-support.c
> index 84d8a6b..4c353c5 100644
> --- a/gdb/cp-support.c
> +++ b/gdb/cp-support.c
> @@ -1857,6 +1857,67 @@ test_cp_symbol_name_cmp ()
>    CHECK_MATCH_C ("function(int)", "function(int)");

[snip a WHOLE LOTTA TESTS]

AWESOME!

> diff --git a/gdb/utils.c b/gdb/utils.c
> index 9798edc..484c1ef 100644
> --- a/gdb/utils.c
> +++ b/gdb/utils.c
> @@ -65,6 +65,8 @@
>  #include "gdb_usleep.h"
>  #include "interps.h"
>  #include "gdb_regex.h"
> +#include "cp-support.h"
> +#include <algorithm>
>  
>  #if !HAVE_DECL_MALLOC
>  extern PTR malloc ();		/* ARI: PTR */
> @@ -2418,22 +2420,227 @@ fprintf_symbol_filtered (struct ui_file *stream, const char *name,
>      }
>  }
>  
> +/* True if CH is a character that can be part of a symbol name.  I.e.,
> +   either a number, a letter, or a '_'.  */
> +
> +static bool
> +valid_identifier_name_char (int ch)
> +{
> +  return (isalnum (ch) || ch == '_');
> +}

Couldn't this be language-dependent? [Yikes!]
Also note that there are a handful of places where this could be used
[follow-up patch?] in linespec.c, location.c, symtab.c. Maybe more.

We have logic for c++ operators all over the place. Note to self: this needs
to be cleaned up/consolidated.

> +
> +/* Skip to end of token, or to END, whatever comes first.  */
> +

I think a(n explicit) mention that the input is assumed to be an operator name. It's
mentioned in the name, but please consider repeating that in the comment. It's important.

> +static const char *
> +cp_skip_operator_token (const char *token, const char *end)
> +{
> +  const char *p = token;
> +  while (p != end && !isspace (*p) && *p != '(')
> +    {
> +      if (valid_identifier_name_char (*p))
> +	{
> +	  while (p != end && valid_identifier_name_char (*p))
> +	    p++;
> +	  return p;
> +	}
> +      else
> +	{
> +	  /* Note, ordered such that among ops that share a prefix,
> +	     longer comes first.  This is so that the loop below can
> +	     bail on first match.  */
> +	  static const char *ops[] =
> +	    {
> +	      "[",
> +	      "]",
> +	      "~",
> +	      ",",
> +	      "-=", "--", "->", "-",
> +	      "+=", "++", "+",
> +	      "*=", "*",
> +	      "/=", "/",
> +	      "%=", "%",
> +	      "|=", "||", "|",
> +	      "&=", "&&", "&",
> +	      "^=", "^",
> +	      "!=", "!",
> +	      "<<=", "<=", "<<", "<",
> +	      ">>=", ">=", ">>", ">",
> +	      "==", "=",
> +	    };
> +

Man, I'd *swear* that we have this code repeated in numerous places, but
we actually don't. [Not that I looked that hard...]

> +	  for (const char *op : ops)
> +	    {
> +	      size_t oplen = strlen (op);
> +	      size_t lencmp = std::min<size_t> (oplen, end - p);
> +
> +	      if (strncmp (p, op, lencmp) == 0)
> +		return p + lencmp;
> +	    }
> +	  /* Some unidentified character.  Return it.  */
> +	  return p + 1;
> +	}
> +    }
> +
> +  return p;
> +}
> +
> +/* Advance string1/string2 past whitespace.  */
> +
> +static void
> +skip_ws (const char *&string1, const char *&string2, const char *end_str2)
> +{
> +  while (isspace (*string1))
> +    string1++;
> +  while (string2 < end_str2 && isspace (*string2))
> +    string2++;
> +}
> +
> +static bool
> +cp_is_operator (const char *string, const char *start)

Missing comment?

> +{
> +  return ((string == start
> +	   || !valid_identifier_name_char (string[-1]))
> +	  && strncmp (string, CP_OPERATOR_STR, CP_OPERATOR_LEN) == 0
> +	  && !valid_identifier_name_char (string[CP_OPERATOR_LEN]));
> +}
> +
>  /* See utils.h.  */
>  
>  int
>  strncmp_iw_with_mode (const char *string1, const char *string2,
> -		      size_t string2_len, strncmp_iw_mode mode)
> +		      size_t string2_len, strncmp_iw_mode mode,
> +		      enum language language)
>  {
> +  const char *string1_start = string1;
>    const char *end_str2 = string2 + string2_len;
> +  bool skip_spaces = true;
> +  bool have_colon_op = (language == language_cplus
> +			|| language == language_rust
> +			|| language == language_fortran);
>  

Just a passing comment: I'm kinda torn on this. When new languages are added,
this is going to be yet another place that language implementers are going to
have to modify. While a language method would probably be better (for some
definition of "better"), I don't want to see the language vector bloat beyond
control either. So IMO there's no clear better path.

>    while (1)
>      {
> -      while (isspace (*string1))
> -	string1++;
> -      while (string2 < end_str2 && isspace (*string2))
> -	string2++;
> +      if (skip_spaces
> +	  || ((isspace (*string1) && !valid_identifier_name_char (*string2))
> +	      || (isspace (*string2) && !valid_identifier_name_char (*string1))))
> +	{
> +	  skip_ws (string1, string2, end_str2);
> +	  skip_spaces = false;
> +	}
> +
>        if (*string1 == '\0' || string2 == end_str2)
>  	break;
> +
> +      /* Handle the :: operator.  */
> +      if (have_colon_op && string1[0] == ':' && string1[1] == ':')

At least this part is language-agnostic. That's a plus.

> +	{
> +	  if (*string2 != ':')
> +	    return 1;
> +
> +	  string1++;
> +	  string2++;
> +
> +	  if (string2 == end_str2)
> +	    break;
> +
> +	  if (*string2 != ':')
> +	    return 1;
> +
> +	  string1++;
> +	  string2++;
> +
> +	  while (isspace (*string1))
> +	    string1++;
> +	  while (string2 < end_str2 && isspace (*string2))
> +	    string2++;
> +	  continue;
> +	}
> +
> +      /* Handle C++ user-defined operators.  */
> +      else if (language == language_cplus
> +	       && *string1 == 'o')
> +	{
> +	  if (cp_is_operator (string1, string1_start))
> +	    {
> +	      /* An operator name in STRING1.  Check STRING2.  */
> +	      size_t cmplen = std::min<size_t> (CP_OPERATOR_LEN, end_str2 - string2);

line length == 85

> +	      if (strncmp (string1, string2, cmplen) != 0)
> +		return 1;
[snip]
> @@ -2462,7 +2675,7 @@ int
>  strncmp_iw (const char *string1, const char *string2, size_t string2_len)
>  {
>    return strncmp_iw_with_mode (string1, string2, string2_len,
> -			       strncmp_iw_mode::NORMAL);
> +			       strncmp_iw_mode::NORMAL, language_minimal);
>  }
>  
>  /* See utils.h.  */
> @@ -2471,7 +2684,7 @@ int
>  strcmp_iw (const char *string1, const char *string2)
>  {
>    return strncmp_iw_with_mode (string1, string2, strlen (string2),
> -			       strncmp_iw_mode::MATCH_PARAMS);
> +			       strncmp_iw_mode::MATCH_PARAMS, language_minimal);
>  }

I think the comments for both of these functions should be updated, since
they pass language_minimal to strncmp_iw_with_mode. Therefore,

  strncmp_iw_with_mode (string1, string2, len, MATCH_PARAMS, a_language)

may not necessarily equal

  strncmp_iw (string1, string2)

That may not be obvious to the casual user. Some sort of caveat seems prudent.

>  /* This is like strcmp except that it ignores whitespace and treats
> diff --git a/gdb/utils.h b/gdb/utils.h
> index 9e531e0..4ce263e 100644
> --- a/gdb/utils.h
> +++ b/gdb/utils.h
> @@ -56,7 +56,8 @@ enum class strncmp_iw_mode
>  extern int strncmp_iw_with_mode (const char *string1,
>  				 const char *string2,
>  				 size_t string2_len,
> -				 strncmp_iw_mode mode);
> +				 strncmp_iw_mode mode,
> +				 enum language language);

While most of these parameters are rather obvious usage, it is not obvious to
me why a strncmp-like function needs a language definition. [Of course,
I understand why after reading the code, but a brief mention of how LANGUAGE
affects the operation might be useful IMO. YMMV.]

>  
>  /* Do a strncmp() type operation on STRING1 and STRING2, ignoring any
>     differences in whitespace.  STRING2_LEN is STRING2's length.
> 

Keith