Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed
* [review] [RFC][gdb/contrib] Add words.sh script
@ 2019-10-24 16:39 Tom de Vries (Code Review)
  2019-10-25 17:51 ` Luis Machado (Code Review)
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Tom de Vries (Code Review) @ 2019-10-24 16:39 UTC (permalink / raw)
  To: gdb-patches

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................

[RFC][gdb/contrib] Add words.sh script

Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.

For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.

This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error.  The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.

And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.

This can be used to scan for misspellings manually.

Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 107 insertions(+), 0 deletions(-)



diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..ad6ec2b
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,107 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+
+dir=$(cd $(dirname $0); pwd -P)
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+    case "$1" in
+	--freq|-f)
+	    minfreq=$2
+	    maxfreq=$2
+	    shift 2
+	    ;;
+	--min)
+	    minfreq=$2
+	    if [ "$maxfreq" = "" ]; then
+		maxfreq=0
+	    fi
+	    shift 2
+	    ;;
+	--max)
+	    maxfreq=$2
+	    if [ "$minfreq" = "" ]; then
+		minfreq=0
+	    fi
+	    shift 2
+	    ;;
+	*)
+	    break;
+	    ;;
+    esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+    minfreq=0
+    maxfreq=0
+fi
+
+awkfile=$(mktemp)
+
+cat > $awkfile <<EOF
+BEGIN {
+    in_comment=0
+}
+
+// {
+    line=\$0
+}
+
+/\/\*/ {
+    in_comment=1
+    sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+    sub(/\*\/.*/, "", line)
+    in_comment=0
+    print line
+    next
+}
+
+// {
+    if (in_comment) {
+	print line
+    }
+}
+EOF
+
+awk \
+    -f $awkfile \
+    "$@" \
+    | sed 's/[%^$~#{}`&=@,. \t\/_-()|<>\+\*]/\n/g' \
+    | sed 's/\[/\n/g' \
+    | sed 's/\]/\n/g' \
+    | sed 's/[0-9][0-9]*/\n/g' \
+    | tr '[:upper:]' '[:lower:]' \
+    | sed 's/[ \t]*//g' \
+    | sort \
+    | uniq -c \
+    | awk "{ if (($minfreq == 0 || $minfreq <= \$1) && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+    | awk '{ print length($0) " " $0; }' \
+    | sort -n -r \
+    | cut -d ' ' -f 2-
+
+rm -f $awkfile

-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-MessageType: newchange


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review] [RFC][gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
@ 2019-10-25 17:51 ` Luis Machado (Code Review)
  2019-10-25 17:51 ` Luis Machado (Code Review)
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luis Machado (Code Review) @ 2019-10-25 17:51 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches

Luis Machado has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 1:

Thanks, this is pretty cool! I ran a few examples and it seems very useful.

I'd put your example of scanning all .[c|h] files into the script itself, so it is immediately obvious how to perform such action.

Something that i tried out of the box and that gave me a weird result was "./words --help". It listed words from awk i think, which is a bit odd. But since it is a basic script, it's not a big deal.


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-CC: Luis Machado <luis.machado@linaro.org>
Gerrit-Comment-Date: Fri, 25 Oct 2019 17:51:01 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review] [RFC][gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
  2019-10-25 17:51 ` Luis Machado (Code Review)
@ 2019-10-25 17:51 ` Luis Machado (Code Review)
  2019-11-05 15:40 ` Tom Tromey (Code Review)
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luis Machado (Code Review) @ 2019-10-25 17:51 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches

Luis Machado has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 1: Code-Review+1


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Comment-Date: Fri, 25 Oct 2019 17:51:35 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: Yes
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review] [RFC][gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
  2019-10-25 17:51 ` Luis Machado (Code Review)
  2019-10-25 17:51 ` Luis Machado (Code Review)
@ 2019-11-05 15:40 ` Tom Tromey (Code Review)
  2019-11-05 16:21 ` Simon Marchi (Code Review)
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Tom Tromey (Code Review) @ 2019-11-05 15:40 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Luis Machado

Tom Tromey has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 1: Code-Review+2

Thanks this seems fine to me!


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-Comment-Date: Tue, 05 Nov 2019 15:40:11 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: Yes
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review] [RFC][gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (2 preceding siblings ...)
  2019-11-05 15:40 ` Tom Tromey (Code Review)
@ 2019-11-05 16:21 ` Simon Marchi (Code Review)
  2019-11-07  9:32 ` [review v2] [gdb/contrib] " Tom de Vries (Code Review)
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Simon Marchi (Code Review) @ 2019-11-05 16:21 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Tom Tromey, Luis Machado

Simon Marchi has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 1:

I'd suggest running shellcheck (https://www.shellcheck.net/) on it and fixing the few small warnings it gives.


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-Comment-Date: Tue, 05 Nov 2019 16:21:14 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review v2] [gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (3 preceding siblings ...)
  2019-11-05 16:21 ` Simon Marchi (Code Review)
@ 2019-11-07  9:32 ` Tom de Vries (Code Review)
  2019-11-07  9:45 ` [review] " Tom de Vries (Code Review)
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Tom de Vries (Code Review) @ 2019-11-07  9:32 UTC (permalink / raw)
  To: Luis Machado, Tom Tromey, gdb-patches; +Cc: Simon Marchi

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................

[gdb/contrib] Add words.sh script

Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.

For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.

This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error.  The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.

And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.

This can be used to scan for misspellings manually.

Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 126 insertions(+), 0 deletions(-)



diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..d0d94d5
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,126 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+#
+# For:
+# ...
+# $ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~15000 words prefixed with frequency.
+#
+# This could be used to generate a dictionary that is kept as part of the
+# sources, against which new code can be checked, generating a warning or
+# error.  The hope is that misspellings would trigger this frequently, and rare
+# words rarely, otherwise the burden of updating the dictionary would be too
+# much.
+#
+# And for:
+# ...
+# $ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~5000 words with frequency 1.
+#
+# This can be used to scan for misspellings manually.
+#
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+    case "$1" in
+	--freq|-f)
+	    minfreq=$2
+	    maxfreq=$2
+	    shift 2
+	    ;;
+	--min)
+	    minfreq=$2
+	    if [ "$maxfreq" = "" ]; then
+		maxfreq=0
+	    fi
+	    shift 2
+	    ;;
+	--max)
+	    maxfreq=$2
+	    if [ "$minfreq" = "" ]; then
+		minfreq=0
+	    fi
+	    shift 2
+	    ;;
+	*)
+	    break;
+	    ;;
+    esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+    minfreq=0
+    maxfreq=0
+fi
+
+awkfile=$(mktemp)
+trap 'rm -f "$awkfile"' EXIT
+
+cat > "$awkfile" <<EOF
+BEGIN {
+    in_comment=0
+}
+
+// {
+    line=\$0
+}
+
+/\/\*/ {
+    in_comment=1
+    sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+    sub(/\*\/.*/, "", line)
+    in_comment=0
+    print line
+    next
+}
+
+// {
+    if (in_comment) {
+	print line
+    }
+}
+EOF
+
+awk \
+    -f "$awkfile" \
+    -- "$@" \
+    | sed 's/[%^$~#{}`&=@,. \t\/_-()|<>\+\*]/\n/g' \
+    | sed 's/\[/\n/g' \
+    | sed 's/\]/\n/g' \
+    | sed 's/[0-9][0-9]*/\n/g' \
+    | tr '[:upper:]' '[:lower:]' \
+    | sed 's/[ \t]*//g' \
+    | sort \
+    | uniq -c \
+    | awk "{ if (($minfreq == 0 || $minfreq <= \$1) \
+                 && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+    | awk '{ print length($0) " " $0; }' \
+    | sort -n -r \
+    | cut -d ' ' -f 2-

-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 2
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-MessageType: newpatchset


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review] [gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (4 preceding siblings ...)
  2019-11-07  9:32 ` [review v2] [gdb/contrib] " Tom de Vries (Code Review)
@ 2019-11-07  9:45 ` Tom de Vries (Code Review)
  2019-11-07  9:46 ` [review v2] " Tom de Vries (Code Review)
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Tom de Vries (Code Review) @ 2019-11-07  9:45 UTC (permalink / raw)
  To: gdb-patches; +Cc: Simon Marchi, Tom Tromey, Luis Machado

Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> Thanks, this is pretty cool! I ran a few examples and it seems very useful.
> 
> I'd put your example of scanning all .[c|h] files into the script itself, so it is immediately obvious how to perform such action.
> 

Done.

> Something that i tried out of the box and that gave me a weird result was "./words --help". It listed words from awk i think, which is a bit odd. But since it is a basic script, it's not a big deal.

Fixed by adding -- to the awk command line.

Now prints:
...
$ ./gdb/contrib/words.sh --help
awk: /tmp/tmp.XttLdZXpeI:2: fatal: cannot open file `--help' for reading (No such file or directory)
...


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-Reviewer: Tom de Vries <tdevries@suse.de>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-Comment-Date: Thu, 07 Nov 2019 09:45:34 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [review v2] [gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (5 preceding siblings ...)
  2019-11-07  9:45 ` [review] " Tom de Vries (Code Review)
@ 2019-11-07  9:46 ` Tom de Vries (Code Review)
  2019-11-07  9:51 ` [pushed] " Sourceware to Gerrit sync (Code Review)
  2019-11-07  9:51 ` Sourceware to Gerrit sync (Code Review)
  8 siblings, 0 replies; 10+ messages in thread
From: Tom de Vries (Code Review) @ 2019-11-07  9:46 UTC (permalink / raw)
  To: gdb-patches; +Cc: Simon Marchi, Tom Tromey, Luis Machado

Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................


Patch Set 2:

> Patch Set 1:
> 
> I'd suggest running shellcheck (https://www.shellcheck.net/) on it and fixing the few small warnings it gives.

Done.


-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 2
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-Reviewer: Tom de Vries <tdevries@suse.de>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-Comment-Date: Thu, 07 Nov 2019 09:46:02 +0000
Gerrit-HasComments: No
Gerrit-Has-Labels: No
Gerrit-MessageType: comment


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [pushed] [gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (6 preceding siblings ...)
  2019-11-07  9:46 ` [review v2] " Tom de Vries (Code Review)
@ 2019-11-07  9:51 ` Sourceware to Gerrit sync (Code Review)
  2019-11-07  9:51 ` Sourceware to Gerrit sync (Code Review)
  8 siblings, 0 replies; 10+ messages in thread
From: Sourceware to Gerrit sync (Code Review) @ 2019-11-07  9:51 UTC (permalink / raw)
  To: Tom de Vries, gdb-patches; +Cc: Simon Marchi, Tom Tromey, Luis Machado

Sourceware to Gerrit sync has submitted this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................

[gdb/contrib] Add words.sh script

Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.

For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.

This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error.  The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.

And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.

This can be used to scan for misspellings manually.

Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 129 insertions(+), 0 deletions(-)


diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..ae38539
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,129 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+#
+# For:
+# ...
+# $ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~15000 words prefixed with frequency.
+#
+# This could be used to generate a dictionary that is kept as part of the
+# sources, against which new code can be checked, generating a warning or
+# error.  The hope is that misspellings would trigger this frequently, and rare
+# words rarely, otherwise the burden of updating the dictionary would be too
+# much.
+#
+# And for:
+# ...
+# $ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~5000 words with frequency 1.
+#
+# This can be used to scan for misspellings manually.
+#
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+    case "$1" in
+	--freq|-f)
+	    minfreq=$2
+	    maxfreq=$2
+	    shift 2
+	    ;;
+	--min)
+	    minfreq=$2
+	    if [ "$maxfreq" = "" ]; then
+		maxfreq=0
+	    fi
+	    shift 2
+	    ;;
+	--max)
+	    maxfreq=$2
+	    if [ "$minfreq" = "" ]; then
+		minfreq=0
+	    fi
+	    shift 2
+	    ;;
+	*)
+	    break;
+	    ;;
+    esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+    minfreq=0
+    maxfreq=0
+fi
+
+awkfile=$(mktemp)
+trap 'rm -f "$awkfile"' EXIT
+
+cat > "$awkfile" <<EOF
+BEGIN {
+    in_comment=0
+}
+
+// {
+    line=\$0
+}
+
+/\/\*/ {
+    in_comment=1
+    sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+    sub(/\*\/.*/, "", line)
+    in_comment=0
+    print line
+    next
+}
+
+// {
+    if (in_comment) {
+	print line
+    }
+}
+EOF
+
+# Stabilize sort.
+export LC_ALL=C
+
+awk \
+    -f "$awkfile" \
+    -- "$@" \
+    | sed 's/[%^$~#{}`&=@,. \t\/_()|<>\+\*-]/\n/g' \
+    | sed 's/\[/\n/g' \
+    | sed 's/\]/\n/g' \
+    | sed 's/[0-9][0-9]*/\n/g' \
+    | tr '[:upper:]' '[:lower:]' \
+    | sed 's/[ \t]*//g' \
+    | sort \
+    | uniq -c \
+    | awk "{ if (($minfreq == 0 || $minfreq <= \$1) \
+                 && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+    | awk '{ print length($0) " " $0; }' \
+    | sort -n -r \
+    | cut -d ' ' -f 2-

-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 3
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-Reviewer: Tom de Vries <tdevries@suse.de>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-MessageType: merged


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [pushed] [gdb/contrib] Add words.sh script
  2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
                   ` (7 preceding siblings ...)
  2019-11-07  9:51 ` [pushed] " Sourceware to Gerrit sync (Code Review)
@ 2019-11-07  9:51 ` Sourceware to Gerrit sync (Code Review)
  8 siblings, 0 replies; 10+ messages in thread
From: Sourceware to Gerrit sync (Code Review) @ 2019-11-07  9:51 UTC (permalink / raw)
  To: Tom de Vries, Luis Machado, Tom Tromey, gdb-patches; +Cc: Simon Marchi

The original change was created by Tom de Vries.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................

[gdb/contrib] Add words.sh script

Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.

For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.

This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error.  The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.

And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.

This can be used to scan for misspellings manually.

Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 129 insertions(+), 0 deletions(-)



diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..ae38539
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,129 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+#
+# For:
+# ...
+# $ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~15000 words prefixed with frequency.
+#
+# This could be used to generate a dictionary that is kept as part of the
+# sources, against which new code can be checked, generating a warning or
+# error.  The hope is that misspellings would trigger this frequently, and rare
+# words rarely, otherwise the burden of updating the dictionary would be too
+# much.
+#
+# And for:
+# ...
+# $ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
+# ...
+# it generates a list of ~5000 words with frequency 1.
+#
+# This can be used to scan for misspellings manually.
+#
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+    case "$1" in
+	--freq|-f)
+	    minfreq=$2
+	    maxfreq=$2
+	    shift 2
+	    ;;
+	--min)
+	    minfreq=$2
+	    if [ "$maxfreq" = "" ]; then
+		maxfreq=0
+	    fi
+	    shift 2
+	    ;;
+	--max)
+	    maxfreq=$2
+	    if [ "$minfreq" = "" ]; then
+		minfreq=0
+	    fi
+	    shift 2
+	    ;;
+	*)
+	    break;
+	    ;;
+    esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+    minfreq=0
+    maxfreq=0
+fi
+
+awkfile=$(mktemp)
+trap 'rm -f "$awkfile"' EXIT
+
+cat > "$awkfile" <<EOF
+BEGIN {
+    in_comment=0
+}
+
+// {
+    line=\$0
+}
+
+/\/\*/ {
+    in_comment=1
+    sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+    sub(/\*\/.*/, "", line)
+    in_comment=0
+    print line
+    next
+}
+
+// {
+    if (in_comment) {
+	print line
+    }
+}
+EOF
+
+# Stabilize sort.
+export LC_ALL=C
+
+awk \
+    -f "$awkfile" \
+    -- "$@" \
+    | sed 's/[%^$~#{}`&=@,. \t\/_()|<>\+\*-]/\n/g' \
+    | sed 's/\[/\n/g' \
+    | sed 's/\]/\n/g' \
+    | sed 's/[0-9][0-9]*/\n/g' \
+    | tr '[:upper:]' '[:lower:]' \
+    | sed 's/[ \t]*//g' \
+    | sort \
+    | uniq -c \
+    | awk "{ if (($minfreq == 0 || $minfreq <= \$1) \
+                 && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+    | awk '{ print length($0) " " $0; }' \
+    | sort -n -r \
+    | cut -d ' ' -f 2-

-- 
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 3
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-Reviewer: Luis Machado <luis.machado@linaro.org>
Gerrit-Reviewer: Tom Tromey <tromey@sourceware.org>
Gerrit-Reviewer: Tom de Vries <tdevries@suse.de>
Gerrit-CC: Simon Marchi <simon.marchi@polymtl.ca>
Gerrit-MessageType: newpatchset


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-11-07  9:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 16:39 [review] [RFC][gdb/contrib] Add words.sh script Tom de Vries (Code Review)
2019-10-25 17:51 ` Luis Machado (Code Review)
2019-10-25 17:51 ` Luis Machado (Code Review)
2019-11-05 15:40 ` Tom Tromey (Code Review)
2019-11-05 16:21 ` Simon Marchi (Code Review)
2019-11-07  9:32 ` [review v2] [gdb/contrib] " Tom de Vries (Code Review)
2019-11-07  9:45 ` [review] " Tom de Vries (Code Review)
2019-11-07  9:46 ` [review v2] " Tom de Vries (Code Review)
2019-11-07  9:51 ` [pushed] " Sourceware to Gerrit sync (Code Review)
2019-11-07  9:51 ` Sourceware to Gerrit sync (Code Review)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox