From: "Tom de Vries (Code Review)" <gerrit@gnutoolchain-gerrit.osci.io>
To: gdb-patches@sourceware.org
Subject: [review] [RFC][gdb/contrib] Add words.sh script
Date: Thu, 24 Oct 2019 16:39:00 -0000 [thread overview]
Message-ID: <gerrit.1571935190000.I7b119c9a4519cdbf62a3243d1df2927c80813e8b@gnutoolchain-gerrit.osci.io> (raw)
In-Reply-To: <gerrit.1571935190000.I7b119c9a4519cdbf62a3243d1df2927c80813e8b@gnutoolchain-gerrit.osci.io>
Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/282
......................................................................
[RFC][gdb/contrib] Add words.sh script
Add a script that takes a list of files as arguments and output a list of
words from the C comments with their frequencies.
For:
...
$ ./gdb/contrib/words.sh $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~15000 words prefixed with frequency.
This could be used to generate a dictionary that is kept as part of the
sources, against which new code can be checked, generating a warning or
error. The hope is that misspellings would trigger this frequently, and rare
words rarely, otherwise the burden of updating the dictionary would be too
much.
And for:
...
$ ./gdb/contrib/words.sh -f 1 $(find gdb -type f -name "*.c" -o -name "*.h")
...
it generates a list of ~5000 words with frequency 1.
This can be used to scan for misspellings manually.
Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
---
A gdb/contrib/words.sh
1 file changed, 107 insertions(+), 0 deletions(-)
diff --git a/gdb/contrib/words.sh b/gdb/contrib/words.sh
new file mode 100755
index 0000000..ad6ec2b
--- /dev/null
+++ b/gdb/contrib/words.sh
@@ -0,0 +1,107 @@
+#!/bin/sh
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+# This script intends to facilitate spell checking of comments in C sources.
+# It:
+# - extracts comments from C files
+# - transforms the comments into a list of lowercase words
+# - prefixes each word with the frequency
+# - filters out words within a frequency range
+# - sorts the words, longest first
+
+dir=$(cd $(dirname $0); pwd -P)
+
+minfreq=
+maxfreq=
+while [ $# -gt 0 ]; do
+ case "$1" in
+ --freq|-f)
+ minfreq=$2
+ maxfreq=$2
+ shift 2
+ ;;
+ --min)
+ minfreq=$2
+ if [ "$maxfreq" = "" ]; then
+ maxfreq=0
+ fi
+ shift 2
+ ;;
+ --max)
+ maxfreq=$2
+ if [ "$minfreq" = "" ]; then
+ minfreq=0
+ fi
+ shift 2
+ ;;
+ *)
+ break;
+ ;;
+ esac
+done
+
+if [ "$minfreq" = "" ] && [ "$maxfreq" = "" ]; then
+ minfreq=0
+ maxfreq=0
+fi
+
+awkfile=$(mktemp)
+
+cat > $awkfile <<EOF
+BEGIN {
+ in_comment=0
+}
+
+// {
+ line=\$0
+}
+
+/\/\*/ {
+ in_comment=1
+ sub(/.*\/\*/, "", line)
+}
+
+/\*\// {
+ sub(/\*\/.*/, "", line)
+ in_comment=0
+ print line
+ next
+}
+
+// {
+ if (in_comment) {
+ print line
+ }
+}
+EOF
+
+awk \
+ -f $awkfile \
+ "$@" \
+ | sed 's/[%^$~#{}`&=@,. \t\/_-()|<>\+\*]/\n/g' \
+ | sed 's/\[/\n/g' \
+ | sed 's/\]/\n/g' \
+ | sed 's/[0-9][0-9]*/\n/g' \
+ | tr '[:upper:]' '[:lower:]' \
+ | sed 's/[ \t]*//g' \
+ | sort \
+ | uniq -c \
+ | awk "{ if (($minfreq == 0 || $minfreq <= \$1) && ($maxfreq == 0 || \$1 <= $maxfreq)) { print \$0; } }" \
+ | awk '{ print length($0) " " $0; }' \
+ | sort -n -r \
+ | cut -d ' ' -f 2-
+
+rm -f $awkfile
--
Gerrit-Project: binutils-gdb
Gerrit-Branch: master
Gerrit-Change-Id: I7b119c9a4519cdbf62a3243d1df2927c80813e8b
Gerrit-Change-Number: 282
Gerrit-PatchSet: 1
Gerrit-Owner: Tom de Vries <tdevries@suse.de>
Gerrit-MessageType: newchange
next parent reply other threads:[~2019-10-24 16:39 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-24 16:39 Tom de Vries (Code Review) [this message]
2019-10-25 17:51 ` Luis Machado (Code Review)
2019-10-25 17:51 ` Luis Machado (Code Review)
2019-11-05 15:40 ` Tom Tromey (Code Review)
2019-11-05 16:21 ` Simon Marchi (Code Review)
2019-11-07 9:32 ` [review v2] [gdb/contrib] " Tom de Vries (Code Review)
2019-11-07 9:45 ` [review] " Tom de Vries (Code Review)
2019-11-07 9:46 ` [review v2] " Tom de Vries (Code Review)
2019-11-07 9:51 ` [pushed] " Sourceware to Gerrit sync (Code Review)
2019-11-07 9:51 ` Sourceware to Gerrit sync (Code Review)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=gerrit.1571935190000.I7b119c9a4519cdbf62a3243d1df2927c80813e8b@gnutoolchain-gerrit.osci.io \
--to=gerrit@gnutoolchain-gerrit.osci.io \
--cc=gdb-patches@sourceware.org \
--cc=tdevries@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox