Groups > comp.compilers > #2531 > unrolled thread

Spell checking identifiers

Started by	Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid>
First post	2020-06-24 01:38 +0800
Last post	2020-06-24 13:08 -0700
Articles	13 — 7 participants

Back to article view | Back to comp.compilers

  Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-24 01:38 +0800
    Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-24 03:56 +0800
      Re: Spell checking identifiers gah4@u.washington.edu - 2020-06-23 16:51 -0700
        Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-25 22:33 +0800
    Re: Spell checking identifiers "Derek M. Jones" <derek@_NOSPAM_knosof.co.uk.invalid> - 2020-06-24 11:02 +0100
      Re: Spell checking identifiers gah4@u.washington.edu - 2020-06-24 18:28 -0700
        Re: Spell checking identifiers mac <acolvin@efunct.com> - 2020-07-09 16:07 +0000
          Re: Spell checking identifiers Thomas Koenig <tkoenig@netcologne.de> - 2020-07-10 07:12 +0000
            Re: Spell checking identifiers gah4@u.washington.edu - 2020-07-10 13:17 -0700
    Re: Spell checking identifiers Kaz Kylheku <937-053-0959@kylheku.com> - 2020-06-24 18:12 +0000
      Re: Spell checking identifiers Thomas Koenig <tkoenig@netcologne.de> - 2020-06-24 20:08 +0000
        Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-25 21:44 +0800
    Re: Spell checking identifiers gautier_niouzes@hotmail.com - 2020-06-24 13:08 -0700

#2531 — Spell checking identifiers

From	Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid>
Date	2020-06-24 01:38 +0800
Subject	Spell checking identifiers
Message-ID	<20-06-010@comp.compilers>

Dear c.compilers,

While experimenting with Rust, I came across this suggestion.

  --> foo.rs:5:9
   |
5 |     return j; // the variable, not the type.
   |            ^ help: a local variable with a similar name exists: `i`

Here it is suggesting i where I typed j.  This is the same problem as
spell checking identifiers with fuzzy matching, so apologies for a po-
tentially misleading subject.

So, without going through the source of rustc to find out, I'm curious
about what general techniques people use to make this work?  In particu-
lar the Damerau–Levenshtein distance algorithm is not appropriate for
dictionary lookups, as far as I know.

I've come across a survey of fuzzy matching algorithms, some of which
work with dictionaries but I have no idea which data structures would
be appropriate in a compiler, nor do I know what criteria I'd use to
choose an appropriate algorithm from such a survey.

As an added bonus, the same technique can of course be used to spell
check identifiers against a natural language dictionary.  But since
such a dictionary is more static than the list of identifiers in the
current source file, a precomputed database will work, and a more
expensive indexing method can be used.  Is there an indexing method
that works for this, but would not be appropriate for fuzzy matching
against identifiers?

[Apologies for not responding to my other topic yet, I should be able
to reply soon.]

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk
[There's a vast amount of work on edit distance.  My guess is they
use something like Levenshtein, but rather than use a constant
distance of 1 between different letters, the distance varies depending
on how different the letters look. -John]

[toc] | [next] | [standalone]

#2532

From	Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid>
Date	2020-06-24 03:56 +0800
Message-ID	<20-06-011@comp.compilers>
In reply to	#2531

> [There's a vast amount of work on edit distance.  My guess is they
> use something like Levenshtein, but rather than use a constant
> distance of 1 between different letters, the distance varies depending
> on how different the letters look. -John]

This clang blog specifically mentions Levenshtein,


http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

and it looks like what people do is to go through the entire symbol
table and compute it against the individual erroneous identifier.

I thought that'd be a bit on the expensive side, because C++ files
can have 100k+ (or millions?) of lines after preprocessing, so one
translation unit really can go up to million identifiers in practice.
[I don't know if that actually happens but I don't think it's safe
to assume it doesn't.]

In the 10 years since, people may have changed from standard Levenshtein
as you mention.

But then, maybe compilation speed for erroneous input isn't really
important.  rustc is slow for a short input file in both cases [which
could be the startup cost.]

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

[toc] | [prev] | [next] | [standalone]

#2533

From	gah4@u.washington.edu
Date	2020-06-23 16:51 -0700
Message-ID	<20-06-012@comp.compilers>
In reply to	#2532

On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

(snip)

> This clang blog specifically mentions Levenshtein,

> http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

> and it looks like what people do is to go through the entire symbol
> table and compute it against the individual erroneous identifier.

> I thought that'd be a bit on the expensive side,

With either constant weighting or character dependent weighting
it is easy to do with dynamic programming. The time is then O(m n)
where m and n are the two lengths.

It seems most obvious to do only variable that are in the appropriate
scope to be misspelled, but I suspect catching variables used out
of scope is also worth doing.  Well, in the latter case, you could
hope that they at least spell them the same.

I think you should turn it off for one character names, though,
even though I suspect those are more likely. Too many false
positives!

[toc] | [prev] | [next] | [standalone]

#2540

From	Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid>
Date	2020-06-25 22:33 +0800
Message-ID	<20-06-019@comp.compilers>
In reply to	#2533

On 24/06/2020 7:51 am, gah4@u.washington.edu wrote:
> On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:
>
> (snip)
>
>> This clang blog specifically mentions Levenshtein,
>
>> http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker
>
>> and it looks like what people do is to go through the entire symbol
>> table and compute it against the individual erroneous identifier.
>
>> I thought that'd be a bit on the expensive side,
>
> With either constant weighting or character dependent weighting
> it is easy to do with dynamic programming. The time is then O(m n)
> where m and n are the two lengths.

Are you talking about doing this one by one through the entire symbol
table?

> It seems most obvious to do only variable that are in the appropriate
> scope to be misspelled, but I suspect catching variables used out
> of scope is also worth doing.  Well, in the latter case, you could
> hope that they at least spell them the same.

Depending on context, one would also want to do this for type names (as
per the blog above).  Depending on the language* and culture**, there
can be thousands of type names in scope.

> I think you should turn it off for one character names, though,
> even though I suspect those are more likely. Too many false
> positives!

rustc obviously does this for one character names, at least in the
case for i and j.  I don't know if it's useful to compare a and k.

* C++ and Java come to mind.

** Programming culture, some of them have a name such as Agile, and
eXtreme Programming; others don't have a name.

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

[toc] | [prev] | [next] | [standalone]

#2534

From	"Derek M. Jones" <derek@_NOSPAM_knosof.co.uk.invalid>
Date	2020-06-24 11:02 +0100
Message-ID	<20-06-013@comp.compilers>
In reply to	#2531

Johann,

> So, without going through the source of rustc to find out, I'm curious
> about what general techniques people use to make this work?  In particu-
> lar the Damerau–Levenshtein distance algorithm is not appropriate for
> dictionary lookups, as far as I know.

More than you probably wanted to know:
http://www.coding-guidelines.com/cbook/sent792.pdf

--
Derek M. Jones
blog:shape-of-code.coding-guidelines.com

[toc] | [prev] | [next] | [standalone]

#2538

From	gah4@u.washington.edu
Date	2020-06-24 18:28 -0700
Message-ID	<20-06-017@comp.compilers>
In reply to	#2534

On Wednesday, June 24, 2020 at 11:45:44 AM UTC-7, Derek M. Jones wrote:

(snip)

> More than you probably wanted to know:
> http://www.coding-guidelines.com/cbook/sent792.pdf

Interesting.

It seems that the compiler could also suggest when different names
are confusingly similar, other than spelling, or otherwise possibly
confusing to readers.  Maybe by how they sound when pronounced.

You might not want J1 and J_one in the same program.

Reminds me that the IBM OS/360 Fortran H compiler, instead of using
a hash table, uses six balanced trees.  IBM suggests in one manual
that for faster compilation, one distribute names among the
six lengths.  Back in the time when compilation time was more
important than readability.

[toc] | [prev] | [next] | [standalone]

#2548

From	mac <acolvin@efunct.com>
Date	2020-07-09 16:07 +0000
Message-ID	<20-07-002@comp.compilers>
In reply to	#2538

> You might not want J1 and J_one in the same program.

Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but
required the same capitalization everywhere

This was much more important in the days of batch systems, where compilers
tried to "correct" your code, because it would be a day before you got
another chance. And because student programs couldn't do any damage.

[toc] | [prev] | [next] | [standalone]

#2549

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2020-07-10 07:12 +0000
Message-ID	<20-07-003@comp.compilers>
In reply to	#2548

mac <acolvin@efunct.com> schrieb:
>> You might not want J1 and J_one in the same program.
>
> Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but
> required the same capitalization everywhere

Fortran has a a bit of a similar issue with its C interoperability
feature.

Entities with C binding have global identifiers in Fortran.  Fortran
is a case-insensitive laguage, so FooBar and foobar look the same
to Fortran, and you can not have a C binding to both (but either
one would work).

In practice, this should not be a big problem.

[toc] | [prev] | [next] | [standalone]

#2550

From	gah4@u.washington.edu
Date	2020-07-10 13:17 -0700
Message-ID	<20-07-004@comp.compilers>
In reply to	#2549

On Friday, July 10, 2020 at 8:23:37 AM UTC-7, Thomas Koenig wrote:

(snip)

> Fortran has a a bit of a similar issue with its C interoperability
> feature.

> Entities with C binding have global identifiers in Fortran.  Fortran
> is a case-insensitive laguage, so FooBar and foobar look the same
> to Fortran, and you can not have a C binding to both (but either
> one would work).

Much of this was pretty strange before Fortran added the
C interoperability feature.  Mostly, Fortran compilers did what
they did, and the C names had to match. That included adding
underscore characters sometimes.

Now Fortran has BIND(C), and more specifically BIND(C, NAME='cname')
where you specify the name as it will be known to C.  That name is
case sensitive (at least if the underlying linker supports it),
and might be completely unrelated to the Fortran name.

BIND(C) without NAME= generates lower case, which is usual for C.

I believe IBM PL/I compilers have had the ability to call other
(presumably IBM) languages, with specific keywords.  But mostly that
was when the IBM linker was (supposed to be) upper case only.

VAX/VMS, and I suspect other VMS, has a system, defined along
with VAX, for calling routines with other argument passing
conventions, with %VAL(), %REF(), and %DESCR(), or call by value,
reference, and descriptor. I am not sure what they do about case,
but many VMS names have $ in them, so that is allowed.

[Back in the day, IBM PL/I could call Fortran and COBOL but the
procedure names are all eight EBCDIC characters. The extra stuff
involves mapping datatypes.  The current PL/I manuals say it
handles many different character encodings like UTF-16. -John]

[toc] | [prev] | [next] | [standalone]

#2535

From	Kaz Kylheku <937-053-0959@kylheku.com>
Date	2020-06-24 18:12 +0000
Message-ID	<20-06-014@comp.compilers>
In reply to	#2531

On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
> Dear c.compilers,
>
> While experimenting with Rust, I came across this suggestion.
>
>   --> foo.rs:5:9
>    |
> 5 |     return j; // the variable, not the type.
>    |            ^ help: a local variable with a similar name exists: `i`
>
> Here it is suggesting i where I typed j.  This is the same problem as
> spell checking identifiers with fuzzy matching, so apologies for a po-
> tentially misleading subject.

I don't find these kinds of childish diagnostic messages useful at all.

They have started to appear in GCC also.

The good old "undeclared identifier `j`" requires no update, thanks.

[toc] | [prev] | [next] | [standalone]

#2536

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2020-06-24 20:08 +0000
Message-ID	<20-06-015@comp.compilers>
In reply to	#2535

Kaz Kylheku <937-053-0959@kylheku.com> schrieb:
> On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
>> 5 |     return j; // the variable, not the type.
>>    |            ^ help: a local variable with a similar name exists: `i`

> I don't find these kinds of childish diagnostic messages useful at all.
>
> They have started to appear in GCC also.

> The good old "undeclared identifier `j`" requires no update, thanks.

At least gcc doesn't do the suggestion in that particular case:

foo.c: In function 'foo':
foo.c:8:7: error: 'j' undeclared (first use in this function)
    8 |     x[j] = a[i] > b[i];
      |       ^
foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in

[toc] | [prev] | [next] | [standalone]

#2539

From	Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid>
Date	2020-06-25 21:44 +0800
Message-ID	<20-06-018@comp.compilers>
In reply to	#2536

On 25/06/2020 4:08 am, Thomas Koenig wrote:
> Kaz Kylheku <937-053-0959@kylheku.com> schrieb:
>> On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
>>> 5 |     return j; // the variable, not the type.
>>>     |            ^ help: a local variable with a similar name exists: `i`
>
>> I don't find these kinds of childish diagnostic messages useful at all.
>>
>> They have started to appear in GCC also.
>
>> The good old "undeclared identifier `j`" requires no update, thanks.
>
> At least gcc doesn't do the suggestion in that particular case:
>
> foo.c: In function 'foo':
> foo.c:8:7: error: 'j' undeclared (first use in this function)
>      8 |     x[j] = a[i] > b[i];
>        |       ^
> foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in
>

A friend showed me the ghc code for this, and it deliberatly (according
to comments) excludes single variable names.  I don't understand enough
Haskell to tell exactly what the code is doing, but they seem to be
using the Damerau-Levenshtein distance algorithm based on identifier
names.

Now, whether this is useful is of course debatable, and I would
personally like to give people a choice to either turn it on, or
off.  We can argue about defaults if/when I have a published
compiler.

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

[toc] | [prev] | [next] | [standalone]

#2537

From	gautier_niouzes@hotmail.com
Date	2020-06-24 13:08 -0700
Message-ID	<20-06-016@comp.compilers>
In reply to	#2531

On Tuesday, June 23, 2020 at 8:40:38 PM UTC+2, Johann 'Myrkraverk' Oskarsson

> So, without going through the source of rustc to find out, I'm curious
> about what general techniques people use to make this work?  In particu-
> lar the Damerau–Levenshtein distance algorithm is not appropriate for
> dictionary lookups, as far as I know.

GCC's GNAT front-end for Ada does such checks (search for "GNAT.Spelling_Checker").
The source code (at least versions I've found on the Web) does not refer to a
specific algorithm.

A few examples of detection:

hac-parser-expressions.adb:129:35: "Symst" is undefined
hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

hac-parser-expressions.adb:129:35: "Sym_set" is undefined
hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

hac-parser-expressions.adb:129:35: "Sysmet" is undefined
hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

[toc] | [prev] | [standalone]

csiph-web

Spell checking identifiers

Contents

#2531 — Spell checking identifiers

#2532

#2533

#2540

#2534

#2538

#2548

#2549

#2550

#2535

#2536

#2539

#2537