Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2531 > unrolled thread
| Started by | Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> |
|---|---|
| First post | 2020-06-24 01:38 +0800 |
| Last post | 2020-06-24 13:08 -0700 |
| Articles | 13 — 7 participants |
Back to article view | Back to comp.compilers
Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-24 01:38 +0800
Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-24 03:56 +0800
Re: Spell checking identifiers gah4@u.washington.edu - 2020-06-23 16:51 -0700
Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-25 22:33 +0800
Re: Spell checking identifiers "Derek M. Jones" <derek@_NOSPAM_knosof.co.uk.invalid> - 2020-06-24 11:02 +0100
Re: Spell checking identifiers gah4@u.washington.edu - 2020-06-24 18:28 -0700
Re: Spell checking identifiers mac <acolvin@efunct.com> - 2020-07-09 16:07 +0000
Re: Spell checking identifiers Thomas Koenig <tkoenig@netcologne.de> - 2020-07-10 07:12 +0000
Re: Spell checking identifiers gah4@u.washington.edu - 2020-07-10 13:17 -0700
Re: Spell checking identifiers Kaz Kylheku <937-053-0959@kylheku.com> - 2020-06-24 18:12 +0000
Re: Spell checking identifiers Thomas Koenig <tkoenig@netcologne.de> - 2020-06-24 20:08 +0000
Re: Spell checking identifiers Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> - 2020-06-25 21:44 +0800
Re: Spell checking identifiers gautier_niouzes@hotmail.com - 2020-06-24 13:08 -0700
| From | Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> |
|---|---|
| Date | 2020-06-24 01:38 +0800 |
| Subject | Spell checking identifiers |
| Message-ID | <20-06-010@comp.compilers> |
Dear c.compilers, While experimenting with Rust, I came across this suggestion. --> foo.rs:5:9 | 5 | return j; // the variable, not the type. | ^ help: a local variable with a similar name exists: `i` Here it is suggesting i where I typed j. This is the same problem as spell checking identifiers with fuzzy matching, so apologies for a po- tentially misleading subject. So, without going through the source of rustc to find out, I'm curious about what general techniques people use to make this work? In particu- lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know. I've come across a survey of fuzzy matching algorithms, some of which work with dictionaries but I have no idea which data structures would be appropriate in a compiler, nor do I know what criteria I'd use to choose an appropriate algorithm from such a survey. As an added bonus, the same technique can of course be used to spell check identifiers against a natural language dictionary. But since such a dictionary is more static than the list of identifiers in the current source file, a precomputed database will work, and a more expensive indexing method can be used. Is there an indexing method that works for this, but would not be appropriate for fuzzy matching against identifiers? [Apologies for not responding to my other topic yet, I should be able to reply soon.] -- Johann | email: invalid -> com | www.myrkraverk.com/blog/ I'm not from the Internet, I just work there. | twitter: @myrkraverk [There's a vast amount of work on edit distance. My guess is they use something like Levenshtein, but rather than use a constant distance of 1 between different letters, the distance varies depending on how different the letters look. -John]
[toc] | [next] | [standalone]
| From | Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> |
|---|---|
| Date | 2020-06-24 03:56 +0800 |
| Message-ID | <20-06-011@comp.compilers> |
| In reply to | #2531 |
> [There's a vast amount of work on edit distance. My guess is they > use something like Levenshtein, but rather than use a constant > distance of 1 between different letters, the distance varies depending > on how different the letters look. -John] This clang blog specifically mentions Levenshtein, http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker and it looks like what people do is to go through the entire symbol table and compute it against the individual erroneous identifier. I thought that'd be a bit on the expensive side, because C++ files can have 100k+ (or millions?) of lines after preprocessing, so one translation unit really can go up to million identifiers in practice. [I don't know if that actually happens but I don't think it's safe to assume it doesn't.] In the 10 years since, people may have changed from standard Levenshtein as you mention. But then, maybe compilation speed for erroneous input isn't really important. rustc is slow for a short input file in both cases [which could be the startup cost.] -- Johann | email: invalid -> com | www.myrkraverk.com/blog/ I'm not from the Internet, I just work there. | twitter: @myrkraverk
[toc] | [prev] | [next] | [standalone]
| From | gah4@u.washington.edu |
|---|---|
| Date | 2020-06-23 16:51 -0700 |
| Message-ID | <20-06-012@comp.compilers> |
| In reply to | #2532 |
On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote: (snip) > This clang blog specifically mentions Levenshtein, > http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker > and it looks like what people do is to go through the entire symbol > table and compute it against the individual erroneous identifier. > I thought that'd be a bit on the expensive side, With either constant weighting or character dependent weighting it is easy to do with dynamic programming. The time is then O(m n) where m and n are the two lengths. It seems most obvious to do only variable that are in the appropriate scope to be misspelled, but I suspect catching variables used out of scope is also worth doing. Well, in the latter case, you could hope that they at least spell them the same. I think you should turn it off for one character names, though, even though I suspect those are more likely. Too many false positives!
[toc] | [prev] | [next] | [standalone]
| From | Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> |
|---|---|
| Date | 2020-06-25 22:33 +0800 |
| Message-ID | <20-06-019@comp.compilers> |
| In reply to | #2533 |
On 24/06/2020 7:51 am, gah4@u.washington.edu wrote: > On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote: > > (snip) > >> This clang blog specifically mentions Levenshtein, > >> http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker > >> and it looks like what people do is to go through the entire symbol >> table and compute it against the individual erroneous identifier. > >> I thought that'd be a bit on the expensive side, > > With either constant weighting or character dependent weighting > it is easy to do with dynamic programming. The time is then O(m n) > where m and n are the two lengths. Are you talking about doing this one by one through the entire symbol table? > It seems most obvious to do only variable that are in the appropriate > scope to be misspelled, but I suspect catching variables used out > of scope is also worth doing. Well, in the latter case, you could > hope that they at least spell them the same. Depending on context, one would also want to do this for type names (as per the blog above). Depending on the language* and culture**, there can be thousands of type names in scope. > I think you should turn it off for one character names, though, > even though I suspect those are more likely. Too many false > positives! rustc obviously does this for one character names, at least in the case for i and j. I don't know if it's useful to compare a and k. * C++ and Java come to mind. ** Programming culture, some of them have a name such as Agile, and eXtreme Programming; others don't have a name. -- Johann | email: invalid -> com | www.myrkraverk.com/blog/ I'm not from the Internet, I just work there. | twitter: @myrkraverk
[toc] | [prev] | [next] | [standalone]
| From | "Derek M. Jones" <derek@_NOSPAM_knosof.co.uk.invalid> |
|---|---|
| Date | 2020-06-24 11:02 +0100 |
| Message-ID | <20-06-013@comp.compilers> |
| In reply to | #2531 |
Johann, > So, without going through the source of rustc to find out, I'm curious > about what general techniques people use to make this work? In particu- > lar the Damerau–Levenshtein distance algorithm is not appropriate for > dictionary lookups, as far as I know. More than you probably wanted to know: http://www.coding-guidelines.com/cbook/sent792.pdf -- Derek M. Jones blog:shape-of-code.coding-guidelines.com
[toc] | [prev] | [next] | [standalone]
| From | gah4@u.washington.edu |
|---|---|
| Date | 2020-06-24 18:28 -0700 |
| Message-ID | <20-06-017@comp.compilers> |
| In reply to | #2534 |
On Wednesday, June 24, 2020 at 11:45:44 AM UTC-7, Derek M. Jones wrote: (snip) > More than you probably wanted to know: > http://www.coding-guidelines.com/cbook/sent792.pdf Interesting. It seems that the compiler could also suggest when different names are confusingly similar, other than spelling, or otherwise possibly confusing to readers. Maybe by how they sound when pronounced. You might not want J1 and J_one in the same program. Reminds me that the IBM OS/360 Fortran H compiler, instead of using a hash table, uses six balanced trees. IBM suggests in one manual that for faster compilation, one distribute names among the six lengths. Back in the time when compilation time was more important than readability.
[toc] | [prev] | [next] | [standalone]
| From | mac <acolvin@efunct.com> |
|---|---|
| Date | 2020-07-09 16:07 +0000 |
| Message-ID | <20-07-002@comp.compilers> |
| In reply to | #2538 |
> You might not want J1 and J_one in the same program. Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but required the same capitalization everywhere This was much more important in the days of batch systems, where compilers tried to "correct" your code, because it would be a day before you got another chance. And because student programs couldn't do any damage.
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2020-07-10 07:12 +0000 |
| Message-ID | <20-07-003@comp.compilers> |
| In reply to | #2548 |
mac <acolvin@efunct.com> schrieb: >> You might not want J1 and J_one in the same program. > > Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but > required the same capitalization everywhere Fortran has a a bit of a similar issue with its C interoperability feature. Entities with C binding have global identifiers in Fortran. Fortran is a case-insensitive laguage, so FooBar and foobar look the same to Fortran, and you can not have a C binding to both (but either one would work). In practice, this should not be a big problem.
[toc] | [prev] | [next] | [standalone]
| From | gah4@u.washington.edu |
|---|---|
| Date | 2020-07-10 13:17 -0700 |
| Message-ID | <20-07-004@comp.compilers> |
| In reply to | #2549 |
On Friday, July 10, 2020 at 8:23:37 AM UTC-7, Thomas Koenig wrote: (snip) > Fortran has a a bit of a similar issue with its C interoperability > feature. > Entities with C binding have global identifiers in Fortran. Fortran > is a case-insensitive laguage, so FooBar and foobar look the same > to Fortran, and you can not have a C binding to both (but either > one would work). Much of this was pretty strange before Fortran added the C interoperability feature. Mostly, Fortran compilers did what they did, and the C names had to match. That included adding underscore characters sometimes. Now Fortran has BIND(C), and more specifically BIND(C, NAME='cname') where you specify the name as it will be known to C. That name is case sensitive (at least if the underlying linker supports it), and might be completely unrelated to the Fortran name. BIND(C) without NAME= generates lower case, which is usual for C. I believe IBM PL/I compilers have had the ability to call other (presumably IBM) languages, with specific keywords. But mostly that was when the IBM linker was (supposed to be) upper case only. VAX/VMS, and I suspect other VMS, has a system, defined along with VAX, for calling routines with other argument passing conventions, with %VAL(), %REF(), and %DESCR(), or call by value, reference, and descriptor. I am not sure what they do about case, but many VMS names have $ in them, so that is allowed. [Back in the day, IBM PL/I could call Fortran and COBOL but the procedure names are all eight EBCDIC characters. The extra stuff involves mapping datatypes. The current PL/I manuals say it handles many different character encodings like UTF-16. -John]
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <937-053-0959@kylheku.com> |
|---|---|
| Date | 2020-06-24 18:12 +0000 |
| Message-ID | <20-06-014@comp.compilers> |
| In reply to | #2531 |
On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote: > Dear c.compilers, > > While experimenting with Rust, I came across this suggestion. > > --> foo.rs:5:9 > | > 5 | return j; // the variable, not the type. > | ^ help: a local variable with a similar name exists: `i` > > Here it is suggesting i where I typed j. This is the same problem as > spell checking identifiers with fuzzy matching, so apologies for a po- > tentially misleading subject. I don't find these kinds of childish diagnostic messages useful at all. They have started to appear in GCC also. The good old "undeclared identifier `j`" requires no update, thanks.
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2020-06-24 20:08 +0000 |
| Message-ID | <20-06-015@comp.compilers> |
| In reply to | #2535 |
Kaz Kylheku <937-053-0959@kylheku.com> schrieb:
> On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
>> 5 | return j; // the variable, not the type.
>> | ^ help: a local variable with a similar name exists: `i`
> I don't find these kinds of childish diagnostic messages useful at all.
>
> They have started to appear in GCC also.
> The good old "undeclared identifier `j`" requires no update, thanks.
At least gcc doesn't do the suggestion in that particular case:
foo.c: In function 'foo':
foo.c:8:7: error: 'j' undeclared (first use in this function)
8 | x[j] = a[i] > b[i];
| ^
foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in
[toc] | [prev] | [next] | [standalone]
| From | Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> |
|---|---|
| Date | 2020-06-25 21:44 +0800 |
| Message-ID | <20-06-018@comp.compilers> |
| In reply to | #2536 |
On 25/06/2020 4:08 am, Thomas Koenig wrote: > Kaz Kylheku <937-053-0959@kylheku.com> schrieb: >> On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote: >>> 5 | return j; // the variable, not the type. >>> | ^ help: a local variable with a similar name exists: `i` > >> I don't find these kinds of childish diagnostic messages useful at all. >> >> They have started to appear in GCC also. > >> The good old "undeclared identifier `j`" requires no update, thanks. > > At least gcc doesn't do the suggestion in that particular case: > > foo.c: In function 'foo': > foo.c:8:7: error: 'j' undeclared (first use in this function) > 8 | x[j] = a[i] > b[i]; > | ^ > foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in > A friend showed me the ghc code for this, and it deliberatly (according to comments) excludes single variable names. I don't understand enough Haskell to tell exactly what the code is doing, but they seem to be using the Damerau-Levenshtein distance algorithm based on identifier names. Now, whether this is useful is of course debatable, and I would personally like to give people a choice to either turn it on, or off. We can argue about defaults if/when I have a published compiler. -- Johann | email: invalid -> com | www.myrkraverk.com/blog/ I'm not from the Internet, I just work there. | twitter: @myrkraverk
[toc] | [prev] | [next] | [standalone]
| From | gautier_niouzes@hotmail.com |
|---|---|
| Date | 2020-06-24 13:08 -0700 |
| Message-ID | <20-06-016@comp.compilers> |
| In reply to | #2531 |
On Tuesday, June 23, 2020 at 8:40:38 PM UTC+2, Johann 'Myrkraverk' Oskarsson > So, without going through the source of rustc to find out, I'm curious > about what general techniques people use to make this work? In particu- > lar the Damerau–Levenshtein distance algorithm is not appropriate for > dictionary lookups, as far as I know. GCC's GNAT front-end for Ada does such checks (search for "GNAT.Spelling_Checker"). The source code (at least versions I've found on the Web) does not refer to a specific algorithm. A few examples of detection: hac-parser-expressions.adb:129:35: "Symst" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset" hac-parser-expressions.adb:129:35: "Sym_set" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset" hac-parser-expressions.adb:129:35: "Sysmet" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web