Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Johann 'Myrkraverk' Oskarsson Newsgroups: comp.compilers Subject: Re: Spell checking identifiers Date: Wed, 24 Jun 2020 03:56:56 +0800 Organization: Easynews - www.easynews.com Lines: 29 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-06-011@comp.compilers> References: <20-06-010@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="42091"; mail-complaints-to="abuse@iecc.com" Keywords: lex, errors Posted-Date: 23 Jun 2020 15:59:33 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <20-06-010@comp.compilers> Content-Language: en-GB Xref: csiph.com comp.compilers:2532 > [There's a vast amount of work on edit distance.  My guess is they > use something like Levenshtein, but rather than use a constant > distance of 1 between different letters, the distance varies depending > on how different the letters look. -John] This clang blog specifically mentions Levenshtein, http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker and it looks like what people do is to go through the entire symbol table and compute it against the individual erroneous identifier. I thought that'd be a bit on the expensive side, because C++ files can have 100k+ (or millions?) of lines after preprocessing, so one translation unit really can go up to million identifiers in practice. [I don't know if that actually happens but I don't think it's safe to assume it doesn't.] In the 10 years since, people may have changed from standard Levenshtein as you mention. But then, maybe compilation speed for erroneous input isn't really important. rustc is slow for a short input file in both cases [which could be the startup cost.] -- Johann | email: invalid -> com | www.myrkraverk.com/blog/ I'm not from the Internet, I just work there. | twitter: @myrkraverk