Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!eweka.nl!lightspeed.eweka.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'output': 0.05; 'skip:u 30': 0.07; 'locale': 0.09; 'subject:into': 0.09; 'python': 0.11; 'def': 0.12; '2.7': 0.14; '"word"': 0.16; '*args):': 0.16; 'dict': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'ignoring': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'parts.': 0.16; 'rarely': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'skip:f 60': 0.16; 'skip:q 30': 0.16; 'skip:u 60': 0.16; 'subject:selection': 0.16; 'subject:unicode': 0.16; 'unicode.': 0.16; 'vowel': 0.16; 'wrote:': 0.18; 'module': 0.19; 'skip:f 30': 0.19; 'split': 0.19; 'issue.': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'module,': 0.24; "shouldn't": 0.24; 'unicode': 0.24; 'handling': 0.26; 'this:': 0.26; 'pass': 0.26; 'skip:" 20': 0.27; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'character': 0.29; 'class': 0.32; 'text': 0.33; 'url:python': 0.33; 'checking': 0.33; 'at:': 0.34; 'skip:_ 10': 0.34; 'skip:s 30': 0.35; 'received:84': 0.35; 'but': 0.35; 'url:org': 0.36; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'skip:u 10': 0.60; 'is.': 0.60; 'new': 0.61; 'email addr:gmail.com': 0.63; '8bit%:95': 0.64; 'header :Reply-To:1': 0.67; 'useful.': 0.68; 'reply-to:no real name:2**0': 0.71; 'therefore': 0.72; 'reply-to:addr:python.org': 0.84; 'skip:\xe0 10': 0.84; '8bit%:90': 0.93 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=RZapVTdv c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=K2DDQYBT4xIA:10 a=VgdxNWgHy8oA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=IkcTkHD0fZMA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=uJKMdJTJHRQA:10 a=pGLkceISAAAA:8 a=dJYYyWTS5HkQssQeFnkA:9 a=QEXdDO2ut3YA:10 a=MSl-tDqOz04A:10 X-AUTH: mrabarnett:2500 Date: Wed, 26 Jun 2013 21:24:52 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: python-list@python.org Subject: Re: re.finditer() skips unicode into selection References: <1d412b4a-b043-4723-909a-54c6c06cebd4@googlegroups.com> In-Reply-To: <1d412b4a-b043-4723-909a-54c6c06cebd4@googlegroups.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 53 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1372278293 news.xs4all.nl 15992 [2001:888:2000:d::a6]:35444 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:49275 On 26/06/2013 20:18, akshay.ksth@gmail.com wrote: > I am using the following Highlighter class for Spell Checking to work on my QTextEdit. > > class Highlighter(QSyntaxHighlighter): In Python 2.7, the re module has a somewhat limited idea of what a "word" character is. It recognises 'DEVANAGARI LETTER NA' as a letter, but 'DEVANAGARI VOWEL SIGN E' as a diacritic. The pattern ur'(?u)\w+' will therefore split "नेपाली" into 3 parts. > pattern = ur'\w+' > def __init__(self, *args): > QSyntaxHighlighter.__init__(self, *args) > self.dict = None > > def setDict(self, dict): > self.dict = dict > > def highlightBlock(self, text): > if not self.dict: > return > text = unicode(text) > format = QTextCharFormat() > format.setUnderlineColor(Qt.red) > format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline) The LOCALE flag is for locale-sensitive 1-byte per character bytestrings. It's rarely useful. The UNICODE flag is for dealing with Unicode strings, which is what you need here. You shouldn't be using both at the same time! > unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE) > > for word_object in unicode_pattern.finditer(text): > if not self.dict.spell(word_object.group()): > print word_object.group() > self.setFormat(word_object.start(), word_object.end() - word_object.start(), format) > > But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it. > > When I pass "I am a नेपाली" into the QTextEdit. The output is like this: > > I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a > > It is completely ignoring the unicode. What might be the issue. I am new to PyQt and regex. Im using Python 2.7 and PyQt4. > There's an alternative regex implementation at: http://pypi.python.org/pypi/regex It's a drop-in replacement for the re module, but with a lot of additions, including better handling of Unicode.