Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #49275

Re: re.finditer() skips unicode into selection

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!eweka.nl!lightspeed.eweka.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python@mrabarnett.plus.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'output': 0.05; 'skip:u 30': 0.07; 'locale': 0.09; 'subject:into': 0.09; 'python': 0.11; 'def': 0.12; '2.7': 0.14; '"word"': 0.16; '*args):': 0.16; 'dict': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'ignoring': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'parts.': 0.16; 'rarely': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'skip:f 60': 0.16; 'skip:q 30': 0.16; 'skip:u 60': 0.16; 'subject:selection': 0.16; 'subject:unicode': 0.16; 'unicode.': 0.16; 'vowel': 0.16; 'wrote:': 0.18; 'module': 0.19; 'skip:f 30': 0.19; 'split': 0.19; 'issue.': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'module,': 0.24; "shouldn't": 0.24; 'unicode': 0.24; 'handling': 0.26; 'this:': 0.26; 'pass': 0.26; 'skip:" 20': 0.27; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'character': 0.29; 'class': 0.32; 'text': 0.33; 'url:python': 0.33; 'checking': 0.33; 'at:': 0.34; 'skip:_ 10': 0.34; 'skip:s 30': 0.35; 'received:84': 0.35; 'but': 0.35; 'url:org': 0.36; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'skip:u 10': 0.60; 'is.': 0.60; 'new': 0.61; 'email addr:gmail.com': 0.63; '8bit%:95': 0.64; 'header :Reply-To:1': 0.67; 'useful.': 0.68; 'reply-to:no real name:2**0': 0.71; 'therefore': 0.72; 'reply-to:addr:python.org': 0.84; 'skip:\xe0 10': 0.84; '8bit%:90': 0.93
X-CM-Score 0.00
X-CNFS-Analysis v=2.1 cv=RZapVTdv c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=K2DDQYBT4xIA:10 a=VgdxNWgHy8oA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=IkcTkHD0fZMA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=uJKMdJTJHRQA:10 a=pGLkceISAAAA:8 a=dJYYyWTS5HkQssQeFnkA:9 a=QEXdDO2ut3YA:10 a=MSl-tDqOz04A:10
X-AUTH mrabarnett:2500
Date Wed, 26 Jun 2013 21:24:52 +0100
From MRAB <python@mrabarnett.plus.com>
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130620 Thunderbird/17.0.7
MIME-Version 1.0
To python-list@python.org
Subject Re: re.finditer() skips unicode into selection
References <1d412b4a-b043-4723-909a-54c6c06cebd4@googlegroups.com>
In-Reply-To <1d412b4a-b043-4723-909a-54c6c06cebd4@googlegroups.com>
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 8bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
Reply-To python-list@python.org
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3900.1372278293.3114.python-list@python.org> (permalink)
Lines 53
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1372278293 news.xs4all.nl 15992 [2001:888:2000:d::a6]:35444
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:49275

Show key headers only | View raw


On 26/06/2013 20:18, akshay.ksth@gmail.com wrote:
> I am using the following Highlighter class for Spell Checking to work on my QTextEdit.
>
> class Highlighter(QSyntaxHighlighter):

In Python 2.7, the re module has a somewhat limited idea of what a
"word" character is. It recognises 'DEVANAGARI LETTER NA' as a letter,
but 'DEVANAGARI VOWEL SIGN E' as a diacritic. The pattern ur'(?u)\w+'
will therefore split "नेपाली" into 3 parts.

>      pattern = ur'\w+'
>      def __init__(self, *args):
>          QSyntaxHighlighter.__init__(self, *args)
>          self.dict = None
>
>      def setDict(self, dict):
>          self.dict = dict
>
>      def highlightBlock(self, text):
>          if not self.dict:
>              return
>          text = unicode(text)
>          format = QTextCharFormat()
>          format.setUnderlineColor(Qt.red)
>          format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)

The LOCALE flag is for locale-sensitive 1-byte per character
bytestrings. It's rarely useful.

The UNICODE flag is for dealing with Unicode strings, which is what you
need here. You shouldn't be using both at the same time!

>          unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)
>
>          for word_object in unicode_pattern.finditer(text):
>              if not self.dict.spell(word_object.group()):
>                  print word_object.group()
>                  self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)
>
> But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.
>
> When I pass "I am a नेपाली" into the QTextEdit. The output is like this:
>
>      I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a
>
> It is completely ignoring the unicode. What might be the issue. I am new to PyQt and regex. Im using Python 2.7 and PyQt4.
>
There's an alternative regex implementation at:

http://pypi.python.org/pypi/regex

It's a drop-in replacement for the re module, but with a lot of
additions, including better handling of Unicode.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

re.finditer() skips unicode into selection akshay.ksth@gmail.com - 2013-06-26 12:18 -0700
  Re: re.finditer() skips unicode into selection Terry Reedy <tjreedy@udel.edu> - 2013-06-26 16:14 -0400
  Re: re.finditer() skips unicode into selection MRAB <python@mrabarnett.plus.com> - 2013-06-26 21:24 +0100
  Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:26 -0700
    Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:31 -0700
      Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:39 -0700
        Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 21:25 -0700

csiph-web