Groups > comp.lang.python > #49271 > unrolled thread

re.finditer() skips unicode into selection

Started by	akshay.ksth@gmail.com
First post	2013-06-26 12:18 -0700
Last post	2013-06-26 21:25 -0700
Articles	7 — 4 participants

Back to article view | Back to comp.lang.python

  re.finditer() skips unicode into selection akshay.ksth@gmail.com - 2013-06-26 12:18 -0700
    Re: re.finditer() skips unicode into selection Terry Reedy <tjreedy@udel.edu> - 2013-06-26 16:14 -0400
    Re: re.finditer() skips unicode into selection MRAB <python@mrabarnett.plus.com> - 2013-06-26 21:24 +0100
    Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:26 -0700
      Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:31 -0700
        Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 20:39 -0700
          Re: re.finditer() skips unicode into selection darpan6aya <akshay.ksth@gmail.com> - 2013-06-26 21:25 -0700

#49271 — re.finditer() skips unicode into selection

From	akshay.ksth@gmail.com
Date	2013-06-26 12:18 -0700
Subject	re.finditer() skips unicode into selection
Message-ID	<1d412b4a-b043-4723-909a-54c6c06cebd4@googlegroups.com>

I am using the following Highlighter class for Spell Checking to work on my QTextEdit.

class Highlighter(QSyntaxHighlighter):
    pattern = ur'\w+'
    def __init__(self, *args):
        QSyntaxHighlighter.__init__(self, *args)
        self.dict = None

    def setDict(self, dict):
        self.dict = dict

    def highlightBlock(self, text):
        if not self.dict:
            return
        text = unicode(text)
        format = QTextCharFormat()
        format.setUnderlineColor(Qt.red)
        format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)
        unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)

        for word_object in unicode_pattern.finditer(text):
            if not self.dict.spell(word_object.group()):
                print word_object.group()
                self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)

But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.

When I pass "I am a नेपाली" into the QTextEdit. The output is like this:

    I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a

It is completely ignoring the unicode. What might be the issue. I am new to PyQt and regex. Im using Python 2.7 and PyQt4.

[toc] | [next] | [standalone]

#49274

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-06-26 16:14 -0400
Message-ID	<mailman.3899.1372277713.3114.python-list@python.org>
In reply to	#49271

On 6/26/2013 3:18 PM, akshay.ksth@gmail.com wrote:
> I am using the following Highlighter class for Spell Checking to work on my QTextEdit.
>
> class Highlighter(QSyntaxHighlighter):
>      pattern = ur'\w+'
>      def __init__(self, *args):
>          QSyntaxHighlighter.__init__(self, *args)
>          self.dict = None
>
>      def setDict(self, dict):
>          self.dict = dict
>
>      def highlightBlock(self, text):
>          if not self.dict:
>              return
>          text = unicode(text)
>          format = QTextCharFormat()
>          format.setUnderlineColor(Qt.red)
>          format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)
>          unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)
>
>          for word_object in unicode_pattern.finditer(text):
>              if not self.dict.spell(word_object.group()):
>                  print word_object.group()
>                  self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)
>
> But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.
>
> When I pass "I am a नेपाली" into the QTextEdit. The output is like this:
>
>      I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a
>
> It is completely ignoring the unicode.

The whole text is unicode. It is ignoring the non-ascii, as you asked it 
to with re.LOCALE.

With 3.3.2:
import re

pattern = re.compile(r'\w+', re.LOCALE)
text = "I am a नेपाली"

for word in pattern.finditer(text):
     print(word.group())
 >>>
I
am
a

Delete ', re.LOCALE' and the following are also printed:
न
प
ल

There is an issue on the tracker about the vowel marks in नेपाली being 
mis-seen as word separators, but that is another issue.

Lesson: when you do not understand output, simplify code to see what 
changes. Separating re issues from framework issues is a big step in 
that direction.

? What might be the issue. I am new to PyQt and regex. Im using Python 
2.7 and PyQt4.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#49275

From	MRAB <python@mrabarnett.plus.com>
Date	2013-06-26 21:24 +0100
Message-ID	<mailman.3900.1372278293.3114.python-list@python.org>
In reply to	#49271

On 26/06/2013 20:18, akshay.ksth@gmail.com wrote:
> I am using the following Highlighter class for Spell Checking to work on my QTextEdit.
>
> class Highlighter(QSyntaxHighlighter):

In Python 2.7, the re module has a somewhat limited idea of what a
"word" character is. It recognises 'DEVANAGARI LETTER NA' as a letter,
but 'DEVANAGARI VOWEL SIGN E' as a diacritic. The pattern ur'(?u)\w+'
will therefore split "नेपाली" into 3 parts.

>      pattern = ur'\w+'
>      def __init__(self, *args):
>          QSyntaxHighlighter.__init__(self, *args)
>          self.dict = None
>
>      def setDict(self, dict):
>          self.dict = dict
>
>      def highlightBlock(self, text):
>          if not self.dict:
>              return
>          text = unicode(text)
>          format = QTextCharFormat()
>          format.setUnderlineColor(Qt.red)
>          format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)

The LOCALE flag is for locale-sensitive 1-byte per character
bytestrings. It's rarely useful.

The UNICODE flag is for dealing with Unicode strings, which is what you
need here. You shouldn't be using both at the same time!

>          unicode_pattern=re.compile(self.pattern,re.UNICODE|re.LOCALE)
>
>          for word_object in unicode_pattern.finditer(text):
>              if not self.dict.spell(word_object.group()):
>                  print word_object.group()
>                  self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)
>
> But whenever I pass unicode values into my QTextEdit the re.finditer() does not seem to collect it.
>
> When I pass "I am a नेपाली" into the QTextEdit. The output is like this:
>
>      I I I a I am I am I am a I am a I am a I am a I am a I am a I am a I am a
>
> It is completely ignoring the unicode. What might be the issue. I am new to PyQt and regex. Im using Python 2.7 and PyQt4.
>
There's an alternative regex implementation at:

http://pypi.python.org/pypi/regex

It's a drop-in replacement for the re module, but with a lot of
additions, including better handling of Unicode.

[toc] | [prev] | [next] | [standalone]

#49293

From	darpan6aya <akshay.ksth@gmail.com>
Date	2013-06-26 20:26 -0700
Message-ID	<8c3a7439-742d-41fb-b19a-ad54ea3580c6@googlegroups.com>
In reply to	#49271

Thanks MRAB, your suggestion worked. But then it brought an error 

    'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I corrected this by encoding it to 'utf-8'. The code looks like this now. 

        pattern = ur'(?u)\w+' 

        def __init__(self, *args):
            QSyntaxHighlighter.__init__(self, *args)
            self.dict = None
            
        def setDict(self, dict):
            self.dict = dict
            
        def highlightBlock(self, text):
            if not self.dict:
                return
            text = unicode(text)
            format = QTextCharFormat()
            format.setUnderlineColor(Qt.red)
            format.setUnderlineStyle(QTextCharFormat.SpellCheckUnderline)
            
            unicode_pattern=re.compile(self.pattern,re.UNICODE)
                    
            for word_object in unicode_pattern.finditer(text):
                if not self.dict.spell(word_object.group().encode('utf-8')):
                    print word_object.group().encode('utf-8')
                    self.setFormat(word_object.start(), word_object.end() - word_object.start(), format)

The problem now is that all the vowels are separated from the root word, such that if you type मेरो, the म and े are printed separately. (the े appears as a box instead). What am I doing wrong?

Like this.

मेरो नाम रुपा हो।

[toc] | [prev] | [next] | [standalone]

#49294

From	darpan6aya <akshay.ksth@gmail.com>
Date	2013-06-26 20:31 -0700
Message-ID	<f82346b7-8ab2-459a-bb56-9d7fc9fee7b2@googlegroups.com>
In reply to	#49293

> 
> मेरो नाम रुपा हो।

^^ Sorry this didnt come out as I expected. Ignore it.

[toc] | [prev] | [next] | [standalone]

#49295

From	darpan6aya <akshay.ksth@gmail.com>
Date	2013-06-26 20:39 -0700
Message-ID	<04042736-acf4-4ebe-beab-42c75d5a9260@googlegroups.com>
In reply to	#49294

[IMG]http://i41.tinypic.com/35002rr.png[/IMG]

Heres a screenshot http://i41.tinypic.com/35002rr.png

[toc] | [prev] | [next] | [standalone]

#49298

From	darpan6aya <akshay.ksth@gmail.com>
Date	2013-06-26 21:25 -0700
Message-ID	<10d70775-6f3f-484a-a76d-164b099c4397@googlegroups.com>
In reply to	#49295

Thanks MRAB your alternative regex implementation worked flawlessly. 
It works now.

[toc] | [prev] | [standalone]

csiph-web

re.finditer() skips unicode into selection

Contents

#49271 — re.finditer() skips unicode into selection

#49274

#49275

#49293

#49294

#49295

#49298