Re: encoding error in python 27

From	Peter Otten <__peter__@web.de>
Subject	Re: encoding error in python 27
Date	2013-02-24 09:34 +0100
Organization	None
References	<a3d3d352-c170-4165-9552-741869106830@googlegroups.com> <86c880ca-ab2d-4406-832a-129235cf59bd@googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.2397.1361694874.2939.python-list@python.org> (permalink)

Show all headers | View raw

Hala Gamal wrote:

> thank you :)it worked well for small file but when i enter big file,, i
> obtain this error: "Traceback (most recent call last):
>   File "D:\Python27\yarab (4).py", line 46, in <module>
>     writer.add_document(**doc)
>   File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 369, in
>   add_document
>     items = field.index(value)
>   File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index
>     return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
>   File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers
>     yield self.to_text(num, shift=shift)
>   File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text
>     return self._to_text(self.prepare_number(x), shift=shift,
>   File "build\bdist.win32\egg\whoosh\fields.py", line 476, in
>   prepare_number
>     x = self.type(x)
> UnicodeEncodeError: 'decimal' codec can't encode characters in position
> 0-4: invalid decimal Unicode string" i don't know realy where is the
> problem? On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote:
>> my code works well with english file but when i use text file
>> encodede"utf-8" "my file contain some arabic letters" it doesn't work.

I guess that one of the fields you require to be NUMERIC contains non-digit 
characters. Replace the line

>>       writer.add_document(**doc)

with something similar to

         try:
             writer.add_document(**doc)
         except UnicodeEncodeError:
             print "Skipping malformed line", repr(i) 

This will allow you to inspect the lines your script cannot handle and if 
they are indeed "malformed" as I am guessing you can fix your input data.

i is a terrible name for a line in a file, btw. Also, you should avoid 
readlines() which reads the whole file into memory and instead iterate over 
the file object directly:

with codecs.open("tt.txt", encoding='utf-8-sig') as textfile:
    for line in textfile: # no readlines(), can handle 
                          # text files of arbitrary size
        ...

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

encoding error in python 27 Hala Gamal <halagamal2009@gmail.com> - 2013-02-22 06:55 -0800
  Re: encoding error in python 27 Peter Otten <__peter__@web.de> - 2013-02-22 16:40 +0100
  Re: encoding error in python 27 MRAB <python@mrabarnett.plus.com> - 2013-02-22 17:35 +0000
  Re: encoding error in python 27 Hala Gamal <halagamal2009@gmail.com> - 2013-02-23 20:31 -0800
    Re: encoding error in python 27 Peter Otten <__peter__@web.de> - 2013-02-24 09:34 +0100

csiph-web