Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!usenet-fr.net!nerim.net!novso.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'error:': 0.05; "'')": 0.07; 'problem?': 0.07; 'try:': 0.07; '"my': 0.09; '22,': 0.09; 'encode': 0.09; 'friday,': 0.09; 'iterate': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:error': 0.11; 'subject:python': 0.11; 'index': 0.13; 'file,': 0.15; '46,': 0.16; 'codec': 0.16; 'guessing': 0.16; 'received:80.91.229.3': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-dialin.net': 0.16; 'wrote:': 0.17; 'fix': 0.17; 'unicode': 0.17; 'yield': 0.17; 'input': 0.18; 'memory': 0.18; 'skip:" 40': 0.20; 'txt': 0.22; 'work.': 0.23; 'script': 0.24; 'header:User-Agent:1': 0.26; '(most': 0.27; 'guess': 0.27; 'replace': 0.27; 'skip:" 50': 0.27; "doesn't": 0.28; 'header:X-Complaints-To:1': 0.28; 'lines': 0.28; 'decimal': 0.29; 'inspect': 0.29; 'worked': 0.30; 'code': 0.31; 'file': 0.32; 'print': 0.32; 'skip:s 30': 0.33; 'handle': 0.33; 'to:addr:python-list': 0.33; "can't": 0.34; 'text': 0.34; 'similar': 0.35; 'something': 0.35; 'received:org': 0.36; 'except': 0.36; 'but': 0.36; 'characters': 0.36; 'data.': 0.36; 'should': 0.36; 'thank': 0.36; 'subject:: ': 0.38; 'files': 0.38; 'object': 0.38; 'some': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'where': 0.40; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'your': 0.60; 'skip:u 10': 0.60; '2013': 0.84; 'terrible': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Peter Otten <__peter__@web.de> Subject: Re: encoding error in python 27 Date: Sun, 24 Feb 2013 09:34:57 +0100 Organization: None References: <86c880ca-ab2d-4406-832a-129235cf59bd@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Gmane-NNTP-Posting-Host: p5084a556.dip.t-dialin.net User-Agent: KNode/4.7.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 48 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1361694874 news.xs4all.nl 6912 [2001:888:2000:d::a6]:48103 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:39737 Hala Gamal wrote: > thank you :)it worked well for small file but when i enter big file,, i > obtain this error: "Traceback (most recent call last): > File "D:\Python27\yarab (4).py", line 46, in > writer.add_document(**doc) > File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 369, in > add_document > items = field.index(value) > File "build\bdist.win32\egg\whoosh\fields.py", line 466, in index > return [(txt, 1, 1.0, '') for txt in self._tiers(num)] > File "build\bdist.win32\egg\whoosh\fields.py", line 454, in _tiers > yield self.to_text(num, shift=shift) > File "build\bdist.win32\egg\whoosh\fields.py", line 487, in to_text > return self._to_text(self.prepare_number(x), shift=shift, > File "build\bdist.win32\egg\whoosh\fields.py", line 476, in > prepare_number > x = self.type(x) > UnicodeEncodeError: 'decimal' codec can't encode characters in position > 0-4: invalid decimal Unicode string" i don't know realy where is the > problem? On Friday, February 22, 2013 4:55:22 PM UTC+2, Hala Gamal wrote: >> my code works well with english file but when i use text file >> encodede"utf-8" "my file contain some arabic letters" it doesn't work. I guess that one of the fields you require to be NUMERIC contains non-digit characters. Replace the line >> writer.add_document(**doc) with something similar to try: writer.add_document(**doc) except UnicodeEncodeError: print "Skipping malformed line", repr(i) This will allow you to inspect the lines your script cannot handle and if they are indeed "malformed" as I am guessing you can fix your input data. i is a terrible name for a line in a file, btw. Also, you should avoid readlines() which reads the whole file into memory and instead iterate over the file object directly: with codecs.open("tt.txt", encoding='utf-8-sig') as textfile: for line in textfile: # no readlines(), can handle # text files of arbitrary size ...