Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #49412 > unrolled thread
| Started by | fobos3@gmail.com |
|---|---|
| First post | 2013-06-29 04:29 -0700 |
| Last post | 2013-06-29 16:20 +0000 |
| Articles | 7 — 6 participants |
Back to article view | Back to comp.lang.python
MeCab UTF-8 Decoding Problem fobos3@gmail.com - 2013-06-29 04:29 -0700
Re: MeCab UTF-8 Decoding Problem Giorgos Tzampanakis <giorgos.tzampanakis@gmail.com> - 2013-06-29 13:46 +0000
Re: MeCab UTF-8 Decoding Problem Dave Angel <d@davea.name> - 2013-06-29 10:02 -0400
Re: MeCab UTF-8 Decoding Problem Terry Reedy <tjreedy@udel.edu> - 2013-06-29 11:32 -0400
Re: MeCab UTF-8 Decoding Problem Terry Reedy <tjreedy@udel.edu> - 2013-06-29 11:55 -0400
Re: MeCab UTF-8 Decoding Problem MRAB <python@mrabarnett.plus.com> - 2013-06-29 17:12 +0100
Re: MeCab UTF-8 Decoding Problem Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-29 16:20 +0000
| From | fobos3@gmail.com |
|---|---|
| Date | 2013-06-29 04:29 -0700 |
| Subject | MeCab UTF-8 Decoding Problem |
| Message-ID | <f4fe97e3-5949-4c52-97d7-4995b8891efd@googlegroups.com> |
Hi,
I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:
#!/usr/bin/python
# -*- coding:utf-8 -*-
import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabで遊んでみよう!'
result = tagger.parse(text)
print result
result = result.decode('utf-8')
print result
And here is the output:
MeCab �� �� ��んで�� �� ��う!
Traceback (most recent call last):
File "test.py", line 11, in <module>
result = result.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
------------------
(program exited with code: 1)
Press return to continue
Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.
Any ideas?
[toc] | [next] | [standalone]
| From | Giorgos Tzampanakis <giorgos.tzampanakis@gmail.com> |
|---|---|
| Date | 2013-06-29 13:46 +0000 |
| Message-ID | <slrnkstpk5.f04.giorgos.tzampanakis@brilliance.eternal-september.org> |
| In reply to | #49412 |
On 2013-06-29, fobos3@gmail.com wrote:
> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCab????????????????????????'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab ?????? ?????? ?????????????????? ?????? ????????????
>
> Traceback (most recent call last):
> File "test.py", line 11, in <module>
> result = result.decode('utf-8')
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>
Find out what the output of tagger.parse is. Your program assumes it is a
bytestring that contains the utf-8 encoded representation of some text,
but it is obvious that this assumption is wrong.
--
Real (i.e. statistical) tennis and snooker player rankings and ratings:
http://www.statsfair.com/
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2013-06-29 10:02 -0400 |
| Message-ID | <mailman.3990.1372514590.3114.python-list@python.org> |
| In reply to | #49412 |
On 06/29/2013 07:29 AM, fobos3@gmail.com wrote:
> Hi,
Using Python 2.7 on Linux, presumably? It'd be better to be explicit.
>
> I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:
What do the MeCab docs say the tagger.parse byte string represents?
Maybe it's not text at all. But surely it's not utf-8.
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう!'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う!
>
> Traceback (most recent call last):
> File "test.py", line 11, in <module>
> result = result.decode('utf-8')
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>
> Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.
Are your terminal and your text editor using utf-8, or something else?
Can you put your print statement in the source file above, and it'll
also work fine?
Are you actually running it from the terminal, or some GUI? I notice
you get "(program exited with code: 1)" and "Press return to continue".
Neither of those is standard terminal fare on any OS I know of.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-06-29 11:32 -0400 |
| Message-ID | <mailman.3994.1372519986.3114.python-list@python.org> |
| In reply to | #49412 |
On 6/29/2013 10:02 AM, Dave Angel wrote:
> On 06/29/2013 07:29 AM, fobos3@gmail.com wrote:
>> Hi,
>
> Using Python 2.7 on Linux, presumably? It'd be better to be explicit.
>
>>
>> I am trying to use a program called MeCab, which does syntax analysis
>> on Japanese text.
It is generally nice to give a link when asking about 3rd party
software. https://code.google.com/p/mecab/
In this case, nearly all the non-boilerplate text is Japanese ;-(.
>> The problem I am having is that it returns a byte string
and the problem with bytes is that they can have any encoding.
In Python 2 (indicated by your print *statements*), a byte string is
just a string.
>> and if I try to print it, it prints question marks for almost
>> all characters. However, if I try to use .decide, it throws an error.
>> Here is my code:
>
> What do the MeCab docs say the tagger.parse byte string represents?
> Maybe it's not text at all. But surely it's not utf-8.
https://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
followed by Japanese.
>> #!/usr/bin/python
>> # -*- coding:utf-8 -*-
>>
>> import MeCab
>> tagger = MeCab.Tagger("-Owakati")
>> text = 'MeCabで遊んでみよう!'
Parts of this appear in the output, as indicated by spaces.
'MeCabで遊 んで みよ う!'
>> result = tagger.parse(text)
>> print result
>>
>> result = result.decode('utf-8')
>> print result
>>
>> And here is the output:
>>
>> MeCab �� �� ��んで�� �� ��う!
Python normally prints bytes with ascii chars representing either
themselves or other values with hex escapes. This looks more like
unicode sent to a terminal with a limited character set. I would add
print type(result)
to be sure.
>> Traceback (most recent call last):
>> File "test.py", line 11, in <module>
>> result = result.decode('utf-8')
>> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>> return codecs.utf_8_decode(input, errors, True)
>> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
>> invalid continuation byte
>>
>>
>> ------------------
>> (program exited with code: 1)
>> Press return to continue
>>
>> Also my terminal is able to display Japanese characters properly. For
>> example print '日本語' works perfectly fine.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-06-29 11:55 -0400 |
| Message-ID | <mailman.3995.1372521366.3114.python-list@python.org> |
| In reply to | #49412 |
On 6/29/2013 11:32 AM, Terry Reedy wrote: >>> I am trying to use a program called MeCab, which does syntax analysis >>> on Japanese text. > > It is generally nice to give a link when asking about 3rd party > software. https://code.google.com/p/mecab/ > In this case, nearly all the non-boilerplate text is Japanese ;-(. My daughter translated the summary paragraph for me. MeCab is an open source morphological analysis open source engine developed through a collaborative unit project between Kyoto University's Informatics Research Department and Nippon Telegraph and Telephone Corporation Communication Science Laboratories. Its fundamental premise is a design which is general-purpose and not reliant on a language, dictionary, or corpus. It uses Conditional Random Fields (CRF) for the estimation of the parameters, and has improved performance over ChaSen, which uses a hidden Markov model. In addition, on average it is faster than ChaSen, Juman, and KAKASI. Incidentally, the creator's favorite food is mekabu (thick leaves of wakame, a kind of edible seaweed, from near the root of the stalk). -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-06-29 17:12 +0100 |
| Message-ID | <mailman.3998.1372522376.3114.python-list@python.org> |
| In reply to | #49412 |
On 29/06/2013 12:29, fobos3@gmail.com wrote:
> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
This is a bytestring. Are you sure it shouldn't be a Unicode string
instead, i.e. u'MeCabで遊んでみよう!'?
> text = 'MeCabで遊んでみよう!'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う!
>
> Traceback (most recent call last):
> File "test.py", line 11, in <module>
> result = result.decode('utf-8')
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>
> Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.
>
> Any ideas?
>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-29 16:20 +0000 |
| Message-ID | <51cf0948$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #49412 |
On Sat, 29 Jun 2013 04:29:23 -0700, fobos3 wrote:
> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう!'
I see from below you are using Python 2.7.
Here you are using a byte-string rather than Unicode. The actual bytes
that you get *may* be indeterminate. I don't think that Python guarantees
that just because the source file is declared as UTF-8, that *implicit*
encoding into bytes will necessarily use UTF-8.
Even if it does, it is still better to use an explicit Unicode string,
and explicitly encode into bytes using whatever encoding MeCab expects
you to use, say:
text = u'MeCabで遊んでみよう!'.encode('utf-8')
By the way, what makes you think that MeCab expects, and returns, text
encoded using UTF-8?
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う!
MeCab has returned a bunch of bytes, representing some text in some
encoding. When you print those bytes, your terminal uses whatever its
default encoding is (probably UTF-8, on a Linux system) and tries to make
sense of the bytes, using � for any byte it cannot make sense of. This is
good evidence that MeCab is *not* actually using UTF-8.
And sure enough, when you try to decode it manually:
> Traceback (most recent call last):
> File "test.py", line 11, in <module>
> result = result.decode('utf-8')
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte
Assuming that the bytes being returned are *supposed* to be encoded in
UTF-8, it's possible that MeCab is simply buggy and cannot produce proper
UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using
*byte strings* as non-ASCII text strongly suggests that the author
doesn't understand Unicode very well.
But perhaps more likely, MeCab isn't using UTF-8 at all. What does the
documentation say?
A third possibility is that the string you feed to MeCab is simply
mangled beyond recognition due to the way you create it using the
implicit encoding from chars to bytes. Change the line
text = 'MeCab ...'
to use an explicit Unicode string and encode, as above, and maybe the
error will go away.
--
Steven
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web