Groups > comp.lang.python > #49412 > unrolled thread

MeCab UTF-8 Decoding Problem

Started by	fobos3@gmail.com
First post	2013-06-29 04:29 -0700
Last post	2013-06-29 16:20 +0000
Articles	7 — 6 participants

Back to article view | Back to comp.lang.python

  MeCab UTF-8 Decoding Problem fobos3@gmail.com - 2013-06-29 04:29 -0700
    Re: MeCab UTF-8 Decoding Problem Giorgos Tzampanakis <giorgos.tzampanakis@gmail.com> - 2013-06-29 13:46 +0000
    Re: MeCab UTF-8 Decoding Problem Dave Angel <d@davea.name> - 2013-06-29 10:02 -0400
    Re: MeCab UTF-8 Decoding Problem Terry Reedy <tjreedy@udel.edu> - 2013-06-29 11:32 -0400
    Re: MeCab UTF-8 Decoding Problem Terry Reedy <tjreedy@udel.edu> - 2013-06-29 11:55 -0400
    Re: MeCab UTF-8 Decoding Problem MRAB <python@mrabarnett.plus.com> - 2013-06-29 17:12 +0100
    Re: MeCab UTF-8 Decoding Problem Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-29 16:20 +0000

#49412 — MeCab UTF-8 Decoding Problem

From	fobos3@gmail.com
Date	2013-06-29 04:29 -0700
Subject	MeCab UTF-8 Decoding Problem
Message-ID	<f4fe97e3-5949-4c52-97d7-4995b8891efd@googlegroups.com>

Hi,

I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabで遊んでみよう！'

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab �� �� ��んで�� �� ��う！ 

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    result = result.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte


------------------
(program exited with code: 1)
Press return to continue

Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.

Any ideas?

[toc] | [next] | [standalone]

#49418

From	Giorgos Tzampanakis <giorgos.tzampanakis@gmail.com>
Date	2013-06-29 13:46 +0000
Message-ID	<slrnkstpk5.f04.giorgos.tzampanakis@brilliance.eternal-september.org>
In reply to	#49412

On 2013-06-29, fobos3@gmail.com wrote:

> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCab????????????????????????'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab ?????? ?????? ?????????????????? ?????? ???????????? 
>
> Traceback (most recent call last):
>   File "test.py", line 11, in <module>
>     result = result.decode('utf-8')
>   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>

Find out what the output of tagger.parse is. Your program assumes it is a
bytestring that contains the utf-8 encoded representation of some text,
but it is obvious that this assumption is wrong.


-- 
Real (i.e. statistical) tennis and snooker player rankings and ratings:
http://www.statsfair.com/

[toc] | [prev] | [next] | [standalone]

#49421

From	Dave Angel <d@davea.name>
Date	2013-06-29 10:02 -0400
Message-ID	<mailman.3990.1372514590.3114.python-list@python.org>
In reply to	#49412

On 06/29/2013 07:29 AM, fobos3@gmail.com wrote:
> Hi,

Using Python 2.7 on Linux, presumably?  It'd be better to be explicit.

>
> I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:

What do the MeCab docs say the tagger.parse byte string represents? 
Maybe it's not text at all.  But surely it's not utf-8.

>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう！'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う！
>
> Traceback (most recent call last):
>    File "test.py", line 11, in <module>
>      result = result.decode('utf-8')
>    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>      return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>
> Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.

Are your terminal and your text editor using utf-8, or something else? 
Can you put your print statement in the source file above, and it'll 
also work fine?

Are you actually running it from the terminal, or some GUI?  I notice 
you get "(program exited with code: 1)" and "Press return to continue". 
   Neither of those is standard terminal fare on any OS I know of.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#49427

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-06-29 11:32 -0400
Message-ID	<mailman.3994.1372519986.3114.python-list@python.org>
In reply to	#49412

On 6/29/2013 10:02 AM, Dave Angel wrote:
> On 06/29/2013 07:29 AM, fobos3@gmail.com wrote:
>> Hi,
>
> Using Python 2.7 on Linux, presumably?  It'd be better to be explicit.
>
>>
>> I am trying to use a program called MeCab, which does syntax analysis
>> on Japanese text.

It is generally nice to give a link when asking about 3rd party 
software.  https://code.google.com/p/mecab/
In this case, nearly all the non-boilerplate text is Japanese ;-(.

 >> The problem I am having is that it returns a byte string

and the problem with bytes is that they can have any encoding.
In Python 2 (indicated by your print *statements*), a byte string is 
just a string.

>> and if I try to print it, it prints question marks for almost
>> all characters. However, if I try to use .decide, it throws an error.
>> Here is my code:
>
> What do the MeCab docs say the tagger.parse byte string represents?
> Maybe it's not text at all.  But surely it's not utf-8.

https://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
followed by Japanese.

>> #!/usr/bin/python
>> # -*- coding:utf-8 -*-
>>
>> import MeCab
>> tagger = MeCab.Tagger("-Owakati")
>> text = 'MeCabで遊んでみよう！'

Parts of this appear in the output, as indicated by spaces.
'MeCabで遊 んで みよ う！'

>> result = tagger.parse(text)
>> print result
>>
>> result = result.decode('utf-8')
>> print result
>>
>> And here is the output:
>>
>> MeCab �� �� ��んで�� �� ��う！

Python normally prints bytes with ascii chars representing either 
themselves or other values with hex escapes. This looks more like 
unicode sent to a terminal with a limited character set. I would add

print type(result)

to be sure.

>> Traceback (most recent call last):
>>    File "test.py", line 11, in <module>
>>      result = result.decode('utf-8')
>>    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>>      return codecs.utf_8_decode(input, errors, True)
>> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
>> invalid continuation byte
>>
>>
>> ------------------
>> (program exited with code: 1)
>> Press return to continue
>>
>> Also my terminal is able to display Japanese characters properly. For
>> example print '日本語' works perfectly fine.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#49428

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-06-29 11:55 -0400
Message-ID	<mailman.3995.1372521366.3114.python-list@python.org>
In reply to	#49412

On 6/29/2013 11:32 AM, Terry Reedy wrote:

>>> I am trying to use a program called MeCab, which does syntax analysis
>>> on Japanese text.
>
> It is generally nice to give a link when asking about 3rd party
> software.  https://code.google.com/p/mecab/
> In this case, nearly all the non-boilerplate text is Japanese ;-(.

My daughter translated the summary paragraph for me.

MeCab is an open source morphological analysis open source engine 
developed through a collaborative unit project between Kyoto 
University's Informatics Research Department and Nippon Telegraph and 
Telephone Corporation Communication Science Laboratories. Its 
fundamental premise is a design which is general-purpose and not reliant 
on a language, dictionary, or corpus. It uses Conditional Random Fields 
(CRF) for the estimation of the parameters, and has improved performance 
over ChaSen, which uses a hidden Markov model. In addition, on average 
it is faster than ChaSen, Juman, and KAKASI. Incidentally, the creator's 
favorite food is mekabu (thick leaves of wakame, a kind of edible 
seaweed, from near the root of the stalk).

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#49432

From	MRAB <python@mrabarnett.plus.com>
Date	2013-06-29 17:12 +0100
Message-ID	<mailman.3998.1372522376.3114.python-list@python.org>
In reply to	#49412

On 29/06/2013 12:29, fobos3@gmail.com wrote:
> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")

This is a bytestring. Are you sure it shouldn't be a Unicode string
instead, i.e. u'MeCabで遊んでみよう！'?

> text = 'MeCabで遊んでみよう！'
>
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う！
>
> Traceback (most recent call last):
>    File "test.py", line 11, in <module>
>      result = result.decode('utf-8')
>    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>      return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte
>
>
> ------------------
> (program exited with code: 1)
> Press return to continue
>
> Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.
>
> Any ideas?
>

[toc] | [prev] | [next] | [standalone]

#49433

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-29 16:20 +0000
Message-ID	<51cf0948$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to	#49412

On Sat, 29 Jun 2013 04:29:23 -0700, fobos3 wrote:

> Hi,
> 
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
> 
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
> 
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう！'

I see from below you are using Python 2.7.

Here you are using a byte-string rather than Unicode. The actual bytes 
that you get *may* be indeterminate. I don't think that Python guarantees 
that just because the source file is declared as UTF-8, that *implicit* 
encoding into bytes will necessarily use UTF-8.

Even if it does, it is still better to use an explicit Unicode string, 
and explicitly encode into bytes using whatever encoding MeCab expects 
you to use, say:

text = u'MeCabで遊んでみよう！'.encode('utf-8')

By the way, what makes you think that MeCab expects, and returns, text 
encoded using UTF-8?

> result = tagger.parse(text)
> print result
> 
> result = result.decode('utf-8')
> print result
> 
> And here is the output:
> 
> MeCab �� �� ��んで�� �� ��う！

MeCab has returned a bunch of bytes, representing some text in some 
encoding. When you print those bytes, your terminal uses whatever its 
default encoding is (probably UTF-8, on a Linux system) and tries to make 
sense of the bytes, using � for any byte it cannot make sense of. This is 
good evidence that MeCab is *not* actually using UTF-8.

And sure enough, when you try to decode it manually:

> Traceback (most recent call last):
>   File "test.py", line 11, in <module>
>     result = result.decode('utf-8')
>   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte

Assuming that the bytes being returned are *supposed* to be encoded in 
UTF-8, it's possible that MeCab is simply buggy and cannot produce proper 
UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using 
*byte strings* as non-ASCII text strongly suggests that the author 
doesn't understand Unicode very well.

But perhaps more likely, MeCab isn't using UTF-8 at all. What does the 
documentation say?

A third possibility is that the string you feed to MeCab is simply 
mangled beyond recognition due to the way you create it using the 
implicit encoding from chars to bytes. Change the line

text = 'MeCab ...'

to use an explicit Unicode string and encode, as above, and maybe the 
error will go away.

-- 
Steven

[toc] | [prev] | [standalone]

csiph-web

MeCab UTF-8 Decoding Problem

Contents

#49412 — MeCab UTF-8 Decoding Problem

#49418

#49421

#49427

#49428

#49432

#49433