Groups > comp.lang.python > #35190 > unrolled thread

how to detect the encoding used for a specific text data ?

Started by	iMath <redstone-cold@163.com>
First post	2012-12-20 03:57 -0800
Last post	2012-12-21 09:14 -0500
Articles	11 — 7 participants

Back to article view | Back to comp.lang.python

  how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 03:57 -0800
    Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 04:06 -0800
    Re: how to detect the encoding used for a specific text data ? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-12-20 14:48 +0200
    Re: how to detect the encoding used for a specific text data ? "Stefan H. Holek" <stefan@epy.co.at> - 2012-12-20 14:17 +0100
      Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 05:50 -0800
        Re: how to detect the encoding used for a specific text data ? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-12-20 16:10 +0200
      Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 05:50 -0800
    Re: how to detect the encoding used for a specific text data ? Christian Heimes <christian@python.org> - 2012-12-20 15:19 +0100
    Re: how to detect the encoding used for a specific text data ? rurpy@yahoo.com - 2012-12-20 09:48 -0800
    Re: how to detect the encoding used for a specific text data ? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-12-21 12:38 +0000
    Re: how to detect the encoding used for a specific text data ? Dave Angel <d@davea.name> - 2012-12-21 09:14 -0500

#35190 — how to detect the encoding used for a specific text data ?

From	iMath <redstone-cold@163.com>
Date	2012-12-20 03:57 -0800
Subject	how to detect the encoding used for a specific text data ?
Message-ID	<c6eeb756-65be-4c50-88a8-1f94bd772fe8@googlegroups.com>

 how to detect the encoding used for a specific text data ?

[toc] | [next] | [standalone]

#35191

From	iMath <redstone-cold@163.com>
Date	2012-12-20 04:06 -0800
Message-ID	<18e8de2e-3741-4089-b46f-7390683e01c4@googlegroups.com>
In reply to	#35190

在 2012年12月20日星期四UTC+8下午7时57分19秒，iMath写道：
> how to detect the encoding used for a specific text data ?

On windows XP

[toc] | [prev] | [next] | [standalone]

#35192

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2012-12-20 14:48 +0200
Message-ID	<qot4njgyjmj.fsf@ruuvi.it.helsinki.fi>
In reply to	#35190

iMath writes:

>  how to detect the encoding used for a specific text data ?

The practical thing to do is to try an encoding and see whether you
find the expected frequent letters of the relevant languages in the
decoded text, or the most frequent words. This is likely to help you
decide between some of the most common encodings. Some decoding
attempts may even raise an exception, which should be a clue.

Strictly speaking, it cannot be done with complete certainty. There
are lots of Finnish texts that are identical whether you think they
are in Latin-1 or Latin-9. A further text from the same source might
still reveal the difference, so the distinction matters.

Short Finnish texts might also be identical whether you think they are
in Latin-1 or UTF-8, but the situation is different: a couple of
frequent letters turn into nonsense in the wrong encoding. It's easy
to tell at a glance.

Sometimes texts declare their encoding. That should be a clue, but in
practice the declaration may be false. Sometimes there is a stray
character that violates the declared or assumed encoding, or a part of
the text is in one encoding and another part in another. Bad source.
You decide how important it is to deal with the mess. (This only
happens in the real world.)

Good luck.

[toc] | [prev] | [next] | [standalone]

#35194

From	"Stefan H. Holek" <stefan@epy.co.at>
Date	2012-12-20 14:17 +0100
Message-ID	<mailman.1097.1356010989.29569.python-list@python.org>
In reply to	#35190

On 20.12.2012, at 12:57, iMath wrote:

> how to detect the encoding used for a specific text data ?

http://pypi.python.org/pypi?%3Aaction=search&term=detect+encoding

-- 
Stefan H. Holek
stefan@epy.co.at

[toc] | [prev] | [next] | [standalone]

#35195

From	iMath <redstone-cold@163.com>
Date	2012-12-20 05:50 -0800
Message-ID	<88f8c2ea-a217-4d0a-84a1-6de3a433d7ce@googlegroups.com>
In reply to	#35194

which package to use ?

[toc] | [prev] | [next] | [standalone]

#35197

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2012-12-20 16:10 +0200
Message-ID	<qotzk18x1a7.fsf@ruuvi.it.helsinki.fi>
In reply to	#35195

iMath writes:

> which package to use ?

Read the text in as a "bytes object" (bytes), then it has a .decode
method that you can experiment with. Strings (str) are Unicode and
have an .encode method. These methods allow you to specify a desired
encoding and and what to do when there are errors.

help(bytes.decode)
help(str.encode)
help(open)
<http://docs.python.org/3.3/library/stdtypes.html>

In Python 2.7 and before, strings seem to do double duty and have both
the .encode and .decode methods, so Python version matters here.

[toc] | [prev] | [next] | [standalone]

#35196

From	iMath <redstone-cold@163.com>
Date	2012-12-20 05:50 -0800
Message-ID	<mailman.1098.1356011427.29569.python-list@python.org>
In reply to	#35194

which package to use ?

[toc] | [prev] | [next] | [standalone]

#35198

From	Christian Heimes <christian@python.org>
Date	2012-12-20 15:19 +0100
Message-ID	<mailman.1099.1356013179.29569.python-list@python.org>
In reply to	#35190

Am 20.12.2012 12:57, schrieb iMath:
>  how to detect the encoding used for a specific text data ?

You can't.

It's not possible unless the file format can specify the encoding
somehow, e.g. like XML's header <?xml version="1.0" encoding="UTF-8"?>.
Sometimes you can try and make an educated guess. But it's just a guess
and it may give you wrong results.

Christian

[toc] | [prev] | [next] | [standalone]

#35208

From	rurpy@yahoo.com
Date	2012-12-20 09:48 -0800
Message-ID	<6ae211f3-2199-4163-9844-1b2cfa9e0215@googlegroups.com>
In reply to	#35190

On Thursday, December 20, 2012 4:57:19 AM UTC-7, iMath wrote:
> how to detect the encoding used for a specific text data ?

The chardet package will probably do what you want:
  http://pypi.python.org/pypi/chardet

[toc] | [prev] | [next] | [standalone]

#35285

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-12-21 12:38 +0000
Message-ID	<mailman.1144.1356093497.29569.python-list@python.org>
In reply to	#35190

On 20 December 2012 11:57, iMath <redstone-cold@163.com> wrote:
>  how to detect the encoding used for a specific text data ?

Normally encoding is given in some way by the context of the data.
Otherwise no general solution is possible.

On a related note: how to answer question with no context on mailing list?

[toc] | [prev] | [next] | [standalone]

#35292

From	Dave Angel <d@davea.name>
Date	2012-12-21 09:14 -0500
Message-ID	<mailman.1149.1356099297.29569.python-list@python.org>
In reply to	#35190

On 12/21/2012 07:38 AM, Oscar Benjamin wrote:
> <snip>
> On a related note: how to answer question with no context on mailing
> list? 

Depends on how you're reading/responding.  I'll assume you're using an
email client like Thunderbird, and that you do NOT subscribe in digest form.

Most general way is to use Reply-All, and remove any recipients you
don't want there, but make sure you keep the python-list recipient.

Alternatively, if you're using Thunderbird or another with similar
capability, use Reply-list, which is smart enough to only keep the list
entry.

Or, what I used to do, reply, then add the python-list@python.org to the
list of recipients.  That's error prone.

I hope this answers your question.

-- 

DaveA

[toc] | [prev] | [standalone]

csiph-web

how to detect the encoding used for a specific text data ?

Contents

#35190 — how to detect the encoding used for a specific text data ?

#35191

#35192

#35194

#35195

#35197

#35196

#35198

#35208

#35285

#35292