Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35190 > unrolled thread
| Started by | iMath <redstone-cold@163.com> |
|---|---|
| First post | 2012-12-20 03:57 -0800 |
| Last post | 2012-12-21 09:14 -0500 |
| Articles | 11 — 7 participants |
Back to article view | Back to comp.lang.python
how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 03:57 -0800
Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 04:06 -0800
Re: how to detect the encoding used for a specific text data ? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-12-20 14:48 +0200
Re: how to detect the encoding used for a specific text data ? "Stefan H. Holek" <stefan@epy.co.at> - 2012-12-20 14:17 +0100
Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 05:50 -0800
Re: how to detect the encoding used for a specific text data ? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-12-20 16:10 +0200
Re: how to detect the encoding used for a specific text data ? iMath <redstone-cold@163.com> - 2012-12-20 05:50 -0800
Re: how to detect the encoding used for a specific text data ? Christian Heimes <christian@python.org> - 2012-12-20 15:19 +0100
Re: how to detect the encoding used for a specific text data ? rurpy@yahoo.com - 2012-12-20 09:48 -0800
Re: how to detect the encoding used for a specific text data ? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-12-21 12:38 +0000
Re: how to detect the encoding used for a specific text data ? Dave Angel <d@davea.name> - 2012-12-21 09:14 -0500
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-20 03:57 -0800 |
| Subject | how to detect the encoding used for a specific text data ? |
| Message-ID | <c6eeb756-65be-4c50-88a8-1f94bd772fe8@googlegroups.com> |
how to detect the encoding used for a specific text data ?
[toc] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-20 04:06 -0800 |
| Message-ID | <18e8de2e-3741-4089-b46f-7390683e01c4@googlegroups.com> |
| In reply to | #35190 |
在 2012年12月20日星期四UTC+8下午7时57分19秒,iMath写道: > how to detect the encoding used for a specific text data ? On windows XP
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2012-12-20 14:48 +0200 |
| Message-ID | <qot4njgyjmj.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #35190 |
iMath writes: > how to detect the encoding used for a specific text data ? The practical thing to do is to try an encoding and see whether you find the expected frequent letters of the relevant languages in the decoded text, or the most frequent words. This is likely to help you decide between some of the most common encodings. Some decoding attempts may even raise an exception, which should be a clue. Strictly speaking, it cannot be done with complete certainty. There are lots of Finnish texts that are identical whether you think they are in Latin-1 or Latin-9. A further text from the same source might still reveal the difference, so the distinction matters. Short Finnish texts might also be identical whether you think they are in Latin-1 or UTF-8, but the situation is different: a couple of frequent letters turn into nonsense in the wrong encoding. It's easy to tell at a glance. Sometimes texts declare their encoding. That should be a clue, but in practice the declaration may be false. Sometimes there is a stray character that violates the declared or assumed encoding, or a part of the text is in one encoding and another part in another. Bad source. You decide how important it is to deal with the mess. (This only happens in the real world.) Good luck.
[toc] | [prev] | [next] | [standalone]
| From | "Stefan H. Holek" <stefan@epy.co.at> |
|---|---|
| Date | 2012-12-20 14:17 +0100 |
| Message-ID | <mailman.1097.1356010989.29569.python-list@python.org> |
| In reply to | #35190 |
On 20.12.2012, at 12:57, iMath wrote: > how to detect the encoding used for a specific text data ? http://pypi.python.org/pypi?%3Aaction=search&term=detect+encoding -- Stefan H. Holek stefan@epy.co.at
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-20 05:50 -0800 |
| Message-ID | <88f8c2ea-a217-4d0a-84a1-6de3a433d7ce@googlegroups.com> |
| In reply to | #35194 |
which package to use ?
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2012-12-20 16:10 +0200 |
| Message-ID | <qotzk18x1a7.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #35195 |
iMath writes: > which package to use ? Read the text in as a "bytes object" (bytes), then it has a .decode method that you can experiment with. Strings (str) are Unicode and have an .encode method. These methods allow you to specify a desired encoding and and what to do when there are errors. help(bytes.decode) help(str.encode) help(open) <http://docs.python.org/3.3/library/stdtypes.html> In Python 2.7 and before, strings seem to do double duty and have both the .encode and .decode methods, so Python version matters here.
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-20 05:50 -0800 |
| Message-ID | <mailman.1098.1356011427.29569.python-list@python.org> |
| In reply to | #35194 |
which package to use ?
[toc] | [prev] | [next] | [standalone]
| From | Christian Heimes <christian@python.org> |
|---|---|
| Date | 2012-12-20 15:19 +0100 |
| Message-ID | <mailman.1099.1356013179.29569.python-list@python.org> |
| In reply to | #35190 |
Am 20.12.2012 12:57, schrieb iMath: > how to detect the encoding used for a specific text data ? You can't. It's not possible unless the file format can specify the encoding somehow, e.g. like XML's header <?xml version="1.0" encoding="UTF-8"?>. Sometimes you can try and make an educated guess. But it's just a guess and it may give you wrong results. Christian
[toc] | [prev] | [next] | [standalone]
| From | rurpy@yahoo.com |
|---|---|
| Date | 2012-12-20 09:48 -0800 |
| Message-ID | <6ae211f3-2199-4163-9844-1b2cfa9e0215@googlegroups.com> |
| In reply to | #35190 |
On Thursday, December 20, 2012 4:57:19 AM UTC-7, iMath wrote: > how to detect the encoding used for a specific text data ? The chardet package will probably do what you want: http://pypi.python.org/pypi/chardet
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-12-21 12:38 +0000 |
| Message-ID | <mailman.1144.1356093497.29569.python-list@python.org> |
| In reply to | #35190 |
On 20 December 2012 11:57, iMath <redstone-cold@163.com> wrote: > how to detect the encoding used for a specific text data ? Normally encoding is given in some way by the context of the data. Otherwise no general solution is possible. On a related note: how to answer question with no context on mailing list?
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-12-21 09:14 -0500 |
| Message-ID | <mailman.1149.1356099297.29569.python-list@python.org> |
| In reply to | #35190 |
On 12/21/2012 07:38 AM, Oscar Benjamin wrote: > <snip> > On a related note: how to answer question with no context on mailing > list? Depends on how you're reading/responding. I'll assume you're using an email client like Thunderbird, and that you do NOT subscribe in digest form. Most general way is to use Reply-All, and remove any recipients you don't want there, but make sure you keep the python-list recipient. Alternatively, if you're using Thunderbird or another with similar capability, use Reply-list, which is smart enough to only keep the list entry. Or, what I used to do, reply, then add the python-list@python.org to the list of recipients. That's error prone. I hope this answers your question. -- DaveA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web