Groups > comp.lang.python > #18594 > unrolled thread

How to support a non-standard encoding?

Started by	Ivan <ivan@llaisdy.com>
First post	2012-01-06 10:03 +0000
Last post	2012-01-08 12:16 +0100
Articles	7 — 5 participants

Back to article view | Back to comp.lang.python

  How to support a non-standard encoding? Ivan <ivan@llaisdy.com> - 2012-01-06 10:03 +0000
    Re: How to support a non-standard encoding? Tim Wintle <tim.wintle@teamrubber.com> - 2012-01-06 13:47 +0000
    Re: How to support a non-standard encoding? Ivan Uemlianin <ivan@llaisdy.com> - 2012-01-06 14:03 +0000
    Re: How to support a non-standard encoding? jmfauth <wxjmfauth@gmail.com> - 2012-01-06 12:00 -0800
      Re: How to support a non-standard encoding? Tim Wintle <tim.wintle@teamrubber.com> - 2012-01-06 20:42 +0000
        Re: How to support a non-standard encoding? Ivan <ivan@llaisdy.com> - 2012-01-08 08:50 +0000
      Re: How to support a non-standard encoding? Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-01-08 12:16 +0100

#18594 — How to support a non-standard encoding?

From	Ivan <ivan@llaisdy.com>
Date	2012-01-06 10:03 +0000
Subject	How to support a non-standard encoding?
Message-ID	<je6guf$pns$1@localhost.localdomain>

Dear All

I'm developing a python application for which I need to support a 
non-standard character encoding (specifically ISO 6937/2-1983, Addendum 
1-1989).  Here are some of the properties of the encoding and its use in 
the application:

   - I need to read and write data to/from files.  The file format
     includes two sections in different character encodings (so I
     shan't be able to use codecs.open()).

   - iso-6937 sections include non-printing control characters

   - iso-6937 is a variable width encoding, e.g. "A" = [41],
     "Ä" = [0xC8, 0x41]; all non-spacing diacritical marks are in the
     range 0xC0-0xCF.

By any chance is there anyone out there working on iso-6937?

Otherwise, I think I need to write a new codec to support reading and 
writing this data.  Does anyone know of any tutorials or blog posts on 
implementing a codec for a non-standard characeter encoding?  Would 
anyone be interested in reading one?

With thanks and best wishes

Ivan


-- 
============================================================
Ivan A. Uemlianin
Llaisdy
Speech Technology Research and Development

                     ivan@llaisdy.com
                      www.llaisdy.com
                          llaisdy.wordpress.com
               github.com/llaisdy
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================

[toc] | [next] | [standalone]

#18599

From	Tim Wintle <tim.wintle@teamrubber.com>
Date	2012-01-06 13:47 +0000
Message-ID	<mailman.4478.1325857654.27778.python-list@python.org>
In reply to	#18594

On Fri, 2012-01-06 at 10:03 +0000, Ivan wrote:
> Dear All
> 
> I'm developing a python application for which I need to support a 
> non-standard character encoding (specifically ISO 6937/2-1983, Addendum 
> 1-1989).

If your system version of iconv contains that encoding (mine does) then
you could use a wrapped iconv library to avoid re-inventing the wheel.

I've got a forked version of the "iconv" package from pypi available
here:

<https://github.com/timwintle/iconv-python>

.. it should work on python2.5-2.7

Tim

[toc] | [prev] | [next] | [standalone]

#18603

From	Ivan Uemlianin <ivan@llaisdy.com>
Date	2012-01-06 14:03 +0000
Message-ID	<mailman.4480.1325860009.27778.python-list@python.org>
In reply to	#18594

Dear Tim

Thanks for your help.

 > If your system version of iconv contains that encoding, ...

Alas, it doesn't:

     $ iconv -l |grep 6937
     $

Also, I'd like to package the app so other people could use it, so I 
wouldn't want to depend too much on the local OS.

Best wishes

Ivan


On 06/01/2012 13:47, Tim Wintle wrote:
> On Fri, 2012-01-06 at 10:03 +0000, Ivan wrote:
>> Dear All
>>
>> I'm developing a python application for which I need to support a
>> non-standard character encoding (specifically ISO 6937/2-1983, Addendum
>> 1-1989).
>
> If your system version of iconv contains that encoding (mine does) then
> you could use a wrapped iconv library to avoid re-inventing the wheel.
>
> I've got a forked version of the "iconv" package from pypi available
> here:
>
> <https://github.com/timwintle/iconv-python>
>
> .. it should work on python2.5-2.7
>
> Tim
>


-- 
============================================================
Ivan A. Uemlianin
Llaisdy
Speech Technology Research and Development

                     ivan@llaisdy.com
                      www.llaisdy.com
                          llaisdy.wordpress.com
               github.com/llaisdy
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================

[toc] | [prev] | [next] | [standalone]

#18617

From	jmfauth <wxjmfauth@gmail.com>
Date	2012-01-06 12:00 -0800
Message-ID	<1480875f-d133-40a1-8fd1-dd31a2dd430b@d10g2000vbh.googlegroups.com>
In reply to	#18594

On 6 jan, 11:03, Ivan <i...@llaisdy.com> wrote:
> Dear All
>
> I'm developing a python application for which I need to support a
> non-standard character encoding (specifically ISO 6937/2-1983, Addendum
> 1-1989).  Here are some of the properties of the encoding and its use in
> the application:
>
>    - I need to read and write data to/from files.  The file format
>      includes two sections in different character encodings (so I
>      shan't be able to use codecs.open()).
>
>    - iso-6937 sections include non-printing control characters
>
>    - iso-6937 is a variable width encoding, e.g. "A" = [41],
>      "Ä" = [0xC8, 0x41]; all non-spacing diacritical marks are in the
>      range 0xC0-0xCF.
>
> By any chance is there anyone out there working on iso-6937?
>
> Otherwise, I think I need to write a new codec to support reading and
> writing this data.  Does anyone know of any tutorials or blog posts on
> implementing a codec for a non-standard characeter encoding?  Would
> anyone be interested in reading one?
>

Take a look at the files, Python modules, in the
...\Lib\encodings. This is the place where all codecs
are centralized. Python is magically using these
a long there are present in that dir.

I remember, long time ago, for the fun, I created such
a codec quite easily. I picked up one of the file as
template and I modified its "table". It was a
byte <-> byte table.

For multibytes coding scheme, it may be a litte bit more
complicated; you may take a look, eg, at the mbcs.py codec.

The distibution of such a codec may be a problem.

----

Another simple approach, os independent.

You probably do not write your code in iso-6937, but
you only need to encode/decode some bytes sequence
"on the fly". In that case, work with bytes, create
a couple of coding / decoding functions with a
created <dict> [*] as helper. It's not so complicate.
Use <unicode> Py2 or <str> Py3 (the recommended
way ;-) ) as pivot encoding.

[*] I also created once a such a dict from
# http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

I never checked if it does correpond to the "official" cp1252
codec.

jmf

[toc] | [prev] | [next] | [standalone]

#18619

From	Tim Wintle <tim.wintle@teamrubber.com>
Date	2012-01-06 20:42 +0000
Message-ID	<mailman.4494.1325882556.27778.python-list@python.org>
In reply to	#18617

On Fri, 2012-01-06 at 12:00 -0800, jmfauth wrote:
> The distibution of such a codec may be a problem.

There is a register_codec method (or similar) in the codecs module.

Tim

[toc] | [prev] | [next] | [standalone]

#18666

From	Ivan <ivan@llaisdy.com>
Date	2012-01-08 08:50 +0000
Message-ID	<jeblda$5gk$1@localhost.localdomain>
In reply to	#18619

Dear jmf, Tim

Thanks for these pointers.  They look v useful.

I'll have a go and report back (with success I hope).

Best wishes

Ivan

On 06/01/2012 20:42, Tim Wintle wrote:
> On Fri, 2012-01-06 at 12:00 -0800, jmfauth wrote:
>> The distibution of such a codec may be a problem.
>
> There is a register_codec method (or similar) in the codecs module.
>
> Tim
>
>


-- 
============================================================
Ivan A. Uemlianin
Llaisdy
Speech Technology Research and Development

                     ivan@llaisdy.com
                      www.llaisdy.com
                          llaisdy.wordpress.com
               github.com/llaisdy
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================

[toc] | [prev] | [next] | [standalone]

#18668

From	Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date	2012-01-08 12:16 +0100
Message-ID	<jebtu9$5m2$1@r03.glglgl.gl>
In reply to	#18617

Am 06.01.2012 21:00 schrieb jmfauth:

> Another simple approach, os independent.
>
> You probably do not write your code in iso-6937, but
> you only need to encode/decode some bytes sequence
> "on the fly". In that case, work with bytes, create
> a couple of coding / decoding functions with a
> created<dict>  [*] as helper. It's not so complicate.
> Use<unicode>  Py2 or<str>  Py3 (the recommended
> way ;-) ) as pivot encoding.

These coding/decoding functions are exactly the way to create a codec. 
I. e., it is not much more.


Thomas

[toc] | [prev] | [standalone]

csiph-web

How to support a non-standard encoding?

Contents

#18594 — How to support a non-standard encoding?

#18599

#18603

#18617

#18619

#18666

#18668