Groups > comp.lang.python > #29093 > unrolled thread

Re: Least-lossy string.encode to us-ascii?

Started by	Tim Chase <python.list@tim.thechases.com>
First post	2012-09-13 18:54 -0500
Last post	2012-09-13 19:09 -0700
Articles	6 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Least-lossy string.encode to us-ascii? Tim Chase <python.list@tim.thechases.com> - 2012-09-13 18:54 -0500
    Re: Least-lossy string.encode to us-ascii? Mark Tolonen <metolone@gmail.com> - 2012-09-13 19:09 -0700
      Re: Least-lossy string.encode to us-ascii? Tim Chase <python.list@tim.thechases.com> - 2012-09-13 21:34 -0500
        Re: Least-lossy string.encode to us-ascii? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 04:05 +0000
      Re: Least-lossy string.encode to us-ascii? Terry Reedy <tjreedy@udel.edu> - 2012-09-14 16:57 -0400
    Re: Least-lossy string.encode to us-ascii? Mark Tolonen <metolone@gmail.com> - 2012-09-13 19:09 -0700

#29093 — Re: Least-lossy string.encode to us-ascii?

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-13 18:54 -0500
Subject	Re: Least-lossy string.encode to us-ascii?
Message-ID	<mailman.654.1347580392.27098.python-list@python.org>

On 09/13/12 18:36, Terry Reedy wrote:
> On 9/13/2012 5:26 PM, Tim Chase wrote:
>> I've got a bunch of text in Portuguese and to transmit them, need to
>> have them in us-ascii (7-bit).  I'd like to keep as much information
>> as possible,just stripping accents, cedillas, tildes, etc.
> 
> 'keep as much information as possible' would mean an effectively 
> lossless transliteration, which you could do with a dict.
> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would 
> never occur in normal text of the sort you are transmitting), ...}

Vlastimil's solution kept the characters but stripped them of their
accents/tildes/cedillas/etc, doing just what I wanted, all using the
stdlib.  Hard to do better than that :-)

-tkc

[toc] | [next] | [standalone]

#29099

From	Mark Tolonen <metolone@gmail.com>
Date	2012-09-13 19:09 -0700
Message-ID	<8a35c480-7594-4202-afe8-f03db9418301@googlegroups.com>
In reply to	#29093

On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
> On 09/13/12 18:36, Terry Reedy wrote:
> 
> > On 9/13/2012 5:26 PM, Tim Chase wrote:
> 
> >> I've got a bunch of text in Portuguese and to transmit them, need to
> 
> >> have them in us-ascii (7-bit).  I'd like to keep as much information
> 
> >> as possible,just stripping accents, cedillas, tildes, etc.
> 
> > 
> 
> > 'keep as much information as possible' would mean an effectively 
> 
> > lossless transliteration, which you could do with a dict.
> 
> > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would 
> 
> > never occur in normal text of the sort you are transmitting), ...}
> 
> 
> 
> Vlastimil's solution kept the characters but stripped them of their
> 
> accents/tildes/cedillas/etc, doing just what I wanted, all using the
> 
> stdlib.  Hard to do better than that :-)
> 
> 
> 
> -tkc

How about using UTF-7 for transmission and decode on the other end?  This keeps the transmission all 7-bit, and no loss.

    >>> s=u"serviço móvil".encode('utf-7')
    >>> print s
    servi+AOc-o m+APM-vil
    >>> print s.decode('utf-7')
    serviço móvil

-Mark

[toc] | [prev] | [next] | [standalone]

#29103

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-13 21:34 -0500
Message-ID	<mailman.660.1347590022.27098.python-list@python.org>
In reply to	#29099

On 09/13/12 21:09, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib.  Hard to do better than that :-)
> 
> How about using UTF-7 for transmission and decode on the other end?  This keeps the transmission all 7-bit, and no loss.
> 
>     >>> s=u"serviço móvil".encode('utf-7')
>     >>> print s
>     servi+AOc-o m+APM-vil
>     >>> print s.decode('utf-7')
>     serviço móvil

Nice if I control both ends of the pipe.  Unfortunately, I only
control what goes in, and I want it to be as un-screw-uppable as
possible when it comes out the other end (may be web, CSV files,
PDFs, FTP'ed file dumps, spreadsheets, word-processing documents,
etc), and us-ascii is the lowest-common-denominator of
unscrewuppableness while requiring nothing of the the other end. :-)

-tkc

[toc] | [prev] | [next] | [standalone]

#29110

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-14 04:05 +0000
Message-ID	<5052ad0c$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#29103

On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:

> On 09/13/12 21:09, Mark Tolonen wrote:
>> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>>> Vlastimil's solution kept the characters but stripped them of their
>>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>>> stdlib.  Hard to do better than that :-)
>> 
>> How about using UTF-7 for transmission and decode on the other end? 
>> This keeps the transmission all 7-bit, and no loss.
>> 
>>     >>> s=u"serviço móvil".encode('utf-7')
>>     >>> print s
>>     servi+AOc-o m+APM-vil
>>     >>> print s.decode('utf-7')
>>     serviço móvil
> 
> Nice if I control both ends of the pipe.  Unfortunately, I only control
> what goes in, and I want it to be as un-screw-uppable as possible when
> it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
> dumps, spreadsheets, word-processing documents, etc), and us-ascii is
> the lowest-common-denominator of unscrewuppableness while requiring
> nothing of the the other end. :-)

Wrong. It requires support for US-ASCII. What if the other end is an IBM 
mainframe using EBCDIC?

Frankly, I am appalled that you are intentionally perpetuating the 
ignorance of US-ASCII-only applications, not because you have no choice 
about inter-operating with some ancient, brain-dead application, but 
because you artificially choose to follow an obsolete *and incorrect* 
standard.

It is *incorrect* because you can change the meaning of text by stripping 
accents and deleting characters. Consequences can include murder and suicide:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

At least tell me that "ASCII only" is merely an *option* for your 
application, not the only choice, and that it defaults to UTF-8 which is 
the right standard to use for text.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#29189

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-14 16:57 -0400
Message-ID	<mailman.721.1347656284.27098.python-list@python.org>
In reply to	#29099

On 9/13/2012 10:09 PM, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> On 09/13/12 18:36, Terry Reedy wrote:

>>> 'keep as much information as possible' would mean an effectively
>>> lossless transliteration, which you could do with a dict.
>>> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would

>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib.  Hard to do better than that :-)

You mean, hard to do better than what you think you want, as opposed to 
what you said you wanted in both the subject line and the text line I 
quoted. What you need depends on why you need ascii only text and what 
the recipient will do with the ascii only text. Print it on an 
ascii-only printer? Or something similar? If so, a lossy encoding may be 
sufficient, but why not let the recipient decide to toss info?

> How about using UTF-7 for transmission and decode on the other end?
 > This keeps the transmission all 7-bit, and no loss.
>
>      >>> s=u"serviço móvil".encode('utf-7')
>      >>> print s
>      servi+AOc-o m+APM-vil
>      >>> print s.decode('utf-7')
>      serviço móvil

Nice. I was barely aware of and forgot that option. This and similar 
suggestions to use existing methods is much better than my hackish approach.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#29100

From	Mark Tolonen <metolone@gmail.com>
Date	2012-09-13 19:09 -0700
Message-ID	<mailman.657.1347588580.27098.python-list@python.org>
In reply to	#29093

On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
> On 09/13/12 18:36, Terry Reedy wrote:
> 
> > On 9/13/2012 5:26 PM, Tim Chase wrote:
> 
> >> I've got a bunch of text in Portuguese and to transmit them, need to
> 
> >> have them in us-ascii (7-bit).  I'd like to keep as much information
> 
> >> as possible,just stripping accents, cedillas, tildes, etc.
> 
> > 
> 
> > 'keep as much information as possible' would mean an effectively 
> 
> > lossless transliteration, which you could do with a dict.
> 
> > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would 
> 
> > never occur in normal text of the sort you are transmitting), ...}
> 
> 
> 
> Vlastimil's solution kept the characters but stripped them of their
> 
> accents/tildes/cedillas/etc, doing just what I wanted, all using the
> 
> stdlib.  Hard to do better than that :-)
> 
> 
> 
> -tkc

How about using UTF-7 for transmission and decode on the other end?  This keeps the transmission all 7-bit, and no loss.

    >>> s=u"serviço móvil".encode('utf-7')
    >>> print s
    servi+AOc-o m+APM-vil
    >>> print s.decode('utf-7')
    serviço móvil

-Mark

[toc] | [prev] | [standalone]

csiph-web

Re: Least-lossy string.encode to us-ascii?

Contents

#29093 — Re: Least-lossy string.encode to us-ascii?

#29099

#29103

#29110

#29189

#29100