Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #29093 > unrolled thread
| Started by | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| First post | 2012-09-13 18:54 -0500 |
| Last post | 2012-09-13 19:09 -0700 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Least-lossy string.encode to us-ascii? Tim Chase <python.list@tim.thechases.com> - 2012-09-13 18:54 -0500
Re: Least-lossy string.encode to us-ascii? Mark Tolonen <metolone@gmail.com> - 2012-09-13 19:09 -0700
Re: Least-lossy string.encode to us-ascii? Tim Chase <python.list@tim.thechases.com> - 2012-09-13 21:34 -0500
Re: Least-lossy string.encode to us-ascii? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-14 04:05 +0000
Re: Least-lossy string.encode to us-ascii? Terry Reedy <tjreedy@udel.edu> - 2012-09-14 16:57 -0400
Re: Least-lossy string.encode to us-ascii? Mark Tolonen <metolone@gmail.com> - 2012-09-13 19:09 -0700
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2012-09-13 18:54 -0500 |
| Subject | Re: Least-lossy string.encode to us-ascii? |
| Message-ID | <mailman.654.1347580392.27098.python-list@python.org> |
On 09/13/12 18:36, Terry Reedy wrote:
> On 9/13/2012 5:26 PM, Tim Chase wrote:
>> I've got a bunch of text in Portuguese and to transmit them, need to
>> have them in us-ascii (7-bit). I'd like to keep as much information
>> as possible,just stripping accents, cedillas, tildes, etc.
>
> 'keep as much information as possible' would mean an effectively
> lossless transliteration, which you could do with a dict.
> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
> never occur in normal text of the sort you are transmitting), ...}
Vlastimil's solution kept the characters but stripped them of their
accents/tildes/cedillas/etc, doing just what I wanted, all using the
stdlib. Hard to do better than that :-)
-tkc
[toc] | [next] | [standalone]
| From | Mark Tolonen <metolone@gmail.com> |
|---|---|
| Date | 2012-09-13 19:09 -0700 |
| Message-ID | <8a35c480-7594-4202-afe8-f03db9418301@googlegroups.com> |
| In reply to | #29093 |
On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
> On 09/13/12 18:36, Terry Reedy wrote:
>
> > On 9/13/2012 5:26 PM, Tim Chase wrote:
>
> >> I've got a bunch of text in Portuguese and to transmit them, need to
>
> >> have them in us-ascii (7-bit). I'd like to keep as much information
>
> >> as possible,just stripping accents, cedillas, tildes, etc.
>
> >
>
> > 'keep as much information as possible' would mean an effectively
>
> > lossless transliteration, which you could do with a dict.
>
> > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
>
> > never occur in normal text of the sort you are transmitting), ...}
>
>
>
> Vlastimil's solution kept the characters but stripped them of their
>
> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>
> stdlib. Hard to do better than that :-)
>
>
>
> -tkc
How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.
>>> s=u"serviço móvil".encode('utf-7')
>>> print s
servi+AOc-o m+APM-vil
>>> print s.decode('utf-7')
serviço móvil
-Mark
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2012-09-13 21:34 -0500 |
| Message-ID | <mailman.660.1347590022.27098.python-list@python.org> |
| In reply to | #29099 |
On 09/13/12 21:09, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib. Hard to do better than that :-)
>
> How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.
>
> >>> s=u"serviço móvil".encode('utf-7')
> >>> print s
> servi+AOc-o m+APM-vil
> >>> print s.decode('utf-7')
> serviço móvil
Nice if I control both ends of the pipe. Unfortunately, I only
control what goes in, and I want it to be as un-screw-uppable as
possible when it comes out the other end (may be web, CSV files,
PDFs, FTP'ed file dumps, spreadsheets, word-processing documents,
etc), and us-ascii is the lowest-common-denominator of
unscrewuppableness while requiring nothing of the the other end. :-)
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-14 04:05 +0000 |
| Message-ID | <5052ad0c$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #29103 |
On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:
> On 09/13/12 21:09, Mark Tolonen wrote:
>> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>>> Vlastimil's solution kept the characters but stripped them of their
>>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>>> stdlib. Hard to do better than that :-)
>>
>> How about using UTF-7 for transmission and decode on the other end?
>> This keeps the transmission all 7-bit, and no loss.
>>
>> >>> s=u"serviço móvil".encode('utf-7')
>> >>> print s
>> servi+AOc-o m+APM-vil
>> >>> print s.decode('utf-7')
>> serviço móvil
>
> Nice if I control both ends of the pipe. Unfortunately, I only control
> what goes in, and I want it to be as un-screw-uppable as possible when
> it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
> dumps, spreadsheets, word-processing documents, etc), and us-ascii is
> the lowest-common-denominator of unscrewuppableness while requiring
> nothing of the the other end. :-)
Wrong. It requires support for US-ASCII. What if the other end is an IBM
mainframe using EBCDIC?
Frankly, I am appalled that you are intentionally perpetuating the
ignorance of US-ASCII-only applications, not because you have no choice
about inter-operating with some ancient, brain-dead application, but
because you artificially choose to follow an obsolete *and incorrect*
standard.
It is *incorrect* because you can change the meaning of text by stripping
accents and deleting characters. Consequences can include murder and suicide:
http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
At least tell me that "ASCII only" is merely an *option* for your
application, not the only choice, and that it defaults to UTF-8 which is
the right standard to use for text.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-09-14 16:57 -0400 |
| Message-ID | <mailman.721.1347656284.27098.python-list@python.org> |
| In reply to | #29099 |
On 9/13/2012 10:09 PM, Mark Tolonen wrote:
> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>> On 09/13/12 18:36, Terry Reedy wrote:
>>> 'keep as much information as possible' would mean an effectively
>>> lossless transliteration, which you could do with a dict.
>>> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
>> Vlastimil's solution kept the characters but stripped them of their
>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>> stdlib. Hard to do better than that :-)
You mean, hard to do better than what you think you want, as opposed to
what you said you wanted in both the subject line and the text line I
quoted. What you need depends on why you need ascii only text and what
the recipient will do with the ascii only text. Print it on an
ascii-only printer? Or something similar? If so, a lossy encoding may be
sufficient, but why not let the recipient decide to toss info?
> How about using UTF-7 for transmission and decode on the other end?
> This keeps the transmission all 7-bit, and no loss.
>
> >>> s=u"serviço móvil".encode('utf-7')
> >>> print s
> servi+AOc-o m+APM-vil
> >>> print s.decode('utf-7')
> serviço móvil
Nice. I was barely aware of and forgot that option. This and similar
suggestions to use existing methods is much better than my hackish approach.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Mark Tolonen <metolone@gmail.com> |
|---|---|
| Date | 2012-09-13 19:09 -0700 |
| Message-ID | <mailman.657.1347588580.27098.python-list@python.org> |
| In reply to | #29093 |
On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
> On 09/13/12 18:36, Terry Reedy wrote:
>
> > On 9/13/2012 5:26 PM, Tim Chase wrote:
>
> >> I've got a bunch of text in Portuguese and to transmit them, need to
>
> >> have them in us-ascii (7-bit). I'd like to keep as much information
>
> >> as possible,just stripping accents, cedillas, tildes, etc.
>
> >
>
> > 'keep as much information as possible' would mean an effectively
>
> > lossless transliteration, which you could do with a dict.
>
> > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
>
> > never occur in normal text of the sort you are transmitting), ...}
>
>
>
> Vlastimil's solution kept the characters but stripped them of their
>
> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>
> stdlib. Hard to do better than that :-)
>
>
>
> -tkc
How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.
>>> s=u"serviço móvil".encode('utf-7')
>>> print s
servi+AOc-o m+APM-vil
>>> print s.decode('utf-7')
serviço móvil
-Mark
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web