Re: Least-lossy string.encode to us-ascii?

From	Christian Heimes <lists@cheimes.de>
Subject	Re: Least-lossy string.encode to us-ascii?
Date	2012-09-14 00:00 +0200
References	<50524F6F.6070604@tim.thechases.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.643.1347573660.27098.python-list@python.org> (permalink)

Show all headers | View raw

Am 13.09.2012 23:26, schrieb Tim Chase:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit).  I'd like to keep as much information
> as possible, just stripping accents, cedillas, tildes, etc.  So
> "serviço móvil" becomes "servico movil".  Is there anything stock
> that I've missed?  I can do mystring.encode('us-ascii', 'replace')
> but that doesn't keep as much information as I'd hope.

The unidecode [1] package contains a large mapping of unicode chars to
ASCII. It even supports cool stuff like Chinese to ASCII:

>>> import unidecode
>>> print u"\u5317\u4EB0"
北亰
>>> print unidecode.unidecode(u"\u5317\u4EB0")
Bei Jing

icu4c and pyicu [2] may contain more methods for conversion but they
require binary extensions. By the way ICU can do a lot of cool, too:

>>> import icu
>>> rbf = icu.RuleBasedNumberFormat(icu.URBNFRuleSetTag.SPELLOUT,
icu.Locale.getUS())
>>> rbf.format(23)
u'twenty-three'
>>> rbf.format(100000)
u'one hundred thousand'

Regards,
Christian

[1] http://pypi.python.org/pypi/Unidecode/0.04.9
[2] http://pypi.python.org/pypi/PyICU/1.4

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: Least-lossy string.encode to us-ascii? Christian Heimes <lists@cheimes.de> - 2012-09-14 00:00 +0200

csiph-web