Groups > comp.lang.python > #13162 > unrolled thread

Re: How do I automate the removal of all non-ascii characters from my code?

Started by	Stefan Behnel <stefan_ml@behnel.de>
First post	2011-09-12 10:43 +0200
Last post	2011-09-13 20:13 +0200
Articles	13 — 9 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: How do I automate the removal of all non-ascii characters from my code? Stefan Behnel <stefan_ml@behnel.de> - 2011-09-12 10:43 +0200
    Re: How do I automate the removal of all non-ascii characters from my code? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-12 18:49 +1000
      Re: How do I automate the removal of all non-ascii characters from my code? Dave Angel <davea@ieee.org> - 2011-09-12 08:09 -0400
      Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-12 07:47 -0700
        Re: How do I automate the removal of all non-ascii characters from my code? "Rhodri James" <rhodri@wildebst.demon.co.uk> - 2011-09-12 22:39 +0100
          Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-13 00:49 -0700
            Re: How do I automate the removal of all non-ascii characters from my code? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-13 18:15 +1000
              Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-13 02:04 -0700
      Re: How do I automate the removal of all non-ascii characters from my code? ron <vacorama@gmail.com> - 2011-09-13 05:31 -0700
        Re: How do I automate the removal of all non-ascii characters from my code? Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-09-13 15:33 +0200
        Re: How do I automate the removal of all non-ascii characters from my code? Alec Taylor <alec.taylor6@gmail.com> - 2011-09-14 01:02 +1000
          Re: How do I automate the removal of all non-ascii characters from my code? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2011-09-13 18:29 +0300
        Re: How do I automate the removal of all non-ascii characters from my code? Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-09-13 20:13 +0200

#13162 — Re: How do I automate the removal of all non-ascii characters from my code?

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2011-09-12 10:43 +0200
Subject	Re: How do I automate the removal of all non-ascii characters from my code?
Message-ID	<mailman.1021.1315817058.27778.python-list@python.org>

Alec Taylor, 12.09.2011 10:33:
> from creole import html2creole
>
> from BeautifulSoup import BeautifulSoup
>
> VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'b', 'i', 'a', 'h1', 'h2']
>
> def sanitize_html(value):
>
>     soup = BeautifulSoup(value)
>
>     for tag in soup.findAll(True):
>         if tag.name not in VALID_TAGS:
>             tag.hidden = True
>
>     return soup.renderContents()
> html2creole(u(sanitize_html('''<h1
> style="margin-left:76.8px;margin-right:0;text-indent:0;">Abstract</h1>
>     <p class="Standard"
> style="margin-left:76.8px;margin-right:0;text-indent:0;">
> [more stuff here]
> """))

Hi,

I'm not sure what you are trying to say with the above code, but if it's 
the code that fails for you with the exception you posted, I would guess 
that the problem is in the "[more stuff here]" part, which likely contains 
a non-ASCII character. Note that you didn't declare the source file 
encoding above. Do as Gary told you.

Stefan

[toc] | [next] | [standalone]

#13164

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-12 18:49 +1000
Message-ID	<4e6dc7b4$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#13162

On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote:

> I'm not sure what you are trying to say with the above code, but if it's
> the code that fails for you with the exception you posted, I would guess
> that the problem is in the "[more stuff here]" part, which likely contains
> a non-ASCII character. Note that you didn't declare the source file
> encoding above. Do as Gary told you.

Even with a source code encoding, you will probably have problems with
source files including \xe2 and other "bad" chars. Unless they happen to
fall inside a quoted string literal, I would expect to get a SyntaxError.

I have come across this myself. While I haven't really investigated in great
detail, it appears to happen when copying and pasting code from a document
(usually HTML) which uses non-breaking spaces instead of \x20 space
characters. All it takes is just one to screw things up.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#13170

From	Dave Angel <davea@ieee.org>
Date	2011-09-12 08:09 -0400
Message-ID	<mailman.1028.1315829434.27778.python-list@python.org>
In reply to	#13164

On 01/-10/-28163 02:59 PM, Steven D'Aprano wrote:
> On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote:
>
>> I'm not sure what you are trying to say with the above code, but if it's
>> the code that fails for you with the exception you posted, I would guess
>> that the problem is in the "[more stuff here]" part, which likely contains
>> a non-ASCII character. Note that you didn't declare the source file
>> encoding above. Do as Gary told you.
> Even with a source code encoding, you will probably have problems with
> source files including \xe2 and other "bad" chars. Unless they happen to
> fall inside a quoted string literal, I would expect to get a SyntaxError.
>
> I have come across this myself. While I haven't really investigated in great
> detail, it appears to happen when copying and pasting code from a document
> (usually HTML) which uses non-breaking spaces instead of \x20 space
> characters. All it takes is just one to screw things up.
>
>

For me, more common than non-breaking space is the "smart quotes" 
characters.  In that case, one probably doesn't want to delete them, but 
instead convert them into standard quotes.

DaveA

[toc] | [prev] | [next] | [standalone]

#13183

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-09-12 07:47 -0700
Message-ID	<98d81fe1-79df-4e37-87aa-399a78b52353@bi2g2000vbb.googlegroups.com>
In reply to	#13164

On 12 sep, 10:49, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
>
> Even with a source code encoding, you will probably have problems with
> source files including \xe2 and other "bad" chars. Unless they happen to
> fall inside a quoted string literal, I would expect to get a SyntaxError.
>

This is absurd and a complete non sense. The purpose
of a coding directive is to inform the engine, which
is processing a text file, about the "language" it
has to speak. Can be a html, py or tex file.
If you have problem, it's probably a mismatch between
your coding directive and the real coding of the
file. Typical case: ascii/utf-8 without signature.

jmf

[toc] | [prev] | [next] | [standalone]

#13201

From	"Rhodri James" <rhodri@wildebst.demon.co.uk>
Date	2011-09-12 22:39 +0100
Message-ID	<op.v1ps4fqia8ncjz@gnudebst>
In reply to	#13183

On Mon, 12 Sep 2011 15:47:00 +0100, jmfauth <wxjmfauth@gmail.com> wrote:

> On 12 sep, 10:49, Steven D'Aprano <steve
> +comp.lang.pyt...@pearwood.info> wrote:
>>
>> Even with a source code encoding, you will probably have problems with
>> source files including \xe2 and other "bad" chars. Unless they happen to
>> fall inside a quoted string literal, I would expect to get a  
>> SyntaxError.
>>
>
> This is absurd and a complete non sense. The purpose
> of a coding directive is to inform the engine, which
> is processing a text file, about the "language" it
> has to speak. Can be a html, py or tex file.
> If you have problem, it's probably a mismatch between
> your coding directive and the real coding of the
> file. Typical case: ascii/utf-8 without signature.

Now read what Steven wrote again.  The issue is that the program contains  
characters that are syntactically illegal.  The "engine" can be perfectly  
correctly translating a character as a smart quote or a non breaking space  
or an e-umlaut or whatever, but that doesn't make the character legal!

-- 
Rhodri James *-* Wildebeest Herder to the Masses

[toc] | [prev] | [next] | [standalone]

#13219

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-09-13 00:49 -0700
Message-ID	<44bd982b-e740-4c77-974e-aa8c9eb47d6e@y4g2000vbx.googlegroups.com>
In reply to	#13201

On 12 sep, 23:39, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote:

> Now read what Steven wrote again.  The issue is that the program contains  
> characters that are syntactically illegal.  The "engine" can be perfectly  
> correctly translating a character as a smart quote or a non breaking space  
> or an e-umlaut or whatever, but that doesn't make the character legal!
>

Yes, you are right. I did not understand in that way.

However, a small correction/precision. Illegal character
do not exit. One can "only" have an ill-formed encoded code
points or an illegal encoded code point representing a
character/glyph.

Basically, in the present case. The issue is most probably
a mismatch between the coding directive and the real
coding, with "no coding directive" == 'ascii'.

jmf

[toc] | [prev] | [next] | [standalone]

#13220

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-13 18:15 +1000
Message-ID	<4e6f112c$0$29997$c3e8da3$5496439d@news.astraweb.com>
In reply to	#13219

On Tue, 13 Sep 2011 05:49 pm jmfauth wrote:

> On 12 sep, 23:39, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote:
> 
> 
>> Now read what Steven wrote again.  The issue is that the program contains
>> characters that are syntactically illegal.  The "engine" can be perfectly
>> correctly translating a character as a smart quote or a non breaking
>> space or an e-umlaut or whatever, but that doesn't make the character
>> legal!
>>
> 
> Yes, you are right. I did not understand in that way.
> 
> However, a small correction/precision. Illegal character
> do not exit. One can "only" have an ill-formed encoded code
> points or an illegal encoded code point representing a
> character/glyph.

You are wrong there. There are many ASCII characters which are illegal in
Python source code, at least outside of comments and string literals, and
possibly even there.

>>> code = "x = 1 + \b 2"  # all ASCII characters
>>> print(code)
x = 1 + 2
>>> exec(code)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    x = 1 + 2
            ^
SyntaxError: invalid syntax


Now, imagine that somehow a \b ASCII backspace character somehow gets
introduced into your source file. When you go to run the file, or import
it, you will get a SyntaxError. Changing the encoding will not help.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#13222

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-09-13 02:04 -0700
Message-ID	<4a674dba-078b-4353-a6ae-01c205460e91@w12g2000yqa.googlegroups.com>
In reply to	#13220

On 13 sep, 10:15, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:

The intrinsic coding of the characters is one thing,
The usage of bytes stream supposed to represent a text
is one another thing,

jmf

[toc] | [prev] | [next] | [standalone]

#13226

From	ron <vacorama@gmail.com>
Date	2011-09-13 05:31 -0700
Message-ID	<ee50be9b-2710-423b-9c69-744cc173ac85@dq7g2000vbb.googlegroups.com>
In reply to	#13164

On Sep 12, 4:49 am, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote:
>
> > I'm not sure what you are trying to say with the above code, but if it's
> > the code that fails for you with the exception you posted, I would guess
> > that the problem is in the "[more stuff here]" part, which likely contains
> > a non-ASCII character. Note that you didn't declare the source file
> > encoding above. Do as Gary told you.
>
> Even with a source code encoding, you will probably have problems with
> source files including \xe2 and other "bad" chars. Unless they happen to
> fall inside a quoted string literal, I would expect to get a SyntaxError.
>
> I have come across this myself. While I haven't really investigated in great
> detail, it appears to happen when copying and pasting code from a document
> (usually HTML) which uses non-breaking spaces instead of \x20 space
> characters. All it takes is just one to screw things up.
>
> --
> Steven

Depending on the load, you can do something like:

"".join([x for x in string if ord(x) < 128])

It's worked great for me in cleaning input on webapps where there's a
lot of copy/paste from varied sources.

[toc] | [prev] | [next] | [standalone]

#13231

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-09-13 15:33 +0200
Message-ID	<mailman.1072.1315920782.27778.python-list@python.org>
In reply to	#13226

2011/9/13 ron <vacorama@gmail.com>:
>
> Depending on the load, you can do something like:
>
> "".join([x for x in string if ord(x) < 128])
>
> It's worked great for me in cleaning input on webapps where there's a
> lot of copy/paste from varied sources.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Well, for this kind of dirty "data cleaning" you may as well use e.g.

>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
u'text  with non ASCII in between ...'
>>>

vbr

[toc] | [prev] | [next] | [standalone]

#13234

From	Alec Taylor <alec.taylor6@gmail.com>
Date	2011-09-14 01:02 +1000
Message-ID	<mailman.1074.1315926128.27778.python-list@python.org>
In reply to	#13226

Hmm, nothing mentioned so far works for me...

Here's a very small test case:

>>> python -u "Convert to Creole.py"
  File "Convert to Creole.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
>>> Exit Code: 1

Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")

On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
<vlastimil.brom@gmail.com> wrote:
> 2011/9/13 ron <vacorama@gmail.com>:
>>
>> Depending on the load, you can do something like:
>>
>> "".join([x for x in string if ord(x) < 128])
>>
>> It's worked great for me in cleaning input on webapps where there's a
>> lot of copy/paste from varied sources.
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>
>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
> u'text  with non ASCII in between ...'
>>>>
>
> vbr
> --
> http://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]

#13235

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2011-09-13 18:29 +0300
Message-ID	<qotty8gwjeo.fsf@ruuvi.it.helsinki.fi>
In reply to	#13234

Alec Taylor writes:

> Hmm, nothing mentioned so far works for me...
> 
> Here's a very small test case:
> 
> >>> python -u "Convert to Creole.py"
>   File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
> >>> Exit Code: 1
> 
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")

The people who told you to declare the source code encoding in the
source file would like to see Line 0.

See <http://www.python.org/peps/pep-0263.html>.

[1001] ruuvi$ cat ctc.py
# coding=utf-8
print u'''x ≤ 1'''.encode("ascii", "ignore").decode("ascii")
[1002] ruuvi$ python ctc.py
x  1
[1003] ruuvi$

[toc] | [prev] | [next] | [standalone]

#13239

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-09-13 20:13 +0200
Message-ID	<mailman.1081.1315937627.27778.python-list@python.org>
In reply to	#13226

2011/9/13 Alec Taylor <alec.taylor6@gmail.com>:
> Hmm, nothing mentioned so far works for me...
>
> Here's a very small test case:
>
>>>> python -u "Convert to Creole.py"
>  File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>>>> Exit Code: 1
>
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
>
> On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
> <vlastimil.brom@gmail.com> wrote:
>> 2011/9/13 ron <vacorama@gmail.com>:
>>>
>>> Depending on the load, you can do something like:
>>>
>>> "".join([x for x in string if ord(x) < 128])
>>>
>>> It's worked great for me in cleaning input on webapps where there's a
>>> lot of copy/paste from varied sources.
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>>
>>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
>> u'text  with non ASCII in between ...'
>>>>>
>>
>> vbr
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>

Ok, in that case the encoding probably would be utf-8; \xe2 is just
the first part of the encoded data

>>> u'≤'.encode("utf-8")
'\xe2\x89\xa4'
>>>

Setting this encoding at the beginning of the file, as mentioned
before, might solve the problem while retaining the symbol in question
(or you could move from syntax error to some unicode related error
depending on other circumstances...).

vbr

[toc] | [prev] | [standalone]

csiph-web