Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #13162 > unrolled thread
| Started by | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| First post | 2011-09-12 10:43 +0200 |
| Last post | 2011-09-13 20:13 +0200 |
| Articles | 13 — 9 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: How do I automate the removal of all non-ascii characters from my code? Stefan Behnel <stefan_ml@behnel.de> - 2011-09-12 10:43 +0200
Re: How do I automate the removal of all non-ascii characters from my code? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-12 18:49 +1000
Re: How do I automate the removal of all non-ascii characters from my code? Dave Angel <davea@ieee.org> - 2011-09-12 08:09 -0400
Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-12 07:47 -0700
Re: How do I automate the removal of all non-ascii characters from my code? "Rhodri James" <rhodri@wildebst.demon.co.uk> - 2011-09-12 22:39 +0100
Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-13 00:49 -0700
Re: How do I automate the removal of all non-ascii characters from my code? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-13 18:15 +1000
Re: How do I automate the removal of all non-ascii characters from my code? jmfauth <wxjmfauth@gmail.com> - 2011-09-13 02:04 -0700
Re: How do I automate the removal of all non-ascii characters from my code? ron <vacorama@gmail.com> - 2011-09-13 05:31 -0700
Re: How do I automate the removal of all non-ascii characters from my code? Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-09-13 15:33 +0200
Re: How do I automate the removal of all non-ascii characters from my code? Alec Taylor <alec.taylor6@gmail.com> - 2011-09-14 01:02 +1000
Re: How do I automate the removal of all non-ascii characters from my code? Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2011-09-13 18:29 +0300
Re: How do I automate the removal of all non-ascii characters from my code? Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-09-13 20:13 +0200
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2011-09-12 10:43 +0200 |
| Subject | Re: How do I automate the removal of all non-ascii characters from my code? |
| Message-ID | <mailman.1021.1315817058.27778.python-list@python.org> |
Alec Taylor, 12.09.2011 10:33:
> from creole import html2creole
>
> from BeautifulSoup import BeautifulSoup
>
> VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'b', 'i', 'a', 'h1', 'h2']
>
> def sanitize_html(value):
>
> soup = BeautifulSoup(value)
>
> for tag in soup.findAll(True):
> if tag.name not in VALID_TAGS:
> tag.hidden = True
>
> return soup.renderContents()
> html2creole(u(sanitize_html('''<h1
> style="margin-left:76.8px;margin-right:0;text-indent:0;">Abstract</h1>
> <p class="Standard"
> style="margin-left:76.8px;margin-right:0;text-indent:0;">
> [more stuff here]
> """))
Hi,
I'm not sure what you are trying to say with the above code, but if it's
the code that fails for you with the exception you posted, I would guess
that the problem is in the "[more stuff here]" part, which likely contains
a non-ASCII character. Note that you didn't declare the source file
encoding above. Do as Gary told you.
Stefan
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-09-12 18:49 +1000 |
| Message-ID | <4e6dc7b4$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #13162 |
On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote: > I'm not sure what you are trying to say with the above code, but if it's > the code that fails for you with the exception you posted, I would guess > that the problem is in the "[more stuff here]" part, which likely contains > a non-ASCII character. Note that you didn't declare the source file > encoding above. Do as Gary told you. Even with a source code encoding, you will probably have problems with source files including \xe2 and other "bad" chars. Unless they happen to fall inside a quoted string literal, I would expect to get a SyntaxError. I have come across this myself. While I haven't really investigated in great detail, it appears to happen when copying and pasting code from a document (usually HTML) which uses non-breaking spaces instead of \x20 space characters. All it takes is just one to screw things up. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@ieee.org> |
|---|---|
| Date | 2011-09-12 08:09 -0400 |
| Message-ID | <mailman.1028.1315829434.27778.python-list@python.org> |
| In reply to | #13164 |
On 01/-10/-28163 02:59 PM, Steven D'Aprano wrote: > On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote: > >> I'm not sure what you are trying to say with the above code, but if it's >> the code that fails for you with the exception you posted, I would guess >> that the problem is in the "[more stuff here]" part, which likely contains >> a non-ASCII character. Note that you didn't declare the source file >> encoding above. Do as Gary told you. > Even with a source code encoding, you will probably have problems with > source files including \xe2 and other "bad" chars. Unless they happen to > fall inside a quoted string literal, I would expect to get a SyntaxError. > > I have come across this myself. While I haven't really investigated in great > detail, it appears to happen when copying and pasting code from a document > (usually HTML) which uses non-breaking spaces instead of \x20 space > characters. All it takes is just one to screw things up. > > For me, more common than non-breaking space is the "smart quotes" characters. In that case, one probably doesn't want to delete them, but instead convert them into standard quotes. DaveA
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-09-12 07:47 -0700 |
| Message-ID | <98d81fe1-79df-4e37-87aa-399a78b52353@bi2g2000vbb.googlegroups.com> |
| In reply to | #13164 |
On 12 sep, 10:49, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > > Even with a source code encoding, you will probably have problems with > source files including \xe2 and other "bad" chars. Unless they happen to > fall inside a quoted string literal, I would expect to get a SyntaxError. > This is absurd and a complete non sense. The purpose of a coding directive is to inform the engine, which is processing a text file, about the "language" it has to speak. Can be a html, py or tex file. If you have problem, it's probably a mismatch between your coding directive and the real coding of the file. Typical case: ascii/utf-8 without signature. jmf
[toc] | [prev] | [next] | [standalone]
| From | "Rhodri James" <rhodri@wildebst.demon.co.uk> |
|---|---|
| Date | 2011-09-12 22:39 +0100 |
| Message-ID | <op.v1ps4fqia8ncjz@gnudebst> |
| In reply to | #13183 |
On Mon, 12 Sep 2011 15:47:00 +0100, jmfauth <wxjmfauth@gmail.com> wrote: > On 12 sep, 10:49, Steven D'Aprano <steve > +comp.lang.pyt...@pearwood.info> wrote: >> >> Even with a source code encoding, you will probably have problems with >> source files including \xe2 and other "bad" chars. Unless they happen to >> fall inside a quoted string literal, I would expect to get a >> SyntaxError. >> > > This is absurd and a complete non sense. The purpose > of a coding directive is to inform the engine, which > is processing a text file, about the "language" it > has to speak. Can be a html, py or tex file. > If you have problem, it's probably a mismatch between > your coding directive and the real coding of the > file. Typical case: ascii/utf-8 without signature. Now read what Steven wrote again. The issue is that the program contains characters that are syntactically illegal. The "engine" can be perfectly correctly translating a character as a smart quote or a non breaking space or an e-umlaut or whatever, but that doesn't make the character legal! -- Rhodri James *-* Wildebeest Herder to the Masses
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-09-13 00:49 -0700 |
| Message-ID | <44bd982b-e740-4c77-974e-aa8c9eb47d6e@y4g2000vbx.googlegroups.com> |
| In reply to | #13201 |
On 12 sep, 23:39, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote: > Now read what Steven wrote again. The issue is that the program contains > characters that are syntactically illegal. The "engine" can be perfectly > correctly translating a character as a smart quote or a non breaking space > or an e-umlaut or whatever, but that doesn't make the character legal! > Yes, you are right. I did not understand in that way. However, a small correction/precision. Illegal character do not exit. One can "only" have an ill-formed encoded code points or an illegal encoded code point representing a character/glyph. Basically, in the present case. The issue is most probably a mismatch between the coding directive and the real coding, with "no coding directive" == 'ascii'. jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-09-13 18:15 +1000 |
| Message-ID | <4e6f112c$0$29997$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #13219 |
On Tue, 13 Sep 2011 05:49 pm jmfauth wrote:
> On 12 sep, 23:39, "Rhodri James" <rho...@wildebst.demon.co.uk> wrote:
>
>
>> Now read what Steven wrote again. The issue is that the program contains
>> characters that are syntactically illegal. The "engine" can be perfectly
>> correctly translating a character as a smart quote or a non breaking
>> space or an e-umlaut or whatever, but that doesn't make the character
>> legal!
>>
>
> Yes, you are right. I did not understand in that way.
>
> However, a small correction/precision. Illegal character
> do not exit. One can "only" have an ill-formed encoded code
> points or an illegal encoded code point representing a
> character/glyph.
You are wrong there. There are many ASCII characters which are illegal in
Python source code, at least outside of comments and string literals, and
possibly even there.
>>> code = "x = 1 + \b 2" # all ASCII characters
>>> print(code)
x = 1 + 2
>>> exec(code)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1
x = 1 + 2
^
SyntaxError: invalid syntax
Now, imagine that somehow a \b ASCII backspace character somehow gets
introduced into your source file. When you go to run the file, or import
it, you will get a SyntaxError. Changing the encoding will not help.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-09-13 02:04 -0700 |
| Message-ID | <4a674dba-078b-4353-a6ae-01c205460e91@w12g2000yqa.googlegroups.com> |
| In reply to | #13220 |
On 13 sep, 10:15, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: The intrinsic coding of the characters is one thing, The usage of bytes stream supposed to represent a text is one another thing, jmf
[toc] | [prev] | [next] | [standalone]
| From | ron <vacorama@gmail.com> |
|---|---|
| Date | 2011-09-13 05:31 -0700 |
| Message-ID | <ee50be9b-2710-423b-9c69-744cc173ac85@dq7g2000vbb.googlegroups.com> |
| In reply to | #13164 |
On Sep 12, 4:49 am, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > On Mon, 12 Sep 2011 06:43 pm Stefan Behnel wrote: > > > I'm not sure what you are trying to say with the above code, but if it's > > the code that fails for you with the exception you posted, I would guess > > that the problem is in the "[more stuff here]" part, which likely contains > > a non-ASCII character. Note that you didn't declare the source file > > encoding above. Do as Gary told you. > > Even with a source code encoding, you will probably have problems with > source files including \xe2 and other "bad" chars. Unless they happen to > fall inside a quoted string literal, I would expect to get a SyntaxError. > > I have come across this myself. While I haven't really investigated in great > detail, it appears to happen when copying and pasting code from a document > (usually HTML) which uses non-breaking spaces instead of \x20 space > characters. All it takes is just one to screw things up. > > -- > Steven Depending on the load, you can do something like: "".join([x for x in string if ord(x) < 128]) It's worked great for me in cleaning input on webapps where there's a lot of copy/paste from varied sources.
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-09-13 15:33 +0200 |
| Message-ID | <mailman.1072.1315920782.27778.python-list@python.org> |
| In reply to | #13226 |
2011/9/13 ron <vacorama@gmail.com>:
>
> Depending on the load, you can do something like:
>
> "".join([x for x in string if ord(x) < 128])
>
> It's worked great for me in cleaning input on webapps where there's a
> lot of copy/paste from varied sources.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Well, for this kind of dirty "data cleaning" you may as well use e.g.
>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
u'text with non ASCII in between ...'
>>>
vbr
[toc] | [prev] | [next] | [standalone]
| From | Alec Taylor <alec.taylor6@gmail.com> |
|---|---|
| Date | 2011-09-14 01:02 +1000 |
| Message-ID | <mailman.1074.1315926128.27778.python-list@python.org> |
| In reply to | #13226 |
Hmm, nothing mentioned so far works for me...
Here's a very small test case:
>>> python -u "Convert to Creole.py"
File "Convert to Creole.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
>>> Exit Code: 1
Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
<vlastimil.brom@gmail.com> wrote:
> 2011/9/13 ron <vacorama@gmail.com>:
>>
>> Depending on the load, you can do something like:
>>
>> "".join([x for x in string if ord(x) < 128])
>>
>> It's worked great for me in cleaning input on webapps where there's a
>> lot of copy/paste from varied sources.
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>
>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
> u'text with non ASCII in between ...'
>>>>
>
> vbr
> --
> http://mail.python.org/mailman/listinfo/python-list
>
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2011-09-13 18:29 +0300 |
| Message-ID | <qotty8gwjeo.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #13234 |
Alec Taylor writes:
> Hmm, nothing mentioned so far works for me...
>
> Here's a very small test case:
>
> >>> python -u "Convert to Creole.py"
> File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
> >>> Exit Code: 1
>
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
The people who told you to declare the source code encoding in the
source file would like to see Line 0.
See <http://www.python.org/peps/pep-0263.html>.
[1001] ruuvi$ cat ctc.py
# coding=utf-8
print u'''x ≤ 1'''.encode("ascii", "ignore").decode("ascii")
[1002] ruuvi$ python ctc.py
x 1
[1003] ruuvi$
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-09-13 20:13 +0200 |
| Message-ID | <mailman.1081.1315937627.27778.python-list@python.org> |
| In reply to | #13226 |
2011/9/13 Alec Taylor <alec.taylor6@gmail.com>:
> Hmm, nothing mentioned so far works for me...
>
> Here's a very small test case:
>
>>>> python -u "Convert to Creole.py"
> File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>>>> Exit Code: 1
>
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
>
> On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
> <vlastimil.brom@gmail.com> wrote:
>> 2011/9/13 ron <vacorama@gmail.com>:
>>>
>>> Depending on the load, you can do something like:
>>>
>>> "".join([x for x in string if ord(x) < 128])
>>>
>>> It's worked great for me in cleaning input on webapps where there's a
>>> lot of copy/paste from varied sources.
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>>
>>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
>> u'text with non ASCII in between ...'
>>>>>
>>
>> vbr
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
Ok, in that case the encoding probably would be utf-8; \xe2 is just
the first part of the encoded data
>>> u'≤'.encode("utf-8")
'\xe2\x89\xa4'
>>>
Setting this encoding at the beginning of the file, as mentioned
before, might solve the problem while retaining the symbol in question
(or you could move from syntax error to some unicode related error
depending on other circumstances...).
vbr
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web