Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53323 > unrolled thread

UnicodeDecodeError issue

Started byFerrous Cranus <nikos@superhost.gr>
First post2013-08-31 09:41 +0300
Last post2013-09-02 20:49 -0400
Articles 20 on this page of 50 — 11 participants

Back to article view | Back to comp.lang.python


Contents

  UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 09:41 +0300
    Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-08-31 16:53 +1000
      Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 10:02 +0300
        Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 10:18 +0300
    Re: UnicodeDecodeError issue Peter Otten <__peter__@web.de> - 2013-08-31 09:25 +0200
      Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 10:58 +0300
        Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 11:31 +0300
          Re: UnicodeDecodeError issue Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-31 11:28 +0000
            Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 15:58 +0300
              Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-08-31 16:07 +0300
              Re: UnicodeDecodeError issue Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-31 15:44 +0000
    Re: UnicodeDecodeError issue Ferrous Cranus <nikos.gr33k@gmail.com> - 2013-08-31 23:50 -0700
      Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-09-01 17:12 +1000
        Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-01 10:23 +0300
          Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-09-01 17:28 +1000
          Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-01 10:35 +0000
            Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-01 16:59 +0300
              Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-01 15:40 +0000
          Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-09-01 20:51 +1000
      Re: UnicodeDecodeError issue Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-01 08:35 +0000
        Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-01 17:08 +0300
          Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-01 17:25 +0300
          Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-01 15:36 +0000
            Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-01 19:10 +0300
              Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-02 01:23 +0300
                Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-01 23:14 +0000
                  Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-02 07:16 +0300
                    Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-02 11:38 +0000
                      Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-02 14:49 +0300
                        Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-02 12:21 +0000
                          Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-02 18:05 +0300
                            Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-02 18:28 +0000
                              Re: UnicodeDecodeError issue Ferrous Cranus <nikos.gr33k@gmail.com> - 2013-09-04 01:35 -0700
                                Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-04 11:26 +0000
                                  Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-04 14:38 +0300
                                    Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-04 12:38 +0000
                                      Re: UnicodeDecodeError issue Ferrous Cranus <nikos@superhost.gr> - 2013-09-04 17:29 +0300
                                        Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-05 00:17 +0000
                                          Re: UnicodeDecodeError issue Steven D'Aprano <steve@pearwood.info> - 2013-09-05 03:07 +0000
                                            Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-09-05 13:59 +1000
                                              Re: UnicodeDecodeError issue Steven D'Aprano <steve@pearwood.info> - 2013-09-05 05:28 +0000
                    Re: UnicodeDecodeError issue MRAB <python@mrabarnett.plus.com> - 2013-09-02 12:56 +0100
                    Re: UnicodeDecodeError issue Dave Angel <davea@davea.name> - 2013-09-02 12:24 +0000
                    Re: UnicodeDecodeError issue MRAB <python@mrabarnett.plus.com> - 2013-09-02 15:44 +0100
                      Re: UnicodeDecodeError issue wxjmfauth@gmail.com - 2013-09-03 08:23 -0700
                        Re: UnicodeDecodeError issue Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-09-04 10:01 +0200
                          Re: UnicodeDecodeError issue wxjmfauth@gmail.com - 2013-09-04 07:08 -0700
                    Re: UnicodeDecodeError issue Chris Angelico <rosuav@gmail.com> - 2013-09-03 08:45 +1000
                      Re: UnicodeDecodeError issue Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-03 14:56 +0000
                    Re: UnicodeDecodeError issue Joel Goldstick <joel.goldstick@gmail.com> - 2013-09-02 20:49 -0400

Page 2 of 3 — ← Prev page 1 [2] 3  Next page →


#53415

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-01 17:08 +0300
Message-ID<kvvhoi$1v4o$1@news.ntua.gr>
In reply to#53406
Στις 1/9/2013 11:35 πμ, ο/η Steven D'Aprano έγραψε:
> On Sat, 31 Aug 2013 23:50:23 -0700, Ferrous Cranus wrote:
>
>> Τη Σάββατο, 31 Αυγούστου 2013 9:41:27 π.μ. UTC+3, ο χρήστης Ferrous
>> Cranus έγραψε:
>>> Suddenly my webiste superhost.gr running my main python script presents
>>>
>>> me with this error:
>>>
>>>
>>>
>>> Code:
>>>
>>> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
>>>
>>> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
>>>
>>> 'invalid start byte')
>>>
>>>
>>>
>>>
>>>
>>> Does anyone know what this means?
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Webhost <http://superhost.gr>
>>
>> Good morning Steven,
>>
>> Ye i'm aware that i need to define variables before i try to make use of
>> them. I have study all of your examples and then re-view my code and i
>> can *assure* you that the line statement that tied to set the 'host'
>> variable is very early at the top of the script(of course after
>> imports), and the cur.execute comes after.
>>
>> The problem here is not what you say, that i try to drink k a coffee
>> before actually making one first but rather than i cannot drink the
>> coffee although i know *i have tried* to make one first.
>>
>>
>> i will upload the code for you to prove my sayings at pastebin.
>>
>> http://pastebin.com/J97guApQ
>
>
> You are mistaken. In line 20-25, you have this:
>
> try:
>      gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
>      city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
>          gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
>      host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
>          socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0]
>          or "Proxy Detected"
> except Exception as e:
>          print( repr(e), file=open( '/tmp/err.out', 'w' ) )
>
>
> An error occurs inside that block, *before* host gets set. Who knows what
> the error is? You have access to the err.out file, but apparently you
> aren't reading it to find out.
>
> Then, 110 lines later, at line 135, you try to access the value of "host"
> that never got set.
>
> Your job is to read the error in /tmp/err.out, see what is failing, and
> fix it.
>
>

But i'm Steven! That why i make use of it to read it immediately after 
my script run at browser time.

i have even included a sys.exit(0) after the try:/except block:

Here is it:


errout = open( '/tmp/err.out', 'w' )		# opens and truncates the error 
output file
try:
	gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
	city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or 
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
	host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or 
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy 
Detected"
except Exception as e:
	print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
     errout.flush()

sys.exit(0)

and the output of error file is:


nikos@superhost.gr [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef 
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1, 
'invalid start byte')

-- 
Webhost <http://superhost.gr>

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53416

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-01 17:25 +0300
Message-ID<kvvioj$21rg$1@news.ntua.gr>
In reply to#53415
Στις 1/9/2013 5:08 μμ, ο/η Ferrous Cranus έγραψε:
> Στις 1/9/2013 11:35 πμ, ο/η Steven D'Aprano έγραψε:
>> On Sat, 31 Aug 2013 23:50:23 -0700, Ferrous Cranus wrote:
>>
>>> Τη Σάββατο, 31 Αυγούστου 2013 9:41:27 π.μ. UTC+3, ο χρήστης Ferrous
>>> Cranus έγραψε:
>>>> Suddenly my webiste superhost.gr running my main python script presents
>>>>
>>>> me with this error:
>>>>
>>>>
>>>>
>>>> Code:
>>>>
>>>> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
>>>>
>>>> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
>>>>
>>>> 'invalid start byte')
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Does anyone know what this means?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Webhost <http://superhost.gr>
>>>
>>> Good morning Steven,
>>>
>>> Ye i'm aware that i need to define variables before i try to make use of
>>> them. I have study all of your examples and then re-view my code and i
>>> can *assure* you that the line statement that tied to set the 'host'
>>> variable is very early at the top of the script(of course after
>>> imports), and the cur.execute comes after.
>>>
>>> The problem here is not what you say, that i try to drink k a coffee
>>> before actually making one first but rather than i cannot drink the
>>> coffee although i know *i have tried* to make one first.
>>>
>>>
>>> i will upload the code for you to prove my sayings at pastebin.
>>>
>>> http://pastebin.com/J97guApQ
>>
>>
>> You are mistaken. In line 20-25, you have this:
>>
>> try:
>>      gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
>>      city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
>>          gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
>>      host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
>>          socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0]
>>          or "Proxy Detected"
>> except Exception as e:
>>          print( repr(e), file=open( '/tmp/err.out', 'w' ) )
>>
>>
>> An error occurs inside that block, *before* host gets set. Who knows what
>> the error is? You have access to the err.out file, but apparently you
>> aren't reading it to find out.
>>
>> Then, 110 lines later, at line 135, you try to access the value of "host"
>> that never got set.
>>
>> Your job is to read the error in /tmp/err.out, see what is failing, and
>> fix it.
>>
>>
>
> But i'm Steven! That why i make use of it to read it immediately after
> my script run at browser time.
>
> i have even included a sys.exit(0) after the try:/except block:
>
> Here is it:
>
>
> errout = open( '/tmp/err.out', 'w' )        # opens and truncates the
> error output file
> try:
>      gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
>      city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
> gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
>      host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
> socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
> Detected"
> except Exception as e:
>      print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
>      errout.flush()
>
> sys.exit(0)
>
> and the output of error file is:
>
>
> nikos@superhost.gr [~]# cat /tmp/err.out
> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
> 'invalid start byte')
>


But i noticed that err.out and /usr/local/apache/logs/error_log produced 
different output.

In any case i check both:


nikos@superhost.gr [~]# chmod 777 /tmp/err2.out

ouput of error_log
nikos@superhost.gr [~]# [Sun Sep 01 14:23:46 2013] [error] [client 
173.245.49.120] Premature end of script headers: metrites.py
[Sun Sep 01 14:23:46 2013] [error] [client 173.245.49.120] File does not 
exist: /home/nikos/public_html/500.shtml



Also i have even changed output error filename.
turns out empty.

nikos@superhost.gr [~]# cat /tmp/err2.out

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53417

FromDave Angel <davea@davea.name>
Date2013-09-01 15:36 +0000
Message-ID<mailman.450.1378049809.19984.python-list@python.org>
In reply to#53415
On 1/9/2013 10:08, Ferrous Cranus wrote:

   <snip>
> Here is it:
>
>
> errout = open( '/tmp/err.out', 'w' )		# opens and truncates the error 
> output file
> try:
> 	gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
> 	city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or 
> gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
> 	host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or 
> socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy 
> Detected"
> except Exception as e:
> 	print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
>      errout.flush()
>
> sys.exit(0)
>
> and the output of error file is:
>
>
> nikos@superhost.gr [~]# cat /tmp/err.out
> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef 
> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1, 
> 'invalid start byte')
>

Nope.  The label  "Xyzzy exception" is not in that file, so that's not
the file you created in this run.  Further, if that line existed before,
it would have been wiped out by the open with mode "w".

i suggest you add yet another write to that file, immediately after
opening it:

errout = open( '/tmp/err.out', 'w' )		# opens and truncates the error 
print("starting run", file=errorout)
errout.flush()

Until you can reliably examine the same file that was logging your
errors, you're just spinning your wheels.  you might even want to write
the time to the file, so that you can tell whether it was now, or 2 days
ago that the run was made.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53419

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-01 19:10 +0300
Message-ID<kvvoua$2ifa$1@news.ntua.gr>
In reply to#53417
Στις 1/9/2013 6:36 μμ, ο/η Dave Angel έγραψε:
> On 1/9/2013 10:08, Ferrous Cranus wrote:
>
>     <snip>
>> Here is it:
>>
>>
>> errout = open( '/tmp/err.out', 'w' )		# opens and truncates the error
>> output file
>> try:
>> 	gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
>> 	city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
>> gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
>> 	host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
>> socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
>> Detected"
>> except Exception as e:
>> 	print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
>>       errout.flush()
>>
>> sys.exit(0)
>>
>> and the output of error file is:
>>
>>
>> nikos@superhost.gr [~]# cat /tmp/err.out
>> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
>> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
>> 'invalid start byte')
>>
>
> Nope.  The label  "Xyzzy exception" is not in that file, so that's not
> the file you created in this run.  Further, if that line existed before,
> it would have been wiped out by the open with mode "w".
>
> i suggest you add yet another write to that file, immediately after
> opening it:
>
> errout = open( '/tmp/err.out', 'w' )		# opens and truncates the error
> print("starting run", file=errorout)
> errout.flush()
>
> Until you can reliably examine the same file that was logging your
> errors, you're just spinning your wheels.  you might even want to write
> the time to the file, so that you can tell whether it was now, or 2 days
> ago that the run was made.
>
>


I tried it and it printed nothing.
But suddenly thw ebpage sttaed to run and i get n invalid byte entried 
and no weird messge files.py is working as expcted.
what on earht?

Now i ahve thso error:

# 
=================================================================================================================
# DATABASE INSERTS - do not increment the counter if a Cookie is set to 
the visitors browser already
# 
=================================================================================================================
if( not vip and re.search( 
r'(msn|gator|amazon|yandex|reverse|cloudflare|who|fetch|barracuda|spider|google|crawl|pingdom)', 
host ) is None ):

	print( "i'm in and data is: ", host )
	try:
		#find the needed counter for the page URL
		if os.path.exists( path + page ) or os.path.exists( cgi_path + page ):
			cur.execute('''SELECT ID FROM counters WHERE url = %s''', page )
			data = cur.fetchone()		#URL is unique, so should only be one

		if not data:
			#first time for page; primary key is automatic, hit is defaulted
			cur.execute('''INSERT INTO counters (url) VALUES (%s)''', page )
			cID = cur.lastrowid        #get the primary key value of the new record
		else:
			#found the page, save primary key and use it to issue hit UPDATE
			cID = data[0]
			cur.execute('''UPDATE counters SET hits = hits + 1 WHERE ID = %s''', 
cID )

		#find the visitor record for the (saved) cID and current host
		cur.execute('''SELECT * FROM visitors WHERE counterID = %s and host = 
%s''', (cID, host) )
		data = cur.fetchone()        #cID&host are unique
			
		if not data:
			#first time for this host on this page, create new record
			cur.execute('''INSERT INTO visitors (counterID, host, city, useros, 
browser, lastvisit) VALUES (%s, %s, %s, %s, %s, %s)''', (cID, host, 
city, useros, browser, date) )
		else:
			#found the page, save its primary key for later use
			vID = data[0]
			#UPDATE record using retrieved vID
			cur.execute('''UPDATE visitors SET city = %s, useros = %s, browser = 
%s, hits = hits + 1, lastvisit = %s
									WHERE counterID = %s and host = %s''', (city, useros, browser, 
date, vID, host) )

		con.commit()		#if we made it here, the transaction is complete
		
	except pymysql.ProgrammingError as e:
		print( repr(e) )
		con.rollback()		#something failed, rollback the entire transaction
		sys.exit(0)


i get no counter increment when visitors visit my webpage.
What on eart is going on?

How the previous error with the invalid byte somehtign got solved?

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53437

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-02 01:23 +0300
Message-ID<l00epm$1ce3$1@news.ntua.gr>
In reply to#53419
Στις 1/9/2013 7:10 μμ, ο/η Ferrous Cranus έγραψε:
> Στις 1/9/2013 6:36 μμ, ο/η Dave Angel έγραψε:
>> On 1/9/2013 10:08, Ferrous Cranus wrote:
>>
>>     <snip>
>>> Here is it:
>>>
>>>
>>> errout = open( '/tmp/err.out', 'w' )        # opens and truncates the
>>> error
>>> output file
>>> try:
>>>     gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
>>>     city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
>>> gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
>>>     host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
>>> socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
>>> Detected"
>>> except Exception as e:
>>>     print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
>>>       errout.flush()
>>>
>>> sys.exit(0)
>>>
>>> and the output of error file is:
>>>
>>>
>>> nikos@superhost.gr [~]# cat /tmp/err.out
>>> UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
>>> \xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
>>> 'invalid start byte')
>>>
>>
>> Nope.  The label  "Xyzzy exception" is not in that file, so that's not
>> the file you created in this run.  Further, if that line existed before,
>> it would have been wiped out by the open with mode "w".
>>
>> i suggest you add yet another write to that file, immediately after
>> opening it:
>>
>> errout = open( '/tmp/err.out', 'w' )        # opens and truncates the
>> error
>> print("starting run", file=errorout)
>> errout.flush()
>>
>> Until you can reliably examine the same file that was logging your
>> errors, you're just spinning your wheels.  you might even want to write
>> the time to the file, so that you can tell whether it was now, or 2 days
>> ago that the run was made.
>>
>>
>
>
> I tried it and it printed nothing.
> But suddenly thw ebpage sttaed to run and i get n invalid byte entried
> and no weird messge files.py is working as expcted.
> what on earht?
>
> Now i ahve thso error:
>
> #
> =================================================================================================================
>
> # DATABASE INSERTS - do not increment the counter if a Cookie is set to
> the visitors browser already
> #
> =================================================================================================================
>
> if( not vip and re.search(
> r'(msn|gator|amazon|yandex|reverse|cloudflare|who|fetch|barracuda|spider|google|crawl|pingdom)',
> host ) is None ):
>
>      print( "i'm in and data is: ", host )
>      try:
>          #find the needed counter for the page URL
>          if os.path.exists( path + page ) or os.path.exists( cgi_path +
> page ):
>              cur.execute('''SELECT ID FROM counters WHERE url = %s''',
> page )
>              data = cur.fetchone()        #URL is unique, so should only
> be one
>
>          if not data:
>              #first time for page; primary key is automatic, hit is
> defaulted
>              cur.execute('''INSERT INTO counters (url) VALUES (%s)''',
> page )
>              cID = cur.lastrowid        #get the primary key value of
> the new record
>          else:
>              #found the page, save primary key and use it to issue hit
> UPDATE
>              cID = data[0]
>              cur.execute('''UPDATE counters SET hits = hits + 1 WHERE ID
> = %s''', cID )
>
>          #find the visitor record for the (saved) cID and current host
>          cur.execute('''SELECT * FROM visitors WHERE counterID = %s and
> host = %s''', (cID, host) )
>          data = cur.fetchone()        #cID&host are unique
>
>          if not data:
>              #first time for this host on this page, create new record
>              cur.execute('''INSERT INTO visitors (counterID, host, city,
> useros, browser, lastvisit) VALUES (%s, %s, %s, %s, %s, %s)''', (cID,
> host, city, useros, browser, date) )
>          else:
>              #found the page, save its primary key for later use
>              vID = data[0]
>              #UPDATE record using retrieved vID
>              cur.execute('''UPDATE visitors SET city = %s, useros = %s,
> browser = %s, hits = hits + 1, lastvisit = %s
>                                      WHERE counterID = %s and host =
> %s''', (city, useros, browser, date, vID, host) )
>
>          con.commit()        #if we made it here, the transaction is
> complete
>
>      except pymysql.ProgrammingError as e:
>          print( repr(e) )
>          con.rollback()        #something failed, rollback the entire
> transaction
>          sys.exit(0)
>
>
> i get no counter increment when visitors visit my webpage.
> What on eart is going on?
>
> How the previous error with the invalid byte somehtign got solved?
>
i still wonder how come the invalid byte messge dissapeared

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53438

FromDave Angel <davea@davea.name>
Date2013-09-01 23:14 +0000
Message-ID<mailman.462.1378077287.19984.python-list@python.org>
In reply to#53437
On 1/9/2013 18:23, Ferrous Cranus wrote:

    <snip>
>>
> i still wonder how come the invalid byte messge dissapeared
>

Too bad you never bothered to narrow it down to its source.  It could
be anywhere on those three lines.  If I had to guess, I'd figure it was
one of those environment variables.  The Linux environment variables are
strings of bytes, and the os.environ is a dict of strings.  Apparently
it converts them using utf-8, and if you've somehow set them using some
other encoding, you could be getting that error.

Have you tried to decode those bytes in various encodings other than
utf-8 ?

-- 
Signature file not found

[toc] | [prev] | [next] | [standalone]


#53453

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-02 07:16 +0300
Message-ID<l013f1$230h$1@news.ntua.gr>
In reply to#53438
Στις 2/9/2013 2:14 πμ, ο/η Dave Angel έγραψε:
> On 1/9/2013 18:23, Ferrous Cranus wrote:
>
>      <snip>
>>>
>> i still wonder how come the invalid byte messge dissapeared
>>
>
> Too bad you never bothered to narrow it down to its source.


if only i knew how up until yesterday when they were appearing.


> It could
> be anywhere on those three lines.  If I had to guess, I'd figure it was
> one of those environment variables.  The Linux environment variables are
> strings of bytes, and the os.environ is a dict of strings.  Apparently
> it converts them using utf-8, and if you've somehow set them using some
> other encoding, you could be getting that error.
>
> Have you tried to decode those bytes in various encodings other than
> utf-8 ?


No, because i wasn't aware of what string/variable they were pertaining at.


-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53474

FromDave Angel <davea@davea.name>
Date2013-09-02 11:38 +0000
Message-ID<mailman.484.1378121913.19984.python-list@python.org>
In reply to#53453
On 2/9/2013 00:16, Ferrous Cranus wrote:


>>
>> Have you tried to decode those bytes in various encodings other than
>> utf-8 ?
>
>
> No, because i wasn't aware of what string/variable they were pertaining at.
>
>

  http://pypi.python.org/pypi/chardet

is a package which tries to 'guess' an encoding for a string of bytes. 
I happen to have the 2.7 version installed, but not the 3.x version, so
the following is in 2.7. Same thing should work in 3.3....

>>> chardet.detect(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2')
{'confidence': 0.9638983132261467, 'encoding': 'windows-1253'}
>>> print b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2'.decode('windows-1253')
¶γνωστοόνομα συστήματος


I don't have a clue what it might be;  it's not English, and I don't
know whatever language it may be in.

Does that string make any sense to you?  You may want to try it on your
own machine, since the email may obscure the encoding.  Or you might
want to do the decode using whatever the default encoding is for that
server.

The Linux 'file' utility thinks this string is in ISO-8859, so you might
want to try a decode('ISO-8859-1') as well.  (and maybe  ISO-8859-2, -3,
-4, and -5)




-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53475

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-02 14:49 +0300
Message-ID<l01tvf$1a8k$1@news.ntua.gr>
In reply to#53474
Στις 2/9/2013 2:38 μμ, ο/η Dave Angel έγραψε:
> On 2/9/2013 00:16, Ferrous Cranus wrote:
>
>
>>>
>>> Have you tried to decode those bytes in various encodings other than
>>> utf-8 ?
>>
>>
>> No, because i wasn't aware of what string/variable they were pertaining at.
>>
>>
>
>    http://pypi.python.org/pypi/chardet
>
> is a package which tries to 'guess' an encoding for a string of bytes.
> I happen to have the 2.7 version installed, but not the 3.x version, so
> the following is in 2.7. Same thing should work in 3.3....
>
>>>> chardet.detect(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2')
> {'confidence': 0.9638983132261467, 'encoding': 'windows-1253'}
>>>> print b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2'.decode('windows-1253')
> ¶γνωστοόνομα συστήματος
>
>
> I don't have a clue what it might be;  it's not English, and I don't
> know whatever language it may be in.
>
> Does that string make any sense to you?

Yes it does, it mean "Unknown Hostname"

> The Linux 'file' utility thinks this string is in ISO-8859, so you might
> want to try a decode('ISO-8859-1') as well.  (and maybe  ISO-8859-2, -3,
> -4, and -5)

How did you test it? The utility afaik analyzes a file's encodings not 
string encodings.

nikos@superhost.gr [~]# file www/cgi-bin/files.py
www/cgi-bin/files.py: a /usr/bin/python script text executable


-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53478

FromDave Angel <davea@davea.name>
Date2013-09-02 12:21 +0000
Message-ID<mailman.487.1378124525.19984.python-list@python.org>
In reply to#53475
On 2/9/2013 07:49, Ferrous Cranus wrote:
    <snip>
> Στις 2/9/2013 2:38 μμ, ο/η Dave Angel έγραψε:
>>
>> Does that string make any sense to you?
>
> Yes it does, it mean "Unknown Hostname"
>
>> The Linux 'file' utility thinks this string is in ISO-8859, so you might
>> want to try a decode('ISO-8859-1') as well.  (and maybe  ISO-8859-2, -3,
>> -4, and -5)
>
> How did you test it? The utility afaik analyzes a file's encodings not 
> string encodings.
>

Starting with the byte string in the error message:

>>> f = open("junk.txt", "w")
>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>> f.close()


> nikos@superhost.gr [~]# file www/cgi-bin/files.py
> www/cgi-bin/files.py: a /usr/bin/python script text executable
>
>
No point in doing that, as the string in question doesn't exist there.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53492

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-02 18:05 +0300
Message-ID<l029ev$2a1g$1@news.ntua.gr>
In reply to#53478
Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
> Starting with the byte string in the error message:
>>>> f = open("junk.txt", "w")
>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>> f.close()


Ιndeed but yet again, file checks out the encoding of the filename that 
consists of these lines above, not of the actual strings.


-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53522

FromDave Angel <davea@davea.name>
Date2013-09-02 18:28 +0000
Message-ID<mailman.511.1378146537.19984.python-list@python.org>
In reply to#53492
On 2/9/2013 11:05, Ferrous Cranus wrote:

> Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
>> Starting with the byte string in the error message:
>>>>> f = open("junk.txt", "w")
>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>>> f.close()
>
>
> Ιndeed but yet again, file checks out the encoding of the filename that 
> consists of these lines above, not of the actual strings.
>
>

'file' does nothing interesting with the filename, it just opens it and
examines the contents.  For example,

file www/cgi-bin/files.py

will examine the Python source file, not run it.

So first in the interpreter, I ran

>>>> f = open("junk.txt", "w")
>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>> f.close()

then at the bash prompt, I ran:

davea@think2:~$ file junk.txt 
junk.txt: ISO-8859 text
davea@think2:~$ 





-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53609

FromFerrous Cranus <nikos.gr33k@gmail.com>
Date2013-09-04 01:35 -0700
Message-ID<3e549761-4323-4379-b4e4-ce51597d59c0@googlegroups.com>
In reply to#53522
Τη Δευτέρα, 2 Σεπτεμβρίου 2013 9:28:36 μ.μ. UTC+3, ο χρήστης Dave Angel έγραψε:
> On 2/9/2013 11:05, Ferrous Cranus wrote:
> 
> 
> 
> > Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
> 
> >> Starting with the byte string in the error message:
> 
> >>>>> f = open("junk.txt", "w")
> 
> >>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
> 
> >>>>> f.close()
> 
> >
> 
> >
> 
> > Ιndeed but yet again, file checks out the encoding of the filename that 
> 
> > consists of these lines above, not of the actual strings.
> 
> >
> 
> >
> 
> 
> 
> 'file' does nothing interesting with the filename, it just opens it and
> 
> examines the contents.  For example,
> 
> 
> 
> file www/cgi-bin/files.py
> 
> 
> 
> will examine the Python source file, not run it.
> 
> 
> 
> So first in the interpreter, I ran
> 
> 
> 
> >>>> f = open("junk.txt", "w")
> 
> >>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
> 
> >>>> f.close()
> 
> 
> 
> then at the bash prompt, I ran:
> 
> 
> 
> davea@think2:~$ file junk.txt 
> 
> junk.txt: ISO-8859 text


That is one Clever Idea Dave.

I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

But wait a minute: What editor do you uses to write these 3 lines?
I mean am a bit confused.

i for example i 'nano tets.py' which has within:

f = open("junk.txt", "w") 
f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n') 
f.close() 

then when i save the file within nano for example by default in utf-8 charset

how would it be able to detect the bytestring within that is supposed to be of greek-iso's

[toc] | [prev] | [next] | [standalone]


#53616

FromDave Angel <davea@davea.name>
Date2013-09-04 11:26 +0000
Message-ID<mailman.38.1378294002.5461.python-list@python.org>
In reply to#53609
On 4/9/2013 04:35, Ferrous Cranus wrote:

> Τη Δευτέρα, 2 Σεπτεμβρίου 2013 9:28:36 μ.μ. UTC+3, ο χρήστης Dave Angel έγραψε:
>> On 2/9/2013 11:05, Ferrous Cranus wrote:
>> 
>> 
>> 
>> > Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
>> 
>> >> Starting with the byte string in the error message:
>> 
>> >>>>> f = open("junk.txt", "w")
>> 
>> >>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> 
>> >>>>> f.close()
>> 
>> >
>> 
>> >
>> 
>> > Ιndeed but yet again, file checks out the encoding of the filename that 
>> 
>> > consists of these lines above, not of the actual strings.
>> 
>> >
>> 
>> >
>> 
>> 
>> 
>> 'file' does nothing interesting with the filename, it just opens it and
>> 
>> examines the contents.  For example,
>> 
>> 
>> 
>> file www/cgi-bin/files.py
>> 
>> 
>> 
>> will examine the Python source file, not run it.
>> 
>> 
>> 
>> So first in the interpreter, I ran
>> 
>> 
>> 
>> >>>> f = open("junk.txt", "w")
>> 
>> >>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> 
>> >>>> f.close()
>> 
>> 
>> 
>> then at the bash prompt, I ran:
>> 
>> 
>> 
>> davea@think2:~$ file junk.txt 
>> 
>> junk.txt: ISO-8859 text
>
>
> That is one Clever Idea Dave.
>
> I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

'file' only guesses the most likely encoding for 'junk.txt'  But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.

>
> But wait a minute: What editor do you uses to write these 3 lines?
> I mean am a bit confused.

As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses.  By interpeter, I mean I ran Python with no
parameters.  I did not run IDLE or any other IDE, that might take it
upon itself to interfere.


>
> i for example i 'nano tets.py' which has within:
>
> f = open("junk.txt", "w") 
> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n') 
> f.close() 
>
> then when i save the file within nano for example by default in utf-8 charset

That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII.  Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.

>
> how would it be able to detect the bytestring within that is supposed to be of greek-iso's

I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
    python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53618

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-04 14:38 +0300
Message-ID<l07641$vh2$1@dont-email.me>
In reply to#53616
Στις 4/9/2013 2:26 μμ, ο/η Dave Angel έγραψε:
> On 4/9/2013 04:35, Ferrous Cranus wrote:
>
>> Τη Δευτέρα, 2 Σεπτεμβρίου 2013 9:28:36 μ.μ. UTC+3, ο χρήστης Dave Angel έγραψε:
>>> On 2/9/2013 11:05, Ferrous Cranus wrote:
>>>
>>>
>>>
>>>> Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
>>>
>>>>> Starting with the byte string in the error message:
>>>
>>>>>>>> f = open("junk.txt", "w")
>>>
>>>>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>
>>>>>>>> f.close()
>>>
>>>>
>>>
>>>>
>>>
>>>> Ιndeed but yet again, file checks out the encoding of the filename that
>>>
>>>> consists of these lines above, not of the actual strings.
>>>
>>>>
>>>
>>>>
>>>
>>>
>>>
>>> 'file' does nothing interesting with the filename, it just opens it and
>>>
>>> examines the contents.  For example,
>>>
>>>
>>>
>>> file www/cgi-bin/files.py
>>>
>>>
>>>
>>> will examine the Python source file, not run it.
>>>
>>>
>>>
>>> So first in the interpreter, I ran
>>>
>>>
>>>
>>>>>>> f = open("junk.txt", "w")
>>>
>>>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>
>>>>>>> f.close()
>>>
>>>
>>>
>>> then at the bash prompt, I ran:
>>>
>>>
>>>
>>> davea@think2:~$ file junk.txt
>>>
>>> junk.txt: ISO-8859 text
>>
>>
>> That is one Clever Idea Dave.
>>
>> I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?
>
> 'file' only guesses the most likely encoding for 'junk.txt'  But at
> least it can know it's not utf-8, since that would give an decoding
> error.
>
> That's why, whenever 'file' makes its verdict, it's up to you to check
> it by displaying the data after decoding it with that tentative
> encoding.
>
>>
>> But wait a minute: What editor do you uses to write these 3 lines?
>> I mean am a bit confused.
>
> As I said right above, "in the interpreter, I ran"...
> And if that's not clear enough, you can see the >>>> prompts that the
> Python interpreter uses.  By interpeter, I mean I ran Python with no
> parameters.  I did not run IDLE or any other IDE, that might take it
> upon itself to interfere.
>
>
>>
>> i for example i 'nano tets.py' which has within:
>>
>> f = open("junk.txt", "w")
>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> f.close()
>>
>> then when i save the file within nano for example by default in utf-8 charset
>
> That's the encoding for the file tets.py, and you'll notice that it's
> actually ASCII.  Notice that the string I copied from the error message
> uses escape sequences for all non-ASCII bytes.
>
>>
>> how would it be able to detect the bytestring within that is supposed to be of greek-iso's
>
> I wouldn't be running 'file' on the tets.py file, but on the junk.txt
> file created when you run
>      python tets.py
>
> So since the tets.py file was a sidetrack, I just ran those three lines
> in the interpreter.
>
I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

fi

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53624

FromDave Angel <davea@davea.name>
Date2013-09-04 12:38 +0000
Message-ID<mailman.42.1378298304.5461.python-list@python.org>
In reply to#53618
On 4/9/2013 07:38, Ferrous Cranus wrote:

> Στις 4/9/2013 2:26 μμ, ο/η Dave Angel έγραψε:

>>
>>>>
>>>> So first in the interpreter, I ran
>>>>
>>>>
>>>>
>>>>>>>> f = open("junk.txt", "w")
>>>>
>>>>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>>
>>>>>>>> f.close()
>>>>
>>>>
>>>>
         <snip>
>> So since the tets.py file was a sidetrack, I just ran those three lines
>> in the interpreter.
>>
> I'm still consused about this.
>
> say we save those 3 lines inside junk.txt and we save it by default as utf-8
>
> when we 'file junk.txt'
>
> what will file respond with?

junk2.txt: ASCII text

>
> filename's charset?
>
> or
>
> will it llook at the bystering within to decide what encoding it uses?
>

'file' isn't magic.  And again, it doesn't look at the filename, it
looks at the content.  What heuristics it uses, I don't know, but it has
hundreds of them.   ( I wish you hadn't confused the issue by using the
same name junk.txt for an entirely different purpose) When it looks at a
file like this one, it looks only at the bytes within it. In this
case, the instance of 'file' on my machine decides it's an ASCII file.

if I add an silly shebang line

#!/usr/tmp/pyttthon

it says
junk2.txt: a /usr/tmp/pyttthon script, ASCII text executable

It doesn't know it's python, it just trusts the shebang line.  And it
identifies it as ASCII, not utf-8, since there are no non-ascii
characters in it.  It certainly does not try to interpret the b'xxxx'
byte string by Python syntax rules.




-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53627

FromFerrous Cranus <nikos@superhost.gr>
Date2013-09-04 17:29 +0300
Message-ID<l07g4b$lcv$3@dont-email.me>
In reply to#53624
Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγραψε:
> 'file' isn't magic.  And again, it doesn't look at the filename, it
> looks at the content.
So, you are saying that it looks a the content of the file and not of 
what encoding we used to save the file into?

But the contents have within:

f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 
\xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')

so it should have said greek-iso and not ascii.

-- 
Webhost <http://superhost.gr>

[toc] | [prev] | [next] | [standalone]


#53660

FromDave Angel <davea@davea.name>
Date2013-09-05 00:17 +0000
Message-ID<mailman.68.1378340276.5461.python-list@python.org>
In reply to#53627
On 4/9/2013 10:29, Ferrous Cranus wrote:

> Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγραψε:
>> 'file' isn't magic.  And again, it doesn't look at the filename, it
>> looks at the content.
> So, you are saying that it looks a the content of the file and not of 
> what encoding we used to save the file into?

That's right.  There's no place where your text editor stores the
encoding it used, so 'file' has to guess, based only on the content.
>
> But the contents have within:
>
> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 
> \xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>
> so it should have said greek-iso and not ascii.
>

No, that line is totally ASCII.  Only when it's EXECUTED by Python will
a non ASCII byte string object be created.  Like I said, 'file' doesn't
know the first thing about Python syntax, nor should it.

-- 
Signature file not found

[toc] | [prev] | [next] | [standalone]


#53664

FromSteven D'Aprano <steve@pearwood.info>
Date2013-09-05 03:07 +0000
Message-ID<5227f57e$0$2743$c3e8da3$76491128@news.astraweb.com>
In reply to#53660
On Thu, 05 Sep 2013 00:17:36 +0000, Dave Angel wrote:

> On 4/9/2013 10:29, Ferrous Cranus wrote:
> 
>> Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγραψε:
>>> 'file' isn't magic.  And again, it doesn't look at the filename, it
>>> looks at the content.
>> So, you are saying that it looks a the content of the file and not of
>> what encoding we used to save the file into?
> 
> That's right.  There's no place where your text editor stores the
> encoding it used, so 'file' has to guess, based only on the content.

Correct. The thing that people often fail to understand is that there is 
no *reliable* way to store the encoding used for a text file in the text 
file itself. The encoding is *metadata*, not data: it is data about the 
data, and consequently it has to be stored "out of band". It has to be 
stored somewhere else, outside of the file.

In the case of text files, it is usually not stored anywhere at all. IBM 
mainframes assume that text files are using EBCDIC; modern Linux systems 
assume text files are UTF-8; old DOS applications assume text files are 
ASCII. Some text editors will try to guess the encoding, using various 
heuristics such as "if the file starts with \xFE\xFF it is UTF-16" but 
none of them are foolproof:

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx

sometimes with amusing consequences:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html



>> But the contents have within:
>>
>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
>> \xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>
>> so it should have said greek-iso and not ascii.

But the above byte string is also valid ISO-8859-5 (Cyrillic):

'Жуэљѓєяќэяьсѓ\x0fѓєоьсєяђ\n'

ISO-8859-2 (Central European):

'śăíůóôďüíďěáó\x0fóôŢěáôďň\n'

and ISO-8859-4 (Baltic):

'ļãíųķôīüíīėáķ\x0fķôŪėáôīō\n'


Surely you don't expect the file utility to actually recognise that 
'Άγνωστοόνομασ\x0fστήματος\n' makes a valid Greek phrase while the others 
are not meaningful?



> No, that line is totally ASCII.  Only when it's EXECUTED by Python will
> a non ASCII byte string object be created.  Like I said, 'file' doesn't
> know the first thing about Python syntax, nor should it.

Technically, it's not ASCII, since ASCII only knows about bytes \x00 
through \x7F (decimal 0 through 127). That's why it isn't correct to 
describe Python bytes strings as "ASCII strings". They're byte strings 
that happen to be displayed as ASCII-plus-other-stuff.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#53668

FromChris Angelico <rosuav@gmail.com>
Date2013-09-05 13:59 +1000
Message-ID<mailman.71.1378353565.5461.python-list@python.org>
In reply to#53664
On Thu, Sep 5, 2013 at 1:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> Technically, it's not ASCII, since ASCII only knows about bytes \x00
> through \x7F (decimal 0 through 127). That's why it isn't correct to
> describe Python bytes strings as "ASCII strings". They're byte strings
> that happen to be displayed as ASCII-plus-other-stuff.

The line of code is itself entirely ASCII. The sequence REVERSE
SOLIDUS, LATIN SMALL LETTER X, LATIN SMALL LETTER B, DIGIT SIX is four
Unicode characters that are in the ASCII set. That Python interprets
them as representing the byte value 182 doesn't change that; the line
of code *is* ASCII.

ChrisA

[toc] | [prev] | [next] | [standalone]


Page 2 of 3 — ← Prev page 1 [2] 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web