Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'wed,': 0.03; 'subject:bug': 0.04; 'encoded': 0.05; 'mrab': 0.05; 'subject:Python': 0.06; 'bytes.': 0.07; 'remotely': 0.07; 'skip:\\ 20': 0.09; 'pm,': 0.10; '>>>': 0.12; '25,': 0.12; 'subject:file': 0.14; 'wrote:': 0.14; '61,': 0.16; 'codec': 0.16; 'f.close()': 0.16; 'f.read()': 0.16; 'skip:7 20': 0.16; 'truncated': 0.16; 'traceback': 0.16; '(most': 0.16; "wouldn't": 0.17; 'bytes': 0.19; 'errors,': 0.19; 'cheers,': 0.19; 'header:In-Reply-To:1': 0.21; 'file,': 0.22; 'interpreted': 0.23; 'last):': 0.23; 'received:209.85.161.46': 0.23; 'received:mail- fx0-f46.google.com': 0.23; 'values': 0.25; 'received:209.85.161': 0.26; 'message-id:@mail.gmail.com': 0.28; 'problem': 0.28; 'skip:" 30': 0.29; 'instead': 0.29; 'code,': 0.29; 'least': 0.30; "won't": 0.30; 'throwing': 0.30; "can't": 0.32; "skip:' 10": 0.32; 'does': 0.33; 'to:addr:python-list': 0.33; 'initial': 0.33; 'skip:" 20': 0.33; 'starting': 0.33; 'source': 0.34; 'file': 0.34; "we're": 0.34; '"",': 0.35; 'beginning': 0.37; 'data.': 0.37; 'received:google.com': 0.37; 'received:209.85': 0.37; 'sequence': 0.37; 'space,': 0.37; 'anything': 0.38; 'but': 0.38; 'data': 0.38; 'subject:: ': 0.38; 'doing': 0.39; 'skip:s 20': 0.39; 'received:209': 0.39; 'returned': 0.39; 'to:addr:python.org': 0.39; 'subject:? ': 0.67; 'production': 0.68; 'subject:line': 0.73; 'bom': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=KxoIBfohgtRa2FhLizc9l/fcb3P6yjgjZANC8XR/Zys=; b=w66HGbMnYPivVGDSLgqer5ZdlWxBmgHB8GYxxrNK0eL9cMbjlELteGxJSxr2t+EKuV 7xzGSOq+R4caKRwBIWQbMeNrCzcEDfVWQE4Q3aI+xqgqdD+sl3GDA3QNkEYI/57ZnQk5 ZKWqK1pEx8WRCkF7jn0TV6kqJAXoDoLKExH2s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=bQsSLTZf7lRUzyFsjbHCBVS4pKXRGL1m8QZsuGwL1RXQr/7QZTjOOTgWqn31U9FFUT 922I0KXAx25ayjBSZunc623P3ToTs7Gujn9SnXVTHGxVgpdBYl7wwiLjCI4mQv+30MMb 9NN06LUQE2kPUVHZegLhuEKxrDbAx5Wxy0Txk= MIME-Version: 1.0 In-Reply-To: <4DDD7A27.60602@mrabarnett.plus.com> References: <3d81e2a0-6c86-4f12-a1c4-ce4c736172b6@y31g2000vbp.googlegroups.com> <4DDD5FD2.8040607@mrabarnett.plus.com> <4DDD7A27.60602@mrabarnett.plus.com> From: Ian Kelly Date: Wed, 25 May 2011 19:06:04 -0600 Subject: Re: Python 3.2 bug? Reading the last line of a file To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 52 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1306371996 news.xs4all.nl 49038 [::ffff:82.94.164.166]:49122 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:6285 On Wed, May 25, 2011 at 3:52 PM, MRAB wrote: > What do you mean by "may include the decoder state in its return value"? > > It does make sense that the values returned from tell() won't be in the > middle of an encoded sequence of bytes. If you take a look at the source code, tell() returns a long that includes decoder state data in the upper bytes. For example: >>> data = b' ' + '\u0302a'.encode('utf-16') >>> data b' \xff\xfe\x02\x03a\x00' >>> f = open('test.txt', 'wb') >>> f.write(data) 7 >>> f.close() >>> f = open('test.txt', 'r', encoding='utf-16') >>> f.read() Traceback (most recent call last): File "", line 1, in File "c:\python32\lib\codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode codecs.utf_16_ex_decode(input, errors, 0, final) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6: truncated data The problem of course is the initial space, throwing off the decoder. We can try to seek past it: >>> f.seek(1) 1 >>> f.read() '\ufeff\u0302a' But notice that since we're not reading from the beginning of the file, the BOM has now been interpreted as data. However: >>> f.seek(1 + (2 << 65)) 73786976294838206465 >>> f.read() '\u0302a' And you can see that instead of reading from position 73786976294838206465 it has read from position 1 starting in the "read a BOM" state. Note that I wouldn't recommend doing anything remotely like this in production code, not least because the value that I passed into seek() is platform-dependent. This is just a demonstration of how the seek() value can include decoder state. Cheers, Ian