Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!news2.arglkargh.de!news.n-ix.net!news.belwue.de!newsfeed.arcor.de!newsspool1.arcor-online.net!news.arcor.de.POSTED!not-for-mail
Content-Type: text/plain; charset="UTF-8"
Message-Id: <1450471.fLXkozyCbt@PointedEars.de>
From: Thomas 'PointedEars' Lahn <PointedEars@web.de>
Reply-To: Thomas 'PointedEars' Lahn <usenet@PointedEars.de>
Organization: PointedEars Software (PES)
Date: Sat, 23 Apr 2011 21:33:33 +0200
User-Agent: KNode/4.4.7
Content-Transfer-Encoding: 8Bit
Subject: Re: detecting newline character
Newsgroups: comp.lang.python
References: <4DB315D7.1020405@rulez.sk> <mailman.781.1303585944.9059.python-list@python.org>
Followup-To: comp.lang.python
MIME-Version: 1.0
Lines: 66
NNTP-Posting-Date: 23 Apr 2011 21:33:33 CEST
NNTP-Posting-Host: 0bc404b0.newsspool4.arcor-online.net
X-Trace: DXC=3X3kEAbdScFAa;:RKVJ>LE4IUK<Cl32<A4Fo<]lROoRA8kF<OcfhCOKoY<<[OWJ>^KDZm8W4\YJNLT<8F<]0D<`InOJ3[SYnM<EMUZiLaS@B2H
X-Complaints-To: usenet-abuse@arcor.de
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:3921

Chris Rebert wrote:

> On Sat, Apr 23, 2011 at 11:09 AM, Daniel Geržo <danger@rulez.sk> wrote:
>> I need to detect the newline characters used in the file I am reading.
>> For this purpose I am using the following code:
>>
>> def _read_lines(self):
>>     with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
>>     fobj.readlines()
>>     if isinstance(fobj.newlines, tuple):
>>         self.newline = fobj.newlines[0]
>>     else:
>>         self.newline = fobj.newlines
>>
>> This works fine, if I call codecs.open() without encoding argument; I am
>> testing with an ASCII enghlish text file, and in such case the
>> fobj.newlines is correctly detected being as '\r\n'. However, when I call
>> codecs.open() with encoding='ascii' argument, the fobj.newlines is None
>> and I can't figure out why that is the case. Reading the PEP at
>> http://www.python.org/dev/peps/pep-0278/ I don't see any reason why would
>> I end up with newlines being None after I call readlines().
>>
>> Anyone has an idea?
> 
> I would hypothesize that it's an interaction bug between universal
> newlines and codecs.open().
> 
> […]
> I would speculate that the upshot of this is that codecs.open() ends
> up calling built-in open() with a nonsense `mode` of "rUb" or similar,
> resulting in strange behavior.
> 
> If this explanation is correct, then there are 2 bugs:
> 1. Built-in open() should treat "b" and "U" as mutually exclusive and
> reject mode strings which involve both.
> 2. codecs.open() should either reject modes involving "U", or be fixed
> so that they work as expected.

You might be correct that it is a bug (already fixed in versions newer than 
2.5), since codecs.open() from my Python 2.6 reads as follows:

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
    """
    …
    """
    if encoding is not None:
        if 'U' in mode:
            # No automatic conversion of '\n' is done on reading and writing
            mode = mode.strip().replace('U', '')
            if mode[:1] not in set('rwa'):
                mode = 'r' + mode
        if 'b' not in mode:
            # Force opening of the file in binary mode
            mode = mode + 'b'
    file = __builtin__.open(filename, mode, buffering)
    if encoding is None:
        return file
    info = lookup(encoding)
    srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, 
errors)
    # Add attributes to simplify introspection
    srw.encoding = encoding
    return srw

-- 
PointedEars