Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!news2.arglkargh.de!news.n-ix.net!news.belwue.de!newsfeed.arcor.de!newsspool1.arcor-online.net!news.arcor.de.POSTED!not-for-mail Content-Type: text/plain; charset="UTF-8" Message-Id: <1450471.fLXkozyCbt@PointedEars.de> From: Thomas 'PointedEars' Lahn Reply-To: Thomas 'PointedEars' Lahn Organization: PointedEars Software (PES) Date: Sat, 23 Apr 2011 21:33:33 +0200 User-Agent: KNode/4.4.7 Content-Transfer-Encoding: 8Bit Subject: Re: detecting newline character Newsgroups: comp.lang.python References: <4DB315D7.1020405@rulez.sk> Followup-To: comp.lang.python MIME-Version: 1.0 Lines: 66 NNTP-Posting-Date: 23 Apr 2011 21:33:33 CEST NNTP-Posting-Host: 0bc404b0.newsspool4.arcor-online.net X-Trace: DXC=3X3kEAbdScFAa;:RKVJ>LE4IUK^KDZm8W4\YJNLT<8F<]0D<`InOJ3[SYnM On Sat, Apr 23, 2011 at 11:09 AM, Daniel Geržo wrote: >> I need to detect the newline characters used in the file I am reading. >> For this purpose I am using the following code: >> >> def _read_lines(self): >> with contextlib.closing(codecs.open(self.path, "rU")) as fobj: >> fobj.readlines() >> if isinstance(fobj.newlines, tuple): >> self.newline = fobj.newlines[0] >> else: >> self.newline = fobj.newlines >> >> This works fine, if I call codecs.open() without encoding argument; I am >> testing with an ASCII enghlish text file, and in such case the >> fobj.newlines is correctly detected being as '\r\n'. However, when I call >> codecs.open() with encoding='ascii' argument, the fobj.newlines is None >> and I can't figure out why that is the case. Reading the PEP at >> http://www.python.org/dev/peps/pep-0278/ I don't see any reason why would >> I end up with newlines being None after I call readlines(). >> >> Anyone has an idea? > > I would hypothesize that it's an interaction bug between universal > newlines and codecs.open(). > > […] > I would speculate that the upshot of this is that codecs.open() ends > up calling built-in open() with a nonsense `mode` of "rUb" or similar, > resulting in strange behavior. > > If this explanation is correct, then there are 2 bugs: > 1. Built-in open() should treat "b" and "U" as mutually exclusive and > reject mode strings which involve both. > 2. codecs.open() should either reject modes involving "U", or be fixed > so that they work as expected. You might be correct that it is a bug (already fixed in versions newer than 2.5), since codecs.open() from my Python 2.6 reads as follows: def open(filename, mode='rb', encoding=None, errors='strict', buffering=1): """ … """ if encoding is not None: if 'U' in mode: # No automatic conversion of '\n' is done on reading and writing mode = mode.strip().replace('U', '') if mode[:1] not in set('rwa'): mode = 'r' + mode if 'b' not in mode: # Force opening of the file in binary mode mode = mode + 'b' file = __builtin__.open(filename, mode, buffering) if encoding is None: return file info = lookup(encoding) srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors) # Add attributes to simplify introspection srw.encoding = encoding return srw -- PointedEars