Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!newsfeed.kamp.net!newsfeed.kamp.net!87.79.20.101.MISMATCH!newsreader4.netcologne.de!news.netcologne.de!xlned.com!feeder7.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.090 X-Spam-Evidence: '*H*': 0.83; '*S*': 0.01; 'subject:Python': 0.06; "'',": 0.07; 'suppose': 0.07; 'correct,': 0.09; 'line)': 0.16; 'perhaps:': 0.16; 'subject:Unicode': 0.16; 'wrote:': 0.18; 'thu,': 0.19; 'unicode': 0.24; 'header:In-Reply-To:1': 0.27; 'points': 0.29; 'message-id:@mail.gmail.com': 0.30; '"",': 0.31; 'though.': 0.31; 'received:google.com': 0.35; 'really': 0.36; 'subject:?': 0.36; 'should': 0.36; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'to:addr:python.org': 0.39; 'remove': 0.60; 'removing': 0.60; 'more': 0.64; 'relatively': 0.65; 'line,': 0.68; 'skip:r 40': 0.68; 'captures': 0.84; 'standard:': 0.84; 'absolutely': 0.87; 'subject:you': 0.87 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=tf9YaQWdFpCYXoKmYk7o6vUEE8Bq1nE6SVNjY/M0d8Y=; b=u8GKByk696ASxFZSems70PGzCyjvBBzxu/6nzUKwoqtXDOSUbpUhnBu77amXy3kAB7 QCrdovjjAhaQ/5WxKTvCOZcl1tZ5jXddRVnDrm9T6BanaMOtMaiOSrsA/BUPkoqpyrFK o/E7voDnJD5EXZQzEYuJTu7JCKQLFHA9Peve8KJRvk64aCcclNEBJqD6e3F68fmhXIQ0 YfiyZ1zJ291D4OWnN9FsxFV344v4Zn1iwIQ+3XoMKuCrOtfJYASG2GIDVYiQw3LkieAL Hmx4NxbuyRLp2KcgcofeqxWDo8d1QTzlrFOC7N7fXUf20QtvnKFj+pkXH7vaHuJ1s215 khsw== X-Received: by 10.194.243.104 with SMTP id wx8mr1530278wjc.32.1402013174267; Thu, 05 Jun 2014 17:06:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1402000445.62825.YahooMailNeo@web163806.mail.gq1.yahoo.com> References: <7xr433z0g3.fsf@ruckus.brouhaha.com> <7xioof9li6.fsf@ruckus.brouhaha.com> <1402000445.62825.YahooMailNeo@web163806.mail.gq1.yahoo.com> From: Ian Kelly Date: Thu, 5 Jun 2014 18:05:34 -0600 Subject: Re: Unicode and Python - how often do you index strings? To: Python Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 13 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1402013180 news.xs4all.nl 2839 [2001:888:2000:d::a6]:47746 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72796 On Thu, Jun 5, 2014 at 2:34 PM, Albert-Jan Roskam wrote: >> If you want to be really picky about removing exactly one line >> terminator, then this captures all the relatively modern variations: >> re.sub('\r?\n$|\n?\r$', line, '', count=1) > > or perhaps: re.sub("[^ \S]+$", "", line) That will remove more than one terminator, plus tabs. Points for including \f and \v though. I suppose if we want to be absolutely correct, we should follow the Unicode standard: re.sub(r'\r?\n$|[\r\v\f\x85\u2028\u2029]$', line, '', count=1)