Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Christopher Arndt Newsgroups: de.comp.lang.python Subject: [Python-de] re.split und Unicode in Python 3 Date: Fri, 29 Jul 2016 16:45:16 +0200 Lines: 28 Message-ID: References: <7ae0837f-8596-a55b-7195-e6d85492dd51@chrisarndt.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: news.uni-berlin.de 5/Dbp4ZRPjcbpOLT1bV9EQC1vXBg8d7INx0QrIh4BMlg== Return-Path: X-Original-To: python-de@python.org Delivered-To: python-de@mail.python.org X-Virus-Scanned: Debian amavisd-new at mx1.0x20.eu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 X-BeenThere: python-de@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Die Deutsche Python Mailingliste List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <7ae0837f-8596-a55b-7195-e6d85492dd51@chrisarndt.de> Xref: csiph.com de.comp.lang.python:4498 Ich habe gerade dieses merkwürdige Verhalten von Python 3.5 festgestellt: Python 3.5.1+ (default, Mar 30 2016, 22:46:26) [GCC 5.3.1 20160330] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> s = 'One\u2003Two' >>> re.search('\s+', s) <_sre.SRE_Match object; span=(3, 4), match='\u2003'> >>> re.search('\s+', s, re.ASCII) >>> ^^^ # --> No match >>> re.split('\s+', s) ['One', 'Two'] >>> re.split('\s+', s, re.ASCII) ['One', 'Two'] Bug? Zum Verständnis: '\u2003' == em space, also ein Whitespace-Char in Unicode. Chris