Path: csiph.com!usenet.pasdenom.info!news.albasani.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Sun, 17 May 2015 15:12:36 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: Johannes Bauer <dfnsonfsduifb@gmx.de>
Cc: python-list@python.org
Subject: Re: textwrap.wrap() breaks non-breaking spaces
In-Reply-To: <mjaqqd$t0m$1@news.albasani.net>
References: <mjaqqd$t0m$1@news.albasani.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.98.1431894065.17265.python-list@python.org>
Lines: 62
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:90785

On 2015-05-17 21:39, Johannes Bauer wrote:
> Hey there,
> 
> so that textwrap.wrap() breks non-breaking spaces, is this a bug or
> intended behavior? For example:
> 
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> 
> >>> import textwrap
> >>> for line in textwrap.wrap("foo dont\xa0break " * 20):
> >>> print(line)
> ...
> foo dont break foo dont break foo dont break foo dont break foo dont
> break foo dont break foo dont break foo dont break foo dont break
> foo dont break foo dont break foo dont break foo dont break foo
> dont break foo dont break foo dont break foo dont break foo dont
> break foo dont break foo dont break
> 
> Apparently it does recognize that \xa0 is a kind of space, but it
> thinks it can break any space. The point of \xa0 being exactly to
> avoid this kind of thing.
> 
> Any remedy or ideas?

Since it uses a TextWrapper class, you can subclass that and
then assert that the spaces found for splitting aren't
non-breaking spaces.  Note that, to use the "\u00a0"
notation, the particular string has to be a non-raw string.
You can compare the two regular expressions with those in
the original source file in your $STDLIB/textwrap.py

  import textwrap
  import re
  
  class MyWrapper(textwrap.TextWrapper):
    wordsep_re = re.compile(
      '((?!\u00a0)\\s+|'                      # any whitespace
      r'[^\s\w]*\w+[^0-9\W]-(?=\w+[^0-9\W])|' # hyphenated words
      r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash
  
    # This less funky little regex just split on recognized spaces. E.g.
    #   "Hello there -- you goof-ball, use the -b option!"
    # splits into
    #   Hello/ /there/ /--/ /you/ /goof-ball,/ /use/ /the/ /-b/ /option!/
    wordsep_simple_re = re.compile('((?!\u00a0)\\s+)')
  
  s = 'foo dont\u00a0break ' * 20
  
  wrapper = MyWrapper()
  for line in wrapper.wrap(s):
    print(line)

Based on my tests, it gives the results you were looking
for.

-tkc