Groups > comp.lang.python > #92402 > unrolled thread

Re: Python NBSP DWIM

Started by	Skip Montanaro <skip.montanaro@gmail.com>
First post	2015-06-10 09:28 -0500
Last post	2015-06-11 13:28 +1000
Articles	10 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Python NBSP DWIM Skip Montanaro <skip.montanaro@gmail.com> - 2015-06-10 09:28 -0500
    Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 03:11 +1000
      Re: Python NBSP DWIM random832@fastmail.us - 2015-06-10 21:02 -0400
      Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 11:09 +1000
      Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 12:26 +1000
        Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:05 +1000
          Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 13:27 +1000
            Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:37 +1000
        Re: Python NBSP DWIM random832@fastmail.us - 2015-06-10 23:18 -0400
        Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:28 +1000

#92402 — Re: Python NBSP DWIM

From	Skip Montanaro <skip.montanaro@gmail.com>
Date	2015-06-10 09:28 -0500
Subject	Re: Python NBSP DWIM
Message-ID	<mailman.344.1433946513.13271.python-list@python.org>

On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
<python.list@tim.thechases.com> wrote:
> Is this a bug?

Looks like it's been reported a few times with slightly different context:

https://bugs.python.org/issue6537
https://bugs.python.org/issue16623
https://bugs.python.org/issue20491
https://bugs.python.org/issue1390608

The couple times it's come up in the context of str.split, it's been
rejected, since the purpose of that method is to split words.

Skip

[toc] | [next] | [standalone]

#92411

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-11 03:11 +1000
Message-ID	<55786fd5$0$13003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92402

On Thu, 11 Jun 2015 12:28 am, Skip Montanaro wrote:

> On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
> <python.list@tim.thechases.com> wrote:
>> Is this a bug?
> 
> Looks like it's been reported a few times with slightly different context:
> 
> https://bugs.python.org/issue6537
> https://bugs.python.org/issue16623
> https://bugs.python.org/issue20491
> https://bugs.python.org/issue1390608
> 
> The couple times it's come up in the context of str.split, it's been
> rejected, since the purpose of that method is to split words.

That reasoning is ... strange. The whole point of the NBSP is specifically
*not* to split on it. If you wanted it to split, you would use a regular
space.

(Oh, and for the record, there are at least two non-breaking spaces in
Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)

http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/charts/PDF/U2000.pdf


Non-breaking spaces should be used for when you want to prevent
word-wrapping, and also for "open form" compound words:

http://grammar.ccc.commnet.edu/grammar/compounds.htm

textwrap should also treat NBSPs as non-spaces for the purposes of wrapping.

As a work-around, I think this should work:

- split the string on NBSPs;

- for substring returned, split normally;

- merge sub-substrings.


def split(s):
    """Split on whitespace, except NBSP.

    >>> split(u'hello world spam\\u00A0eggs cheese')
    [u'hello', u'world', u'spam\\xa0eggs', 'cheese']

    """
    words = []
    NBSP = u'\u00A0'
    substrings = s.split(NBSP)
    for i, sub in enumerate(substrings):
        parts = sub.split()
        if i == 0:
            words.extend(parts)
        else:
            words[-1] += NBSP + parts[0]
            words.extend(parts[1:])
    return words
        

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92439

From	random832@fastmail.us
Date	2015-06-10 21:02 -0400
Message-ID	<mailman.372.1433984539.13271.python-list@python.org>
In reply to	#92411

On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote:
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> been
> fighting with VLC Media Player over the font it uses for subtitles; for
> some bizarre reason, that font represents U+FEFF not with zero pixels of
> emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines.
> Yeah, because that totally takes up zero width and looks like blank
> space.

As I understand it, the proper behavior is that the ZWNBSP that is the
byte order mark shall never appear in an in-memory representation of the
first line of a BOM-encoded file, or any other line of the concatenation
of two BOM-encoded files, but should "vanish" when the file is opened
and first read from. So it shouldn't be showing up in your subtitles
regardless of its rendering behavior.

The real world, needless to say, isn't so nice.

IIRC there's also a font in MS windows that uses various glyphs which
are zero-width, but are not blank, to represent ZWJ, ZWNJ, RLM, and LRM.
Good for seeing what is happening, bad for actually rendering text
that's intended to contain these characters. Though there's another
argument that ideally a rendering engine should not render any such
glyph unless something like "visible controls" has been selected (the
real world, again, isn't so nice, which is why most symbols intended for
visible control style rendering have their own distinct code points
rather than using those of the control characters they represent).

[toc] | [prev] | [next] | [standalone]

#92440

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-11 11:09 +1000
Message-ID	<mailman.373.1433984973.13271.python-list@python.org>
In reply to	#92411

On Thu, Jun 11, 2015 at 11:02 AM, <random832@fastmail.us> wrote:
>
> On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote:
> > And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
> > the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> > been
> > fighting with VLC Media Player over the font it uses for subtitles; for
> > some bizarre reason, that font represents U+FEFF not with zero pixels of
> > emptiness, but with a box containing the letters "ZWN" "BSP" on two
> > lines.
> > Yeah, because that totally takes up zero width and looks like blank
> > space.
>
> As I understand it, the proper behavior is that the ZWNBSP that is the
> byte order mark shall never appear in an in-memory representation of the
> first line of a BOM-encoded file, or any other line of the concatenation
> of two BOM-encoded files, but should "vanish" when the file is opened
> and first read from. So it shouldn't be showing up in your subtitles
> regardless of its rendering behavior.

It's a perfectly valid character for other purposes; it's coming up in
the middle of pieces of text, which should be 100% legal. No, it's a
font problem.

ChrisA

[toc] | [prev] | [next] | [standalone]

#92442

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-11 12:26 +1000
Message-ID	<5578f1be$0$12979$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92411

On Thu, 11 Jun 2015 10:09 am, Chris Angelico wrote:

> On Thu, Jun 11, 2015 at 3:11 AM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> (Oh, and for the record, there are at least two non-breaking spaces in
>> Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
>>
>> http://www.unicode.org/charts/PDF/U0080.pdf
>> http://www.unicode.org/charts/PDF/U2000.pdf
> 
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE", 

No, despite the name, that is not a space character, it is a formatting
character. Due to Unicode's stability policy, the name is stuck forever,
but it should not be treated as a space character:

py> unicodedata.category(' ')
'Zs'
py> unicodedata.category('\u00A0')  # NBSP
'Zs'
py> unicodedata.category('\uFEFF')  # ZWNBSP
'Cf'

Ideally, outside of the BOM, you should never come across a ZWNBSP. You
should use U+2060 WORD JOINER instead. But if you do come across one
outside of the BOM, it should be treated as a legitimate non-space
character:

http://www.unicode.org/faq/utf_bom.html#bom6

Although ZWNBSP is a "default ignorable" code point, I believe that the font
is well within its rights to show it with a visible glyph:

    "Fonts can contain glyphs intended for visible display of 
    default ignorable code points that would otherwise be 
    rendered invisibly when not supported."

http://www.unicode.org/faq/unsup_char.html

> notable because it's also used as 
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've 
> been fighting with VLC Media Player over the font it uses for subtitles;
> for some bizarre reason, that font represents U+FEFF not with zero pixels
> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines. Yeah, because that totally takes up zero width and looks like blank
> space.

Why do the subtitles contain ZWNBSP in the first place? Surely they're not
English subtitles?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92443

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-11 13:05 +1000
Message-ID	<mailman.374.1433991937.13271.python-list@python.org>
In reply to	#92442

On Thu, Jun 11, 2015 at 12:26 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> No, despite the name, that is not a space character, it is a formatting
> character. Due to Unicode's stability policy, the name is stuck forever,
> but it should not be treated as a space character:
>
> py> unicodedata.category(' ')
> 'Zs'
> py> unicodedata.category('\u00A0')  # NBSP
> 'Zs'
> py> unicodedata.category('\uFEFF')  # ZWNBSP
> 'Cf'
>
>
> Ideally, outside of the BOM, you should never come across a ZWNBSP. You
> should use U+2060 WORD JOINER instead. But if you do come across one
> outside of the BOM, it should be treated as a legitimate non-space
> character:
>
> http://www.unicode.org/faq/utf_bom.html#bom6
>
> Although ZWNBSP is a "default ignorable" code point, I believe that the font
> is well within its rights to show it with a visible glyph:
>
>     "Fonts can contain glyphs intended for visible display of
>     default ignorable code points that would otherwise be
>     rendered invisibly when not supported."
>
> http://www.unicode.org/faq/unsup_char.html

Huh. Okay, my bad. I was under the impression that it was supposed to
take up no width, as the name implies, but stability trumps logic
sometimes. Learn something new every day.

>> notable because it's also used as
>> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
>> been fighting with VLC Media Player over the font it uses for subtitles;
>> for some bizarre reason, that font represents U+FEFF not with zero pixels
>> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
>> lines. Yeah, because that totally takes up zero width and looks like blank
>> space.
>
> Why do the subtitles contain ZWNBSP in the first place? Surely they're not
> English subtitles?

No, they're not :) The character comes up in the Cantonese and
Japanese subs for Once Upon A December.

http://youtu.be/CEpcUeWP0bg
http://youtu.be/WFZAaHrHens

Possibly some others in the series as well. It may well be a fault in
the subtitles, but most programs I've seen don't show U+FEFF as a big
fat box.

ChrisA

[toc] | [prev] | [next] | [standalone]

#92447

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-11 13:27 +1000
Message-ID	<5579000e$0$12986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92443

On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote:
[...]
>> Why do the subtitles contain ZWNBSP in the first place? Surely they're
>> not English subtitles?
> 
> No, they're not :) The character comes up in the Cantonese and
> Japanese subs for Once Upon A December.
> 
> http://youtu.be/CEpcUeWP0bg
> http://youtu.be/WFZAaHrHens
> 
> Possibly some others in the series as well. It may well be a fault in
> the subtitles, but most programs I've seen don't show U+FEFF as a big
> fat box.

I think that for backwards compatibility, applications (or fonts) are
permitted to treat U+FEFF as a zero-width invisible character, so perhaps
you can raise a feature request with VLC.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92451

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-11 13:37 +1000
Message-ID	<mailman.380.1433993883.13271.python-list@python.org>
In reply to	#92447

On Thu, Jun 11, 2015 at 1:27 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote:
> [...]
>>> Why do the subtitles contain ZWNBSP in the first place? Surely they're
>>> not English subtitles?
>>
>> No, they're not :) The character comes up in the Cantonese and
>> Japanese subs for Once Upon A December.
>>
>> http://youtu.be/CEpcUeWP0bg
>> http://youtu.be/WFZAaHrHens
>>
>> Possibly some others in the series as well. It may well be a fault in
>> the subtitles, but most programs I've seen don't show U+FEFF as a big
>> fat box.
>
> I think that for backwards compatibility, applications (or fonts) are
> permitted to treat U+FEFF as a zero-width invisible character, so perhaps
> you can raise a feature request with VLC.

Yeah. Well, like I said - learn something new every day. I didn't know
it wasn't a bug. (Though it'd still be a font issue, not a VLC one.
With other fonts, it comes up looking different, in some cases
invisible. Unfortunately, the fonts that look good aren't the fonts
that have glyphs for all characters, so I need to figure out why font
substitution isn't working right. But that's a separate issue.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#92444

From	random832@fastmail.us
Date	2015-06-10 23:18 -0400
Message-ID	<mailman.375.1433992684.13271.python-list@python.org>
In reply to	#92442

On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote:
> http://youtu.be/CEpcUeWP0bg
> http://youtu.be/WFZAaHrHens

An example of the actual subtitle text would be more useful than a
youtube link to the video, since we're unlikely to be able to see what
context the character appears in if our client doesn't show it. (I don't
think the default youtube player does). And you haven't even included a
time code.

[toc] | [prev] | [next] | [standalone]

#92448

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-11 13:28 +1000
Message-ID	<mailman.377.1433993314.13271.python-list@python.org>
In reply to	#92442

On Thu, Jun 11, 2015 at 1:18 PM,  <random832@fastmail.us> wrote:
> On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote:
>> http://youtu.be/CEpcUeWP0bg
>> http://youtu.be/WFZAaHrHens
>
> An example of the actual subtitle text would be more useful than a
> youtube link to the video, since we're unlikely to be able to see what
> context the character appears in if our client doesn't show it. (I don't
> think the default youtube player does). And you haven't even included a
> time code.

Unfortunately I can't really offer anything better, as the text I saw
was after a lot of processing (youtube-dl, then some other
post-processing), and I don't actually remember which file it was that
bugged me about this, now. But the subs/annotations (visible in the
default player if you turn on "Subtitles" down the bottom) do include
U+FEFF; in each case, it's on the very last line of the song, although
that's not where I remember it occurring.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: Python NBSP DWIM

Contents

#92402 — Re: Python NBSP DWIM

#92411

#92439

#92440

#92442

#92443

#92447

#92451

#92444

#92448