Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #92402 > unrolled thread
| Started by | Skip Montanaro <skip.montanaro@gmail.com> |
|---|---|
| First post | 2015-06-10 09:28 -0500 |
| Last post | 2015-06-11 13:28 +1000 |
| Articles | 10 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Python NBSP DWIM Skip Montanaro <skip.montanaro@gmail.com> - 2015-06-10 09:28 -0500
Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 03:11 +1000
Re: Python NBSP DWIM random832@fastmail.us - 2015-06-10 21:02 -0400
Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 11:09 +1000
Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 12:26 +1000
Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:05 +1000
Re: Python NBSP DWIM Steven D'Aprano <steve@pearwood.info> - 2015-06-11 13:27 +1000
Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:37 +1000
Re: Python NBSP DWIM random832@fastmail.us - 2015-06-10 23:18 -0400
Re: Python NBSP DWIM Chris Angelico <rosuav@gmail.com> - 2015-06-11 13:28 +1000
| From | Skip Montanaro <skip.montanaro@gmail.com> |
|---|---|
| Date | 2015-06-10 09:28 -0500 |
| Subject | Re: Python NBSP DWIM |
| Message-ID | <mailman.344.1433946513.13271.python-list@python.org> |
On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase <python.list@tim.thechases.com> wrote: > Is this a bug? Looks like it's been reported a few times with slightly different context: https://bugs.python.org/issue6537 https://bugs.python.org/issue16623 https://bugs.python.org/issue20491 https://bugs.python.org/issue1390608 The couple times it's come up in the context of str.split, it's been rejected, since the purpose of that method is to split words. Skip
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-11 03:11 +1000 |
| Message-ID | <55786fd5$0$13003$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92402 |
On Thu, 11 Jun 2015 12:28 am, Skip Montanaro wrote:
> On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
> <python.list@tim.thechases.com> wrote:
>> Is this a bug?
>
> Looks like it's been reported a few times with slightly different context:
>
> https://bugs.python.org/issue6537
> https://bugs.python.org/issue16623
> https://bugs.python.org/issue20491
> https://bugs.python.org/issue1390608
>
> The couple times it's come up in the context of str.split, it's been
> rejected, since the purpose of that method is to split words.
That reasoning is ... strange. The whole point of the NBSP is specifically
*not* to split on it. If you wanted it to split, you would use a regular
space.
(Oh, and for the record, there are at least two non-breaking spaces in
Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/charts/PDF/U2000.pdf
Non-breaking spaces should be used for when you want to prevent
word-wrapping, and also for "open form" compound words:
http://grammar.ccc.commnet.edu/grammar/compounds.htm
textwrap should also treat NBSPs as non-spaces for the purposes of wrapping.
As a work-around, I think this should work:
- split the string on NBSPs;
- for substring returned, split normally;
- merge sub-substrings.
def split(s):
"""Split on whitespace, except NBSP.
>>> split(u'hello world spam\\u00A0eggs cheese')
[u'hello', u'world', u'spam\\xa0eggs', 'cheese']
"""
words = []
NBSP = u'\u00A0'
substrings = s.split(NBSP)
for i, sub in enumerate(substrings):
parts = sub.split()
if i == 0:
words.extend(parts)
else:
words[-1] += NBSP + parts[0]
words.extend(parts[1:])
return words
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-06-10 21:02 -0400 |
| Message-ID | <mailman.372.1433984539.13271.python-list@python.org> |
| In reply to | #92411 |
On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote: > And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as > the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've > been > fighting with VLC Media Player over the font it uses for subtitles; for > some bizarre reason, that font represents U+FEFF not with zero pixels of > emptiness, but with a box containing the letters "ZWN" "BSP" on two > lines. > Yeah, because that totally takes up zero width and looks like blank > space. As I understand it, the proper behavior is that the ZWNBSP that is the byte order mark shall never appear in an in-memory representation of the first line of a BOM-encoded file, or any other line of the concatenation of two BOM-encoded files, but should "vanish" when the file is opened and first read from. So it shouldn't be showing up in your subtitles regardless of its rendering behavior. The real world, needless to say, isn't so nice. IIRC there's also a font in MS windows that uses various glyphs which are zero-width, but are not blank, to represent ZWJ, ZWNJ, RLM, and LRM. Good for seeing what is happening, bad for actually rendering text that's intended to contain these characters. Though there's another argument that ideally a rendering engine should not render any such glyph unless something like "visible controls" has been selected (the real world, again, isn't so nice, which is why most symbols intended for visible control style rendering have their own distinct code points rather than using those of the control characters they represent).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-11 11:09 +1000 |
| Message-ID | <mailman.373.1433984973.13271.python-list@python.org> |
| In reply to | #92411 |
On Thu, Jun 11, 2015 at 11:02 AM, <random832@fastmail.us> wrote: > > On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote: > > And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as > > the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've > > been > > fighting with VLC Media Player over the font it uses for subtitles; for > > some bizarre reason, that font represents U+FEFF not with zero pixels of > > emptiness, but with a box containing the letters "ZWN" "BSP" on two > > lines. > > Yeah, because that totally takes up zero width and looks like blank > > space. > > As I understand it, the proper behavior is that the ZWNBSP that is the > byte order mark shall never appear in an in-memory representation of the > first line of a BOM-encoded file, or any other line of the concatenation > of two BOM-encoded files, but should "vanish" when the file is opened > and first read from. So it shouldn't be showing up in your subtitles > regardless of its rendering behavior. It's a perfectly valid character for other purposes; it's coming up in the middle of pieces of text, which should be 100% legal. No, it's a font problem. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-11 12:26 +1000 |
| Message-ID | <5578f1be$0$12979$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92411 |
On Thu, 11 Jun 2015 10:09 am, Chris Angelico wrote:
> On Thu, Jun 11, 2015 at 3:11 AM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> (Oh, and for the record, there are at least two non-breaking spaces in
>> Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
>>
>> http://www.unicode.org/charts/PDF/U0080.pdf
>> http://www.unicode.org/charts/PDF/U2000.pdf
>
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE",
No, despite the name, that is not a space character, it is a formatting
character. Due to Unicode's stability policy, the name is stuck forever,
but it should not be treated as a space character:
py> unicodedata.category(' ')
'Zs'
py> unicodedata.category('\u00A0') # NBSP
'Zs'
py> unicodedata.category('\uFEFF') # ZWNBSP
'Cf'
Ideally, outside of the BOM, you should never come across a ZWNBSP. You
should use U+2060 WORD JOINER instead. But if you do come across one
outside of the BOM, it should be treated as a legitimate non-space
character:
http://www.unicode.org/faq/utf_bom.html#bom6
Although ZWNBSP is a "default ignorable" code point, I believe that the font
is well within its rights to show it with a visible glyph:
"Fonts can contain glyphs intended for visible display of
default ignorable code points that would otherwise be
rendered invisibly when not supported."
http://www.unicode.org/faq/unsup_char.html
> notable because it's also used as
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> been fighting with VLC Media Player over the font it uses for subtitles;
> for some bizarre reason, that font represents U+FEFF not with zero pixels
> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines. Yeah, because that totally takes up zero width and looks like blank
> space.
Why do the subtitles contain ZWNBSP in the first place? Surely they're not
English subtitles?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-11 13:05 +1000 |
| Message-ID | <mailman.374.1433991937.13271.python-list@python.org> |
| In reply to | #92442 |
On Thu, Jun 11, 2015 at 12:26 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> No, despite the name, that is not a space character, it is a formatting
> character. Due to Unicode's stability policy, the name is stuck forever,
> but it should not be treated as a space character:
>
> py> unicodedata.category(' ')
> 'Zs'
> py> unicodedata.category('\u00A0') # NBSP
> 'Zs'
> py> unicodedata.category('\uFEFF') # ZWNBSP
> 'Cf'
>
>
> Ideally, outside of the BOM, you should never come across a ZWNBSP. You
> should use U+2060 WORD JOINER instead. But if you do come across one
> outside of the BOM, it should be treated as a legitimate non-space
> character:
>
> http://www.unicode.org/faq/utf_bom.html#bom6
>
> Although ZWNBSP is a "default ignorable" code point, I believe that the font
> is well within its rights to show it with a visible glyph:
>
> "Fonts can contain glyphs intended for visible display of
> default ignorable code points that would otherwise be
> rendered invisibly when not supported."
>
> http://www.unicode.org/faq/unsup_char.html
Huh. Okay, my bad. I was under the impression that it was supposed to
take up no width, as the name implies, but stability trumps logic
sometimes. Learn something new every day.
>> notable because it's also used as
>> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
>> been fighting with VLC Media Player over the font it uses for subtitles;
>> for some bizarre reason, that font represents U+FEFF not with zero pixels
>> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
>> lines. Yeah, because that totally takes up zero width and looks like blank
>> space.
>
> Why do the subtitles contain ZWNBSP in the first place? Surely they're not
> English subtitles?
No, they're not :) The character comes up in the Cantonese and
Japanese subs for Once Upon A December.
http://youtu.be/CEpcUeWP0bg
http://youtu.be/WFZAaHrHens
Possibly some others in the series as well. It may well be a fault in
the subtitles, but most programs I've seen don't show U+FEFF as a big
fat box.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-06-11 13:27 +1000 |
| Message-ID | <5579000e$0$12986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #92443 |
On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote: [...] >> Why do the subtitles contain ZWNBSP in the first place? Surely they're >> not English subtitles? > > No, they're not :) The character comes up in the Cantonese and > Japanese subs for Once Upon A December. > > http://youtu.be/CEpcUeWP0bg > http://youtu.be/WFZAaHrHens > > Possibly some others in the series as well. It may well be a fault in > the subtitles, but most programs I've seen don't show U+FEFF as a big > fat box. I think that for backwards compatibility, applications (or fonts) are permitted to treat U+FEFF as a zero-width invisible character, so perhaps you can raise a feature request with VLC. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-11 13:37 +1000 |
| Message-ID | <mailman.380.1433993883.13271.python-list@python.org> |
| In reply to | #92447 |
On Thu, Jun 11, 2015 at 1:27 PM, Steven D'Aprano <steve@pearwood.info> wrote: > On Thu, 11 Jun 2015 01:05 pm, Chris Angelico wrote: > [...] >>> Why do the subtitles contain ZWNBSP in the first place? Surely they're >>> not English subtitles? >> >> No, they're not :) The character comes up in the Cantonese and >> Japanese subs for Once Upon A December. >> >> http://youtu.be/CEpcUeWP0bg >> http://youtu.be/WFZAaHrHens >> >> Possibly some others in the series as well. It may well be a fault in >> the subtitles, but most programs I've seen don't show U+FEFF as a big >> fat box. > > I think that for backwards compatibility, applications (or fonts) are > permitted to treat U+FEFF as a zero-width invisible character, so perhaps > you can raise a feature request with VLC. Yeah. Well, like I said - learn something new every day. I didn't know it wasn't a bug. (Though it'd still be a font issue, not a VLC one. With other fonts, it comes up looking different, in some cases invisible. Unfortunately, the fonts that look good aren't the fonts that have glyphs for all characters, so I need to figure out why font substitution isn't working right. But that's a separate issue.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2015-06-10 23:18 -0400 |
| Message-ID | <mailman.375.1433992684.13271.python-list@python.org> |
| In reply to | #92442 |
On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote: > http://youtu.be/CEpcUeWP0bg > http://youtu.be/WFZAaHrHens An example of the actual subtitle text would be more useful than a youtube link to the video, since we're unlikely to be able to see what context the character appears in if our client doesn't show it. (I don't think the default youtube player does). And you haven't even included a time code.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-06-11 13:28 +1000 |
| Message-ID | <mailman.377.1433993314.13271.python-list@python.org> |
| In reply to | #92442 |
On Thu, Jun 11, 2015 at 1:18 PM, <random832@fastmail.us> wrote: > On Wed, Jun 10, 2015, at 23:05, Chris Angelico wrote: >> http://youtu.be/CEpcUeWP0bg >> http://youtu.be/WFZAaHrHens > > An example of the actual subtitle text would be more useful than a > youtube link to the video, since we're unlikely to be able to see what > context the character appears in if our client doesn't show it. (I don't > think the default youtube player does). And you haven't even included a > time code. Unfortunately I can't really offer anything better, as the text I saw was after a lot of processing (youtube-dl, then some other post-processing), and I don't actually remember which file it was that bugged me about this, now. But the subs/annotations (visible in the default player if you turn on "Subtitles" down the bottom) do include U+FEFF; in each case, it's on the very last line of the song, although that's not where I remember it occurring. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web