Path: csiph.com!usenet.pasdenom.info!bete-des-vosges.org!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <1425650591.3368169.236462481.4CA83873@webmail.messagingengine.com>
References: <aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com> <rosuav@gmail.com> <CAPTjJmpg+Ar-83fLPN5Pg3U5udLbkS0tBqF+aGQbiLrCVJ5aSw@mail.gmail.com> <201502241524.t1OFO09k022270@fido.openend.se> <CAPTjJmoSZm8xRxeq-8G5KOKWddQxq23ieWqsY+jjCJXuY3DP0A@mail.gmail.com> <201502241620.t1OGKf4n002146@fido.openend.se> <54ECB134.5090304@davea.name> <201502241945.t1OJjshO013092@fido.openend.se> <201502241957.t1OJvrJS015604@fido.openend.se> <mailman.19148.1424810518.18130.python-list@python.org> <ef520397-b1f0-47bf-8d24-585a9ba230e2@googlegroups.com> <CAPTjJmreaPu7MZQgmbFNnhhg9R6w9dHPPo=yBbMoG85HxK+H_Q@mail.gmail.com> <mailman.19274.1424970167.18130.python-list@python.org> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <mailman.33.1425444900.21433.python-list@python.org> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <c6caaa76-f448-4c2f-8874-c1f2716da744@googlegroups.com> <mailman.88.1425619223.21433.python-list@python.org> <01dd9b83-db3e-4e7d-9022-dc6af75eb570@googlegroups.com> <d01a4428-d691-4620-88ba-076360366cff@googlegroups.com> <1425648780.3358976.236451377.1ED1534A@webmail.messagingengine.com> <CAPTjJmpwRVWQQgaA7we=8fzw1ROpq-BG40i1jNt90f-KYAxR8A@mail.gmail.com> <1425650591.3368169.236462481.4CA83873@webmail.messagingengine.com>
Date: Sat, 7 Mar 2015 01:11:20 +1100
Subject: Re: Newbie question about text encoding
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.112.1425651082.21433.python-list@python.org>
Lines: 45
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:87024

On Sat, Mar 7, 2015 at 1:03 AM,  <random832@fastmail.us> wrote:
> On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
>> Number of code points is the most logical way to length-limit
>> something. If you want to allow users to set their display names but
>> not to make arbitrarily long ones, limiting them to X code points is
>> the safest way (and preferably do an NFC or NFD normalization before
>> counting, for consistency);
>
> Why are you length-limiting it? Storage space? Limit it in whatever
> encoding they're stored in. Why are combining marks "pathological" but
> surrogate characters not? Display space? Limit it by columns. If you're
> going to allow a Japanese user's name to be twice as wide, you've got a
> problem when you go to display it.

To prevent people from putting three paragraphs of lipsum in and
calling it a username.

>> this means you disallow pathological cases
>> where every base character has innumerable combining marks added.
>
> No it doesn't. If you limit it to, say, fifty, someone can still post
> two base characters with twenty combining marks each. If you actually
> want to disallow this, you've got to do more work. You've disallowed
> some of the pathological cases, some of the time, by coincidence. And
> limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
> will accomplish this just as well.

They can, but then they're limited to two base characters. They can't
have fifty base characters with twenty combining marks each. That's
the point.

> Now, if you intend to _silently truncate_ it to the desired length, you
> certainly don't want to leave half a character in, of course. But who's
> to say the base character plus first few combining marks aren't also
> "half a character"? If you're _splitting_ a string, rather than merely
> truncating it, you probably don't want those combining marks at the
> beginning of part two.

So you truncate to the desired length, then if the first character of
the trimmed-off section is a combining mark (based on its Unicode
character types), you keep trimming until you've removed a character
which isn't. Then, if you no longer have any content whatsoever,
reject the name. Simple.

ChrisA