Path: csiph.com!usenet.pasdenom.info!bete-des-vosges.org!news.redatomik.org!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'preferably': 0.05; 'say,': 0.05; 'subject:text': 0.05; 'utf-8': 0.07; 'bytes,': 0.09; 'received:internal': 0.09; 'subject:question': 0.10; 'stored': 0.12; 'added.': 0.16; 'arbitrarily': 0.16; 'disallow': 0.16; 'length,': 0.16; 'limiting': 0.16; 'merely': 0.16; 'message- id:@webmail.messagingengine.com': 0.16; 'pathological': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:messagingengine.com': 0.16; 'surrogate': 0.16; 'truncating': 0.16; 'two.': 0.16; 'wrote:': 0.18; "aren't": 0.24; 'certainly': 0.24; 'logical': 0.24; 'string,': 0.24; 'post': 0.26; 'header:In-Reply-To:1': 0.27; 'chris': 0.29; 'leave': 0.29; 'character': 0.29; 'points': 0.29; 'characters': 0.30; 'work.': 0.31; 'code': 0.31; "user's": 0.31; 'probably': 0.32; 'beginning': 0.33; 'cases': 0.33; 'fri,': 0.33; 'problem': 0.35; 'received:66': 0.35; 'display': 0.35; "who's": 0.35; 'but': 0.35; 'in.': 0.36; 'marks': 0.36; 'ones,': 0.36; 'half': 0.37; 'two': 0.37; 'received:10': 0.37; 'whatever': 0.38; 'to:addr:python-list': 0.38; 'rather': 0.38; 'to:addr:python.org': 0.39; 'users': 0.40; 'course.': 0.60; 'most': 0.60; 'from:no real name:2**0': 0.61; "you're": 0.61; 'first': 0.61; "you've": 0.63; 'header:Message- Id:1': 0.63; 'name': 0.63; 'skip:n 10': 0.64; 'more': 0.64; 'mar': 0.68; 'combining': 0.68; 'limit': 0.70; 'to,': 0.72; 'points,': 0.84 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.us; h= message-id:x-sasl-enc:from:to:mime-version :content-transfer-encoding:content-type:subject:date:in-reply-to :references; s=mesmtp; bh=qkPebDT8Q1DtzPA2JQCswdC5UzY=; b=dcPRC+ Vct0qMCvLxAQsp6oiF754PvkOLGA3vE2LVt9Y4/AUy3a6cXe6nxECYHmEnMpNVSR XeChHHNWgeMSjOjU0GyPwd/zawSiLU9iqCp80Y51iIimcjY7g54z1FtpXV3b0xX8 7x49ruMc0L00t1We32RDNcpkIKZ9EiZOzGTY4= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:x-sasl-enc:from:to :mime-version:content-transfer-encoding:content-type:subject :date:in-reply-to:references; s=smtpout; bh=qkPebDT8Q1DtzPA2JQCs wdC5UzY=; b=BpnV/ioLtX0m5tKjhGjzDy40UgMmBUMwRJjbXc+W4nD4wNZo3/yg lmQp6axNSrv9uQXD5EmNwiCmY0vBiTfOeV1rpyCyh6oDn3JhKTQHYJ1zvw6deMpX 7f8dbull29Sca/YUgkOenXGcqgql1bCiGx++/mjqQxxHp+uQWHMiTcI= X-Sasl-Enc: i9Y3NPvfHpuKXMZdkX+7C6dTsLv/YrJ5Jjzb0q5pj3ot 1425650591 From: random832@fastmail.us To: python-list@python.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain X-Mailer: MessagingEngine.com Webmail Interface - ajax-07699171 Subject: Re: Newbie question about text encoding Date: Fri, 06 Mar 2015 09:03:11 -0500 In-Reply-To: References: <201502241524.t1OFO09k022270@fido.openend.se> <201502241620.t1OGKf4n002146@fido.openend.se> <54ECB134.5090304@davea.name> <201502241945.t1OJjshO013092@fido.openend.se> <201502241957.t1OJvrJS015604@fido.openend.se> <9169f3b1-2ac7-42a3-8033-584f84b88a1f@googlegroups.com> <7a75a23c-4678-4d7a-a2ec-9e8fff4c07f8@googlegroups.com> <132d5ce6-f672-4eec-99f9-1cc9e88b94f3@googlegroups.com> <619e4cb5-1c4c-449b-a5d7-951101b32b45@googlegroups.com> <54f862ca$0$13014$c3e8da3$5496439d@news.astraweb.com> <01dd9b83-db3e-4e7d-9022-dc6af75eb570@googlegroups.com> <1425648780.3358976.236451377.1ED1534A@webmail.messagingengine.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 29 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425650593 news.xs4all.nl 2885 [2001:888:2000:d::a6]:36823 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:87023 On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: > Number of code points is the most logical way to length-limit > something. If you want to allow users to set their display names but > not to make arbitrarily long ones, limiting them to X code points is > the safest way (and preferably do an NFC or NFD normalization before > counting, for consistency); Why are you length-limiting it? Storage space? Limit it in whatever encoding they're stored in. Why are combining marks "pathological" but surrogate characters not? Display space? Limit it by columns. If you're going to allow a Japanese user's name to be twice as wide, you've got a problem when you go to display it. > this means you disallow pathological cases > where every base character has innumerable combining marks added. No it doesn't. If you limit it to, say, fifty, someone can still post two base characters with twenty combining marks each. If you actually want to disallow this, you've got to do more work. You've disallowed some of the pathological cases, some of the time, by coincidence. And limiting the number of UTF-8 bytes, or the number of UTF-16 code points, will accomplish this just as well. Now, if you intend to _silently truncate_ it to the desired length, you certainly don't want to leave half a character in, of course. But who's to say the base character plus first few combining marks aren't also "half a character"? If you're _splitting_ a string, rather than merely truncating it, you probably don't want those combining marks at the beginning of part two.