Path: csiph.com!weretis.net!feeder6.news.weretis.net!feeder4.news.weretis.net!ecngs!testfeeder.ecngs.de!81.171.118.61.MISMATCH!peer01.fr7!news.highwinds-media.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: How to waste computer memory? Date: Fri, 18 Mar 2016 23:37:21 +1100 Lines: 82 Message-ID: References: <265377f4-741d-4aa2-9338-239f56f8bc57@googlegroups.com> <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de ogvdQS1S+zMDgkvo3hhjWgvIRWCWh6C9KVlaraCRtfFg== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'received:209.85.223': 0.03; 'position,': 0.04; 'imply': 0.07; 'indexing': 0.07; 'utf-8': 0.07; 'cc:addr:python-list': 0.09; 'subject:How': 0.09; 'bytes,': 0.09; 'ignoring': 0.09; 'indexes': 0.09; 'non-ascii': 0.09; 'okay': 0.09; 'sake': 0.09; 'themselves,': 0.09; 'transcoding': 0.09; 'unicode,': 0.09; 'python': 0.10; 'assume': 0.11; 'language,': 0.11; 'index': 0.13; 'encoding': 0.15; 'thu,': 0.15; '(assuming': 0.16; '*you*': 0.16; '2016': 0.16; '255': 0.16; 'cheap.': 0.16; 'disk.': 0.16; 'exclusion': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'locating': 0.16; 'operation.': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sequential': 0.16; 'string:': 0.16; 'wrote:': 0.16; 'memory': 0.17; 'string': 0.17; 'byte': 0.18; 'bytes': 0.18; 'first.': 0.18; 'have:': 0.18; 'instance,': 0.18; 'string,': 0.18; '(in': 0.18; 'python?': 0.18; 'thanks.': 0.18; 'language': 0.19; '>>>': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'fairly': 0.22; 'text,': 0.22; 'trying': 0.22; 'seems': 0.23; 'header:In-Reply-To:1': 0.24; "doesn't": 0.26; 'point.': 0.27; 'points': 0.27; 'fri,': 0.27; 'handling': 0.27; 'message-id:@mail.gmail.com': 0.27; 'grouping': 0.29; 'table,': 0.29; 'array': 0.29; 'character': 0.29; 'asked': 0.29; "i'm": 0.30; '(including': 0.30; 'code': 0.30; 'becomes': 0.30; 'normally': 0.30; 'position.': 0.30; 'waste': 0.30; 'point': 0.33; 'consist': 0.33; "d'aprano": 0.33; 'limitations': 0.33; 'point,': 0.33; 'purposes,': 0.33; 'steven': 0.33; 'stream': 0.33; '(for': 0.34; 'definition': 0.34; 'languages': 0.34; 'handle': 0.34; 'skip:d 20': 0.34; 'add': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'could': 0.35; '8bit%:86': 0.35; 'skip:e 40': 0.35; 'unicode': 0.35; 'level': 0.35; "isn't": 0.35; 'supports': 0.35; 'but': 0.36; 'too': 0.36; 'should': 0.36; 'instead': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'subject:?': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'johnson': 0.37; 'release': 0.37; 'starting': 0.37; 'received:209': 0.38; 'mean': 0.38; 'format': 0.39; 'still': 0.40; 'some': 0.40; 'high': 0.60; 'your': 0.60; 'provide': 0.61; 'total': 0.62; '(that': 0.63; 'more': 0.63; 'necessarily': 0.63; 'complete': 0.63; 'goal': 0.64; 'mar': 0.65; 'series': 0.65; 'else.': 0.66; 'rare': 0.66; 'store,': 0.66; 'worth': 0.67; 'characters.)': 0.84; 'chrisa': 0.84; 'gig': 0.84; 'to:none': 0.91; 'rick': 0.93; 'ultimate': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-transfer-encoding; bh=affic8QqFT6NmSiSD2IbshhfZmZxyroWH+elnhxprns=; b=dsdOVQ/411/az3YvRhuhnlJsHm3Vg+YLXLZu9yVcB78V4rNspaC+eqUd3mRRsXZOnP ETCWOKSowBNFjHe+xlGdeQFKmHFn7Hk17rn7LJH5ki53A3kg0T5M+NABe1yod+P/gvxX gyUKaanIqNdnTUj68NoZAZg7geExSDRjaXiZm5do4Bjl3wA1/XBS+bh1ykBfRX+LZUKn Qc0nVGzI1/NeyM2YcwYuOFzMz9rYZvH1imEKoaC5cf4Jj2qNMtCRasIjMojB9SEKP3Uk X9XqIXyrovIRpQmfQ97qqmthHHbt7YBeG8jJGnxdBDJaGU/HIQx99iGFVKFghEs9qACe p0hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-transfer-encoding; bh=affic8QqFT6NmSiSD2IbshhfZmZxyroWH+elnhxprns=; b=W92Z1Pel6JU+9ISqQLKe9F9nUYoykzSjZFrXoEQBkge+4hfGkzu3dM3G7S29bzOiJ8 d5FIJIXzhpMC7ougCuRb+374iCZBPZbJo+n3wNw6u6seGnrT67wacKEFLv/c4kY12T5+ cCDmURYfLfhGL/onOVF1JRwClzjHc14rT3wmju+FizY8qLwkc+ZujO3mDdNip9Ttvj6F pxPs9W4dg0Z7e263tKUrlDigXVYdnHZUOQeG+jNJTJIQ9khsofL9WAg7NK0sID/vR6Ob ihQuCHwgbQ34Ihtxy7j9AQtALsXicUfDHLmYO/Lnnjy5w1Lu9iUanhDUkm/ORBuLmfqB rNeQ== X-Gm-Message-State: AD7BkJI7q5Ge6WR38yP96XhcYHW4zEdharn2GCCp7UJ6hHwIi4F8cFEbLX+mex8YQGXvYScr4E21EOgfiuIalA== X-Received: by 10.107.128.104 with SMTP id b101mr15200754iod.31.1458304641761; Fri, 18 Mar 2016 05:37:21 -0700 (PDT) In-Reply-To: <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Received-Bytes: 9050 X-Received-Body-CRC: 1897860954 Xref: csiph.com comp.lang.python:105206 On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano wro= te: > On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote: > >> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson >> wrote: >>> In the event that i change my mind about Unicode, and/or for >>> the sake of others, who may want to know, please provide a >>> list of languages that *YOU* think handle Unicode better than >>> Python, starting with the best first. Thanks. > > Better than Python? Easy-peasy: > > List of languages with Unicode handling which is better than Python =3D [= ] > > I'm not aware of any language with better or more complete Unicode > functionality than Python's. (That doesn't necessarily mean that they don= 't > exist.) And this also doesn't preclude languages that have *as good* handling as Python's, of which I know of one off-hand, and there may be any number. (Trivial case: Take Python 3.5, change the definition of a block to be { } instead of indentation, and release it as Bracethon 1.0. Voila, a distinct-yet-related language whose Unicode handling is exactly as good as Python's.) >> jmf has been asked this before, and as I recall he seems to feel that >> UTF-8 should be used for all purposes, ignoring the limitations of >> that encoding such as that indexing becomes a O(n) operation. > > Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For > instance, your UTF-8 string might consist of an array of bytes containing > the string, plus an array of indexes to the start of each code point. For > example, the string: > > =E2=80=9Cabc=CF=80=C3=9F=D0=8A=E2=80=A2=F0=92=80=81=E2=80=9D > > (including the quote marks) is 10 code points in length and 22 bytes as > UTF-8. Grouping the (hex) bytes for each code point, we have: > > e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d > > so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus = the > indexes (in decimal) in which each code point starts: > > e2809c616263cf80c39fd08ae280a2f0928081e2809d > > 0 3 4 5 6 8 10 12 15 19 > > but (assuming each index needs 2 bytes, which supports strings up to 6553= 5 > characters in length), that's actually LESS memory efficient than UTF-32: > 42 bytes versus 40. A lot of strings will have no more than 255 non-ASCII characters in them. (For example, all strings which no more than 255 total characters.) You could store, instead of the indexes themselves, a series of one-byte offsets: e2809c616263cf80c39fd08ae280a2f0928081e2809d 0 2 2 2 2 3 4 5 7 10 Locating a byte based on its character position is still O(1); you look up that position in the offset table, add that to your original character position, and you have the byte location. For strings with too many non-ASCII codepoints, you'd need some other representation, but at that point, it might be worth just switching to UTF-32. Of course, O(1) isn't the ultimate goal to the exclusion of all else. For a simple sequential parser, indexing might be such a rare operation that it's okay for it to be O(N), as you're never going to index more than a few characters from a known position. Or if you're trying to search a few gig of text, it's entirely possible that transcoding into an indexable format is a complete waste of time, and it's better to just work with a stream of bytes straight off the disk. But for a general string type in a high level language, I'm normally going to assume that indexing is fairly cheap. ChrisA