Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.009 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'python.': 0.02; 'exercise': 0.04; 'python)': 0.05; '64-bit': 0.07; 'memory.': 0.07; 'suddenly': 0.07; 'utf-8': 0.07; '32-bit': 0.09; 'ascii': 0.09; 'character,': 0.09; 'oh,': 0.09; 'subject:string': 0.09; 'python': 0.11; 'cheap.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hint:': 0.16; 'storing': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'all,': 0.19; '>>>': 0.22; 'memory': 0.22; 'bytes': 0.24; 'comparing': 0.24; 'integer': 0.24; 'text,': 0.24; 'compare': 0.26; 'header:In-Reply-To:1': 0.27; 'words': 0.29; "doesn't": 0.30; '(like': 0.30; 'message- id:@mail.gmail.com': 0.30; '3.2': 0.31; 'languages': 0.32; 'text': 0.33; 'linux': 0.33; 'addresses': 0.33; 'skip:s 30': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'really': 0.36; 'method': 0.36; 'subject:?': 0.36; 'two': 0.37; 'nov': 0.38; 'saves': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'to:addr:python.org': 0.39; 'how': 0.40; 'even': 0.60; 'length': 0.61; 'matter': 0.61; "you're": 0.61; 'save': 0.62; 'address': 0.63; 'name': 0.63; 'box,': 0.64; 'more': 0.64; 'results': 0.69; 'wish': 0.70; 'beats': 0.84; 'characters,': 0.84; 'confusing': 0.84; 'dict,': 0.84; 'picture.': 0.84; 'subject:long': 0.84; 'numbers:': 0.91; 'same,': 0.91; 'average': 0.93; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=vOLqAJbOKxsjIWYO5p9rUu16LJ8LQBAgTE0aDF65cAE=; b=n0NywNvTWc5czREjo36K3YQpBybUOUHQgVaSg1g4fYmTZAG9aC2AY+blzScsRzdNWj yLfm7p5i6G5JAzJvV9qveJ0tDhsBzNYjD003RZWg0q1ro/KlIQEI4PsUdLJqn1Eysm/N xDA5FCZY/31hofq4WAVhgpL9de6/IRU/PHzoGUuymp4MBqc9lFkazaXaJni0Cm+rFZI6 CdPB8YEvIjAfT2cDKiX8CfVu0QCNnvC8X3VkqBBk7G3q3H7qT+vwQTeI7GpR3gLmpRDp O+h8bU4oz6geG7l6cFdokWRPvK5TRNLmpOKNhXqOhnvF1VtUsuPSwFLHj1Cr9Vs6VCH8 Us8g== MIME-Version: 1.0 X-Received: by 10.68.254.164 with SMTP id aj4mr78805pbd.161.1383985580515; Sat, 09 Nov 2013 00:26:20 -0800 (PST) In-Reply-To: <39112f0b-f834-4e4a-86f2-ca19078e6de4@googlegroups.com> References: <527d85e8$0$29983$c3e8da3$5496439d@news.astraweb.com> <39112f0b-f834-4e4a-86f2-ca19078e6de4@googlegroups.com> Date: Sat, 9 Nov 2013 19:26:20 +1100 Subject: Re: chunking a long string? From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 43 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1383985583 news.xs4all.nl 15949 [2001:888:2000:d::a6]:41784 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:58915 On Sat, Nov 9, 2013 at 7:14 PM, wrote: > If you wish to count the the frequency of chars in a text > and store the results in a dict, {char: number_of_that_char, ...}, > do not forget to save the key in utf-XXX, it saves memory. Oh, if you're that concerned about memory usage of individual characters, try storing them as integers: >>> sys.getsizeof("a") 26 >>> sys.getsizeof("a".encode("utf-32")) 25 >>> sys.getsizeof("a".encode("utf-8")) 18 >>> sys.getsizeof(ord("a")) 14 I really don't see that UTF-32 is much advantage here. UTF-8 happens to be, because I used an ASCII character, but the integer beats them all, even for larger numbers: >>> sys.getsizeof(ord("\U0001d11e")) 16 And there's even less difference on my Linux box, but of course, you never compare against Linux because Python 3.2 wide builds don't suit your numbers. For longer strings, there's an even more efficient way to store them. Just store the memory address - that's going to be 4 bytes or 8, depending on whether it's a 32-bit or 64-bit build of Python. There's a name for this method of comparing strings: interning. Some languages do it automatically for all strings, others (like Python) only when you ask for it. Suddenly it doesn't matter at all what the storage format is - if the two strings are the same, their addresses are the same, and conversely. That's how to make it cheap. > Hint: If you attempt to do the same exercise with > words in a "latin" text, never forget the length average > of a word is approximatively 1000 chars. I think you're confusing length of word with value of picture. ChrisA