Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: Storing a big amount of path names Date: Fri, 12 Feb 2016 16:02:43 +1100 Lines: 70 Message-ID: References: <56BD4DCD.4080401@mrabarnett.plus.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de wLCWbgCbOsfDYutB7BCJdgo0CFz18jX+8O85wZ3QJGeQ== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'handler': 0.04; 'memory.': 0.05; 'mrab': 0.05; 'sys': 0.05; 'correct.': 0.07; 'cc:addr :python-list': 0.09; '[1]:': 0.09; '[2]:': 0.09; '[3]:': 0.09; 'bit...': 0.09; 'dict': 0.09; 'dirname': 0.09; 'handler.': 0.09; 'handlers': 0.09; 'optimizing': 0.09; 'example:': 0.10; 'def': 0.13; '2016': 0.16; '[4]:': 0.16; 'dictionary.': 0.16; 'dirnames': 0.16; 'files)': 0.16; 'flushed': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'identities': 0.16; 'itself),': 0.16; 'paulo': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'suggested,': 0.16; 'wrote:': 0.16; 'memory': 0.17; 'string': 0.17; 'pfxlen:0': 0.18; "shouldn't": 0.18; 'solution.': 0.18; 'all,': 0.20; '(not': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'names.': 0.22; 'feb': 0.23; 'needed.': 0.23; 'references': 0.23; 'import': 0.24; 'implemented': 0.24; 'header:In-Reply-To:1': 0.24; 'fri,': 0.27; 'message-id:@mail.gmail.com': 0.27; 'values': 0.28; 'measure': 0.29; 'strings,': 0.29; "i'm": 0.30; 'post': 0.31; "can't": 0.32; 'computer.': 0.32; 'returned': 0.32; 'int': 0.33; 'retain': 0.33; 'equal': 0.34; 'file': 0.34; 'previous': 0.34; 'running': 0.34; 'received:google.com': 0.35; 'dir': 0.35; 'files,': 0.35; "isn't": 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; '12,': 0.37; 'received:209.85.213': 0.37; 'thought': 0.37; 'doing': 0.38; 'virtual': 0.38; 'received:209': 0.38; 'delete': 0.38; 'names': 0.38; 'skip:p 20': 0.38; 'end': 0.39; 'means': 0.39; 'sure': 0.39; 'easily': 0.39; 'takes': 0.39; 'ever': 0.60; 'your': 0.60; "you'll": 0.61; 'back': 0.62; 'is.': 0.63; 'to,': 0.63; 'yourself,': 0.72; '100%': 0.72; 'million': 0.74; '1st.': 0.84; '2gb': 0.84; 'chrisa': 0.84; 'dict,': 0.84; 'dict.': 0.84; 'optimized.': 0.84; 'suspicion': 0.84; "that'll": 0.84; 'true!': 0.84; 'choice.': 0.93; 'reducing': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=uEp34PXZn7PqBEyhh9ZJmR5akvOe/3QphAUJDX5X2YU=; b=bzsfCAElktufotgEm8I3MZNlHmtZ3xQZIN6oo2g0ODF+BlM3i+JXBQ3V2CiQXXJ50W M/1TMyuvVAKeezSuYLJlFoH48CLOPFQcrhz5XtmQAkmpVb2KL9+2YFiyvDcCkZ6vJ0VH 9ZzsLjchWjuDXNo1UzWFdSyz4N8TOPg0ThepkuA/Tr22NQBHPEUXoeN8d0lZF5Pqv9Em VjOmLlQubdUX9VxEuSQz/tLMDYv3wpBRTibVWgXg0SQ1Fnd//Ggbj8K6vBWAOpYzYFVy 653TaBVPFF8JIudFIzLrSv1PEfprr5r47TbRnLPZ2Uo/iXsYfDVIOgxCuXCzLpzzGrPN iV6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=uEp34PXZn7PqBEyhh9ZJmR5akvOe/3QphAUJDX5X2YU=; b=Nn7wsSbltIlzYvrUcRLbQMxJzgq//2Ojyn5b4y5ngsoL+eMyNwPq4gkVShDjE50p2D uCdOc8v97fq0HfhMVH0dVHf868MoJC62Sq7HUwtTlTzkdGYOtEiAoEpCPowXew2ZwGE9 XWus24peuuCHyXS8uyHk7P4ql9HvxwtUy5QRhzcP24yY2UHJ94xt+lCGk3wJQvJ28CyW iQiYhf9Di6ZSm6nSJvmOyQ2/rB63LyK4f6lFfmkTLoy9C2opbZe5lhHjbXDoE+1Oxr+V fOzA3cldo0suQjHnMoD4TFSW9tLfKJVf74XBtFsVx//kkK/MKfl3kQeSko69bAHBF/b6 oXqA== X-Gm-Message-State: AG10YOReT3vyaQbGL+nafphfqQ1YXYyEHIs1BUsPJhj64GBsaawOWG+NghnB1vYadWuIXeMj+Sn2WgEhTP8bJA== X-Received: by 10.50.176.195 with SMTP id ck3mr851641igc.94.1455253363165; Thu, 11 Feb 2016 21:02:43 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102845 On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva wrote: >> Correct. Two equal strings, passed to sys.intern(), will come back as >> identical strings, which means they use the same memory. You can have >> a million references to the same string and it takes up no additional >> memory. > I have being playing with this and found that it is not always true! > For example: > > In [1]: def f(s): > ...: print(id(sys.intern(s))) > ...: > > In [2]: import sys > > In [3]: f("12345") > 139805480756480 > > In [4]: f("12345") > 139805480755640 > > In [5]: f("12345") > 139805480756480 > > In [6]: f("12345") > 139805480756480 > > In [7]: f("12345") > 139805480750864 > > I think a dict, as MRAB suggested, is needed. > At the end of the store process I may delete the dict. I'm not 100% sure of what's going on here, but my suspicion is that a string that isn't being used is allowed to be flushed from the dictionary. If you retain a reference to the string (not to its id, but to the string itself), you shouldn't see that change. By doing the dict yourself, you guarantee that ALL the strings will be retained, which can never be _less_ memory than interning them all, and can easily be _more_. >> But I reiterate: Don't even bother with this unless you know your >> program is running short of memory. > > Yes, it is. > This is part of a previous post (sets of equal files) and I need lots of > memory for performance reasons. I only have 2G in this computer. How many files, roughly? Do you ever look at the contents of the files? Most likely, you'll be dwarfing the files' names with their contents. Unless you actually have over two million unique files, each one with over a thousand characters in the name, you can't use all that 2GB with file names. If virtual memory is active, all that'll happen is that you dip into the swapper / page file a bit... and THAT is when you start looking at reducing memory usage. Don't bother optimizing until you need to, and even then, you measure first to see what part of the program actually needs to be optimized. > I already had implemented a solution. I used two dicts. One to map > dirnames to an int handler and the other to map the handler to dir > names. At the end I deleted the 1st. one because I only need to get the > dirname from the handler. But I thought there should be a better choice. If all your dir names are interned, their identities (approximately the values returned by id(), but not quite) will be those handlers for you, without any overhead and without any complexity. ChrisA