Groups > comp.lang.python > #102833 > unrolled thread

Storing a big amount of path names

Started by	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
First post	2016-02-12 00:31 +0000
Last post	2016-02-12 11:46 +0530
Articles	20 — 11 participants

Back to article view | Back to comp.lang.python

  Storing a big amount of path names Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-12 00:31 +0000
    Re: Storing a big amount of path names Chris Angelico <rosuav@gmail.com> - 2016-02-12 11:39 +1100
    Re: Storing a big amount of path names Ben Finney <ben+python@benfinney.id.au> - 2016-02-12 11:44 +1100
    Re: Storing a big amount of path names Tim Chase <python.list@tim.thechases.com> - 2016-02-11 19:13 -0600
      Re: Storing a big amount of path names Rob Gaddi <rgaddi@highlandtechnology.invalid> - 2016-02-12 02:17 +0000
    Re: Storing a big amount of path names MRAB <python@mrabarnett.plus.com> - 2016-02-12 03:13 +0000
    Re: Storing a big amount of path names Chris Angelico <rosuav@gmail.com> - 2016-02-12 14:49 +1100
      Re: Storing a big amount of path names Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-12 04:15 +0000
        Re: Storing a big amount of path names Chris Angelico <rosuav@gmail.com> - 2016-02-12 15:23 +1100
          Re: Storing a big amount of path names Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-12 04:45 +0000
            Re: Storing a big amount of path names Chris Angelico <rosuav@gmail.com> - 2016-02-12 16:02 +1100
              Re: Storing a big amount of path names Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-12 05:49 +0000
              Re: Storing a big amount of path names Steven D'Aprano <steve@pearwood.info> - 2016-02-12 16:51 +1100
          Re: Storing a big amount of path names Rob Gaddi <rgaddi@highlandtechnology.invalid> - 2016-02-12 17:05 +0000
            Re: Storing a big amount of path names Chris Angelico <rosuav@gmail.com> - 2016-02-13 04:18 +1100
            Re: Storing a big amount of path names Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-02-12 21:37 +0000
            Re: Storing a big amount of path names Ben Finney <ben+python@benfinney.id.au> - 2016-02-13 08:49 +1100
            Re: Storing a big amount of path names Matt Wheeler <funkyhat@gmail.com> - 2016-02-12 23:31 +0000
              Re: Storing a big amount of path names mkondrashin@gmail.com - 2016-02-13 12:19 -0800
    Re: Storing a big amount of path names srinivas devaki <mr.eightnoteight@gmail.com> - 2016-02-12 11:46 +0530

#102833 — Storing a big amount of path names

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-12 00:31 +0000
Subject	Storing a big amount of path names
Message-ID	<n9j94f$712$1@gioia.aioe.org>

Hi!

What is the best (shortest memory usage) way to store lots of pathnames
in memory where:

1. Path names are pathname=(dirname,filename)
2. There many different dirnames but much less than pathnames
3. dirnames have in general many chars

The idea is to share the common dirnames.

More realistically not only the pathnames are stored but objects each
object being a MyFile containing
self.name - <base name>
getPathname(self) - <full pathname>
other stuff

class MyFile:

  __allfiles=[]

  def __init__(self,dirname,filename):
    self.dirname=dirname  # But I want to share this with other files
    self.name=filename
    MyFile.__allfiles.append(self)
    ...

  def getPathname(self):
    return os.path.join(self.dirname,self.name)

  ...

Thanks for any suggestion.
Paulo

[toc] | [next] | [standalone]

#102834

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-12 11:39 +1100
Message-ID	<mailman.60.1455237579.22075.python-list@python.org>
In reply to	#102833

On Fri, Feb 12, 2016 at 11:31 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> What is the best (shortest memory usage) way to store lots of pathnames
> in memory where:
>
> 1. Path names are pathname=(dirname,filename)
> 2. There many different dirnames but much less than pathnames
> 3. dirnames have in general many chars
>
> The idea is to share the common dirnames.
>
> More realistically not only the pathnames are stored but objects each
> object being a MyFile containing
> self.name - <base name>
> getPathname(self) - <full pathname>
> other stuff

Just store them in the most obvious way, and don't worry about memory
usage. How many path names are you likely to have? A million? You can
still afford to have 1KB pathnames and it'll take up no more than a
gigabyte of RAM - and most computers throw around gigs of virtual
memory like it's nothing.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102835

From	Ben Finney <ben+python@benfinney.id.au>
Date	2016-02-12 11:44 +1100
Message-ID	<mailman.61.1455237865.22075.python-list@python.org>
In reply to	#102833

Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> writes:

> What is the best (shortest memory usage) way to store lots of
> pathnames in memory

I challenge the premise. Why is “shortest memory usage” your criterion
for “best”, here?

How have you determined that factors like “easily understandable when
reading”, or “using standard Python idioms”, are less important?

As for “lots of pathnames”, how many are you expecting? Python's
built-in container types are highly optimised for quite large amounts of
data.

Have you measured an implementation with normal built-in container types
with your expected quantity of items, and confirmed that the performance
is unacceptable?

> Thanks for any suggestion.

I would suggest that the assumption you have too much data for Python's
built-in container types, is an assumption that should be rigorously
tested because it is likely not true.

-- 
 \      “We suffer primarily not from our vices or our weaknesses, but |
  `\    from our illusions.” —Daniel J. Boorstin, historian, 1914–2004 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#102837

From	Tim Chase <python.list@tim.thechases.com>
Date	2016-02-11 19:13 -0600
Message-ID	<mailman.63.1455239783.22075.python-list@python.org>
In reply to	#102833

On 2016-02-12 00:31, Paulo da Silva wrote:
> What is the best (shortest memory usage) way to store lots of
> pathnames in memory where:
> 
> 1. Path names are pathname=(dirname,filename)
> 2. There many different dirnames but much less than pathnames
> 3. dirnames have in general many chars
> 
> The idea is to share the common dirnames.

Well, you can create a dict that has dirname->list(filenames) which
will reduce the dirname to a single instance.  You could store that
dict in the class, shared by all of the instances, though that starts
to pick up a code-smell.

But unless you're talking about an obscenely large number of
dirnames & filenames, or a severely resource-limited machine, just
use the default built-ins.  If you start to push the boundaries of
system resources, then I'd try the "anydbm" module or use the
"shelve" module to marshal them out to disk.  Finally, you *could*
create an actual sqlite database on disk if size really does exceed
reasonable system specs.

-tkc

[toc] | [prev] | [next] | [standalone]

#102839

From	Rob Gaddi <rgaddi@highlandtechnology.invalid>
Date	2016-02-12 02:17 +0000
Message-ID	<n9jfcc$oqr$1@dont-email.me>
In reply to	#102837

Tim Chase wrote:

> On 2016-02-12 00:31, Paulo da Silva wrote:
>> What is the best (shortest memory usage) way to store lots of
>> pathnames in memory where:
>> 
>> 1. Path names are pathname=(dirname,filename)
>> 2. There many different dirnames but much less than pathnames
>> 3. dirnames have in general many chars
>> 
>> The idea is to share the common dirnames.
>
> Well, you can create a dict that has dirname->list(filenames) which
> will reduce the dirname to a single instance.  You could store that
> dict in the class, shared by all of the instances, though that starts
> to pick up a code-smell.
>
> But unless you're talking about an obscenely large number of
> dirnames & filenames, or a severely resource-limited machine, just
> use the default built-ins.  If you start to push the boundaries of
> system resources, then I'd try the "anydbm" module or use the
> "shelve" module to marshal them out to disk.  Finally, you *could*
> create an actual sqlite database on disk if size really does exceed
> reasonable system specs.
>
> -tkc
>

Probably more memory efficient to make a list of lists, and just declare
that element[0] of each list is the dirname.  That way you're not
wasting memory on the unused entryies of the hashtable.

But unless the OP has both a) plus of a million entries and b) let's say
at least 20 filenames to each dirname, it's not worth doing.

Now, if you do really have a million entries, one thing that would help
with memory is setting __slots__ for MyFile rather than letting it
create an instance dictionary for each one.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com

Email address domain is currently out of order.  See above to fix.

[toc] | [prev] | [next] | [standalone]

#102840

From	MRAB <python@mrabarnett.plus.com>
Date	2016-02-12 03:13 +0000
Message-ID	<mailman.65.1455246809.22075.python-list@python.org>
In reply to	#102833

On 2016-02-12 00:31, Paulo da Silva wrote:
> Hi!
>
> What is the best (shortest memory usage) way to store lots of pathnames
> in memory where:
>
> 1. Path names are pathname=(dirname,filename)
> 2. There many different dirnames but much less than pathnames
> 3. dirnames have in general many chars
>
> The idea is to share the common dirnames.
>
> More realistically not only the pathnames are stored but objects each
> object being a MyFile containing
> self.name - <base name>
> getPathname(self) - <full pathname>
> other stuff
>
> class MyFile:
>
>    __allfiles=[]
>
>    def __init__(self,dirname,filename):
>      self.dirname=dirname  # But I want to share this with other files
>      self.name=filename
>      MyFile.__allfiles.append(self)
>      ...
>
>    def getPathname(self):
>      return os.path.join(self.dirname,self.name)
>
>    ...
>
Apart from all of the other answers that have been given:

 >>> p1 = 'foo/bar'
 >>> p2 = 'foo/bar'
 >>> id(p1), id(p2)
(982008930176, 982008930120)
 >>> d = {}
 >>> id(d.setdefault(p1, p1))
982008930176
 >>> id(d.setdefault(p2, p2))
982008930176

The dict maps equal strings (dirnames) to the same string, so you won't 
have multiple copies.

[toc] | [prev] | [next] | [standalone]

#102841

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-12 14:49 +1100
Message-ID	<mailman.66.1455248959.22075.python-list@python.org>
In reply to	#102833

On Fri, Feb 12, 2016 at 2:13 PM, MRAB <python@mrabarnett.plus.com> wrote:
> Apart from all of the other answers that have been given:
>
>>>> p1 = 'foo/bar'
>>>> p2 = 'foo/bar'
>>>> id(p1), id(p2)
> (982008930176, 982008930120)
>>>> d = {}
>>>> id(d.setdefault(p1, p1))
> 982008930176
>>>> id(d.setdefault(p2, p2))
> 982008930176
>
> The dict maps equal strings (dirnames) to the same string, so you won't have
> multiple copies.

Simpler to let the language do that for you:

>>> import sys
>>> p1 = sys.intern('foo/bar')
>>> p2 = sys.intern('foo/bar')
>>> id(p1), id(p2)
(139621017266528, 139621017266528)

ChrisA

[toc] | [prev] | [next] | [standalone]

#102842

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-12 04:15 +0000
Message-ID	<n9jm9n$m77$1@gioia.aioe.org>
In reply to	#102841

Às 03:49 de 12-02-2016, Chris Angelico escreveu:
> On Fri, Feb 12, 2016 at 2:13 PM, MRAB <python@mrabarnett.plus.com> wrote:
>> Apart from all of the other answers that have been given:
>>
...
> 
> Simpler to let the language do that for you:
> 
>>>> import sys
>>>> p1 = sys.intern('foo/bar')
>>>> p2 = sys.intern('foo/bar')
>>>> id(p1), id(p2)
> (139621017266528, 139621017266528)
> 

I didn't know about id or sys.intern :-)
I need to look at them ...

As I can understand I can do in MyFile class

self.dirname=sys.intern(dirname) # dirname passed as arg to the __init__

and the character string doesn't get repeated.
Is this correct?

[toc] | [prev] | [next] | [standalone]

#102843

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-12 15:23 +1100
Message-ID	<mailman.67.1455251003.22075.python-list@python.org>
In reply to	#102842

On Fri, Feb 12, 2016 at 3:15 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> Às 03:49 de 12-02-2016, Chris Angelico escreveu:
>> On Fri, Feb 12, 2016 at 2:13 PM, MRAB <python@mrabarnett.plus.com> wrote:
>>> Apart from all of the other answers that have been given:
>>>
> ...
>>
>> Simpler to let the language do that for you:
>>
>>>>> import sys
>>>>> p1 = sys.intern('foo/bar')
>>>>> p2 = sys.intern('foo/bar')
>>>>> id(p1), id(p2)
>> (139621017266528, 139621017266528)
>>
>
> I didn't know about id or sys.intern :-)
> I need to look at them ...
>
> As I can understand I can do in MyFile class
>
> self.dirname=sys.intern(dirname) # dirname passed as arg to the __init__
>
> and the character string doesn't get repeated.
> Is this correct?

Correct. Two equal strings, passed to sys.intern(), will come back as
identical strings, which means they use the same memory. You can have
a million references to the same string and it takes up no additional
memory.

But I reiterate: Don't even bother with this unless you know your
program is running short of memory. Start by coding things in the
simple and obvious way, and then fix problems only when you see them.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102844

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-12 04:45 +0000
Message-ID	<n9jo24$o42$1@gioia.aioe.org>
In reply to	#102843

Às 04:23 de 12-02-2016, Chris Angelico escreveu:
> On Fri, Feb 12, 2016 at 3:15 PM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>> Às 03:49 de 12-02-2016, Chris Angelico escreveu:
>>> On Fri, Feb 12, 2016 at 2:13 PM, MRAB <python@mrabarnett.plus.com> wrote:
>>>> Apart from all of the other answers that have been given:
>>>>
>> ...
>>>
>>> Simpler to let the language do that for you:
>>>
>>>>>> import sys
>>>>>> p1 = sys.intern('foo/bar')
>>>>>> p2 = sys.intern('foo/bar')
>>>>>> id(p1), id(p2)
>>> (139621017266528, 139621017266528)
>>>
>>
>> I didn't know about id or sys.intern :-)
>> I need to look at them ...
>>
>> As I can understand I can do in MyFile class
>>
>> self.dirname=sys.intern(dirname) # dirname passed as arg to the __init__
>>
>> and the character string doesn't get repeated.
>> Is this correct?
> 
> Correct. Two equal strings, passed to sys.intern(), will come back as
> identical strings, which means they use the same memory. You can have
> a million references to the same string and it takes up no additional
> memory.
I have being playing with this and found that it is not always true!
For example:

In [1]: def f(s):
   ...:     print(id(sys.intern(s)))
   ...:

In [2]: import sys

In [3]: f("12345")
139805480756480

In [4]: f("12345")
139805480755640

In [5]: f("12345")
139805480756480

In [6]: f("12345")
139805480756480

In [7]: f("12345")
139805480750864

I think a dict, as MRAB suggested, is needed.
At the end of the store process I may delete the dict.

> 
> But I reiterate: Don't even bother with this unless you know your
> program is running short of memory.

Yes, it is.
This is part of a previous post (sets of equal files) and I need lots of
memory for performance reasons. I only have 2G in this computer.

I already had implemented a solution. I used two dicts. One to map
dirnames to an int handler and the other to map the handler to dir
names. At the end I deleted the 1st. one because I only need to get the
dirname from the handler. But I thought there should be a better choice.

Thanks
Paulo

[toc] | [prev] | [next] | [standalone]

#102845

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-12 16:02 +1100
Message-ID	<mailman.68.1455253371.22075.python-list@python.org>
In reply to	#102844

On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>> Correct. Two equal strings, passed to sys.intern(), will come back as
>> identical strings, which means they use the same memory. You can have
>> a million references to the same string and it takes up no additional
>> memory.
> I have being playing with this and found that it is not always true!
> For example:
>
> In [1]: def f(s):
>    ...:     print(id(sys.intern(s)))
>    ...:
>
> In [2]: import sys
>
> In [3]: f("12345")
> 139805480756480
>
> In [4]: f("12345")
> 139805480755640
>
> In [5]: f("12345")
> 139805480756480
>
> In [6]: f("12345")
> 139805480756480
>
> In [7]: f("12345")
> 139805480750864
>
> I think a dict, as MRAB suggested, is needed.
> At the end of the store process I may delete the dict.

I'm not 100% sure of what's going on here, but my suspicion is that a
string that isn't being used is allowed to be flushed from the
dictionary. If you retain a reference to the string (not to its id,
but to the string itself), you shouldn't see that change. By doing the
dict yourself, you guarantee that ALL the strings will be retained,
which can never be _less_ memory than interning them all, and can
easily be _more_.

>> But I reiterate: Don't even bother with this unless you know your
>> program is running short of memory.
>
> Yes, it is.
> This is part of a previous post (sets of equal files) and I need lots of
> memory for performance reasons. I only have 2G in this computer.

How many files, roughly? Do you ever look at the contents of the
files? Most likely, you'll be dwarfing the files' names with their
contents. Unless you actually have over two million unique files, each
one with over a thousand characters in the name, you can't use all
that 2GB with file names.

If virtual memory is active, all that'll happen is that you dip into
the swapper / page file a bit... and THAT is when you start looking at
reducing memory usage. Don't bother optimizing until you need to, and
even then, you measure first to see what part of the program actually
needs to be optimized.

> I already had implemented a solution. I used two dicts. One to map
> dirnames to an int handler and the other to map the handler to dir
> names. At the end I deleted the 1st. one because I only need to get the
> dirname from the handler. But I thought there should be a better choice.

If all your dir names are interned, their identities (approximately
the values returned by id(), but not quite) will be those handlers for
you, without any overhead and without any complexity.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102847

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-12 05:49 +0000
Message-ID	<n9jrq0$sce$1@gioia.aioe.org>
In reply to	#102845

Às 05:02 de 12-02-2016, Chris Angelico escreveu:
> On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
...

>> I think a dict, as MRAB suggested, is needed.
>> At the end of the store process I may delete the dict.
> 
> I'm not 100% sure of what's going on here, but my suspicion is that a
> string that isn't being used is allowed to be flushed from the
> dictionary.

You are right. I have tried with a small class and it seems to work.
Thanks.
...

> 
> How many files, roughly? Do you ever look at the contents of the
> files? Most likely, you'll be dwarfing the files' names with their
> contents. Unless you actually have over two million unique files, each
> one with over a thousand characters in the name, you can't use all
> that 2GB with file names.

That's not only the filenames.
The more memory I have more expensive but faster algorithm I can implement.

Thank you very much for your nice suggestion which also contributed to
my Python knowledge.

Thank you all who responded.
Paulo

[toc] | [prev] | [next] | [standalone]

#102848

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-02-12 16:51 +1100
Message-ID	<56bd72db$0$1615$c3e8da3$5496439d@news.astraweb.com>
In reply to	#102845

On Fri, 12 Feb 2016 04:02 pm, Chris Angelico wrote:

> On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>>> Correct. Two equal strings, passed to sys.intern(), will come back as
>>> identical strings, which means they use the same memory. You can have
>>> a million references to the same string and it takes up no additional
>>> memory.
>> I have being playing with this and found that it is not always true!

It is true, but only for the lifetime of the string. Once the string is
garbage collected, it is removed from the cache as well. If you then add
the string again, you may not get the same id.

py> mystr = "hello world"
py> str2 = sys.intern(mystr)
py> str3 = "hello world"
py> mystr is str2  # same string object, as str2 is interned
True
py> mystr is str3  # not the same string object
False

But if we delete all references to the string objects, the intern cache is
also flushed, and we may not get the same id:

py> del str2, str3
py> id(mystr)  # remember this ID number
3079482600
py> del mystr
py> id(sys.intern("hello world"))  # a new entry in the cache
3079227624

This is the behaviour you want: if a string is completely deleted, you don't
want it remaining in the intern cache taking up memory.

> I'm not 100% sure of what's going on here, but my suspicion is that a
> string that isn't being used is allowed to be flushed from the
> dictionary. If you retain a reference to the string (not to its id,
> but to the string itself), you shouldn't see that change. By doing the
> dict yourself, you guarantee that ALL the strings will be retained,
> which can never be _less_ memory than interning them all, and can
> easily be _more_.

Yep. Back in the early days, interned strings were immortal and lasted
forever. That wasted memory, and is no longer the case.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#102866

From	Rob Gaddi <rgaddi@highlandtechnology.invalid>
Date	2016-02-12 17:05 +0000
Message-ID	<n9l3ck$v6o$1@dont-email.me>
In reply to	#102843

Chris Angelico wrote:

> Start by coding things in the
> simple and obvious way, and then fix problems only when you see them.

Is that statement available in 10 foot letters etched into stone?

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com

Email address domain is currently out of order.  See above to fix.

[toc] | [prev] | [next] | [standalone]

#102867

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-13 04:18 +1100
Message-ID	<mailman.82.1455297504.22075.python-list@python.org>
In reply to	#102866

On Sat, Feb 13, 2016 at 4:05 AM, Rob Gaddi
<rgaddi@highlandtechnology.invalid> wrote:
> Chris Angelico wrote:
>
>> Start by coding things in the
>> simple and obvious way, and then fix problems only when you see them.
>
> Is that statement available in 10 foot letters etched into stone?

I actually had that built behind my house, at one point. Sadly, the
letters sank until they were partly embedded into the ground, and
what's left says, in the local language, "Go stick your head in a
PHP", so it's lit up only for special celebrations.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102872

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2016-02-12 21:37 +0000
Message-ID	<mailman.86.1455313059.22075.python-list@python.org>
In reply to	#102866

On 12/02/2016 17:05, Rob Gaddi wrote:
> Chris Angelico wrote:
>
>> Start by coding things in the
>> simple and obvious way, and then fix problems only when you see them.
>
> Is that statement available in 10 foot letters etched into stone?
>

Hopefully not as that would be a waste, it should be made more obvious 
by using a red hot poker to engrave it onto every newbies' forehead. 
Even then some simply wouldn't take a blind bit of notice.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#102874

From	Ben Finney <ben+python@benfinney.id.au>
Date	2016-02-13 08:49 +1100
Message-ID	<mailman.88.1455313781.22075.python-list@python.org>
In reply to	#102866

Chris Angelico <rosuav@gmail.com> writes:

> I actually had that built behind my house, at one point. Sadly, the
> letters sank until they were partly embedded into the ground, and
> what's left says, in the local language, "Go stick your head in a
> PHP", so it's lit up only for special celebrations.

Douglas Adams, you are sorely missed.

-- 
 \        “The greatest tragedy in mankind's entire history may be the |
  `\       hijacking of morality by religion.” —Arthur C. Clarke, 1991 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#102876

From	Matt Wheeler <funkyhat@gmail.com>
Date	2016-02-12 23:31 +0000
Message-ID	<mailman.89.1455319898.22075.python-list@python.org>
In reply to	#102866

On 12 Feb 2016 21:37, "Mark Lawrence" <breamoreboy@yahoo.co.uk> wrote:
> Hopefully not as that would be a waste, it should be made more obvious by
using a red hot poker to engrave it onto every newbies' forehead. Even then
some simply wouldn't take a blind bit of notice.

Yes sorry about that, I think our aim was a little off with a few of the
brandings.

--
Matt Wheeler
http://funkyh.at

[toc] | [prev] | [next] | [standalone]

#102889

From	mkondrashin@gmail.com
Date	2016-02-13 12:19 -0800
Message-ID	<243825dd-451a-405e-ad41-855622e80b06@googlegroups.com>
In reply to	#102876

In my application I have used two approaches: 1. To store paths as a tree (as directories for a tree. 2. For long list of similar paths, to store difference of strings. Though this was c++/obj-c project, I can share a diff code with you if you drip me a line (mkondrashin & gmail , com)

[toc] | [prev] | [next] | [standalone]

#102849

From	srinivas devaki <mr.eightnoteight@gmail.com>
Date	2016-02-12 11:46 +0530
Message-ID	<mailman.69.1455257811.22075.python-list@python.org>
In reply to	#102833

On Feb 12, 2016 6:05 AM, "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
wrote:
>
> Hi!
>
> What is the best (shortest memory usage) way to store lots of pathnames
> in memory where:
>
> 1. Path names are pathname=(dirname,filename)
> 2. There many different dirnames but much less than pathnames
> 3. dirnames have in general many chars
>
> The idea is to share the common dirnames.
>
> More realistically not only the pathnames are stored but objects each
> object being a MyFile containing
> self.name - <base name>
> getPathname(self) - <full pathname>
> other stuff
>
> class MyFile:
>
>   __allfiles=[]
>
>   def __init__(self,dirname,filename):
>     self.dirname=dirname  # But I want to share this with other files
>     self.name=filename
>     MyFile.__allfiles.append(self)
>     ...
>
>   def getPathname(self):
>     return os.path.join(self.dirname,self.name)
>

what you want is Trie data structure, which won't use extra memory if the
basepath of your strings are common.

instead of having constructing a char Trie, try to make it as string Trie
i.e each directory name is a node and all the files and folders are it's
children, each node can be of two types a file and folder.

if you come to think about it this is most intuitive way to represent the
file structure in your program.

you can extract the directory name from the file object by traversing it's
parents.

I hope this helps.

Regards
Srinivas Devaki
Junior (3rd yr) student at Indian School of Mines,(IIT Dhanbad)
Computer Science and Engineering Department
ph: +91 9491 383 249
telegram_id: @eightnoteight

[toc] | [prev] | [standalone]

csiph-web

Storing a big amount of path names

Contents

#102833 — Storing a big amount of path names

#102834

#102835

#102837

#102839

#102840

#102841

#102842

#102843

#102844

#102845

#102847

#102848

#102866

#102867

#102872

#102874

#102876

#102889

#102849