Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <52f219c5$0$29972$c3e8da3$5496439d@news.astraweb.com>
References: <8e4c1ab1-e65d-483f-ad9d-6933ae2052c3@googlegroups.com> <lcred6$q3r$1@ger.gmane.org> <mailman.6405.1391542145.18130.python-list@python.org> <7e7d3200-a4ae-4842-ad8d-68b4435b9006@googlegroups.com> <52f219c5$0$29972$c3e8da3$5496439d@news.astraweb.com>
Date: Wed, 5 Feb 2014 22:44:47 +1100
Subject: Re: Finding size of Variable
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.6418.1391600696.18130.python-list@python.org>
Lines: 25
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:65475

On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> where stopWords.txt is a file of size 4KB
>
> My guess is that if you split a 4K file into words, then put the words
> into a list, you'll probably end up with 6-8K in memory.

I'd guess rather more; Python strings have a fair bit of fixed
overhead, so with a whole lot of small strings, it will get more
costly.

>>> sys.version
'3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan  5 2014, 16:23:43) [MSC v.1600 32
bit (Intel)]'
>>> sys.getsizeof("asdf")
29

"Stop words" tend to be short, rather than long, words, so I'd look at
an average of 2-3 letters per word. Assuming they're separated by
spaces or newlines, that means there'll be roughly a thousand of them
in the file, for about 25K of overhead. A bit less if the words are
longer, but still quite a bit. (Byte strings have slightly less
overhead, 17 bytes apiece, but still quite a bit.)

ChrisA