Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #65415 > unrolled thread

Finding size of Variable

Started byAyushi Dalmia <ayushidalmia2604@gmail.com>
First post2014-02-04 03:28 -0800
Last post2014-02-05 15:22 +0000
Articles 20 on this page of 137 — 29 participants

Back to article view | Back to comp.lang.python


Contents

  Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 03:28 -0800
    Re: Finding size of Variable Peter Otten <__peter__@web.de> - 2014-02-04 12:40 +0100
      Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 04:43 -0800
        Re: Finding size of Variable Asaf Las <roegltd@gmail.com> - 2014-02-04 04:53 -0800
          Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 05:18 -0800
        Re: Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-04 08:09 -0500
          Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 05:19 -0800
            Re: Finding size of Variable Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-02-04 09:06 -0500
              Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:00 -0800
    Re:Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-04 14:21 -0500
      Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:15 -0800
        Re: Finding size of Variable Peter Otten <__peter__@web.de> - 2014-02-05 09:27 +0100
    Re: Finding size of Variable Tim Golden <mail@timgolden.me.uk> - 2014-02-04 19:28 +0000
    Re: Finding size of Variable Tim Chase <python.list@tim.thechases.com> - 2014-02-04 13:29 -0600
      Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:35 -0800
        Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-04 21:45 -0800
          Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 22:00 -0800
        Re: Finding size of Variable Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-05 11:00 +0000
          Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-05 22:44 +1100
            Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 02:15 -0800
              Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-06 06:10 -0500
                Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 05:51 -0800
                  Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 06:15 -0800
                  Re: Finding size of Variable Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-08 02:48 +0000
                    Re: Finding size of Variable Ethan Furman <ethan@stoneleaf.us> - 2014-02-07 19:02 -0800
                    Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-08 13:17 +0000
                    Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 17:45 -0500
                      Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-08 17:25 -0800
                        Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 21:56 -0500
                        Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-09 13:59 +1100
                        Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 22:07 -0500
                        Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-08 22:09 -0500
                        Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 22:09 -0500
                        Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-08 22:16 -0500
                          Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-08 19:30 -0800
                    Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-10 06:07 -0800
                      Re: Finding size of Variable Asaf Las <roegltd@gmail.com> - 2014-02-10 06:25 -0800
                        Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-10 14:39 +0000
                      Re: Finding size of Variable Tim Chase <python.list@tim.thechases.com> - 2014-02-10 08:43 -0600
                        Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-11 10:53 -0800
                          Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-11 19:04 +0000
                            Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-11 23:49 -0800
                              Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-12 19:06 +1100
                                Re: Finding size of Variable Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 10:57 +0200
                                  Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-12 20:24 +1100
                                    Re: Finding size of Variable Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 11:35 +0200
                              Working with the set of real numbers (was: Finding size of Variable) Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:17 +1100
                                Re: Working with the set of real numbers (was: Finding size of Variable) wxjmfauth@gmail.com - 2014-02-12 00:35 -0800
                                  Re: Working with the set of real numbers (was: Finding size of Variable) wxjmfauth@gmail.com - 2014-02-12 00:46 -0800
                                  Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:52 +1100
                                Re: Working with the set of real numbers (was: Finding size of Variable) Grant Edwards <invalid@invalid.invalid> - 2014-02-12 15:24 +0000
                                  Re: Working with the set of real numbers (was: Finding size of Variable) "Gisle Vanem" <gvanem@yahoo.no> - 2014-02-12 17:23 +0100
                              Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-02-12 19:47 +1100
                                Re: Working with the set of real numbers (was: Finding size of Variable) Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 11:23 +0200
                                Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-04 02:45 +0000
                                  Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 14:02 +1100
                                    Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-03 19:13 -0800
                                      Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 14:46 +1100
                                        Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-03 21:19 -0800
                                        Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-04 05:53 +0000
                                          Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 17:35 +1100
                                            Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-05 00:05 +1300
                                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-04 23:43 +1100
                                            Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-03-04 21:49 +0200
                                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 06:58 +1100
                                              Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 20:55 +0000
                                                Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-03-04 23:05 +0200
                                                  Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:08 +0000
                                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 08:18 +1100
                                              Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:02 +0000
                                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 09:18 +1100
                                              Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:54 +0000
                                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 10:01 +1100
                                              Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-03-04 18:20 -0500
                                          Re: Working with the set of real numbers (was: Finding size of Variable) Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-04 04:19 -0700
                                            Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:27 +0000
                                          Re: Working with the set of real numbers (was: Finding size of Variable) Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-04 04:23 -0700
                                    Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:15 +0000
                                      Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-05 03:41 +0000
                                        Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-04 20:15 -0800
                                          Re: Working with the set of real numbers (was: Finding size of Variable) Roy Smith <roy@panix.com> - 2014-03-04 23:25 -0500
                                            Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-03-05 15:37 +1100
                                              Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-03-04 20:57 -0800
                                              Re: Working with the set of real numbers Roy Smith <roy@panix.com> - 2014-03-05 00:29 -0500
                              Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:56 +1100
                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 20:16 +1100
                              Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 21:07 +1100
                                Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:11 -0800
                                  Re: Working with the set of real numbers Ian Kelly <ian.g.kelly@gmail.com> - 2014-02-12 13:45 -0700
                                    Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-12 17:47 -0800
                                Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:09 +1300
                                Re: Working with the set of real numbers Steven D'Aprano <steve@pearwood.info> - 2014-02-13 03:31 +0000
                                  Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-13 14:45 +1100
                                  Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 15:17 +1100
                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 21:20 +1100
                                Re: Working with the set of real numbers wxjmfauth@gmail.com - 2014-02-12 02:55 -0800
                                  Re: Working with the set of real numbers Ned Batchelder <ned@nedbatchelder.com> - 2014-02-12 06:55 -0500
                                Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-12 14:48 +0200
                                  Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 00:20 +1100
                                    Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-12 16:13 +0200
                                      Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 04:52 +1100
                                        Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:24 +1300
                                          Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-02-12 17:56 -0500
                                            Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-14 18:26 +1300
                              Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 22:44 +1100
                              Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 22:58 +1100
                                Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:32 +1300
                                  Re: Working with the set of real numbers Grant Edwards <invalid@invalid.invalid> - 2014-02-12 23:23 +0000
                              Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-12 14:04 +0000
                                Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:14 -0800
                                  Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-12 14:25 +0000
                                    Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:32 -0800
                              Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-02-13 12:48 +0000
                                Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-13 16:00 +0200
                                  Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 06:25 +1100
                                    Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-13 21:47 +0200
                                      Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 07:08 +1100
                                      Re: Working with the set of real numbers Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-02-13 22:05 -0800
                                        Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-15 00:30 +1300
                                          Re: Working with the set of real numbers Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-02-14 16:26 -0800
                                      Re: Working with the set of real numbers albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:38 +0000
                                    Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-14 19:37 +1300
                                      Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 17:44 +1100
                                        Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-14 07:13 -0800
                                      Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-02-14 07:30 -0500
                                      Re: Working with the set of real numbers Grant Edwards <invalid@invalid.invalid> - 2014-02-14 15:09 +0000
                                  Re: Working with the set of real numbers Rotwang <sg552@hotmail.co.uk> - 2014-02-13 21:29 +0000
                                    Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-14 00:00 +0200
                                      Re: Working with the set of real numbers Rotwang <sg552@hotmail.co.uk> - 2014-02-13 22:21 +0000
                                        Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-14 01:16 +0200
                              Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-14 03:57 +1100
                      Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-10 10:02 -0500
                      Re: Finding size of Variable Neil Cerutti <neilc@norwich.edu> - 2014-02-11 14:29 +0000
          Re: Finding size of Variable Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-02-05 22:14 -0500
        Re: Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-05 08:43 -0500
          Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-05 06:33 -0800
            Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-05 15:22 +0000

Page 1 of 7  [1] 2 3 4 5 6 7  Next page →


#65415 — Finding size of Variable

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 03:28 -0800
SubjectFinding size of Variable
Message-ID<8e4c1ab1-e65d-483f-ad9d-6933ae2052c3@googlegroups.com>
Hello,

I have 10 files and I need to merge them (using K way merging). The size of each file is around 200 MB. Now suppose I am keeping the merged data in a variable named mergedData, I had thought of checking the size of mergedData using sys.getsizeof() but it somehow doesn't gives the actual value of the memory occupied. 

For example, if a file in my file system occupies 4 KB of data, if I read all the lines in a list, the size of the list is around 2100 bytes only.

Where am I going wrong? What are the alternatives I can try?

[toc] | [next] | [standalone]


#65416

FromPeter Otten <__peter__@web.de>
Date2014-02-04 12:40 +0100
Message-ID<mailman.6384.1391514038.18130.python-list@python.org>
In reply to#65415
Ayushi Dalmia wrote:

> I have 10 files and I need to merge them (using K way merging). The size
> of each file is around 200 MB. Now suppose I am keeping the merged data in
> a variable named mergedData, I had thought of checking the size of
> mergedData using sys.getsizeof() but it somehow doesn't gives the actual
> value of the memory occupied.
> 
> For example, if a file in my file system occupies 4 KB of data, if I read
> all the lines in a list, the size of the list is around 2100 bytes only.
> 
> Where am I going wrong? What are the alternatives I can try?

getsizeof() gives you the size of the list only; to complete the picture you 
have to add the sizes of the lines.

However, why do you want to keep track of the actual memory used by 
variables in your script? You should instead concentrate on the algorithm, 
and as long as either the size of the dataset is manageable or you can limit 
the amount of data accessed at a given time you are golden.

[toc] | [prev] | [next] | [standalone]


#65418

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 04:43 -0800
Message-ID<2728aca8-735b-4c38-9e7e-a164e8ed36f9@googlegroups.com>
In reply to#65416
On Tuesday, February 4, 2014 5:10:25 PM UTC+5:30, Peter Otten wrote:
> Ayushi Dalmia wrote:
> 
> 
> 
> > I have 10 files and I need to merge them (using K way merging). The size
> 
> > of each file is around 200 MB. Now suppose I am keeping the merged data in
> 
> > a variable named mergedData, I had thought of checking the size of
> 
> > mergedData using sys.getsizeof() but it somehow doesn't gives the actual
> 
> > value of the memory occupied.
> 
> > 
> 
> > For example, if a file in my file system occupies 4 KB of data, if I read
> 
> > all the lines in a list, the size of the list is around 2100 bytes only.
> 
> > 
> 
> > Where am I going wrong? What are the alternatives I can try?
> 
> 
> 
> getsizeof() gives you the size of the list only; to complete the picture you 
> 
> have to add the sizes of the lines.
> 
> 
> 
> However, why do you want to keep track of the actual memory used by 
> 
> variables in your script? You should instead concentrate on the algorithm, 
> 
> and as long as either the size of the dataset is manageable or you can limit 
> 
> the amount of data accessed at a given time you are golden.

As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern.

Can you please elaborate on the point of taking lines into consideration?

[toc] | [prev] | [next] | [standalone]


#65419

FromAsaf Las <roegltd@gmail.com>
Date2014-02-04 04:53 -0800
Message-ID<b512a99b-59f8-4721-a51b-ad3a1be4b2d0@googlegroups.com>
In reply to#65418
On Tuesday, February 4, 2014 2:43:21 PM UTC+2, Ayushi Dalmia wrote:
> 
> As I said, I need to merge large files and I cannot afford more I/O 
> operations. So in order to minimise the I/O operation I am writing in 
> chunks. Also, I need to use the merged files as indexes later which 
> should be loaded in the memory for fast access. Hence the concern.
> Can you please elaborate on the point of taking lines into consideration?

have you tried os.sendfile()? 

http://docs.python.org/dev/library/os.html#os.sendfile

[toc] | [prev] | [next] | [standalone]


#65421

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 05:18 -0800
Message-ID<6b515ace-8a4c-46b4-ab9d-a20922d917cc@googlegroups.com>
In reply to#65419
On Tuesday, February 4, 2014 6:23:19 PM UTC+5:30, Asaf Las wrote:
> On Tuesday, February 4, 2014 2:43:21 PM UTC+2, Ayushi Dalmia wrote:
> 
> > 
> 
> > As I said, I need to merge large files and I cannot afford more I/O 
> 
> > operations. So in order to minimise the I/O operation I am writing in 
> 
> > chunks. Also, I need to use the merged files as indexes later which 
> 
> > should be loaded in the memory for fast access. Hence the concern.
> 
> > Can you please elaborate on the point of taking lines into consideration?
> 
> 
> 
> have you tried os.sendfile()? 
> 
> 
> 
> http://docs.python.org/dev/library/os.html#os.sendfile

os.sendfile will not serve my purpose. I not only need to merge files, but do it in a sorted way. Thus some postprocessing is needed. 

[toc] | [prev] | [next] | [standalone]


#65420

FromDave Angel <davea@davea.name>
Date2014-02-04 08:09 -0500
Message-ID<mailman.6385.1391519162.18130.python-list@python.org>
In reply to#65418
 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
 
>> getsizeof() gives you the size of the list only; to complete the picture you 
>> 
>> have to add the sizes of the lines.
>> 
>> 
>> 
>> However, why do you want to keep track of the actual memory used by 
>> 
>> variables in your script? You should instead concentrate on the algorithm, 
>> 
>> and as long as either the size of the dataset is manageable or you can limit 
>> 
>> the amount of data accessed at a given time you are golden.
> 
> As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern.
> 
> Can you please elaborate on the point of taking lines into consideration?
> 

Please don't doublespace your quotes.  If you must use
 googlegroups,  fix its bugs before posting. 

There's usually no net gain in trying to 'chunk' your output to a
 text file. The python file system already knows how to do that
 for a sequential file.

For list of strings just add the getsizeof for the list to the sum
 of the getsizeof of all the list items. 

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#65422

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 05:19 -0800
Message-ID<40d95427-0c96-46af-9efe-0343953ac460@googlegroups.com>
In reply to#65420
On Tuesday, February 4, 2014 6:39:00 PM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> 
>  
> 
> >> getsizeof() gives you the size of the list only; to complete the picture you 
> 
> >> 
> 
> >> have to add the sizes of the lines.
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> However, why do you want to keep track of the actual memory used by 
> 
> >> 
> 
> >> variables in your script? You should instead concentrate on the algorithm, 
> 
> >> 
> 
> >> and as long as either the size of the dataset is manageable or you can limit 
> 
> >> 
> 
> >> the amount of data accessed at a given time you are golden.
> 
> > 
> 
> > As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern.
> 
> > 
> 
> > Can you please elaborate on the point of taking lines into consideration?
> 
> > 
> 
> 
> 
> Please don't doublespace your quotes.  If you must use
> 
>  googlegroups,  fix its bugs before posting. 
> 
> 
> 
> There's usually no net gain in trying to 'chunk' your output to a
> 
>  text file. The python file system already knows how to do that
> 
>  for a sequential file.
> 
> 
> 
> For list of strings just add the getsizeof for the list to the sum
> 
>  of the getsizeof of all the list items. 
> 
> 
> 
> -- 
> 
> DaveA

Hey! 

I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.

[toc] | [prev] | [next] | [standalone]


#65426

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2014-02-04 09:06 -0500
Message-ID<mailman.6389.1391523017.18130.python-list@python.org>
In reply to#65422
On Tue, 4 Feb 2014 05:19:48 -0800 (PST), Ayushi Dalmia
<ayushidalmia2604@gmail.com> declaimed the following:


>I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.

	10 200MB files is only 2GB... Most any 64-bit processor these days can
handle that. Even some 32-bit systems could handle it (WinXP booted with
the server option gives 3GB to user processes -- if the 4GB was installed
in the machine).

	However, you speak of an n-way merge. The traditional merge operation
only reads one record from each file at a time, examines them for "first",
writes that "first", reads next record from the file "first" came from, and
then reassesses the set.

	You mention needed to chunk the data -- that implies performing a merge
sort in which you read a few records from each file into memory, sort them,
and right them out to newFile1; then read the same number of records from
each file, sort, and write them to newFile2, up to however many files you
intend to work with -- at that point you go back and append the next chunk
to newFile1. When done, each file contains chunks of n*r records. You now
make newFilex the inputs, read/merge the records from those chunks
outputting to another file1, when you reach the end of the first chunk in
the files you then read/merge the second chunk into another file2. You
repeat this process until you end up with only one chunk in one file.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#65468

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 21:00 -0800
Message-ID<ce4abca2-27f9-419d-a7d0-352ff53b533f@googlegroups.com>
In reply to#65426
On Tuesday, February 4, 2014 7:36:48 PM UTC+5:30, Dennis Lee Bieber wrote:
> On Tue, 4 Feb 2014 05:19:48 -0800 (PST), Ayushi Dalmia
> 
> <ayushidalmia2604@gmail.com> declaimed the following:
> 
> 
> 
> 
> 
> >I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.
> 
> 
> 
> 	10 200MB files is only 2GB... Most any 64-bit processor these days can
> 
> handle that. Even some 32-bit systems could handle it (WinXP booted with
> 
> the server option gives 3GB to user processes -- if the 4GB was installed
> 
> in the machine).
> 
> 
> 
> 	However, you speak of an n-way merge. The traditional merge operation
> 
> only reads one record from each file at a time, examines them for "first",
> 
> writes that "first", reads next record from the file "first" came from, and
> 
> then reassesses the set.
> 
> 
> 
> 	You mention needed to chunk the data -- that implies performing a merge
> 
> sort in which you read a few records from each file into memory, sort them,
> 
> and right them out to newFile1; then read the same number of records from
> 
> each file, sort, and write them to newFile2, up to however many files you
> 
> intend to work with -- at that point you go back and append the next chunk
> 
> to newFile1. When done, each file contains chunks of n*r records. You now
> 
> make newFilex the inputs, read/merge the records from those chunks
> 
> outputting to another file1, when you reach the end of the first chunk in
> 
> the files you then read/merge the second chunk into another file2. You
> 
> repeat this process until you end up with only one chunk in one file.
> 
> -- 
> 
> 	Wulfraed                 Dennis Lee Bieber         AF6VN
> 
>     wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

The way you mentioned for merging the file is an option but that will involve a lot of I/O operation. Also, I do not want the size of the file to increase beyond a certain point. When I reach the file size upto a certain limit, I want to start writing in a new file. This is because I want to store them in memory again later.

[toc] | [prev] | [next] | [standalone]


#65444

FromDave Angel <davea@davea.name>
Date2014-02-04 14:21 -0500
Message-ID<mailman.6402.1391541507.18130.python-list@python.org>
In reply to#65415
 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:

> 
> Where am I going wrong? What are the alternatives I can try?

You've rejected all the alternatives so far without showing your
 code, or even properly specifying your problem.

To get the "total" size of a list of strings,  try (untested):

a = sys.getsizeof (mylist )
for item in mylist:
    a += sys.getsizeof (item)

This can be high if some of the strings are interned and get
 counted twice. But you're not likely to get closer without some
 knowledge of the data objects and where they come
 from.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#65469

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 21:15 -0800
Message-ID<723729ee-8e74-4d65-aa6f-742051a94101@googlegroups.com>
In reply to#65444
On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> 
> 
> 
> > 
> 
> > Where am I going wrong? What are the alternatives I can try?
> 
> 
> 
> You've rejected all the alternatives so far without showing your
> 
>  code, or even properly specifying your problem.
> 
> 
> 
> To get the "total" size of a list of strings,  try (untested):
> 
> 
> 
> a = sys.getsizeof (mylist )
> 
> for item in mylist:
> 
>     a += sys.getsizeof (item)
> 
> 
> 
> This can be high if some of the strings are interned and get
> 
>  counted twice. But you're not likely to get closer without some
> 
>  knowledge of the data objects and where they come
> 
>  from.
> 
> 
> 
> -- 
> 
> DaveA

Hello Dave, 

I just thought that saving others time is better and hence I explained only the subset of my problem. Here is what I am trying to do:

I am trying to index the current wikipedia dump without using databases and create a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:

I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the words along with the posting list (a list of doc ids in which the word is present) into different files after reading 'X' number of pages. Now these files may have the same word and hence I need to merge them and write the final index again. Now these final indexes must be of limited size as I need to be of limited size. This is where I am stuck. I need to know how to determine the size of content in a variable before I write into the file.

Here is the code for my merging:

def mergeFiles(pathOfFolder, countFile):
    listOfWords={}
    indexFile={}
    topOfFile={}
    flag=[0]*countFile
    data=defaultdict(list)
    heap=[]
    countFinalFile=0
    for i in xrange(countFile):
        fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
        indexFile[i]= bz2.BZ2File(fileName, 'rb')
        flag[i]=1
        topOfFile[i]=indexFile[i].readline().strip()
        listOfWords[i] = topOfFile[i].split(' ')
        if listOfWords[i][0] not in heap:
            heapq.heappush(heap, listOfWords[i][0])        
            
    while any(flag)==1:
        temp = heapq.heappop(heap)
        for i in xrange(countFile):
            if flag[i]==1:
                if listOfWords[i][0]==temp:

                    //This is where I am stuck. I cannot wait until memory //error, as I need to do some postprocessing too.
                    try:
                        data[temp].extend(listOfWords[i][1:])
                    except MemoryError:
                        writeFinalIndex(data, countFinalFile, pathOfFolder)
                        data=defaultdict(list)
                        countFinalFile+=1

                    topOfFile[i]=indexFile[i].readline().strip()   
                    if topOfFile[i]=='':
                            flag[i]=0
                            indexFile[i].close()
                            os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
                    else:
                        listOfWords[i] = topOfFile[i].split(' ')
                        if listOfWords[i][0] not in heap:
                            heapq.heappush(heap, listOfWords[i][0])
    writeFinalIndex(data, countFinalFile, pathOfFolder)

countFile is the number of files and writeFileIndex method writes into the file.

[toc] | [prev] | [next] | [standalone]


#65473

FromPeter Otten <__peter__@web.de>
Date2014-02-05 09:27 +0100
Message-ID<mailman.6417.1391588841.18130.python-list@python.org>
In reply to#65469
Ayushi Dalmia wrote:

> On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
>> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>> 
>> 
>> 
>> > 
>> 
>> > Where am I going wrong? What are the alternatives I can try?
>> 
>> 
>> 
>> You've rejected all the alternatives so far without showing your
>> 
>>  code, or even properly specifying your problem.
>> 
>> 
>> 
>> To get the "total" size of a list of strings,  try (untested):
>> 
>> 
>> 
>> a = sys.getsizeof (mylist )
>> 
>> for item in mylist:
>> 
>>     a += sys.getsizeof (item)
>> 
>> 
>> 
>> This can be high if some of the strings are interned and get
>> 
>>  counted twice. But you're not likely to get closer without some
>> 
>>  knowledge of the data objects and where they come
>> 
>>  from.
>> 
>> 
>> 
>> --
>> 
>> DaveA
> 
> Hello Dave,
> 
> I just thought that saving others time is better and hence I explained
> only the subset of my problem. Here is what I am trying to do:
> 
> I am trying to index the current wikipedia dump without using databases
> and create a search engine for Wikipedia documents. Note, I CANNOT USE
> DATABASES. My approach:
> 
> I am parsing the wikipedia pages using SAX Parser, and then, I am dumping
> the words along with the posting list (a list of doc ids in which the word
> is present) into different files after reading 'X' number of pages. Now
> these files may have the same word and hence I need to merge them and
> write the final index again. Now these final indexes must be of limited
> size as I need to be of limited size. This is where I am stuck. I need to
> know how to determine the size of content in a variable before I write
> into the file.
> 
> Here is the code for my merging:
> 
> def mergeFiles(pathOfFolder, countFile):
>     listOfWords={}
>     indexFile={}
>     topOfFile={}
>     flag=[0]*countFile
>     data=defaultdict(list)
>     heap=[]
>     countFinalFile=0
>     for i in xrange(countFile):
>         fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
>         indexFile[i]= bz2.BZ2File(fileName, 'rb')
>         flag[i]=1
>         topOfFile[i]=indexFile[i].readline().strip()
>         listOfWords[i] = topOfFile[i].split(' ')
>         if listOfWords[i][0] not in heap:
>             heapq.heappush(heap, listOfWords[i][0])

At this point you have already done it wrong as your heap contains the 
complete data and you have done a lot of O(N) tests on the heap. 
This is both slow and consumes a lot of memory. See

http://code.activestate.com/recipes/491285-iterator-merge/

for a sane way to merge sorted data from multiple files.  Your code becomes 
(untested)

with open("outfile.txt", "wb") as outfile:

    infiles = []
    for i in xrange(countFile):
        filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2')
        infiles.append(bz2.BZ2File(filename, "rb"))

    outfile.writelines(imerge(*infiles))

    for infile in infiles:
        infile.close()

Once you have your data in a single file you can read from that file and do 
the postprocessing you mention below.

             
>     while any(flag)==1:
>         temp = heapq.heappop(heap)
>         for i in xrange(countFile):
>             if flag[i]==1:
>                 if listOfWords[i][0]==temp:
> 
>                     //This is where I am stuck. I cannot wait until memory
>                     //error, as I need to do some postprocessing too. try:
>                         data[temp].extend(listOfWords[i][1:])
>                     except MemoryError:
>                         writeFinalIndex(data, countFinalFile,
>                         pathOfFolder) data=defaultdict(list)
>                         countFinalFile+=1
> 
>                     topOfFile[i]=indexFile[i].readline().strip()
>                     if topOfFile[i]=='':
>                             flag[i]=0
>                             indexFile[i].close()
>                             
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
>                     else:
>                         listOfWords[i] = topOfFile[i].split(' ')
>                         if listOfWords[i][0] not in heap:
>                             heapq.heappush(heap, listOfWords[i][0])
>     writeFinalIndex(data, countFinalFile, pathOfFolder)
> 
> countFile is the number of files and writeFileIndex method writes into the
> file.

[toc] | [prev] | [next] | [standalone]


#65446

FromTim Golden <mail@timgolden.me.uk>
Date2014-02-04 19:28 +0000
Message-ID<mailman.6404.1391542093.18130.python-list@python.org>
In reply to#65415
On 04/02/2014 19:21, Dave Angel wrote:
>   Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>
>>
>> Where am I going wrong? What are the alternatives I can try?
>
> You've rejected all the alternatives so far without showing your
>   code, or even properly specifying your problem.
>
> To get the "total" size of a list of strings,  try (untested):
>
> a = sys.getsizeof (mylist )
> for item in mylist:
>      a += sys.getsizeof (item)

The documentation for sys.getsizeof:

   http://docs.python.org/dev/library/sys#sys.getsizeof

warns about the limitations of this function when applied to a 
container, and even points to a recipe by Raymond Hettinger which 
attempts to do a more complete job.

TJG

[toc] | [prev] | [next] | [standalone]


#65447

FromTim Chase <python.list@tim.thechases.com>
Date2014-02-04 13:29 -0600
Message-ID<mailman.6405.1391542145.18130.python-list@python.org>
In reply to#65415
On 2014-02-04 14:21, Dave Angel wrote:
> To get the "total" size of a list of strings,  try (untested):
> 
> a = sys.getsizeof (mylist )
> for item in mylist:
>     a += sys.getsizeof (item)

I always find this sort of accumulation weird (well, at least in
Python; it's the *only* way in many other languages) and would write
it as

  a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)

-tkc


[toc] | [prev] | [next] | [standalone]


#65470

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 21:35 -0800
Message-ID<7e7d3200-a4ae-4842-ad8d-68b4435b9006@googlegroups.com>
In reply to#65447
On Wednesday, February 5, 2014 12:59:46 AM UTC+5:30, Tim Chase wrote:
> On 2014-02-04 14:21, Dave Angel wrote:
> 
> > To get the "total" size of a list of strings,  try (untested):
> 
> > 
> 
> > a = sys.getsizeof (mylist )
> 
> > for item in mylist:
> 
> >     a += sys.getsizeof (item)
> 
> 
> 
> I always find this sort of accumulation weird (well, at least in
> 
> Python; it's the *only* way in many other languages) and would write
> 
> it as
> 
> 
> 
>   a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)
> 
> 
> 
> -tkc

This also doesn't gives the true size. I did the following:

import sys
data=[]
f=open('stopWords.txt','r')

for line in f:
    line=line.split()
    data.extend(line)

print sys.getsizeof(data)

where stopWords.txt is a file of size 4KB

[toc] | [prev] | [next] | [standalone]


#65471

FromRustom Mody <rustompmody@gmail.com>
Date2014-02-04 21:45 -0800
Message-ID<c7e4d66d-8c27-4229-a92e-49e3f68e1440@googlegroups.com>
In reply to#65470
On Wednesday, February 5, 2014 11:05:05 AM UTC+5:30, Ayushi Dalmia wrote:
> This also doesn't gives the true size. I did the following:

> import sys
> data=[]
> f=open('stopWords.txt','r')

> for line in f:
>     line=line.split()
>     data.extend(line)

> print sys.getsizeof(data)

> where stopWords.txt is a file of size 4KB

Try getsizeof("".join(data))

General advice:
- You have been recommended (by Chris??) that you should use a database
- You say you cant use a database (for whatever reason)

Now the fact is you NEED database (functionality)
How to escape this catch-22 situation?
In computer science its called somewhat sardonically "Greenspun's 10th rule"

And the best way out is to 

1 isolate those aspects of database functionality you need 
2 temporarily forget about your original problem and implement the dbms
(subset of) DBMS functionality you need
3 Use 2 above to implement 1

[toc] | [prev] | [next] | [standalone]


#65472

FromAyushi Dalmia <ayushidalmia2604@gmail.com>
Date2014-02-04 22:00 -0800
Message-ID<691fecec-c02a-4b0c-99ee-711c5371abad@googlegroups.com>
In reply to#65471
On Wednesday, February 5, 2014 11:15:09 AM UTC+5:30, Rustom Mody wrote:
> On Wednesday, February 5, 2014 11:05:05 AM UTC+5:30, Ayushi Dalmia wrote:
> 
> > This also doesn't gives the true size. I did the following:
> 
> 
> 
> > import sys
> 
> > data=[]
> 
> > f=open('stopWords.txt','r')
> 
> 
> 
> > for line in f:
> 
> >     line=line.split()
> 
> >     data.extend(line)
> 
> 
> 
> > print sys.getsizeof(data)
> 
> 
> 
> > where stopWords.txt is a file of size 4KB
> 
> 
> 
> Try getsizeof("".join(data))
> 
> 
> 
> General advice:
> 
> - You have been recommended (by Chris??) that you should use a database
> 
> - You say you cant use a database (for whatever reason)
> 
> 
> 
> Now the fact is you NEED database (functionality)
> 
> How to escape this catch-22 situation?
> 
> In computer science its called somewhat sardonically "Greenspun's 10th rule"
> 
> 
> 
> And the best way out is to 
> 
> 
> 
> 1 isolate those aspects of database functionality you need 
> 
> 2 temporarily forget about your original problem and implement the dbms
> 
> (subset of) DBMS functionality you need
> 
> 3 Use 2 above to implement 1

Hello Rustum,

Thanks for the enlightenment. I did not know about the Greenspun's Tenth rule. It is interesting to know that. However, it is an academic project and not a research one. Hence I donot have the liberty to choose what to work with. Life is easier with databases though, but I am not allowed to use them. Thanks for the tip. I will try to replicate those functionality.

[toc] | [prev] | [next] | [standalone]


#65474

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-02-05 11:00 +0000
Message-ID<52f219c5$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to#65470
On Tue, 04 Feb 2014 21:35:05 -0800, Ayushi Dalmia wrote:

> On Wednesday, February 5, 2014 12:59:46 AM UTC+5:30, Tim Chase wrote:
>> On 2014-02-04 14:21, Dave Angel wrote:
>> 
>> > To get the "total" size of a list of strings,  try (untested):
>> 
>> > 
>> > a = sys.getsizeof (mylist )
>> > for item in mylist:
>> >     a += sys.getsizeof (item)
>> 
>> 
>> I always find this sort of accumulation weird (well, at least in
>> Python; it's the *only* way in many other languages) and would write
>> it as
>> 
>>   a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)
>> 
> 
> This also doesn't gives the true size. I did the following:


What do you mean by "true size"?

Do you mean the amount of space a certain amount of data will take in 
memory? With or without the overhead of object headers? Or do you mean 
how much space it will take when written to disk? You have not been clear 
what you are trying to measure.

If you are dealing with one-byte characters, you can measure the amount 
of memory they take up (excluding object overhead) by counting the number 
of characters: 23 one-byte characters requires 23 bytes. Plus the object 
overhead gives:

py> sys.getsizeof('a'*23)
44

44 bytes (23 bytes for the 23 single-byte characters, plus 21 bytes 
overhead). One thousand such characters takes:

py> sys.getsizeof('a'*1000)
1021

If you write such a string to disk, it will take 1000 bytes (or 1KB), 
unless you use some sort of compression.

> import sys
> data=[]
> f=open('stopWords.txt','r')
> 
> for line in f:
>     line=line.split()
>     data.extend(line)
> 
> print sys.getsizeof(data)

This will give you the amount of space taken by the list object. It will 
*not* give you the amount of space taken by the individual strings.

A Python list looks like this:


    | header | array of pointers |


The header is of constant or near-constant size; the array depends on the 
number of items in the list. It may be bigger than the list, e.g. a list 
with 1000 items might have allocated space for 2000 items. It will never 
be smaller.
 
getsizeof(list) only counts the direct size of that list, including the 
array, but not the things which the pointers point at. If you want the 
total size, you need to count them as well.


> where stopWords.txt is a file of size 4KB

My guess is that if you split a 4K file into words, then put the words 
into a list, you'll probably end up with 6-8K in memory.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#65475

FromChris Angelico <rosuav@gmail.com>
Date2014-02-05 22:44 +1100
Message-ID<mailman.6418.1391600696.18130.python-list@python.org>
In reply to#65474
On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> where stopWords.txt is a file of size 4KB
>
> My guess is that if you split a 4K file into words, then put the words
> into a list, you'll probably end up with 6-8K in memory.

I'd guess rather more; Python strings have a fair bit of fixed
overhead, so with a whole lot of small strings, it will get more
costly.

>>> sys.version
'3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan  5 2014, 16:23:43) [MSC v.1600 32
bit (Intel)]'
>>> sys.getsizeof("asdf")
29

"Stop words" tend to be short, rather than long, words, so I'd look at
an average of 2-3 letters per word. Assuming they're separated by
spaces or newlines, that means there'll be roughly a thousand of them
in the file, for about 25K of overhead. A bit less if the words are
longer, but still quite a bit. (Byte strings have slightly less
overhead, 17 bytes apiece, but still quite a bit.)

ChrisA

[toc] | [prev] | [next] | [standalone]


#65528

Fromwxjmfauth@gmail.com
Date2014-02-06 02:15 -0800
Message-ID<acdae8c8-2b59-4289-9f2b-1e4dd52cbd62@googlegroups.com>
In reply to#65475
Le mercredi 5 février 2014 12:44:47 UTC+1, Chris Angelico a écrit :
> On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano
> 
> <steve+comp.lang.python@pearwood.info> wrote:
> 
> >> where stopWords.txt is a file of size 4KB
> 
> >
> 
> > My guess is that if you split a 4K file into words, then put the words
> 
> > into a list, you'll probably end up with 6-8K in memory.
> 
> 
> 
> I'd guess rather more; Python strings have a fair bit of fixed
> 
> overhead, so with a whole lot of small strings, it will get more
> 
> costly.
> 
> 
> 
> >>> sys.version
> 
> '3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan  5 2014, 16:23:43) [MSC v.1600 32
> 
> bit (Intel)]'
> 
> >>> sys.getsizeof("asdf")
> 
> 29
> 
> 
> 
> "Stop words" tend to be short, rather than long, words, so I'd look at
> 
> an average of 2-3 letters per word. Assuming they're separated by
> 
> spaces or newlines, that means there'll be roughly a thousand of them
> 
> in the file, for about 25K of overhead. A bit less if the words are
> 
> longer, but still quite a bit. (Byte strings have slightly less
> 
> overhead, 17 bytes apiece, but still quite a bit.)
> 
> 
> 
> ChrisA

>>> sum([sys.getsizeof(c) for c in ['a']])
26
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO']])
68
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']])
112
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO']])
158
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO', 'aaaaaaaaaaaaaaaaaaaa EURO']])
238
>>> 
>>> 
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a']])
21
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO']])
46
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO']])
75
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO']])
108
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO', 'aaaaaaaaaaaaaaaaaaaa EURO']])
209
>>> 
>>> 
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']*3])
336
>>> sum([sys.getsizeof(c) for c in ['aa EURO aa EURO']*3])
150
>>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['a', 'a EURO', 'aa EURO']*3])
261
>>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['aa EURO aa EURO']*3])
135
>>>

jmf

[toc] | [prev] | [next] | [standalone]


Page 1 of 7  [1] 2 3 4 5 6 7  Next page →

Back to top | Article view | comp.lang.python


csiph-web