Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #65415 > unrolled thread
| Started by | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| First post | 2014-02-04 03:28 -0800 |
| Last post | 2014-02-05 15:22 +0000 |
| Articles | 20 on this page of 159 — 30 participants |
Back to article view | Back to comp.lang.python
Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 03:28 -0800
Re: Finding size of Variable Peter Otten <__peter__@web.de> - 2014-02-04 12:40 +0100
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 04:43 -0800
Re: Finding size of Variable Asaf Las <roegltd@gmail.com> - 2014-02-04 04:53 -0800
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 05:18 -0800
Re: Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-04 08:09 -0500
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 05:19 -0800
Re: Finding size of Variable Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-02-04 09:06 -0500
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:00 -0800
Re:Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-04 14:21 -0500
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:15 -0800
Re: Finding size of Variable Peter Otten <__peter__@web.de> - 2014-02-05 09:27 +0100
Re: Finding size of Variable Tim Golden <mail@timgolden.me.uk> - 2014-02-04 19:28 +0000
Re: Finding size of Variable Tim Chase <python.list@tim.thechases.com> - 2014-02-04 13:29 -0600
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 21:35 -0800
Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-04 21:45 -0800
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 22:00 -0800
Re: Finding size of Variable Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-05 11:00 +0000
Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-05 22:44 +1100
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 02:15 -0800
Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-06 06:10 -0500
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 05:51 -0800
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-06 06:15 -0800
Re: Finding size of Variable Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-08 02:48 +0000
Re: Finding size of Variable Ethan Furman <ethan@stoneleaf.us> - 2014-02-07 19:02 -0800
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-08 13:17 +0000
Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 17:45 -0500
Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-08 17:25 -0800
Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 21:56 -0500
Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-09 13:59 +1100
Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 22:07 -0500
Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-08 22:09 -0500
Re: Finding size of Variable David Hutto <dwightdhutto@gmail.com> - 2014-02-08 22:09 -0500
Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-08 22:16 -0500
Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-08 19:30 -0800
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-10 06:07 -0800
Re: Finding size of Variable Asaf Las <roegltd@gmail.com> - 2014-02-10 06:25 -0800
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-10 14:39 +0000
Re: Finding size of Variable Tim Chase <python.list@tim.thechases.com> - 2014-02-10 08:43 -0600
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-11 10:53 -0800
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-11 19:04 +0000
Re: Finding size of Variable wxjmfauth@gmail.com - 2014-02-11 23:49 -0800
Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-12 19:06 +1100
Re: Finding size of Variable Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 10:57 +0200
Re: Finding size of Variable Chris Angelico <rosuav@gmail.com> - 2014-02-12 20:24 +1100
Re: Finding size of Variable Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 11:35 +0200
Working with the set of real numbers (was: Finding size of Variable) Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:17 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) wxjmfauth@gmail.com - 2014-02-12 00:35 -0800
Re: Working with the set of real numbers (was: Finding size of Variable) wxjmfauth@gmail.com - 2014-02-12 00:46 -0800
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:52 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Grant Edwards <invalid@invalid.invalid> - 2014-02-12 15:24 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) "Gisle Vanem" <gvanem@yahoo.no> - 2014-02-12 17:23 +0100
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-02-12 19:47 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-02-12 11:23 +0200
Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-04 02:45 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 14:02 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-03 19:13 -0800
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 14:46 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-03 21:19 -0800
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-04 05:53 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-04 17:35 +1100
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-05 00:05 +1300
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-04 23:43 +1100
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-03-04 21:49 +0200
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 06:58 +1100
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 20:55 +0000
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-03-04 23:05 +0200
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:08 +0000
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 08:18 +1100
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:02 +0000
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 09:18 +1100
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-04 22:54 +0000
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-05 10:01 +1100
Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-03-04 18:20 -0500
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-05 11:59 +0000
Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-03-05 07:57 -0500
Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-03-05 08:32 -0500
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-06 12:27 +0000
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-03-07 00:16 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-04 04:19 -0700
Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:27 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-04 04:23 -0700
Re: Working with the set of real numbers (was: Finding size of Variable) albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:15 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-05 03:41 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Rustom Mody <rustompmody@gmail.com> - 2014-03-04 20:15 -0800
Re: Working with the set of real numbers (was: Finding size of Variable) Roy Smith <roy@panix.com> - 2014-03-04 23:25 -0500
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-03-05 15:37 +1100
Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-03-04 20:57 -0800
Re: Working with the set of real numbers Roy Smith <roy@panix.com> - 2014-03-05 00:29 -0500
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-05 07:52 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve@pearwood.info> - 2014-03-05 08:38 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) wxjmfauth@gmail.com - 2014-03-05 01:00 -0800
Re: Working with the set of real numbers Ned Batchelder <ned@nedbatchelder.com> - 2014-03-05 06:23 -0500
Re: Working with the set of real numbers (was: Finding size of Variable) Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-05 12:21 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-05 17:43 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-06 05:01 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Kaynor <ckaynor@zindagigames.com> - 2014-03-05 10:03 -0800
Re: Working with the set of real numbers (was: Finding size of Variable) Grant Edwards <invalid@invalid.invalid> - 2014-03-05 19:13 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-03-05 21:22 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Roy Smith <roy@panix.com> - 2014-03-05 21:31 -0500
Re: Working with the set of real numbers (was: Finding size of Variable) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-06 03:06 +0000
Re: Working with the set of real numbers (was: Finding size of Variable) Chris Angelico <rosuav@gmail.com> - 2014-03-06 14:14 +1100
Re: Working with the set of real numbers (was: Finding size of Variable) Roy Smith <roy@panix.com> - 2014-03-05 23:05 -0500
Re: Working with the set of real numbers (was: Finding size of Variable) Grant Edwards <invalid@invalid.invalid> - 2014-03-06 03:34 +0000
Re: Working with the set of real numbers Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-05 12:50 +0000
Re: Working with the set of real numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-05 17:49 +0000
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 19:56 +1100
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 20:16 +1100
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 21:07 +1100
Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:11 -0800
Re: Working with the set of real numbers Ian Kelly <ian.g.kelly@gmail.com> - 2014-02-12 13:45 -0700
Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-12 17:47 -0800
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:09 +1300
Re: Working with the set of real numbers Steven D'Aprano <steve@pearwood.info> - 2014-02-13 03:31 +0000
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-13 14:45 +1100
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 15:17 +1100
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 21:20 +1100
Re: Working with the set of real numbers wxjmfauth@gmail.com - 2014-02-12 02:55 -0800
Re: Working with the set of real numbers Ned Batchelder <ned@nedbatchelder.com> - 2014-02-12 06:55 -0500
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-12 14:48 +0200
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 00:20 +1100
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-12 16:13 +0200
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-13 04:52 +1100
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:24 +1300
Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-02-12 17:56 -0500
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-14 18:26 +1300
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-12 22:44 +1100
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-12 22:58 +1100
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-13 11:32 +1300
Re: Working with the set of real numbers Grant Edwards <invalid@invalid.invalid> - 2014-02-12 23:23 +0000
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-12 14:04 +0000
Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:14 -0800
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-12 14:25 +0000
Re: Finding size of Variable Rustom Mody <rustompmody@gmail.com> - 2014-02-12 06:32 -0800
Re: Working with the set of real numbers Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2014-02-13 12:48 +0000
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-13 16:00 +0200
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 06:25 +1100
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-13 21:47 +0200
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 07:08 +1100
Re: Working with the set of real numbers Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-02-13 22:05 -0800
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-15 00:30 +1300
Re: Working with the set of real numbers Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-02-14 16:26 -0800
Re: Working with the set of real numbers albert@spenarnc.xs4all.nl (Albert van der Horst) - 2014-03-05 02:38 +0000
Re: Working with the set of real numbers Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-02-14 19:37 +1300
Re: Working with the set of real numbers Chris Angelico <rosuav@gmail.com> - 2014-02-14 17:44 +1100
Re: Working with the set of real numbers Rustom Mody <rustompmody@gmail.com> - 2014-02-14 07:13 -0800
Re: Working with the set of real numbers Dave Angel <davea@davea.name> - 2014-02-14 07:30 -0500
Re: Working with the set of real numbers Grant Edwards <invalid@invalid.invalid> - 2014-02-14 15:09 +0000
Re: Working with the set of real numbers Rotwang <sg552@hotmail.co.uk> - 2014-02-13 21:29 +0000
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-14 00:00 +0200
Re: Working with the set of real numbers Rotwang <sg552@hotmail.co.uk> - 2014-02-13 22:21 +0000
Re: Working with the set of real numbers Marko Rauhamaa <marko@pacujo.net> - 2014-02-14 01:16 +0200
Re: Working with the set of real numbers Ben Finney <ben+python@benfinney.id.au> - 2014-02-14 03:57 +1100
Re: Finding size of Variable Ned Batchelder <ned@nedbatchelder.com> - 2014-02-10 10:02 -0500
Re: Finding size of Variable Neil Cerutti <neilc@norwich.edu> - 2014-02-11 14:29 +0000
Re: Finding size of Variable Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-02-05 22:14 -0500
Re: Finding size of Variable Dave Angel <davea@davea.name> - 2014-02-05 08:43 -0500
Re: Finding size of Variable Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-05 06:33 -0800
Re: Finding size of Variable Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-05 15:22 +0000
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 03:28 -0800 |
| Subject | Finding size of Variable |
| Message-ID | <8e4c1ab1-e65d-483f-ad9d-6933ae2052c3@googlegroups.com> |
Hello, I have 10 files and I need to merge them (using K way merging). The size of each file is around 200 MB. Now suppose I am keeping the merged data in a variable named mergedData, I had thought of checking the size of mergedData using sys.getsizeof() but it somehow doesn't gives the actual value of the memory occupied. For example, if a file in my file system occupies 4 KB of data, if I read all the lines in a list, the size of the list is around 2100 bytes only. Where am I going wrong? What are the alternatives I can try?
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-02-04 12:40 +0100 |
| Message-ID | <mailman.6384.1391514038.18130.python-list@python.org> |
| In reply to | #65415 |
Ayushi Dalmia wrote: > I have 10 files and I need to merge them (using K way merging). The size > of each file is around 200 MB. Now suppose I am keeping the merged data in > a variable named mergedData, I had thought of checking the size of > mergedData using sys.getsizeof() but it somehow doesn't gives the actual > value of the memory occupied. > > For example, if a file in my file system occupies 4 KB of data, if I read > all the lines in a list, the size of the list is around 2100 bytes only. > > Where am I going wrong? What are the alternatives I can try? getsizeof() gives you the size of the list only; to complete the picture you have to add the sizes of the lines. However, why do you want to keep track of the actual memory used by variables in your script? You should instead concentrate on the algorithm, and as long as either the size of the dataset is manageable or you can limit the amount of data accessed at a given time you are golden.
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 04:43 -0800 |
| Message-ID | <2728aca8-735b-4c38-9e7e-a164e8ed36f9@googlegroups.com> |
| In reply to | #65416 |
On Tuesday, February 4, 2014 5:10:25 PM UTC+5:30, Peter Otten wrote: > Ayushi Dalmia wrote: > > > > > I have 10 files and I need to merge them (using K way merging). The size > > > of each file is around 200 MB. Now suppose I am keeping the merged data in > > > a variable named mergedData, I had thought of checking the size of > > > mergedData using sys.getsizeof() but it somehow doesn't gives the actual > > > value of the memory occupied. > > > > > > For example, if a file in my file system occupies 4 KB of data, if I read > > > all the lines in a list, the size of the list is around 2100 bytes only. > > > > > > Where am I going wrong? What are the alternatives I can try? > > > > getsizeof() gives you the size of the list only; to complete the picture you > > have to add the sizes of the lines. > > > > However, why do you want to keep track of the actual memory used by > > variables in your script? You should instead concentrate on the algorithm, > > and as long as either the size of the dataset is manageable or you can limit > > the amount of data accessed at a given time you are golden. As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern. Can you please elaborate on the point of taking lines into consideration?
[toc] | [prev] | [next] | [standalone]
| From | Asaf Las <roegltd@gmail.com> |
|---|---|
| Date | 2014-02-04 04:53 -0800 |
| Message-ID | <b512a99b-59f8-4721-a51b-ad3a1be4b2d0@googlegroups.com> |
| In reply to | #65418 |
On Tuesday, February 4, 2014 2:43:21 PM UTC+2, Ayushi Dalmia wrote: > > As I said, I need to merge large files and I cannot afford more I/O > operations. So in order to minimise the I/O operation I am writing in > chunks. Also, I need to use the merged files as indexes later which > should be loaded in the memory for fast access. Hence the concern. > Can you please elaborate on the point of taking lines into consideration? have you tried os.sendfile()? http://docs.python.org/dev/library/os.html#os.sendfile
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 05:18 -0800 |
| Message-ID | <6b515ace-8a4c-46b4-ab9d-a20922d917cc@googlegroups.com> |
| In reply to | #65419 |
On Tuesday, February 4, 2014 6:23:19 PM UTC+5:30, Asaf Las wrote: > On Tuesday, February 4, 2014 2:43:21 PM UTC+2, Ayushi Dalmia wrote: > > > > > > As I said, I need to merge large files and I cannot afford more I/O > > > operations. So in order to minimise the I/O operation I am writing in > > > chunks. Also, I need to use the merged files as indexes later which > > > should be loaded in the memory for fast access. Hence the concern. > > > Can you please elaborate on the point of taking lines into consideration? > > > > have you tried os.sendfile()? > > > > http://docs.python.org/dev/library/os.html#os.sendfile os.sendfile will not serve my purpose. I not only need to merge files, but do it in a sorted way. Thus some postprocessing is needed.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2014-02-04 08:09 -0500 |
| Message-ID | <mailman.6385.1391519162.18130.python-list@python.org> |
| In reply to | #65418 |
Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message: >> getsizeof() gives you the size of the list only; to complete the picture you >> >> have to add the sizes of the lines. >> >> >> >> However, why do you want to keep track of the actual memory used by >> >> variables in your script? You should instead concentrate on the algorithm, >> >> and as long as either the size of the dataset is manageable or you can limit >> >> the amount of data accessed at a given time you are golden. > > As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern. > > Can you please elaborate on the point of taking lines into consideration? > Please don't doublespace your quotes. If you must use googlegroups, fix its bugs before posting. There's usually no net gain in trying to 'chunk' your output to a text file. The python file system already knows how to do that for a sequential file. For list of strings just add the getsizeof for the list to the sum of the getsizeof of all the list items. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 05:19 -0800 |
| Message-ID | <40d95427-0c96-46af-9efe-0343953ac460@googlegroups.com> |
| In reply to | #65420 |
On Tuesday, February 4, 2014 6:39:00 PM UTC+5:30, Dave Angel wrote: > Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message: > > > > >> getsizeof() gives you the size of the list only; to complete the picture you > > >> > > >> have to add the sizes of the lines. > > >> > > >> > > >> > > >> However, why do you want to keep track of the actual memory used by > > >> > > >> variables in your script? You should instead concentrate on the algorithm, > > >> > > >> and as long as either the size of the dataset is manageable or you can limit > > >> > > >> the amount of data accessed at a given time you are golden. > > > > > > As I said, I need to merge large files and I cannot afford more I/O operations. So in order to minimise the I/O operation I am writing in chunks. Also, I need to use the merged files as indexes later which should be loaded in the memory for fast access. Hence the concern. > > > > > > Can you please elaborate on the point of taking lines into consideration? > > > > > > > Please don't doublespace your quotes. If you must use > > googlegroups, fix its bugs before posting. > > > > There's usually no net gain in trying to 'chunk' your output to a > > text file. The python file system already knows how to do that > > for a sequential file. > > > > For list of strings just add the getsizeof for the list to the sum > > of the getsizeof of all the list items. > > > > -- > > DaveA Hey! I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2014-02-04 09:06 -0500 |
| Message-ID | <mailman.6389.1391523017.18130.python-list@python.org> |
| In reply to | #65422 |
On Tue, 4 Feb 2014 05:19:48 -0800 (PST), Ayushi Dalmia
<ayushidalmia2604@gmail.com> declaimed the following:
>I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it.
10 200MB files is only 2GB... Most any 64-bit processor these days can
handle that. Even some 32-bit systems could handle it (WinXP booted with
the server option gives 3GB to user processes -- if the 4GB was installed
in the machine).
However, you speak of an n-way merge. The traditional merge operation
only reads one record from each file at a time, examines them for "first",
writes that "first", reads next record from the file "first" came from, and
then reassesses the set.
You mention needed to chunk the data -- that implies performing a merge
sort in which you read a few records from each file into memory, sort them,
and right them out to newFile1; then read the same number of records from
each file, sort, and write them to newFile2, up to however many files you
intend to work with -- at that point you go back and append the next chunk
to newFile1. When done, each file contains chunks of n*r records. You now
make newFilex the inputs, read/merge the records from those chunks
outputting to another file1, when you reach the end of the first chunk in
the files you then read/merge the second chunk into another file2. You
repeat this process until you end up with only one chunk in one file.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 21:00 -0800 |
| Message-ID | <ce4abca2-27f9-419d-a7d0-352ff53b533f@googlegroups.com> |
| In reply to | #65426 |
On Tuesday, February 4, 2014 7:36:48 PM UTC+5:30, Dennis Lee Bieber wrote: > On Tue, 4 Feb 2014 05:19:48 -0800 (PST), Ayushi Dalmia > > <ayushidalmia2604@gmail.com> declaimed the following: > > > > > > >I need to chunk out the outputs otherwise it will give Memory Error. I need to do some postprocessing on the data read from the file too. If I donot stop before memory error, I won't be able to perform any more operations on it. > > > > 10 200MB files is only 2GB... Most any 64-bit processor these days can > > handle that. Even some 32-bit systems could handle it (WinXP booted with > > the server option gives 3GB to user processes -- if the 4GB was installed > > in the machine). > > > > However, you speak of an n-way merge. The traditional merge operation > > only reads one record from each file at a time, examines them for "first", > > writes that "first", reads next record from the file "first" came from, and > > then reassesses the set. > > > > You mention needed to chunk the data -- that implies performing a merge > > sort in which you read a few records from each file into memory, sort them, > > and right them out to newFile1; then read the same number of records from > > each file, sort, and write them to newFile2, up to however many files you > > intend to work with -- at that point you go back and append the next chunk > > to newFile1. When done, each file contains chunks of n*r records. You now > > make newFilex the inputs, read/merge the records from those chunks > > outputting to another file1, when you reach the end of the first chunk in > > the files you then read/merge the second chunk into another file2. You > > repeat this process until you end up with only one chunk in one file. > > -- > > Wulfraed Dennis Lee Bieber AF6VN > > wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/ The way you mentioned for merging the file is an option but that will involve a lot of I/O operation. Also, I do not want the size of the file to increase beyond a certain point. When I reach the file size upto a certain limit, I want to start writing in a new file. This is because I want to store them in memory again later.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2014-02-04 14:21 -0500 |
| Message-ID | <mailman.6402.1391541507.18130.python-list@python.org> |
| In reply to | #65415 |
Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>
> Where am I going wrong? What are the alternatives I can try?
You've rejected all the alternatives so far without showing your
code, or even properly specifying your problem.
To get the "total" size of a list of strings, try (untested):
a = sys.getsizeof (mylist )
for item in mylist:
a += sys.getsizeof (item)
This can be high if some of the strings are interned and get
counted twice. But you're not likely to get closer without some
knowledge of the data objects and where they come
from.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 21:15 -0800 |
| Message-ID | <723729ee-8e74-4d65-aa6f-742051a94101@googlegroups.com> |
| In reply to | #65444 |
On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>
>
>
> >
>
> > Where am I going wrong? What are the alternatives I can try?
>
>
>
> You've rejected all the alternatives so far without showing your
>
> code, or even properly specifying your problem.
>
>
>
> To get the "total" size of a list of strings, try (untested):
>
>
>
> a = sys.getsizeof (mylist )
>
> for item in mylist:
>
> a += sys.getsizeof (item)
>
>
>
> This can be high if some of the strings are interned and get
>
> counted twice. But you're not likely to get closer without some
>
> knowledge of the data objects and where they come
>
> from.
>
>
>
> --
>
> DaveA
Hello Dave,
I just thought that saving others time is better and hence I explained only the subset of my problem. Here is what I am trying to do:
I am trying to index the current wikipedia dump without using databases and create a search engine for Wikipedia documents. Note, I CANNOT USE DATABASES.
My approach:
I am parsing the wikipedia pages using SAX Parser, and then, I am dumping the words along with the posting list (a list of doc ids in which the word is present) into different files after reading 'X' number of pages. Now these files may have the same word and hence I need to merge them and write the final index again. Now these final indexes must be of limited size as I need to be of limited size. This is where I am stuck. I need to know how to determine the size of content in a variable before I write into the file.
Here is the code for my merging:
def mergeFiles(pathOfFolder, countFile):
listOfWords={}
indexFile={}
topOfFile={}
flag=[0]*countFile
data=defaultdict(list)
heap=[]
countFinalFile=0
for i in xrange(countFile):
fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
indexFile[i]= bz2.BZ2File(fileName, 'rb')
flag[i]=1
topOfFile[i]=indexFile[i].readline().strip()
listOfWords[i] = topOfFile[i].split(' ')
if listOfWords[i][0] not in heap:
heapq.heappush(heap, listOfWords[i][0])
while any(flag)==1:
temp = heapq.heappop(heap)
for i in xrange(countFile):
if flag[i]==1:
if listOfWords[i][0]==temp:
//This is where I am stuck. I cannot wait until memory //error, as I need to do some postprocessing too.
try:
data[temp].extend(listOfWords[i][1:])
except MemoryError:
writeFinalIndex(data, countFinalFile, pathOfFolder)
data=defaultdict(list)
countFinalFile+=1
topOfFile[i]=indexFile[i].readline().strip()
if topOfFile[i]=='':
flag[i]=0
indexFile[i].close()
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
else:
listOfWords[i] = topOfFile[i].split(' ')
if listOfWords[i][0] not in heap:
heapq.heappush(heap, listOfWords[i][0])
writeFinalIndex(data, countFinalFile, pathOfFolder)
countFile is the number of files and writeFileIndex method writes into the file.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-02-05 09:27 +0100 |
| Message-ID | <mailman.6417.1391588841.18130.python-list@python.org> |
| In reply to | #65469 |
Ayushi Dalmia wrote:
> On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
>> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>>
>>
>>
>> >
>>
>> > Where am I going wrong? What are the alternatives I can try?
>>
>>
>>
>> You've rejected all the alternatives so far without showing your
>>
>> code, or even properly specifying your problem.
>>
>>
>>
>> To get the "total" size of a list of strings, try (untested):
>>
>>
>>
>> a = sys.getsizeof (mylist )
>>
>> for item in mylist:
>>
>> a += sys.getsizeof (item)
>>
>>
>>
>> This can be high if some of the strings are interned and get
>>
>> counted twice. But you're not likely to get closer without some
>>
>> knowledge of the data objects and where they come
>>
>> from.
>>
>>
>>
>> --
>>
>> DaveA
>
> Hello Dave,
>
> I just thought that saving others time is better and hence I explained
> only the subset of my problem. Here is what I am trying to do:
>
> I am trying to index the current wikipedia dump without using databases
> and create a search engine for Wikipedia documents. Note, I CANNOT USE
> DATABASES. My approach:
>
> I am parsing the wikipedia pages using SAX Parser, and then, I am dumping
> the words along with the posting list (a list of doc ids in which the word
> is present) into different files after reading 'X' number of pages. Now
> these files may have the same word and hence I need to merge them and
> write the final index again. Now these final indexes must be of limited
> size as I need to be of limited size. This is where I am stuck. I need to
> know how to determine the size of content in a variable before I write
> into the file.
>
> Here is the code for my merging:
>
> def mergeFiles(pathOfFolder, countFile):
> listOfWords={}
> indexFile={}
> topOfFile={}
> flag=[0]*countFile
> data=defaultdict(list)
> heap=[]
> countFinalFile=0
> for i in xrange(countFile):
> fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
> indexFile[i]= bz2.BZ2File(fileName, 'rb')
> flag[i]=1
> topOfFile[i]=indexFile[i].readline().strip()
> listOfWords[i] = topOfFile[i].split(' ')
> if listOfWords[i][0] not in heap:
> heapq.heappush(heap, listOfWords[i][0])
At this point you have already done it wrong as your heap contains the
complete data and you have done a lot of O(N) tests on the heap.
This is both slow and consumes a lot of memory. See
http://code.activestate.com/recipes/491285-iterator-merge/
for a sane way to merge sorted data from multiple files. Your code becomes
(untested)
with open("outfile.txt", "wb") as outfile:
infiles = []
for i in xrange(countFile):
filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2')
infiles.append(bz2.BZ2File(filename, "rb"))
outfile.writelines(imerge(*infiles))
for infile in infiles:
infile.close()
Once you have your data in a single file you can read from that file and do
the postprocessing you mention below.
> while any(flag)==1:
> temp = heapq.heappop(heap)
> for i in xrange(countFile):
> if flag[i]==1:
> if listOfWords[i][0]==temp:
>
> //This is where I am stuck. I cannot wait until memory
> //error, as I need to do some postprocessing too. try:
> data[temp].extend(listOfWords[i][1:])
> except MemoryError:
> writeFinalIndex(data, countFinalFile,
> pathOfFolder) data=defaultdict(list)
> countFinalFile+=1
>
> topOfFile[i]=indexFile[i].readline().strip()
> if topOfFile[i]=='':
> flag[i]=0
> indexFile[i].close()
>
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
> else:
> listOfWords[i] = topOfFile[i].split(' ')
> if listOfWords[i][0] not in heap:
> heapq.heappush(heap, listOfWords[i][0])
> writeFinalIndex(data, countFinalFile, pathOfFolder)
>
> countFile is the number of files and writeFileIndex method writes into the
> file.
[toc] | [prev] | [next] | [standalone]
| From | Tim Golden <mail@timgolden.me.uk> |
|---|---|
| Date | 2014-02-04 19:28 +0000 |
| Message-ID | <mailman.6404.1391542093.18130.python-list@python.org> |
| In reply to | #65415 |
On 04/02/2014 19:21, Dave Angel wrote: > Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message: > >> >> Where am I going wrong? What are the alternatives I can try? > > You've rejected all the alternatives so far without showing your > code, or even properly specifying your problem. > > To get the "total" size of a list of strings, try (untested): > > a = sys.getsizeof (mylist ) > for item in mylist: > a += sys.getsizeof (item) The documentation for sys.getsizeof: http://docs.python.org/dev/library/sys#sys.getsizeof warns about the limitations of this function when applied to a container, and even points to a recipe by Raymond Hettinger which attempts to do a more complete job. TJG
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-02-04 13:29 -0600 |
| Message-ID | <mailman.6405.1391542145.18130.python-list@python.org> |
| In reply to | #65415 |
On 2014-02-04 14:21, Dave Angel wrote: > To get the "total" size of a list of strings, try (untested): > > a = sys.getsizeof (mylist ) > for item in mylist: > a += sys.getsizeof (item) I always find this sort of accumulation weird (well, at least in Python; it's the *only* way in many other languages) and would write it as a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist) -tkc
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 21:35 -0800 |
| Message-ID | <7e7d3200-a4ae-4842-ad8d-68b4435b9006@googlegroups.com> |
| In reply to | #65447 |
On Wednesday, February 5, 2014 12:59:46 AM UTC+5:30, Tim Chase wrote:
> On 2014-02-04 14:21, Dave Angel wrote:
>
> > To get the "total" size of a list of strings, try (untested):
>
> >
>
> > a = sys.getsizeof (mylist )
>
> > for item in mylist:
>
> > a += sys.getsizeof (item)
>
>
>
> I always find this sort of accumulation weird (well, at least in
>
> Python; it's the *only* way in many other languages) and would write
>
> it as
>
>
>
> a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)
>
>
>
> -tkc
This also doesn't gives the true size. I did the following:
import sys
data=[]
f=open('stopWords.txt','r')
for line in f:
line=line.split()
data.extend(line)
print sys.getsizeof(data)
where stopWords.txt is a file of size 4KB
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-02-04 21:45 -0800 |
| Message-ID | <c7e4d66d-8c27-4229-a92e-49e3f68e1440@googlegroups.com> |
| In reply to | #65470 |
On Wednesday, February 5, 2014 11:05:05 AM UTC+5:30, Ayushi Dalmia wrote:
> This also doesn't gives the true size. I did the following:
> import sys
> data=[]
> f=open('stopWords.txt','r')
> for line in f:
> line=line.split()
> data.extend(line)
> print sys.getsizeof(data)
> where stopWords.txt is a file of size 4KB
Try getsizeof("".join(data))
General advice:
- You have been recommended (by Chris??) that you should use a database
- You say you cant use a database (for whatever reason)
Now the fact is you NEED database (functionality)
How to escape this catch-22 situation?
In computer science its called somewhat sardonically "Greenspun's 10th rule"
And the best way out is to
1 isolate those aspects of database functionality you need
2 temporarily forget about your original problem and implement the dbms
(subset of) DBMS functionality you need
3 Use 2 above to implement 1
[toc] | [prev] | [next] | [standalone]
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
|---|---|
| Date | 2014-02-04 22:00 -0800 |
| Message-ID | <691fecec-c02a-4b0c-99ee-711c5371abad@googlegroups.com> |
| In reply to | #65471 |
On Wednesday, February 5, 2014 11:15:09 AM UTC+5:30, Rustom Mody wrote:
> On Wednesday, February 5, 2014 11:05:05 AM UTC+5:30, Ayushi Dalmia wrote:
>
> > This also doesn't gives the true size. I did the following:
>
>
>
> > import sys
>
> > data=[]
>
> > f=open('stopWords.txt','r')
>
>
>
> > for line in f:
>
> > line=line.split()
>
> > data.extend(line)
>
>
>
> > print sys.getsizeof(data)
>
>
>
> > where stopWords.txt is a file of size 4KB
>
>
>
> Try getsizeof("".join(data))
>
>
>
> General advice:
>
> - You have been recommended (by Chris??) that you should use a database
>
> - You say you cant use a database (for whatever reason)
>
>
>
> Now the fact is you NEED database (functionality)
>
> How to escape this catch-22 situation?
>
> In computer science its called somewhat sardonically "Greenspun's 10th rule"
>
>
>
> And the best way out is to
>
>
>
> 1 isolate those aspects of database functionality you need
>
> 2 temporarily forget about your original problem and implement the dbms
>
> (subset of) DBMS functionality you need
>
> 3 Use 2 above to implement 1
Hello Rustum,
Thanks for the enlightenment. I did not know about the Greenspun's Tenth rule. It is interesting to know that. However, it is an academic project and not a research one. Hence I donot have the liberty to choose what to work with. Life is easier with databases though, but I am not allowed to use them. Thanks for the tip. I will try to replicate those functionality.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-02-05 11:00 +0000 |
| Message-ID | <52f219c5$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #65470 |
On Tue, 04 Feb 2014 21:35:05 -0800, Ayushi Dalmia wrote:
> On Wednesday, February 5, 2014 12:59:46 AM UTC+5:30, Tim Chase wrote:
>> On 2014-02-04 14:21, Dave Angel wrote:
>>
>> > To get the "total" size of a list of strings, try (untested):
>>
>> >
>> > a = sys.getsizeof (mylist )
>> > for item in mylist:
>> > a += sys.getsizeof (item)
>>
>>
>> I always find this sort of accumulation weird (well, at least in
>> Python; it's the *only* way in many other languages) and would write
>> it as
>>
>> a = getsizeof(mylist) + sum(getsizeof(item) for item in mylist)
>>
>
> This also doesn't gives the true size. I did the following:
What do you mean by "true size"?
Do you mean the amount of space a certain amount of data will take in
memory? With or without the overhead of object headers? Or do you mean
how much space it will take when written to disk? You have not been clear
what you are trying to measure.
If you are dealing with one-byte characters, you can measure the amount
of memory they take up (excluding object overhead) by counting the number
of characters: 23 one-byte characters requires 23 bytes. Plus the object
overhead gives:
py> sys.getsizeof('a'*23)
44
44 bytes (23 bytes for the 23 single-byte characters, plus 21 bytes
overhead). One thousand such characters takes:
py> sys.getsizeof('a'*1000)
1021
If you write such a string to disk, it will take 1000 bytes (or 1KB),
unless you use some sort of compression.
> import sys
> data=[]
> f=open('stopWords.txt','r')
>
> for line in f:
> line=line.split()
> data.extend(line)
>
> print sys.getsizeof(data)
This will give you the amount of space taken by the list object. It will
*not* give you the amount of space taken by the individual strings.
A Python list looks like this:
| header | array of pointers |
The header is of constant or near-constant size; the array depends on the
number of items in the list. It may be bigger than the list, e.g. a list
with 1000 items might have allocated space for 2000 items. It will never
be smaller.
getsizeof(list) only counts the direct size of that list, including the
array, but not the things which the pointers point at. If you want the
total size, you need to count them as well.
> where stopWords.txt is a file of size 4KB
My guess is that if you split a 4K file into words, then put the words
into a list, you'll probably end up with 6-8K in memory.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-02-05 22:44 +1100 |
| Message-ID | <mailman.6418.1391600696.18130.python-list@python.org> |
| In reply to | #65474 |
On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> where stopWords.txt is a file of size 4KB
>
> My guess is that if you split a 4K file into words, then put the words
> into a list, you'll probably end up with 6-8K in memory.
I'd guess rather more; Python strings have a fair bit of fixed
overhead, so with a whole lot of small strings, it will get more
costly.
>>> sys.version
'3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan 5 2014, 16:23:43) [MSC v.1600 32
bit (Intel)]'
>>> sys.getsizeof("asdf")
29
"Stop words" tend to be short, rather than long, words, so I'd look at
an average of 2-3 letters per word. Assuming they're separated by
spaces or newlines, that means there'll be roughly a thousand of them
in the file, for about 25K of overhead. A bit less if the words are
longer, but still quite a bit. (Byte strings have slightly less
overhead, 17 bytes apiece, but still quite a bit.)
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-02-06 02:15 -0800 |
| Message-ID | <acdae8c8-2b59-4289-9f2b-1e4dd52cbd62@googlegroups.com> |
| In reply to | #65475 |
Le mercredi 5 février 2014 12:44:47 UTC+1, Chris Angelico a écrit :
> On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano
>
> <steve+comp.lang.python@pearwood.info> wrote:
>
> >> where stopWords.txt is a file of size 4KB
>
> >
>
> > My guess is that if you split a 4K file into words, then put the words
>
> > into a list, you'll probably end up with 6-8K in memory.
>
>
>
> I'd guess rather more; Python strings have a fair bit of fixed
>
> overhead, so with a whole lot of small strings, it will get more
>
> costly.
>
>
>
> >>> sys.version
>
> '3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan 5 2014, 16:23:43) [MSC v.1600 32
>
> bit (Intel)]'
>
> >>> sys.getsizeof("asdf")
>
> 29
>
>
>
> "Stop words" tend to be short, rather than long, words, so I'd look at
>
> an average of 2-3 letters per word. Assuming they're separated by
>
> spaces or newlines, that means there'll be roughly a thousand of them
>
> in the file, for about 25K of overhead. A bit less if the words are
>
> longer, but still quite a bit. (Byte strings have slightly less
>
> overhead, 17 bytes apiece, but still quite a bit.)
>
>
>
> ChrisA
>>> sum([sys.getsizeof(c) for c in ['a']])
26
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO']])
68
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']])
112
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO']])
158
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO', 'aaaaaaaaaaaaaaaaaaaa EURO']])
238
>>>
>>>
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a']])
21
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO']])
46
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO']])
75
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO']])
108
>>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO', 'aaaaaaaaaaaaaaaaaaaa EURO']])
209
>>>
>>>
>>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']*3])
336
>>> sum([sys.getsizeof(c) for c in ['aa EURO aa EURO']*3])
150
>>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['a', 'a EURO', 'aa EURO']*3])
261
>>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['aa EURO aa EURO']*3])
135
>>>
jmf
[toc] | [prev] | [next] | [standalone]
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web