Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #58824 > unrolled thread
| Started by | Roy Smith <roy@panix.com> |
|---|---|
| First post | 2013-11-08 12:48 -0500 |
| Last post | 2013-11-09 00:54 +0000 |
| Articles | 20 on this page of 22 — 6 participants |
Back to article view | Back to comp.lang.python
chunking a long string? Roy Smith <roy@panix.com> - 2013-11-08 12:48 -0500
Re: chunking a long string? wxjmfauth@gmail.com - 2013-11-08 12:43 -0800
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 07:53 +1100
Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-08 20:57 +0000
Re: chunking a long string? Tim Chase <python.list@tim.thechases.com> - 2013-11-08 15:04 -0600
Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-08 21:06 +0000
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 08:04 +1100
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 08:17 +1100
Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 00:46 +0000
Re: chunking a long string? wxjmfauth@gmail.com - 2013-11-09 00:14 -0800
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 19:26 +1100
Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 09:37 -0500
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 02:02 +1100
Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 10:21 -0500
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 02:30 +1100
Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 10:35 -0500
Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 15:37 +0000
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 09:14 +1100
Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-10 06:39 +0000
Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 19:46 +1100
Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-09 10:13 +0000
Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 00:54 +0000
Page 1 of 2 [1] 2 Next page →
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-08 12:48 -0500 |
| Subject | chunking a long string? |
| Message-ID | <mailman.2232.1383932895.18130.python-list@python.org> |
I have a long string (several Mbytes). I want to iterate over it in manageable chunks (say, 1 kbyte each). For (a small) example, if I started with "this is a very long string", and I wanted 10 character chunks, I should get: "this is a " "very long " "string" This seems like something itertools would do, but I don't see anything. Is there something, or do I just need to loop and slice (and worry about getting all the edge conditions right) myself? --- Roy Smith roy@panix.com
[toc] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-11-08 12:43 -0800 |
| Message-ID | <c1bb3377-4425-4707-9ae7-aa7251cebc75@googlegroups.com> |
| In reply to | #58824 |
"(say, 1 kbyte each)": one "kilo" of characters or bytes?
Glad to read some users are still living in an ascii world,
at the "Unicode time" where an encoded code point size may vary
between 1-4 bytes.
Oops, sorry, I'm wrong, it can be much more.
>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('a\U0001d11e')
48
>>>
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-09 07:53 +1100 |
| Message-ID | <mailman.2254.1383943995.18130.python-list@python.org> |
| In reply to | #58855 |
On Sat, Nov 9, 2013 at 7:43 AM, <wxjmfauth@gmail.com> wrote:
> Oops, sorry, I'm wrong, it can be much more.
>
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('a\U0001d11e')
> 48
>>>>
I know, overhead sucks doesn't it. Python is really abysmal at that;
look how big a single bit is:
>>> sys.getsizeof(1)
14
>>> sys.getsizeof(True)
14
On the flip side, Python gets really awesome at some other things.
Your operating system probably takes an entire CD to distribute, maybe
even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
efficiently Python can represent it:
>>> sys.getsizeof(os)
36
Wow!
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-11-08 20:57 +0000 |
| Message-ID | <mailman.2255.1383944273.18130.python-list@python.org> |
| In reply to | #58855 |
On 08/11/2013 20:43, wxjmfauth@gmail.com wrote:
>
> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
>
> Glad to read some users are still living in an ascii world,
> at the "Unicode time" where an encoded code point size may vary
> between 1-4 bytes.
>
>
> Oops, sorry, I'm wrong, it can be much more.
>
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('a\U0001d11e')
> 48
>>>>
>
> jmf
>
>
For any newcomers please ignore the rubbish that "Joseph McCarthy" Faust
comes up with from time to time. He's been asked repeatedly to come up
with evidence to support his claims regarding PEP 393, the Flexible
String Representation, but he never does, clearly because he can't.
Instead he provides micro benchmarks or meaningless numbers like those
above.
--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-11-08 15:04 -0600 |
| Message-ID | <mailman.2257.1383944592.18130.python-list@python.org> |
| In reply to | #58855 |
On 2013-11-09 07:53, Chris Angelico wrote: > On the flip side, Python gets really awesome at some other things. > Your operating system probably takes an entire CD to distribute, > maybe even a DVD, so that's either 700MB or 4.7GB, give or take. > Look how efficiently Python can represent it: > > >>> sys.getsizeof(os) > 36 Someone has been hanging out too much over on that thread about compressing random data ;-) -tkc
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-11-08 21:06 +0000 |
| Message-ID | <mailman.2258.1383944797.18130.python-list@python.org> |
| In reply to | #58855 |
On 08/11/2013 20:53, Chris Angelico wrote:
> On Sat, Nov 9, 2013 at 7:43 AM, <wxjmfauth@gmail.com> wrote:
>> Oops, sorry, I'm wrong, it can be much more.
>>
>>>>> sys.getsizeof('ab')
>> 27
>>>>> sys.getsizeof('a\U0001d11e')
>> 48
>>>>>
>
> I know, overhead sucks doesn't it. Python is really abysmal at that;
> look how big a single bit is:
>
>>>> sys.getsizeof(1)
> 14
>>>> sys.getsizeof(True)
> 14
>
> On the flip side, Python gets really awesome at some other things.
> Your operating system probably takes an entire CD to distribute, maybe
> even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
> efficiently Python can represent it:
>
>>>> sys.getsizeof(os)
> 36
>
> Wow!
>
> ChrisA
>
Those figures look really good but I actually want figures that do
things my way, even if the figures aren't as good or even suck
completely. Can you help me with this even if I've already asked 42
times before but have always been given the same figures in response?
--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-09 08:04 +1100 |
| Message-ID | <mailman.2259.1383945116.18130.python-list@python.org> |
| In reply to | #58855 |
On Sat, Nov 9, 2013 at 8:04 AM, Tim Chase <python.list@tim.thechases.com> wrote: > On 2013-11-09 07:53, Chris Angelico wrote: >> On the flip side, Python gets really awesome at some other things. >> Your operating system probably takes an entire CD to distribute, >> maybe even a DVD, so that's either 700MB or 4.7GB, give or take. >> Look how efficiently Python can represent it: >> >> >>> sys.getsizeof(os) >> 36 > > Someone has been hanging out too much over on that thread about > compressing random data ;-) Hey, that's a bit unfair! Operating systems aren't full of random data! At least... well, I can't speak for Windows here... *dives for cover* ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-09 08:17 +1100 |
| Message-ID | <mailman.2260.1383945462.18130.python-list@python.org> |
| In reply to | #58855 |
On Sat, Nov 9, 2013 at 8:06 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote: > Those figures look really good but I actually want figures that do things my > way, even if the figures aren't as good or even suck completely. Can you > help me with this even if I've already asked 42 times before but have always > been given the same figures in response? Yep! I can even offer you an hourglass figure. Here, watch this figure of an hourglass while you wait for a different answer... ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-09 00:46 +0000 |
| Message-ID | <527d85e8$0$29983$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #58855 |
On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:
> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
>
> Glad to read some users are still living in an ascii world, at the
> "Unicode time" where an encoded code point size may vary between 1-4
> bytes.
>
>
> Oops, sorry, I'm wrong,
That part is true.
> it can be much more.
That part is false. You're measuring the overhead of the object
structure, not the per-character storage. This has been the case going
back since at least Python 2.2: strings are objects, and have overhead.
>>>> sys.getsizeof('ab')
> 27
27 bytes for two characters! Except it isn't, it's actually 25 bytes for
the object header and two bytes for the two characters.
>>>> sys.getsizeof('a\U0001d11e')
> 48
And here you have four bytes each for the two characters and a 40 byte
header. Observe:
py> c = '\U0001d11e'
py> len(c)
1
py> sys.getsizeof(2*c) - sys.getsizeof(c)
4
py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
4
How big is the object overhead on a (say) thousand character string? Just
one percent:
py> (sys.getsizeof(1000*c) - 4000)/4000
0.01
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-11-09 00:14 -0800 |
| Message-ID | <39112f0b-f834-4e4a-86f2-ca19078e6de4@googlegroups.com> |
| In reply to | #58889 |
Le samedi 9 novembre 2013 01:46:32 UTC+1, Steven D'Aprano a écrit :
> On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:
>
>
>
> > "(say, 1 kbyte each)": one "kilo" of characters or bytes?
>
> >
>
> > Glad to read some users are still living in an ascii world, at the
>
> > "Unicode time" where an encoded code point size may vary between 1-4
>
> > bytes.
>
> >
>
> >
>
> > Oops, sorry, I'm wrong,
>
>
>
> That part is true.
>
>
>
>
>
> > it can be much more.
>
>
>
> That part is false. You're measuring the overhead of the object
>
> structure, not the per-character storage. This has been the case going
>
> back since at least Python 2.2: strings are objects, and have overhead.
>
>
>
> >>>> sys.getsizeof('ab')
>
> > 27
>
>
>
> 27 bytes for two characters! Except it isn't, it's actually 25 bytes for
>
> the object header and two bytes for the two characters.
>
>
>
> >>>> sys.getsizeof('a\U0001d11e')
>
> > 48
>
>
>
> And here you have four bytes each for the two characters and a 40 byte
>
> header. Observe:
>
>
>
> py> c = '\U0001d11e'
>
> py> len(c)
>
> 1
>
> py> sys.getsizeof(2*c) - sys.getsizeof(c)
>
> 4
>
> py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
>
> 4
>
>
>
>
>
> How big is the object overhead on a (say) thousand character string? Just
>
> one percent:
>
>
>
> py> (sys.getsizeof(1000*c) - 4000)/4000
>
> 0.01
--------
Sure, the new phone "xyz" does not cost 600$, it only cost
only 100$ more than the "abc" 500$ phone model.
If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.
After all, it is much more funny to waste its time in coding
and in attempting to handle unicode properly and to observe
this poor Python wasting its time in conversions.
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U0001d11e')
44
>>> sys.getsizeof('\U0001d11e'.encode('utf-32'))
25
Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-09 19:26 +1100 |
| Message-ID | <mailman.2283.1383985583.18130.python-list@python.org> |
| In reply to | #58914 |
On Sat, Nov 9, 2013 at 7:14 PM, <wxjmfauth@gmail.com> wrote:
> If you wish to count the the frequency of chars in a text
> and store the results in a dict, {char: number_of_that_char, ...},
> do not forget to save the key in utf-XXX, it saves memory.
Oh, if you're that concerned about memory usage of individual
characters, try storing them as integers:
>>> sys.getsizeof("a")
26
>>> sys.getsizeof("a".encode("utf-32"))
25
>>> sys.getsizeof("a".encode("utf-8"))
18
>>> sys.getsizeof(ord("a"))
14
I really don't see that UTF-32 is much advantage here. UTF-8 happens
to be, because I used an ASCII character, but the integer beats them
all, even for larger numbers:
>>> sys.getsizeof(ord("\U0001d11e"))
16
And there's even less difference on my Linux box, but of course, you
never compare against Linux because Python 3.2 wide builds don't suit
your numbers.
For longer strings, there's an even more efficient way to store them.
Just store the memory address - that's going to be 4 bytes or 8,
depending on whether it's a 32-bit or 64-bit build of Python. There's
a name for this method of comparing strings: interning. Some languages
do it automatically for all strings, others (like Python) only when
you ask for it. Suddenly it doesn't matter at all what the storage
format is - if the two strings are the same, their addresses are the
same, and conversely. That's how to make it cheap.
> Hint: If you attempt to do the same exercise with
> words in a "latin" text, never forget the length average
> of a word is approximatively 1000 chars.
I think you're confusing length of word with value of picture.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-09 09:37 -0500 |
| Message-ID | <roy-9831F9.09375409112013@news.panix.com> |
| In reply to | #58915 |
In article <mailman.2283.1383985583.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > Some languages [intern] automatically for all strings, others > (like Python) only when you ask for it. What does "only when you ask for it" mean?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-10 02:02 +1100 |
| Message-ID | <mailman.2298.1384009376.18130.python-list@python.org> |
| In reply to | #58943 |
On Sun, Nov 10, 2013 at 1:37 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.2283.1383985583.18130.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
>> Some languages [intern] automatically for all strings, others
>> (like Python) only when you ask for it.
>
> What does "only when you ask for it" mean?
You can explicitly intern a Python string with the sys.intern()
function, which returns either the string itself or an
indistinguishable "interned" string. Two equal strings, when interned,
will return the same object:
>>> foo = "asdf"
>>> bar = "as"
>>> bar += "df"
>>> foo is bar
False
Note that the Python interpreter is free to answer True there, but
there's no mandate for it.
>>> foo = sys.intern(foo)
>>> bar = sys.intern(bar)
>>> foo is bar
True
Now it's mandated. The two strings must be the same object. Interning
in this way makes string equality come down to an 'is' check, which is
potentially a lot faster than actual string equality.
Some languages (eg Pike) do this automatically with all strings - the
construction of any string includes checking to see if it's a
duplicate of any other string. This adds cost to string manipulation
and speeds up string comparisons; since the engine knows that all
strings are interned, it can do the equivalent of an 'is' check for
any string equality.
So what I meant, in terms of storage/representation efficiency, is
that you can store duplicate strings very efficiently if you simply
increment the reference counts of the same few objects. Python won't
necessarily do that for you; check memory usage of something like
this:
strings = [open("some_big_file").read() for _ in range(10000)]
And compare against this:
strings = [sys.intern(open("some_big_file").read()) for _ in range(10000)]
In a language that guarantees string interning, the syntax of the
former would have the memory consumption of the latter. Whether that
memory saving and improved equality comparison is worth the effort of
dictionarification is one of those eternally-debatable points.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-09 10:21 -0500 |
| Message-ID | <roy-4EDEFD.10212609112013@news.panix.com> |
| In reply to | #58944 |
In article <mailman.2298.1384009376.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > On Sun, Nov 10, 2013 at 1:37 AM, Roy Smith <roy@panix.com> wrote: > > In article <mailman.2283.1383985583.18130.python-list@python.org>, > > Chris Angelico <rosuav@gmail.com> wrote: > > > >> Some languages [intern] automatically for all strings, others > >> (like Python) only when you ask for it. > > > > What does "only when you ask for it" mean? > > You can explicitly intern a Python string with the sys.intern() > function > [long, and good, explanation of interning] But, you missed the point of my question. You said that Python does this "only when you ask for it". That implies it never interns strings if you don't ask for it, which is clearly not true: $ python Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) [...] >>> x = "foo" >>> y = "foo" >>> x is y True I think what you're trying to say is that there are several possible interning policies: 1) Strings are never interned 2) Strings are always interned 3) Strings are optionally interned, at the discretion of the implementation 4) The user may force a specific string to be interned by explicitly requesting it. and that Pike implements #1, while Python implements #3 and #4.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-10 02:30 +1100 |
| Message-ID | <mailman.2301.1384011026.18130.python-list@python.org> |
| In reply to | #58947 |
On Sun, Nov 10, 2013 at 2:21 AM, Roy Smith <roy@panix.com> wrote: > But, you missed the point of my question. You said that Python does > this "only when you ask for it". That implies it never interns strings > if you don't ask for it, which is clearly not true: > > $ python > Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) > [...] >>>> x = "foo" >>>> y = "foo" >>>> x is y > True Ah! Yes, that's true; literals are interned - I forgot that. But anything from an external source won't be, hence my example with reading in the contents of a file. > I think what you're trying to say is that there are several possible > interning policies: > > 1) Strings are never interned > > 2) Strings are always interned > > 3) Strings are optionally interned, at the discretion of the > implementation > > 4) The user may force a specific string to be interned by explicitly > requesting it. > > and that Pike implements #1, while Python implements #3 and #4. Pike implements #2, I presume that was a typo. And yes, the interning of literals falls under #3, while sys.intern() gives #4. Use of #1 would be restricted to languages with mutable strings, I would expect, for the same reason that Python tuples might be shared but lists won't be. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-09 10:35 -0500 |
| Message-ID | <roy-5051BF.10352509112013@news.panix.com> |
| In reply to | #58950 |
In article <mailman.2301.1384011026.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > Pike implements #2, I presume that was a typo. Duh. Yes.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-09 15:37 +0000 |
| Message-ID | <527e569f$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #58943 |
On Sat, 09 Nov 2013 09:37:54 -0500, Roy Smith wrote: > In article <mailman.2283.1383985583.18130.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> Some languages [intern] automatically for all strings, others (like >> Python) only when you ask for it. > > What does "only when you ask for it" mean? In Python 2: help(intern) In Python 3: import sys help(sys.intern) for more info. I think that Chris is wrong about Python "only" interning strings if you explicitly ask for it. I recall that Python will (may?) automatically intern strings which look like identifiers (e.g. "spam" but not "Hello World" or "123abc"). Let's see now: # using Python 3.1 on Linux py> s = "spam" py> t = "spam" py> s is t True but: py> z = ''.join(["sp", "am"]) py> z is s False However: py> u = "123abc" py> v = "123abc" py> u is v True Hmmm, obviously the rules are a tad more complicated than I thought... in any case, you shouldn't rely on automatic interning since it is an implementation dependent optimization and will probably change without notice. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-10 09:14 +1100 |
| Message-ID | <mailman.2310.1384035279.18130.python-list@python.org> |
| In reply to | #58953 |
On Sun, Nov 10, 2013 at 2:37 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > I think that Chris is wrong about Python "only" interning > strings if you explicitly ask for it. I recall that Python will (may?) > automatically intern strings which look like identifiers (e.g. "spam" but > not "Hello World" or "123abc"). I'm pretty sure it's simply that literals are interned, or at least shared across a module (and the interactive interpreter "counts" as a module). And it might still only be ones which look like identifiers, because: >>> foo = "lorem ipsum dolor sit amet" >>> bar = "lorem ipsum dolor sit amet" >>> foo is bar False My "only" was false because of the sharing/interning of (some) literals, which I'd forgotten about; however, there's still the distinction that I was trying to draw, that in Python _some strings_ are interned (a feature you can explicitly request), rather than _all strings_ being interned. And as is typical of python-list, it's this extremely minor point that became the new course of the thread - my main point was not whether all, some, or no strings get interned, but that string interning makes the storage space of duplicate strings immaterial :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-10 06:39 +0000 |
| Message-ID | <527f2a2a$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #58969 |
On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote:
> And
> as is typical of python-list, it's this extremely minor point that
> became the new course of the thread -
You say that as if it were a bad thing :-P
> my main point was not whether all,
> some, or no strings get interned, but that string interning makes the
> storage space of duplicate strings immaterial :)
True. It's not just a memory saver[1], but a time saver too. Using Python
3.3:
py> from timeit import Timer
py> t1 = Timer('s == t', setup='s = "a b"*10000; t = "a b"*10000')
py> t2 = Timer('s == t',
... setup='from sys import intern; s = intern("a b"*10000); '
... 't = intern("a b"*10000)')
py> min(t1.repeat(number=100000))
7.651959054172039
py> min(t2.repeat(number=100000))
0.00881262868642807
String equality does a short-cut of checking for identity; if the strings
are interned, they will be identical.
[1] Assuming that you actually do have duplicate strings. If every string
is unique, interning them potentially wastes memory.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-10 19:46 +1100 |
| Message-ID | <mailman.2324.1384073196.18130.python-list@python.org> |
| In reply to | #58986 |
On Sun, Nov 10, 2013 at 5:39 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote: > >> And >> as is typical of python-list, it's this extremely minor point that >> became the new course of the thread - > > You say that as if it were a bad thing :-P More a curiosity than a bad thing. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web