Groups > comp.lang.python > #58824 > unrolled thread

chunking a long string?

Started by	Roy Smith <roy@panix.com>
First post	2013-11-08 12:48 -0500
Last post	2013-11-09 00:54 +0000
Articles	20 on this page of 22 — 6 participants

Back to article view | Back to comp.lang.python

  chunking a long string? Roy Smith <roy@panix.com> - 2013-11-08 12:48 -0500
    Re: chunking a long string? wxjmfauth@gmail.com - 2013-11-08 12:43 -0800
      Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 07:53 +1100
      Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-08 20:57 +0000
      Re: chunking a long string? Tim Chase <python.list@tim.thechases.com> - 2013-11-08 15:04 -0600
      Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-08 21:06 +0000
      Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 08:04 +1100
      Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 08:17 +1100
      Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 00:46 +0000
        Re: chunking a long string? wxjmfauth@gmail.com - 2013-11-09 00:14 -0800
          Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-09 19:26 +1100
            Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 09:37 -0500
              Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 02:02 +1100
                Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 10:21 -0500
                  Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 02:30 +1100
                    Re: chunking a long string? Roy Smith <roy@panix.com> - 2013-11-09 10:35 -0500
              Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 15:37 +0000
                Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 09:14 +1100
                  Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-10 06:39 +0000
                    Re: chunking a long string? Chris Angelico <rosuav@gmail.com> - 2013-11-10 19:46 +1100
          Re: chunking a long string? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-09 10:13 +0000
    Re: chunking a long string? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-09 00:54 +0000

Page 1 of 2 [1] 2 Next page →

#58824 — chunking a long string?

From	Roy Smith <roy@panix.com>
Date	2013-11-08 12:48 -0500
Subject	chunking a long string?
Message-ID	<mailman.2232.1383932895.18130.python-list@python.org>

I have a long string (several Mbytes).  I want to iterate over it in manageable chunks (say, 1 kbyte each).  For (a small) example, if I started with "this is a very long string", and I wanted 10 character chunks, I should get:

"this is a "
"very long "
"string"

This seems like something itertools would do, but I don't see anything.  Is there something, or do I just need to loop and slice (and worry about getting all the edge conditions right) myself?

---
Roy Smith
roy@panix.com

[toc] | [next] | [standalone]

#58855

From	wxjmfauth@gmail.com
Date	2013-11-08 12:43 -0800
Message-ID	<c1bb3377-4425-4707-9ae7-aa7251cebc75@googlegroups.com>
In reply to	#58824

"(say, 1 kbyte each)": one "kilo" of characters or bytes?

Glad to read some users are still living in an ascii world,
at the "Unicode time" where an encoded code point size may vary
between 1-4 bytes.


Oops, sorry, I'm wrong, it can be much more.

>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('a\U0001d11e')
48
>>>

jmf

[toc] | [prev] | [next] | [standalone]

#58857

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-09 07:53 +1100
Message-ID	<mailman.2254.1383943995.18130.python-list@python.org>
In reply to	#58855

On Sat, Nov 9, 2013 at 7:43 AM,  <wxjmfauth@gmail.com> wrote:
> Oops, sorry, I'm wrong, it can be much more.
>
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('a\U0001d11e')
> 48
>>>>

I know, overhead sucks doesn't it. Python is really abysmal at that;
look how big a single bit is:

>>> sys.getsizeof(1)
14
>>> sys.getsizeof(True)
14

On the flip side, Python gets really awesome at some other things.
Your operating system probably takes an entire CD to distribute, maybe
even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
efficiently Python can represent it:

>>> sys.getsizeof(os)
36

Wow!

ChrisA

[toc] | [prev] | [next] | [standalone]

#58858

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-08 20:57 +0000
Message-ID	<mailman.2255.1383944273.18130.python-list@python.org>
In reply to	#58855

On 08/11/2013 20:43, wxjmfauth@gmail.com wrote:
>
> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
>
> Glad to read some users are still living in an ascii world,
> at the "Unicode time" where an encoded code point size may vary
> between 1-4 bytes.
>
>
> Oops, sorry, I'm wrong, it can be much more.
>
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('a\U0001d11e')
> 48
>>>>
>
> jmf
>
>

For any newcomers please ignore the rubbish that "Joseph McCarthy" Faust 
comes up with from time to time.  He's been asked repeatedly to come up 
with evidence to support his claims regarding PEP 393, the Flexible 
String Representation, but he never does, clearly because he can't. 
Instead he provides micro benchmarks or meaningless numbers like those 
above.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#58861

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-11-08 15:04 -0600
Message-ID	<mailman.2257.1383944592.18130.python-list@python.org>
In reply to	#58855

On 2013-11-09 07:53, Chris Angelico wrote:
> On the flip side, Python gets really awesome at some other things.
> Your operating system probably takes an entire CD to distribute,
> maybe even a DVD, so that's either 700MB or 4.7GB, give or take.
> Look how efficiently Python can represent it:
> 
> >>> sys.getsizeof(os)  
> 36

Someone has been hanging out too much over on that thread about
compressing random data ;-)

-tkc

[toc] | [prev] | [next] | [standalone]

#58863

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-08 21:06 +0000
Message-ID	<mailman.2258.1383944797.18130.python-list@python.org>
In reply to	#58855

On 08/11/2013 20:53, Chris Angelico wrote:
> On Sat, Nov 9, 2013 at 7:43 AM,  <wxjmfauth@gmail.com> wrote:
>> Oops, sorry, I'm wrong, it can be much more.
>>
>>>>> sys.getsizeof('ab')
>> 27
>>>>> sys.getsizeof('a\U0001d11e')
>> 48
>>>>>
>
> I know, overhead sucks doesn't it. Python is really abysmal at that;
> look how big a single bit is:
>
>>>> sys.getsizeof(1)
> 14
>>>> sys.getsizeof(True)
> 14
>
> On the flip side, Python gets really awesome at some other things.
> Your operating system probably takes an entire CD to distribute, maybe
> even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
> efficiently Python can represent it:
>
>>>> sys.getsizeof(os)
> 36
>
> Wow!
>
> ChrisA
>

Those figures look really good but I actually want figures that do 
things my way, even if the figures aren't as good or even suck 
completely.  Can you help me with this even if I've already asked 42 
times before but have always been given the same figures in response?

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#58865

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-09 08:04 +1100
Message-ID	<mailman.2259.1383945116.18130.python-list@python.org>
In reply to	#58855

On Sat, Nov 9, 2013 at 8:04 AM, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2013-11-09 07:53, Chris Angelico wrote:
>> On the flip side, Python gets really awesome at some other things.
>> Your operating system probably takes an entire CD to distribute,
>> maybe even a DVD, so that's either 700MB or 4.7GB, give or take.
>> Look how efficiently Python can represent it:
>>
>> >>> sys.getsizeof(os)
>> 36
>
> Someone has been hanging out too much over on that thread about
> compressing random data ;-)

Hey, that's a bit unfair! Operating systems aren't full of random data!

At least... well, I can't speak for Windows here...

*dives for cover*

ChrisA

[toc] | [prev] | [next] | [standalone]

#58867

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-09 08:17 +1100
Message-ID	<mailman.2260.1383945462.18130.python-list@python.org>
In reply to	#58855

On Sat, Nov 9, 2013 at 8:06 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> Those figures look really good but I actually want figures that do things my
> way, even if the figures aren't as good or even suck completely.  Can you
> help me with this even if I've already asked 42 times before but have always
> been given the same figures in response?

Yep! I can even offer you an hourglass figure. Here, watch this figure
of an hourglass while you wait for a different answer...

ChrisA

[toc] | [prev] | [next] | [standalone]

#58889

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-09 00:46 +0000
Message-ID	<527d85e8$0$29983$c3e8da3$5496439d@news.astraweb.com>
In reply to	#58855

On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:

> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
> 
> Glad to read some users are still living in an ascii world, at the
> "Unicode time" where an encoded code point size may vary between 1-4
> bytes.
> 
> 
> Oops, sorry, I'm wrong, 

That part is true.

> it can be much more.

That part is false. You're measuring the overhead of the object 
structure, not the per-character storage. This has been the case going 
back since at least Python 2.2: strings are objects, and have overhead.

>>>> sys.getsizeof('ab')
> 27

27 bytes for two characters! Except it isn't, it's actually 25 bytes for 
the object header and two bytes for the two characters.

>>>> sys.getsizeof('a\U0001d11e')
> 48

And here you have four bytes each for the two characters and a 40 byte 
header. Observe:

py> c = '\U0001d11e'
py> len(c)
1
py> sys.getsizeof(2*c) - sys.getsizeof(c)
4
py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
4

How big is the object overhead on a (say) thousand character string? Just 
one percent:

py> (sys.getsizeof(1000*c) - 4000)/4000
0.01

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58914

From	wxjmfauth@gmail.com
Date	2013-11-09 00:14 -0800
Message-ID	<39112f0b-f834-4e4a-86f2-ca19078e6de4@googlegroups.com>
In reply to	#58889

Le samedi 9 novembre 2013 01:46:32 UTC+1, Steven D'Aprano a écrit :
> On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:
> 
> 
> 
> > "(say, 1 kbyte each)": one "kilo" of characters or bytes?
> 
> > 
> 
> > Glad to read some users are still living in an ascii world, at the
> 
> > "Unicode time" where an encoded code point size may vary between 1-4
> 
> > bytes.
> 
> > 
> 
> > 
> 
> > Oops, sorry, I'm wrong, 
> 
> 
> 
> That part is true.
> 
> 
> 
> 
> 
> > it can be much more.
> 
> 
> 
> That part is false. You're measuring the overhead of the object 
> 
> structure, not the per-character storage. This has been the case going 
> 
> back since at least Python 2.2: strings are objects, and have overhead.
> 
> 
> 
> >>>> sys.getsizeof('ab')
> 
> > 27
> 
> 
> 
> 27 bytes for two characters! Except it isn't, it's actually 25 bytes for 
> 
> the object header and two bytes for the two characters.
> 
> 
> 
> >>>> sys.getsizeof('a\U0001d11e')
> 
> > 48
> 
> 
> 
> And here you have four bytes each for the two characters and a 40 byte 
> 
> header. Observe:
> 
> 
> 
> py> c = '\U0001d11e'
> 
> py> len(c)
> 
> 1
> 
> py> sys.getsizeof(2*c) - sys.getsizeof(c)
> 
> 4
> 
> py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
> 
> 4
> 
> 
> 
> 
> 
> How big is the object overhead on a (say) thousand character string? Just 
> 
> one percent:
> 
> 
> 
> py> (sys.getsizeof(1000*c) - 4000)/4000
> 
> 0.01


--------

Sure, the new phone "xyz" does not cost 600$, it only cost
only 100$ more than the "abc" 500$ phone model.


If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.

After all, it is much more funny to waste its time in coding
and in attempting to handle unicode properly and to observe
this poor Python wasting its time in conversions.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U0001d11e')
44
>>> sys.getsizeof('\U0001d11e'.encode('utf-32'))
25


Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.

jmf

[toc] | [prev] | [next] | [standalone]

#58915

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-09 19:26 +1100
Message-ID	<mailman.2283.1383985583.18130.python-list@python.org>
In reply to	#58914

On Sat, Nov 9, 2013 at 7:14 PM,  <wxjmfauth@gmail.com> wrote:
> If you wish to count the the frequency of chars in a text
> and store the results in a dict, {char: number_of_that_char, ...},
> do not forget to save the key in utf-XXX, it saves memory.

Oh, if you're that concerned about memory usage of individual
characters, try storing them as integers:

>>> sys.getsizeof("a")
26
>>> sys.getsizeof("a".encode("utf-32"))
25
>>> sys.getsizeof("a".encode("utf-8"))
18
>>> sys.getsizeof(ord("a"))
14

I really don't see that UTF-32 is much advantage here. UTF-8 happens
to be, because I used an ASCII character, but the integer beats them
all, even for larger numbers:
>>> sys.getsizeof(ord("\U0001d11e"))
16

And there's even less difference on my Linux box, but of course, you
never compare against Linux because Python 3.2 wide builds don't suit
your numbers.

For longer strings, there's an even more efficient way to store them.
Just store the memory address - that's going to be 4 bytes or 8,
depending on whether it's a 32-bit or 64-bit build of Python. There's
a name for this method of comparing strings: interning. Some languages
do it automatically for all strings, others (like Python) only when
you ask for it. Suddenly it doesn't matter at all what the storage
format is - if the two strings are the same, their addresses are the
same, and conversely. That's how to make it cheap.

> Hint: If you attempt to do the same exercise with
> words in a "latin" text, never forget the length average
> of a word is approximatively 1000 chars.

I think you're confusing length of word with value of picture.

ChrisA

[toc] | [prev] | [next] | [standalone]

#58943

From	Roy Smith <roy@panix.com>
Date	2013-11-09 09:37 -0500
Message-ID	<roy-9831F9.09375409112013@news.panix.com>
In reply to	#58915

In article <mailman.2283.1383985583.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> Some languages [intern] automatically for all strings, others
> (like Python) only when you ask for it.

What does "only when you ask for it" mean?

[toc] | [prev] | [next] | [standalone]

#58944

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-10 02:02 +1100
Message-ID	<mailman.2298.1384009376.18130.python-list@python.org>
In reply to	#58943

On Sun, Nov 10, 2013 at 1:37 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.2283.1383985583.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> Some languages [intern] automatically for all strings, others
>> (like Python) only when you ask for it.
>
> What does "only when you ask for it" mean?

You can explicitly intern a Python string with the sys.intern()
function, which returns either the string itself or an
indistinguishable "interned" string. Two equal strings, when interned,
will return the same object:

>>> foo = "asdf"
>>> bar = "as"
>>> bar += "df"
>>> foo is bar
False

Note that the Python interpreter is free to answer True there, but
there's no mandate for it.

>>> foo = sys.intern(foo)
>>> bar = sys.intern(bar)
>>> foo is bar
True

Now it's mandated. The two strings must be the same object. Interning
in this way makes string equality come down to an 'is' check, which is
potentially a lot faster than actual string equality.

Some languages (eg Pike) do this automatically with all strings - the
construction of any string includes checking to see if it's a
duplicate of any other string. This adds cost to string manipulation
and speeds up string comparisons; since the engine knows that all
strings are interned, it can do the equivalent of an 'is' check for
any string equality.

So what I meant, in terms of storage/representation efficiency, is
that you can store duplicate strings very efficiently if you simply
increment the reference counts of the same few objects. Python won't
necessarily do that for you; check memory usage of something like
this:

strings = [open("some_big_file").read() for _ in range(10000)]

And compare against this:

strings = [sys.intern(open("some_big_file").read()) for _ in range(10000)]

In a language that guarantees string interning, the syntax of the
former would have the memory consumption of the latter. Whether that
memory saving and improved equality comparison is worth the effort of
dictionarification is one of those eternally-debatable points.

ChrisA

[toc] | [prev] | [next] | [standalone]

#58947

From	Roy Smith <roy@panix.com>
Date	2013-11-09 10:21 -0500
Message-ID	<roy-4EDEFD.10212609112013@news.panix.com>
In reply to	#58944

In article <mailman.2298.1384009376.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Sun, Nov 10, 2013 at 1:37 AM, Roy Smith <roy@panix.com> wrote:
> > In article <mailman.2283.1383985583.18130.python-list@python.org>,
> >  Chris Angelico <rosuav@gmail.com> wrote:
> >
> >> Some languages [intern] automatically for all strings, others
> >> (like Python) only when you ask for it.
> >
> > What does "only when you ask for it" mean?
> 
> You can explicitly intern a Python string with the sys.intern()
> function
> [long, and good, explanation of interning]

But, you missed the point of my question.  You said that Python does 
this "only when you ask for it".  That implies it never interns strings 
if you don't ask for it, which is clearly not true:

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[...]
>>> x = "foo"
>>> y = "foo"
>>> x is y
True

I think what you're trying to say is that there are several possible 
interning policies:

1) Strings are never interned

2) Strings are always interned

3) Strings are optionally interned, at the discretion of the 
implementation

4) The user may force a specific string to be interned by explicitly 
requesting it.

and that Pike implements #1, while Python implements #3 and #4.

[toc] | [prev] | [next] | [standalone]

#58950

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-10 02:30 +1100
Message-ID	<mailman.2301.1384011026.18130.python-list@python.org>
In reply to	#58947

On Sun, Nov 10, 2013 at 2:21 AM, Roy Smith <roy@panix.com> wrote:
> But, you missed the point of my question.  You said that Python does
> this "only when you ask for it".  That implies it never interns strings
> if you don't ask for it, which is clearly not true:
>
> $ python
> Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
> [...]
>>>> x = "foo"
>>>> y = "foo"
>>>> x is y
> True

Ah! Yes, that's true; literals are interned - I forgot that. But
anything from an external source won't be, hence my example with
reading in the contents of a file.

> I think what you're trying to say is that there are several possible
> interning policies:
>
> 1) Strings are never interned
>
> 2) Strings are always interned
>
> 3) Strings are optionally interned, at the discretion of the
> implementation
>
> 4) The user may force a specific string to be interned by explicitly
> requesting it.
>
> and that Pike implements #1, while Python implements #3 and #4.

Pike implements #2, I presume that was a typo. And yes, the interning
of literals falls under #3, while sys.intern() gives #4. Use of #1
would be restricted to languages with mutable strings, I would expect,
for the same reason that Python tuples might be shared but lists won't
be.

ChrisA

[toc] | [prev] | [next] | [standalone]

#58952

From	Roy Smith <roy@panix.com>
Date	2013-11-09 10:35 -0500
Message-ID	<roy-5051BF.10352509112013@news.panix.com>
In reply to	#58950

In article <mailman.2301.1384011026.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> Pike implements #2, I presume that was a typo.

Duh.  Yes.

[toc] | [prev] | [next] | [standalone]

#58953

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-09 15:37 +0000
Message-ID	<527e569f$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#58943

On Sat, 09 Nov 2013 09:37:54 -0500, Roy Smith wrote:

> In article <mailman.2283.1383985583.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> 
>> Some languages [intern] automatically for all strings, others (like
>> Python) only when you ask for it.
> 
> What does "only when you ask for it" mean?

In Python 2:

help(intern)

In Python 3:

import sys
help(sys.intern)

for more info. I think that Chris is wrong about Python "only" interning 
strings if you explicitly ask for it. I recall that Python will (may?) 
automatically intern strings which look like identifiers (e.g. "spam" but 
not "Hello World" or "123abc"). Let's see now:

# using Python 3.1 on Linux

py> s = "spam"
py> t = "spam"
py> s is t
True

but:

py> z = ''.join(["sp", "am"])
py> z is s
False

However:

py> u = "123abc"
py> v = "123abc"
py> u is v
True

Hmmm, obviously the rules are a tad more complicated than I thought... in 
any case, you shouldn't rely on automatic interning since it is an 
implementation dependent optimization and will probably change without 
notice.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58969

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-10 09:14 +1100
Message-ID	<mailman.2310.1384035279.18130.python-list@python.org>
In reply to	#58953

On Sun, Nov 10, 2013 at 2:37 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> I think that Chris is wrong about Python "only" interning
> strings if you explicitly ask for it. I recall that Python will (may?)
> automatically intern strings which look like identifiers (e.g. "spam" but
> not "Hello World" or "123abc").

I'm pretty sure it's simply that literals are interned, or at least
shared across a module (and the interactive interpreter "counts" as a
module). And it might still only be ones which look like identifiers,
because:

>>> foo = "lorem ipsum dolor sit amet"
>>> bar = "lorem ipsum dolor sit amet"
>>> foo is bar
False

My "only" was false because of the sharing/interning of (some)
literals, which I'd forgotten about; however, there's still the
distinction that I was trying to draw, that in Python _some strings_
are interned (a feature you can explicitly request), rather than _all
strings_ being interned. And as is typical of python-list, it's this
extremely minor point that became the new course of the thread - my
main point was not whether all, some, or no strings get interned, but
that string interning makes the storage space of duplicate strings
immaterial :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#58986

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-10 06:39 +0000
Message-ID	<527f2a2a$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#58969

On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote:

> And
> as is typical of python-list, it's this extremely minor point that
> became the new course of the thread - 

You say that as if it were a bad thing :-P

> my main point was not whether all,
> some, or no strings get interned, but that string interning makes the
> storage space of duplicate strings immaterial :)

True. It's not just a memory saver[1], but a time saver too. Using Python 
3.3:

py> from timeit import Timer
py> t1 = Timer('s == t', setup='s = "a b"*10000; t = "a b"*10000')
py> t2 = Timer('s == t', 
...     setup='from sys import intern; s = intern("a b"*10000); '
...           't = intern("a b"*10000)')
py> min(t1.repeat(number=100000))
7.651959054172039
py> min(t2.repeat(number=100000))
0.00881262868642807

String equality does a short-cut of checking for identity; if the strings 
are interned, they will be identical.

[1] Assuming that you actually do have duplicate strings. If every string 
is unique, interning them potentially wastes memory.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58989

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-10 19:46 +1100
Message-ID	<mailman.2324.1384073196.18130.python-list@python.org>
In reply to	#58986

On Sun, Nov 10, 2013 at 5:39 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sun, 10 Nov 2013 09:14:28 +1100, Chris Angelico wrote:
>
>> And
>> as is typical of python-list, it's this extremely minor point that
>> became the new course of the thread -
>
> You say that as if it were a bad thing :-P

More a curiosity than a bad thing.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

chunking a long string?

Contents

#58824 — chunking a long string?

#58855

#58857

#58858

#58861

#58863

#58865

#58867

#58889

#58914

#58915

#58943

#58944

#58947

#58950

#58952

#58953

#58969

#58986

#58989