Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #43306 > unrolled thread

performance of script to write very long lines of random chars

Started bygry <georgeryoung@gmail.com>
First post2013-04-10 18:21 -0700
Last post2013-04-11 10:47 +0100
Articles 16 — 7 participants

Back to article view | Back to comp.lang.python


Contents

  performance of script to write very long lines of random chars gry <georgeryoung@gmail.com> - 2013-04-10 18:21 -0700
    Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 11:45 +1000
      Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 05:33 +0000
        Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 15:53 +1000
    Re: performance of script to write very long lines of random chars Michael Torrie <torriem@gmail.com> - 2013-04-10 19:52 -0600
      Re: performance of script to write very long lines of random chars gry <georgeryoung@gmail.com> - 2013-04-10 19:40 -0700
        Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 13:14 +1000
    Re: performance of script to write very long lines of random chars MRAB <python@mrabarnett.plus.com> - 2013-04-11 04:09 +0100
    Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 07:47 +0000
      Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 10:47 +0100
        Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 10:50 +0000
          Re: performance of script to write very long lines of random chars Robert Kern <robert.kern@gmail.com> - 2013-04-11 16:49 +0530
          Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 13:05 +0100
          Re: performance of script to write very long lines of random chars Robert Kern <robert.kern@gmail.com> - 2013-04-11 19:06 +0530
          Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 23:56 +1000
    Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 10:47 +0100

#43306 — performance of script to write very long lines of random chars

Fromgry <georgeryoung@gmail.com>
Date2013-04-10 18:21 -0700
Subjectperformance of script to write very long lines of random chars
Message-ID<24dc619b-7abd-4be3-aa92-f858eb4ab85f@n4g2000yqj.googlegroups.com>
Dear pythonistas,
   I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii
characters.  I am hoping to find a more time and memory efficient way,
that is still fairly simple clear, and _pythonic_.

I would like to have something that I can use at both extremes of
data:

   32M chars per line * 100 lines
or
   5 chars per line * 1e8 lines.

E.g., the output of bigrand.py for 10 characters, 2 lines might be:

gw2+M/5t&.
S[[db/l?Vx

I'm using python 2.7.0 on linux.  I need to use only out-of-the box
modules, since this has to work on a bunch of different computers.
At this point I'm especially concerned with the case of a few very
long lines, since that seems to use a lot of memory, and take a long
time.
Characters are a slight subset of the printable ascii's, specified in
the examples below.  My first naive try was:

from sys import stdout
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'

def make_varchar(nchars):
    return (''.join([random.choice(avail_chrs) for i in
range(nchars)]))

for l in range(rows):
    stdout.write(make_varchar(nchars))
    stdout.write('\n')

This version used around 1.2GB resident/1.2GB virtual of memory for
3min 38sec.


My second try uses much less RAM, but more CPU time, and seems rather,
umm, un-pythonic (the array module always seems a little un
pythonic...)

from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)

for l in range(rows):
    for i in xrange(nchars):
        a[i] = random.choice(avail_chrs)
    a.tofile(stdout)
    stdout.write('\n')

This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code?  And help me understand the
performance issues here?

-- George

[toc] | [next] | [standalone]


#43308

FromChris Angelico <rosuav@gmail.com>
Date2013-04-11 11:45 +1000
Message-ID<mailman.434.1365644739.3114.python-list@python.org>
In reply to#43306
On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung@gmail.com> wrote:
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'

Is this exact set of characters a requirement? For instance, would it
be acceptable to instead use this set of characters?

avail_chrs = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'

Your alphabet has 92 characters, this one only 64... the advantage is
that it's really easy to work with a 64-character set; in fact, for
this specific set, it's the standard called Base 64, and Python
already has a module for working with it. All you need is a random
stream of eight-bit characters, which can be provided by os.urandom().

So here's a much simpler version of your program, following the
cut-down character set I offer:

import os
import base64
nchars = 32000000
rows = 10
# Note: If nchars is one higher than a multiple of 4 (eg 5, 9, 101),
# the lines will be one character short (4, 8, 100).
nchars = nchars * 3 // 4
for l in range(rows):
    print(base64.b64encode(os.urandom(nchars)).strip(b'='))


If you can guarantee that your nchars will always be a multiple of 4,
you can drop the .strip() call.

This is going to be *immensely* faster than calling random.choice()
for every character, but it depends on a working os.urandom (it'll
raise NotImplementedError if there's no suitable source). I know it's
available on OS/2, Windows, and Linux, but don't have others handy to
test. If by "a bunch of different computers" you mean exclusively
Linux computers, this should be fine.

ChrisA

[toc] | [prev] | [next] | [standalone]


#43316

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-04-11 05:33 +0000
Message-ID<51664b43$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#43308
On Thu, 11 Apr 2013 11:45:31 +1000, Chris Angelico wrote:

> On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung@gmail.com> wrote:
>> avail_chrs =
>> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
>> \'()*+,-./:;<=>?@[\\]^_`{}'
> 
> Is this exact set of characters a requirement? For instance, would it be
> acceptable to instead use this set of characters?
> 
> avail_chrs =
> 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
> 
> Your alphabet has 92 characters, this one only 64... the advantage is
> that it's really easy to work with a 64-character set; in fact, for this
> specific set, it's the standard called Base 64, and Python already has a
> module for working with it. All you need is a random stream of eight-bit
> characters, which can be provided by os.urandom().

I was originally going to write that using the base64 module would 
introduce bias into the random strings, but after a little investigation, 
I don't think it does.

Or at least, if it does, it's a fairly subtle bias, and not detectable by 
the simple technique I used: inspect the mean, and the mean deviation 
from the mean.


from os import urandom
from base64 import b64encode

data = urandom(1000000)
m = sum(data)/len(data)
md = sum(abs(v - m) for v in data)/len(data)
print("Mean and mean deviation of urandom:", m, md)

encoded = b64encode(data).strip(b'=')
chars = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef'
         b'ghijklmnopqrstuvwxyz0123456789+/')
values = [chars.index(v) for v in encoded]
m = sum(values)/len(values)
md = sum(abs(v - m) for v in values)/len(values)
print("Mean and mean deviation of encoded data:", m, md)



When I run this, it prints:


Mean and mean deviation of urandom: 127.451652 63.95331188965717
Mean and mean deviation of encoded data: 31.477027511486245 
15.991177272527072

I would expect 127 64 and 32 16, so we're pretty close. That's not to say 
that there aren't any other biases or correlations in the encoded data, 
but after a simplistic test, it looks okay to me.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#43320

FromChris Angelico <rosuav@gmail.com>
Date2013-04-11 15:53 +1000
Message-ID<mailman.440.1365659618.3114.python-list@python.org>
In reply to#43316
On Thu, Apr 11, 2013 at 3:33 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> I was originally going to write that using the base64 module would
> introduce bias into the random strings, but after a little investigation,
> I don't think it does.

Assuming that os.urandom() returns bytes with perfectly fair
distribution (exactly equal chance of any value 00-FF - it probably
does, or close to it), and assuming that you work with exact multiples
of 3 bytes and 4 output characters, base64 will give you perfectly
fair distribution of result characters. You take three bytes (24 bits)
and turn them into four characters (6 bits per character, = 24 bits).
You might see some bias if you use less than a full set of four output
characters, though; I haven't dug into the details to check that.

ChrisA

[toc] | [prev] | [next] | [standalone]


#43309

FromMichael Torrie <torriem@gmail.com>
Date2013-04-10 19:52 -0600
Message-ID<mailman.435.1365645159.3114.python-list@python.org>
In reply to#43306
On 04/10/2013 07:21 PM, gry wrote:
> from sys import stdout
> from array import array
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
> a = array('c', 'X' * nchars)
> 
> for l in range(rows):
>     for i in xrange(nchars):
>         a[i] = random.choice(avail_chrs)
>     a.tofile(stdout)
>     stdout.write('\n')
> 
> This version using array took 4 min, 29 sec, using 34MB resident/110
> virtual. So, much smaller than the first attempt, but a bit slower.
> Can someone suggest a better code?  And help me understand the
> performance issues here?

Why are you using an array?  Why not just rely on the OS to buffer the
output.  Just write your characters straight to stdout instead of
placing them in an array.

At that point I believe this program will be as fast as is possible in
Python.

[toc] | [prev] | [next] | [standalone]


#43310

Fromgry <georgeryoung@gmail.com>
Date2013-04-10 19:40 -0700
Message-ID<15b233c5-f961-479e-aec1-fe1467bd99d3@e8g2000yqg.googlegroups.com>
In reply to#43309
On Apr 10, 9:52 pm, Michael Torrie <torr...@gmail.com> wrote:
> On 04/10/2013 07:21 PM, gry wrote:
>
>
>
>
>
>
>
>
>
> > from sys import stdout
> > from array import array
> > import random
> > nchars = 32000000
> > rows = 10
> > avail_chrs =
> > '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> > \'()*+,-./:;<=>?@[\\]^_`{}'
> > a = array('c', 'X' * nchars)
>
> > for l in range(rows):
> >     for i in xrange(nchars):
> >         a[i] = random.choice(avail_chrs)
> >     a.tofile(stdout)
> >     stdout.write('\n')
>
> > This version using array took 4 min, 29 sec, using 34MB resident/110
> > virtual. So, much smaller than the first attempt, but a bit slower.
> > Can someone suggest a better code?  And help me understand the
> > performance issues here?
>
> Why are you using an array?  Why not just rely on the OS to buffer the
> output.  Just write your characters straight to stdout instead of
> placing them in an array.
>
> At that point I believe this program will be as fast as is possible in
> Python.

Appealing idea, but it's slower than the array solution: 5min 13
secs.  vs 4min 30sec for the array:

for l in range(rows):
    for i in xrange(nchars):
        stdout.write(random.choice(avail_chrs))
    stdout.write('\n')


os.urandom does look promising -- I have to have full control over the
charset, but urandom is very fast at generating big random strings...
stay tuned...

[toc] | [prev] | [next] | [standalone]


#43313

FromChris Angelico <rosuav@gmail.com>
Date2013-04-11 13:14 +1000
Message-ID<mailman.437.1365650080.3114.python-list@python.org>
In reply to#43310
On Thu, Apr 11, 2013 at 12:40 PM, gry <georgeryoung@gmail.com> wrote:
> Appealing idea, but it's slower than the array solution: 5min 13
> secs.  vs 4min 30sec for the array:
>
> for l in range(rows):
>     for i in xrange(nchars):
>         stdout.write(random.choice(avail_chrs))
>     stdout.write('\n')
>
>
> os.urandom does look promising -- I have to have full control over the
> charset, but urandom is very fast at generating big random strings...
> stay tuned...

Without actually profiling it, my first guess would be that calling
random.choice() for every character is an optimization target. (NOTE:
Do profile it, if the urandom method isn't sufficient straight-off.)
You may want to consider, for instance, generating larger random
numbers and doing some kind of translation on them - which is
fundamentally what the urandom/b64encode method is doing.

ChrisA

[toc] | [prev] | [next] | [standalone]


#43312

FromMRAB <python@mrabarnett.plus.com>
Date2013-04-11 04:09 +0100
Message-ID<mailman.436.1365649974.3114.python-list@python.org>
In reply to#43306
On 11/04/2013 02:21, gry wrote:
> Dear pythonistas,
>     I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii
> characters.  I am hoping to find a more time and memory efficient way,
> that is still fairly simple clear, and _pythonic_.
>
> I would like to have something that I can use at both extremes of
> data:
>
>     32M chars per line * 100 lines
> or
>     5 chars per line * 1e8 lines.
>
> E.g., the output of bigrand.py for 10 characters, 2 lines might be:
>
> gw2+M/5t&.
> S[[db/l?Vx
>
> I'm using python 2.7.0 on linux.  I need to use only out-of-the box
> modules, since this has to work on a bunch of different computers.
> At this point I'm especially concerned with the case of a few very
> long lines, since that seems to use a lot of memory, and take a long
> time.
> Characters are a slight subset of the printable ascii's, specified in
> the examples below.  My first naive try was:
>
> from sys import stdout
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
>
> def make_varchar(nchars):
>      return (''.join([random.choice(avail_chrs) for i in
> range(nchars)]))
>
> for l in range(rows):
>      stdout.write(make_varchar(nchars))
>      stdout.write('\n')
>
> This version used around 1.2GB resident/1.2GB virtual of memory for
> 3min 38sec.
>
>
> My second try uses much less RAM, but more CPU time, and seems rather,
> umm, un-pythonic (the array module always seems a little un
> pythonic...)
>
> from sys import stdout
> from array import array
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
> a = array('c', 'X' * nchars)
>
> for l in range(rows):
>      for i in xrange(nchars):
>          a[i] = random.choice(avail_chrs)
>      a.tofile(stdout)
>      stdout.write('\n')
>
> This version using array took 4 min, 29 sec, using 34MB resident/110
> virtual. So, much smaller than the first attempt, but a bit slower.
> Can someone suggest a better code?  And help me understand the
> performance issues here?
>
Names in the global scope are stored in a dict, but local to a function
are stored in slots and can be accessed more quickly.

'avail_chrs' and 'random.choice' are referred to many times, so making
'avail_chrs' local and making a local reference to 'random.choice' will
help.


from sys import stdout
from array import array
import random

def generate():
     avail_chrs = 
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}'
     rnd = random.choice

     for l in range(rows):
         stdout.write(''.join([rnd(avail_chrs) for i in xrange(nchars)]))
         stdout.write('\n')

nchars = 32000000
rows = 10
generate()

[toc] | [prev] | [next] | [standalone]


#43334

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-04-11 07:47 +0000
Message-ID<51666aae$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#43306
On Wed, 10 Apr 2013 18:21:51 -0700, gry wrote:

> Dear pythonistas,
>    I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii characters. 
> I am hoping to find a more time and memory efficient way, that is still
> fairly simple clear, and _pythonic_.

Here's another option: use string.translate to map random bytes to the 
printable ASCII bytes.


import string

all_bytes = ''.join(map(chr, range(256)))
printable = all_bytes[33:127]
n = 127 + len(printable)
extras = all_bytes[127:n]
_deletions = all_bytes[:33] + all_bytes[n:]
_table = string.maketrans(extras, printable)


def random_chars(length, size=1000):
    # Make local bindings for speed.
    import string, os
    table, deletions = _table, _deletions
    # Generate random strings.
    buffer = []
    while True:
        while len(buffer) < length:
            bytes = string.translate(os.urandom(size), table, deletions)
            buffer.extend(bytes)
        yield ''.join(buffer[:length])
        buffer = buffer[length:]



Then use it like this:

# I want seven lines of twelve char strings.
make = random_chars(12)
for i in range(7):
    print next(make)



One thing to be aware of: urandom may run out of entropy, and then it 
will slow down a lot. If you don't care about cryptographic randomness, 
you could use this instead:

import random
def myrandom(n):
    return [random.randint(0, 255) for i in xrange(n)]


although that will actually be slower unless urandom has run out of 
entropy.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#43344

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-11 10:47 +0100
Message-ID<mailman.453.1365673692.3114.python-list@python.org>
In reply to#43334
On 11 April 2013 08:47, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:

> One thing to be aware of: urandom may run out of entropy, and then it
> will slow down a lot. If you don't care about cryptographic randomness,
> you could use this instead:

Reading this I'm realising that I don't really know what os.urandom
is. How exactly is it generating random numbers and what do you mean
by it running out of entropy?


Oscar

[toc] | [prev] | [next] | [standalone]


#43349

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-04-11 10:50 +0000
Message-ID<51669576$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#43344
On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote:

> On 11 April 2013 08:47, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
> 
>> One thing to be aware of: urandom may run out of entropy, and then it
>> will slow down a lot. If you don't care about cryptographic randomness,
>> you could use this instead:
> 
> Reading this I'm realising that I don't really know what os.urandom is.
> How exactly is it generating random numbers and what do you mean by it
> running out of entropy?

I am not an expert, but here goes...

Some (most?) modern operating systems provide a cryptographically strong 
source of non-deterministic randomness. The non-deterministic part comes 
from external "stuff", which is called "entropy". Typical sources of 
entropy include network events, user key-presses, moving the mouse, and 
(presumably in machines with special hardware), even thermal noise in 
electrical components.

If the OS hasn't collected enough entropy, urandom can block until it 
has. This can be a problem, e.g. I've experienced issues where scripts 
relying indirectly on urandom that run at system reboot can block for 
minutes at a time, waiting for entropy. If those scripts run before the 
networking software, and before any users log in and start running apps, 
the script can block forever waiting for entropy that never arrives.

(Where "forever" == "70 minute boot times".)

Entropy is used and discarded, so urandom needs the OS to continually 
replenish the amount of entropy. Under normal circumstances, this it 
does, but if you grab lots of urandom output on a system which is 
otherwise quiet and not doing anything, it could run out.

If I've got any of this wrong, corrections will be gratefully acceptable.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#43353

FromRobert Kern <robert.kern@gmail.com>
Date2013-04-11 16:49 +0530
Message-ID<mailman.461.1365679158.3114.python-list@python.org>
In reply to#43349
On 2013-04-11 16:20, Steven D'Aprano wrote:
> On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote:
>
>> On 11 April 2013 08:47, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>
>>> One thing to be aware of: urandom may run out of entropy, and then it
>>> will slow down a lot. If you don't care about cryptographic randomness,
>>> you could use this instead:
>>
>> Reading this I'm realising that I don't really know what os.urandom is.
>> How exactly is it generating random numbers and what do you mean by it
>> running out of entropy?
>
> I am not an expert, but here goes...
>
> Some (most?) modern operating systems provide a cryptographically strong
> source of non-deterministic randomness. The non-deterministic part comes
> from external "stuff", which is called "entropy". Typical sources of
> entropy include network events, user key-presses, moving the mouse, and
> (presumably in machines with special hardware), even thermal noise in
> electrical components.
>
> If the OS hasn't collected enough entropy, urandom can block until it
> has. This can be a problem, e.g. I've experienced issues where scripts
> relying indirectly on urandom that run at system reboot can block for
> minutes at a time, waiting for entropy. If those scripts run before the
> networking software, and before any users log in and start running apps,
> the script can block forever waiting for entropy that never arrives.
>
> (Where "forever" == "70 minute boot times".)
>
> Entropy is used and discarded, so urandom needs the OS to continually
> replenish the amount of entropy. Under normal circumstances, this it
> does, but if you grab lots of urandom output on a system which is
> otherwise quiet and not doing anything, it could run out.
>
> If I've got any of this wrong, corrections will be gratefully acceptable.

Just one important thing: os.urandom() does not block to wait for more entropy. 
Only os.random() does.

http://en.wikipedia.org/wiki//dev/random

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]


#43355

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-11 13:05 +0100
Message-ID<mailman.463.1365681967.3114.python-list@python.org>
In reply to#43349
On 11 April 2013 11:50, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote:
>
>> On 11 April 2013 08:47, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>
>>> One thing to be aware of: urandom may run out of entropy, and then it
>>> will slow down a lot. If you don't care about cryptographic randomness,
>>> you could use this instead:
>>
>> Reading this I'm realising that I don't really know what os.urandom is.
>> How exactly is it generating random numbers and what do you mean by it
>> running out of entropy?
>
> Some (most?) modern operating systems provide a cryptographically strong
> source of non-deterministic randomness. The non-deterministic part comes
> from external "stuff", which is called "entropy". Typical sources of
> entropy include network events, user key-presses, moving the mouse, and
> (presumably in machines with special hardware), even thermal noise in
> electrical components.

> Entropy is used and discarded, so urandom needs the OS to continually
> replenish the amount of entropy. Under normal circumstances, this it
> does, but if you grab lots of urandom output on a system which is
> otherwise quiet and not doing anything, it could run out.

Okay, so I understand what entropy is in the thermodynamic sense and
also in the mathematical (Shannon) sense but I'm still confused about
what it means that the OS is somehow storing entropy. Do you mean that
it is always maintaining a buffer of what it considers to be random
bytes that it slowly builds up from noise that is made accessible to
the OS from the hardware?


Oscar

[toc] | [prev] | [next] | [standalone]


#43367

FromRobert Kern <robert.kern@gmail.com>
Date2013-04-11 19:06 +0530
Message-ID<mailman.472.1365687426.3114.python-list@python.org>
In reply to#43349
On 2013-04-11 17:35, Oscar Benjamin wrote:
> On 11 April 2013 11:50, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote:
>>
>>> On 11 April 2013 08:47, Steven D'Aprano
>>> <steve+comp.lang.python@pearwood.info> wrote:
>>>
>>>> One thing to be aware of: urandom may run out of entropy, and then it
>>>> will slow down a lot. If you don't care about cryptographic randomness,
>>>> you could use this instead:
>>>
>>> Reading this I'm realising that I don't really know what os.urandom is.
>>> How exactly is it generating random numbers and what do you mean by it
>>> running out of entropy?
>>
>> Some (most?) modern operating systems provide a cryptographically strong
>> source of non-deterministic randomness. The non-deterministic part comes
>> from external "stuff", which is called "entropy". Typical sources of
>> entropy include network events, user key-presses, moving the mouse, and
>> (presumably in machines with special hardware), even thermal noise in
>> electrical components.
>
>> Entropy is used and discarded, so urandom needs the OS to continually
>> replenish the amount of entropy. Under normal circumstances, this it
>> does, but if you grab lots of urandom output on a system which is
>> otherwise quiet and not doing anything, it could run out.
>
> Okay, so I understand what entropy is in the thermodynamic sense and
> also in the mathematical (Shannon) sense but I'm still confused about
> what it means that the OS is somehow storing entropy. Do you mean that
> it is always maintaining a buffer of what it considers to be random
> bytes that it slowly builds up from noise that is made accessible to
> the OS from the hardware?

Yes.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]


#43370

FromChris Angelico <rosuav@gmail.com>
Date2013-04-11 23:56 +1000
Message-ID<mailman.474.1365688589.3114.python-list@python.org>
In reply to#43349
On Thu, Apr 11, 2013 at 10:05 PM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> On 11 April 2013 11:50, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> Some (most?) modern operating systems provide a cryptographically strong
>> source of non-deterministic randomness. The non-deterministic part comes
>> from external "stuff", which is called "entropy". Typical sources of
>> entropy include network events, user key-presses, moving the mouse, and
>> (presumably in machines with special hardware), even thermal noise in
>> electrical components.
>
>> Entropy is used and discarded, so urandom needs the OS to continually
>> replenish the amount of entropy. Under normal circumstances, this it
>> does, but if you grab lots of urandom output on a system which is
>> otherwise quiet and not doing anything, it could run out.
>
> Okay, so I understand what entropy is in the thermodynamic sense and
> also in the mathematical (Shannon) sense but I'm still confused about
> what it means that the OS is somehow storing entropy. Do you mean that
> it is always maintaining a buffer of what it considers to be random
> bytes that it slowly builds up from noise that is made accessible to
> the OS from the hardware?

Correct. And Steven's right about most of what he says (modulo the
urandom vs random distinction, as Robert Kern pointed out - urandom
won't block, but it's not guaranteed to be cryptographically secure);
I'll just add that one of the best sources of entropy is a solid
cylinder, rotated at high velocity in a sealed container filled with a
fluid, and entropy is found in the eddies. Many computers have a
device of this nature - the solid cylinder is thin and flat and
referred to as a "disk", the fluid it's in is air, and the sealed
container is your hard disk drive.

The details will vary, but broadly speaking, the /dev/random driver
(or its equivalent) maintains an ever-increasing buffer of entropic
bits, accumulated as they arrive from the various sources, and often
saved to disk on shutdown to permit faster boot (which helps to avoid
the problem Steven described of 70-minute boot times - on an all-SSD
computer with no human being attached, entropy really can be very hard
to obtain); whenever a program asks for bytes from it, it delivers
them and removes that much "recorded entropy" from its buffer. For
many purposes, it's sufficient to take 4 or 8 bytes of /dev/random
entropy and use that to seed a PRNG, but if you're using /dev/urandom
and it's not a critical server, I wouldn't worry too much about
drawing too much off it. (On a web server that's constantly serving
HTTPS requests, for instance, I'd be cautious about reading too much
from /dev/urandom as it might cause the web server to block waiting
for /dev/random. Might kill your TPS for a while.)

ChrisA

[toc] | [prev] | [next] | [standalone]


#43343

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-11 10:47 +0100
Message-ID<mailman.451.1365673654.3114.python-list@python.org>
In reply to#43306
On 11 April 2013 02:21, gry <georgeryoung@gmail.com> wrote:
> Dear pythonistas,
>    I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii
> characters.  I am hoping to find a more time and memory efficient way,
> that is still fairly simple clear, and _pythonic_.
>
> I would like to have something that I can use at both extremes of
> data:
>
>    32M chars per line * 100 lines
> or
>    5 chars per line * 1e8 lines.

I would definitely use numpy for this. The script below seems to be
io-bound on my machine:

#!/usr/bin/env python

from sys import stdout
import numpy as np
from numpy import random

CHARS = (
    '0123456789'
    'abcdefghijklmnopqrstuvwxyz'
    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    '!#$%& \'()*+,-./:;<=>?@[\\]^_`{}'
    )

ARRAY_CHARS = np.frombuffer(CHARS, np.uint8)
NCHARS = len(CHARS)

CHUNK_SIZE = 4096
NCOLS = 32000000
NROWS = 10

def chunk_sizes(total, chunk_size):
    numchunks, remainder = divmod(total, chunk_size)
    for n in range(numchunks):
        yield chunk_size
    if remainder:
        yield remainder

def chunks():
    bytes_per_line = NCOLS + 1
    total_bytes = bytes_per_line * NROWS
    newline_index = NCOLS
    newline = ord('\n')
    for size in chunk_sizes(total_bytes, CHUNK_SIZE):
        chars = ARRAY_CHARS[random.randint(0, NCHARS, size)]
        chars[newline_index::bytes_per_line] = newline
        newline_index = (newline_index - CHUNK_SIZE) % bytes_per_line
        yield chars

for chunk in chunks():
    chunk.tofile(stdout)

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web