Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #43306 > unrolled thread
| Started by | gry <georgeryoung@gmail.com> |
|---|---|
| First post | 2013-04-10 18:21 -0700 |
| Last post | 2013-04-11 10:47 +0100 |
| Articles | 16 — 7 participants |
Back to article view | Back to comp.lang.python
performance of script to write very long lines of random chars gry <georgeryoung@gmail.com> - 2013-04-10 18:21 -0700
Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 11:45 +1000
Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 05:33 +0000
Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 15:53 +1000
Re: performance of script to write very long lines of random chars Michael Torrie <torriem@gmail.com> - 2013-04-10 19:52 -0600
Re: performance of script to write very long lines of random chars gry <georgeryoung@gmail.com> - 2013-04-10 19:40 -0700
Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 13:14 +1000
Re: performance of script to write very long lines of random chars MRAB <python@mrabarnett.plus.com> - 2013-04-11 04:09 +0100
Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 07:47 +0000
Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 10:47 +0100
Re: performance of script to write very long lines of random chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 10:50 +0000
Re: performance of script to write very long lines of random chars Robert Kern <robert.kern@gmail.com> - 2013-04-11 16:49 +0530
Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 13:05 +0100
Re: performance of script to write very long lines of random chars Robert Kern <robert.kern@gmail.com> - 2013-04-11 19:06 +0530
Re: performance of script to write very long lines of random chars Chris Angelico <rosuav@gmail.com> - 2013-04-11 23:56 +1000
Re: performance of script to write very long lines of random chars Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-11 10:47 +0100
| From | gry <georgeryoung@gmail.com> |
|---|---|
| Date | 2013-04-10 18:21 -0700 |
| Subject | performance of script to write very long lines of random chars |
| Message-ID | <24dc619b-7abd-4be3-aa92-f858eb4ab85f@n4g2000yqj.googlegroups.com> |
Dear pythonistas,
I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii
characters. I am hoping to find a more time and memory efficient way,
that is still fairly simple clear, and _pythonic_.
I would like to have something that I can use at both extremes of
data:
32M chars per line * 100 lines
or
5 chars per line * 1e8 lines.
E.g., the output of bigrand.py for 10 characters, 2 lines might be:
gw2+M/5t&.
S[[db/l?Vx
I'm using python 2.7.0 on linux. I need to use only out-of-the box
modules, since this has to work on a bunch of different computers.
At this point I'm especially concerned with the case of a few very
long lines, since that seems to use a lot of memory, and take a long
time.
Characters are a slight subset of the printable ascii's, specified in
the examples below. My first naive try was:
from sys import stdout
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
def make_varchar(nchars):
return (''.join([random.choice(avail_chrs) for i in
range(nchars)]))
for l in range(rows):
stdout.write(make_varchar(nchars))
stdout.write('\n')
This version used around 1.2GB resident/1.2GB virtual of memory for
3min 38sec.
My second try uses much less RAM, but more CPU time, and seems rather,
umm, un-pythonic (the array module always seems a little un
pythonic...)
from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)
for l in range(rows):
for i in xrange(nchars):
a[i] = random.choice(avail_chrs)
a.tofile(stdout)
stdout.write('\n')
This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code? And help me understand the
performance issues here?
-- George
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-11 11:45 +1000 |
| Message-ID | <mailman.434.1365644739.3114.python-list@python.org> |
| In reply to | #43306 |
On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung@gmail.com> wrote:
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
Is this exact set of characters a requirement? For instance, would it
be acceptable to instead use this set of characters?
avail_chrs = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
Your alphabet has 92 characters, this one only 64... the advantage is
that it's really easy to work with a 64-character set; in fact, for
this specific set, it's the standard called Base 64, and Python
already has a module for working with it. All you need is a random
stream of eight-bit characters, which can be provided by os.urandom().
So here's a much simpler version of your program, following the
cut-down character set I offer:
import os
import base64
nchars = 32000000
rows = 10
# Note: If nchars is one higher than a multiple of 4 (eg 5, 9, 101),
# the lines will be one character short (4, 8, 100).
nchars = nchars * 3 // 4
for l in range(rows):
print(base64.b64encode(os.urandom(nchars)).strip(b'='))
If you can guarantee that your nchars will always be a multiple of 4,
you can drop the .strip() call.
This is going to be *immensely* faster than calling random.choice()
for every character, but it depends on a working os.urandom (it'll
raise NotImplementedError if there's no suitable source). I know it's
available on OS/2, Windows, and Linux, but don't have others handy to
test. If by "a bunch of different computers" you mean exclusively
Linux computers, this should be fine.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-11 05:33 +0000 |
| Message-ID | <51664b43$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #43308 |
On Thu, 11 Apr 2013 11:45:31 +1000, Chris Angelico wrote:
> On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung@gmail.com> wrote:
>> avail_chrs =
>> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
>> \'()*+,-./:;<=>?@[\\]^_`{}'
>
> Is this exact set of characters a requirement? For instance, would it be
> acceptable to instead use this set of characters?
>
> avail_chrs =
> 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
>
> Your alphabet has 92 characters, this one only 64... the advantage is
> that it's really easy to work with a 64-character set; in fact, for this
> specific set, it's the standard called Base 64, and Python already has a
> module for working with it. All you need is a random stream of eight-bit
> characters, which can be provided by os.urandom().
I was originally going to write that using the base64 module would
introduce bias into the random strings, but after a little investigation,
I don't think it does.
Or at least, if it does, it's a fairly subtle bias, and not detectable by
the simple technique I used: inspect the mean, and the mean deviation
from the mean.
from os import urandom
from base64 import b64encode
data = urandom(1000000)
m = sum(data)/len(data)
md = sum(abs(v - m) for v in data)/len(data)
print("Mean and mean deviation of urandom:", m, md)
encoded = b64encode(data).strip(b'=')
chars = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef'
b'ghijklmnopqrstuvwxyz0123456789+/')
values = [chars.index(v) for v in encoded]
m = sum(values)/len(values)
md = sum(abs(v - m) for v in values)/len(values)
print("Mean and mean deviation of encoded data:", m, md)
When I run this, it prints:
Mean and mean deviation of urandom: 127.451652 63.95331188965717
Mean and mean deviation of encoded data: 31.477027511486245
15.991177272527072
I would expect 127 64 and 32 16, so we're pretty close. That's not to say
that there aren't any other biases or correlations in the encoded data,
but after a simplistic test, it looks okay to me.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-11 15:53 +1000 |
| Message-ID | <mailman.440.1365659618.3114.python-list@python.org> |
| In reply to | #43316 |
On Thu, Apr 11, 2013 at 3:33 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > I was originally going to write that using the base64 module would > introduce bias into the random strings, but after a little investigation, > I don't think it does. Assuming that os.urandom() returns bytes with perfectly fair distribution (exactly equal chance of any value 00-FF - it probably does, or close to it), and assuming that you work with exact multiples of 3 bytes and 4 output characters, base64 will give you perfectly fair distribution of result characters. You take three bytes (24 bits) and turn them into four characters (6 bits per character, = 24 bits). You might see some bias if you use less than a full set of four output characters, though; I haven't dug into the details to check that. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-04-10 19:52 -0600 |
| Message-ID | <mailman.435.1365645159.3114.python-list@python.org> |
| In reply to | #43306 |
On 04/10/2013 07:21 PM, gry wrote:
> from sys import stdout
> from array import array
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
> a = array('c', 'X' * nchars)
>
> for l in range(rows):
> for i in xrange(nchars):
> a[i] = random.choice(avail_chrs)
> a.tofile(stdout)
> stdout.write('\n')
>
> This version using array took 4 min, 29 sec, using 34MB resident/110
> virtual. So, much smaller than the first attempt, but a bit slower.
> Can someone suggest a better code? And help me understand the
> performance issues here?
Why are you using an array? Why not just rely on the OS to buffer the
output. Just write your characters straight to stdout instead of
placing them in an array.
At that point I believe this program will be as fast as is possible in
Python.
[toc] | [prev] | [next] | [standalone]
| From | gry <georgeryoung@gmail.com> |
|---|---|
| Date | 2013-04-10 19:40 -0700 |
| Message-ID | <15b233c5-f961-479e-aec1-fe1467bd99d3@e8g2000yqg.googlegroups.com> |
| In reply to | #43309 |
On Apr 10, 9:52 pm, Michael Torrie <torr...@gmail.com> wrote:
> On 04/10/2013 07:21 PM, gry wrote:
>
>
>
>
>
>
>
>
>
> > from sys import stdout
> > from array import array
> > import random
> > nchars = 32000000
> > rows = 10
> > avail_chrs =
> > '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> > \'()*+,-./:;<=>?@[\\]^_`{}'
> > a = array('c', 'X' * nchars)
>
> > for l in range(rows):
> > for i in xrange(nchars):
> > a[i] = random.choice(avail_chrs)
> > a.tofile(stdout)
> > stdout.write('\n')
>
> > This version using array took 4 min, 29 sec, using 34MB resident/110
> > virtual. So, much smaller than the first attempt, but a bit slower.
> > Can someone suggest a better code? And help me understand the
> > performance issues here?
>
> Why are you using an array? Why not just rely on the OS to buffer the
> output. Just write your characters straight to stdout instead of
> placing them in an array.
>
> At that point I believe this program will be as fast as is possible in
> Python.
Appealing idea, but it's slower than the array solution: 5min 13
secs. vs 4min 30sec for the array:
for l in range(rows):
for i in xrange(nchars):
stdout.write(random.choice(avail_chrs))
stdout.write('\n')
os.urandom does look promising -- I have to have full control over the
charset, but urandom is very fast at generating big random strings...
stay tuned...
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-11 13:14 +1000 |
| Message-ID | <mailman.437.1365650080.3114.python-list@python.org> |
| In reply to | #43310 |
On Thu, Apr 11, 2013 at 12:40 PM, gry <georgeryoung@gmail.com> wrote:
> Appealing idea, but it's slower than the array solution: 5min 13
> secs. vs 4min 30sec for the array:
>
> for l in range(rows):
> for i in xrange(nchars):
> stdout.write(random.choice(avail_chrs))
> stdout.write('\n')
>
>
> os.urandom does look promising -- I have to have full control over the
> charset, but urandom is very fast at generating big random strings...
> stay tuned...
Without actually profiling it, my first guess would be that calling
random.choice() for every character is an optimization target. (NOTE:
Do profile it, if the urandom method isn't sufficient straight-off.)
You may want to consider, for instance, generating larger random
numbers and doing some kind of translation on them - which is
fundamentally what the urandom/b64encode method is doing.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-04-11 04:09 +0100 |
| Message-ID | <mailman.436.1365649974.3114.python-list@python.org> |
| In reply to | #43306 |
On 11/04/2013 02:21, gry wrote:
> Dear pythonistas,
> I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii
> characters. I am hoping to find a more time and memory efficient way,
> that is still fairly simple clear, and _pythonic_.
>
> I would like to have something that I can use at both extremes of
> data:
>
> 32M chars per line * 100 lines
> or
> 5 chars per line * 1e8 lines.
>
> E.g., the output of bigrand.py for 10 characters, 2 lines might be:
>
> gw2+M/5t&.
> S[[db/l?Vx
>
> I'm using python 2.7.0 on linux. I need to use only out-of-the box
> modules, since this has to work on a bunch of different computers.
> At this point I'm especially concerned with the case of a few very
> long lines, since that seems to use a lot of memory, and take a long
> time.
> Characters are a slight subset of the printable ascii's, specified in
> the examples below. My first naive try was:
>
> from sys import stdout
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
>
> def make_varchar(nchars):
> return (''.join([random.choice(avail_chrs) for i in
> range(nchars)]))
>
> for l in range(rows):
> stdout.write(make_varchar(nchars))
> stdout.write('\n')
>
> This version used around 1.2GB resident/1.2GB virtual of memory for
> 3min 38sec.
>
>
> My second try uses much less RAM, but more CPU time, and seems rather,
> umm, un-pythonic (the array module always seems a little un
> pythonic...)
>
> from sys import stdout
> from array import array
> import random
> nchars = 32000000
> rows = 10
> avail_chrs =
> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
> \'()*+,-./:;<=>?@[\\]^_`{}'
> a = array('c', 'X' * nchars)
>
> for l in range(rows):
> for i in xrange(nchars):
> a[i] = random.choice(avail_chrs)
> a.tofile(stdout)
> stdout.write('\n')
>
> This version using array took 4 min, 29 sec, using 34MB resident/110
> virtual. So, much smaller than the first attempt, but a bit slower.
> Can someone suggest a better code? And help me understand the
> performance issues here?
>
Names in the global scope are stored in a dict, but local to a function
are stored in slots and can be accessed more quickly.
'avail_chrs' and 'random.choice' are referred to many times, so making
'avail_chrs' local and making a local reference to 'random.choice' will
help.
from sys import stdout
from array import array
import random
def generate():
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}'
rnd = random.choice
for l in range(rows):
stdout.write(''.join([rnd(avail_chrs) for i in xrange(nchars)]))
stdout.write('\n')
nchars = 32000000
rows = 10
generate()
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-11 07:47 +0000 |
| Message-ID | <51666aae$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #43306 |
On Wed, 10 Apr 2013 18:21:51 -0700, gry wrote:
> Dear pythonistas,
> I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii characters.
> I am hoping to find a more time and memory efficient way, that is still
> fairly simple clear, and _pythonic_.
Here's another option: use string.translate to map random bytes to the
printable ASCII bytes.
import string
all_bytes = ''.join(map(chr, range(256)))
printable = all_bytes[33:127]
n = 127 + len(printable)
extras = all_bytes[127:n]
_deletions = all_bytes[:33] + all_bytes[n:]
_table = string.maketrans(extras, printable)
def random_chars(length, size=1000):
# Make local bindings for speed.
import string, os
table, deletions = _table, _deletions
# Generate random strings.
buffer = []
while True:
while len(buffer) < length:
bytes = string.translate(os.urandom(size), table, deletions)
buffer.extend(bytes)
yield ''.join(buffer[:length])
buffer = buffer[length:]
Then use it like this:
# I want seven lines of twelve char strings.
make = random_chars(12)
for i in range(7):
print next(make)
One thing to be aware of: urandom may run out of entropy, and then it
will slow down a lot. If you don't care about cryptographic randomness,
you could use this instead:
import random
def myrandom(n):
return [random.randint(0, 255) for i in xrange(n)]
although that will actually be slower unless urandom has run out of
entropy.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-11 10:47 +0100 |
| Message-ID | <mailman.453.1365673692.3114.python-list@python.org> |
| In reply to | #43334 |
On 11 April 2013 08:47, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > One thing to be aware of: urandom may run out of entropy, and then it > will slow down a lot. If you don't care about cryptographic randomness, > you could use this instead: Reading this I'm realising that I don't really know what os.urandom is. How exactly is it generating random numbers and what do you mean by it running out of entropy? Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-11 10:50 +0000 |
| Message-ID | <51669576$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #43344 |
On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote: > On 11 April 2013 08:47, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: > >> One thing to be aware of: urandom may run out of entropy, and then it >> will slow down a lot. If you don't care about cryptographic randomness, >> you could use this instead: > > Reading this I'm realising that I don't really know what os.urandom is. > How exactly is it generating random numbers and what do you mean by it > running out of entropy? I am not an expert, but here goes... Some (most?) modern operating systems provide a cryptographically strong source of non-deterministic randomness. The non-deterministic part comes from external "stuff", which is called "entropy". Typical sources of entropy include network events, user key-presses, moving the mouse, and (presumably in machines with special hardware), even thermal noise in electrical components. If the OS hasn't collected enough entropy, urandom can block until it has. This can be a problem, e.g. I've experienced issues where scripts relying indirectly on urandom that run at system reboot can block for minutes at a time, waiting for entropy. If those scripts run before the networking software, and before any users log in and start running apps, the script can block forever waiting for entropy that never arrives. (Where "forever" == "70 minute boot times".) Entropy is used and discarded, so urandom needs the OS to continually replenish the amount of entropy. Under normal circumstances, this it does, but if you grab lots of urandom output on a system which is otherwise quiet and not doing anything, it could run out. If I've got any of this wrong, corrections will be gratefully acceptable. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2013-04-11 16:49 +0530 |
| Message-ID | <mailman.461.1365679158.3114.python-list@python.org> |
| In reply to | #43349 |
On 2013-04-11 16:20, Steven D'Aprano wrote: > On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote: > >> On 11 April 2013 08:47, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >> >>> One thing to be aware of: urandom may run out of entropy, and then it >>> will slow down a lot. If you don't care about cryptographic randomness, >>> you could use this instead: >> >> Reading this I'm realising that I don't really know what os.urandom is. >> How exactly is it generating random numbers and what do you mean by it >> running out of entropy? > > I am not an expert, but here goes... > > Some (most?) modern operating systems provide a cryptographically strong > source of non-deterministic randomness. The non-deterministic part comes > from external "stuff", which is called "entropy". Typical sources of > entropy include network events, user key-presses, moving the mouse, and > (presumably in machines with special hardware), even thermal noise in > electrical components. > > If the OS hasn't collected enough entropy, urandom can block until it > has. This can be a problem, e.g. I've experienced issues where scripts > relying indirectly on urandom that run at system reboot can block for > minutes at a time, waiting for entropy. If those scripts run before the > networking software, and before any users log in and start running apps, > the script can block forever waiting for entropy that never arrives. > > (Where "forever" == "70 minute boot times".) > > Entropy is used and discarded, so urandom needs the OS to continually > replenish the amount of entropy. Under normal circumstances, this it > does, but if you grab lots of urandom output on a system which is > otherwise quiet and not doing anything, it could run out. > > If I've got any of this wrong, corrections will be gratefully acceptable. Just one important thing: os.urandom() does not block to wait for more entropy. Only os.random() does. http://en.wikipedia.org/wiki//dev/random -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-11 13:05 +0100 |
| Message-ID | <mailman.463.1365681967.3114.python-list@python.org> |
| In reply to | #43349 |
On 11 April 2013 11:50, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote: > >> On 11 April 2013 08:47, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >> >>> One thing to be aware of: urandom may run out of entropy, and then it >>> will slow down a lot. If you don't care about cryptographic randomness, >>> you could use this instead: >> >> Reading this I'm realising that I don't really know what os.urandom is. >> How exactly is it generating random numbers and what do you mean by it >> running out of entropy? > > Some (most?) modern operating systems provide a cryptographically strong > source of non-deterministic randomness. The non-deterministic part comes > from external "stuff", which is called "entropy". Typical sources of > entropy include network events, user key-presses, moving the mouse, and > (presumably in machines with special hardware), even thermal noise in > electrical components. > Entropy is used and discarded, so urandom needs the OS to continually > replenish the amount of entropy. Under normal circumstances, this it > does, but if you grab lots of urandom output on a system which is > otherwise quiet and not doing anything, it could run out. Okay, so I understand what entropy is in the thermodynamic sense and also in the mathematical (Shannon) sense but I'm still confused about what it means that the OS is somehow storing entropy. Do you mean that it is always maintaining a buffer of what it considers to be random bytes that it slowly builds up from noise that is made accessible to the OS from the hardware? Oscar
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2013-04-11 19:06 +0530 |
| Message-ID | <mailman.472.1365687426.3114.python-list@python.org> |
| In reply to | #43349 |
On 2013-04-11 17:35, Oscar Benjamin wrote: > On 11 April 2013 11:50, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Thu, 11 Apr 2013 10:47:43 +0100, Oscar Benjamin wrote: >> >>> On 11 April 2013 08:47, Steven D'Aprano >>> <steve+comp.lang.python@pearwood.info> wrote: >>> >>>> One thing to be aware of: urandom may run out of entropy, and then it >>>> will slow down a lot. If you don't care about cryptographic randomness, >>>> you could use this instead: >>> >>> Reading this I'm realising that I don't really know what os.urandom is. >>> How exactly is it generating random numbers and what do you mean by it >>> running out of entropy? >> >> Some (most?) modern operating systems provide a cryptographically strong >> source of non-deterministic randomness. The non-deterministic part comes >> from external "stuff", which is called "entropy". Typical sources of >> entropy include network events, user key-presses, moving the mouse, and >> (presumably in machines with special hardware), even thermal noise in >> electrical components. > >> Entropy is used and discarded, so urandom needs the OS to continually >> replenish the amount of entropy. Under normal circumstances, this it >> does, but if you grab lots of urandom output on a system which is >> otherwise quiet and not doing anything, it could run out. > > Okay, so I understand what entropy is in the thermodynamic sense and > also in the mathematical (Shannon) sense but I'm still confused about > what it means that the OS is somehow storing entropy. Do you mean that > it is always maintaining a buffer of what it considers to be random > bytes that it slowly builds up from noise that is made accessible to > the OS from the hardware? Yes. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-11 23:56 +1000 |
| Message-ID | <mailman.474.1365688589.3114.python-list@python.org> |
| In reply to | #43349 |
On Thu, Apr 11, 2013 at 10:05 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: > On 11 April 2013 11:50, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> Some (most?) modern operating systems provide a cryptographically strong >> source of non-deterministic randomness. The non-deterministic part comes >> from external "stuff", which is called "entropy". Typical sources of >> entropy include network events, user key-presses, moving the mouse, and >> (presumably in machines with special hardware), even thermal noise in >> electrical components. > >> Entropy is used and discarded, so urandom needs the OS to continually >> replenish the amount of entropy. Under normal circumstances, this it >> does, but if you grab lots of urandom output on a system which is >> otherwise quiet and not doing anything, it could run out. > > Okay, so I understand what entropy is in the thermodynamic sense and > also in the mathematical (Shannon) sense but I'm still confused about > what it means that the OS is somehow storing entropy. Do you mean that > it is always maintaining a buffer of what it considers to be random > bytes that it slowly builds up from noise that is made accessible to > the OS from the hardware? Correct. And Steven's right about most of what he says (modulo the urandom vs random distinction, as Robert Kern pointed out - urandom won't block, but it's not guaranteed to be cryptographically secure); I'll just add that one of the best sources of entropy is a solid cylinder, rotated at high velocity in a sealed container filled with a fluid, and entropy is found in the eddies. Many computers have a device of this nature - the solid cylinder is thin and flat and referred to as a "disk", the fluid it's in is air, and the sealed container is your hard disk drive. The details will vary, but broadly speaking, the /dev/random driver (or its equivalent) maintains an ever-increasing buffer of entropic bits, accumulated as they arrive from the various sources, and often saved to disk on shutdown to permit faster boot (which helps to avoid the problem Steven described of 70-minute boot times - on an all-SSD computer with no human being attached, entropy really can be very hard to obtain); whenever a program asks for bytes from it, it delivers them and removes that much "recorded entropy" from its buffer. For many purposes, it's sufficient to take 4 or 8 bytes of /dev/random entropy and use that to seed a PRNG, but if you're using /dev/urandom and it's not a critical server, I wouldn't worry too much about drawing too much off it. (On a web server that's constantly serving HTTPS requests, for instance, I'd be cautious about reading too much from /dev/urandom as it might cause the web server to block waiting for /dev/random. Might kill your TPS for a while.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-11 10:47 +0100 |
| Message-ID | <mailman.451.1365673654.3114.python-list@python.org> |
| In reply to | #43306 |
On 11 April 2013 02:21, gry <georgeryoung@gmail.com> wrote:
> Dear pythonistas,
> I am writing a tiny utility to produce a file consisting of a
> specified number of lines of a given length of random ascii
> characters. I am hoping to find a more time and memory efficient way,
> that is still fairly simple clear, and _pythonic_.
>
> I would like to have something that I can use at both extremes of
> data:
>
> 32M chars per line * 100 lines
> or
> 5 chars per line * 1e8 lines.
I would definitely use numpy for this. The script below seems to be
io-bound on my machine:
#!/usr/bin/env python
from sys import stdout
import numpy as np
from numpy import random
CHARS = (
'0123456789'
'abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'!#$%& \'()*+,-./:;<=>?@[\\]^_`{}'
)
ARRAY_CHARS = np.frombuffer(CHARS, np.uint8)
NCHARS = len(CHARS)
CHUNK_SIZE = 4096
NCOLS = 32000000
NROWS = 10
def chunk_sizes(total, chunk_size):
numchunks, remainder = divmod(total, chunk_size)
for n in range(numchunks):
yield chunk_size
if remainder:
yield remainder
def chunks():
bytes_per_line = NCOLS + 1
total_bytes = bytes_per_line * NROWS
newline_index = NCOLS
newline = ord('\n')
for size in chunk_sizes(total_bytes, CHUNK_SIZE):
chars = ARRAY_CHARS[random.randint(0, NCHARS, size)]
chars[newline_index::bytes_per_line] = newline
newline_index = (newline_index - CHUNK_SIZE) % bytes_per_line
yield chars
for chunk in chunks():
chunk.tofile(stdout)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web