Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #90963 > unrolled thread

Re: Best approach to create humongous amount of files

Started byPeter Otten <__peter__@web.de>
First post2015-05-20 17:59 +0200
Last post2015-05-20 17:59 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Best approach to create humongous amount of files Peter Otten <__peter__@web.de> - 2015-05-20 17:59 +0200

#90963 — Re: Best approach to create humongous amount of files

FromPeter Otten <__peter__@web.de>
Date2015-05-20 17:59 +0200
SubjectRe: Best approach to create humongous amount of files
Message-ID<mailman.174.1432137588.17265.python-list@python.org>
Tim Chase wrote:

> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799@gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>> 
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
> 
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
> 
> 
>   import random
>   HOW_MANY = 1000000
>   template = """{
>     "some_string": "%(string)s",
>     "some_int": %(int)i
>     }
>     """
> 
>   wordlist = [
>     word.rstrip()
>     for word in open('/usr/share/dict/words')
>     ]
>   wordlist[:] = [ # just lowercase all-alpha words
>     word
>     for word in wordlist
>     if word.isalpha() and word.islower()
>     ]
> 
>   for i in xrange(HOW_MANY):
>     fname = "data_%08i.json" % i
>     with open(fname, "w") as f:
>       f.write(template % {
>         "string_value": random.choice(wordlist),
>         "int_value": random.randint(0, 1000),
>         })

Just a quick reminder: if the data is user-provided you have to sanitize it:

>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}

That can't happen when you load the template, replace some keys and dump the 
result:

>>> template = json.loads("""{"access": "restricted", "user": 
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'

I expect that performance will be dominated by I/O; if that's correct the 
extra work of serializing the JSON should not do much harm.

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web