Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: Re: Best approach to create humongous amount of files
Date: Wed, 20 May 2015 17:59:33 +0200
Organization: None
References: <CAPkZ3MS5SiGH9OCe9RSTmakF681O+qM572y49FuDBmBix=aiFg@mail.gmail.com> <CAPTjJmppiMpVjTBt5CH_6DGSdCWw5aoDU+jY-3wMs5Ai7tPdKw@mail.gmail.com> <20150520100723.3a34a775@bigbox.christie.dr>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
User-Agent: KNode/4.13.3
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.174.1432137588.17265.python-list@python.org>
Lines: 69
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:90963

Tim Chase wrote:

> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799@gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>> 
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
> 
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
> 
> 
>   import random
>   HOW_MANY = 1000000
>   template = """{
>     "some_string": "%(string)s",
>     "some_int": %(int)i
>     }
>     """
> 
>   wordlist = [
>     word.rstrip()
>     for word in open('/usr/share/dict/words')
>     ]
>   wordlist[:] = [ # just lowercase all-alpha words
>     word
>     for word in wordlist
>     if word.isalpha() and word.islower()
>     ]
> 
>   for i in xrange(HOW_MANY):
>     fname = "data_%08i.json" % i
>     with open(fname, "w") as f:
>       f.write(template % {
>         "string_value": random.choice(wordlist),
>         "int_value": random.randint(0, 1000),
>         })

Just a quick reminder: if the data is user-provided you have to sanitize it:

>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}

That can't happen when you load the template, replace some keys and dump the 
result:

>>> template = json.loads("""{"access": "restricted", "user": 
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'

I expect that performance will be dominated by I/O; if that's correct the 
extra work of serializing the JSON should not do much harm.