Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #90963 > unrolled thread
| Started by | Peter Otten <__peter__@web.de> |
|---|---|
| First post | 2015-05-20 17:59 +0200 |
| Last post | 2015-05-20 17:59 +0200 |
| Articles | 1 — 1 participant |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Best approach to create humongous amount of files Peter Otten <__peter__@web.de> - 2015-05-20 17:59 +0200
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-05-20 17:59 +0200 |
| Subject | Re: Best approach to create humongous amount of files |
| Message-ID | <mailman.174.1432137588.17265.python-list@python.org> |
Tim Chase wrote:
> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799@gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>>
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
>
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
>
>
> import random
> HOW_MANY = 1000000
> template = """{
> "some_string": "%(string)s",
> "some_int": %(int)i
> }
> """
>
> wordlist = [
> word.rstrip()
> for word in open('/usr/share/dict/words')
> ]
> wordlist[:] = [ # just lowercase all-alpha words
> word
> for word in wordlist
> if word.isalpha() and word.islower()
> ]
>
> for i in xrange(HOW_MANY):
> fname = "data_%08i.json" % i
> with open(fname, "w") as f:
> f.write(template % {
> "string_value": random.choice(wordlist),
> "int_value": random.randint(0, 1000),
> })
Just a quick reminder: if the data is user-provided you have to sanitize it:
>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}
That can't happen when you load the template, replace some keys and dump the
result:
>>> template = json.loads("""{"access": "restricted", "user":
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'
I expect that performance will be dominated by I/O; if that's correct the
extra work of serializing the JSON should not do much harm.
Back to top | Article view | comp.lang.python
csiph-web