Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '"""': 0.07; 'json': 0.07; 'fname': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'skip:\\ 10': 0.09; 'subject:create': 0.09; 'subject:files': 0.09; 'python': 0.11; 'creates': 0.14; 'random': 0.14; 'template': 0.14; '"w")': 0.16; '(say': 0.16; 'dump': 0.16; 'integer.': 0.16; 'lowercase': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'sequential': 0.16; 'template,': 0.16; 'folder': 0.16; 'wrote:': 0.18; 'wed,': 0.18; '>>>': 0.22; 'import': 0.22; 'load': 0.23; 'header:User-Agent:1': 0.23; 'replace': 0.24; 'possibly': 0.26; 'header:X-Complaints- To:1': 0.27; 'correct': 0.29; 'chris': 0.29; 'tim': 0.29; 'words': 0.29; 'code': 0.31; "skip:' 10": 0.31; 'bunch': 0.31; 'chase': 0.31; 'keys': 0.31; 'work:': 0.31; 'file': 0.32; "can't": 0.35; 'skip:j 20': 0.36; 'should': 0.36; 'too': 0.37; 'performance': 0.37; 'skip:o 20': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'pm,': 0.38; 'expect': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:x 10': 0.40; 'skip:u 10': 0.60; 'most': 0.60; 'simple': 0.61; 'happen': 0.63; '20,': 0.68; 'million': 0.74; '2015': 0.84; 'harm.': 0.84; 'reminder:': 0.84; 'stamp': 0.91; 'subject:Best': 0.91; 'dirty': 0.93 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Peter Otten <__peter__@web.de> Subject: Re: Best approach to create humongous amount of files Date: Wed, 20 May 2015 17:59:33 +0200 Organization: None References: <20150520100723.3a34a775@bigbox.christie.dr> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Gmane-NNTP-Posting-Host: p57bd9238.dip0.t-ipconnect.de User-Agent: KNode/4.13.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 69 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1432137588 news.xs4all.nl 2879 [2001:888:2000:d::a6]:57020 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:90963 Tim Chase wrote: > On 2015-05-20 22:58, Chris Angelico wrote: >> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra >> wrote: >> > My objective is to create large amount of data files (say a >> > million *.json files), using a pre-existing template file >> > (*.json). Each file would have a unique name, possibly by >> > incorporating time stamp information. The files have to be >> > generated in a folder specified. > [snip] >> try a simple sequential integer. >> >> All you'd need would be a loop that creates a bunch of files... most >> of your code will be figuring out what parts of the template need to >> change. Not too difficult. > > If you store your template as a Python string-formatting template, > you can just use string-formatting to do your dirty work: > > > import random > HOW_MANY = 1000000 > template = """{ > "some_string": "%(string)s", > "some_int": %(int)i > } > """ > > wordlist = [ > word.rstrip() > for word in open('/usr/share/dict/words') > ] > wordlist[:] = [ # just lowercase all-alpha words > word > for word in wordlist > if word.isalpha() and word.islower() > ] > > for i in xrange(HOW_MANY): > fname = "data_%08i.json" % i > with open(fname, "w") as f: > f.write(template % { > "string_value": random.choice(wordlist), > "int_value": random.randint(0, 1000), > }) Just a quick reminder: if the data is user-provided you have to sanitize it: >>> template = """{"access": "restricted", "user": "%(user)s"}""" >>> json.loads(template % dict(user="""tim", "access": "unlimited""")) {'user': 'tim', 'access': 'unlimited'} That can't happen when you load the template, replace some keys and dump the result: >>> template = json.loads("""{"access": "restricted", "user": "placeholder"}""") >>> template["user"] = """tim", "access": "unlimited""" >>> json.dumps(template) '{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}' >>> json.loads(_) {'user': 'tim", "access": "unlimited', 'access': 'restricted'} >>> _["access"] 'restricted' I expect that performance will be dominated by I/O; if that's correct the extra work of serializing the JSON should not do much harm.