Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #90963

Re: Best approach to create humongous amount of files

Path csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; '"""': 0.07; 'json': 0.07; 'fname': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'skip:\\ 10': 0.09; 'subject:create': 0.09; 'subject:files': 0.09; 'python': 0.11; 'creates': 0.14; 'random': 0.14; 'template': 0.14; '"w")': 0.16; '(say': 0.16; 'dump': 0.16; 'integer.': 0.16; 'lowercase': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'sequential': 0.16; 'template,': 0.16; 'folder': 0.16; 'wrote:': 0.18; 'wed,': 0.18; '>>>': 0.22; 'import': 0.22; 'load': 0.23; 'header:User-Agent:1': 0.23; 'replace': 0.24; 'possibly': 0.26; 'header:X-Complaints- To:1': 0.27; 'correct': 0.29; 'chris': 0.29; 'tim': 0.29; 'words': 0.29; 'code': 0.31; "skip:' 10": 0.31; 'bunch': 0.31; 'chase': 0.31; 'keys': 0.31; 'work:': 0.31; 'file': 0.32; "can't": 0.35; 'skip:j 20': 0.36; 'should': 0.36; 'too': 0.37; 'performance': 0.37; 'skip:o 20': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'pm,': 0.38; 'expect': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:x 10': 0.40; 'skip:u 10': 0.60; 'most': 0.60; 'simple': 0.61; 'happen': 0.63; '20,': 0.68; 'million': 0.74; '2015': 0.84; 'harm.': 0.84; 'reminder:': 0.84; 'stamp': 0.91; 'subject:Best': 0.91; 'dirty': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Peter Otten <__peter__@web.de>
Subject Re: Best approach to create humongous amount of files
Date Wed, 20 May 2015 17:59:33 +0200
Organization None
References <CAPkZ3MS5SiGH9OCe9RSTmakF681O+qM572y49FuDBmBix=aiFg@mail.gmail.com> <CAPTjJmppiMpVjTBt5CH_6DGSdCWw5aoDU+jY-3wMs5Ai7tPdKw@mail.gmail.com> <20150520100723.3a34a775@bigbox.christie.dr>
Mime-Version 1.0
Content-Type text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding 7Bit
X-Gmane-NNTP-Posting-Host p57bd9238.dip0.t-ipconnect.de
User-Agent KNode/4.13.3
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.174.1432137588.17265.python-list@python.org> (permalink)
Lines 69
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1432137588 news.xs4all.nl 2879 [2001:888:2000:d::a6]:57020
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:90963

Show key headers only | View raw


Tim Chase wrote:

> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799@gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>> 
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
> 
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
> 
> 
>   import random
>   HOW_MANY = 1000000
>   template = """{
>     "some_string": "%(string)s",
>     "some_int": %(int)i
>     }
>     """
> 
>   wordlist = [
>     word.rstrip()
>     for word in open('/usr/share/dict/words')
>     ]
>   wordlist[:] = [ # just lowercase all-alpha words
>     word
>     for word in wordlist
>     if word.isalpha() and word.islower()
>     ]
> 
>   for i in xrange(HOW_MANY):
>     fname = "data_%08i.json" % i
>     with open(fname, "w") as f:
>       f.write(template % {
>         "string_value": random.choice(wordlist),
>         "int_value": random.randint(0, 1000),
>         })

Just a quick reminder: if the data is user-provided you have to sanitize it:

>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}

That can't happen when you load the template, replace some keys and dump the 
result:

>>> template = json.loads("""{"access": "restricted", "user": 
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'

I expect that performance will be dominated by I/O; if that's correct the 
extra work of serializing the JSON should not do much harm.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Best approach to create humongous amount of files Peter Otten <__peter__@web.de> - 2015-05-20 17:59 +0200

csiph-web