Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #44134

Re: optomizations

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python@mrabarnett.plus.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'output': 0.05; '(so': 0.07; 'args': 0.07; 'parser': 0.07; 'processing.': 0.07; 'sys': 0.07; '__name__': 0.09; 'chunk': 0.09; 'parsing': 0.09; 'try:': 0.09; 'python': 0.11; "'%b": 0.16; "'__main__':": 0.16; 'caching': 0.16; 'err:': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'ioerror,': 0.16; 'lines),': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'truncate': 0.16; 'year)': 0.16; 'wrote:': 0.18; 'year,': 0.18; 'split': 0.19; 'import': 0.22; 'header:User-Agent:1': 0.23; 'month,': 0.24; 'skip:l 30': 0.24; 'skip:{ 20': 0.24; 'script': 0.25; '(see': 0.26; 'second': 0.26; 'header:In-Reply-To:1': 0.27; 'wonder': 0.29; 'mode': 0.30; 'skip:( 20': 0.30; 'lines': 0.31; "skip:' 10": 0.31; '"",': 0.31; 'extract': 0.31; "skip:' 40": 0.31; 'file': 0.32; 'run': 0.32; 'another': 0.32; 'skip:# 10': 0.33; 'third': 0.33; 'skip:d 20': 0.34; 'could': 0.34; 'except': 0.35; 'received:84': 0.35; 'but': 0.35; 'complete.': 0.36; 'next': 0.36; 'possible': 0.36; 'feedback': 0.38; 'skip:o 20': 0.38; 'to:addr :python-list': 0.38; 'files': 0.38; 'to:addr:python.org': 0.39; 'unable': 0.39; 'skip:p 20': 0.39; 'ensure': 0.60; 'read': 0.60; 'skip:o 30': 0.61; "you're": 0.61; 'further': 0.61; 'first': 0.61; 'save': 0.62; 'complete': 0.62; 'header:Reply-To:1': 0.67; 'date,': 0.68; 'line,': 0.68; 'reply-to:no real name:2**0': 0.71; '100%': 0.77; 'dict.': 0.84; 'faster.': 0.84; 'hour,': 0.84; 'replacements': 0.84; 'reply-to:addr:python.org': 0.84; 'subject:skip:o 10': 0.84
X-CM-Score 0.00
X-CNFS-Analysis v=2.1 cv=JsTI8qIC c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=7AxPfEIvyrUA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=7AnSAicYWMAA:10 a=fO_CHlwhAAAA:8 a=u0Y0ZVSAAAAA:8 a=xP1ufChRAAAA:8 a=z5LYk0dISwSBRNyT1EoA:9 a=5BKK9VsWOpEQg1wQ:21 a=douCkhfk0fsMat6u:21 a=wPNLvfGTeEIA:10
X-AUTH mrabarnett:2500
Date Tue, 23 Apr 2013 03:03:25 +0100
From MRAB <python@mrabarnett.plus.com>
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130328 Thunderbird/17.0.5
MIME-Version 1.0
To python-list@python.org
Subject Re: optomizations
References <CABRP1o_ab1w91jHQN_9cFMAj=rpY3GVUMcKOA9+_TskRMUu=CQ@mail.gmail.com>
In-Reply-To <CABRP1o_ab1w91jHQN_9cFMAj=rpY3GVUMcKOA9+_TskRMUu=CQ@mail.gmail.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
Reply-To python-list@python.org
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.947.1366682596.3114.python-list@python.org> (permalink)
Lines 76
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1366682596 news.xs4all.nl 2236 [2001:888:2000:d::a6]:33714
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:44134

Show key headers only | View raw


On 23/04/2013 02:19, Rodrick Brown wrote:
> I would like some feedback on possible solutions to make this script run
> faster.
> The system is pegged at 100% CPU and it takes a long time to complete.
>
>
> #!/usr/bin/env python
>
> import gzip
> import re
> import os
> import sys
> from datetime import datetime
> import argparse
>
> if __name__ == '__main__':
>      parser = argparse.ArgumentParser()
>      parser.add_argument('-f', dest='inputfile', type=str, help='data file to parse')
>      parser.add_argument('-o', dest='outputdir', type=str, default=os.getcwd(), help='Output directory')
>      args = parser.parse_args()
>
>      if len(sys.argv[1:]) < 1:
>          parser.print_usage()
>          sys.exit(-1)
>
>      print(args)
>      if args.inputfile and os.path.exists(args.inputfile):
>          try:
>              with gzip.open(args.inputfile) as datafile:
>                  for line in datafile:
>                      line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
>                      line = line.replace('staticcdn.xxx.co.uk', 'static.xxx.co.uk')

These next 2 lines are duplicates; the second will have no effect (I
think!).

>                      line = line.replace('cdn.xxx', 'www.xxx')
>                      line = line.replace('cdn.xxx', 'www.xxx')

Won't the next line also do the work of the preceding 2 lines?

>                      line = line.replace('cdn.xx', 'www.xx')
>                      siteurl = line.split()[6].split('/')[2]
>                      line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)
>
>                      (day, month, year, hour, minute, second) = (line.split()[3]).replace('[','').replace(':','/').split('/')
>                      datelog = '{} {} {}'.format(month, day, year)
>                      dateobj = datetime.strptime(datelog, '%b %d %Y')
>
>                      outfile = '{}{}{}_combined.log'.format(dateobj.year, dateobj.month, dateobj.day)
>                      outdir = (args.outputdir + os.sep + siteurl)
>
>                      if not os.path.exists(outdir):
>                          os.makedirs(outdir)
>
>                      with open(outdir + os.sep + outfile, 'w+') as outf:
>                          outf.write(line)
>
>          except IOError, err:
>              sys.stderr.write("Error unable to read or extract inputfile: {} {}\n".format(args.inputfile, err))
>              sys.exit(-1)
>
I wonder whether it'll make a difference if you read a chunk at a time
(datafile.read(chunk_size) + datafile.readline() to ensure you have
complete lines), perform the replacements on it (so that you're working 
on several lines in one go), and then split it into lines for further
processing.

Another thing you could try caching the result of parsing the date, 
using (month, day, year) the key and outfile as the value in a dict.

A third thing you could try is not writing a file for every line
(doesn't the 'w+' mode truncate the file?), but save the output for
each chunk (see first suggestion) and then write the files afterwards,
at the end of the chunk.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: optomizations MRAB <python@mrabarnett.plus.com> - 2013-04-23 03:03 +0100

csiph-web