Groups > comp.lang.python > #44130 > unrolled thread

optomizations

Started by	Rodrick Brown <rodrick.brown@gmail.com>
First post	2013-04-22 21:19 -0400
Last post	2013-04-24 00:52 +1000
Articles	14 — 8 participants

Back to article view | Back to comp.lang.python

  optomizations Rodrick Brown <rodrick.brown@gmail.com> - 2013-04-22 21:19 -0400
    Re: optomizations Roy Smith <roy@panix.com> - 2013-04-22 21:53 -0400
      Re: optomizations Dan Stromberg <drsalists@gmail.com> - 2013-04-22 20:15 -0700
      Re: optomizations Rodrick Brown <rodrick.brown@gmail.com> - 2013-04-23 00:20 -0400
        Re: optomizations Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 04:38 +0000
      Re: optomizations Chris Angelico <rosuav@gmail.com> - 2013-04-23 12:03 +1000
    Re: optomizations Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 04:00 +0000
      Re: optomizations Chris Angelico <rosuav@gmail.com> - 2013-04-23 14:08 +1000
      percent faster than format()? (was: Re: optomizations) Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-04-23 09:46 +0200
        Re: percent faster than format()? (was: Re: optomizations) Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-04-23 10:26 +0200
          Re: percent faster than format()? Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-04-23 16:57 +0200
            Re: percent faster than format()? Lele Gaifax <lele@metapensiero.it> - 2013-04-23 17:44 +0200
        Re: percent faster than format()? (was: Re: optomizations) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 14:36 +0000
          Re: percent faster than format()? (was: Re: optomizations) Chris Angelico <rosuav@gmail.com> - 2013-04-24 00:52 +1000

#44130 — optomizations

From	Rodrick Brown <rodrick.brown@gmail.com>
Date	2013-04-22 21:19 -0400
Subject	optomizations
Message-ID	<mailman.944.1366680414.3114.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

I would like some feedback on possible solutions to make this script run
faster.
The system is pegged at 100% CPU and it takes a long time to complete.


#!/usr/bin/env python

import gzip
import re
import os
import sys
from datetime import datetime
import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-f', dest='inputfile', type=str, help='data file
to parse')
    parser.add_argument('-o', dest='outputdir', type=str,
default=os.getcwd(), help='Output directory')
    args = parser.parse_args()

    if len(sys.argv[1:]) < 1:
        parser.print_usage()
        sys.exit(-1)

    print(args)
    if args.inputfile and os.path.exists(args.inputfile):
        try:
            with gzip.open(args.inputfile) as datafile:
                for line in datafile:
                    line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
                    line = line.replace('staticcdn.xxx.co.uk', '
static.xxx.co.uk')
                    line = line.replace('cdn.xxx', 'www.xxx')
                    line = line.replace('cdn.xxx', 'www.xxx')
                    line = line.replace('cdn.xx', 'www.xx')
                    siteurl = line.split()[6].split('/')[2]
                    line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)

                    (day, month, year, hour, minute, second) =
(line.split()[3]).replace('[','').replace(':','/').split('/')
                    datelog = '{} {} {}'.format(month, day, year)
                    dateobj = datetime.strptime(datelog, '%b %d %Y')

                    outfile = '{}{}{}_combined.log'.format(dateobj.year,
dateobj.month, dateobj.day)
                    outdir = (args.outputdir + os.sep + siteurl)

                    if not os.path.exists(outdir):
                        os.makedirs(outdir)

                    with open(outdir + os.sep + outfile, 'w+') as outf:
                        outf.write(line)

        except IOError, err:
            sys.stderr.write("Error unable to read or extract inputfile: {}
{}\n".format(args.inputfile, err))
            sys.exit(-1)

[toc] | [next] | [standalone]

#44133

From	Roy Smith <roy@panix.com>
Date	2013-04-22 21:53 -0400
Message-ID	<roy-A32AAF.21531122042013@news.panix.com>
In reply to	#44130

In article <mailman.944.1366680414.3114.python-list@python.org>,
 Rodrick Brown <rodrick.brown@gmail.com> wrote:

> I would like some feedback on possible solutions to make this script run
> faster.

If I had to guess, I would think this stuff:

>                     line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
>                     line = line.replace('staticcdn.xxx.co.uk', '
> static.xxx.co.uk')
>                     line = line.replace('cdn.xxx', 'www.xxx')
>                     line = line.replace('cdn.xxx', 'www.xxx')
>                     line = line.replace('cdn.xx', 'www.xx')
>                     siteurl = line.split()[6].split('/')[2]
>                     line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)

You make 6 copies of every line.  That's slow.  But I'm also going to 
quote something I wrote here a couple of months back:

> I've been doing some log analysis.  It's been taking a grovelingly long 
> time, so I decided to fire up the profiler and see what's taking so 
> long.  I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots 
> might be (looking up IP addresses in the geolocation database, or 
> producing some pretty pictures using matplotlib).  It was just a matter 
> of figuring out which it was. 
> 
> As with most attempts to out-guess the profiler, I was totally, 
> absolutely, and embarrassingly wrong. 

So, my real advice to you is to fire up the profiler and see what it 
says.

[toc] | [prev] | [next] | [standalone]

#44135

From	Dan Stromberg <drsalists@gmail.com>
Date	2013-04-22 20:15 -0700
Message-ID	<mailman.948.1366686954.3114.python-list@python.org>
In reply to	#44133

[Multipart message — attachments visible in raw view] — view raw

On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith <roy@panix.com> wrote:

>
> So, my real advice to you is to fire up the profiler and see what it
> says.


I agree.

Fire up a  line-oriented profiler and only then start trying to improve the
hot spots.

[toc] | [prev] | [next] | [standalone]

#44141

From	Rodrick Brown <rodrick.brown@gmail.com>
Date	2013-04-23 00:20 -0400
Message-ID	<mailman.951.1366691420.3114.python-list@python.org>
In reply to	#44133

[Multipart message — attachments visible in raw view] — view raw

On Apr 22, 2013, at 11:18 PM, Dan Stromberg <drsalists@gmail.com> wrote:

On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith <roy@panix.com> wrote:

>
> So, my real advice to you is to fire up the profiler and see what it
> says.

I agree.

Fire up a  line-oriented profiler and only then start trying to improve the
hot spots.

Got a doc or URL I have no experience working with python profilers.

-- 
http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#44142

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-04-23 04:38 +0000
Message-ID	<51761044$0$29872$c3e8da3$5496439d@news.astraweb.com>
In reply to	#44141

On Tue, 23 Apr 2013 00:20:59 -0400, Rodrick Brown wrote:

> Got a doc or URL I have no experience working with python profilers.


https://duckduckgo.com/html/?q=python%20profiler



This is also good:

http://pymotw.com/2/profile/



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#44169

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-23 12:03 +1000
Message-ID	<mailman.966.1366707799.3114.python-list@python.org>
In reply to	#44133

On Tue, Apr 23, 2013 at 11:53 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.944.1366680414.3114.python-list@python.org>,
>  Rodrick Brown <rodrick.brown@gmail.com> wrote:
>
>> I would like some feedback on possible solutions to make this script run
>> faster.
>
> If I had to guess, I would think this stuff:
>
>>                     line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
>>                     line = line.replace('staticcdn.xxx.co.uk', '
>> static.xxx.co.uk')
>>                     line = line.replace('cdn.xxx', 'www.xxx')
>>                     line = line.replace('cdn.xxx', 'www.xxx')
>>                     line = line.replace('cdn.xx', 'www.xx')
>>                     siteurl = line.split()[6].split('/')[2]
>>                     line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)
>
> You make 6 copies of every line.  That's slow.

One of those is a regular expression substitution, which is also
likely to be a hot-spot. But definitely profile.

ChrisA

[toc] | [prev] | [next] | [standalone]

#44137

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-04-23 04:00 +0000
Message-ID	<51760754$0$29872$c3e8da3$5496439d@news.astraweb.com>
In reply to	#44130

On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote:

> I would like some feedback on possible solutions to make this script run
> faster.
> The system is pegged at 100% CPU and it takes a long time to complete.

Have you profiled the app to see where it is spending all its time?

What does "a long time" mean? For instance:

"It takes two hours to process a 15KB file" -- you have a problem.

"It takes 20 minutes to process a 15GB file" -- and why are you 
complaining?

Or somewhere in the middle... 

But before profiling, I suggest you clean up the program. For example:

        if args.inputfile and os.path.exists(args.inputfile):

Don't do that. There really isn't any point in checking whether the input 
file exists, since:

1) Just because it exists doesn't mean you can read it;

2) Just because you can read it doesn't mean it is a valid gzip file;

3) Just because it is a valid gzip file that you can read *now*, doesn't 
mean that it still will be in 10 milliseconds when you actually try to 
open the file.

A lot can happen in 10ms, or 1ms. The file might be deleted, or 
overwritten, or permissions changed. Change that to:

        try:
            with gzip.open(args.inputfile) as datafile:
                for line in datafile:

and catch the exception if the file doesn't exist, or cannot be read. 
Which you already do, which just demonstrates that the call to 
os.path.exists is a waste of effort. 

Then look for wasted effort like this:

line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')

Surely the first line is redundant, since it would be correctly caught 
and replaced by the second?

Also, you're searching the file system *for every line* in the input 
file. Pull this outside of the loop and have it run once:

                    if not os.path.exists(outdir):
                        os.makedirs(outdir)

Likewise for opening and closing the output file, which you currently 
open and close it for every line. It only needs to be opened and closed 
once.

If it comes down to micro-optimizations to shave a few microseconds off, 
consider using string % formatting rather than the format method. But 
really, if you find yourself shaving microseconds off something that runs 
for ten minutes, you have to ask why you're bothering.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#44139

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-23 14:08 +1000
Message-ID	<mailman.949.1366690116.3114.python-list@python.org>
In reply to	#44137

On Tue, Apr 23, 2013 at 2:00 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Also, you're searching the file system *for every line* in the input
> file. Pull this outside of the loop and have it run once:
>
>                     if not os.path.exists(outdir):
>                         os.makedirs(outdir)
>
> Likewise for opening and closing the output file, which you currently
> open and close it for every line. It only needs to be opened and closed
> once.

The outdir depends on the line, though. Hence my suggestion to retain
the open files in a dictionary.

ChrisA

[toc] | [prev] | [next] | [standalone]

#44165 — percent faster than format()? (was: Re: optomizations)

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2013-04-23 09:46 +0200
Subject	percent faster than format()? (was: Re: optomizations)
Message-ID	<dsqh4a-66p.ln1@satorlaser.homedns.org>
In reply to	#44137

Am 23.04.2013 06:00, schrieb Steven D'Aprano:
> If it comes down to micro-optimizations to shave a few microseconds off,
> consider using string % formatting rather than the format method.

Why? I don't see any obvious difference between the two...


Greetings!

Uli

[toc] | [prev] | [next] | [standalone]

#44167 — Re: percent faster than format()? (was: Re: optomizations)

From	Chris “Kwpolska” Warrick <kwpolska@gmail.com>
Date	2013-04-23 10:26 +0200
Subject	Re: percent faster than format()? (was: Re: optomizations)
Message-ID	<mailman.964.1366705622.3114.python-list@python.org>
In reply to	#44165

On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt
<ulrich.eckhardt@dominolaser.com> wrote:
> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>>
>> If it comes down to micro-optimizations to shave a few microseconds off,
>> consider using string % formatting rather than the format method.
>
>
> Why? I don't see any obvious difference between the two...
>
>
> Greetings!
>
> Uli
>
> --
> http://mail.python.org/mailman/listinfo/python-list

$ python -m timeit "a = '{0} {1} {2}'.format(1, 2, 42)"
1000000 loops, best of 3: 0.824 usec per loop
$ python -m timeit "a = '%s %s %s' % (1, 2, 42)"
10000000 loops, best of 3: 0.0286 usec per loop

--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail                | always bottom-post
http://asciiribbon.org        | http://caliburn.nl/topposting.html

[toc] | [prev] | [next] | [standalone]

#44189 — Re: percent faster than format()?

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2013-04-23 16:57 +0200
Subject	Re: percent faster than format()?
Message-ID	<m4ki4a-33r.ln1@satorlaser.homedns.org>
In reply to	#44167

Am 23.04.2013 10:26, schrieb Chris “Kwpolska” Warrick:
> On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt
> <ulrich.eckhardt@dominolaser.com> wrote:
>> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>>>
>>> If it comes down to micro-optimizations to shave a few microseconds off,
>>> consider using string % formatting rather than the format method.
>>
>>
>> Why? I don't see any obvious difference between the two...
[...]
>
> $ python -m timeit "a = '{0} {1} {2}'.format(1, 2, 42)"
> 1000000 loops, best of 3: 0.824 usec per loop
> $ python -m timeit "a = '%s %s %s' % (1, 2, 42)"
> 10000000 loops, best of 3: 0.0286 usec per loop
>

Well, I don't question that for at least some CPython implementations 
one is faster than the other. I don't see a reason why one must be 
faster than the other though. In other words, I don't understand where 
the other one needs more time to achieve basically the same. To me, the 
only difference is the syntax, but not greatly so.

So again, why is one faster than the other? What am I missing?

Uli

[toc] | [prev] | [next] | [standalone]

#44191 — Re: percent faster than format()?

From	Lele Gaifax <lele@metapensiero.it>
Date	2013-04-23 17:44 +0200
Subject	Re: percent faster than format()?
Message-ID	<mailman.979.1366731853.3114.python-list@python.org>
In reply to	#44189

Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> writes:

> So again, why is one faster than the other? What am I missing?

The .format() syntax is actually a function, and that alone carries some
overload. Even optimizing the lookup may give a little advantage:

>>> from timeit import Timer
>>> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
>>> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
>>> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
>>> print(min(t1.repeat()))
>>> print(min(t2.repeat()))
>>> setup = "a = 'spam'; b = 'ham'; c = 'eggs'; f = '{}, {} and {} for breakfast'.format"
>>> t3 = Timer("f(a, b, c)", setup)
>>> print(min(t3.repeat()))
0.3076407820044551
0.44008257299719844
0.418146252995939

But building the call frame still takes its bit of time.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [next] | [standalone]

#44183 — Re: percent faster than format()? (was: Re: optomizations)

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-04-23 14:36 +0000
Subject	Re: percent faster than format()? (was: Re: optomizations)
Message-ID	<51769c74$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#44165

On Tue, 23 Apr 2013 09:46:53 +0200, Ulrich Eckhardt wrote:

> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>> If it comes down to micro-optimizations to shave a few microseconds
>> off, consider using string % formatting rather than the format method.
> 
> Why? I don't see any obvious difference between the two...


Possibly the state of the art has changed since then, but some years ago 
% formatting was slightly faster than the format method. Let's try it and 
see:

# Using Python 3.3.

py> from timeit import Timer
py> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
py> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
py> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
py> print(min(t1.repeat()))
0.8319804421626031
py> print(min(t2.repeat()))
1.2395259491167963


Looks like the format method is about 50% slower.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#44186 — Re: percent faster than format()? (was: Re: optomizations)

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-24 00:52 +1000
Subject	Re: percent faster than format()? (was: Re: optomizations)
Message-ID	<mailman.976.1366728786.3114.python-list@python.org>
In reply to	#44183

On Wed, Apr 24, 2013 at 12:36 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> # Using Python 3.3.
>
> py> from timeit import Timer
> py> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
> py> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
> py> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
> py> print(min(t1.repeat()))
> 0.8319804421626031
> py> print(min(t2.repeat()))
> 1.2395259491167963
>
>
> Looks like the format method is about 50% slower.

Figures on my hardware are (naturally) different, with a similar (but
slightly more pronounced) difference:

>>> sys.version
'3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)]'
>>> print(min(t1.repeat()))
1.4841416995735415
>>> print(min(t2.repeat()))
2.5459869899666074
>>> t3 = Timer("a+', '+b+' and '+c+' for breakfast'", setup)
>>> print(min(t3.repeat()))
1.5707538248576327
>>> t4 = Timer("''.join([a, ', ', b, ' and ', c, ' for breakfast'])", setup)
>>> print(min(t4.repeat()))
1.5026834416105999

So on the face of it, format() is slower than everything else by a
good margin... until you note that repeat() is doing one million
iterations, so those figures are effectively in microseconds. Yeah, I
think I can handle a couple of microseconds.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

optomizations

Contents

#44130 — optomizations

#44133

#44135

#44141

#44142

#44169

#44137

#44139

#44165 — percent faster than format()? (was: Re: optomizations)

#44167 — Re: percent faster than format()? (was: Re: optomizations)

#44189 — Re: percent faster than format()?

#44191 — Re: percent faster than format()?

#44183 — Re: percent faster than format()? (was: Re: optomizations)

#44186 — Re: percent faster than format()? (was: Re: optomizations)