Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #44130 > unrolled thread
| Started by | Rodrick Brown <rodrick.brown@gmail.com> |
|---|---|
| First post | 2013-04-22 21:19 -0400 |
| Last post | 2013-04-24 00:52 +1000 |
| Articles | 14 — 8 participants |
Back to article view | Back to comp.lang.python
optomizations Rodrick Brown <rodrick.brown@gmail.com> - 2013-04-22 21:19 -0400
Re: optomizations Roy Smith <roy@panix.com> - 2013-04-22 21:53 -0400
Re: optomizations Dan Stromberg <drsalists@gmail.com> - 2013-04-22 20:15 -0700
Re: optomizations Rodrick Brown <rodrick.brown@gmail.com> - 2013-04-23 00:20 -0400
Re: optomizations Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 04:38 +0000
Re: optomizations Chris Angelico <rosuav@gmail.com> - 2013-04-23 12:03 +1000
Re: optomizations Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 04:00 +0000
Re: optomizations Chris Angelico <rosuav@gmail.com> - 2013-04-23 14:08 +1000
percent faster than format()? (was: Re: optomizations) Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-04-23 09:46 +0200
Re: percent faster than format()? (was: Re: optomizations) Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-04-23 10:26 +0200
Re: percent faster than format()? Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-04-23 16:57 +0200
Re: percent faster than format()? Lele Gaifax <lele@metapensiero.it> - 2013-04-23 17:44 +0200
Re: percent faster than format()? (was: Re: optomizations) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 14:36 +0000
Re: percent faster than format()? (was: Re: optomizations) Chris Angelico <rosuav@gmail.com> - 2013-04-24 00:52 +1000
| From | Rodrick Brown <rodrick.brown@gmail.com> |
|---|---|
| Date | 2013-04-22 21:19 -0400 |
| Subject | optomizations |
| Message-ID | <mailman.944.1366680414.3114.python-list@python.org> |
[Multipart message — attachments visible in raw view] — view raw
I would like some feedback on possible solutions to make this script run
faster.
The system is pegged at 100% CPU and it takes a long time to complete.
#!/usr/bin/env python
import gzip
import re
import os
import sys
from datetime import datetime
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-f', dest='inputfile', type=str, help='data file
to parse')
parser.add_argument('-o', dest='outputdir', type=str,
default=os.getcwd(), help='Output directory')
args = parser.parse_args()
if len(sys.argv[1:]) < 1:
parser.print_usage()
sys.exit(-1)
print(args)
if args.inputfile and os.path.exists(args.inputfile):
try:
with gzip.open(args.inputfile) as datafile:
for line in datafile:
line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
line = line.replace('staticcdn.xxx.co.uk', '
static.xxx.co.uk')
line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')
siteurl = line.split()[6].split('/')[2]
line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)
(day, month, year, hour, minute, second) =
(line.split()[3]).replace('[','').replace(':','/').split('/')
datelog = '{} {} {}'.format(month, day, year)
dateobj = datetime.strptime(datelog, '%b %d %Y')
outfile = '{}{}{}_combined.log'.format(dateobj.year,
dateobj.month, dateobj.day)
outdir = (args.outputdir + os.sep + siteurl)
if not os.path.exists(outdir):
os.makedirs(outdir)
with open(outdir + os.sep + outfile, 'w+') as outf:
outf.write(line)
except IOError, err:
sys.stderr.write("Error unable to read or extract inputfile: {}
{}\n".format(args.inputfile, err))
sys.exit(-1)
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-04-22 21:53 -0400 |
| Message-ID | <roy-A32AAF.21531122042013@news.panix.com> |
| In reply to | #44130 |
In article <mailman.944.1366680414.3114.python-list@python.org>,
Rodrick Brown <rodrick.brown@gmail.com> wrote:
> I would like some feedback on possible solutions to make this script run
> faster.
If I had to guess, I would think this stuff:
> line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
> line = line.replace('staticcdn.xxx.co.uk', '
> static.xxx.co.uk')
> line = line.replace('cdn.xxx', 'www.xxx')
> line = line.replace('cdn.xxx', 'www.xxx')
> line = line.replace('cdn.xx', 'www.xx')
> siteurl = line.split()[6].split('/')[2]
> line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)
You make 6 copies of every line. That's slow. But I'm also going to
quote something I wrote here a couple of months back:
> I've been doing some log analysis. It's been taking a grovelingly long
> time, so I decided to fire up the profiler and see what's taking so
> long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
> might be (looking up IP addresses in the geolocation database, or
> producing some pretty pictures using matplotlib). It was just a matter
> of figuring out which it was.
>
> As with most attempts to out-guess the profiler, I was totally,
> absolutely, and embarrassingly wrong.
So, my real advice to you is to fire up the profiler and see what it
says.
[toc] | [prev] | [next] | [standalone]
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2013-04-22 20:15 -0700 |
| Message-ID | <mailman.948.1366686954.3114.python-list@python.org> |
| In reply to | #44133 |
[Multipart message — attachments visible in raw view] — view raw
On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith <roy@panix.com> wrote: > > So, my real advice to you is to fire up the profiler and see what it > says. I agree. Fire up a line-oriented profiler and only then start trying to improve the hot spots.
[toc] | [prev] | [next] | [standalone]
| From | Rodrick Brown <rodrick.brown@gmail.com> |
|---|---|
| Date | 2013-04-23 00:20 -0400 |
| Message-ID | <mailman.951.1366691420.3114.python-list@python.org> |
| In reply to | #44133 |
[Multipart message — attachments visible in raw view] — view raw
On Apr 22, 2013, at 11:18 PM, Dan Stromberg <drsalists@gmail.com> wrote: On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith <roy@panix.com> wrote: > > So, my real advice to you is to fire up the profiler and see what it > says. I agree. Fire up a line-oriented profiler and only then start trying to improve the hot spots. Got a doc or URL I have no experience working with python profilers. -- http://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-23 04:38 +0000 |
| Message-ID | <51761044$0$29872$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44141 |
On Tue, 23 Apr 2013 00:20:59 -0400, Rodrick Brown wrote: > Got a doc or URL I have no experience working with python profilers. https://duckduckgo.com/html/?q=python%20profiler This is also good: http://pymotw.com/2/profile/ -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-23 12:03 +1000 |
| Message-ID | <mailman.966.1366707799.3114.python-list@python.org> |
| In reply to | #44133 |
On Tue, Apr 23, 2013 at 11:53 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.944.1366680414.3114.python-list@python.org>,
> Rodrick Brown <rodrick.brown@gmail.com> wrote:
>
>> I would like some feedback on possible solutions to make this script run
>> faster.
>
> If I had to guess, I would think this stuff:
>
>> line = line.replace('mediacdn.xxx.com', 'media.xxx.com')
>> line = line.replace('staticcdn.xxx.co.uk', '
>> static.xxx.co.uk')
>> line = line.replace('cdn.xxx', 'www.xxx')
>> line = line.replace('cdn.xxx', 'www.xxx')
>> line = line.replace('cdn.xx', 'www.xx')
>> siteurl = line.split()[6].split('/')[2]
>> line = re.sub(r'\bhttps?://%s\b' % siteurl, "", line, 1)
>
> You make 6 copies of every line. That's slow.
One of those is a regular expression substitution, which is also
likely to be a hot-spot. But definitely profile.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-23 04:00 +0000 |
| Message-ID | <51760754$0$29872$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44130 |
On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote:
> I would like some feedback on possible solutions to make this script run
> faster.
> The system is pegged at 100% CPU and it takes a long time to complete.
Have you profiled the app to see where it is spending all its time?
What does "a long time" mean? For instance:
"It takes two hours to process a 15KB file" -- you have a problem.
"It takes 20 minutes to process a 15GB file" -- and why are you
complaining?
Or somewhere in the middle...
But before profiling, I suggest you clean up the program. For example:
if args.inputfile and os.path.exists(args.inputfile):
Don't do that. There really isn't any point in checking whether the input
file exists, since:
1) Just because it exists doesn't mean you can read it;
2) Just because you can read it doesn't mean it is a valid gzip file;
3) Just because it is a valid gzip file that you can read *now*, doesn't
mean that it still will be in 10 milliseconds when you actually try to
open the file.
A lot can happen in 10ms, or 1ms. The file might be deleted, or
overwritten, or permissions changed. Change that to:
try:
with gzip.open(args.inputfile) as datafile:
for line in datafile:
and catch the exception if the file doesn't exist, or cannot be read.
Which you already do, which just demonstrates that the call to
os.path.exists is a waste of effort.
Then look for wasted effort like this:
line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')
Surely the first line is redundant, since it would be correctly caught
and replaced by the second?
Also, you're searching the file system *for every line* in the input
file. Pull this outside of the loop and have it run once:
if not os.path.exists(outdir):
os.makedirs(outdir)
Likewise for opening and closing the output file, which you currently
open and close it for every line. It only needs to be opened and closed
once.
If it comes down to micro-optimizations to shave a few microseconds off,
consider using string % formatting rather than the format method. But
really, if you find yourself shaving microseconds off something that runs
for ten minutes, you have to ask why you're bothering.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-23 14:08 +1000 |
| Message-ID | <mailman.949.1366690116.3114.python-list@python.org> |
| In reply to | #44137 |
On Tue, Apr 23, 2013 at 2:00 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Also, you're searching the file system *for every line* in the input > file. Pull this outside of the loop and have it run once: > > if not os.path.exists(outdir): > os.makedirs(outdir) > > Likewise for opening and closing the output file, which you currently > open and close it for every line. It only needs to be opened and closed > once. The outdir depends on the line, though. Hence my suggestion to retain the open files in a dictionary. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2013-04-23 09:46 +0200 |
| Subject | percent faster than format()? (was: Re: optomizations) |
| Message-ID | <dsqh4a-66p.ln1@satorlaser.homedns.org> |
| In reply to | #44137 |
Am 23.04.2013 06:00, schrieb Steven D'Aprano: > If it comes down to micro-optimizations to shave a few microseconds off, > consider using string % formatting rather than the format method. Why? I don't see any obvious difference between the two... Greetings! Uli
[toc] | [prev] | [next] | [standalone]
| From | Chris “Kwpolska” Warrick <kwpolska@gmail.com> |
|---|---|
| Date | 2013-04-23 10:26 +0200 |
| Subject | Re: percent faster than format()? (was: Re: optomizations) |
| Message-ID | <mailman.964.1366705622.3114.python-list@python.org> |
| In reply to | #44165 |
On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt
<ulrich.eckhardt@dominolaser.com> wrote:
> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>>
>> If it comes down to micro-optimizations to shave a few microseconds off,
>> consider using string % formatting rather than the format method.
>
>
> Why? I don't see any obvious difference between the two...
>
>
> Greetings!
>
> Uli
>
> --
> http://mail.python.org/mailman/listinfo/python-list
$ python -m timeit "a = '{0} {1} {2}'.format(1, 2, 42)"
1000000 loops, best of 3: 0.824 usec per loop
$ python -m timeit "a = '%s %s %s' % (1, 2, 42)"
10000000 loops, best of 3: 0.0286 usec per loop
--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail | always bottom-post
http://asciiribbon.org | http://caliburn.nl/topposting.html
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2013-04-23 16:57 +0200 |
| Subject | Re: percent faster than format()? |
| Message-ID | <m4ki4a-33r.ln1@satorlaser.homedns.org> |
| In reply to | #44167 |
Am 23.04.2013 10:26, schrieb Chris “Kwpolska” Warrick:
> On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt
> <ulrich.eckhardt@dominolaser.com> wrote:
>> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>>>
>>> If it comes down to micro-optimizations to shave a few microseconds off,
>>> consider using string % formatting rather than the format method.
>>
>>
>> Why? I don't see any obvious difference between the two...
[...]
>
> $ python -m timeit "a = '{0} {1} {2}'.format(1, 2, 42)"
> 1000000 loops, best of 3: 0.824 usec per loop
> $ python -m timeit "a = '%s %s %s' % (1, 2, 42)"
> 10000000 loops, best of 3: 0.0286 usec per loop
>
Well, I don't question that for at least some CPython implementations
one is faster than the other. I don't see a reason why one must be
faster than the other though. In other words, I don't understand where
the other one needs more time to achieve basically the same. To me, the
only difference is the syntax, but not greatly so.
So again, why is one faster than the other? What am I missing?
Uli
[toc] | [prev] | [next] | [standalone]
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2013-04-23 17:44 +0200 |
| Subject | Re: percent faster than format()? |
| Message-ID | <mailman.979.1366731853.3114.python-list@python.org> |
| In reply to | #44189 |
Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> writes:
> So again, why is one faster than the other? What am I missing?
The .format() syntax is actually a function, and that alone carries some
overload. Even optimizing the lookup may give a little advantage:
>>> from timeit import Timer
>>> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
>>> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
>>> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
>>> print(min(t1.repeat()))
>>> print(min(t2.repeat()))
>>> setup = "a = 'spam'; b = 'ham'; c = 'eggs'; f = '{}, {} and {} for breakfast'.format"
>>> t3 = Timer("f(a, b, c)", setup)
>>> print(min(t3.repeat()))
0.3076407820044551
0.44008257299719844
0.418146252995939
But building the call frame still takes its bit of time.
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-23 14:36 +0000 |
| Subject | Re: percent faster than format()? (was: Re: optomizations) |
| Message-ID | <51769c74$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44165 |
On Tue, 23 Apr 2013 09:46:53 +0200, Ulrich Eckhardt wrote:
> Am 23.04.2013 06:00, schrieb Steven D'Aprano:
>> If it comes down to micro-optimizations to shave a few microseconds
>> off, consider using string % formatting rather than the format method.
>
> Why? I don't see any obvious difference between the two...
Possibly the state of the art has changed since then, but some years ago
% formatting was slightly faster than the format method. Let's try it and
see:
# Using Python 3.3.
py> from timeit import Timer
py> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
py> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
py> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
py> print(min(t1.repeat()))
0.8319804421626031
py> print(min(t2.repeat()))
1.2395259491167963
Looks like the format method is about 50% slower.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-04-24 00:52 +1000 |
| Subject | Re: percent faster than format()? (was: Re: optomizations) |
| Message-ID | <mailman.976.1366728786.3114.python-list@python.org> |
| In reply to | #44183 |
On Wed, Apr 24, 2013 at 12:36 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> # Using Python 3.3.
>
> py> from timeit import Timer
> py> setup = "a = 'spam'; b = 'ham'; c = 'eggs'"
> py> t1 = Timer("'%s, %s and %s for breakfast' % (a, b, c)", setup)
> py> t2 = Timer("'{}, {} and {} for breakfast'.format(a, b, c)", setup)
> py> print(min(t1.repeat()))
> 0.8319804421626031
> py> print(min(t2.repeat()))
> 1.2395259491167963
>
>
> Looks like the format method is about 50% slower.
Figures on my hardware are (naturally) different, with a similar (but
slightly more pronounced) difference:
>>> sys.version
'3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)]'
>>> print(min(t1.repeat()))
1.4841416995735415
>>> print(min(t2.repeat()))
2.5459869899666074
>>> t3 = Timer("a+', '+b+' and '+c+' for breakfast'", setup)
>>> print(min(t3.repeat()))
1.5707538248576327
>>> t4 = Timer("''.join([a, ', ', b, ' and ', c, ' for breakfast'])", setup)
>>> print(min(t4.repeat()))
1.5026834416105999
So on the face of it, format() is slower than everything else by a
good margin... until you note that repeat() is doing one million
iterations, so those figures are effectively in microseconds. Yeah, I
think I can handle a couple of microseconds.
ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web