Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Michael Selik <michael.selik@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Drowning in a teacup?
Date: Sat, 02 Apr 2016 14:36:45 +0000
Lines: 116
Message-ID: <mailman.363.1459607817.28225.python-list@python.org>
References: <ndmlj2$pb0$1@gioia.aioe.org> <ndnm9v$j07$1@ger.gmane.org> <CAGgTfkO4HuzoRBgrGJ3ccqcD+VB4_7O-g19NUk9FaYSBpQ2HkA@mail.gmail.com> <ndnomv$mf5$1@ger.gmane.org> <ndo722$rv4$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <ndo722$rv4$1@ger.gmane.org>
Precedence: list
Xref: csiph.com comp.lang.python:106286

On Sat, Apr 2, 2016 at 6:32 AM Peter Otten <__peter__@web.de> wrote:

> Vito De Tullio wrote:
>
> > Michael Selik wrote:
> >
> >>> > I need to scan a list of strings. If one of the elements matches the
> >>> > beginning of a search keyword, that element needs to snap to the
> front
> >>> > of the list.
> >>>
> >>> I know this post regards the function passing, but, on you specific
> >>> problem,
> >>> can't you just ... sort the list with a custom key?
> >>
> >> If the number of matches is small relative to the size of the list, I'd
> >> expect the sort would be slower than most of the other suggestions.
> >
> > umm... I take the liberty to set up a little test
> >
> >     $ cat test_sort_prefix.py
> >     #!/usr/bin/env python3
> >
> >     from sys import argv
> >     from timeit import timeit
> >
> >
> >     def sort_in_place(l):
> >         def key(e):
> >             return not e.startswith('yes')
> >         l.sort(key=key)
> >
> >
> >     def original_solution(mylist):
> >         for i in range(len(mylist)):
> >             if(mylist[i].startswith('yes')):
> >                 mylist[:] = [mylist[i]] + mylist[:i] + mylist[i+1:]
>
> original_solution() reverses the order of the matches while sort_in_place()
> doesn't. If the order is not relevant I'd try something like
>
> def exchange(items):
>     pos = 0
>     for index, item in enumerate(items):
>         if item.startswith("yes"):
>             items[pos], items[index] = item, items[pos]
>             pos += 1
>
> >     def main():
> >         if argv[1] == 'sort_in_place':
> >             f = sort_in_place
> >         elif argv[1] == 'original_solution':
> >             f = original_solution
> >         else:
> >             raise Exception()
> >
> >         nomatches = int(argv[2])
> >         matches = int(argv[3])
> >
> >         l = ["no_" for _ in range(nomatches)] + ["yes_" for _ in
> > range(matches)]
>
> Python's timsort knows how to deal with (partially) sorted lists. I suggest
> that you add
>
> random.seed(42) # same list for all script invocations
> random.shuffle(l)
>
> here to spread the matches somewhat over the list.
>
> >         print(timeit("f(l)", number=100, globals={"f": f, "l": l}))
>
> Remember that f(l) modifies l, so all but the first invocation will see a
> list where all matches are at the beginning. In other words: in 99 out of
> 100 runs the list is already sorted.
>
> While adding the overhead of copying the list I would expect the results of
>
> timeit("f(l[:])", ...)
>
> to be more realistic.
>
> >     if __name__ == '__main__':
> >         main()
> >     $ ./test_sort_prefix.py sort_in_place 1000000 3
> >     33.260575089996564
> >     $ ./test_sort_prefix.py original_solution 1000000 3
> >     35.93009542999789
> >     $
> >
> >
> > in my tests there is no sensible difference between the two solutions...
> > and the number of matches is risible :)
>
> Indeed, as the number of matches rises your sort() approach will likely
> fare
> better.
>
> Your conclusions may be correct, but the benchmark to support them is
> flawed.
>
>
Option 1: scan and match, pop, insert
Scan and match is O(n)
pop is O(n) ... gotta shift everything over one after deleting
insert is O(n) on a list, better use deque.appendleft for O(1)

Option 2: map keyfunc, sort
Mapping the keyfunction is O(n)
(tim)Sorting is O(n log n) worst case, O(n) best case.

Option 1 using a deque will be O(n).
Option 2 will be O(n log n) worst case and O(n) best case.

On small datasets the constant factors matter, but then it's small, so who
cares about speed? (partially joking)