Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #76495 > unrolled thread
| Started by | Chiu Hsiang Hsu <wdv4758h@gmail.com> |
|---|---|
| First post | 2014-08-18 10:18 -0700 |
| Last post | 2014-08-19 18:00 -0600 |
| Articles | 9 — 4 participants |
Back to article view | Back to comp.lang.python
efficient partial sort in Python ? Chiu Hsiang Hsu <wdv4758h@gmail.com> - 2014-08-18 10:18 -0700
Re: efficient partial sort in Python ? Ian Kelly <ian.g.kelly@gmail.com> - 2014-08-18 12:48 -0600
Re: efficient partial sort in Python ? Dan Stromberg <drsalists@gmail.com> - 2014-08-18 14:42 -0700
Re: efficient partial sort in Python ? Chiu Hsiang Hsu <wdv4758h@gmail.com> - 2014-08-19 12:37 -0700
Re: efficient partial sort in Python ? Terry Reedy <tjreedy@udel.edu> - 2014-08-19 18:11 -0400
Re: efficient partial sort in Python ? Dan Stromberg <drsalists@gmail.com> - 2014-08-19 16:05 -0700
Re: efficient partial sort in Python ? Dan Stromberg <drsalists@gmail.com> - 2014-08-19 16:10 -0700
Re: efficient partial sort in Python ? Dan Stromberg <drsalists@gmail.com> - 2014-08-19 16:22 -0700
Re: efficient partial sort in Python ? Ian Kelly <ian.g.kelly@gmail.com> - 2014-08-19 18:00 -0600
| From | Chiu Hsiang Hsu <wdv4758h@gmail.com> |
|---|---|
| Date | 2014-08-18 10:18 -0700 |
| Subject | efficient partial sort in Python ? |
| Message-ID | <51dfbe9b-f6e0-4532-bc2d-e7ce2fc282b5@googlegroups.com> |
I know that Python use Timsort as default sorting algorithm and it is efficient, but I just wanna have a partial sorting (n-largest/smallest elements). In the current state, I can only find this kind of functions from heapq, but it's too slow for this problem, the pure c sorted function can easily beat heapq.nlargest even though it sort all the elements. I think it would be better if we have an effeicient partial sorting function, maybe it could be like this : partial_sorted(data, n), data.partial_sort(n) Does anyone have any more information for partial sorting ? Btw, C++ has a partial_sort function in <algorithm>, but I don't know what algorithm it really use.
[toc] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-08-18 12:48 -0600 |
| Message-ID | <mailman.13109.1408387733.18130.python-list@python.org> |
| In reply to | #76495 |
On Mon, Aug 18, 2014 at 11:18 AM, Chiu Hsiang Hsu <wdv4758h@gmail.com> wrote: > I know that Python use Timsort as default sorting algorithm and it is efficient, > but I just wanna have a partial sorting (n-largest/smallest elements). > > In the current state, I can only find this kind of functions from heapq, > but it's too slow for this problem, > the pure c sorted function can easily beat heapq.nlargest even though it sort all the elements. > > I think it would be better if we have an effeicient partial sorting function, > maybe it could be like this : partial_sorted(data, n), data.partial_sort(n) > > Does anyone have any more information for partial sorting ? > > Btw, C++ has a partial_sort function in <algorithm>, but I don't know what algorithm it really use. According to http://www.sgi.com/tech/stl/partial_sort.html the STL implementation also uses heapsort. For better partial sorting algorithms, try Wikipedia: http://en.wikipedia.org/wiki/Partial_sorting However, I think you're still not likely to beat Timsort for smallish data sets unless you also implement it in C.
[toc] | [prev] | [next] | [standalone]
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2014-08-18 14:42 -0700 |
| Message-ID | <mailman.13116.1408398154.18130.python-list@python.org> |
| In reply to | #76495 |
On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu <wdv4758h@gmail.com> wrote: > I know that Python use Timsort as default sorting algorithm and it is efficient, > but I just wanna have a partial sorting (n-largest/smallest elements). Perhaps heapq with Pypy? Or with nuitka? Or with numba?
[toc] | [prev] | [next] | [standalone]
| From | Chiu Hsiang Hsu <wdv4758h@gmail.com> |
|---|---|
| Date | 2014-08-19 12:37 -0700 |
| Message-ID | <3fb3b4d1-a7e2-4912-a878-7d5e1798aee6@googlegroups.com> |
| In reply to | #76518 |
On Tuesday, August 19, 2014 5:42:27 AM UTC+8, Dan Stromberg wrote: > On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu <wdv4758h@gmail.com> wrote: > > > I know that Python use Timsort as default sorting algorithm and it is efficient, > > > but I just wanna have a partial sorting (n-largest/smallest elements). > > > > Perhaps heapq with Pypy? Or with nuitka? Or with numba? I heard of PyPy and numba before, but I doesn't know nuitka, thanks for your information. Another problem with heapq is the memory usage, it cost a lot of more memory with heapq in CPython (I test it in 3.4 with 1000000 float numbers) compare to sorted. For curiosity, there are many speed up solution in Python (like Cython, PyPy), I hasn't use Cython before, I guess PyPy is a more convient way to speed up current Python code (?), so how does Cython compare to PyPy ? (speed, code, flexibility, or anything else)
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-08-19 18:11 -0400 |
| Message-ID | <mailman.13168.1408486351.18130.python-list@python.org> |
| In reply to | #76602 |
On 8/19/2014 3:37 PM, Chiu Hsiang Hsu wrote: > On Tuesday, August 19, 2014 5:42:27 AM UTC+8, Dan Stromberg wrote: >> On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu >> <wdv4758h@gmail.com> wrote: >> >>> I know that Python use Timsort as default sorting algorithm and >>> it is efficient, Especially with data that is partially ordered already. >>> but I just wanna have a partial sorting (n-largest/smallest >>> elements). >> Perhaps heapq with Pypy? Or with nuitka? Or with numba? > I heard of PyPy and numba before, but I doesn't know nuitka, thanks > for your information. > > Another problem with heapq is the memory usage, it cost a lot of more > memory with heapq in CPython (I test it in 3.4 with 1000000 float > numbers) compare to sorted. > > For curiosity, there are many speed up solution in Python (like > Cython, PyPy), I hasn't use Cython before, I guess PyPy is a more > convient way to speed up current Python code (?), so how does Cython > compare to PyPy ? (speed, code, flexibility, or anything else) Why are you rejecting the previous answer: use .sort and pick out whatever items you want? -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2014-08-19 16:05 -0700 |
| Message-ID | <mailman.13175.1408489553.18130.python-list@python.org> |
| In reply to | #76602 |
On Tue, Aug 19, 2014 at 12:37 PM, Chiu Hsiang Hsu <wdv4758h@gmail.com> wrote: > On Tuesday, August 19, 2014 5:42:27 AM UTC+8, Dan Stromberg wrote: >> On Mon, Aug 18, 2014 at 10:18 AM, Chiu Hsiang Hsu <wdv4758h@gmail.com> wrote: >> >> > I know that Python use Timsort as default sorting algorithm and it is efficient, >> >> > but I just wanna have a partial sorting (n-largest/smallest elements). >> >> >> >> Perhaps heapq with Pypy? Or with nuitka? Or with numba? > Another problem with heapq is the memory usage, it cost a lot of more memory with heapq in CPython (I test it in 3.4 with 1000000 float numbers) compare to sorted. This surprises me. I believe heapq probably keeps values in a python list with no extra references, by making node i's left child and right child be array elements 2*i and 2*i+1, respectively. A heap of some sort probably is best algorithmically. You're probably just up against a high constant. On the other hand, there are many kinds of heaps. > For curiosity, there are many speed up solution in Python (like Cython, PyPy), I hasn't use Cython before, > I guess PyPy is a more convient way to speed up current Python code (?), > so how does Cython compare to PyPy ? (speed, code, flexibility, or anything else) PyPy is really fast for CPU-intensive workloads, but CPython is better for I/O. I tested a single CPU-intensive microbenchmark of Cython and PyPy (also Jython and CPython). PyPy was fastest (http://stromberg.dnsalias.org/~strombrg/backshift/documentation/performance/index.html). I haven't yet compared numba or nuitka or Shedskin. When you use heapq, are you putting all the values in the heap, or just up to n at a time (evicting the worst value, one at a time as you go)? If you're doing the former, it's basically a heapsort which probably won't beat timsort. If you're doing the latter, that should be pretty good.
[toc] | [prev] | [next] | [standalone]
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2014-08-19 16:10 -0700 |
| Message-ID | <mailman.13177.1408489863.18130.python-list@python.org> |
| In reply to | #76602 |
On Tue, Aug 19, 2014 at 4:05 PM, Dan Stromberg <drsalists@gmail.com> wrote: > When you use heapq, are you putting all the values in the heap, or > just up to n at a time (evicting the worst value, one at a time as you > go)? If you're doing the former, it's basically a heapsort which > probably won't beat timsort. If you're doing the latter, that should > be pretty good. It just occurred to me that evicting the highest value each step of the way might be expensive with a traditional heap (since the value at the largest array index isn't necessarily the largest). I've experimented with doing this using a treap a bit: http://stromberg.dnsalias.org/~strombrg/treap/ You might try that.
[toc] | [prev] | [next] | [standalone]
| From | Dan Stromberg <drsalists@gmail.com> |
|---|---|
| Date | 2014-08-19 16:22 -0700 |
| Message-ID | <mailman.13179.1408490569.18130.python-list@python.org> |
| In reply to | #76602 |
On Tue, Aug 19, 2014 at 4:10 PM, Dan Stromberg <drsalists@gmail.com> wrote: > On Tue, Aug 19, 2014 at 4:05 PM, Dan Stromberg <drsalists@gmail.com> wrote: >> When you use heapq, are you putting all the values in the heap, or >> just up to n at a time (evicting the worst value, one at a time as you >> go)? If you're doing the former, it's basically a heapsort which >> probably won't beat timsort. If you're doing the latter, that should >> be pretty good. > > It just occurred to me that evicting the highest value each step of > the way might be expensive with a traditional heap (since the value at > the largest array index isn't necessarily the largest). I've > experimented with doing this using a treap a bit: > http://stromberg.dnsalias.org/~strombrg/treap/ You might try that. Or more directly relevant: http://stromberg.dnsalias.org/~strombrg/highest/
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-08-19 18:00 -0600 |
| Message-ID | <mailman.13183.1408492909.18130.python-list@python.org> |
| In reply to | #76602 |
On Tue, Aug 19, 2014 at 5:05 PM, Dan Stromberg <drsalists@gmail.com> wrote: > When you use heapq, are you putting all the values in the heap, or > just up to n at a time (evicting the worst value, one at a time as you > go)? If you're doing the former, it's basically a heapsort which > probably won't beat timsort. If you're doing the latter, that should > be pretty good. The latter is what the heapq.nsmallest and heap.nlargest functions do. nsmallest uses a max heap, so the value being evicted is still found at index 0. Curious then if it's performing worse memory-wise than sorted.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web