Groups > comp.lang.python > #86459 > unrolled thread

Parallelization of Python on GPU?

Started by	John Ladasky <john_ladasky@sbcglobal.net>
First post	2015-02-25 18:35 -0800
Last post	2015-02-26 21:54 +0100
Articles	17 — 7 participants

Back to article view | Back to comp.lang.python

  Parallelization of Python on GPU? John Ladasky <john_ladasky@sbcglobal.net> - 2015-02-25 18:35 -0800
    Re: Parallelization of Python on GPU? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-26 14:02 +1100
      Re: Parallelization of Python on GPU? John Ladasky <john_ladasky@sbcglobal.net> - 2015-02-25 20:01 -0800
      Re: Parallelization of Python on GPU? Jason Swails <jason.swails@gmail.com> - 2015-02-26 10:06 -0500
      Re: Parallelization of Python on GPU? Sturla Molden <sturla.molden@gmail.com> - 2015-02-26 16:53 +0000
      Re: Parallelization of Python on GPU? Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:16 -0500
      Re: Parallelization of Python on GPU? Jason Swails <jason.swails@gmail.com> - 2015-02-26 12:48 -0500
      Re: Parallelization of Python on GPU? Sturla Molden <sturla.molden@gmail.com> - 2015-02-26 22:10 +0100
      Re: Parallelization of Python on GPU? Jason Swails <jason.swails@gmail.com> - 2015-02-26 17:28 -0500
    Re: Parallelization of Python on GPU? Ethan Furman <ethan@stoneleaf.us> - 2015-02-25 19:03 -0800
    Re: Parallelization of Python on GPU? Ethan Furman <ethan@stoneleaf.us> - 2015-02-25 19:05 -0800
      Re: Parallelization of Python on GPU? John Ladasky <john_ladasky@sbcglobal.net> - 2015-02-25 21:53 -0800
        Re: Parallelization of Python on GPU? Christian Gollwitzer <auriocus@gmx.de> - 2015-02-27 19:55 +0100
    Re: Parallelization of Python on GPU? Jason Swails <jason.swails@gmail.com> - 2015-02-26 10:27 -0500
    Re: Parallelization of Python on GPU? Sturla Molden <sturla.molden@gmail.com> - 2015-02-26 16:40 +0000
      Re: Parallelization of Python on GPU? John Ladasky <john_ladasky@sbcglobal.net> - 2015-02-26 09:34 -0800
        Re: Parallelization of Python on GPU? Sturla Molden <sturla.molden@gmail.com> - 2015-02-26 21:54 +0100

#86459 — Parallelization of Python on GPU?

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2015-02-25 18:35 -0800
Subject	Parallelization of Python on GPU?
Message-ID	<82642f3a-49e8-4982-b135-66ffc04d67d9@googlegroups.com>

I've been working with machine learning for a while.  Many of the standard packages (e.g., scikit-learn) have fitting algorithms which run in single threads.  These algorithms are not themselves parallelized.  Perhaps, due to their unique mathematical requirements, they cannot be paralleized.  

When one is investigating several potential models of one's data with various settings for free parameters, it is still sometimes possible to speed things up.  On a modern machine, one can use Python's multiprocessing.Pool to run separate instances of scikit-learn fits.  I am currently using ten of the twelve 3.3 GHz CPU cores on my machine to do just that.  And I can still browse the web with no observable lag.  :^)

Still, I'm waiting hours for jobs to finish.  Support vector regression fitting is hard.

What I would REALLY like to do is to take advantage of my GPU.  My NVidia graphics card has 1152 cores and a 1.0 GHz clock.  I wouldn't mind borrowing a few hundred of those GPU cores at a time, and see what they can do.  In theory, I calculate that I can speed up the job by another five-fold.

The trick is that each process would need to run some PYTHON code, not CUDA or OpenCL.  The child process code isn't particularly fancy.  (I should, for example, be able to switch that portion of my code to static typing.)

What is the most effective way to accomplish this task?

I came across a reference to a package called "Urutu" which may be what I need, however it doesn't look like it is widely supported.

I would love it if the Python developers themselves added the ability to spawn GPU processes to the Multiprocessing module!

Thanks for any advice and comments.

[toc] | [next] | [standalone]

#86462

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-26 14:02 +1100
Message-ID	<54ee8ce2$0$11109$c3e8da3@news.astraweb.com>
In reply to	#86459

John Ladasky wrote:


> What I would REALLY like to do is to take advantage of my GPU.

I can't help you with that, but I would like to point out that GPUs 
typically don't support IEE-754 maths, which means that while they are 
likely significantly faster, they're also likely significantly less 
accurate. Any any two different brands/models of GPU are likely to give 
different results. (Possibly not *very* different, but considering the mess 
that floating point maths was prior to IEEE-754, possibly *very* different.)

Personally, I wouldn't trust GPU floating point for serious work. Maybe for 
quick and dirty exploration of the data, but I'd then want to repeat any 
calculations using the main CPU before using the numbers anywhere :-)



-- 
Steve

[toc] | [prev] | [next] | [standalone]

#86467

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2015-02-25 20:01 -0800
Message-ID	<459a9366-19ef-4f98-9087-e50430a8655e@googlegroups.com>
In reply to	#86462

On Wednesday, February 25, 2015 at 7:03:23 PM UTC-8, Steven D'Aprano wrote:

> I would like to point out that GPUs 
> typically don't support IEE-754 maths, which means that while they are 
> likely significantly faster, they're also likely significantly less 
> accurate.

Historically, that has been true.  According to this document...

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

...NVidia's GPU cards which implement "compute capability" versions 2.0 and higher are IEEE-754 compliant, both for single- and double-precision floating point operations.

The current "compute capability" version is 5.2, so there are several generations of GPU hardware out there by now which should give satisfactory floating-point results.

[toc] | [prev] | [next] | [standalone]

#86510

From	Jason Swails <jason.swails@gmail.com>
Date	2015-02-26 10:06 -0500
Message-ID	<mailman.19262.1424967059.18130.python-list@python.org>
In reply to	#86462

On Thu, 2015-02-26 at 14:02 +1100, Steven D'Aprano wrote:
> John Ladasky wrote:
> 
> 
> > What I would REALLY like to do is to take advantage of my GPU.
> 
> I can't help you with that, but I would like to point out that GPUs 
> typically don't support IEE-754 maths, which means that while they are 
> likely significantly faster, they're also likely significantly less 
> accurate. Any any two different brands/models of GPU are likely to give 
> different results. (Possibly not *very* different, but considering the mess 
> that floating point maths was prior to IEEE-754, possibly *very* different.)

This hasn't been true in NVidia GPUs manufactured since ca. 2008.

> Personally, I wouldn't trust GPU floating point for serious work. Maybe for 
> quick and dirty exploration of the data, but I'd then want to repeat any 
> calculations using the main CPU before using the numbers anywhere :-)

There is a *huge* dash toward GPU computing in the scientific computing
sector.  Since I started as a graduate student in computational
chemistry/physics in 2008, I watched as state-of-the-art supercomputers
running tens of thousands to hundreds of thousands of cores were
overtaken in performance by a $500 GPU (today the GTX 780 or 980) you
can put in a desktop.  I went from running all of my calculations on a
CPU cluster in 2009 to running 90% of my calculations on a GPU by the
time I graduated in 2013... and for people without as ready access to
supercomputers as myself the move was even more pronounced.

This work is very serious, and numerical precision is typically of
immense importance.  See, e.g.,
http://www.sciencedirect.com/science/article/pii/S0010465512003098 and
http://pubs.acs.org/doi/abs/10.1021/ct400314y

In our software, we can run simulations on a GPU or a CPU and the
results are *literally* indistinguishable.  The transition to GPUs was
accompanied by a series of studies that investigated precisely your
concerns... we would never have started using GPUs if we didn't trust
GPU numbers as much as we did from the CPU.

And NVidia is embracing this revolution (obviously) -- they are putting
a lot of time, effort, and money into ensuring the success of GPU high
performance computing.  It is here to stay in the immediate future, and
refusing to use the technology will leave those that *could* benefit
from it at a severe disadvantage. (That said, GPUs aren't good at
everything, and CPUs are also here to stay.)

And GPU performance gains are outpacing CPU performance gains -- I've
seen about two orders of magnitude improvement in computational
throughput over the past 6 years through the introduction of GPU
computing and improvements in GPU hardware.

All the best,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher

[toc] | [prev] | [next] | [standalone]

#86517

From	Sturla Molden <sturla.molden@gmail.com>
Date	2015-02-26 16:53 +0000
Message-ID	<mailman.19272.1424969625.18130.python-list@python.org>
In reply to	#86462

GPU computing is great if you have the following:

1. Your data structures are arrays floating point numbers.
2. You have a data-parallel problem.
3. You are happy with single precision.
4. You have time to code erything in CUDA or OpenCL.
5. You have enough video RAM to store your data.

For Python the easiest solution is to use Numba Pro.

Sturla


Jason Swails <jason.swails@gmail.com> wrote:
> On Thu, 2015-02-26 at 14:02 +1100, Steven D'Aprano wrote:
>> John Ladasky wrote:
>> 
>> 
>>> What I would REALLY like to do is to take advantage of my GPU.
>> 
>> I can't help you with that, but I would like to point out that GPUs 
>> typically don't support IEE-754 maths, which means that while they are 
>> likely significantly faster, they're also likely significantly less 
>> accurate. Any any two different brands/models of GPU are likely to give 
>> different results. (Possibly not *very* different, but considering the mess 
>> that floating point maths was prior to IEEE-754, possibly *very* different.)
> 
> This hasn't been true in NVidia GPUs manufactured since ca. 2008.
> 
>> Personally, I wouldn't trust GPU floating point for serious work. Maybe for 
>> quick and dirty exploration of the data, but I'd then want to repeat any 
>> calculations using the main CPU before using the numbers anywhere :-)
> 
> There is a *huge* dash toward GPU computing in the scientific computing
> sector.  Since I started as a graduate student in computational
> chemistry/physics in 2008, I watched as state-of-the-art supercomputers
> running tens of thousands to hundreds of thousands of cores were
> overtaken in performance by a $500 GPU (today the GTX 780 or 980) you
> can put in a desktop.  I went from running all of my calculations on a
> CPU cluster in 2009 to running 90% of my calculations on a GPU by the
> time I graduated in 2013... and for people without as ready access to
> supercomputers as myself the move was even more pronounced.
> 
> This work is very serious, and numerical precision is typically of
> immense importance.  See, e.g.,
> http://www.sciencedirect.com/science/article/pii/S0010465512003098 and
> http://pubs.acs.org/doi/abs/10.1021/ct400314y
> 
> In our software, we can run simulations on a GPU or a CPU and the
> results are *literally* indistinguishable.  The transition to GPUs was
> accompanied by a series of studies that investigated precisely your
> concerns... we would never have started using GPUs if we didn't trust
> GPU numbers as much as we did from the CPU.
> 
> And NVidia is embracing this revolution (obviously) -- they are putting
> a lot of time, effort, and money into ensuring the success of GPU high
> performance computing.  It is here to stay in the immediate future, and
> refusing to use the technology will leave those that *could* benefit
> from it at a severe disadvantage. (That said, GPUs aren't good at
> everything, and CPUs are also here to stay.)
> 
> And GPU performance gains are outpacing CPU performance gains -- I've
> seen about two orders of magnitude improvement in computational
> throughput over the past 6 years through the introduction of GPU
> computing and improvements in GPU hardware.
> 
> All the best,
> Jason

[toc] | [prev] | [next] | [standalone]

#86522

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-02-26 12:16 -0500
Message-ID	<mailman.19276.1424971039.18130.python-list@python.org>
In reply to	#86462

On 2/26/2015 10:06 AM, Jason Swails wrote:
> On Thu, 2015-02-26 at 14:02 +1100, Steven D'Aprano wrote:
>> John Ladasky wrote:
>>
>>
>>> What I would REALLY like to do is to take advantage of my GPU.
>>
>> I can't help you with that, but I would like to point out that GPUs
>> typically don't support IEE-754 maths, which means that while they are
>> likely significantly faster, they're also likely significantly less
>> accurate. Any any two different brands/models of GPU are likely to give
>> different results. (Possibly not *very* different, but considering the mess
>> that floating point maths was prior to IEEE-754, possibly *very* different.)
>
> This hasn't been true in NVidia GPUs manufactured since ca. 2008.
>
>> Personally, I wouldn't trust GPU floating point for serious work. Maybe for
>> quick and dirty exploration of the data, but I'd then want to repeat any
>> calculations using the main CPU before using the numbers anywhere :-)
>
> There is a *huge* dash toward GPU computing in the scientific computing
> sector.  Since I started as a graduate student in computational
> chemistry/physics in 2008, I watched as state-of-the-art supercomputers
> running tens of thousands to hundreds of thousands of cores were
> overtaken in performance by a $500 GPU (today the GTX 780 or 980) you
> can put in a desktop.  I went from running all of my calculations on a
> CPU cluster in 2009 to running 90% of my calculations on a GPU by the
> time I graduated in 2013... and for people without as ready access to
> supercomputers as myself the move was even more pronounced.
>
> This work is very serious, and numerical precision is typically of
> immense importance.  See, e.g.,
> http://www.sciencedirect.com/science/article/pii/S0010465512003098 and
> http://pubs.acs.org/doi/abs/10.1021/ct400314y
>
> In our software, we can run simulations on a GPU or a CPU and the
> results are *literally* indistinguishable.  The transition to GPUs was
> accompanied by a series of studies that investigated precisely your
> concerns... we would never have started using GPUs if we didn't trust
> GPU numbers as much as we did from the CPU.
>
> And NVidia is embracing this revolution (obviously) -- they are putting
> a lot of time, effort, and money into ensuring the success of GPU high
> performance computing.  It is here to stay in the immediate future, and
> refusing to use the technology will leave those that *could* benefit
> from it at a severe disadvantage. (That said, GPUs aren't good at
> everything, and CPUs are also here to stay.)
>
> And GPU performance gains are outpacing CPU performance gains -- I've
> seen about two orders of magnitude improvement in computational
> throughput over the past 6 years through the introduction of GPU
> computing and improvements in GPU hardware.

Thanks for the update.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#86525

From	Jason Swails <jason.swails@gmail.com>
Date	2015-02-26 12:48 -0500
Message-ID	<mailman.19278.1424972875.18130.python-list@python.org>
In reply to	#86462

On Thu, 2015-02-26 at 16:53 +0000, Sturla Molden wrote:
> GPU computing is great if you have the following:
> 
> 1. Your data structures are arrays floating point numbers.

It actually works equally great, if not better, for integers.

> 2. You have a data-parallel problem.

This is the biggest one, IMO. ^^^

> 3. You are happy with single precision.

NVidia GPUs have double-precision maths in hardware since compute
capability 1.2 (GTX 280).  That's ca. 2008.  In optimized CPU code, you
still get ~50% benefit going from double to single precision (it's
rarely ever that high, but 20-30% is commonplace in my experience of
optimized code).  It's admittedly a bigger hit on most GPUs, but there
are ways to work around it (e.g., fixed precision), and you can still do
double precision work where it's needed.  One of the articles I linked
previously demonstrates that a hybrid precision model (based on fixed
precision) provides exactly the same numerical stability as double
precision (which is much better than pure single precision) for that
application.

Double precision can often be avoided in many parts of a calculation,
using it only where those bits matter (like accumulators with
potentially small contributions, subtractions of two numbers of similar
magnitude, etc.).

> 4. You have time to code erything in CUDA or OpenCL.

This is the second biggest one, IMO. ^^^

> 5. You have enough video RAM to store your data.

Again, it can be worked around, but the frequent GPU->CPU xfers involved
if you can't fit everything on the GPU can be painstaking to limit its
potentially devastating effects on performance.

> 
> For Python the easiest solution is to use Numba Pro.

Agreed, although I've never actually tried PyCUDA before...

All the best,
Jason

[toc] | [prev] | [next] | [standalone]

#86545

From	Sturla Molden <sturla.molden@gmail.com>
Date	2015-02-26 22:10 +0100
Message-ID	<mailman.19289.1424985028.18130.python-list@python.org>
In reply to	#86462

On 26/02/15 18:48, Jason Swails wrote:
> On Thu, 2015-02-26 at 16:53 +0000, Sturla Molden wrote:
>> GPU computing is great if you have the following:
>>
>> 1. Your data structures are arrays floating point numbers.
>
> It actually works equally great, if not better, for integers.

Right, but not complicated data structures with a lot of references or 
pointers. It requires data are laid out in regular arrays, and then it 
acts on these arrays in a data-parallel manner. It is designed to 
process vertices in parallel for computer graphics, and that is a 
limitation which is always there. It is not a CPU with 1024 cores. It is 
a "floating point monster" which can process 1024 vectors in parallel. 
You write a tiny kernel in a C-like language (CUDA, OpenCL) to process 
one vector, and then it will apply the kernel to all the vectors in an 
array of vectors. It is very comparable to how GLSL and Direct3D vertex 
and fragment shaders work. (The reason for which should be obvious.) The 
GPU is actually great for a lot of things in science, but it is not a 
CPU. The biggest mistake in the GPGPU hype is the idea that the GPU will 
behave like a CPU with many cores.

Sturla

[toc] | [prev] | [next] | [standalone]

#86552

From	Jason Swails <jason.swails@gmail.com>
Date	2015-02-26 17:28 -0500
Message-ID	<mailman.19294.1424989738.18130.python-list@python.org>
In reply to	#86462

[Multipart message — attachments visible in raw view] — view raw

On Thu, Feb 26, 2015 at 4:10 PM, Sturla Molden <sturla.molden@gmail.com>
wrote:

> On 26/02/15 18:48, Jason Swails wrote:
>
>> On Thu, 2015-02-26 at 16:53 +0000, Sturla Molden wrote:
>>
>>> GPU computing is great if you have the following:
>>>
>>> 1. Your data structures are arrays floating point numbers.
>>>
>>
>> It actually works equally great, if not better, for integers.
>>
>
> Right, but not complicated data structures with a lot of references or
> pointers. It requires data are laid out in regular arrays, and then it acts
> on these arrays in a data-parallel manner. It is designed to process
> vertices in parallel for computer graphics, and that is a limitation which
> is always there. It is not a CPU with 1024 cores. It is a "floating point
> monster" which can process 1024 vectors in parallel. You write a tiny
> kernel in a C-like language (CUDA, OpenCL) to process one vector, and then
> it will apply the kernel to all the vectors in an array of vectors. It is
> very comparable to how GLSL and Direct3D vertex and fragment shaders work.
> (The reason for which should be obvious.) The GPU is actually great for a
> lot of things in science, but it is not a CPU. The biggest mistake in the
> GPGPU hype is the idea that the GPU will behave like a CPU with many cores.

Very well summarized.  At least in my field, though, it is well-known that
GPUs are not 'uber-fast CPUs'.  Algorithms have been redesigned, programs
rewritten to take advantage of their architecture.  It has been a *massive*
investment of time and resources, but (unlike the Xeon Phi coprocessor [1])
has reaped most of its promised rewards.

--Jason

[1] I couldn't resist the jab.  At several times the cost of the top of the
line NVidia gaming card, the GPU is about 15-20x faster...

[toc] | [prev] | [next] | [standalone]

#86463

From	Ethan Furman <ethan@stoneleaf.us>
Date	2015-02-25 19:03 -0800
Message-ID	<mailman.19232.1424919818.18130.python-list@python.org>
In reply to	#86459

[Multipart message — attachments visible in raw view] — view raw

On 02/25/2015 06:35 PM, John Ladasky wrote:
> What I would REALLY like to do is to take advantage of my GPU.  My NVidia graphics
> card has 1152 cores and a 1.0 GHz clock.  I wouldn't mind borrowing a few hundred
> of those GPU cores at a time, and see what they can do.  In theory, I calculate
> that I can speed up the job by another five-fold.

Only free for academic use:

  https://developer.nvidia.com/how-to-cuda-python


unsure, but looks like free to use:

  http://mathema.tician.de/software/pycuda/


and, of course, the StackOverflow question:

  http://stackoverflow.com/q/5957554/208880

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#86465

From	Ethan Furman <ethan@stoneleaf.us>
Date	2015-02-25 19:05 -0800
Message-ID	<mailman.19234.1424919987.18130.python-list@python.org>
In reply to	#86459

[Multipart message — attachments visible in raw view] — view raw

Oh, and this one:

  http://www.cs.toronto.edu/~tijmen/gnumpy.html

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#86478

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2015-02-25 21:53 -0800
Message-ID	<fecd3a22-21bb-42fc-97a4-bbfc54b7958d@googlegroups.com>
In reply to	#86465

Thanks for the various links, Ethan.  I have encountered PyCUDA before, but not the other options.

So far, I'm not seeing code examples which appear to do what I would like, which is simply to farm out one Python process to one GPU core.  The examples all appear to parallelize array operations.  I know, that's the easier way to break up a task.

I may have to bite the bullet and learn how to use this:

http://mklab.iti.gr/project/GPU-LIBSVM

[toc] | [prev] | [next] | [standalone]

#86583

From	Christian Gollwitzer <auriocus@gmx.de>
Date	2015-02-27 19:55 +0100
Message-ID	<mcqei5$45q$1@dont-email.me>
In reply to	#86478

Am 26.02.15 um 06:53 schrieb John Ladasky:
> Thanks for the various links, Ethan.  I have encountered PyCUDA before, but not the other options.
> 
> So far, I'm not seeing code examples which appear to do what I would like, which is simply to farm out one Python process to one GPU core.  The examples all appear to parallelize array operations.  I know, that's the easier way to break up a task.
> 
> I may have to bite the bullet and learn how to use this:
> 
> http://mklab.iti.gr/project/GPU-LIBSVM
> 

If you can get this to run on your machine, it will surely outperform
any efforts what you can do with a python-CUDA bridge on your own. GPU
programming is hard, and efficient GPU programming is really hard. To
get an impression, this talk shows how some changes to an OpenCL program
can improve the speed by 60x compared to a naive implementation:

http://web.archive.org/web/20101217181349/http://developer.amd.com/zones/OpenCLZone/Events/assets/Optimizations-ImageConvolution1.pdf

	Christian

[toc] | [prev] | [next] | [standalone]

#86507

From	Jason Swails <jason.swails@gmail.com>
Date	2015-02-26 10:27 -0500
Message-ID	<mailman.19260.1424964440.18130.python-list@python.org>
In reply to	#86459

On Wed, 2015-02-25 at 18:35 -0800, John Ladasky wrote:
> I've been working with machine learning for a while.  Many of the
> standard packages (e.g., scikit-learn) have fitting algorithms which
> run in single threads.  These algorithms are not themselves
> parallelized.  Perhaps, due to their unique mathematical requirements,
> they cannot be paralleized.  
> 
> When one is investigating several potential models of one's data with
> various settings for free parameters, it is still sometimes possible
> to speed things up.  On a modern machine, one can use Python's
> multiprocessing.Pool to run separate instances of scikit-learn fits.
> I am currently using ten of the twelve 3.3 GHz CPU cores on my machine
> to do just that.  And I can still browse the web with no observable
> lag.  :^)
> 
> Still, I'm waiting hours for jobs to finish.  Support vector
> regression fitting is hard.
> 
> What I would REALLY like to do is to take advantage of my GPU.  My
> NVidia graphics card has 1152 cores and a 1.0 GHz clock.  I wouldn't
> mind borrowing a few hundred of those GPU cores at a time, and see
> what they can do.  In theory, I calculate that I can speed up the job
> by another five-fold.
> 
> The trick is that each process would need to run some PYTHON code, not
> CUDA or OpenCL.  The child process code isn't particularly fancy.  (I
> should, for example, be able to switch that portion of my code to
> static typing.)
> 
> What is the most effective way to accomplish this task?

GPU computing is a lot more than simply saying "run this on a GPU".  To
realize the performance gains promised by a GPU, you need to tailor your
algorithms to take advantage of their hardware... SIMD reigns supreme
where thread divergence and branching are far more expensive than they
are in CPU computing.  So even if you decide to somehow translate your
Python code into a CUDA kernel, there is a good chance that you will be
woefully disappointed in the resulting speedup (or even moreso if you
actually get a slowdown :)).  For example, a simple reduction is more
expensive on a GPU than it is on a CPU for small arrays.  A dot product,
for example, has a part that's super fast on the GPU (element-by-element
multiplication), and then a part that gets a lot slower (summing up all
elements of the resulting multiplication).  Each core on the GPU is a
lot slower than a CPU (which is why a 1000-CUDA-core GPU doesn't run
anywhere near 1000x faster than a CPU), so you really only get gains
when they can all work efficiently together.

Another example -- matrix multiplies are *fast*.  Diagonalizations are
slow (which is why in my field where diagonalizations are common
requirements, they are often done on the CPU while *building* the matrix
is done on the GPU).
> 
> I came across a reference to a package called "Urutu" which may be
> what I need, however it doesn't look like it is widely supported.

Urutu seems to be built on PyCUDA and PyOpenCL (which are both written
by the same person; Andreas Kloeckner at UIUC in the United States).

Another package I would suggest looking into is numba, from Continuum
Analytics: https://github.com/numba/numba.  Unlike Urutu, their package
is built on LLVM and Python bindings they've written to implement
numpy-aware JIT capabilities.  I believe they also permit compiling down
to a GPU kernel through LLVM.  One downside I've experienced with that
package is that LLVM does not yet have a stable API (as I understand
it), so they often lag behind support for the latest versions of LLVM.
> 
> I would love it if the Python developers themselves added the ability
> to spawn GPU processes to the Multiprocessing module!

I would be stunned if this actually happened.  If you're worried about
performance, you get at least an order of magnitude performance boost by
going to numpy or writing the kernel directly in C or Fortran.  CPython
itself just isn't structured to run on a GPU... maybe pypy will tackle
that at some point in the probably-distant future.

All the best,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher

[toc] | [prev] | [next] | [standalone]

#86513

From	Sturla Molden <sturla.molden@gmail.com>
Date	2015-02-26 16:40 +0000
Message-ID	<mailman.19268.1424968864.18130.python-list@python.org>
In reply to	#86459

If you are doing SVM regression with scikit-learn you are using libSVM.
There is a CUDA accelerated version of this C library here:
http://mklab.iti.gr/project/GPU-LIBSVM

You can presumably reuse the wrapping code from scikit-learn.

Sturla


John Ladasky <john_ladasky@sbcglobal.net> wrote:
> I've been working with machine learning for a while.  Many of the
> standard packages (e.g., scikit-learn) have fitting algorithms which run
> in single threads.  These algorithms are not themselves parallelized. 
> Perhaps, due to their unique mathematical requirements, they cannot be paralleized.  
> 
> When one is investigating several potential models of one's data with
> various settings for free parameters, it is still sometimes possible to
> speed things up.  On a modern machine, one can use Python's
> multiprocessing.Pool to run separate instances of scikit-learn fits.  I
> am currently using ten of the twelve 3.3 GHz CPU cores on my machine to
> do just that.  And I can still browse the web with no observable lag.  :^)
> 
> Still, I'm waiting hours for jobs to finish.  Support vector regression fitting is hard.
> 
> What I would REALLY like to do is to take advantage of my GPU.  My NVidia
> graphics card has 1152 cores and a 1.0 GHz clock.  I wouldn't mind
> borrowing a few hundred of those GPU cores at a time, and see what they
> can do.  In theory, I calculate that I can speed up the job by another five-fold.
> 
> The trick is that each process would need to run some PYTHON code, not
> CUDA or OpenCL.  The child process code isn't particularly fancy.  (I
> should, for example, be able to switch that portion of my code to static typing.)
> 
> What is the most effective way to accomplish this task?
> 
> I came across a reference to a package called "Urutu" which may be what I
> need, however it doesn't look like it is widely supported.
> 
> I would love it if the Python developers themselves added the ability to
> spawn GPU processes to the Multiprocessing module!
> 
> Thanks for any advice and comments.

[toc] | [prev] | [next] | [standalone]

#86524

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2015-02-26 09:34 -0800
Message-ID	<d6531afe-3e55-4141-a0db-67eed984da5d@googlegroups.com>
In reply to	#86513

On Thursday, February 26, 2015 at 8:41:26 AM UTC-8, Sturla Molden wrote:
> If you are doing SVM regression with scikit-learn you are using libSVM.
> There is a CUDA accelerated version of this C library here:
> http://mklab.iti.gr/project/GPU-LIBSVM
> 
> You can presumably reuse the wrapping code from scikit-learn.
> 
> Sturla

Hi Sturla,  I recognize your name from the scikit-learn mailing list.  

If you look a few posts above yours in this thread, I am aware of gpu-libsvm.  I don't know if I'm up to the task of reusing the scikit-learn wrapping code, but I am giving that option some serious thought.  It isn't clear to me that gpu-libsvm can handle both SVM and SVR, and I have need of both algorithms. 

My training data sets are around 5000 vectors long.  IF that graph on the gpu-libsvm web page is any indication of what I can expect from my own data (I note that they didn't specify the GPU card they're using), I might realize a 20x increase in speed.

[toc] | [prev] | [next] | [standalone]

#86543

From	Sturla Molden <sturla.molden@gmail.com>
Date	2015-02-26 21:54 +0100
Message-ID	<mailman.19287.1424984054.18130.python-list@python.org>
In reply to	#86524

On 26/02/15 18:34, John Ladasky wrote:

> Hi Sturla,  I recognize your name from the scikit-learn mailing list.
>
> If you look a few posts above yours in this thread, I am aware of gpu-libsvm.  I don't know if I'm up to the task of reusing the scikit-learn wrapping code, but I am giving that option some serious thought.  It isn't clear to me that gpu-libsvm can handle both SVM and SVR, and I have need of both algorithms.
>
> My training data sets are around 5000 vectors long.  IF that graph on the gpu-libsvm web page is any indication of what I can expect from my own data (I note that they didn't specify the GPU card they're using), I might realize a 20x increase in speed.

A GPU is a "floating point monster", not a CPU. It is not designed to 
run things like CPython. It is also only designed to run threads in 
parallel on its cores, not processes. And as you know, in Python there 
is something called GIL. Further the GPU has hard-wired fine-grained 
load scheduling for data-parallel problems (e.g. matrix multiplication 
for vertex processing in 3D graphics). It is not like a thread on a GPU 
is comparable to a thread on a CPU. It is more like a parallel work 
queue, with the kind of abstraction you find in Apple's GCD.

I don't think it really doable to make something like CPython run with 
thousands of parallel instances on a GPU. A GPU is not designed for 
that. A GPU is great if you can pass millions of floating point vectors 
as items to the work queue, with a tiny amount of computation per item. 
It would be crippled if you passed a thousand CPython interpreters and 
expect them to do a lot of work.

Also, as it is libSVM that does the math in you case, you need to get 
libSVM to run on the GPU, not CPython.

In most cases the best hardware for parallel scientific computing 
(taking economy and flexibility into account) is a Linux cluster which 
supports MPI. You can then use mpi4py or Cython to use MPI from your 
Python code.

Sturla

[toc] | [prev] | [standalone]

csiph-web

Parallelization of Python on GPU?

Contents

#86459 — Parallelization of Python on GPU?

#86462

#86467

#86510

#86517

#86522

#86525

#86545

#86552

#86463

#86465

#86478

#86583

#86507

#86513

#86524

#86543