Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Subject: Re: Parallelization of Python on GPU?
From: Jason Swails <jason.swails@gmail.com>
To: python-list@python.org
Date: Thu, 26 Feb 2015 12:48:03 -0500
In-Reply-To: <1915907417446661989.682673sturla.molden-gmail.com@news.gmane.org>
References: <82642f3a-49e8-4982-b135-66ffc04d67d9@googlegroups.com> <54ee8ce2$0$11109$c3e8da3@news.astraweb.com> <1424963166.30927.73.camel@gmail.com> <1915907417446661989.682673sturla.molden-gmail.com@news.gmane.org>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.19278.1424972875.18130.python-list@python.org>
Lines: 48
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:86525

On Thu, 2015-02-26 at 16:53 +0000, Sturla Molden wrote:
> GPU computing is great if you have the following:
> 
> 1. Your data structures are arrays floating point numbers.

It actually works equally great, if not better, for integers.

> 2. You have a data-parallel problem.

This is the biggest one, IMO. ^^^

> 3. You are happy with single precision.

NVidia GPUs have double-precision maths in hardware since compute
capability 1.2 (GTX 280).  That's ca. 2008.  In optimized CPU code, you
still get ~50% benefit going from double to single precision (it's
rarely ever that high, but 20-30% is commonplace in my experience of
optimized code).  It's admittedly a bigger hit on most GPUs, but there
are ways to work around it (e.g., fixed precision), and you can still do
double precision work where it's needed.  One of the articles I linked
previously demonstrates that a hybrid precision model (based on fixed
precision) provides exactly the same numerical stability as double
precision (which is much better than pure single precision) for that
application.

Double precision can often be avoided in many parts of a calculation,
using it only where those bits matter (like accumulators with
potentially small contributions, subtractions of two numbers of similar
magnitude, etc.).

> 4. You have time to code erything in CUDA or OpenCL.

This is the second biggest one, IMO. ^^^

> 5. You have enough video RAM to store your data.

Again, it can be worked around, but the frequent GPU->CPU xfers involved
if you can't fit everything on the GPU can be painstaking to limit its
potentially devastating effects on performance.

> 
> For Python the easiest solution is to use Numba Pro.

Agreed, although I've never actually tried PyCUDA before...

All the best,
Jason