Groups > comp.compilers > #2572 > unrolled thread

Parsing using a Graphics Processing Unit (GPU)?

Started by	Roger L Costello <costello@mitre.org>
First post	2020-08-31 10:35 +0000
Last post	2020-09-09 14:09 -0700
Articles	12 — 9 participants

Back to article view | Back to comp.compilers

  Parsing using a Graphics Processing Unit (GPU)? Roger L Costello <costello@mitre.org> - 2020-08-31 10:35 +0000
    Re: Parsing using a Graphics Processing Unit (GPU)? Christian Gollwitzer <auriocus@gmx.de> - 2020-09-01 09:22 +0200
      Re: Parsing using a Graphics Processing Unit (GPU)? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-09-01 19:02 +0200
      Parsing using a Graphics Processing Unit (GPU)? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-09-02 01:14 +0300
      Re: Parsing using a Graphics Processing Unit (GPU)? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2020-09-02 02:13 -0700
        Re: Parsing using a Graphics Processing Unit (GPU)? Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-09-02 22:34 +0200
    Re: Parsing using a Graphics Processing Unit (GPU)? "A. K." <minforth@arcor.de> - 2020-09-01 01:25 -0700
      Re: Parsing using a Graphics Processing Unit (GPU)? arnold@skeeve.com (Aharon Robbins) - 2020-09-02 05:43 +0000
    Re: Parsing using a Graphics Processing Unit (GPU)? Elijah Stone <elronnd@elronnd.net> - 2020-09-01 20:13 -0700
      Re: Parsing using a Graphics Processing Unit (GPU)? Roger L Costello <costello@mitre.org> - 2020-09-02 11:45 +0000
    Re: Parsing using a Graphics Processing Unit (GPU)? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2020-09-02 17:57 +0300
    Re: Parsing using a Graphics Processing Unit (GPU)? gah4 <gah4@u.washington.edu> - 2020-09-09 14:09 -0700

#2572 — Parsing using a Graphics Processing Unit (GPU)?

From	Roger L Costello <costello@mitre.org>
Date	2020-08-31 10:35 +0000
Subject	Parsing using a Graphics Processing Unit (GPU)?
Message-ID	<20-09-001@comp.compilers>

Hi Folks,

I am reading a book [1] on machine learning and the book says some pretty
interesting things:

"In the search for more speed, machine learning researchers started taking
advantage of special hardware found in some computers, originally designed to
improve graphics performance. You may have heard these called graphics cards.
... Those graphics cards contain a GPU, or graphics processing unit. Unlike a
general purpose CPU, a GPU is designed to perform specific tasks, and do them
well. One of those tasks is to carry out arithmetic, including matrix
multiplication, in a highly parallel way. ... GPUs have many more [than CPUs]
arithmetic cores, thousands are fairly common today. This means a huge
workload can be split amongst all those cores and the job can be done
quickly."

Neat!

Has the parsing community found a way to take advantage of GPUs?

From the above excerpt, it appears that GPUs are especially good at
arithmetic. When I think of parsing, I don't think of lots of arithmetic.
Perhaps someone has devised a way to recast the parsing problem into an
arithmetic problem?

Any thoughts you might have on:

(a) parsing-using-GPUs, and
(b) recasting-the-parsing-problem-into-an-arithmetic-problem

would be appreciated.  /Roger

[1] "Make Your First GAN with Pytorch" by Tariq Rashid

[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

[toc] | [next] | [standalone]

#2573

From	Christian Gollwitzer <auriocus@gmx.de>
Date	2020-09-01 09:22 +0200
Message-ID	<20-09-002@comp.compilers>
In reply to	#2572

Am 31.08.20 um 12:35 schrieb Roger L Costello:
> Has the parsing community found a way to take advantage of GPUs?

I don't think that you can speed up parsing a lot using GPUs. GPUs are
generally good at parallel processing. A single thread is usually slower
than a CPU thread, and there is overhead, because they are not
self-contained - i.e. you can usually speed up only some part of a
program, and it needs to be uploaded to the GPU and downloaded back.
GPUs also have faster memory, *if* you access it either in big blocks or
as a serial stream, in which case the compiler can transform it to block
access. For random accesses, the memory is slower.

I have done some basic GPU programming, and I think that parsing is not
a parallel task in that sense. The parser reads the input as a stream of
tokens; you can't split the C file at some arbitrary point in half and
parse both parts independently.

	Christian

[toc] | [prev] | [next] | [standalone]

#2575

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-09-01 19:02 +0200
Message-ID	<20-09-004@comp.compilers>
In reply to	#2573

Am 01.09.2020 um 09:22 schrieb Christian Gollwitzer:
> Am 31.08.20 um 12:35 schrieb Roger L Costello:
>> Has the parsing community found a way to take advantage of GPUs?


> I have done some basic GPU programming, and I think that parsing is not
> a parallel task in that sense. The parser reads the input as a stream of
> tokens;

and acts *differently* depending on that input, while a GPU performs
essentially the *same* operation to all elements of a stream (merge,
scale...).

> you can't split the C file at some arbitrary point in half and
> parse both parts independently.

Multiple files can be parsed in parallel just with a multi-core CPU.
More parallel modules require more symbol tables etc., what limits the
amount of concurrent threads by available (cached!) memory.

DoDi

[toc] | [prev] | [next] | [standalone]

#2576

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2020-09-02 01:14 +0300
Message-ID	<20-09-005@comp.compilers>
In reply to	#2573

Our esteemed moderator wrote:
[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

As to parallelizing lexers, we did some experiments on speeding up
regular expressions (mostly for Snort) while I was at Intel.   Charlie
Lasswell did experiments on starting parallel scans on different parts
of a stream or file that showed some promise.  The hyperscan team (aka
Sensory Networks, Geoff Langdale &co) had some interesting results on
when you could skip ahead and guess the correct state when using
NFAs--I believe they called it "and then a miracle happens" (following
the famous cartoon about a mathematical proof with some missing steps
on a whiteboard).  Michela Becchi worked on when you could predict
states involved in xFA loops would terminate.  Geoff had me do a mode
they called Maclellan (the HS folks had some interesting naming
conventions, one of them was the use of US Civil War generals for
major variations/modes) which was turning NFAs into parallelized DFAs
for Hyperscan.  The regular expression engine we put into the Cave
Creek chip, did a different form of parallel DFAs, spawning them when
certain conditions were met, and for things like the Aho-Corasick
algorithm you can show that there tends to be a "bushy" part near the
head of the automata, with "skinny" parts being forked off after a
small number of bytes--that was the key insight that led to our chip.

So, there are definite possibilities for doing SIMD regular expression
matching that would be suitable for GPU implementation.  Geoff and I
discussed how to approach it on numerous occasions.  It just never
quite got past the to-consider list.

Some of this "could" also be applied to parsing, particularly in a GLR
style parser, where you have forests.  But our moderator is correct in
saying the big boost would be in speeding up lexing.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2579

From	Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Date	2020-09-02 02:13 -0700
Message-ID	<20-09-008@comp.compilers>
In reply to	#2573

On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
> The parser reads the input as a stream of
> tokens; you can't split the C file at some arbitrary point in half and
> parse both parts independently.

Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
points: when the state of the lexical analyzer collapses to a single
state starting from a random file position with an arbitrary starting
state.

-atom

[toc] | [prev] | [next] | [standalone]

#2582

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-09-02 22:34 +0200
Message-ID	<20-09-011@comp.compilers>
In reply to	#2579

Am 02.09.2020 um 11:13 schrieb Jan Ziak:
> On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
>> The parser reads the input as a stream of
>> tokens; you can't split the C file at some arbitrary point in half and
>> parse both parts independently.
>
> Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
> points:

Certainly not for C/C++. The preprocessor and included files modify the
original source. That's one of the reasons for the slow C/C++ compilation.

> when the state of the lexical analyzer collapses to a single
> state starting from a random file position with an arbitrary starting
> state.

If you mean the lexer by "lexical analysis", that's only a front end for
the parser. A compiler needs more state and context information like
symbol tables, which can not be built from arbitrary parts of a source file.

DoDi
[One can certainly pipeline the C preprocessor and later phases but I'm
guessing that's not what you're talking about here. -John]

[toc] | [prev] | [next] | [standalone]

#2574

From	"A. K." <minforth@arcor.de>
Date	2020-09-01 01:25 -0700
Message-ID	<20-09-003@comp.compilers>
In reply to	#2572

Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
> Has the parsing community found a way to take advantage of GPUs?

The old IT classic still holds: Early optimization is the root of all evil.

IOW: first identify your bottlenecks by profiling your parsing routines.
Repeat it.

Most probably bottlenecks will have to do with I/O speed and selected
algorithms. Memory access speed perhaps, CPU/GPU performance very behind.

[toc] | [prev] | [next] | [standalone]

#2578

From	arnold@skeeve.com (Aharon Robbins)
Date	2020-09-02 05:43 +0000
Message-ID	<20-09-007@comp.compilers>
In reply to	#2574

In article <20-09-003@comp.compilers>, A. K. <minforth@arcor.de> wrote:
>Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
>The old IT classic still holds: Early optimization is the root of all evil.

Jus to give credit where credit is due, it was Donald Knuth who said
"Premature optimization is the root of all evil."

IIRC it was in his famous "Structured Programming with GOTO Statements"
paper.

Arnold

[toc] | [prev] | [next] | [standalone]

#2577

From	Elijah Stone <elronnd@elronnd.net>
Date	2020-09-01 20:13 -0700
Message-ID	<20-09-006@comp.compilers>
In reply to	#2572

On Mon, 31 Aug 2020, Roger L Costello wrote:

> Any thoughts you might have on:
>
> (a) parsing-using-GPUs, and
> (b) recasting-the-parsing-problem-into-an-arithmetic-problem

Look up Aaron Hsu's Ph.D thesis, _A data parallel compiler hosted on the
GPU_ (https://scholarworks.iu.edu/dspace/handle/2022/24749).  As John (and
others) mention, I don't think the GPU is an especially interesting target
to speed up parsing specifically, but it may be a fruitful line of
inquiry.  If so then that thesis is, as far as I can tell, the only
research that has been done so far; maybe you can make your own
experiments based on it.

--
time flies like an arrow;
fruit flies like a banana

[toc] | [prev] | [next] | [standalone]

#2580

From	Roger L Costello <costello@mitre.org>
Date	2020-09-02 11:45 +0000
Message-ID	<20-09-009@comp.compilers>
In reply to	#2577

> Look up Aaron Hsu's Ph.D thesis,

> A data parallel compiler hosted on the GPU

> (https://scholarworks.iu.edu/dspace/handle/2022/24749)

Abstract:
This work describes a general, scalable method for building data-parallel by
construction tree transformations that exhibit simplicity, directness of
expression, and high-performance on both CPU and GPU architectures when
executed on either interpreted or compiled platforms across a wide range of
data sizes, as exemplified and expounded by the exposition of a complete
compiler for a lexically scoped, functionally oriented programming commercial
language. The entire source code to the compiler written in this method
requires only 17 lines of simple code compared to roughly 1000 lines of
equivalent code in the domain-specific compiler construction framework,
Nanopass, and requires no domain specific techniques, libraries, or
infrastructure support. It requires no sophisticated abstraction barriers to
retain its concision and simplicity of form. The execution performance of the
compiler scales along multiple dimensions: it consistently outperforms the
equivalent traditional compiler by orders of magnitude in memory usage and run
time at all data sizes and achieves this performance on both interpreted and
compiled platforms across CPU and GPU hardware using a single source code for
both architectures and no hardware-specific annotations or code. It does not
use any novel domain-specific inventions of technique or process, nor does it
use any sophisticated language or platform support. Indeed, the source does
not utilize branching, conditionals, if statements, pattern matching, ADTs,
recursions, explicit looping, or other non-trivial control or dispatch, nor
any specialized data models.
---------------------------------------------------
Wow!

Thanks for the reference Elijah!

/Roger

[toc] | [prev] | [next] | [standalone]

#2581

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2020-09-02 17:57 +0300
Message-ID	<20-09-010@comp.compilers>
In reply to	#2572

To follow up on the part of the question about using GPUs to do
arithmetic to implement DFA transitions.

It can "easily" be done.  The base ideas can be found in the Wu Mamber
string search algorithm.  You treat states like (usually one-hot) bits
in a fixed length (usually one word) bit vector.  Your transition
function then simply maps bit vectors to bit vectors.  You can do that
as a SIMD instruction, thus it is suitable for a GPU to execute.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2596

From	gah4 <gah4@u.washington.edu>
Date	2020-09-09 14:09 -0700
Message-ID	<20-09-025@comp.compilers>
In reply to	#2572

On Monday, August 31, 2020 at 9:44:53 PM UT
C-7, Roger L Costello wrote:
> Hi Folks,
>
> I am reading a book [1] on machine learning and the book says some pretty
> interesting things:
>
> "In the search for more speed, machine learning researchers started taking
> advantage of special hardware found in some computers, originally designed to
> improve graphics performance. You may have heard these called graphics cards.

The usual way to use a GPU is for single precision floating point.  They might also
be able to do fixed point of a reasonable size.

As above, parsing is usually fast enough.

GPUs are often enough used for dynamic programming, which is sometimes
used for optimization and code generation in compilers.  It might be that those
could use a speed boost. This might be more true for unusual architectures where
optimal code generation is more important.

[toc] | [prev] | [standalone]

csiph-web

Parsing using a Graphics Processing Unit (GPU)?

Contents

#2572 — Parsing using a Graphics Processing Unit (GPU)?

#2573

#2575

#2576

#2579

#2582

#2574

#2578

#2577

#2580

#2581

#2596