Groups > comp.compilers > #3081 > unrolled thread

What does it mean to "move characters" in the lexer?

Started by	Roger L Costello <costello@mitre.org>
First post	2022-06-21 10:27 +0000
Last post	2022-06-22 01:05 +0000
Articles	6 — 5 participants

Back to article view | Back to comp.compilers

  What does it mean to "move characters" in the lexer? Roger L Costello <costello@mitre.org> - 2022-06-21 10:27 +0000
    Re: What does it mean to "move characters" in the lexer? gah4 <gah4@u.washington.edu> - 2022-06-21 10:30 -0700
      Re: What does it mean to "move characters" in the lexer? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-22 00:44 +0300
        Re: What does it mean to "move characters" in the lexer? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-06-22 01:13 +0000
          Re: What does it mean to "move characters" in the lexer? Thomas Koenig <tkoenig@netcologne.de> - 2022-06-22 11:45 +0000
    Re: What does it mean to "move characters" in the lexer? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-06-22 01:05 +0000

#3081 — What does it mean to "move characters" in the lexer?

From	Roger L Costello <costello@mitre.org>
Date	2022-06-21 10:27 +0000
Subject	What does it mean to "move characters" in the lexer?
Message-ID	<22-06-057@comp.compilers>

Hi Folks,

Page 89 of the dragon book says:

Because a large amount of time can be consumed moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead to
process an input character.

Page 102 of "A Retargetable C Compiler: Design and Implementation" says:

The lexical analyzer's main activity is moving characters, so minimizing the
amount of character movement helps increase speed.

And on page 103 it says:

An important consequence of this design is that most of the input characters
are accessed by *cp and many characters are never moved. Only identifiers
(excluding keywords) and string literals that appear in executable code are
copied out of the buffer into permanent storage.

I don't understand what they mean by "moving characters". Do they mean copying
characters? Do they mean reading characters from a file into memory? Would you
explain what this "character movement" thing is all about, please?

/Roger
[Keeping in mind that this was written in the 1970s, they mean copying strings
of characters from one place to another.  On a PDP-11.  With 64K bytes of memory.
It is still true that character processing in a lexer consumes a large fraction
of the time in compilers.  -John]

[toc] | [next] | [standalone]

#3082

From	gah4 <gah4@u.washington.edu>
Date	2022-06-21 10:30 -0700
Message-ID	<22-06-058@comp.compilers>
In reply to	#3081

On Tuesday, June 21, 2022 at 9:25:12 AM UTC-7, Roger L Costello wrote:

(snip)
> Because a large amount of time can be consumed moving characters, specialized
> buffering techniques have been developed to reduce the amount of overhead to
> process an input character.

(snip)
> I don't understand what they mean by "moving characters". Do they mean copying
> characters? Do they mean reading characters from a file into memory? Would you
> explain what this "character movement" thing is all about, please?

Yes it is copying, and yes it can take a lot of the time.

On many systems, the disk controller reads the data into its own buffer,
and then the OS copies the data from the controller buffer into its buffer.

Then when the user does an I/O (input) request, the data is copied
into the program's own buffer, and finally into the place where the data
actually goes.  So maybe four copies.

Early in the days of TCP/IP there was trailer encapsulation.
(I never saw it used, but some have the ability to turn it on.)

If you follow the ISO seven letter model, or even if you don't.

The program gives data to TCP, which divides it up into
packets to send.  Each of those gets a TCP header.
It is then passed to IP where IP puts its header on.
And then before sending, it gets an Ethernet header.

Since there is often something before the buffer, but
the buffer might not be full, so there might be space at
the end, there was trailer encapsulation.  Instead of
putting the TCP and IP header on the beginning, you
put them on the end!  Less copying!

I believe people found other ways to reduce copying, though.

The I/O hardware for IBM S/360 copies data directly
from the I/O device into memory.  (Memory was expensive!)
Also, it is blocked on disk the same as it is for the user,
unlike most systems now.  It would be usual, though,
for the last copy -- from the I/O buffer to/from the actual
data area -- to be an actual copy.  IBM has locate mode
I/O to eliminate that one.  For locate mode, instead of
copying, the program gets a pointer to the actual buffer.
(That works in assembly and PL/I, C hadn't been invented.)

For write, you request the address of the output buffer,
operate on the data there, and then request it be written.

There has been much work over the years on reducing
the amount of data copying, or operations needed to
copy it.  For byte addressed machines, to copy data
a whole word at a time.  (Depending on alignment.)

There are also search algorithms like Boyer-Moore,
to search strings without looking at every character.
[Now you can usually ask operating systems to map a file into your
process so there is no extra copying at all, the disk reads a block
into a page frame and you address the data directly in that page
frame. In a program called grepcidr that does grep-like searches for
IP address strings, for large files it got somewhat faster when I
switched from stdio to mapping the whole file in and treating it as
one big string. This is pretty remote from compilers, though. tl;dr
less copying is faster. -John]

[toc] | [prev] | [next] | [standalone]

#3086

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-06-22 00:44 +0300
Message-ID	<22-06-064@comp.compilers>
In reply to	#3082

While worrying about copying characters around in compilers isn't given
much thought these days, it is very relevant to people implementing
networking software and also those doing hardware accelerators and their
device drivers.  The startup I'm working with these days, spends a lot of
time worrying about zero-copy abstractions, i.e. how to avoid moving data
around.   Of course, that doesn't surprise me as we are building hardware
accelerators and lots of the staff has a networking background and our
accelerators communicate with each other over network connections or shared
memory, but the less the data moves, the faster throughput we get with less
energy usage and usually with less hardware too.

--
******************************************************************************

Chris Clark                  email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#3088

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-06-22 01:13 +0000
Message-ID	<22-06-066@comp.compilers>
In reply to	#3086

On 2022-06-21, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> While worrying about copying characters around in compilers isn't given
> much thought these days, it is very relevant to people implementing
> networking software and also those doing hardware accelerators and their
> device drivers.

Don't write off buffering optimization'sn as yesterday's game just yet;
e.g. this could still be relevant to someone needing to parse gigabytes
of JSON or XML. Or maybe even just megabytes.

I remember reading some article some years ago whereby some Javascript
programmer discovered it was faster to read JSON from a file using
dedicated JSON routines available in Javascript, than to declare the
same syntax in the Javascript program as a literal and let it be
scanned along with the program and available to it that way.

(I realize there may be other reasons for the performance difference,
because JSON isn't Javascript, but likely part of it was that the JS
implementation didn't care about being efficient for large amounts of
data: the "hot spot" isn't going to be the scanning stage that takes
place oncem before the program has a chance to execute any sort of
loop.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

[toc] | [prev] | [next] | [standalone]

#3092

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2022-06-22 11:45 +0000
Message-ID	<22-06-071@comp.compilers>
In reply to	#3088

Kaz Kylheku <480-992-1380@kylheku.com> schrieb:

> I remember reading some article some years ago whereby some Javascript
> programmer discovered it was faster to read JSON from a file using
> dedicated JSON routines available in Javascript, than to declare the
> same syntax in the Javascript program as a literal and let it be
> scanned along with the program and available to it that way.

This came up on comp.arch recently.

There is an insanely fast JSON parser ad UTF-8 validator based
on SIMD to be found at https://github.com/simdjson/simdjson .
They select a different length of vector according to
the CPU version they find.  The algorithm is described at
https://arxiv.org/pdf/1902.08318.pdf.  It
heavily relies on special-casing for JSON and for the SIMD
instructions that are available.

A general SIMD-based parser generator is likely to be even harder
to write and will probably not outperform the package above (nor,
for that case, a traditional character-at-a-time approach).

Is there research on this?

[toc] | [prev] | [next] | [standalone]

#3087

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-06-22 01:05 +0000
Message-ID	<22-06-065@comp.compilers>
In reply to	#3081

On 2022-06-21, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> Page 89 of the dragon book says:
>
> Because a large amount of time can be consumed moving characters, specialized
> buffering techniques have been developed to reduce the amount of overhead to
> process an input character.

It's not clear what exactly this is referring to, but probably just to
the practice of making multiple copies of the same data in the
processing stack. If we put an imaginary tracer on a single character,
we may see it hopping among multiple buffers. In the Chaper 2 lexical
analyzer, getchar is used to read a character; getchar fills a buffer in
the stdio stream, and the program is sucking it out from there one
character at a time. So to build a lexeme, it has to have its own buffer
for the lexeme, which is another copy of the data.

The technique described in chapter 3 allows bulk reads into a buffer,
eliminating the stream library. The tokens are delimited right inside
the buffer, reducing a copy.

[The technique seems closely related to the "flip buffer",
which can be found in places like TTY implementations of Unix
kernels. Linux had one; there was once a "struct tty_flip_buffer".
At a quick glance, it looks like that today there are nmore than two
buffers used which are allocated and freed on the fly. There is still a
<linux/tty_flip.h> which contains a modicum of "flip" terminology,]

[toc] | [prev] | [standalone]

csiph-web

What does it mean to "move characters" in the lexer?

Contents

#3081 — What does it mean to "move characters" in the lexer?

#3082

#3086

#3088

#3092

#3087