Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #3081 > unrolled thread
| Started by | Roger L Costello <costello@mitre.org> |
|---|---|
| First post | 2022-06-21 10:27 +0000 |
| Last post | 2022-06-22 01:05 +0000 |
| Articles | 6 — 5 participants |
Back to article view | Back to comp.compilers
What does it mean to "move characters" in the lexer? Roger L Costello <costello@mitre.org> - 2022-06-21 10:27 +0000
Re: What does it mean to "move characters" in the lexer? gah4 <gah4@u.washington.edu> - 2022-06-21 10:30 -0700
Re: What does it mean to "move characters" in the lexer? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-06-22 00:44 +0300
Re: What does it mean to "move characters" in the lexer? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-06-22 01:13 +0000
Re: What does it mean to "move characters" in the lexer? Thomas Koenig <tkoenig@netcologne.de> - 2022-06-22 11:45 +0000
Re: What does it mean to "move characters" in the lexer? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-06-22 01:05 +0000
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2022-06-21 10:27 +0000 |
| Subject | What does it mean to "move characters" in the lexer? |
| Message-ID | <22-06-057@comp.compilers> |
Hi Folks, Page 89 of the dragon book says: Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character. Page 102 of "A Retargetable C Compiler: Design and Implementation" says: The lexical analyzer's main activity is moving characters, so minimizing the amount of character movement helps increase speed. And on page 103 it says: An important consequence of this design is that most of the input characters are accessed by *cp and many characters are never moved. Only identifiers (excluding keywords) and string literals that appear in executable code are copied out of the buffer into permanent storage. I don't understand what they mean by "moving characters". Do they mean copying characters? Do they mean reading characters from a file into memory? Would you explain what this "character movement" thing is all about, please? /Roger [Keeping in mind that this was written in the 1970s, they mean copying strings of characters from one place to another. On a PDP-11. With 64K bytes of memory. It is still true that character processing in a lexer consumes a large fraction of the time in compilers. -John]
[toc] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2022-06-21 10:30 -0700 |
| Message-ID | <22-06-058@comp.compilers> |
| In reply to | #3081 |
On Tuesday, June 21, 2022 at 9:25:12 AM UTC-7, Roger L Costello wrote: (snip) > Because a large amount of time can be consumed moving characters, specialized > buffering techniques have been developed to reduce the amount of overhead to > process an input character. (snip) > I don't understand what they mean by "moving characters". Do they mean copying > characters? Do they mean reading characters from a file into memory? Would you > explain what this "character movement" thing is all about, please? Yes it is copying, and yes it can take a lot of the time. On many systems, the disk controller reads the data into its own buffer, and then the OS copies the data from the controller buffer into its buffer. Then when the user does an I/O (input) request, the data is copied into the program's own buffer, and finally into the place where the data actually goes. So maybe four copies. Early in the days of TCP/IP there was trailer encapsulation. (I never saw it used, but some have the ability to turn it on.) If you follow the ISO seven letter model, or even if you don't. The program gives data to TCP, which divides it up into packets to send. Each of those gets a TCP header. It is then passed to IP where IP puts its header on. And then before sending, it gets an Ethernet header. Since there is often something before the buffer, but the buffer might not be full, so there might be space at the end, there was trailer encapsulation. Instead of putting the TCP and IP header on the beginning, you put them on the end! Less copying! I believe people found other ways to reduce copying, though. The I/O hardware for IBM S/360 copies data directly from the I/O device into memory. (Memory was expensive!) Also, it is blocked on disk the same as it is for the user, unlike most systems now. It would be usual, though, for the last copy -- from the I/O buffer to/from the actual data area -- to be an actual copy. IBM has locate mode I/O to eliminate that one. For locate mode, instead of copying, the program gets a pointer to the actual buffer. (That works in assembly and PL/I, C hadn't been invented.) For write, you request the address of the output buffer, operate on the data there, and then request it be written. There has been much work over the years on reducing the amount of data copying, or operations needed to copy it. For byte addressed machines, to copy data a whole word at a time. (Depending on alignment.) There are also search algorithms like Boyer-Moore, to search strings without looking at every character. [Now you can usually ask operating systems to map a file into your process so there is no extra copying at all, the disk reads a block into a page frame and you address the data directly in that page frame. In a program called grepcidr that does grep-like searches for IP address strings, for large files it got somewhat faster when I switched from stdio to mapping the whole file in and treating it as one big string. This is pretty remote from compilers, though. tl;dr less copying is faster. -John]
[toc] | [prev] | [next] | [standalone]
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Date | 2022-06-22 00:44 +0300 |
| Message-ID | <22-06-064@comp.compilers> |
| In reply to | #3082 |
While worrying about copying characters around in compilers isn't given much thought these days, it is very relevant to people implementing networking software and also those doing hardware accelerators and their device drivers. The startup I'm working with these days, spends a lot of time worrying about zero-copy abstractions, i.e. how to avoid moving data around. Of course, that doesn't surprise me as we are building hardware accelerators and lots of the staff has a networking background and our accelerators communicate with each other over network connections or shared memory, but the less the data moves, the faster throughput we get with less energy usage and usually with less hardware too. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-06-22 01:13 +0000 |
| Message-ID | <22-06-066@comp.compilers> |
| In reply to | #3086 |
On 2022-06-21, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote: > While worrying about copying characters around in compilers isn't given > much thought these days, it is very relevant to people implementing > networking software and also those doing hardware accelerators and their > device drivers. Don't write off buffering optimization'sn as yesterday's game just yet; e.g. this could still be relevant to someone needing to parse gigabytes of JSON or XML. Or maybe even just megabytes. I remember reading some article some years ago whereby some Javascript programmer discovered it was faster to read JSON from a file using dedicated JSON routines available in Javascript, than to declare the same syntax in the Javascript program as a literal and let it be scanned along with the program and available to it that way. (I realize there may be other reasons for the performance difference, because JSON isn't Javascript, but likely part of it was that the JS implementation didn't care about being efficient for large amounts of data: the "hot spot" isn't going to be the scanning stage that takes place oncem before the program has a chance to execute any sort of loop.) -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[toc] | [prev] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2022-06-22 11:45 +0000 |
| Message-ID | <22-06-071@comp.compilers> |
| In reply to | #3088 |
Kaz Kylheku <480-992-1380@kylheku.com> schrieb: > I remember reading some article some years ago whereby some Javascript > programmer discovered it was faster to read JSON from a file using > dedicated JSON routines available in Javascript, than to declare the > same syntax in the Javascript program as a literal and let it be > scanned along with the program and available to it that way. This came up on comp.arch recently. There is an insanely fast JSON parser ad UTF-8 validator based on SIMD to be found at https://github.com/simdjson/simdjson . They select a different length of vector according to the CPU version they find. The algorithm is described at https://arxiv.org/pdf/1902.08318.pdf. It heavily relies on special-casing for JSON and for the SIMD instructions that are available. A general SIMD-based parser generator is likely to be even harder to write and will probably not outperform the package above (nor, for that case, a traditional character-at-a-time approach). Is there research on this?
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <480-992-1380@kylheku.com> |
|---|---|
| Date | 2022-06-22 01:05 +0000 |
| Message-ID | <22-06-065@comp.compilers> |
| In reply to | #3081 |
On 2022-06-21, Roger L Costello <costello@mitre.org> wrote: > Hi Folks, > > Page 89 of the dragon book says: > > Because a large amount of time can be consumed moving characters, specialized > buffering techniques have been developed to reduce the amount of overhead to > process an input character. It's not clear what exactly this is referring to, but probably just to the practice of making multiple copies of the same data in the processing stack. If we put an imaginary tracer on a single character, we may see it hopping among multiple buffers. In the Chaper 2 lexical analyzer, getchar is used to read a character; getchar fills a buffer in the stdio stream, and the program is sucking it out from there one character at a time. So to build a lexeme, it has to have its own buffer for the lexeme, which is another copy of the data. The technique described in chapter 3 allows bulk reads into a buffer, eliminating the stream library. The tokens are delimited right inside the buffer, reducing a copy. [The technique seems closely related to the "flip buffer", which can be found in places like TTY implementations of Unix kernels. Linux had one; there was once a "struct tty_flip_buffer". At a quick glance, it looks like that today there are nmore than two buffers used which are allocated and freed on the fly. There is still a <linux/tty_flip.h> which contains a modicum of "flip" terminology,]
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web