Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #20562 > unrolled thread
| Started by | Alex McDonald <blog@rivadpm.com> |
|---|---|
| First post | 2013-03-11 14:17 -0700 |
| Last post | 2013-03-12 15:22 -0700 |
| Articles | 20 on this page of 36 — 10 participants |
Back to article view | Back to comp.lang.forth
Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-11 14:17 -0700
Re: Hosted Forths on multicore machines "Clyde W. Phillips Jr." <cwpjr02@gmail.com> - 2013-03-11 17:41 -0700
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 06:38 -0700
Re: Hosted Forths on multicore machines Bernd Paysan <bernd.paysan@gmx.de> - 2013-03-12 02:19 +0100
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 06:05 -0700
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 06:34 -0700
Re: Hosted Forths on multicore machines Bernd Paysan <bernd.paysan@gmx.de> - 2013-03-13 01:51 +0100
Re: Hosted Forths on multicore machines "Elizabeth D. Rather" <erather@forth.com> - 2013-03-12 09:46 -1000
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-13 11:30 +0100
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-13 04:35 -0500
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-13 06:55 -0700
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-13 09:49 -0500
Re: Hosted Forths on multicore machines Paul Rubin <no.email@nospam.invalid> - 2013-03-13 08:18 -0700
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-13 10:39 -0500
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-13 16:36 -0700
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-14 03:44 -0500
Re: Hosted Forths on multicore machines anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2013-03-14 09:06 +0000
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-14 06:26 -0500
Re: Hosted Forths on multicore machines anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2013-03-14 15:41 +0000
Re: Hosted Forths on multicore machines Bernd Paysan <bernd.paysan@gmx.de> - 2013-03-14 17:56 +0100
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-15 03:26 -0500
Re: Hosted Forths on multicore machines Bernd Paysan <bernd.paysan@gmx.de> - 2013-03-16 23:11 +0100
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-12 05:01 -0500
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 06:10 -0700
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-12 17:04 +0100
Re: Hosted Forths on multicore machines Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-03-12 11:28 -0500
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-12 19:52 +0100
Re: Hosted Forths on multicore machines morrimichael@gmail.com - 2013-03-12 10:11 -0700
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 11:34 -0700
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-12 19:44 +0100
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 13:13 -0700
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-15 15:44 +0100
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-15 09:37 -0700
Re: Hosted Forths on multicore machines Alex McDonald <blog@rivadpm.com> - 2013-03-12 11:37 -0700
Re: Hosted Forths on multicore machines Roelf Toxopeus <rt4all@notthis.hetnet.nl> - 2013-03-13 10:46 +0100
Re: Hosted Forths on multicore machines the_gavino_himself <visphatesjava@gmail.com> - 2013-03-12 15:22 -0700
Page 1 of 2 [1] 2 Next page →
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-11 14:17 -0700 |
| Subject | Hosted Forths on multicore machines |
| Message-ID | <fd73df23-4c41-4cd7-936a-3141d752af25@g16g2000vbf.googlegroups.com> |
As one of three Forth projects I'm pursuing at the moment (the others are 64bit and an optimising compiler) I've been looking at specific solutions for employing more than 1 core of my 8 (!) core laptop. (If you won a laptop of less than 5 years vintage, it probably has more than 1 core, and multicore machines have been available for a long time in server land.) There are two common architectures; SMP and NUMA. And there are a plethora of chips and OSes that support these; Linux, Windows, Andrpoid on ARM, Intel/AMD to name some of the commonest. This has raised a number of questions for hosted Forths, some of which I pose here. Forth's traditional "multiuser" feature isn't adequate (PAUSE). It simply isn't extendable to multi core machines. The model assumes a single core with all work running on a single thread. There are no synchronisation primitives, and there is no way to handle asynchrnous interrupts. Multicore support would appear to require: . Memory fencing primitives (such as MFENCE SFENCE and LFENCE on x86) . Sync primitives (CAS or LL/SC) . Words for creating & managing threads . Word(s) for querying processor attributes (such as # of cores) . Closures (for which I have a proposal) Would it be possible to agree that these are a reasonable set, and that hosted Forth's would benefit from standardising some of these? I've looked at a number of other languages that support these kind of operations. . GO uses high level constructs; this model looks interesting. . C++ now has language intrinsics such as atomic_compare_exchange() . Lambdas and blocks in C++, Objective-C and others . Haskell, Erlang ... Comments welcome.
[toc] | [next] | [standalone]
| From | "Clyde W. Phillips Jr." <cwpjr02@gmail.com> |
|---|---|
| Date | 2013-03-11 17:41 -0700 |
| Message-ID | <72c028ab-a4d0-4ff2-a9fa-19bbcb272fb1@googlegroups.com> |
| In reply to | #20562 |
On Monday, March 11, 2013 4:17:25 PM UTC-5, Alex McDonald wrote: > As one of three Forth projects I'm pursuing at the moment (the others > > are 64bit and an optimising compiler) I've been looking at specific > > solutions for employing more than 1 core of my 8 (!) core laptop. (If > > you won a laptop of less than 5 years vintage, it probably has more > > than 1 core, and multicore machines have been available for a long > > time in server land.) > > > > There are two common architectures; SMP and NUMA. And there are a > > plethora of chips and OSes that support these; Linux, Windows, > > Andrpoid on ARM, Intel/AMD to name some of the commonest. > > > > This has raised a number of questions for hosted Forths, some of which > > I pose here. > > > > Forth's traditional "multiuser" feature isn't adequate (PAUSE). It > > simply isn't extendable to multi core machines. The model assumes a > > single core with all work running on a single thread. There are no > > synchronisation primitives, and there is no way to handle asynchrnous > > interrupts. > > > > Multicore support would appear to require: > > > > . Memory fencing primitives (such as MFENCE SFENCE and LFENCE on x86) > > . Sync primitives (CAS or LL/SC) > > . Words for creating & managing threads > > . Word(s) for querying processor attributes (such as # of cores) > > . Closures (for which I have a proposal) > > > > Would it be possible to agree that these are a reasonable set, and > > that hosted Forth's would benefit from standardising some of these? > > > > I've looked at a number of other languages that support these kind of > > operations. > > > > . GO uses high level constructs; this model looks interesting. > > . C++ now has language intrinsics such as atomic_compare_exchange() > > . Lambdas and blocks in C++, Objective-C and others > > . Haskell, Erlang ... > > > > Comments welcome. Would there be any value in segregating vocabularies to different cores, vs each being a complete system?
[toc] | [prev] | [next] | [standalone]
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-12 06:38 -0700 |
| Message-ID | <b252ac73-2faa-4f13-8fa6-507e1c0525ef@o9g2000pbt.googlegroups.com> |
| In reply to | #20574 |
On Mar 12, 12:41 am, "Clyde W. Phillips Jr." <cwpj...@gmail.com> wrote: > On Monday, March 11, 2013 4:17:25 PM UTC-5, Alex McDonald wrote: > > As one of three Forth projects I'm pursuing at the moment (the others > > > are 64bit and an optimising compiler) I've been looking at specific > > > solutions for employing more than 1 core of my 8 (!) core laptop. (If > > > you won a laptop of less than 5 years vintage, it probably has more > > > than 1 core, and multicore machines have been available for a long > > > time in server land.) > > > There are two common architectures; SMP and NUMA. And there are a > > > plethora of chips and OSes that support these; Linux, Windows, > > > Andrpoid on ARM, Intel/AMD to name some of the commonest. > > > This has raised a number of questions for hosted Forths, some of which > > > I pose here. > > > Forth's traditional "multiuser" feature isn't adequate (PAUSE). It > > > simply isn't extendable to multi core machines. The model assumes a > > > single core with all work running on a single thread. There are no > > > synchronisation primitives, and there is no way to handle asynchrnous > > > interrupts. > > > Multicore support would appear to require: > > > . Memory fencing primitives (such as MFENCE SFENCE and LFENCE on x86) > > > . Sync primitives (CAS or LL/SC) > > > . Words for creating & managing threads > > > . Word(s) for querying processor attributes (such as # of cores) > > > . Closures (for which I have a proposal) > > > Would it be possible to agree that these are a reasonable set, and > > > that hosted Forth's would benefit from standardising some of these? > > > I've looked at a number of other languages that support these kind of > > > operations. > > > . GO uses high level constructs; this model looks interesting. > > > . C++ now has language intrinsics such as atomic_compare_exchange() > > > . Lambdas and blocks in C++, Objective-C and others > > > . Haskell, Erlang ... > > > Comments welcome. > > Would there be any value in segregating vocabularies to different cores, vs each being a complete system? I'm not sure how that helps distribute work. It sounds like a multi- user system, one per core, if I understand what you're saying. Then we're into IPC or RPC for communication; the discussion is for something a little more tightly coupled than that.
[toc] | [prev] | [next] | [standalone]
| From | Bernd Paysan <bernd.paysan@gmx.de> |
|---|---|
| Date | 2013-03-12 02:19 +0100 |
| Message-ID | <khlvr3$jq3$1@online.de> |
| In reply to | #20562 |
Alex McDonald wrote:
> As one of three Forth projects I'm pursuing at the moment (the others
> are 64bit and an optimising compiler) I've been looking at specific
> solutions for employing more than 1 core of my 8 (!) core laptop. (If
> you won a laptop of less than 5 years vintage, it probably has more
> than 1 core, and multicore machines have been available for a long
> time in server land.)
Even my smartphone has four cores. Single-core on Android sucks.
> Forth's traditional "multiuser" feature isn't adequate (PAUSE). It
> simply isn't extendable to multi core machines.
Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker),
PAUSE just maps to sched_yield().
I have
NewTask ( stacksize -- task )
\G creates a task, uses stacksize for stack, rstack, fpstack, locals
NewTask4 ( ssize rsize fpsize lsize -- task )
\G creates a task, each stack individually sized
Activation with
activate ( task r:cont -- )
\G activates task, the current procedure will be continued there
pass ( x1..xn n task r:cont -- x1..xn )
\G activates task, and passes n parameters from the data stack
sema ( "name" -- )
\G creates a semaphore "name" ( -- addr )
lock ( addr -- )
\G Aquires the lock
unlock ( addr -- )
\G releases the lock
stop ( -- )
\G stops the current task, waiting for events
stop-ns ( timeout-ns -- )
\G stops the current task, waiting for events or timeout in nanoseconds
There's an event system which needs some examples. The easiest thing is to
wake and stop other tasks:
event: ->wake ;
event: ->sleep stop ;
: wake ( task -- ) <event ->wake event> ;
: sleep ( task -- ) <event ->sleep event> ;
Events are send as sequence of events, enclosed in <event .. event>. The ->
convention is still questionable, as Gforth also has a recognizer that usese
->something for TO something (eliminating the need for the parsing TO). It
is just a convention, though, you can name your events whatever you like.
<event ( -- )
\G starts a sequence of events
event> ( task -- )
\G ends a sequence and sends it to the mentioned task
You can send literals, strings, and floats as part of the events:
elit, ( n -- )
\G sends a literal
e$, ( addr u -- )
\G sends a string (actually only the address and the count, because it's
\G shared memory
eflit, ( r -- )
\G sends a float
?events ( x1..xn -- y1..ym )
\G checks for events and executes them
event: ( "name" -- )
\G defines an event and the reaction to it as Forth code
> The model assumes a
> single core with all work running on a single thread. There are no
> synchronisation primitives, and there is no way to handle asynchrnous
> interrupts.
Yes. The lock/unlock and the events as described above serve for
synchronisation. The events construct commands which are executed in the
other task's context; IMHO this is the most forthish way to let threads
communicate with each other.
Example: You use a few helper thread to download files from the internet via
HTTP, e.g. 4 like IE does for fetching inline images; the helpers are chosen
in a round-robin fashion. It's asynchronous, events queue up, and you want
to be notified when it's done, and then you want as notification *what* has
been done (i.e. which url, and what the data is).
event: ->wdone ( content u1 url u2 -- ) cache-url rerender ;
event: ->wget ( url u task -- ) >r
2dup wget <event 2swap e$, e$, ->wdone r> event> ;
: wget-async ( url u -- )
<event e$, up@ elit, ->wget wget-helper event> ;
> Multicore support would appear to require:
>
> . Memory fencing primitives (such as MFENCE SFENCE and LFENCE on x86)
x86 is easy, because it has TSO memory, anyways. ARM has weak ordering, so
you need it there. We probably still need this to instruct the compiler
that yes, it has to do fencing. There are some optimizations the compiler
is not allowed to do, like fetching the same value twice when there's only
one @ in the code (VFX is an offender here). A compiler-only fence between
@ and DUP can help that.
> . Sync primitives (CAS or LL/SC)
I think we should have polling loops like LL/SC outside of the high level
code. CAS (and CAS2 for two values), a non-conditional exchange (I use the
name !@ for that, but usually I only need the unlocked version), and an
atomic increment are useful primitives.
> . Words for creating & managing threads
Just use the same words we used for creating tasks for the PAUSE
multitasker.
> . Word(s) for querying processor attributes (such as # of cores)
cores ( -- n )?
You may need it if you write code that should just utilize all cores, but it
is not something you get easily ported among platforms. glibc has a
sysconfig query for it.
> . Closures (for which I have a proposal)
I prefer PASS here. Just pass a few parameters on the stack to the code
that runs in the other task.
> Would it be possible to agree that these are a reasonable set, and
> that hosted Forth's would benefit from standardising some of these?
Well, maybe. The stuff you explain is pretty low-level. The events above
are much higher level, and make programming with multiple threads very easy.
You don't have to worry about low-level synchronizations, a bunch of events
between <event and event> is constructed and then transmitted as a whole to
the receiver (how that is achieved is totally up to the system implementer,
including the necessary memory fence).
> I've looked at a number of other languages that support these kind of
> operations.
>
> . GO uses high level constructs; this model looks interesting.
Yes, the higher level constructs are more useful.
> . C++ now has language intrinsics such as atomic_compare_exchange()
> . Lambdas and blocks in C++, Objective-C and others
The <event event> thing is my Forthish equivalent to these lambdas/blocks.
> . Haskell, Erlang ...
>
> Comments welcome.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/
[toc] | [prev] | [next] | [standalone]
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-12 06:05 -0700 |
| Message-ID | <e0f429ac-b7ba-45ff-9900-293bd90cf992@oz4g2000pbc.googlegroups.com> |
| In reply to | #20579 |
On Mar 12, 1:19 am, Bernd Paysan <bernd.pay...@gmx.de> wrote: > Alex McDonald wrote: > > As one of three Forth projects I'm pursuing at the moment (the others > > are 64bit and an optimising compiler) I've been looking at specific > > solutions for employing more than 1 core of my 8 (!) core laptop. (If > > you won a laptop of less than 5 years vintage, it probably has more > > than 1 core, and multicore machines have been available for a long > > time in server land.) > > Even my smartphone has four cores. Single-core on Android sucks. > > > Forth's traditional "multiuser" feature isn't adequate (PAUSE). It > > simply isn't extendable to multi core machines. > > Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), > PAUSE just maps to sched_yield(). That's not useful in a multicore environment. With 4 cores and 4 threads, one thread active per core, it will return immediately. Unless there's processor affinity set on the thread, any core can run any thread; all PAUSE does is pester the OS scheduler for no net return. It's only useful in single core/multithread environments; for instance, you release a mutex and want to have other tasks that are waiting get CPU cycles. > > I have > > NewTask ( stacksize -- task ) > \G creates a task, uses stacksize for stack, rstack, fpstack, locals > NewTask4 ( ssize rsize fpsize lsize -- task ) > \G creates a task, each stack individually sized > > Activation with > > activate ( task r:cont -- ) > \G activates task, the current procedure will be continued there > > pass ( x1..xn n task r:cont -- x1..xn ) > \G activates task, and passes n parameters from the data stack > > sema ( "name" -- ) > \G creates a semaphore "name" ( -- addr ) > > lock ( addr -- ) > \G Aquires the lock > > unlock ( addr -- ) > \G releases the lock > > stop ( -- ) > \G stops the current task, waiting for events > > stop-ns ( timeout-ns -- ) > \G stops the current task, waiting for events or timeout in nanoseconds > > There's an event system which needs some examples. The easiest thing is to > wake and stop other tasks: > > event: ->wake ; > event: ->sleep stop ; > > : wake ( task -- ) <event ->wake event> ; > : sleep ( task -- ) <event ->sleep event> ; > > Events are send as sequence of events, enclosed in <event .. event>. The -> > convention is still questionable, as Gforth also has a recognizer that usese > ->something for TO something (eliminating the need for the parsing TO). It > is just a convention, though, you can name your events whatever you like. > > <event ( -- ) > \G starts a sequence of events > > event> ( task -- ) > \G ends a sequence and sends it to the mentioned task > > You can send literals, strings, and floats as part of the events: > > elit, ( n -- ) > \G sends a literal > > e$, ( addr u -- ) > \G sends a string (actually only the address and the count, because it's > \G shared memory > > eflit, ( r -- ) > \G sends a float > > ?events ( x1..xn -- y1..ym ) > \G checks for events and executes them > > event: ( "name" -- ) > \G defines an event and the reaction to it as Forth code I'll take a look at that; Win32Forth has a similar tasker and semaphores. I'm not a big fan of locks, especially for queue management. > > > The model assumes a > > single core with all work running on a single thread. There are no > > synchronisation primitives, and there is no way to handle asynchrnous > > interrupts. > > Yes. The lock/unlock and the events as described above serve for > synchronisation. The events construct commands which are executed in the > other task's context; IMHO this is the most forthish way to let threads > communicate with each other. That's where a closure-like facility becomes useful, since each task has a quite distinct stack and environment. > > Example: You use a few helper thread to download files from the internet via > HTTP, e.g. 4 like IE does for fetching inline images; the helpers are chosen > in a round-robin fashion. It's asynchronous, events queue up, and you want > to be notified when it's done, and then you want as notification *what* has > been done (i.e. which url, and what the data is). > > event: ->wdone ( content u1 url u2 -- ) cache-url rerender ; > event: ->wget ( url u task -- ) >r > 2dup wget <event 2swap e$, e$, ->wdone r> event> ; > > : wget-async ( url u -- ) > <event e$, up@ elit, ->wget wget-helper event> ; > > > Multicore support would appear to require: > > > . Memory fencing primitives (such as MFENCE SFENCE and LFENCE on x86) > > x86 is easy, because it has TSO memory, anyways. Only by using RMW instructions. There are temporal issues with load x/ store x being seen as store x/load x by other processors due to OOO execution. > ARM has weak ordering, so > you need it there. We probably still need this to instruct the compiler > that yes, it has to do fencing. There are some optimizations the compiler > is not allowed to do, like fetching the same value twice when there's only > one @ in the code (VFX is an offender here). A compiler-only fence between > @ and DUP can help that. Or a word like volatile@ ? I think we we need a set of read/write acquire/release primitives for ARM and other relaxed consistency processors. But I don't know enough about them to comment. > > > . Sync primitives (CAS or LL/SC) > > I think we should have polling loops like LL/SC outside of the high level > code. CAS (and CAS2 for two values), a non-conditional exchange (I use the > name !@ for that, but usually I only need the unlocked version), and an > atomic increment are useful primitives. ( v -- volatile address c -- comparand x -- exchange value vval -- original value of v ) : atomic-@! ( x v -- vval ) \ as per x86 XCHG : atomic-2@! ( dx v -- dvval ) \ see * : atomic-@+! ( x v -- vval ) \ as per x86 XADD : atomic-2@+! ( dx v -- dvval ) \ see * : atomic-@cmp! ( c x v -- vval ) \ as per x86 cmpxchg (32) or cmpxchg8b (64) : atomic-2@cmp! ( dc dx v -- dvval ) \ ditto cmpxhcg8b (32) or cmpxchg16b (64) Atomic increment is : atomic-incr ( -- n ) 1 var atomic-@+! ; *On a 32bit(64bit) x86 processor, doing a 2 cell atomic exchange or add will be problematic since there's no 64(128) bit equivalent, and it will need to be emulated with cmpxchg8b(16b) in a spin loop. > > > . Words for creating & managing threads > > Just use the same words we used for creating tasks for the PAUSE > multitasker. > > > . Word(s) for querying processor attributes (such as # of cores) > > cores ( -- n )? > > You may need it if you write code that should just utilize all cores, but it > is not something you get easily ported among platforms. glibc has a > sysconfig query for it. > > > . Closures (for which I have a proposal) I'll come back to this, since it relates to producer/consumer queues. > > I prefer PASS here. Just pass a few parameters on the stack to the code > that runs in the other task. > > > Would it be possible to agree that these are a reasonable set, and > > that hosted Forth's would benefit from standardising some of these? > > Well, maybe. The stuff you explain is pretty low-level. The events above > are much higher level, and make programming with multiple threads very easy. > You don't have to worry about low-level synchronizations, a bunch of events > between <event and event> is constructed and then transmitted as a whole to > the receiver (how that is achieved is totally up to the system implementer, > including the necessary memory fence). > > > I've looked at a number of other languages that support these kind of > > operations. > > > . GO uses high level constructs; this model looks interesting. > > Yes, the higher level constructs are more useful. > > > . C++ now has language intrinsics such as atomic_compare_exchange() > > . Lambdas and blocks in C++, Objective-C and others > > The <event event> thing is my Forthish equivalent to these lambdas/blocks. > > > . Haskell, Erlang ... > > > Comments welcome. > > -- > Bernd Paysan > "If you want it done right, you have to do it yourself"http://bernd-paysan.de/
[toc] | [prev] | [next] | [standalone]
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-12 06:34 -0700 |
| Message-ID | <7b6f28a8-83d1-429e-9c20-22959df667c3@q9g2000pbf.googlegroups.com> |
| In reply to | #20589 |
On Mar 12, 1:05 pm, Alex McDonald <b...@rivadpm.com> wrote: > On Mar 12, 1:19 am, Bernd Paysan <bernd.pay...@gmx.de> wrote: > > > Alex McDonald wrote: [snip] > > > > Forth's traditional "multiuser" feature isn't adequate (PAUSE). It > > > simply isn't extendable to multi core machines. > > > Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), > > PAUSE just maps to sched_yield(). > > That's not useful in a multicore environment. With 4 cores and 4 > threads, one thread active per core, it will return immediately. > Unless there's processor affinity set on the thread, any core can run > any thread; all PAUSE does is pester the OS scheduler for no net > return. It's only useful in single core/multithread environments; for > instance, you release a mutex and want to have other tasks that are > waiting get CPU cycles. > (adding) Or if you have an issue with priority inversion, PAUSE may be useful. But it is avoidable with good design.
[toc] | [prev] | [next] | [standalone]
| From | Bernd Paysan <bernd.paysan@gmx.de> |
|---|---|
| Date | 2013-03-13 01:51 +0100 |
| Message-ID | <khoiia$rt6$1@online.de> |
| In reply to | #20591 |
Alex McDonald wrote: > On Mar 12, 1:05 pm, Alex McDonald <b...@rivadpm.com> wrote: >> On Mar 12, 1:19 am, Bernd Paysan <bernd.pay...@gmx.de> wrote: >> >> > Alex McDonald wrote: > [snip] >> >> > > Forth's traditional "multiuser" feature isn't adequate (PAUSE). It >> > > simply isn't extendable to multi core machines. >> >> > Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), >> > PAUSE just maps to sched_yield(). >> >> That's not useful in a multicore environment. With 4 cores and 4 >> threads, one thread active per core, it will return immediately. >> Unless there's processor affinity set on the thread, any core can run >> any thread; all PAUSE does is pester the OS scheduler for no net >> return. It's only useful in single core/multithread environments; for >> instance, you release a mutex and want to have other tasks that are >> waiting get CPU cycles. >> > > (adding) Or if you have an issue with priority inversion, PAUSE may be > useful. But it is avoidable with good design. I don't actually use PAUSE. It's just there, for "compatibility reasons" and it calls sched_yield(), which sometimes may be useful. I exclusively use the <event ->somesignal task event> construct. It does everything I need: It sends an atomic (multi-)message from one task to the other, and I can define the sequence point (the event> side is the sender's sequence point, the ?events is the receiver's sequence point). Example: I'm splitting my net2o packet handler code off into a task at the moment. This means you can write event-driven programs that communicate with each others through net2o. It turned out to be a trivial exercise: Just put a <event ... task event> wrapper around the packed handler, and if there is an event to signal inside the code that handles packets, signal it. I need to expose the unsent event queue length, because there's no need to send empty queues around (or is this just an optimization issue? <event task event> should really do nothing, and do it fast). This is still somehow in flux, because <event is not really needed. Whatever events come, queue up, and event> sends them. The only possible reason why you might want <event is for nesting, something like <event ->foo ... <event ->bar a event> ... b event> -- Bernd Paysan "If you want it done right, you have to do it yourself" http://bernd-paysan.de/
[toc] | [prev] | [next] | [standalone]
| From | "Elizabeth D. Rather" <erather@forth.com> |
|---|---|
| Date | 2013-03-12 09:46 -1000 |
| Message-ID | <ddKdnRHFKf2-G6LMnZ2dnUVZ_uWdnZ2d@supernews.com> |
| In reply to | #20589 |
On 3/12/13 3:05 AM, Alex McDonald wrote: >>> Forth's traditional "multiuser" feature isn't adequate (PAUSE). It >>> > >simply isn't extendable to multi core machines. >> > >> >Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), >> >PAUSE just maps to sched_yield(). > That's not useful in a multicore environment. With 4 cores and 4 > threads, one thread active per core, it will return immediately. > Unless there's processor affinity set on the thread, any core can run > any thread; all PAUSE does is pester the OS scheduler for no net > return. It's only useful in single core/multithread environments; for > instance, you release a mutex and want to have other tasks that are > waiting get CPU cycles. Many years ago, even before multicore machines came out, we concluded that the traditional Forth multitasker is inappropriate in a hosted environment. We completely redesigned task management for Windows threads. The result is documented in the SwiftForth manual (included with all SwiftForth downloads, including the eval version). Cheers, Elizabeth -- ================================================== Elizabeth D. Rather (US & Canada) 800-55-FORTH FORTH Inc. +1 310.999.6784 5959 West Century Blvd. Suite 700 Los Angeles, CA 90045 http://www.forth.com "Forth-based products and Services for real-time applications since 1973." ==================================================
[toc] | [prev] | [next] | [standalone]
| From | Roelf Toxopeus <rt4all@notthis.hetnet.nl> |
|---|---|
| Date | 2013-03-13 11:30 +0100 |
| Message-ID | <rt4all-CD47F1.11304913032013@[10.12.75.213]> |
| In reply to | #20604 |
In article <ddKdnRHFKf2-G6LMnZ2dnUVZ_uWdnZ2d@supernews.com>, "Elizabeth D. Rather" <erather@forth.com> wrote: > We completely redesigned task management for Windows > threads. The result is documented in the SwiftForth manual (included > with all SwiftForth downloads, including the eval version). What I appreciate very much, is that the diverse implementations for PolyForth, SwiftX, SwiftForth Windows, SwiftForth Linux, SwiftForth OSX use a common set of names/words. 1980's code running on a trad multitasking Forth under a monotasking OS on a single core cpu, runs ias good as unaltered on a multicore-tasking Forth under a multicore-tasking OS on a multicore cpu. I think that's quite neat. (Mach2/AtariGEM/68000 -> Coco-SF/OSX/i7) Thanks! -Roelf
[toc] | [prev] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-03-13 04:35 -0500 |
| Message-ID | <p-Sdndgja6PN1d3MnZ2dnUVZ_j2dnZ2d@supernews.com> |
| In reply to | #20589 |
Alex McDonald <blog@rivadpm.com> wrote: > On Mar 12, 1:19?am, Bernd Paysan <bernd.pay...@gmx.de> wrote: >> Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), >> PAUSE just maps to sched_yield(). > > That's not useful in a multicore environment. With 4 cores and 4 > threads, one thread active per core, it will return immediately. > Unless there's processor affinity set on the thread, any core can run > any thread; all PAUSE does is pester the OS scheduler for no net > return. It's only useful in single core/multithread environments; for > instance, you release a mutex and want to have other tasks that are > waiting get CPU cycles. That is not true. When you have a number of threads all trying to acquire a lock it makes sense to use some kind of exponential backoff. First you simply spin, then call PAUSE, and then only if you still haven't acquired a lock do you need to go the heavyweight route of creating a mutex and blocking. This is a huge win in a heavily contended environment where locks are held very briefly. Andrew.
[toc] | [prev] | [next] | [standalone]
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-13 06:55 -0700 |
| Message-ID | <376c6d69-3339-4d48-8df3-1e04730e3ac0@y9g2000vbb.googlegroups.com> |
| In reply to | #20618 |
On Mar 13, 9:35 am, Andrew Haley <andre...@littlepinkcloud.invalid> wrote: > Alex McDonald <b...@rivadpm.com> wrote: > > On Mar 12, 1:19?am, Bernd Paysan <bernd.pay...@gmx.de> wrote: > >> Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), > >> PAUSE just maps to sched_yield(). > > > That's not useful in a multicore environment. With 4 cores and 4 > > threads, one thread active per core, it will return immediately. > > Unless there's processor affinity set on the thread, any core can run > > any thread; all PAUSE does is pester the OS scheduler for no net > > return. It's only useful in single core/multithread environments; for > > instance, you release a mutex and want to have other tasks that are > > waiting get CPU cycles. > > That is not true. When you have a number of threads all trying to > acquire a lock it makes sense to use some kind of exponential backoff. > First you simply spin, then call PAUSE, and then only if you still > haven't acquired a lock do you need to go the heavyweight route of > creating a mutex and blocking. This is a huge win in a heavily > contended environment where locks are held very briefly. > > Andrew. It's the translation of PAUSE to Yield() or sched_yield() or equivalent that's the issue. The scheduler has been given no reason to make the switch; a yield says "If there's someone with a higher priority, run them as I'm willing to wait. Otherwise run me." You're not saying "I'm waiting for a lock, so run the guy that holds it." For several tasks of equal or lower priority to the lock getter, nothing happens and the yield simply returns immediately, so the likelihood of getting the lock after a spin then yield is (a) on a single core, exactly zero (b) on a multicore, it depends on whether the lock holding thread has released it in the time it takes to do the yield. It may not be running on any other core either, so either way it's close to zero as well. Windows & POSIX programmers seem to be advised to use sleep for a very small interval (not zero which appears to be equivalent to a yield) to kick the scheduler into action, as sleeping indicates you are willing to give up your timeslice. Now you're saying "Do something else, I need a rest. Get back to me later." Yield (or PAUSE) won't work. You've got to sleep.
[toc] | [prev] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-03-13 09:49 -0500 |
| Message-ID | <eZydnctgMdVxDN3MnZ2dnUVZ_sqdnZ2d@supernews.com> |
| In reply to | #20635 |
Alex McDonald <blog@rivadpm.com> wrote: > On Mar 13, 9:35?am, Andrew Haley <andre...@littlepinkcloud.invalid> > wrote: >> Alex McDonald <b...@rivadpm.com> wrote: >> > On Mar 12, 1:19?am, Bernd Paysan <bernd.pay...@gmx.de> wrote: >> >> Well, in Gforth's unix/pthread.fs (using Posix threads as multitasker), >> >> PAUSE just maps to sched_yield(). >> >> > That's not useful in a multicore environment. With 4 cores and 4 >> > threads, one thread active per core, it will return immediately. >> > Unless there's processor affinity set on the thread, any core can run >> > any thread; all PAUSE does is pester the OS scheduler for no net >> > return. It's only useful in single core/multithread environments; for >> > instance, you release a mutex and want to have other tasks that are >> > waiting get CPU cycles. >> >> That is not true. When you have a number of threads all trying to >> acquire a lock it makes sense to use some kind of exponential backoff. >> First you simply spin, then call PAUSE, and then only if you still >> haven't acquired a lock do you need to go the heavyweight route of >> creating a mutex and blocking. This is a huge win in a heavily >> contended environment where locks are held very briefly. > > It's the translation of PAUSE to Yield() or sched_yield() or > equivalent that's the issue. The scheduler has been given no reason > to make the switch; a yield says "If there's someone with a higher > priority, ... higher or same ... > run them as I'm willing to wait. Otherwise run me." You're not > saying "I'm waiting for a lock, so run the guy that holds it." Usually you don't know who holds it, and it would be effortful to find out, so you just spin for a short while. > For several tasks of equal or lower priority to the lock getter, > nothing happens and the yield simply returns immediately, so the > likelihood of getting the lock after a spin then yield is (a) on a > single core, exactly zero (b) on a multicore, it depends on whether > the lock holding thread has released it in the time it takes to do > the yield. Right, and often it will have done. If you have a lot of threads contending for locks held for a long time, you might as well use heavyweight locks and block. That's not the case I'm talking about, which is fairly high contention but locks held for a short time. I'm taking about a bunch of worker tasks of the same priority. > It may not be running on any other core either, so either way it's > close to zero as well. Why is it close to zero? Not IME. > Windows & POSIX programmers seem to be advised to use sleep for a very > small interval (not zero which appears to be equivalent to a yield) to > kick the scheduler into action, as sleeping indicates you are willing > to give up your timeslice. Now you're saying "Do something else, I > need a rest. Get back to me later." > > Yield (or PAUSE) won't work. You've got to sleep. Not necessarily: even a few laps of an empty loop or just retrying the lock may well be enough. Andrew.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2013-03-13 08:18 -0700 |
| Message-ID | <7xoben1g7l.fsf@ruckus.brouhaha.com> |
| In reply to | #20638 |
Andrew Haley <andrew29@littlepinkcloud.invalid> writes: > Not necessarily: even a few laps of an empty loop or just retrying the > lock may well be enough. I wonder if spinning is still reasonable with current cpu's, where memory accesses (to retry the lock) cost 100's of cpu cycles, and the energy dissipation from spinning is potentially costly in its own right. Is there any hardware assist for this type of thing?
[toc] | [prev] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-03-13 10:39 -0500 |
| Message-ID | <_fWdnYZr1Js3AN3MnZ2dnUVZ_rqdnZ2d@supernews.com> |
| In reply to | #20639 |
Paul Rubin <no.email@nospam.invalid> wrote: > Andrew Haley <andrew29@littlepinkcloud.invalid> writes: >> Not necessarily: even a few laps of an empty loop or just retrying the >> lock may well be enough. > > I wonder if spinning is still reasonable with current cpu's, where > memory accesses (to retry the lock) cost 100's of cpu cycles, and > the energy dissipation from spinning is potentially costly in its > own right. It is. We have snoopy caches, and you won't retry many times before blocking. Even on a fast box, blocking for a futex takes at least a microsecond whereas a read from L3 is tens of nanoseconds. > Is there any hardware assist for this type of thing? Yes, there is. Intel has the PAUSE instruction, which allows another thread to use the processor. It's recommended for just this purpose. Andrew.
[toc] | [prev] | [next] | [standalone]
| From | Alex McDonald <blog@rivadpm.com> |
|---|---|
| Date | 2013-03-13 16:36 -0700 |
| Message-ID | <1ebc328a-190a-4207-862a-0b1175e15c68@f6g2000yqm.googlegroups.com> |
| In reply to | #20640 |
On Mar 13, 3:39 pm, Andrew Haley <andre...@littlepinkcloud.invalid> wrote: > Paul Rubin <no.em...@nospam.invalid> wrote: > > Andrew Haley <andre...@littlepinkcloud.invalid> writes: > >> Not necessarily: even a few laps of an empty loop or just retrying the > >> lock may well be enough. > > > I wonder if spinning is still reasonable with current cpu's, where > > memory accesses (to retry the lock) cost 100's of cpu cycles, and > > the energy dissipation from spinning is potentially costly in its > > own right. > > It is. We have snoopy caches, and you won't retry many times before > blocking. Even on a fast box, blocking for a futex takes at least a > microsecond whereas a read from L3 is tens of nanoseconds. > > > Is there any hardware assist for this type of thing? > > Yes, there is. Intel has the PAUSE instruction, which allows another > thread to use the processor. It's recommended for just this purpose. It does? I thought it was to reduce certain side effects on Xeon and slow the spin down on later processors, hence reducing power burn; no thread switching. http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_sse2_pause.htm > > Andrew.
[toc] | [prev] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-03-14 03:44 -0500 |
| Message-ID | <ZNOdnQ295_xlENzMnZ2dnUVZ_rKdnZ2d@supernews.com> |
| In reply to | #20646 |
Alex McDonald <blog@rivadpm.com> wrote: > On Mar 13, 3:39?pm, Andrew Haley <andre...@littlepinkcloud.invalid> > wrote: >> Paul Rubin <no.em...@nospam.invalid> wrote: >> > Andrew Haley <andre...@littlepinkcloud.invalid> writes: >> >> Not necessarily: even a few laps of an empty loop or just retrying the >> >> lock may well be enough. >> >> > I wonder if spinning is still reasonable with current cpu's, where >> > memory accesses (to retry the lock) cost 100's of cpu cycles, and >> > the energy dissipation from spinning is potentially costly in its >> > own right. >> >> It is. We have snoopy caches, and you won't retry many times before >> blocking. Even on a fast box, blocking for a futex takes at least a >> microsecond whereas a read from L3 is tens of nanoseconds. >> >> > Is there any hardware assist for this type of thing? >> >> Yes, there is. Intel has the PAUSE instruction, which allows another >> thread to use the processor. It's recommended for just this purpose. > > It does? I thought it was to reduce certain side effects on Xeon and > slow the spin down on later processors, hence reducing power burn; no > thread switching. Hmmm. The documentation isn't so great. PAUSE is certainly recommended in spin-wait loops and it causes a small delay. I confess that I had assumed that the other thread in a hyperthreaded core would run, but this may well not be the case. http://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors Andrew.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2013-03-14 09:06 +0000 |
| Message-ID | <2013Mar14.100629@mips.complang.tuwien.ac.at> |
| In reply to | #20657 |
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
[PAUSE on Intel CPUs]
>Hmmm. The documentation isn't so great. PAUSE is certainly
>recommended in spin-wait loops and it causes a small delay. I confess
>that I had assumed that the other thread in a hyperthreaded core would
>run, but this may well not be the case.
In SMT (simultaneous multi-threading) all the threads that are on a
CPU are running. This comes out clearer in the technical term _S_MT
than in the marketing name "hyperthreading". One might have
instructions for changing the priority of resource allocation, but I
don't think that any SMT processor implements such priorities.
Concerning what PAUSE really does,
<http://www.postgresql.org/message-id/3FECD103.5040105@colorfullife.com>
looks plausible. It says that (on the Pentium 4) it just causes a
delay; the main purpose apparently is to avoid some Pentium-4-specific
performance penalty, but as a side-effect it also reduces power
consumption.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2013: http://www.euroforth.org/ef13/
[toc] | [prev] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-03-14 06:26 -0500 |
| Message-ID | <OqCdnQbDBsxyLtzMnZ2dnUVZ_sCdnZ2d@supernews.com> |
| In reply to | #20660 |
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: > Andrew Haley <andrew29@littlepinkcloud.invalid> writes: > [PAUSE on Intel CPUs] >>Hmmm. The documentation isn't so great. PAUSE is certainly >>recommended in spin-wait loops and it causes a small delay. I confess >>that I had assumed that the other thread in a hyperthreaded core would >>run, but this may well not be the case. > > In SMT (simultaneous multi-threading) all the threads that are on a > CPU are running. Of course. > This comes out clearer in the technical term _S_MT than in the > marketing name "hyperthreading". One might have instructions for > changing the priority of resource allocation, but I don't think that > any SMT processor implements such priorities. Certainly not, no. Priorities are the domain of scheduler software. > Concerning what PAUSE really does, > <http://www.postgresql.org/message-id/3FECD103.5040105@colorfullife.com> > looks plausible. It says that (on the Pentium 4) it just causes a > delay; the main purpose apparently is to avoid some Pentium-4-specific > performance penalty, but as a side-effect it also reduces power > consumption. That seems reasonable. However, I doubt that PAUSE delays the other thread running in the same core, and thus it allows the other thread full access to the core's execution units rather than pointlessly spinning, which was my point. If it really does stall both active threads, then I'm wrong. Andrew.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2013-03-14 15:41 +0000 |
| Message-ID | <2013Mar14.164137@mips.complang.tuwien.ac.at> |
| In reply to | #20662 |
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>> This comes out clearer in the technical term _S_MT than in the
>> marketing name "hyperthreading". One might have instructions for
>> changing the priority of resource allocation, but I don't think that
>> any SMT processor implements such priorities.
>
>Certainly not, no.
I would not put it in the realm of the impossible, it just has not
been done AFAIK. E.g., one thread could get all the resources it
asks for, and the others would get the remainder.
>> Concerning what PAUSE really does,
>> <http://www.postgresql.org/message-id/3FECD103.5040105@colorfullife.com>
>> looks plausible. It says that (on the Pentium 4) it just causes a
>> delay; the main purpose apparently is to avoid some Pentium-4-specific
>> performance penalty, but as a side-effect it also reduces power
>> consumption.
>
>That seems reasonable. However, I doubt that PAUSE delays the other
>thread running in the same core, and thus it allows the other thread
>full access to the core's execution units
Yes, that's how I understand it.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2013: http://www.euroforth.org/ef13/
[toc] | [prev] | [next] | [standalone]
| From | Bernd Paysan <bernd.paysan@gmx.de> |
|---|---|
| Date | 2013-03-14 17:56 +0100 |
| Message-ID | <khsvfj$e1c$1@online.de> |
| In reply to | #20639 |
Paul Rubin wrote: > Andrew Haley <andrew29@littlepinkcloud.invalid> writes: >> Not necessarily: even a few laps of an empty loop or just retrying the >> lock may well be enough. > > I wonder if spinning is still reasonable with current cpu's, where > memory accesses (to retry the lock) cost 100's of cpu cycles, and the > energy dissipation from spinning is potentially costly in its own right. > Is there any hardware assist for this type of thing? Transactional memory should give us the necessary assists. The way a transactional memory works has two parts: One side to observe others stealing the cache-lines you read from within the transaction, the other is to commit the stores you made as atomic writes. Fortunately, most of what you need there as hardware has been available for literally decades... So you have a few cache lines to observe, and a few cache lines to keep as uncommitted copies. The code looks like begin transaction do some reads do some computations do some shadow writes atomic commit If you lost some reads or someone messed with the cache lines you write into, the atomic comit will fail and jump again to the begin. Now, if you are waiting for a lock, you might want some "pause", i.e. begin transaction read a lock check if it is free if not: pause and repeat do a shadow write to aquire the lock atomic commit PAUSE here would mean "wait until one of the observed cache lines is modified". My PAUSE would have a timeout, because when the timeout expires, you should rethink your spin loop strategy, and do a full block. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://bernd-paysan.de/
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.forth
csiph-web