Groups > comp.lang.forth > #14141 > unrolled thread

GA144 article

Started by	Mark Wills <markrobertwills@yahoo.co.uk>
First post	2012-07-18 03:48 -0700
Last post	2012-08-06 02:21 -0700
Articles	20 on this page of 62 — 13 participants

Back to article view | Back to comp.lang.forth

  GA144 article Mark Wills <markrobertwills@yahoo.co.uk> - 2012-07-18 03:48 -0700
    Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-19 10:51 -0700
      Re: GA144 article Mark Wills <markrobertwills@yahoo.co.uk> - 2012-07-20 01:30 -0700
        Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-20 01:48 -0700
          Re: GA144 article Mark Wills <markrobertwills@yahoo.co.uk> - 2012-07-20 03:25 -0700
          Re: GA144 article Syd Rumpo <usenet@neonica.co.uk> - 2012-07-20 11:59 +0100
        Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-22 18:59 -0700
          Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-22 19:16 -0700
            Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-22 19:31 -0700
              Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-22 20:08 -0700
                Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-23 16:24 -0700
                  Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-23 22:33 -0700
                    Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-24 14:04 -0700
              Re: GA144 article vandys@vsta.org - 2012-07-23 03:34 +0000
              Re: GA144 article Bernd Paysan <bernd.paysan@gmx.de> - 2012-07-23 14:11 +0200
                Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-23 16:47 -0700
                Re: GA144 article "Clyde W. Phillips Jr." <cwpjr02@gmail.com> - 2012-07-24 20:11 -0700
            Re: GA144 article Richard Owlett <rowlett@pcnetinc.com> - 2012-07-23 02:04 -0500
    Re: GA144 article Daniel Kalny <dkalny@seznam.cz> - 2012-07-24 02:37 -0700
      Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-24 03:12 -0700
        Re: GA144 article stephenXXX@mpeforth.com (Stephen Pelc) - 2012-07-24 10:36 +0000
          Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-24 10:28 -0700
            Re: GA144 article Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-07-24 22:12 +0000
        Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-24 14:49 -0700
          Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-26 13:25 -0700
            Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-27 13:17 -0700
              Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-27 14:31 -0700
                Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-27 22:02 -0700
                  Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-28 23:03 -0700
                    Re: GA144 article marko <marko@marko.marko> - 2012-07-30 09:12 +1000
                      Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-31 01:09 -0700
                        Re: GA144 article marko <marko@marko.marko> - 2012-08-01 00:19 +1000
                          Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-31 09:59 -0700
                            Re: GA144 article marko <marko@marko.marko> - 2012-08-01 10:13 +1000
                        Re: GA144 article Bernd Paysan <bernd.paysan@gmx.de> - 2012-08-01 00:59 +0200
                          Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-31 16:54 -0700
                            Re: GA144 article Luca Saiu <positron@gnu.org> - 2012-08-04 10:46 +0200
                              Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-05 11:04 -0700
                                Re: GA144 article Luca Saiu <positron@gnu.org> - 2012-08-06 02:02 +0200
                                Re: GA144 article Mark Wills <markrobertwills@yahoo.co.uk> - 2012-08-06 02:16 -0700
                    Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-30 08:23 -0700
                      Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-30 21:33 -0700
                        Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-31 01:51 -0700
                          Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-31 11:48 -0700
                            Re: GA144 article rickman <gnuarm@gmail.com> - 2012-07-31 14:54 -0700
                              Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-07-31 16:09 -0700
                                Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-01 12:57 -0700
                                  Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-01 14:16 -0700
                                    Re: GA144 article vandys@vsta.org - 2012-08-01 22:32 +0000
                                    Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-01 16:03 -0700
                                      Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-01 16:13 -0700
                            Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-01 16:10 -0700
                              Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-01 22:18 -0700
                                Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-02 11:47 -0700
                                  Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-02 14:32 -0700
                                    Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-02 16:01 -0700
                                      Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-02 16:38 -0700
                                        Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-03 08:40 -0700
                                          Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-03 09:19 -0700
                                            Re: GA144 article rickman <gnuarm@gmail.com> - 2012-08-03 11:15 -0700
                                              Re: GA144 article Paul Rubin <no.email@nospam.invalid> - 2012-08-05 15:50 -0700
                                                Re: GA144 article Mark Wills <markrobertwills@yahoo.co.uk> - 2012-08-06 02:21 -0700

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

#14553

From	rickman <gnuarm@gmail.com>
Date	2012-07-30 08:23 -0700
Message-ID	<0c5e8f67-c185-4e41-b8b4-21b29d6feaef@googlegroups.com>
In reply to	#14509

On Sunday, July 29, 2012 2:03:38 AM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > No, no FPGA I know of would allow you to use even 5% of the routing
> > resources.  That is the point.
> 
> If you're using 5% of the routing and 10% of the LUT's, why did you
> buy a $1500 FPGA instead of a $100 one?  What did the extra $1400 get you?

It wasn't my decision.  Please don't mix up my comments.  The 10% of LUTs was a specific product that was allowing room for future expansion.  The 5% of routing is just how FPGAs are designed.  They have to have lots of routing because they don't know how you will be connecting the LUTs.  In the earlier days they minimized the routing and as a result in some designs it was not possible to use all of the LUTs.  I don't have a measured number for how much of the routing was typically used, but I expect it was still well under 50%.  

The inability to fully utilize the LUTs became a marketing issue so the trade off was moved to provide more routing vs the number of LUTs and now it is a rare design that can't get >95% utilization of the LUTs when needed (assuming speed is not also required, speed vs LUT utilization will always be a tradeoff).  As a result most designs don't even use 5% of the routing in an FPGA.  No one complains because they don't know really.  They "see" the LUT utilization and so they care about it, they don't "see" routing utilization and so don't care about it. 

CPU cores are highly "visible" and so people complain if they can't be used 100%.  But what matters is getting your design done.  If they took out half the cores to provide more complex routing, would that make you happier?  Unfortunately this would also slow down the comms as there would need to be more muxing, switching and who knows what.  I seriously doubt that the current method is actually "bad", but there is not much to compare it against.  

> > Worry not if you can use all the resources, worry if your design will
> > fit.  It's that simple.
> 
> That's the thing, if you're hoping to use a $100 FPGA but your
> application won't fit, then maybe you have to use a $200 one with 2x
> more CLB's.  And if it still won't fit, there's the $300 one, the $400
> one, etc.  Once you hit the very biggest ones, there's still stuff that
> won't fit, so you wish there were some that were even bigger, or you
> wait for next year's models, etc.  It's the same with ram, hard drives,
> parallel GPU units, or whatever.

I have no idea how this applies to using the GA144.  

> By comparison it's hard to think of any applications that won't fit a
> GA144, but that would fit in an imagined GA288 with 2x the nodes.  The
> GA144 itself is already hard to fill, as we're finding.

Who said it is hard to fill?  I have ideas for digital receivers which may well max out the chip.  At 40 MHz a small FIR filter requires a full core, a bigger filter uses multiple cores.  This is not a memory limitation as some think, because they only look at one way to implement a FIR filter.  I am attempting to use the primary resource, the CPU horsepower, and depending on the specific function it can get used up quickly.  So there are apps that will use as many of the GA144 nodes as possible by the topology.  

> >> Can I use a 144 core x86?  Yes.  What about 144,000 cores?  144
> >> million?  Yes and yes.
> >
> > They can't even keep 4 x86 CPUs supplied with data from memory so I
> > have no idea what you would do with 144 of them on a single chip.
> 
> They are making a 50+ core x86 (Knights Corner) but I was thinking more
> along the lines of x86 multi-chip supercomputers with 1000's of cores.
> Yes those exist and there's lots of applications for them.

It has been clearly demonstrated that the x86 architecture is memory limited at 2 cores per memory bus.  At four cores each processor runs at a lower efficiency and at 8 cores there is little total improvement over four cores.  Someone may be making an x86 50 core chip, but how well does it work and what is the app?  I expect it is much more restricted than the GA144 in how it can be used effectively.  I don't expect to see one in a laptop anytime soon. 

What is you point about this? 

> > If you think the number 144 is where the magic comes from then you
> > don't get the GA144.  How about a GA32?  Only 32 processors to occupy
> > with all the same IO.  Does that make you any more satisfied with the
> > GA32?
> 
> The GA32 makes more sense than the GA144, since it has more connectivity
> and more i/o per core, so in that sense is better balanced, plus it's
> smaller and therefore presumably costs less.  The GA144 is a fairly
> large chip, probably larger than an ARM M3 or the like.  Of course
> I can understand why GA might have come out with the flagship product
> first, to make the biggest impression.  But I think the GA144 would
> be better with somewhat different parameter choices.

There is no more connectivity per core in the GA32 than there is in the GA144.  The same I/O with fewer processors, yes.  

> > Why would you use RAM in a node that is not next to the node using the RAM?  
> 
> Because there's not enough connectivity between nodes to put the ram
> immediately adjacent, a lot of the time.  As a trivial example, say your
> algorithm needs 512 words of ram, or 8 nodes used as ram.  The 8 nodes
> are no big deal, but the processing node has just 4 interconnects, so
> you need a bunch of extra nodes and delays to move the data around.

You can pipedream all day.  If you need 512 words of RAM on one node it is likely you need to rethink your algorithm.  That's not a frivolous comment, it is the truth.  Just like factoring in Forth, to work on this chip you need to factor your algorithms appropriately. 

> Routing in an FPGA as I understand it is just wires and some gate
> delays, but if you want to route stuff between GA144 nodes, it takes
> software in the nodes to copy the data around, inserting significant
> delays at each copying step compared to just routing with wires.

"Just" wires and gates (wires, switches and buffers actually).  The entire chip is "just" wires and gates.  There is no software on an FPGA unless you design a CPU.  Of course a 144 node processor chip has comms software.  Duh!

Here is software to link any two predefined ports unidirectionally.  "Go" is the definition name and "GO ;" at the end is optimized to a jump.  The time between the @ and the !b is about 1.5 ns.  If that's not fast enough for you then you can forget using FPGAs as they typically have at least that much routing time in any given path (unless you optimize with hand placement). 

go @ !b go ;

> Maybe we could think of the GA144 as something like an FPGA with not
> enough routing.

You are welcome to think of the GA144 anyway you choose. 

> > So far the only thing that has tickled my imagination for this chip
> > (other than software defined radios, SDR) is a single chip
> > oscilloscope.
> 
> SDR is a really cool idea and it will be great if you can use a GA144
> for that.

SDR will be the *next* project.  I'm trying to do this in bite sized pieces.  My current step is to figure out how to run with SDRAM and why GA abandoned using SDRAM on their eval board.  There must have been a reason because the GA144 has ROM code for SDRAM.  

Rick

[toc] | [prev] | [next] | [standalone]

#14561

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-30 21:33 -0700
Message-ID	<7xa9yg1re5.fsf@ruckus.brouhaha.com>
In reply to	#14553

> If they took out half the cores to provide more complex routing, would
> that make you happier?  Unfortunately this would also slow down the
> comms as there would need to be more muxing, switching and who knows what.  

I asked Jeff Fox on this newsgroup about adding more comm channels to
the cores and he seemed to think it would not have been a big deal in
terms of chip area.  My idea was simply to add diagonal connections
between the cores to go with the existing orthogonal ones, a fairly
minor design tweak intended to make it easier to route signals around.

>> if you're hoping to use a $100 FPGA but your
>> application won't fit, then maybe you have to use a $200 one... 
> I have no idea how this applies to using the GA144.  

There are two reasons your design might not be suitable for a given
model of FPGA:

1) Your FPGA is not big enough (too few CLB's), but a bigger one
   would fix the problem.
2) Your design is somehow unsuitable for FPGA implementation no
   matter how big an FPGA you can get (within current technological
   limits).

Thar reason #1 applies a lot of the time and that there are many
different sizes and cost levels of FPGA's indicates that the FPGA
concept is scalable over a wide range.  I haven't seen this demonstrated
of the GA design so far.

> Who said it is hard to fill?  I have ideas for digital receivers which
> may well max out the chip.  At 40 MHz a small FIR filter requires a
> full core, a bigger filter uses multiple cores. 

OK, that is a good point, you can use a single core to do a few MAC's,
and have a pipeline of them that the signal makes its way through to
implement an FIR filter.  Parallel filter banks could indeed use a lot
of nodes that way.  I wonder how the chip area and power consumed would
compare to a conventional DSP approach.

>> >> Can I use a 144 core x86?  Yes.>> >
>> > They can't even keep 4 x86 CPUs supplied with data from memory so I
>> > have no idea what you would do with 144 of them on a single chip.

A cache miss today is like a page fault from yesteryear, even with a
single core cpu.  The idea is they have megabytes of cache per core so
you can avoid going off-chip for many critical things.

> Someone may be making an x86 50 core chip, but how well does it work
> and what is the app?  I expect it is much more restricted than the
> GA144 in how it can be used effectively.

Initial app will be a PCIe coprocessor, general purpose but mainly
intended to compete in the supercomputer and GPU space, it seems.
See: http://en.wikipedia.org/wiki/Xeon_Phi

> There is no more connectivity per core in the GA32 than there is in
> the GA144.

There's more (proportional) connectivity in the sense that each node on
the 32 can communicate with 1/8 of the other nodes, but only 1/36th of
them on the 144.  And, it takes a lot more hops on the 144 to get a
signal from a given node to a distant one.  A traditional array computer
with a hypercube interconnect needs O(log n) max hops between nodes when
there are n nodes.  I guess the GA's are O(sqrt n) as the array gets
large but between 32 and 144 there's a pretty big jump.

> You can pipedream all day.  If you need 512 words of RAM on one node
> it is likely you need to rethink your algorithm.  That's not a
> frivolous comment, it is the truth.  Just like factoring in Forth, to
> work on this chip you need to factor your algorithms appropriately.

You probably also have to eliminate a lot of algorithms and application
areas that need ram.  Vector quantization, MPEG motion estimation(?),
the FFT(?), etc.  Admittedly my own interests go more towards data than
signals, stuff that needs ram but isn't really in the area that the chip
aims at.

> The time between the @ and the !b is about 1.5 ns. ..
> go @ !b go ;

I don't think it's that fast, per the databook
 http://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
pages 9 and 12.  The @ and the !b and the jump are each 5.1 ns so
that whole loop takes over 15 ns per word moved.  You might be able to
get it to 12ns with unext instead of jump.  And I guess for some
purposes you don't have to count the jump as part of the latency
for the data, but 10ns is still pretty bad, especially if you have
to incur it through several nodes.  

> My current step is to figure out how to run with SDRAM and
> why GA abandoned using SDRAM on their eval board.  There must have
> been a reason because the GA144 has ROM code for SDRAM.

You could ask them...

[toc] | [prev] | [next] | [standalone]

#14568

From	rickman <gnuarm@gmail.com>
Date	2012-07-31 01:51 -0700
Message-ID	<420f3e55-76d1-43eb-85b7-182c02f3fd22@googlegroups.com>
In reply to	#14561

On Tuesday, July 31, 2012 12:33:22 AM UTC-4, Paul Rubin wrote:
> > If they took out half the cores to provide more complex routing, would
> > that make you happier?  Unfortunately this would also slow down the
> > comms as there would need to be more muxing, switching and who knows what.  
> 
> I asked Jeff Fox on this newsgroup about adding more comm channels to
> the cores and he seemed to think it would not have been a big deal in
> terms of chip area.  My idea was simply to add diagonal connections
> between the cores to go with the existing orthogonal ones, a fairly
> minor design tweak intended to make it easier to route signals around.

Maybe you should build your own chip then?  Maybe this is a minor design tweek, I don't think it would add much to the comms on the chip.  If you are going to improve the internode comms I would suggest a cube arrangement.  

...snip...

> > The time between the @ and the !b is about 1.5 ns. ..
> > go @ !b go ;
> 
> I don't think it's that fast, per the databook
>  http://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
> pages 9 and 12.  The @ and the !b and the jump are each 5.1 ns so
> that whole loop takes over 15 ns per word moved.  You might be able to
> get it to 12ns with unext instead of jump.  And I guess for some
> purposes you don't have to count the jump as part of the latency
> for the data, but 10ns is still pretty bad, especially if you have
> to incur it through several nodes.  

The F18A databook page 9, "5.1 nanoseconds when reading or writing internal memory".  These ops are not to or from "internal memory".  The simulator will tell how much time code takes I believe.  When I get to that point I'll give it a try. 

The loop time is not so important as is the latency.  The cycle time does not need to be any faster than a node can generate or process data so there is no real issue there.  The latency determines the delay in getting from one point in the chip to another and may be important, especially if you need to calibrate and/or match the delay.  In fact I have some concerns about syncing delays between separate paths for multiple ADC inputs. 

> > My current step is to figure out how to run with SDRAM and
> > why GA abandoned using SDRAM on their eval board.  There must have
> > been a reason because the GA144 has ROM code for SDRAM.
> 
> You could ask them...

Yes, but so far it has been a difficult process to get useful answers.  The answers only provoke more questions really.  So far they have said that the SDRAM interface was only tried on the S40 and was a failure because it was "too specific".  So why is this in the ROM code on the GA144?  Also, the GA144 data sheet says the ROM code is for "an external SDRAM using 18-bit data."  I can't find any 18 bit SDRAM chips these days.  I think they were using a 32 bit part and ignoring 14 bits.  Also, I just realized the part number they gave me was for a DDR part which requires a differential clock.  This is just a complication with no value in my opinion.  An SDRAM using a single wire clock will work just as well since it will be exceedingly hard to run these devices at 100 MHz, much less 166 or 200 MHz.  Also, the SDR device I am looking at is lower power than the DDR device they used. 

Rick

[toc] | [prev] | [next] | [standalone]

#14593

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-31 11:48 -0700
Message-ID	<7x1ujrydfd.fsf@ruckus.brouhaha.com>
In reply to	#14568

rickman <gnuarm@gmail.com> writes:
>>  My idea was simply to add diagonal connections
> If you are going to improve the internode comms I would suggest
> a cube arrangement.

That would be more traditional and probably better for the user, but it
would mean having to run traces across long distances on the chip, which
I figure is a bigger redesign, as well as being a philosophical shift
away from a 2-D grid computer.

>>  http://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
>> pages 9 and 12.
> The F18A databook page 9, "5.1 nanoseconds when reading or writing
> internal memory".  These ops are not to or from "internal memory".

Page 12 section 3.3 paragraph 2:

    When a node operates on a port the data transfer occurs at
    approximately memory speed unless the other node connected to the
    port is not yet performing the complementary operation; in this case
    the operating node suspends...

That's why I figured 5.1 ms for the port transfers, similar to memory.
I also have to wonder how straightforward it is to get the adjacent node
exactly in sync so that the port never blocks.  Don't forget you may
need bidirectional wires for some uses, i.e. yet more delays.

> The cycle time does not need to be any faster than a node can generate
> or process data so there is no real issue there. 

At 15 ns cycle time (about 66 mhz) for the transfers, it's possible that
a simple processing loop on a 700 mhz node can outrun it.

> the GA144 data sheet says the ROM
> code is for "an external SDRAM using 18-bit data."  I can't find any
> 18 bit SDRAM chips these days. 

I remember a page that I think was on Chuck's site, about the ram setup
on the Haypress Creek board.  But I've just spent a while unsuccessfully
looking for it.

[toc] | [prev] | [next] | [standalone]

#14597

From	rickman <gnuarm@gmail.com>
Date	2012-07-31 14:54 -0700
Message-ID	<94845d94-80bf-48ec-8676-46553238c95f@googlegroups.com>
In reply to	#14593

On Tuesday, July 31, 2012 2:48:38 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> >>  My idea was simply to add diagonal connections
> 
> > If you are going to improve the internode comms I would suggest
> > a cube arrangement.
> 
> That would be more traditional and probably better for the user, but it
> would mean having to run traces across long distances on the chip, which
> I figure is a bigger redesign, as well as being a philosophical shift
> away from a 2-D grid computer.

I think that is irrelevant.  If you want to improve the comms, then improve the comms.  Don't waste time with pointless changes. 

> >>  http://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
> >> pages 9 and 12.
> 
> > The F18A databook page 9, "5.1 nanoseconds when reading or writing
> > internal memory".  These ops are not to or from "internal memory".
> 
> 
> Page 12 section 3.3 paragraph 2:
> 
>     When a node operates on a port the data transfer occurs at
>     approximately memory speed unless the other node connected to the
>     port is not yet performing the complementary operation; in this case
>     the operating node suspends...
> 
> That's why I figured 5.1 ms for the port transfers, similar to memory.
> I also have to wonder how straightforward it is to get the adjacent node
> exactly in sync so that the port never blocks.  Don't forget you may
> need bidirectional wires for some uses, i.e. yet more delays.

Don't get confused about how they sync.  Whichever one does their operation first just waits.  Then the second to complete their side of the transfer proceeds without waiting and the first starts running again.  There is no dealing with "exactly in sync".  Anything else requires the node to read the status and do logic, in other words poll the port which is very inefficient.  

In the real world the app will be implemented to work with the protocol.  Since you don't have an app in mind it is hard to say what will work well and what won't. 

> > The cycle time does not need to be any faster than a node can generate
> > or process data so there is no real issue there. 
> 
> At 15 ns cycle time (about 66 mhz) for the transfers, it's possible that
> a simple processing loop on a 700 mhz node can outrun it.

Yes, if it is doing nearly nothing.  Remember that the "simple processing loop" also has to do comms to talk to the next node.  

> > the GA144 data sheet says the ROM
> > code is for "an external SDRAM using 18-bit data."  I can't find any
> > 18 bit SDRAM chips these days. 
> 
> I remember a page that I think was on Chuck's site, about the ram setup
> on the Haypress Creek board.  But I've just spent a while unsuccessfully
> looking for it.

I've read a lot of Chuck's web pages and the only thing I remember reading is that they didn't like the way the SDRAM interface worked out.  Chuck didn't do it someone else did.  I am going to use SDRAM (assuming I decide this effort is worth building a PCB) as it is so much faster in burst mode than an SRAM can be.  The input side (writing from o'scope inputs) may be the bottle neck in the end.  This is random writes, so there may not be a huge difference between the two depending on the details of the design requirements.  I still don't know what is feasible given the resources.  

Rick

[toc] | [prev] | [next] | [standalone]

#14599

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-07-31 16:09 -0700
Message-ID	<7xsjc7wmrn.fsf@ruckus.brouhaha.com>
In reply to	#14597

rickman <gnuarm@gmail.com> writes:
> I think that is irrelevant.  If you want to improve the comms, then
> improve the comms.  Don't waste time with pointless changes.

I don't think it's pointless.  The suggestion was based on trying to
code various algorithms for the GA and finding that I ran out of
capacity, and more channels would have helped.  It's possible that the
GA designers looked into the idea and found it would have created some
problem that I don't know about, but my discussion with Jeff made me
think it wouldn't have been a big deal.

> Don't get confused about how they sync.  Whichever one does their
> operation first just waits.  ...  There is no dealing with "exactly in
> sync".  

I'm not worried about locking up or getting the wrong data, but just
that any wait time imposes latency on the requesting node, slowing
the whole thing down.

[toc] | [prev] | [next] | [standalone]

#14627

From	rickman <gnuarm@gmail.com>
Date	2012-08-01 12:57 -0700
Message-ID	<a218ea56-0265-4559-a462-51a7f1a182cb@googlegroups.com>
In reply to	#14599

On Tuesday, July 31, 2012 7:09:48 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > I think that is irrelevant.  If you want to improve the comms, then
> > improve the comms.  Don't waste time with pointless changes.
> 
> I don't think it's pointless.  The suggestion was based on trying to
> code various algorithms for the GA and finding that I ran out of
> capacity, and more channels would have helped.  It's possible that the
> GA designers looked into the idea and found it would have created some
> problem that I don't know about, but my discussion with Jeff made me
> think it wouldn't have been a big deal.

I can't say anything about your app, but I can't see how the interprocessor comms capacity is lacking in any way relative to the speed of the processors. 

> > Don't get confused about how they sync.  Whichever one does their
> > operation first just waits.  ...  There is no dealing with "exactly in
> > sync".  
> 
> I'm not worried about locking up or getting the wrong data, but just
> that any wait time imposes latency on the requesting node, slowing
> the whole thing down.

How do you sync anything without waiting?  If you are thinking in these terms you really don't understand low computers work at a low level.  Every FF in a CPU slows down the propagation of signals in order to synchronize them.  Even async designs like the F18 CPU delay the latching of data until *after* the data input to the FF is stable.  Time is always utilized to sync signals.  

You don't have to sync the comms.  There are ways to read and write ports without waiting.  But that is too slow for most designs.  

I think someone suggested that the GA144 is somewhat like a systolic array.  I suggest you take a look at how they work.  This will help you understand the synchronization via the comms channels.  

Rick

[toc] | [prev] | [next] | [standalone]

#14632

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-01 14:16 -0700
Message-ID	<7x62928g8u.fsf@ruckus.brouhaha.com>
In reply to	#14627

rickman <gnuarm@gmail.com> writes:
> I can't say anything about your app, but I can't see how the
> interprocessor comms capacity is lacking in any way relative to the
> speed of the processors.

I remember I was trying to code the RC4 stream cipher, which wants a 256
byte ram array most straightforwardly implemented using a word per byte.
With diagonal connections you could put the algorithm on one node and
use 4 of the 8 adjacent nodes as ram, quite nice.  With just orthogonal
connections unless there's a clever layout I didn't spot, you have to
use 7 nodes and the ram data has to cross 3 hops with some delays at
each of the relay nodes (the're not just wire nodes).  I looked into
doing it with 3 nodes packing 2 bytes into each ram word, but it was
ugly and inefficient (tons of bit shifts) and not at all clear that I
could fit all the instructions into ram.

I may look at the app note md5 implementation to see if it would be
obviously smaller/faster with more connections.  The existing way is
pretty ugly in my opinion.

> How do you sync anything without waiting?  

I just mean that if the writer node is faster than the reader, then the
reader doesn't have to wait.

[toc] | [prev] | [next] | [standalone]

#14634

From	vandys@vsta.org
Date	2012-08-01 22:32 +0000
Message-ID	<a7tp3iF32cU1@mid.individual.net>
In reply to	#14632

Paul Rubin <no.email@nospam.invalid> wrote:
> I may look at the app note md5 implementation to see if it would be
> obviously smaller/faster with more connections.  The existing way is
> pretty ugly in my opinion.

I'm still hoping somebody with real hardware can benchmark the
implementation.  The walk-through gives me the impression that implementing
the needed memory space across multiple nodes will really impact performance.
I talked with Jeff Fox about fewer nodes with more memory, but he seemed
pretty sure that there was no way to substantially increase the node memory,
even if you had gained some real estate by reducing the number of nodes.

-- 
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

[toc] | [prev] | [next] | [standalone]

#14635

From	rickman <gnuarm@gmail.com>
Date	2012-08-01 16:03 -0700
Message-ID	<bca2ac04-f41c-4a22-9321-7ad5ba387a29@googlegroups.com>
In reply to	#14632

On Wednesday, August 1, 2012 5:16:49 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > How do you sync anything without waiting?  
> 
> I just mean that if the writer node is faster than the reader, then the
> reader doesn't have to wait.

If the writer is faster than the reader, the reader will never keep up!  It all needs to be synced.  If the writer writes and the reader is not ready the writer waits.  If the reader reads and there is no data, the reader waits.  Unless they are perfectly matches, one has to wait. 

Rick

[toc] | [prev] | [next] | [standalone]

#14637

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-01 16:13 -0700
Message-ID	<7xvch2jje4.fsf@ruckus.brouhaha.com>
In reply to	#14635

rickman <gnuarm@gmail.com> writes:
> If the writer is faster than the reader, the reader will never keep up! 

Yes, it's ok if the writer blocks, if the overall throughput is
determined by how fast the reader can process the data (including
reading it).  That's what we were talking about.  It's important in that
situation to keep the reader from blocking.  If you can make the writer
fast enough that it has to block without ever making the reader block,
you're doing fine.

[toc] | [prev] | [next] | [standalone]

#14636

From	rickman <gnuarm@gmail.com>
Date	2012-08-01 16:10 -0700
Message-ID	<677bc860-fa98-4ca8-a557-7aa70c11876f@googlegroups.com>
In reply to	#14593

On Tuesday, July 31, 2012 2:48:38 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > The F18A databook page 9, "5.1 nanoseconds when reading or writing
> > internal memory".  These ops are not to or from "internal memory".
> 
> Page 12 section 3.3 paragraph 2:
> 
>     When a node operates on a port the data transfer occurs at
>     approximately memory speed unless the other node connected to the
>     port is not yet performing the complementary operation; in this case
>     the operating node suspends...
> 
> That's why I figured 5.1 ms for the port transfers, similar to memory.
> I also have to wonder how straightforward it is to get the adjacent node
> exactly in sync so that the port never blocks.  Don't forget you may
> need bidirectional wires for some uses, i.e. yet more delays.

GA144 data book page 19...

TIOREG IO register read/write opcode time @+ @b @ !+ !b ! 3250 3500 4100 pS

The last three numbers are Min Typ and Max timings.  There is also an instruction fetch adder depending on whether the instruction has been prefetched or not.  So the only way to really time code is to use the simulator.  It's just too tricky to do by hand. 

Rick

[toc] | [prev] | [next] | [standalone]

#14648

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-01 22:18 -0700
Message-ID	<7xk3xhj2gw.fsf@ruckus.brouhaha.com>
In reply to	#14636

rickman <gnuarm@gmail.com> writes:
> TIOREG IO register read/write opcode time @+ @b @ !+ !b ! 3250 3500 4100 pS
> The last three numbers are Min Typ and Max timings.  There is also an
> instruction fetch adder depending on whether the instruction has been
> prefetched or not.  

The simulator would be better but if guess an instruction fetch is
needed every 3.6 instructions (18 bits = 3.6 * 5 bits), and if
instruction fetch costs 5.1 ns like other memory accesses, we're at 3500
+ 5100/3.6 ns = 4916 ns, awfully close to 5.1 ns and maybe equal if
there's a little bit more delay in there somewhere.

I wonder how much real GA hardware is actually out there being
programmed.  Does anyone on this newsgroup have an evaluation board?

[toc] | [prev] | [next] | [standalone]

#14668

From	rickman <gnuarm@gmail.com>
Date	2012-08-02 11:47 -0700
Message-ID	<19981049-2688-4f99-83cc-6f6b1ac5221d@googlegroups.com>
In reply to	#14648

On Thursday, August 2, 2012 1:18:55 AM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > TIOREG IO register read/write opcode time @+ @b @ !+ !b ! 3250 3500 4100 pS
> > The last three numbers are Min Typ and Max timings.  There is also an
> > instruction fetch adder depending on whether the instruction has been
> > prefetched or not.  
> 
> The simulator would be better but if guess an instruction fetch is
> needed every 3.6 instructions (18 bits = 3.6 * 5 bits), and if
> instruction fetch costs 5.1 ns like other memory accesses, we're at 3500
> + 5100/3.6 ns = 4916 ns, awfully close to 5.1 ns and maybe equal if
> there's a little bit more delay in there somewhere.
> 
> I wonder how much real GA hardware is actually out there being
> programmed.  Does anyone on this newsgroup have an evaluation board?

You speculate too much.  The time required to fetch instructions depends on the instruction currently being executed.  There is a prefetch function but it has to restart when other memory accesses are made.  See the timing info in the manual.  

Also, your math is beyond me.  What is the meaning of 3.6?  Why do you care about this number?  What is 3500? 

Rick

[toc] | [prev] | [next] | [standalone]

#14672

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-02 14:32 -0700
Message-ID	<7xk3xhou8t.fsf@ruckus.brouhaha.com>
In reply to	#14668

rickman <gnuarm@gmail.com> writes:
> You speculate too much.  The time required to fetch instructions
> depends on the instruction currently being executed. 

3.6 is a guess at an average.  Obviously the simulator or actual timings
will be more accurate.

> Also, your math is beyond me.  What is the meaning of 3.6?  Why do you
> care about this number?  What is 3500?

3.6 = 18 bits/word / 5 bits per instruction (certain instructions
fit in the remaining 3 bits at the end of a word but they are not
always usable, so I counted them as 0.6).

3500 ns = typical port operation time that you posted, not counting
possible instruction fetch.

I'll be interested to see better numbers.  The above is what's known as
a back-of-envelope calculation based on incomplete info.

[toc] | [prev] | [next] | [standalone]

#14678

From	rickman <gnuarm@gmail.com>
Date	2012-08-02 16:01 -0700
Message-ID	<c999b0ef-d293-4ebd-9a2d-e76eb0cb73cb@googlegroups.com>
In reply to	#14672

On Thursday, August 2, 2012 5:32:18 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > You speculate too much.  The time required to fetch instructions
> > depends on the instruction currently being executed. 
> 
> 3.6 is a guess at an average.  Obviously the simulator or actual timings
> will be more accurate.
> 
> > Also, your math is beyond me.  What is the meaning of 3.6?  Why do you
> > care about this number?  What is 3500?
> 
> 3.6 = 18 bits/word / 5 bits per instruction (certain instructions
> fit in the remaining 3 bits at the end of a word but they are not
> always usable, so I counted them as 0.6).
> 
> 3500 ns = typical port operation time that you posted, not counting
> possible instruction fetch.
> 
> I'll be interested to see better numbers.  The above is what's known as
> a back-of-envelope calculation based on incomplete info.

I saw how you got the 3.5 number, but what is it's purpose?  There are four instructions in a word.  Why call it 3.5 instructions?  Why are you adding 3500 to the other numbers you posted?  I don't see the meaning of this calculation. 

Rick

[toc] | [prev] | [next] | [standalone]

#14679

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-02 16:38 -0700
Message-ID	<7xy5lw985e.fsf@ruckus.brouhaha.com>
In reply to	#14678

rickman <gnuarm@gmail.com> writes:
> I saw how you got the 3.5 number, but what is it's purpose?  There are
> four instructions in a word.  Why call it 3.5 instructions?  

There's only 4 instructions per word if you can actually use all 4
slots.  The 4th slot is useless a lot of the time because there are only
3 bits in it, so only 8 of the 32 possible opcodes can go into it, so
you end up having to put a no-op there.  The other slots are also often
not usable for instructions or cause an extra fetch, because you need to
bring in a literal.  4 instructions is the maximum, and it's achievable
only some of the time, so the average is less than 4.  I picked 3.6 as a
reasonable-looking estimate.  If you've got a better one, go ahead and
post it.

> Why are you adding 3500 to the other numbers you posted?  I don't see
> the meaning of this calculation.

3500 for the port delay per your post, plus some unknown (guesstimated
above) amount for instruction fetch, averaged out across instructions.
You wrote:

> GA144 data book page 19...
> TIOREG IO register read/write opcode time @+ @b @ !+ !b ! 3250 3500 4100 pS
> The last three numbers are Min Typ and Max timings

3500 according to you is typical, so I used that.  Again, if you don't
like my estimate, feel free to post your own.  I don't think there is
enough info to do a really reliable calculation, but the available info
seems like a reasonable starting point.  Otherwise we have to assume the
worst possible case, which is intolerable delays in getting stuff around
between nodes.

[toc] | [prev] | [next] | [standalone]

#14691

From	rickman <gnuarm@gmail.com>
Date	2012-08-03 08:40 -0700
Message-ID	<407431d3-e67e-4344-aae9-7098856d4ff3@googlegroups.com>
In reply to	#14679

On Thursday, August 2, 2012 7:38:37 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > I saw how you got the 3.5 number, but what is it's purpose?  There are
> > four instructions in a word.  Why call it 3.5 instructions?  
> 
> There's only 4 instructions per word if you can actually use all 4
> slots.  The 4th slot is useless a lot of the time because there are only
> 3 bits in it, so only 8 of the 32 possible opcodes can go into it, so
> you end up having to put a no-op there.  The other slots are also often
> not usable for instructions or cause an extra fetch, because you need to
> bring in a literal.  4 instructions is the maximum, and it's achievable
> only some of the time, so the average is less than 4.  I picked 3.6 as a
> reasonable-looking estimate.  If you've got a better one, go ahead and
> post it.
> 
> > Why are you adding 3500 to the other numbers you posted?  I don't see
> > the meaning of this calculation.
> 
> 3500 for the port delay per your post, plus some unknown (guesstimated
> above) amount for instruction fetch, averaged out across instructions.
> You wrote:
> 
> > GA144 data book page 19...
> > TIOREG IO register read/write opcode time @+ @b @ !+ !b ! 3250 3500 4100 pS
> > The last three numbers are Min Typ and Max timings
> 
> 3500 according to you is typical, so I used that.  Again, if you don't
> like my estimate, feel free to post your own.  I don't think there is
> enough info to do a really reliable calculation, but the available info
> seems like a reasonable starting point.  Otherwise we have to assume the
> worst possible case, which is intolerable delays in getting stuff around
> between nodes.

I don't know what you are doing.  You are explaining the tiny details and not the overall picture.  

Rick

[toc] | [prev] | [next] | [standalone]

#14697

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-03 09:19 -0700
Message-ID	<7x8vdw3q3m.fsf@ruckus.brouhaha.com>
In reply to	#14691

rickman <gnuarm@gmail.com> writes:
> I don't know what you are doing.  You are explaining the tiny details
> and not the overall picture.

The overall picture is spelled out on page 12 of the databook, where it
says that reading from a port takes about the same amount of time as
reading from memory, and it says that reading from memory takes about
5.1 ns.  I did a calculation that didn't claim to be rigorously grounded
but which seemed to confirm that the stated number was plausible.

[toc] | [prev] | [next] | [standalone]

#14701

From	rickman <gnuarm@gmail.com>
Date	2012-08-03 11:15 -0700
Message-ID	<bf496233-0aaf-4103-b9a8-0ed6ecc65d32@googlegroups.com>
In reply to	#14697

On Friday, August 3, 2012 12:19:41 PM UTC-4, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
> 
> > I don't know what you are doing.  You are explaining the tiny details
> > and not the overall picture.
> 
> The overall picture is spelled out on page 12 of the databook, where it
> says that reading from a port takes about the same amount of time as
> reading from memory, and it says that reading from memory takes about
> 5.1 ns.  I did a calculation that didn't claim to be rigorously grounded
> but which seemed to confirm that the stated number was plausible.

Your calculations make no sense to me.  Sorry, but unless you can explain them a bit more I have no idea why you are adding the numbers you are adding.  I'm not even sure what the resulting number is an estimate of. 

Also, I have already explained to you that the part of the data book you are quoting is a gross simplification of the real timing data.  Read the timing section of the GA144 data book, DB002, sec 5.3 Typical Instruction Timings.  They give timing for each instruction type and also the adders for instruction fetch based on the instruction mix in a given instruction word.  

Rick

[toc] | [prev] | [next] | [standalone]

Page 3 of 4 — ← Prev page 1 2 [3] 4 Next page →

csiph-web

GA144 article

Contents

#14553

#14561

#14568

#14593

#14597

#14599

#14627

#14632

#14634

#14635

#14637

#14636

#14648

#14668

#14672

#14678

#14679

#14691

#14697

#14701