| View previous topic :: View next topic |
| Author |
Message |
Wilco Dijkstra Guest
|
Posted: Thu Aug 28, 2008 4:08 am Post subject: Re: Future architectures [was Re: Intel details future Larra |
|
|
"MooseFET" <kensmith@rahul.net> wrote in message
news:484df48f-a0c2-4d1a-a8d2-0e0278684b6c@i20g2000prf.googlegroups.com...
| Quote: | On Aug 27, 5:05 pm, "Wilco Dijkstra"
Wilco.removethisDijks...@ntlworld.com> wrote:
"MooseFET" <kensm...@rahul.net> wrote in message
news:d46d5e1c-2df4-436a-a2a6-ac823557b1c4@q5g2000prf.googlegroups.com...
On Aug 27, 7:05 am, "Wilco Dijkstra"
Wilco.removethisDijks...@ntlworld.com> wrote:
"MooseFET" <kensm...@rahul.net> wrote in message
news:ac9d3411-d83d-49d5-8310-bd649ed43a69@n33g2000pri.googlegroups.com...
On Aug 26, 7:51 pm, Terje Mathisen <terje.mathi...@hda.hydro.com
wrote:
JosephKK wrote:
On 25 Aug 2008 18:32:43 GMT, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
That doesn't help with denormals, though! You can't always use them
in denormalised form, and can rarely use the same code as for normal
numbers unless you normalise them.
Regards,
Nick Maclaren.
Would you kindly explain to me how to normalize a denormal without
expanding the exponent range?
You don't.
I.e. a sw library will almost certainly choose to work in an internal
exponent format with a lot more bits, like a 32-bit int.
To follow the spec you have to denormalize (if needed) the result again
after each operation, unless you can fake it exactly.
One possible idea would be to mask away (with proper rounding) the
bottom bits that would have been shifted away during the conversion to
external exponent range.
If I was coding such a library I would most likely convert to an
internal format with a longer mantissa and a base 256 exponent. While
the numbers are being held in the internals, a few extra bytes needed
for such a format would be a low price to pay for the greater speed.
A base 256 number with a longer mantissa speeds up adding and
subtracting at the cost of some speed in the multiply and divide.
Making the mantissa a multiple of the natural word length of the
processor, gets you most of that back.
The mantissa would normally already be 32-bit or 64-bit internally.
Addition is already as simple as:
res = manta + (mantb >> (expa - expb));
I can't see how rebasing the exponent could possibly simplify this.
It is that nasty ">>" operator. And the "if" logic you forgot to
include that makes it slow. If the processor you are working with
doesn't do the shifts quickly, the base 256 exponents speeds things up
a lot.
Consider writing a floating point package for a Z80 and see how it
makes a huge difference in an extreme case.
Base256 could help on CPUs with slow shifts, but only if you use a
non-IEEE format. If you convert between base256 and IEEE exponent
for every operation then you end up with more shifting overall.
I think you missed the point about keeping the numbers as base 256
while they are being worked on. This means that you only need to
convert to and from the IEEE format on the way in and out. If you are
doing an FFT, the conversion time would be small compared to the
savings in the FFT.
|
Indeed. But if you use a non-IEEE format most of the time, why not use
it all the time? Using IEEE on 8-bit micros seems overkill... Have you
seen Steve Wozniak's amazingly compact 6502 FP emulation code?
That's a reasonable format - with base256 it would likely be faster still.
| Quote: | The above
shift would need at most a 4-bit shift on an 8-bitter, so it's not too bad at all.
A 4 bit shift takes quite a bit of time on an 8-bitter. You have only
one carry to tranfer the bits between bytes so it is usually faster to
do 4 one bit shifts. Here it is for an 8051:
CLR C ; Shift in a zero
MOV A,LSB ; Load the lowest
RLC A ; Shift up one
MOV LSB,A
MOV A,LSB+1 ; Next byte
RLC A
MOV LSB+1,A
MOV A,LSB+2 ; Next byte
RLC A
MOV LSB+2,A
MOV A,LSB+3 ; Next byte
RLC A
MOV LSB+3,A
As you can see it comes out to 13 instructions per one bit shift.
This makes it well worth avoiding if you can.
|
Note you can use XCH in the above code to get it down to 9 instructions.
Wilco |
|
| |
|
Back to top |
Andrew Reilly Guest
|
Posted: Thu Aug 28, 2008 6:16 am Post subject: Re: Future architectures [was Re: Intel details future Larra |
|
|
On Wed, 27 Aug 2008 18:05:53 +0200, Terje Mathisen wrote:
| Quote: | Wilco Dijkstra wrote:
"MooseFET" <kensmith@rahul.net> wrote in message
A base 256 number with a longer mantissa speeds up adding and
subtracting at the cost of some speed in the multiply and divide.
Making the mantissa a multiple of the natural word length of the
processor, gets you most of that back.
The mantissa would normally already be 32-bit or 64-bit internally.
Addition is already as simple as:
res = manta + (mantb >> (expa - expb));
I can't see how rebasing the exponent could possibly simplify this.
With long mantissas, like 100+ bits for 128-bit fp, the mantissa shift
operation has to involve multiple integer registers, while using mod 256
allows byte moves (or on some architectures like x86, unaligned loads)
to skip the shifts entirely.
|
In the ARM (and some other processors), byte permutations and unaligned
loads are achieved with shifts, so this "advantage" is quite processor
dependent. Well, I guess that x86 and the general ability to do
unaligned loads probably shifts the numbers towards byte ops as time goes
on.
Cheers,
--
Andrew |
|
| |
|
Back to top |
Andrew Reilly Guest
|
Posted: Thu Aug 28, 2008 7:05 am Post subject: Re: Future architectures [was Re: Intel details future Larra |
|
|
On Thu, 28 Aug 2008 00:08:49 +0100, Wilco Dijkstra wrote:
| Quote: | Indeed. But if you use a non-IEEE format most of the time, why not use
it all the time? Using IEEE on 8-bit micros seems overkill... Have you
seen Steve Wozniak's amazingly compact 6502 FP emulation code? That's a
reasonable format - with base256 it would likely be faster still.
|
I vaguely remember being quite impressed with Bill Gates' five-byte
floating point format used in TRS-80 Z-80 Basic. Seemed quite
reasonable, under the circumstances. Both of those pre-date IEEE FP of
course, so compatability wasn't a concern.
Perhaps I was just easily impressed back then :-)
Cheers,
--
Andrew |
|
| |
|
Back to top |
JosephKK Guest
|
Posted: Mon Sep 01, 2008 1:53 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
On Mon, 25 Aug 2008 06:08:42 -0700 (PDT), already5chosen@yahoo.com
wrote:
| Quote: | On Aug 25, 3:53 pm, already5cho...@yahoo.com wrote:
On Aug 25, 3:36 pm, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
In article <411fa0ee-729c-4c18-81b0-abaebdeef...@l42g2000hsc.googlegroups.com>,already5cho...@yahoo.com writes:
|
|> Conclusion: the people that use the term "Von Neumann architecture" as
|> a common replacement for "architecture based on interpreting of serial
|> or near-serial instruction streams fetched from random-access memory"
|> are true revisionists.
Ah. Well, I side with Backus - who is both massively more eminent
than I am and of a previous generation.
http://portal.acm.org/citation.cfm?id=359579
Are you claiming that he was being a revisionist in that?
Regards,
Nick Maclaren.
Sorry, ACM portal refuses to show me what you mean.
Figured out that you most likely had in mind this particular citation:
"Conventional programming languages are growing ever more enormous,
but not stronger. Inherent defects at the most basic level cause them
to be both fat and weak: their primitive word-at-a-time style of
programming
inherited from their common ancestor--the von Neumann computer... etc"
Yes, Backus is most certainly a revisionist. The property he is
talking about predated Von Neumann contribution. If anything, he
should have praised Von Neumann for showing us one possibly way out of
maze although probably not the best one from performance perspective.
|
Caches or not, memory speed has been more performance limiting that
CPU speed for decades. Multiple CPUs on a single socket only
aggravate this. Multiple memory busses might help. |
|
| |
|
Back to top |
John Doe Guest
|
Posted: Mon Sep 01, 2008 2:14 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
JosephKK <quiettechblue@yahoo.com> wrote:
| Quote: | On Mon, 25 Aug 2008 06:08:42 -0700 (PDT), already5chosen@yahoo.com
|
....
| Quote: | Yes, Backus is most certainly a revisionist. The property he is
talking about predated Von Neumann contribution. If anything, he
should have praised Von Neumann for showing us one possibly way
out of maze although probably not the best one from performance
perspective.
Caches or not, memory speed has been more performance limiting
that CPU speed for decades. Multiple CPUs on a single socket only
aggravate this. Multiple memory busses might help.
|
BWAAAHAHAHAAAAAA!!!!
Sounds like someone who is fishing for the motivation to upgrade.
I'll let you know when my multiple core CPU cannot use all cores at
100%. Multiple core CPUs are the biggest hardware performance leap
in many years. Bet on it.
--
The first big front wheel rollerblades.
http://www.flickr.com/photos/27532210@N04/2565924423/
Google Groups is destroying the USENET archive. |
|
| |
|
Back to top |
Andrew Reilly Guest
|
Posted: Mon Sep 01, 2008 5:13 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
On Sun, 31 Aug 2008 21:14:26 +0000, John Doe wrote:
| Quote: | JosephKK <quiettechblue@yahoo.com> wrote:
On Mon, 25 Aug 2008 06:08:42 -0700 (PDT), already5chosen@yahoo.com
...
Yes, Backus is most certainly a revisionist. The property he is talking
about predated Von Neumann contribution. If anything, he should have
praised Von Neumann for showing us one possibly way out of maze
although probably not the best one from performance perspective.
Caches or not, memory speed has been more performance limiting that CPU
speed for decades. Multiple CPUs on a single socket only aggravate
this. Multiple memory busses might help.
BWAAAHAHAHAAAAAA!!!!
Sounds like someone who is fishing for the motivation to upgrade.
I'll let you know when my multiple core CPU cannot use all cores at
100%. Multiple core CPUs are the biggest hardware performance leap in
many years. Bet on it.
|
They may very well show you that they're running at 100% in a CPU use
meter administered by a time-sharing OS, but do you know how much of that
100% is the processor stalled, waiting for off-chip memory? [*1] Is the
throughput on your problem of choice four (or whatever) times what it is
on a single core?
Well, sometimes it is. My own algorithms fit neatly into two categories:
totally contained in cache (for modern values of cache), and totally
memory bandwidth limited, so I am happy to have a couple of extra cores.
I can imagine applications where it makes little difference, though.
[1] This is the single statistic that I most wish for, in an operating
system performance display, and I don't know how to get it. Is it
possible?
Cheers,
--
Andrew |
|
| |
|
Back to top |
John Doe Guest
|
Posted: Mon Sep 01, 2008 6:42 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
Andrew Reilly <andrew-newspost@areilly.bpc-users.org> wrote:
| Quote: | On Sun, 31 Aug 2008 21:14:26 +0000, John Doe wrote:
JosephKK <quiettechblue@yahoo.com> wrote:
|
....
| Quote: | Caches or not, memory speed has been more performance limiting
that CPU speed for decades. Multiple CPUs on a single socket
only aggravate this. Multiple memory busses might help.
I'll let you know when my multiple core CPU cannot use all cores
at 100%. Multiple core CPUs are the biggest hardware performance
leap in many years. Bet on it.
They may very well show you that they're running at 100% in a CPU
use meter administered by a time-sharing OS, but do you know how
much of that 100% is the processor stalled, waiting for off-chip
memory?
|
If I needed to know, I'd probably use Performance Monitor in Windows
XP.
| Quote: | [*1] Is the throughput on your problem of choice four (or
whatever) times what it is on a single core?
|
It's close enough for me.
| Quote: | I can imagine applications where it makes little difference,
though.
|
Some applications don't take advantage of multiple cores, but that's
not necessarily the CPUs fault. A good example is Supreme Commander
and a tiny utility called CoreMaximizer. Without the utility, one
core bounces against 100% and causes a replay to stutter while the
other core is 50 or 60%. With the utility, both cores are almost
even and there is a noticeable improvement in performance without
stuttering.
--
The first big front wheel rollerblades.
http://www.flickr.com/photos/27532210@N04/2565924423/ |
|
| |
|
Back to top |
Dmitriy V'jukov Guest
|
Posted: Mon Sep 01, 2008 7:49 am Post subject: Re: AMD working on scaleable hardware-based atomic transacti |
|
|
On 1 ÓÅÎÔ, 04:09, "Chris M. Thomasson" <n...@spam.invalid> wrote:
| Quote: | Here ya go:
http://www.amd64.org/fileadmin/user_upload/pub/epham08-asf-eval.pdf
|
Cool!
It recalls Sun's HTM design, but AMD's design gives explicit control
over what locations to lock and what not to lock.
Interesting, when we will see it in AMD hardware?
Also, if HTM will not be supported by Intel, usage will be very
problematic.
Dmitriy V'jukov |
|
| |
|
Back to top |
Terje Mathisen Guest
|
Posted: Mon Sep 01, 2008 10:47 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
Andrew Reilly wrote:
| Quote: | They may very well show you that they're running at 100% in a CPU use
meter administered by a time-sharing OS, but do you know how much of that
100% is the processor stalled, waiting for off-chip memory? [*1] Is the
[snip]
[1] This is the single statistic that I most wish for, in an operating
system performance display, and I don't know how to get it. Is it
possible?
|
It might very well be:
The Intel EMON counters (and similar on most other architectures) allow
you to count the number of cycles spent waiting for memory as well as
the number of cache misses.
The ratio should be the clock cycles/miss, and the total wait/total
cycles is the ratio lost to this.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| |
|
Back to top |
Ken Hagan Guest
|
Posted: Mon Sep 01, 2008 1:55 pm Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
On Mon, 01 Sep 2008 02:42:32 +0100, John Doe <jdoe@usenetlove.invalid>
wrote:
| Quote: | If I needed to know, I'd probably use Performance Monitor in Windows
XP.
|
Straight out of the box, performance monitor doesn't have a suitable
counter for this. I believe sys-internals do a widget that exposes
values from the RDPMC instruction. That might help if there were a
suitable performance counter to report.
But is there? There are certainly counters for cache misses, but those
don't necessarily affect performance in an OoO processor. I suppose
you could quote the "total number of instructions retired" value as
a proxy for "useful CPU work done", but the 100% level for that depends
on the available ILP.
I suppose part of the problem is the lack of a suitable definition for
when the CPU is "usefully busy". |
|
| |
|
Back to top |
MitchAlsup Guest
|
Posted: Mon Sep 01, 2008 7:21 pm Post subject: Re: AMD working on scaleable hardware-based atomic transacti |
|
|
see this thread from midway down the first page:
http://groups.google.com/group/comp.arch/browse_thread/thread/cedc2e76568ab0c5/934e3485d711d4a0?hl=en#934e3485d711d4a0 |
|
| |
|
Back to top |
Martin Brown Guest
|
Posted: Tue Sep 02, 2008 1:39 am Post subject: Re: Von Neumann and revisionists [Re: Future architectures [ |
|
|
Andrew Reilly wrote:
| Quote: | On Sun, 31 Aug 2008 21:14:26 +0000, John Doe wrote:
JosephKK <quiettechblue@yahoo.com> wrote:
On Mon, 25 Aug 2008 06:08:42 -0700 (PDT), already5chosen@yahoo.com
...
Yes, Backus is most certainly a revisionist. The property he is talking
about predated Von Neumann contribution. If anything, he should have
praised Von Neumann for showing us one possibly way out of maze
although probably not the best one from performance perspective.
Caches or not, memory speed has been more performance limiting that CPU
speed for decades. Multiple CPUs on a single socket only aggravate
this. Multiple memory busses might help.
BWAAAHAHAHAAAAAA!!!!
Sounds like someone who is fishing for the motivation to upgrade.
I'll let you know when my multiple core CPU cannot use all cores at
100%. Multiple core CPUs are the biggest hardware performance leap in
many years. Bet on it.
They may very well show you that they're running at 100% in a CPU use
meter administered by a time-sharing OS, but do you know how much of that
100% is the processor stalled, waiting for off-chip memory? [*1] Is the
throughput on your problem of choice four (or whatever) times what it is
on a single core?
Well, sometimes it is. My own algorithms fit neatly into two categories:
totally contained in cache (for modern values of cache), and totally
memory bandwidth limited, so I am happy to have a couple of extra cores.
I can imagine applications where it makes little difference, though.
[1] This is the single statistic that I most wish for, in an operating
system performance display, and I don't know how to get it. Is it
possible?
|
If you don't mind a little work figuring out how to use them the Intel
performance monitoring counters can be configured to do what you want.
You have to run a custom Ring-0 driver to access them (so it isn't for
the faint hearted). It can be very informative.
Look for ia32 from University of Texas or similar utilities for Linux.
Some of the chess optimisation work done for multiple cores is very
enlightening about the difficulties of load balancing an algorithm
across 4 or more cores without saturating external memory bandwidth.
Regards,
Martin Brown
** Posted from http://www.teranews.com ** |
|
| |
|
Back to top |
David Schwartz Guest
|
Posted: Thu Sep 04, 2008 3:21 am Post subject: Re: AMD working on scaleable hardware-based atomic transacti |
|
|
On Sep 3, 4:43 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
| Quote: | Apparently, interrupts are deferred by the OS. I also believe that this
deferment is adjustable by mutating a so-called watch-dog counter.
|
Up to a point configurable by the OS. That sounds pretty nice to me.
| Quote: | Well, I've read it more carefully, and it seems to sort of say that
they 'undo' all writes if there's an abort with a special buffer.
Their description seems kind of confusing to me. If they hold all
writes in a special buffer, how big is it?
I believe the buffer is big enough to hold at least 7-8 words. If your
transactions need more than that, then ASF is not the right tool for the
job...
|
Well then I question how many real-world problems this will actually
help in any significant way, but the design seems sound. It will
certainly simplify the design of things like reader-writer locks.
Perhaps the biggest advantage will be easing the tradeoff between
correctness and performance. Right now, for example, it's easy to
create an obviously-correct implementation of a reader/writer lock
under x86 Linux. It's also not too hard to create a heavily-optimized
implementation of a reader/writer lock. It is, however, an unholy
bitch to create a heavily-optimized reader/writer lock that one can be
confident is correct. It takes serious expertise and reviews from
multiple people to make sure something doesn't slip by. With this, I
could do it in half an hour, and be quite confident it had no subtle
bugs.
DS |
|
| |
|
Back to top |
Chris M. Thomasson Guest
|
Posted: Thu Sep 04, 2008 4:43 am Post subject: Re: AMD working on scaleable hardware-based atomic transacti |
|
|
[added comp.arch
additional context for this topic can be found here:
http://groups.google.com/group/comp.arch/browse_frm/thread/cedc2e76568ab0c5
]
"David Schwartz" <davids@webmaster.com> wrote in message
news:92d923c4-1fdf-4287-a632-582a473b8891@i20g2000prf.googlegroups.com...
| Quote: | Is this intended to be usable in ordinary application code? It
appears, at least to me, to be basically unusable. Maybe I'm missing
something.
The problem is that once you pass the 'ACQUIRE' point, your code must
run without pre-emption until you hit the 'COMMIT' point. If there's
an interrupt, your code will be "aborted" (see section 3.3). The
specification provides a way to detect an abort -- you jump back
magically to your 'ACQUIRE' point and return an error. But I don't see
how that helps you if the abort occurred after some, but not all, of
your writes took place.
The example code, such as Figure 1's supposed DCAS doesn't even try to
handle this case. It will fail horribly if the code is interrupted or
pre-empted in the 'critical section'.
IMO, like so much synchronization code, the supposed 'simplicity' of
this approach (and maybe even its alleged performance advantage) will
evaporate when it has to handle all the nasty things that can happen
in the real world.
Maybe I'm missing something. If so, what?
|
This post might shed some light:
http://groups.google.com/group/comp.arch/msg/75109b08ea8a1ca4
Apparently, interrupts are deferred by the OS. I also believe that this
deferment is adjustable by mutating a so-called watch-dog counter.
"David Schwartz" <davids@webmaster.com> wrote in message
news:faf93c3d-dd22-4035-a107-afea1ada2a62@k36g2000pri.googlegroups.com...
On Sep 3, 7:27 am, David Schwartz <dav...@webmaster.com> wrote:
| Quote: | Well, I've read it more carefully, and it seems to sort of say that
they 'undo' all writes if there's an abort with a special buffer.
Their description seems kind of confusing to me. If they hold all
writes in a special buffer, how big is it?
|
I believe the buffer is big enough to hold at least 7-8 words. If your
transactions need more than that, then ASF is not the right tool for the
job... |
|
| |
|
Back to top |
Chris M. Thomasson Guest
|
Posted: Thu Sep 04, 2008 4:45 am Post subject: Re: AMD working on scaleable hardware-based atomic transacti |
|
|
"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:86c0c11f-785f-4456-9ee5-bfa5d4d8993a@56g2000hsm.googlegroups.com...
On 1 ÓÅÎÔ, 04:09, "Chris M. Thomasson" <n...@spam.invalid> wrote:
| Quote: | Here ya go:
http://www.amd64.org/fileadmin/user_upload/pub/epham08-asf-eval.pdf
Cool!
It recalls Sun's HTM design, but AMD's design gives explicit control
over what locations to lock and what not to lock.
Interesting, when we will see it in AMD hardware?
|
| Quote: | Also, if HTM will not be supported by Intel, usage will be very
problematic.
|
Good point. Humm... I guess that Intel will just have to acquire a license
from AMD!
;^) |
|
| |
|
Back to top |
|