| View previous topic :: View next topic |
| Author |
Message |
techG Guest
|
Posted: Wed Nov 12, 2008 11:27 am Post subject: sobel on DM648 -- too slow |
|
|
Hi,
I've implemented a simple Sobel filter on a TMS320DM648 DSP (clock
freq = 891 MHZ).
Without calculating phase, and turning on compiler optimizations, the
execution time is about 124 ms on a 320x240 image, while on PC this
can run in real-time (with phase calculation!).
Are there any optimization strategies to get running times similar to
PC?
Giulio |
|
| |
|
Back to top |
Rune Allnor Guest
|
Posted: Wed Nov 12, 2008 11:53 am Post subject: Re: sobel on DM648 -- too slow |
|
|
On 12 Nov, 12:27, techG <giuliopul...@gmail.com> wrote:
| Quote: | Hi,
I've implemented a simple Sobel filter on a TMS320DM648 DSP (clock
freq = 891 MHZ).
Without calculating phase, and turning on compiler optimizations, the
execution time is about 124 ms on a 320x240 image, while on PC this
can run in real-time (with phase calculation!).
|
Some factors that might account for the speed difference:
1) Clock frequency (~3GHz vs 0.8GHz)
2) Multi-core PCs
3) The MMX instruction set
So if you run the software on a duo-core 3.2 PC in
parallel mode the PC is a factor 8 faster right there.
If both of those cores use the MMX instruction set,
it might gain another factor 5-10 or so. Maybe more.
Rune |
|
| |
|
Back to top |
Rune Allnor Guest
|
Posted: Wed Nov 12, 2008 11:57 am Post subject: Re: sobel on DM648 -- too slow |
|
|
On 12 Nov, 12:53, Rune Allnor <all...@tele.ntnu.no> wrote:
| Quote: | On 12 Nov, 12:27, techG <giuliopul...@gmail.com> wrote:
Hi,
I've implemented a simple Sobel filter on a TMS320DM648 DSP (clock
freq = 891 MHZ).
Without calculating phase, and turning on compiler optimizations, the
execution time is about 124 ms on a 320x240 image, while on PC this
can run in real-time (with phase calculation!).
Some factors that might account for the speed difference:
1) Clock frequency (~3GHz vs 0.8GHz)
2) Multi-core PCs
3) The MMX instruction set
So if you run the software on a duo-core 3.2 PC in
|
That should be "3.2 GHz PC"
| Quote: | parallel mode the PC is a factor 8 faster right there.
If both of those cores use the MMX instruction set,
it might gain another factor 5-10 or so. Maybe more.
Rune |
|
|
| |
|
Back to top |
techG Guest
|
Posted: Thu Nov 13, 2008 10:28 am Post subject: Re: sobel on DM648 -- too slow |
|
|
On Nov 12, 12:57 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
| Quote: | On 12 Nov, 12:53, Rune Allnor <all...@tele.ntnu.no> wrote:
On 12 Nov, 12:27, techG <giuliopul...@gmail.com> wrote:
Hi,
I've implemented a simple Sobel filter on a TMS320DM648 DSP (clock
freq = 891 MHZ).
Without calculating phase, and turning on compiler optimizations, the
execution time is about 124 ms on a 320x240 image, while on PC this
can run in real-time (with phase calculation!).
Some factors that might account for the speed difference:
1) Clock frequency (~3GHz vs 0.8GHz)
2) Multi-core PCs
3) The MMX instruction set
So if you run the software on a duo-core 3.2 PC in
That should be "3.2 GHz PC"
parallel mode the PC is a factor 8 faster right there.
If both of those cores use the MMX instruction set,
it might gain another factor 5-10 or so. Maybe more.
Rune
|
I found the bottleneck: it's RAM read/write access. For example,
copying 150 Kb from an address to another one takes 7 ms (using 32-bit
long words, that is maximum data bus width of RAM). The theoretical
speed for this transfer is about 72 ìs (RAM data rate is 2.1 GBps)..
how can it be possible? |
|
| |
|
Back to top |
Nils Guest
|
Posted: Thu Nov 13, 2008 5:13 pm Post subject: Re: sobel on DM648 -- too slow |
|
|
techG wrote:
| Quote: | I found the bottleneck: it's RAM read/write access. For example,
copying 150 Kb from an address to another one takes 7 ms (using 32-bit
long words, that is maximum data bus width of RAM). The theoretical
speed for this transfer is about 72 ìs (RAM data rate is 2.1 GBps)..
how can it be possible?
|
It's the second level cache. Each time you have a cache miss the second
level cache will read in a 128 byte scanline and the DSP will stall for
this time. This factor can be significant.
There are two ways around this:
1. Disable the second level cache. The easiest way to do this is to use
the DSP-Bios functions, but this disable the entire cache. If you have
multiple threads running this can be a desaster if a task-switch
happends. A cleaner way is to use the registers in the MMU to disable
the cache just for certain memory pages (read the docs - as far as I
remember these registers are called MARxxx or something like that).
2. Use the DMA to stream the data from RAM into the internal SRAM. Using
DMA will provide a much faster way to access the memory. With a bit of
trickery you could use two DMA-channels in parallel to stream the data
in and out of the SRAM and do the processing with the DSP in parallel.
It's tricky to get the DMA working, and the TI DMA components have a
high call overhead, so you will most likely end up implementing the DMA
code on your own. Howeer, once it works you can expect your algorithm to
be limited just by the available bandwidth. The DSP is fast enough to do
the processing in parallel.
Good luck,
Nils Pipenbrinck |
|
| |
|
Back to top |
Martin Thompson Guest
|
Posted: Thu Nov 13, 2008 5:58 pm Post subject: Re: sobel on DM648 -- too slow |
|
|
Hi Giulio,
techG <giuliopulina@gmail.com> writes:
| Quote: | Hi,
I've implemented a simple Sobel filter on a TMS320DM648 DSP (clock
freq = 891 MHZ).
Without calculating phase, and turning on compiler optimizations, the
execution time is about 124 ms on a 320x240 image, while on PC this
can run in real-time (with phase calculation!).
|
That's about 1.6us per pixel.. ~1500 cycles per pixel! Something's
badly wrong there, as that DSP can issue up to 8 instructions/cycle.
A Sobel (doing both horizontal and vertical) needs (per pixel of
output):
3 new pixel reads
6 subtracts
2 multiplies (but only by 2, so a simple shift is good!)
and either
2 pixel writes (H and V)
or a magnitude calculation and one write
- 2 muls, 1 add and 1 sqrt (if you need it, usually not)
Is the image in external RAM? What sort?
Is the cache enabled?
Is the code written to do calculations whilst waiting for the RAM
latency? Or is that killing it?
You're not treating the pixels as floating point are you? :-)
| Quote: | Are there any optimization strategies to get running times similar to
PC?
|
I'd write it such that it was DMAing a line of image into internal RAM
whilst processing the 3 lines above it. Dunno how hard you'd have to
push on the compiler rope to make it optimal, but it shouldn't be that
hard to do. Possibly unroll the line-loop 3 times to avoid shifting
the intermediate values around in the variables you use to store the
3x3 pixels you are working on.
Cheers,
Martin
--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html |
|
| |
|
Back to top |
techG Guest
|
Posted: Tue Nov 18, 2008 3:24 pm Post subject: Re: sobel on DM648 -- too slow |
|
|
Thank you all for the help!
Now I'm transferring data (one frame a time) between external RAM and
L2 using EDMA, that's very fast.
With this modification, the filter is running in about 8 ms, but I
think the DSP can do better on a 320x240 image. Furthermore, I can't
do phase calculation in real-time because it runs in 77 ms.
So I've implemented another edge-detection filter (ADM edge detector)
and it runs in about 15 ms; this filter is very simple (it doesn't
require any multiplication or division), but requires more memory
access. I'm looking for a way to optimize memory access (if possible)
to reduce elaboration timings.
Ps I still can't figure out why external memory is so slow when
accessing it directly, because L2 and L1D cache are disabled by
default.
Giulio |
|
| |
|
Back to top |
Randy Yates Guest
|
Posted: Wed Nov 19, 2008 12:12 am Post subject: Re: sobel on DM648 -- too slow |
|
|
techG <giuliopulina@gmail.com> writes:
| Quote: | [...]
Ps I still can't figure out why external memory is so slow when
accessing it directly, because L2 and L1D cache are disabled by
default.
|
Because it's probably SDRAM, which means there is all sorts of overhead
cycles in accessing it, like setting up column and row addresses, etc.
--
% Randy Yates % "Bird, on the wing,
%% Fuquay-Varina, NC % goes floating by
%%% 919-577-9882 % but there's a teardrop in his eye..."
%%%% <yates@ieee.org> % 'One Summer Dream', *Face The Music*, ELO
http://www.digitalsignallabs.com |
|
| |
|
Back to top |
Martin Thompson Guest
|
Posted: Wed Nov 19, 2008 6:55 pm Post subject: Re: sobel on DM648 -- too slow |
|
|
techG <giuliopulina@gmail.com> writes:
| Quote: | Thank you all for the help!
Now I'm transferring data (one frame a time) between external RAM and
L2 using EDMA, that's very fast.
With this modification, the filter is running in about 8 ms, but I
think the DSP can do better on a 320x240 image. Furthermore, I can't
do phase calculation in real-time because it runs in 77 ms.
So I've implemented another edge-detection filter (ADM edge detector)
and it runs in about 15 ms; this filter is very simple (it doesn't
require any multiplication or division), but requires more memory
access. I'm looking for a way to optimize memory access (if possible)
to reduce elaboration timings.
|
Make sure you reuse intermediate values that you've already read. If
you are accessing the frame buffer as an array, the compiler amy or
may not be smart enough to notice. I've had more success being explicit.
| Quote: |
Ps I still can't figure out why external memory is so slow when
accessing it directly, because L2 and L1D cache are disabled by
default.
|
SDRAM has a high bandwidth for streaming operations, but if you are
jumping around the image, there is a setup overhead to be paid. Even
if you are performing your operation in a streaming order, there's
till a significant penalty in terms of latency, so your processing
will go:
* read pixel 1
the DSP will then read pixels 1-8(? depending on cachline size). You
have to wait until it's done
* operate on pixel 1
* read pixel 2 - this is in cache, so will be fast
* operate on pixel 2
etc..
until you need to read pixel 9, then a new read has to start and you
have to wait again.
If you can interleave your reads and operates so that you can operate
on stuff that has been precached, while the next few pixels are
cached, things will be faster.
Cheers,
Martin
--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html |
|
| |
|
Back to top |
|