OpenSTM32 Community Site | FPU going too slow

FPU going too slow

Posted by dautrevaux on 2016-05-05 12:13 France

Hi Florent,

Indeed you should always be precise in floating point constant typing; Cortex-M4 only supports single precision in hardware but, as soon as you have a double precision float in an expression (even 1.0 that can be expressed exactly in single precision) the compiler *must* do the whole computation in double...

Regarding the suggestion of hand-optimizing the generated code, it *may* earn you some performance, but in fact it usually needs a full *rewriting* of the code to be significantly better than code generated by the compiler at -O3 optimization level. Starting from non-optimized code will give you the satisfaction to gain a lot of performance WRT the compiler-generated code but you may be slower than the optimized code; on the opoosite, the optimized code is usually totally unreadable and very difficult to relate to the source code, so not a good base for hand optimizing; the reason optimized code is so messy is that the compiler will try to optimize register usage for the whole function, reducing loads/stores and moves, while taking into account data dependencies between instructions by moving loads up in your code so that memory access times are masked.

Executing code from RAM instead of flash may give you more significant performance increase, but it depends a lot on the size of your inner loops (and the precise MCU you are using) as high-end STM32 MCUs have an ART (Adaptative Real-Time) memory accelerator that (on an STM32F411 for example) provides a CoreMark benchmark performance equivalent to a 0 wait-state memory access (by instruction prefetches, instruction caching and branch accelleration mecanisms). Thus on such processors, executing from RAM may not be giving you such a big increase in performance but, of course, YMMV.

Hope this helps clarifying things,

Bernard (Ac6)

Reads: 3272

Link

Posted by dautrevaux on 2016-05-05 12:30 France

Just one precision to my previous message: To be sure flash memory prefetch and caching is enabled, check the FLASH_ACR register; there are 3 bits significant in this case:

ICEN - Instruction Cache Enable, that enables an (64 lines of 16 bytes) instruction cache in front of the flash
DCEN - Data cache Enable, that enables caching up to 8 lines of 64 bytes for data access to the flash (for accessing data pools for example)
PRFTEN - Prefetch Enable, that enables prefetching instructions

These bit are not set on reset, but may be set by firmware initialization; if they are not you should set them in your own code to benefit from these... But don’t forget to reset the caches before enabling them (by setting DCRST and ICRST in the same register); be careful *not* to change the flash latency that will have been set by the firmware when raising the CPU frequency.

Hope this helps,

Bernard (Ac6)

Link

OpenSTM32 Community

The STM32 Systems Resource

Zephyr project on STM32

System Workbench for STM32

FPU going too slow

Search:

Menu

Newest Forum Posts

Newest FAQs

Last-Modified Blogs