SW4STM32 and SW4Linux fully supports the STM32MP1 asymmetric multicore Cortex/A7+M4 MPUs

   With System Workbench for Linux, Embedded Linux on the STM32MP1 family of MPUs from ST was never as simple to build and maintain, even for newcomers in the Linux world. And, if you install System Workbench for Linux in System Workbench for STM32 you can seamlessly develop and debug asymmetric applications running partly on Linux, partly on the Cortex-M4.
You can get more information from the ac6-tools website and download (registration required) various documents highlighting:

System Workbench for STM32

You are viewing a reply to [solved] FPU going too slow  

FPU going too slow


Hi Florent,

Indeed you should always be precise in floating point constant typing; Cortex-M4 only supports single precision in hardware but, as soon as you have a double precision float in an expression (even 1.0 that can be expressed exactly in single precision) the compiler *must* do the whole computation in double...

Regarding the suggestion of hand-optimizing the generated code, it *may* earn you some performance, but in fact it usually needs a full *rewriting* of the code to be significantly better than code generated by the compiler at -O3 optimization level. Starting from non-optimized code will give you the satisfaction to gain a lot of performance WRT the compiler-generated code but you may be slower than the optimized code; on the opoosite, the optimized code is usually totally unreadable and very difficult to relate to the source code, so not a good base for hand optimizing; the reason optimized code is so messy is that the compiler will try to optimize register usage for the whole function, reducing loads/stores and moves, while taking into account data dependencies between instructions by moving loads up in your code so that memory access times are masked.

Executing code from RAM instead of flash may give you more significant performance increase, but it depends a lot on the size of your inner loops (and the precise MCU you are using) as high-end STM32 MCUs have an ART (Adaptative Real-Time) memory accelerator that (on an STM32F411 for example) provides a CoreMark benchmark performance equivalent to a 0 wait-state memory access (by instruction prefetches, instruction caching and branch accelleration mecanisms). Thus on such processors, executing from RAM may not be giving you such a big increase in performance but, of course, YMMV.

Hope this helps clarifying things,

Bernard (Ac6)


Just one precision to my previous message: To be sure flash memory prefetch and caching is enabled, check the FLASH_ACR register; there are 3 bits significant in this case:

  • ICEN - Instruction Cache Enable, that enables an (64 lines of 16 bytes) instruction cache in front of the flash
  • DCEN - Data cache Enable, that enables caching up to 8 lines of 64 bytes for data access to the flash (for accessing data pools for example)
  • PRFTEN - Prefetch Enable, that enables prefetching instructions

These bit are not set on reset, but may be set by firmware initialization; if they are not you should set them in your own code to benefit from these... But don’t forget to reset the caches before enabling them (by setting DCRST and ICRST in the same register); be careful *not* to change the flash latency that will have been set by the firmware when raising the CPU frequency.

Hope this helps,

Bernard (Ac6)