Loading...
 

SW4STM32 and SW4Linux fully supports the STM32MP1 asymmetric multicore Cortex/A7+M4 MPUs

   With System Workbench for Linux, Embedded Linux on the STM32MP1 family of MPUs from ST was never as simple to build and maintain, even for newcomers in the Linux world. And, if you install System Workbench for Linux in System Workbench for STM32 you can seamlessly develop and debug asymmetric applications running partly on Linux, partly on the Cortex-M4.
You can get more information from the ac6-tools website and download (registration required) various documents highlighting:

System Workbench for STM32


[solved] FPU going too slow

Hi!

I’m working on an audio application on the Nucleo F411RE and I’ve noticed that my processing was too slow which made the application skip some samples.

Looking to my disassembly I figured given the number of instructions and the 100 MHz CPU clock that I’ve supposedly set in STM32CubeMx, it should be a lot faster.

I checked SYSCLK value and it is 100Mhz as expected. To be 100% sure I pu 1000 “nop” in my main loop and measured 10 µs, which does correspond to a 100 MHz clock.

I measured exactly the time took by my processin and it takes 14.5 µs ie 1450 clock cycles. I think it’s way too much, considering that the processing is pretty simple :

for(i=0; i inf 12; i++)
{
el1.modei.phase += el1.osci.phaseInc;
if(el1.osci.phase sup= 1.0)
el1.osci.phase -= 1.0;
el1.osci.value = sine[ (int16_t)(el1.osci.phase * RES) ];
el1.val += el1.osci.value * el1.osci.amp;
}


It shouldn’t be more than like 500 cycles, right? I looked at the disassembly and I think it does use the FPU instructions like vadd.f32...
for example, the last line is :
vfma.f32 => 3 cycles
vcvt.s32.f32 => 1 cycle
vstr => 2 cycles
ldrh.w => 2 cycles

(according to thisQuestion ) So that’s a total of 8 instruction for that line, which is the “biggest”.
So I don’t really get why it’s so slow... Maybe because I’m using structures or something?

If anybody has an idea, I’d be glad to hear it :-)

Thanks!

Hi, really can’t tell until you show the whole “offending” disassembly code.

General advices:

1. turn on the optimizations (-O2 or -O3) so that GCC can optimize the loops and.. the other things :-)
2. use e.g. 1.0f instead of 1.0 (in C standard the latter is double while the former is float, STM32F4 hardware supports only float)


What I have done in simular situations:

Option 1) Go in with the debugger and open the disassembly window on the routine.
Option 2) Some compilers have a ‘compile to assembly’ option, turn on that option and get your assembly version of the code
Option 2) Find a disassembler and disassemble the source code

Now that you have the assembly version of the ‘C’ review it. In all cases I have seen there is a huge waste of time moving values into and out of registers.

But to start with, #ifdef out your ‘C’ routine and substutite the assembly version (as-is, no mods). Compile it and make sure it works.

Now, start by removing the unnecessary loads and stores. Often after every instruction the intermediate value is copied from the temporary register into memory, then out of memory and back into the (sometimes same) register.

Using the assembly manual you can calculate how much procesor time is taken by eash assemly instruction.

In the past I have gotten much more speed out of an operation by doing this.

After that look to see if you can substute 1 assembly instruction for several others. Sometimes the compiler defaults to using generic assembly and not taking advantage of specialized instructions.

Make sure you comment the code well.

Good Luck.
-Matt


Thanks guys, I actually had the answer of “1.0f” elsewhere as well and it was my main problem. It does indeed by standar convert every float constants to a double precision float, causing the float operations to be done in soft mode and not hard (-> super slow).

my compiler optimisation was already to o3.

Thanks Matt, that’s a very good method. I will try that for the critical parts of my program.
Some other leads that I’ve been given : run some parts of the code from the RAM and not the flash (some info hereQuestion), use integer phasor (it’s better for audio applications as it diminishes jitter. Also most operations on int are faster) although I couldn’t do that yet considering the other parts of my program. One last lead was to choose a suited clock frequency (apparently, because of wait states, clocking faster doesn’t always speed things up somehow. I have a little explanation test from stackexchange if you guys are interested, I myself haven’t yet understood :P)
Do you advise to compile with optimization for the dissassembly? because when optimized, it gets really messy!
Also shouldn’t compiler’s optimization try to limit unecessary register loads and stores?

Thanks again for helping me!

Florent


France

Hi Florent,

Indeed you should always be precise in floating point constant typing; Cortex-M4 only supports single precision in hardware but, as soon as you have a double precision float in an expression (even 1.0 that can be expressed exactly in single precision) the compiler *must* do the whole computation in double...

Regarding the suggestion of hand-optimizing the generated code, it *may* earn you some performance, but in fact it usually needs a full *rewriting* of the code to be significantly better than code generated by the compiler at -O3 optimization level. Starting from non-optimized code will give you the satisfaction to gain a lot of performance WRT the compiler-generated code but you may be slower than the optimized code; on the opoosite, the optimized code is usually totally unreadable and very difficult to relate to the source code, so not a good base for hand optimizing; the reason optimized code is so messy is that the compiler will try to optimize register usage for the whole function, reducing loads/stores and moves, while taking into account data dependencies between instructions by moving loads up in your code so that memory access times are masked.

Executing code from RAM instead of flash may give you more significant performance increase, but it depends a lot on the size of your inner loops (and the precise MCU you are using) as high-end STM32 MCUs have an ART (Adaptative Real-Time) memory accelerator that (on an STM32F411 for example) provides a CoreMark benchmark performance equivalent to a 0 wait-state memory access (by instruction prefetches, instruction caching and branch accelleration mecanisms). Thus on such processors, executing from RAM may not be giving you such a big increase in performance but, of course, YMMV.

Hope this helps clarifying things,

Bernard (Ac6)

France

Just one precision to my previous message: To be sure flash memory prefetch and caching is enabled, check the FLASH_ACR register; there are 3 bits significant in this case:

  • ICEN - Instruction Cache Enable, that enables an (64 lines of 16 bytes) instruction cache in front of the flash
  • DCEN - Data cache Enable, that enables caching up to 8 lines of 64 bytes for data access to the flash (for accessing data pools for example)
  • PRFTEN - Prefetch Enable, that enables prefetching instructions

These bit are not set on reset, but may be set by firmware initialization; if they are not you should set them in your own code to benefit from these... But don’t forget to reset the caches before enabling them (by setting DCRST and ICRST in the same register); be careful *not* to change the flash latency that will have been set by the firmware when raising the CPU frequency.

Hope this helps,

Bernard (Ac6)


Hi Bernard.
Thank you very much for all these clarifications, I understand much better now! I am gonna do some experimentation with moving some parts to RAM and post the results here (if there are any conclusive results).

About the FLASH_ACR register, I should : set DCRST and ICRST, *then* set the 3 bits to enable prefetching?
Last thing, you said that STM32F411 had 0 wait state at any clock frequency (and I’ve read that on the product page as well). That means that 100 MHz IS indeed the best solution to go fast, right? I don’t have to determine the best clock value given that increasing the clocking speed will not induce any more wait state? This leads me to another question : why would someone run the F411 (or other) slower than 100 MHz if it’s 0 wait state at 100 MHz? My first thought was power consumption, it might make a significant difference... Are there any other criterias?

Thanks a lot for your help,

Florent

France

Hi Florent,

Yes the main reason to run slower may be to save power; however if you carefully use WFI (or WFE) for sleeping when you don’t have anything to do (waiting to be woken up by some interrupt, even masked for WFI, only unmasked ones for WFE) then you may achieve about the same power consumption at 100MHz, but have better response times due to the higher CPU frequency.

Of course this is only valid if you are interrupt-driven; if you use polling, you can’t use WFI/WFE and then decreasing teh CPU frequency to what gives you the needed performance will provide you with better battery life.

Bernard (Ac6)