Indeed you should always be precise in floating point constant typing; Cortex-M4 only supports single precision in hardware but, as soon as you have a double precision float in an expression (even 1.0 that can be expressed exactly in single precision) the compiler *must* do the whole computation in double...
Regarding the suggestion of hand-optimizing the generated code, it *may* earn you some performance, but in fact it usually needs a full *rewriting* of the code to be significantly better than code generated by the compiler at -O3 optimization level. Starting from non-optimized code will give you the satisfaction to gain a lot of performance WRT the compiler-generated code but you may be slower than the optimized code; on the opoosite, the optimized code is usually totally unreadable and very difficult to relate to the source code, so not a good base for hand optimizing; the reason optimized code is so messy is that the compiler will try to optimize register usage for the whole function, reducing loads/stores and moves, while taking into account data dependencies between instructions by moving loads up in your code so that memory access times are masked.
Executing code from RAM instead of flash may give you more significant performance increase, but it depends a lot on the size of your inner loops (and the precise MCU you are using) as high-end STM32 MCUs have an ART (Adaptative Real-Time) memory accelerator that (on an STM32F411 for example) provides a CoreMark benchmark performance equivalent to a 0 wait-state memory access (by instruction prefetches, instruction caching and branch accelleration mecanisms). Thus on such processors, executing from RAM may not be giving you such a big increase in performance but, of course, YMMV.
Hope this helps clarifying things,