Nicolas LIMARE / pro / notes / 2014 / 12 / Floating-Point Math Speed vs Precision

IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.


Follow-up on arithmetic speed vs precision, this time with the C floating-point math functions.

After basic arithmetic operations, let's look at the cost of C math functions. This time we are interested in two points: which math operations are fast, flow, and deadly slow; what is the impact of extending the floating-point precision.

Setup

The method is identical:

  • 10 MBytes random arrays processed in loop on a single operation for (int i=0; i<size; i++) c[i] = OP(a[i], b[i]);;
  • 32, 64, 80 and 128 bits floating-point data
  • use of type-generic C99 math functions (macros), and the three arithmetic operations for comparison;
  • 32 mesures with microsecond precision, kept the median, normalized to 1GHz clock frequency
  • code compiled with gcc 4.9 -Ofast and eglibc 2.19;
  • no hyper-threading and CPU frequency scaling, process pinned to a single CPU core.

Measures are split in three charts, on three different scales. Cosine functions are omitted, they take exactly the same time as their sine variant. Input values are randomly taken in the ]0, 1] interval to satisfy every function.

With some CPUs, designed for mobile computing on laptops, active power saving, heat control, frequency scaling and throttling results in unstable performance numbers. I had to pick the results that made some sense manually from multiple runs.

Measures

x86-64

x86-64 Intel Haswell

Intel Core i7 4770 - 2013 - 3.40GHz

Compiler is gcc 4.8 in Ubuntu 14.04.

x86-64 Intel Ivy Bridge

Intel Xeon E5 2650v2 - 2013 - 2.60GHz

The five fast operations (sum, product, division, absolute value and square root) are all implemented by a single instruction and only differ by the number of cycles they require. For 32 bits float data, absolute value runs in 2 cycles, sum and product in 3 cycles, division and square root in 6 cycles.

The other operations require multiple instructions. Exponential takes about 12 cycles, like hyperbolic sine and tangent whose implementation is also based on the exponential. Arcsine hase the same speed, and arctangent is about twice slower.

Finally, slow functions need around 100 cycles per call in 64 bits precision, but with large differences in 32 bits: logarithm and sine take 30 cycles, but the worst, tangent, is down to 1000 cycles per call in single precision!

→ similar CPU: Core i5 3317U

Intel Core i5 3317U - 2012 - 1.70GHz (mobile)

Compiler is gcc 4.8 in Ubuntu 14.04.

x86-64 Intel Nehalem

Intel Xeon X7560 - 2010 - 2.27GHz

x86-64 Intel Harpertown

Intel Xeon X3323 - 2008 - 2.5GHz

Compiler is gcc 4.8 and libc is eglibc 2.11 on Ubuntu 10.04.

Power

Power IBM POWER8

IBM Power S812L - 2014 - 3.02GHz

Floating-point with 80 bits precision is not available. Here floating-point on 128 bits is double-double, not IEEE754 binary128. These measures were obtained through virtualization (PowerKVM).

ARM

ARM Cortex-A17

Rockchip RK3288 - 2014 - 1.80GHz

Floating-point with 80 or 128 bits precision are not available on ARM devices. Compiler is gcc 4.8 in Ubuntu 14.04.

ARM Cortex-A9

Rockchip RK3188 - 2013 - 1.80GHz

Compiler is gcc 4.8 in Ubuntu 14.04.

ARM11

Broadcom BCM2835 - 2012 - 700MHz

For this slower CPU (from the Raspberry Pi board), we reduced the array size to 1 MByte. Compiler is gcc 4.6 and libc is eglibc 2.13 on Raspbian 7. This CPU has no floating-point unit, thus the very slow speed observed here.

Conclusions

Performance figures differ a lot across CPU generations, and possibly across compiler and libc versions too. The general rules of thumb are:

  • Single-instruction operations (sum, product, division, absolute value, square root) take 3 to 10 cycles.
  • Fast multiple-instruction operations (exponential and hyperbolic functions, inverse sine and tangent) take 10 to 50 cycles.
  • Slow multiple-instruction operations (logarithm, power, trigonometric functions) take 50 to 1000 cycles.

Tangent and sine usually have the worst worst computation time in our measures, except on recent CPUs/compilers where sine is much faster, but tangent still as slow. To decide wether the improvement comes from the SSE4.2 instruction set or clever gcc 4.9 optimizations would require more measures and a look at the compiled code.

Smaller is also usually faster, with operations on 32 bits taking less cycles than those on 64 bits, and 128 bits maths being deadly slow, but the numbers get closer on slow functions, pollibly because their complex implementation defeats every posible vector and pipeline speedup. For the slowest functions, 64 bits computations are even sometimes much faster than the same operations on 32 bits, for some reasons that still lack any explanantion.

And to avoid any misunderstanding, I end with the mandatory disclaimer. These numbers are only meaningful to compare math functions and data types. Performance of real programs will depend on many other factors: instruction sequences, branching, memory access and cache miss, multi-threading, etc.

PS: I am interested in adding other CPUs and architectures; let me know if you can run my code on other machines, or (better!) let me access them for a couple of hours.


Timing code is here. Inspiration is from Jon Bentley's "Programming Pearls".

Updates:

  • 2015/01/07: normalized to 1GHz clock frequency, bounded input to the ]0-1] interval, added trigonometric functions, removed redundant CPU models
  • 2015/01/13: added Intel Xeon X3323, IBM Power S812L, and Broadcom BCM2835 CPUs, corrected legends, added comments
  • 2015/01/15: added Intel Core i5 3317U and Intel Core i7 4770, thanks to Martin Etchart; added Rockchip RK3188 and Rockchip RK3288, thanks to Damien Challet.
  • 2015/01/19: code cleanup and renamed sections