IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.

Follow-up on arithmetic speed vs precision, this time with the C floating-point math functions.

After basic arithmetic operations, let's look at the cost of C math functions. This time we are interested in two points: which math operations are fast, flow, and deadly slow; what is the impact of extending the floating-point precision.

# Setup

The method is identical:

- 10 MBytes random arrays processed in loop on a single operation
`for (int i=0; i<size; i++) c[i] = OP(a[i], b[i]);`

; - 32, 64, 80 and 128 bits floating-point data
- use of type-generic C99 math functions (macros), and the three arithmetic operations for comparison;
- 32 mesures with microsecond precision, kept the median, normalized to 1GHz clock frequency
- code compiled with
`gcc 4.9 -Ofast`

and`eglibc 2.19`

; - no hyper-threading and CPU frequency scaling, process pinned to a single CPU core.

Measures are split in three charts, on three different scales. Cosine functions are omitted, they take exactly the same time as their sine variant. Input values are randomly taken in the ]0, 1] interval to satisfy every function.

With some CPUs, designed for mobile computing on laptops, active power saving, heat control, frequency scaling and throttling results in unstable performance numbers. I had to pick the results that made some sense manually from multiple runs.

# Measures

## x86-64

### x86-64 Intel Haswell

Intel Core i7 4770 - 2013 - 3.40GHz

Compiler is `gcc 4.8`

in Ubuntu 14.04.

### x86-64 Intel Ivy Bridge

Intel Xeon E5 2650v2 - 2013 - 2.60GHz

The five fast operations (sum, product, division, absolute value and square root) are all implemented by a single instruction and only differ by the number of cycles they require. For 32 bits float data, absolute value runs in 2 cycles, sum and product in 3 cycles, division and square root in 6 cycles.

The other operations require multiple instructions. Exponential takes about 12 cycles, like hyperbolic sine and tangent whose implementation is also based on the exponential. Arcsine hase the same speed, and arctangent is about twice slower.

Finally, slow functions need around 100 cycles per call in 64 bits precision, but with large differences in 32 bits: logarithm and sine take 30 cycles, but the worst, tangent, is down to 1000 cycles per call in single precision!

### x86-64 Intel Nehalem

Intel Xeon X7560 - 2010 - 2.27GHz

### x86-64 Intel Harpertown

Intel Xeon X3323 - 2008 - 2.5GHz

Compiler is `gcc 4.8`

and libc is `eglibc 2.11`

on Ubuntu 10.04.

## Power

### Power IBM POWER8

IBM Power S812L - 2014 - 3.02GHz

Floating-point with 80 bits precision is not available. Here floating-point
on 128 bits is double-double, *not* IEEE754 binary128. These measures
were obtained through virtualization (PowerKVM).

## ARM

### ARM Cortex-A17

Rockchip RK3288 - 2014 - 1.80GHz

Floating-point with 80 or 128 bits precision are not available on ARM devices.
Compiler is `gcc 4.8`

in Ubuntu 14.04.

### ARM Cortex-A9

Rockchip RK3188 - 2013 - 1.80GHz

Compiler is `gcc 4.8`

in Ubuntu 14.04.

### ARM11

Broadcom BCM2835 - 2012 - 700MHz

For this slower CPU (from the
Raspberry Pi board), we reduced the
array size to 1 MByte. Compiler is `gcc 4.6`

and libc is `eglibc 2.13`

on Raspbian 7. This CPU has no floating-point unit, thus the very slow
speed observed here.

# Conclusions

Performance figures differ a lot across CPU generations, and possibly across compiler and libc versions too. The general rules of thumb are:

- Single-instruction operations (sum, product, division, absolute value, square root) take 3 to 10 cycles.
- Fast multiple-instruction operations (exponential and hyperbolic functions, inverse sine and tangent) take 10 to 50 cycles.
- Slow multiple-instruction operations (logarithm, power, trigonometric functions) take 50 to 1000 cycles.

Tangent and sine usually have the worst worst computation time in our
measures, except on recent CPUs/compilers where sine is much faster,
but tangent still as slow. To decide wether the improvement comes from
the `SSE4.2`

instruction set or clever `gcc 4.9`

optimizations would
require more measures and a look at the compiled code.

Smaller is also usually faster, with operations on 32 bits taking less cycles than those on 64 bits, and 128 bits maths being deadly slow, but the numbers get closer on slow functions, pollibly because their complex implementation defeats every posible vector and pipeline speedup. For the slowest functions, 64 bits computations are even sometimes much faster than the same operations on 32 bits, for some reasons that still lack any explanantion.

And to avoid any misunderstanding, I end with the mandatory disclaimer. These numbers are only meaningful to compare math functions and data types. Performance of real programs will depend on many other factors: instruction sequences, branching, memory access and cache miss, multi-threading, etc.

PS: I am interested in adding other CPUs and architectures; let me know if you can run my code on other machines, or (better!) let me access them for a couple of hours.

Timing code is here. Inspiration is from Jon Bentley's "Programming Pearls".

Updates:

- 2015/01/07: normalized to 1GHz clock frequency, bounded input to the ]0-1] interval, added trigonometric functions, removed redundant CPU models
- 2015/01/13: added Intel Xeon X3323, IBM Power S812L, and Broadcom BCM2835 CPUs, corrected legends, added comments
- 2015/01/15: added Intel Core i5 3317U and Intel Core i7 4770, thanks to Martin Etchart; added Rockchip RK3188 and Rockchip RK3288, thanks to Damien Challet.
- 2015/01/19: code cleanup and renamed sections