IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.

Follow-up on my notes on code speedup. We measure the computation cost of arithmetic operations on different data types and different (Intel64) CPUs. We see that 64 bits integer is slow, 128 bits floating-point is terrible and 80 bits extended precision not better, division is always slower than other operations (integer and floating-point), and smaller is usually better. Yes, that was expected, but backed by hard code and numbers it's better, isn't it?

If you have the option to choose between data types for a numerical
implementation, part of the decision may be guided by a precision
*vs* speed trade-off. For informed decisions, you need hard numbers,
actual measures of the performance penalty you may hit in exchange for
more precision. So let's run a few timers.

# Setup

We are interested into the cost of repeated operations, including
the possible speed-up of vector SIMD instructions to process multiple
data at once, so the timing code runs various operations in a loop over
arrays, something basically like
`for (int i=0; i<size; i++) c[i] = a[i] OP b[i];`

.

Integer data types are signed on 8, 16, 32 and 64 bits (C `char`

,
`short`

, `int`

and `long`

in the LP64 model). Floating-point types are
on 32, 64, 80 and 128 bits (IEEE754 single, double, extended double
and quadruple precision^ieee754_gcc).
We always process arrays of the same size, 10 MBytes (in the ballpark
of a 8bit gray-scale photograph). Note that processing the *same*
values as different data types would also add cache miss penalty for
types using more memory space.

`double`

, `__float80`

and `__float128`

.

Processing time for the whole array is measured 32 times with microsecond precision and we keep the median, for every operation. Results are normalized to 1GHz clock frequency, for CPU comparison.

Unless otherwise mentioned, code is compiled by `gcc 4.9`

with the
`-Ofast`

option. The environment is usually a Debian 8.0 Jessie
distribution, natively or in a `chroot`

. Hyper-threading and dynamic
CPU frequency scaling are deactivated (via `/sys/devices/system/cpu`

),
and the process is pinned to a single CPU core (with `taskset`

).

We only show the measures for `AND`

bit-wise operator; other operators
`OR`

and `XOR`

have exactly the same performance, on every CPU tested.

With some CPUs, designed for mobile computing on laptops, active power saving, heat control, frequency scaling and throttling results in unstable performance numbers. I had to pick the results that made some sense manually from multiple runs.

# Measures

## x86-64

### x86-64 Intel Haswell

Intel Core i7 4770 - 2013 - 3.40GHz

Compiler is `gcc 4.8`

in Ubuntu 14.04.

### x86-64 Intel Ivy Bridge

Intel Xeon E5 2650v2 - 2013 - 2.60GHz

For integer data, speed is the same for sums, products, and bit-wise
operations (`AND`

, `OR`

, `XOR`

). It takes roughly the same time on 8,
16 and 32 bits, but with a 5% penalty each time we double the
precision. Operations on 64 bits are clearly slower at 65% of the
speed we get for 32 bits.

The other group is division and modulo, which get similar numbers with modulo 10% lower. Integer division is three times slower than sum or product, with similar performances on 8, 16 and 32 bits (same 5% penalty). Computing on 64 bits is is at 30% of the speed we get on 32 bits.

The first result from floating-point data measures is that extended and quad precision (80 and 128 bits) have terrible performance, probably because both need to be implemented in software as there is no hardware instruction for these operations.

Otherwise, sum is faster than product, which is faster than division,
with roughly 4/2/1 speed ratios. But these operations have very
different speed *vs* precision profiles:

- sum is 30% slower in double precision;
- product is twice
*faster*in double precision; - division has the
*same speed*in single and double precision.

The key explanation for these numbers is certainly in which SIMD instruction is available and used for each operation and precision. To be investigated... later.

### x86-64 Intel Nehalem

Intel Xeon X7560 - 2010 - 2.27GHz

→ similar CPUs: Xeon L3406, Core i5 M520

### x86-64 Intel Harpertown

Intel Xeon X3323 - 2008 - 2.5GHz

Compiler is `gcc 4.8`

in Ubuntu 10.04.

### x86-64 VIA Isaiah

VIA Nano U2250 - 2008 - 1.6GHz

## Power

### Power IBM POWER8

IBM Power S812L - 2014 - 3.02GHz

Floating-point with 80 bits precision is not available. Here floating-point
on 128 bits is double-double, *not* IEEE754 binary128. It's the first
time I use a Power CPU with its massive multithreading, I can't
guarantee I didn't misinterpret some number. Moreover, these measures
were obtained through virtualization (PowerKVM).

## ARM

### ARM Cortex-A17

Rockchip RK3288 - 2014 - 1.80GHz

Floating-point with 80 or 128 bits precision are not available on ARM devices.
Compiler is `gcc 4.8`

in Ubuntu 14.04.

### ARM Cortex-A9

Rockchip RK3188 - 2013 - 1.80GHz

Compiler is `gcc 4.8`

in Ubuntu 14.04.

### ARM11

Broadcom BCM2835 - 2012 - 700MHz

For this slower CPU (from the
Raspberry Pi board), we reduced the
array size to 1 MByte. Compiler is `gcc 4.6`

in
Raspbian 7.

# Conclusions

General rules:

- Integer sums (and
`AND`

/`OR`

/`XOR`

) and products take the same time, divisions (and modulo) are three times slower. - Floating-point products are twice slower than sums, and divisions even slower.
- Floating-point operations are always slower than integer ops at same data size.
- Smaller is faster.
- 64 bits integer precision is really slow.
- Float 32 bits is faster than 64 bits on sums, but not really on products and divisions.
- 80 and 128 bits precisions should only be used when absolutely
necessary, they are
*very*slow.

Special cases:

- On x86-64 AVX, float product is
*faster*on 64 bits data than on 32 bits. - On POWER8 AltiVec, float product is
*as fast*as sum, on every precisions; integer operations are performed at the same speed on 8, 16, 32 or 64 bits integers. - On ARM1176, bitwise integer operators are faster than additions.

These numbers are only meaningful to compare arithmetic instructions and data types. Performance of real programs will depend on many other factors: instruction sequences, branching, memory access and cache miss, multi-threading, etc.

PS: I am interested in adding other CPUs and architectures; let me know if you can run my code on other machines, or (better!) let me access them for a couple of hours.

Timing code is here. Inspiration is from Jon Bentley's "Programming Pearls".

Updates:

- 2014/12/16: added Broadcom BCM2835 (ARM, Raspberry Pi), Intel Xeon X5450 and Xeon X3323, and IBM Power S812L.
- 2014/12/18: added VIA Nano U2250, normalized to 1GHz clock frequency.
- 2015/01/15: added Intel Core i5 3317U and Intel Core i7 4470, thanks to Martin Etchart; added Rockchip RK3188 and Rockchip RK3288, thanks to Damien Challet.
- 2015/01/19: code cleanup and renamed sections

Follow-up: