Nicolas LIMARE / pro / notes / 2014 / 12 / Integer and Floating-Point Arithmetic Speed vs Precision

IMPORTANT: Useful feedback revealed that some of these measures are seriously flawed. A major update is on the way.


Follow-up on my notes on code speedup. We measure the computation cost of arithmetic operations on different data types and different (Intel64) CPUs. We see that 64 bits integer is slow, 128 bits floating-point is terrible and 80 bits extended precision not better, division is always slower than other operations (integer and floating-point), and smaller is usually better. Yes, that was expected, but backed by hard code and numbers it's better, isn't it?

If you have the option to choose between data types for a numerical implementation, part of the decision may be guided by a precision vs speed trade-off. For informed decisions, you need hard numbers, actual measures of the performance penalty you may hit in exchange for more precision. So let's run a few timers.

Setup

We are interested into the cost of repeated operations, including the possible speed-up of vector SIMD instructions to process multiple data at once, so the timing code runs various operations in a loop over arrays, something basically like for (int i=0; i<size; i++) c[i] = a[i] OP b[i];.

Integer data types are signed on 8, 16, 32 and 64 bits (C char, short, int and long in the LP64 model). Floating-point types are on 32, 64, 80 and 128 bits (IEEE754 single, double, extended double and quadruple precision^ieee754_gcc). We always process arrays of the same size, 10 MBytes (in the ballpark of a 8bit gray-scale photograph). Note that processing the same values as different data types would also add cache miss penalty for types using more memory space.

double, __float80 and __float128.

Processing time for the whole array is measured 32 times with microsecond precision and we keep the median, for every operation. Results are normalized to 1GHz clock frequency, for CPU comparison.

Unless otherwise mentioned, code is compiled by gcc 4.9 with the -Ofast option. The environment is usually a Debian 8.0 Jessie distribution, natively or in a chroot. Hyper-threading and dynamic CPU frequency scaling are deactivated (via /sys/devices/system/cpu), and the process is pinned to a single CPU core (with taskset).

We only show the measures for AND bit-wise operator; other operators OR and XOR have exactly the same performance, on every CPU tested.

With some CPUs, designed for mobile computing on laptops, active power saving, heat control, frequency scaling and throttling results in unstable performance numbers. I had to pick the results that made some sense manually from multiple runs.

Measures

x86-64

x86-64 Intel Haswell

Intel Core i7 4770 - 2013 - 3.40GHz

Compiler is gcc 4.8 in Ubuntu 14.04.

x86-64 Intel Ivy Bridge

Intel Xeon E5 2650v2 - 2013 - 2.60GHz

For integer data, speed is the same for sums, products, and bit-wise operations (AND, OR, XOR). It takes roughly the same time on 8, 16 and 32 bits, but with a 5% penalty each time we double the precision. Operations on 64 bits are clearly slower at 65% of the speed we get for 32 bits.

The other group is division and modulo, which get similar numbers with modulo 10% lower. Integer division is three times slower than sum or product, with similar performances on 8, 16 and 32 bits (same 5% penalty). Computing on 64 bits is is at 30% of the speed we get on 32 bits.

The first result from floating-point data measures is that extended and quad precision (80 and 128 bits) have terrible performance, probably because both need to be implemented in software as there is no hardware instruction for these operations.

Otherwise, sum is faster than product, which is faster than division, with roughly 4/2/1 speed ratios. But these operations have very different speed vs precision profiles:

  • sum is 30% slower in double precision;
  • product is twice faster in double precision;
  • division has the same speed in single and double precision.

The key explanation for these numbers is certainly in which SIMD instruction is available and used for each operation and precision. To be investigated... later.

→ similar CPU: Core i5 3317U

Intel Core i5 3317U - 2012 - 1.70GHz (mobile)

Compiler is gcc 4.8 in Ubuntu 14.04.

x86-64 Intel Nehalem

Intel Xeon X7560 - 2010 - 2.27GHz

→ similar CPUs: Xeon L3406, Core i5 M520

x86-64 Intel Harpertown

Intel Xeon X3323 - 2008 - 2.5GHz

Compiler is gcc 4.8 in Ubuntu 10.04.

→ similar CPU: Xeon X5450

Intel Xeon X5450 - 2007 - 3GHz

Compiler is gcc 4.8 in Ubuntu 14.04.

x86-64 VIA Isaiah

VIA Nano U2250 - 2008 - 1.6GHz

Power

Power IBM POWER8

IBM Power S812L - 2014 - 3.02GHz

Floating-point with 80 bits precision is not available. Here floating-point on 128 bits is double-double, not IEEE754 binary128. It's the first time I use a Power CPU with its massive multithreading, I can't guarantee I didn't misinterpret some number. Moreover, these measures were obtained through virtualization (PowerKVM).

ARM

ARM Cortex-A17

Rockchip RK3288 - 2014 - 1.80GHz

Floating-point with 80 or 128 bits precision are not available on ARM devices. Compiler is gcc 4.8 in Ubuntu 14.04.

ARM Cortex-A9

Rockchip RK3188 - 2013 - 1.80GHz

Compiler is gcc 4.8 in Ubuntu 14.04.

ARM11

Broadcom BCM2835 - 2012 - 700MHz

For this slower CPU (from the Raspberry Pi board), we reduced the array size to 1 MByte. Compiler is gcc 4.6 in Raspbian 7.

Conclusions

General rules:

  • Integer sums (and AND/OR/XOR) and products take the same time, divisions (and modulo) are three times slower.
  • Floating-point products are twice slower than sums, and divisions even slower.
  • Floating-point operations are always slower than integer ops at same data size.
  • Smaller is faster.
  • 64 bits integer precision is really slow.
  • Float 32 bits is faster than 64 bits on sums, but not really on products and divisions.
  • 80 and 128 bits precisions should only be used when absolutely necessary, they are very slow.

Special cases:

  • On x86-64 AVX, float product is faster on 64 bits data than on 32 bits.
  • On POWER8 AltiVec, float product is as fast as sum, on every precisions; integer operations are performed at the same speed on 8, 16, 32 or 64 bits integers.
  • On ARM1176, bitwise integer operators are faster than additions.

These numbers are only meaningful to compare arithmetic instructions and data types. Performance of real programs will depend on many other factors: instruction sequences, branching, memory access and cache miss, multi-threading, etc.

PS: I am interested in adding other CPUs and architectures; let me know if you can run my code on other machines, or (better!) let me access them for a couple of hours.


Timing code is here. Inspiration is from Jon Bentley's "Programming Pearls".

Updates:

  • 2014/12/16: added Broadcom BCM2835 (ARM, Raspberry Pi), Intel Xeon X5450 and Xeon X3323, and IBM Power S812L.
  • 2014/12/18: added VIA Nano U2250, normalized to 1GHz clock frequency.
  • 2015/01/15: added Intel Core i5 3317U and Intel Core i7 4470, thanks to Martin Etchart; added Rockchip RK3188 and Rockchip RK3288, thanks to Damien Challet.
  • 2015/01/19: code cleanup and renamed sections

Follow-up: