samedi 25 août 2012

Benchmarking of CPUs: the importance of the number of memory channels

I am involved in a project aiming at improving the resolution of CP problems by using massive parallelism.
Thus, I learnt parallelism like a novice.
First, I thought that it was not complex to use at the same time all the cores of a modern multicore CPU. Then, I learnt that it is not so easy ...
My first test was about a simple problem requiring a lot of memory (about 500Mo). I was thinking that I could run the problem 4 times at the same moment on a 4 cores machine without any problem, that is it would not need more time than with one core (or a little bit more due to the frequency change).  The first result was clear : NO, it does not work like that.
Thus, I made some experiments using different machines. I learnt that the results may be quite different! I reproduce the results here:
Using k threads means that I try to solve the very same problem k times in parallel (one per thread). The ms column corresponds to the number of ms that are elapsed for solving all the problems (that is the time for all the threads to finish)


 
Intel i7 920
 
 
 
Intel i7 3820
 
 
4 cœurs
2 canaux mémoire
4 cœurs
4 canaux mémoire
 
2,66 Ghz
2 threads / cœur
3,6 Ghz
2 threads / cœur
#threads
ms
perf/thread
ratio / 1 thread
ms
perf/thread
 
1
835
835
1,0
680
680
1,0
2
1000
500
1,7
750
375
1,8
4
1365
341
2,4
915
229
3,0
6
1900
317
2,6
1220
203
3,3
8
2460
308
2,7
1530
191
3,6
10
3240
324
2,6
1970
197
3,5
12
3850
321
2,6
 
2250
188
3,6

 

 
Intel i7 3930
 
 
AMD FX8150
 
 
6 cœurs
4 canaux mémoire
8 cœurs
2 canaux mémoire
 
3,2 Ghz
2 threads / cœur
3,6 Ghz
1 thread/cœur
#threads
ms
perf/thread
ms
perf/thread
 
1
715
715
1,0
757
757
1,0
2
790
395
1,8
1100
550
1,4
4
900
225
3,2
1800
450
1,7
6
1090
182
3,9
2600
433
1,7
8
1390
174
4,1
3325
416
1,8
10
1630
163
4,4
4120
412
1,8
12
1890
158
4,5
 
5000
417
1,8


 
Intel 4x E7-4870
 
 
 
10 cœurs/CPU
4 canaux mémoire
 
2,4 Ghz
2 threads / cœur
 
#threads
ms
perf/thread
 
1
1744
1744
1,0
 
2
1733
867
2,0
 
4
2089
522
3,3
 
40
2712
68
25,7
 
80
4484
56
31,1
 
120
8350
70
25,1
 
160
10433
65
26,7
 
320
20566
64
27,1
 

 
What do we learn? When using memory, one of the most important thing is the number of memory channels. We also learn that the latest Intel CPUs (3870 and 3930) are really good.
Note that in these benchmarks there is no concurrent access in memory: the memory is splitted into several parts at the beginning.

Can we observe these results for all problems? The answer is NO. When, we do not require so much memory the results are quite different, as shown by the following tables; for these benchmarks, the same protocol is used, but the problem which is solved by each thread requires at most 1Mo of memory.


 
Intel i7 920
 
 
 
Intel i7 3820
 
 
4 cœurs
2 canaux mémoire
4 cœurs
4 canaux mémoire
 
2,66 Ghz
2 threads / cœur
3,6 Ghz
2 threads / cœur
#threads
ms
perf/thread
ratio / 1 thread
ms
perf/thread
 
1
1790
1790
1,0
1100
1100
1,0
2
1790
895
2,0
1140
570
1,9
4
2000
500
3,6
1300
325
3,4
6
2800
467
3,8
1830
305
3,6
8
3610
451
4,0
2260
283
3,9
10
4500
450
4,0
2810
281
3,9
12
5400
450
4,0
 
3300
275
4,0

 

 
Intel i7 3930
 
 
AMD FX8150
 
 
6 cœurs
4 canaux mémoire
8 cœurs
2 canaux mémoire
 
3,2 Ghz
2 threads / cœur
3,6 Ghz
1 thread/cœur
#threads
ms
perf/thread
ms
perf/thread
 
1
1210
1210
1,0
2000
2000
1,0
2
1210
605
2,0
2051
1026
2,0
4
1285
321
3,8
2160
540
3,7
6
1650
275
4,4
2625
438
4,6
8
1880
235
5,1
2830
354
5,7
10
2240
224
5,4
3850
385
5,2
12
2620
218
5,5
 
4490
374
5,3

 

 
Intel 4x E7-4870
 
 
 
10 cœurs/CPU
4 canaux mémoire
 
2,4 Ghz
2 threads / cœur
 
#threads
ms
perf/thread
 
1
1910
1910
1,0
 
2
1920
960
2,0
 
4
1920
480
4,0
 
40
3030
76
25,2
 
80
4080
51
37,5
 
120
6230
52
36,8
 
160
8230
51
37,1
 
320
16260
51
37,6
 

 
In this case, it is clear that the results of the AMD CPU are better than in the previous case. However, it is still clearly outperformed by recent Intel CPUs.

For those who can be interested, here are the results for an Intel i7 2600

i7 2600
ms perf/thread
685 685 1,0
841 421 1,6
1423 356 1,9
1943 324 2,1
2693 337 2,0
3532 353 1,9
4338 362 1,9

I also compare Windows (VS 2012) and Linux (GCC latest version) on the very same machine (core I7 3930): the maximum difference is 10% (Windows is faster when we do not require too much memory)

These results give me some inputs on my quest of 100x improvement machine.

These results also show that we have to be careful when giving results for parallelisms. The impact of the machine that is used is more important than for sequential result