samedi 8 décembre 2012

The danger of unbalanced PC


CPAIOR is one of my favourite conferences because it “provides an opportunity for researchers in one area to learn about techniques in the others”.
A few years ago, Jean-Charles complained that there was no PC member from ILOG (now IBM) at CP2008. I agree with him it was not enough considering the role this company plays in the field.
Last year at CPAIOR12, 7% of the CP members were from IBM.
This year it’s 22%. I’m not sure people from IBM will complain any more ;-)

Although I have nothing against this company, I feel a bit uncomfortable with this percentage and I can see some risk there. Let me illustrate this with a tweet from LocalSolver (not IBM) (commercial local search solver):

“Don't publish dedicated heuristic approaches without having tried ‪#LocalSolver on your combinatorial problem: your paper might be rejected!”

I don’t know if this was a joke or not, but to me it illustrates perfectly the danger there is if a same affiliation is over-represented in a PC of a conference. There are other important players (commercial or open-source) also building nice solvers that are not represented at all in this PC committee… (for instance localsolver ;-))

I don’t blame anybody and I like the fact the people from industry are represented in PC. This is only because I like CPAIOR that I make this remark. But I’m sure I’m not the only one looking at the PC members composition before considering submitting a paper and it would be really sad if people don’t submit because they feel uncomfortable with the PC composition.

What do you think should be the maximum percentage of people with a same affiliation in such a conference? Maybe this should be written in the status of the conference?


samedi 25 août 2012

Benchmarking of CPUs: the importance of the number of memory channels

I am involved in a project aiming at improving the resolution of CP problems by using massive parallelism.
Thus, I learnt parallelism like a novice.
First, I thought that it was not complex to use at the same time all the cores of a modern multicore CPU. Then, I learnt that it is not so easy ...
My first test was about a simple problem requiring a lot of memory (about 500Mo). I was thinking that I could run the problem 4 times at the same moment on a 4 cores machine without any problem, that is it would not need more time than with one core (or a little bit more due to the frequency change).  The first result was clear : NO, it does not work like that.
Thus, I made some experiments using different machines. I learnt that the results may be quite different! I reproduce the results here:
Using k threads means that I try to solve the very same problem k times in parallel (one per thread). The ms column corresponds to the number of ms that are elapsed for solving all the problems (that is the time for all the threads to finish)


 
Intel i7 920
 
 
 
Intel i7 3820
 
 
4 cœurs
2 canaux mémoire
4 cœurs
4 canaux mémoire
 
2,66 Ghz
2 threads / cœur
3,6 Ghz
2 threads / cœur
#threads
ms
perf/thread
ratio / 1 thread
ms
perf/thread
 
1
835
835
1,0
680
680
1,0
2
1000
500
1,7
750
375
1,8
4
1365
341
2,4
915
229
3,0
6
1900
317
2,6
1220
203
3,3
8
2460
308
2,7
1530
191
3,6
10
3240
324
2,6
1970
197
3,5
12
3850
321
2,6
 
2250
188
3,6

 

 
Intel i7 3930
 
 
AMD FX8150
 
 
6 cœurs
4 canaux mémoire
8 cœurs
2 canaux mémoire
 
3,2 Ghz
2 threads / cœur
3,6 Ghz
1 thread/cœur
#threads
ms
perf/thread
ms
perf/thread
 
1
715
715
1,0
757
757
1,0
2
790
395
1,8
1100
550
1,4
4
900
225
3,2
1800
450
1,7
6
1090
182
3,9
2600
433
1,7
8
1390
174
4,1
3325
416
1,8
10
1630
163
4,4
4120
412
1,8
12
1890
158
4,5
 
5000
417
1,8


 
Intel 4x E7-4870
 
 
 
10 cœurs/CPU
4 canaux mémoire
 
2,4 Ghz
2 threads / cœur
 
#threads
ms
perf/thread
 
1
1744
1744
1,0
 
2
1733
867
2,0
 
4
2089
522
3,3
 
40
2712
68
25,7
 
80
4484
56
31,1
 
120
8350
70
25,1
 
160
10433
65
26,7
 
320
20566
64
27,1
 

 
What do we learn? When using memory, one of the most important thing is the number of memory channels. We also learn that the latest Intel CPUs (3870 and 3930) are really good.
Note that in these benchmarks there is no concurrent access in memory: the memory is splitted into several parts at the beginning.

Can we observe these results for all problems? The answer is NO. When, we do not require so much memory the results are quite different, as shown by the following tables; for these benchmarks, the same protocol is used, but the problem which is solved by each thread requires at most 1Mo of memory.


 
Intel i7 920
 
 
 
Intel i7 3820
 
 
4 cœurs
2 canaux mémoire
4 cœurs
4 canaux mémoire
 
2,66 Ghz
2 threads / cœur
3,6 Ghz
2 threads / cœur
#threads
ms
perf/thread
ratio / 1 thread
ms
perf/thread
 
1
1790
1790
1,0
1100
1100
1,0
2
1790
895
2,0
1140
570
1,9
4
2000
500
3,6
1300
325
3,4
6
2800
467
3,8
1830
305
3,6
8
3610
451
4,0
2260
283
3,9
10
4500
450
4,0
2810
281
3,9
12
5400
450
4,0
 
3300
275
4,0

 

 
Intel i7 3930
 
 
AMD FX8150
 
 
6 cœurs
4 canaux mémoire
8 cœurs
2 canaux mémoire
 
3,2 Ghz
2 threads / cœur
3,6 Ghz
1 thread/cœur
#threads
ms
perf/thread
ms
perf/thread
 
1
1210
1210
1,0
2000
2000
1,0
2
1210
605
2,0
2051
1026
2,0
4
1285
321
3,8
2160
540
3,7
6
1650
275
4,4
2625
438
4,6
8
1880
235
5,1
2830
354
5,7
10
2240
224
5,4
3850
385
5,2
12
2620
218
5,5
 
4490
374
5,3

 

 
Intel 4x E7-4870
 
 
 
10 cœurs/CPU
4 canaux mémoire
 
2,4 Ghz
2 threads / cœur
 
#threads
ms
perf/thread
 
1
1910
1910
1,0
 
2
1920
960
2,0
 
4
1920
480
4,0
 
40
3030
76
25,2
 
80
4080
51
37,5
 
120
6230
52
36,8
 
160
8230
51
37,1
 
320
16260
51
37,6
 

 
In this case, it is clear that the results of the AMD CPU are better than in the previous case. However, it is still clearly outperformed by recent Intel CPUs.

For those who can be interested, here are the results for an Intel i7 2600

i7 2600
ms perf/thread
685 685 1,0
841 421 1,6
1423 356 1,9
1943 324 2,1
2693 337 2,0
3532 353 1,9
4338 362 1,9

I also compare Windows (VS 2012) and Linux (GCC latest version) on the very same machine (core I7 3930): the maximum difference is 10% (Windows is faster when we do not require too much memory)

These results give me some inputs on my quest of 100x improvement machine.

These results also show that we have to be careful when giving results for parallelisms. The impact of the machine that is used is more important than for sequential result