relative performance of 4770K across linux and windows

Tim Wilkens's Avatar

Tim Wilkens

18 Sep, 2013 08:06 PM

I was looking at the performance of Haswell 4770K upon Geekbench 3.0.2 in linux and windows. I note that the windows binary is delivering 20 GB/s of bandwidth while the linux binary on my platform only delivers 11.2 GB/s of bandwidth. I'd suggest removing the compiler from the equation.. or addressing this disparity between 2 different binaries of the same benchmark upon the same processor. This is not an OS issue.. but a compiler issue.

Here's the same processor in windows:
http://browser.primatelabs.com/geekbench3/29389

and my scores are:

Geekbench 3.0.2 Pro : http://www.primatelabs.com/geekbench/

System Information
  Operating System Ubuntu 12.04.2 LTS 3.2.0-48-generic x86_64
  Model ASUS All Series
  Motherboard ASUSTeK COMPUTER INC. Z87-PLUS
  Processor Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz @ 3.40 GHz
                        1 Processor, 4 Cores, 4 Threads
  Processor ID GenuineIntel Family 6 Model 60 Stepping 3
  L1 Instruction Cache 32.0 KB x 2
  L1 Data Cache 32.0 KB x 2
  L2 Cache 256 KB x 2
  L3 Cache 8.00 MB
  Memory 15.6 GB
  BIOS American Megatrends Inc. 1007
  Compiler Clang 3.3 (tags/RELEASE_33/final)

Integer
  AES
    single-core 5465 4.68 GB/sec
    multi-core 9787 8.38 GB/sec
  Twofish
    single-core 3177 178.3 MB/sec
    multi-core 12709 713.2 MB/sec
  SHA1
    single-core 3746 406.6 MB/sec
    multi-core 14986 1.59 GB/sec
  SHA2
    single-core 4192 181.4 MB/sec
    multi-core 16798 726.8 MB/sec
  BZip2 Compress
    single-core 2939 11.9 MB/sec
    multi-core 11735 47.7 MB/sec
  BZip2 Decompress
    single-core 2926 15.9 MB/sec
    multi-core 11542 62.6 MB/sec
  JPEG Compress
    single-core 3329 46.4 Mpixels/sec
    multi-core 13315 185.5 Mpixels/sec
  JPEG Decompress
    single-core 4844 119.8 Mpixels/sec
    multi-core 19225 475.3 Mpixels/sec
  PNG Compress
    single-core 2990 2.39 Mpixels/sec
    multi-core 11943 9.54 Mpixels/sec
  PNG Decompress
    single-core 3416 39.4 Mpixels/sec
    multi-core 13630 157.1 Mpixels/sec
  Sobel
    single-core 4494 163.5 Mpixels/sec
    multi-core 17804 647.9 Mpixels/sec
  Lua
    single-core 4967 4.46 MB/sec
    multi-core 19037 17.1 MB/sec
  Dijkstra
    single-core 2695 9.67 Mpairs/sec
    multi-core 7733 27.8 Mpairs/sec

Floating Point
  BlackScholes
    single-core 2445 10.9 Mnodes/sec
    multi-core 9759 43.4 Mnodes/sec
  Mandelbrot
    single-core 3046 3.12 Gflops
    multi-core 12183 12.5 Gflops
  Sharpen Filter
    single-core 3021 2.24 Gflops
    multi-core 12070 8.95 Gflops
  Blur Filter
    single-core 2393 2.28 Gflops
    multi-core 9566 9.12 Gflops
  SGEMM
    single-core 4596 12.9 Gflops
    multi-core 17948 50.3 Gflops
  DGEMM
    single-core 3980 5.85 Gflops
    multi-core 15222 22.4 Gflops
  SFFT
    single-core 3351 3.53 Gflops
    multi-core 13428 14.2 Gflops
  DFFT
    single-core 3443 3.14 Gflops
    multi-core 13654 12.4 Gflops
  N-Body
    single-core 4813 1.79 Mpairs/sec
    multi-core 19176 7.12 Mpairs/sec
  Ray Trace
    single-core 4413 5.20 Mpixels/sec
    multi-core 17808 21.0 Mpixels/sec

Memory
  Stream Copy
    single-core 2807 11.2 GB/sec
    multi-core 2956 11.8 GB/sec
  Stream Scale
    single-core 2802 11.2 GB/sec
    multi-core 2946 11.8 GB/sec
  Stream Add
    single-core 2786 12.6 GB/sec
    multi-core 2963 13.4 GB/sec
  Stream Triad
    single-core 2893 12.7 GB/sec
    multi-core 3050 13.4 GB/sec

Benchmark Summary
  Integer Score 3686 13431
  Floating Point Score 3453 13711
  Memory Score 2821 2978

  Geekbench Score 3419 11452

The DGEMM, SGEMM and DFFT and SFFT are very low.. and not very representative of what the cores and really do. DGEMM should hit +50 GFLOPs, while we're only achieving 1/7 of the theoretically and practically capable. Similar statements can be made regarding the *FFT components. If you don't want to use assembly, I think much better performance can be achieved with well written code compiled by a vectorizing compiler.

Looking at other benchmarks vs what I just measured in Linux, you see the linux score for AES is 5/8 of that on windows, another issue there and I'm sure much more. My comments are only made to be constructive and thanks..

  1. 1 Posted by Tim Wilkens on 20 Sep, 2013 01:31 AM

    Tim Wilkens's Avatar

    Any reply to this email. I would think the same hardware should perform similarly.. but evidently there's a very large discrepancy in performance. There seems to be a serious performance issue when comparing different platforms. I have no idea what apple's done with their LLVM but I'd imagine it's not very comparable to what is available on linux. Also, for processors which are x86, there are optimized math libraries available in linux and windows (and I'd imagine apple) with *GEMM and *FFT routine you could call. That away you're testing something relevant in performance other than memory latency. You're *GEMM should be close to what's theoretically capable. Lastly there are optimized AES and other encryption implementations, even ISA instructions, why not use them?

    Tim

  2. Support Staff 2 Posted by John on 20 Sep, 2013 08:27 PM

    John's Avatar

    Hi Tim,

    Thank you for your questions, and sorry for the delay in getting back to you. We appreciate when users consider the scores so carefully.

    During development of the benchmark we run tests similar to what you are doing here: run the bench mark on the same hardware across different OSes. We see small performance differences, but are unable to reproduce the severe discrepancy that you have observed on either Sandy Bridge or Ivy Bridge (we do not have a Haswell system to test).

    We enable auto-vectorization on all compilers. However, SFFT is not vectorized on any platform, so I do not believe that vectorization is causing the differences you are observing.

    We certainly agree that the same hardware should perform similarly and we strive for this. We want the scores to represent the hardware performing well, but we also intend the scores to reflect execution performance of real-world application code. We expect that a programmer will write their code once and compile that same code for each platform that he supports. We choose not to use optimized vendor libraries for workloads such as GEMM and FFT since we expect such libraries to be optimized for each target architecture making a direct comparison of the score troublesome. Furthermore, if the libraries are proprietary we don't know exactly what optimizations the library performs.

    We use the AES-NI instructions on systems that support them. Similarly, we use the SHA1 instructions when they are available. In this way we have encryption and hash function implemented in both hardware and software: AES and SHA1 in hardware and Twofish and SHA2 in software.

    I hope this addresses your concerns and helps to explain our design choices. Let me know if you have any further questions or concerns and I'd be happy to help out.

    Best,
    John

  3. Support Staff 3 Posted by John on 20 Sep, 2013 08:31 PM

    John's Avatar

    Hi Tim,

    One thing I forgot to mention regarding the low AES performance you observed on Linux. We're aware of situations where Linux is slow to "ramp up" the frequency of the processor (relative to Windows). Since AES is the first workload executed, that might be what is happening here? In our testing AES performance is almost identical between Windows and Linux on the same hardware.

    Best,
    John

  4. 4 Posted by Paul Hoon on 08 Jan, 2017 06:43 PM

    Paul Hoon's Avatar

    A big disappointment here, also. Similar test runs on the same hex core hardware showed win 8 faster (by about the same factor you showed), in both integer and floating point, than a "scientific linux" version we tested.

    A number of knowledgable persons in computer science told me that linux would be fastest for scientific/math computing. Turned out not to be true.

    We need to complete three continuous weeks (24X7) of multivariate statistical computations for 3/4 million cancer patients.

    Would you have found better FP and INT performance had you tested the commercial Red Hat version of linux instead?

Reply to this discussion

Internal reply

Formatting help / Preview (switch to plain text) No formatting (switch to Markdown)

Attaching KB article:

»

Attached Files

You can attach files up to 10MB

If you don't have an account yet, we need to confirm you're human and not a machine trying to post spam.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac