Benchmark CIFAR10 on Tensorflow with CUDA 8 and cuDNN 6.0 vs CUDA 9 and cuDNN 7.0.5

Tensorflow 1.5.0 speed comparison with tensorflow 1.4.0

Tensorflow 1.5.0 has been officially released. And among various new features, one of the big features is CUDA 9 and cuDNN 7 support, which promises double-speed training on Volta GPUs/FP16. But how does it fair on a plain old GTX 840 M? We are going to perform a benchmark on the CIFAR10 dataset to find just that out.

We installed tensorflow-gpu from official pip packages or built it using Bazel to run our tests. If you want to learn more about how we did that, check out our another article here.

We will be performing our benchmark on the famous CIFAR-10 dataset. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32×32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6000 images of each class.

(Source: https://en.wikipedia.org/wiki/CIFAR-10)

We are going to perform our benchmark on using the cifar10_train.py file found in tutorials/image/cifar10 in the github tensorflow models repo. We are going to perform the benchmark on Dell Inspiron 15-3878 with Intel i7 processor and 8gigs of RAM. The system has Nvidia GPU (GeForce 840M) with compute capability 5.0. We are going to run the script on 3 sets of configuration of Tensorflow, CUDA Toolkit and cuDNN as listed below.

  1. Tensorflow gpu 1.4.1 with cuda 8.0 and cudnn 6.0
  2. Tensorflow gpu 1.5.0 with cuda 9.0 and cudnn 7.0.5
  3. Tensorflow gpu 1.5.0 with cuda 9.1 and cudnn 7.0.5

We are going to perform the benchmark on maximum 10000 steps and will calculate the total duration taken to complete 10000 steps using the time.time() function.

1. Benchmark Tensorflow GPU 1.4.1 with CUDA Toolkit 8.0 and cuDNN 6.0

For this configuration, we installed official prebuilt pip package of  Tensorflow GPU 1.4.1. Since Tensorflow GPU 1.4.1 requires CUDA Toolkit 8.0 and cuDNN 6.0, we installed it as well. Upon running 10000 steps on the CIFAR-10 dataset, here’s what we find:

cuda8 tensorflow 1.4 benchmark

We can see this setup is doing roughly 900 examples per second in average and 0.14 seconds per batch of images on average. It took a total of around 1558 seconds which is roughly 26 minutes to run 10000 steps on the CIFAR-10 dataset.

2. Benchmark Tensorflow GPU 1.5.0 with CUDA Toolkit 9.0 and cuDNN 7.0.5

For this configuration, we installed official prebuilt pip package of  Tensorflow GPU 1.5.0. We also installed CUDA Toolkit 9.0 and cuDNN 7.05. Upon running 10000 steps on the CIFAR-10 dataset, here’s what we find:

cuda 9.0 tensorflow 1.5.0 benchmark

We can see this setup is doing roughly 1240examples per second in average and 0.103 seconds per batch of images on average. It took a total of around 1106 seconds which is roughly 18 minutes to run 10000 steps on the CIFAR-10 dataset.

3. Benchmark Tensorflow GPU 1.5.0 with CUDA Toolkit 9.1 and cuDNN 7.0.5

Since the official pip package of Tensorflow GPU 1.5.0 does not ship with CUDA Toolkit 9.1 support, we had to build Tensorflow 1.5.0 with CUDA Toolkit 9.1 to perform this test. And here are the results that we found upon running 10000 steps on CIFAR-10 dataset:

cuda 9.1 and tensorflow 1.5 benchmark

We can see this setup is doing almost similar to our earlier setup in terms of examples per second and seconds per batch of images. It took a total of around 1046 seconds which is roughly 17 and a half minutes to run 10000 steps on the CIFAR-10 dataset.

Conclusion

From above we can conclude that the support of CUDA 9 on Tensorflow 1.5.0 considerably increases the speed of training on GPUs other than Volta GPUs and FP16 as well. While there may not be much difference in terms of speed of training for Tensorflow 1.5 with CUDA 9.0 and CUDA 9.1, tensorflow 1.5.0 is easily faster than tensorflow 1.4.0 by around 30%.

 

3 Comments on Benchmark CIFAR10 on Tensorflow with CUDA 8 and cuDNN 6.0 vs CUDA 9 and cuDNN 7.0.5

  1. Would like to rerun the benchmark on my laptop, tf cifar10 seems to be different now on GitHub.
    There is no cifar10_train.py in the repo, but a cifar10_main.py.
    Tried that but the messages on the screen look quite different and it does not stop at step 10000.
    Please give me a hint how to run the benchmark with the current version on GitHub or a link to the version you used for your benchmark. May I ask for the parameters you used for the test as well?

    Thanks for your effort

    • There is the file in master branch in path models/tutorials/image/cifar10/ please check it.

  2. Arun,

    thanks for your fast reply and sorry for asking a question which I would not have asked if I had read your blog carefully. I first had looked into models/official/resnet where is some more cifar10 stuff.

    Maybe I am overlooking something again – I could not find how exactly to apply the time.time() function to measure the runtime.

    What I did then:
    time python cifar10_train.py –max_steps=10000
    Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
    2018-02-14 20:20:53.543009: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

    totalMemory: 3.95GiB freeMemory: 3.91GiB
    2018-02-14 20:20:04.744556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0)
    2018-02-14 20:20:06.905543: step 0, loss = 4.66 (524.1 examples/sec; 0.244 sec/batch)
    2018-02-14 20:20:07.272666: step 10, loss = 4.61 (3486.6 examples/sec; 0.037 sec/batch)
    2018-02-14 20:20:07.590760: step 20, loss = 4.51 (4024.1 examples/sec; 0.032 sec/batch)

    2018-02-14 20:26:31.374743: step 9950, loss = 0.78 (3963.3 examples/sec; 0.032 sec/batch)
    2018-02-14 20:26:31.699994: step 9960, loss = 0.89 (3935.4 examples/sec; 0.033 sec/batch)
    2018-02-14 20:26:32.024433: step 9970, loss = 0.79 (3945.3 examples/sec; 0.032 sec/batch)
    2018-02-14 20:26:32.348809: step 9980, loss = 1.12 (3946.1 examples/sec; 0.032 sec/batch)
    2018-02-14 20:26:32.667940: step 9990, loss = 1.07 (4010.9 examples/sec; 0.032 sec/batch)

    real 5m41,373s
    user 16m13,551s
    sys 2m26,351s

    so on the Dell Precision 5520 with a Quadro M1200 GPU the benchmark needs about 342 sec. Python setup: conda tensorflow-gpu 1.5.0 cuDNN 7.0.5 CUDA 9.0

Leave a Reply

Your email address will not be published.