Why you should use different backends when deploying deep learning models

Assessing the value of choosing the right backend for your DNN model

Photo by CHUTTERSNAP on Unsplash

Intro

Although development along these lines is very welcoming and great for the ML community, the caveat is that all of these frameworks perform better/worse depending on the model type and hardware platform the model is executed on. This increases the complexity when it comes to deciding where and how to deploy your models. Which AWS instance should I use for my model? Under which backend should I perform inference? What is the cheapest deployment configuration for my model and use case? There are unfortunately no trivial answers to these questions, especially when the inference performance (latency and/or throughput) of your DNN model depends on the hardware and framework it is executed on.

Why you should care

Speeding up your models inference runtime directly translates to proportionally lowering your operational costs. If you can run your model 10x faster, then you can either serve 10x more inference-requests at same costs, or equivalently, reduce your deployment cost by switching to a cheaper hardware platform (e.g. cheaper EC2 AWS instances) or/and reduce the number of processors for running your model (e.g., if you were using 10 CPU instances, then now you only require one).

Thus, if you do not want to spend unnecessary resources and maximise the value of your model, choosing the right framework and/or hardware for running inference could become of utmost importance.

The crux of the matter

Simply put, you need to benchmark your models inference speed for all those backends on the target hardware platform and select the one that is executing your model the fastest.

Use cases & insights

We benchmarked:

  • 8 different DNN models, BERT, Efficientnet-lite4, Fast-neural-style, Mobilenet-v2, Resnet50-v1, Shufflenet-v2, Super-res and YOLO-v4
  • using 7 different backends, TensorFlow, PyTorch, ONNX-Runtime, OpenVINO, Nuphar, TensorRT and CUDA
  • across 10 different AWS instances, c6g.medium, c6g.large, c6g.2xlarge, m5n.2xlarge, g4dn.xlarge, c6g.4xlarge, m5dn.2xlarge, c5a.4xlarge, c6g.8xlarge, c5a.8xlarge.

In these experiments we considered EC2 instances due to simplicity reasons, since the operational costs for performing inference can be calculated in a relatively straightforward manner, especially if comparisons across different hardware platforms is desired. However, we want to mention that there are other AWS services you may want to consider for deploying your models, for instance Lambda services where compute power can be scaled up on-demand. We also considered only batch sizes of 1 in order to reflect the above mentioned use case.

Here are some of the main takeaways we attained from analysing the results.

Takeaway 1: Different inference backends perform better/worse on different models.

Average throughput of BERT benchmarked on a c5a.4xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).

In the plots on the left, we see the throughput (measured as average samples per second processed) of different models when benchmarked on different backends.

Average throughput of Resnet50 benchmarked on a c5a.4xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).

You can see that the throughput and latency (the latency is the inverse of the throughput in this case since we consider batch sizes of 1 and only a single instance), of a model can vary quite significantly across different backends. There is no consistency regarding which one performs best across models.

In some cases, the difference between the worst performing (slowest) backend and the best performing (fastest) can be quite significant, sometimes being of 10x (e.g. in the Resnet50 example on the left the difference is about 7x).

Thus, this stresses the importance of benchmarking your DNN model for several backends on your target hardware platform before deploying it, if you want to avoid spending up to 7x more on resources than necessary.

Takeaway 2: Different inference backends perform better/worse on different hardware platforms.

Average throughput of BERT benchmarked on a c5n.2xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).

On the left we show the throughput of the same BERT model shown above, when benchmarked on a different EC2 AWS instance.

As you can see, the best performing backend is now OpenMP, whereas in the example above it was OpenVINO. Thus, backend’s performances change across different hardware platforms. This shows again that you cannot rely on a single backend if you want to deploy your model in the most cost-optimal manner across different hardware platforms.

Takeaway 3: You can make better cost-optimal decisions regarding where and how to deploy your model if you analyse the benchmark results across different backends and hardware platforms all together.

Average throughput of Resnet50 benchmarked on different EC2 AWS instances. Batch size was equal to 1. The throughput of the fastest backend on each instance is displayed. The instances are ordered from left to right from cheapest to most expensive (at the top is the price displayed in usd/h). Latency: is the inverse of the throughput (seconds/sample).

Here we benchmarked the models runtime across different EC2 AWS instances and ordered the instances from cheapest $/hr-usage to most expensive (left to right).

Such a plot can be quite informative and helpful when deciding where and how to deploy your DNN model. For example, you want to use only 1 instance for deploying your model (this may be due to simplicity reasons), then it becomes pretty trivial how to choose the cheapest deployment option. You simply draw a horizontal line at the average throughput and/or latency value that your specific use case requires and choose the cheapest instance that crosses the line, thus the first one from the left.

In the example above, we can see that for the Resnet50 model the cheapest deployment option is to use the c5a.large instance, if the use case requires an average throughput of 10 samples/second. However, if it requires 100 samples/second, then the cheapest option would be to use the g4dn.xlarge GPU.

Things can get a little bit more interesting if you consider scaling out (thus using multiples of the same instance) your model in order to meet the throughput requirements. In such a case one can ask the question, should I use several but cheap instances or a few but expensive ones?

A plot like the one shown below can guide you in making a decision, since you could simply plot the throughput of a stack of instances in the same graph and compare the options as discussed previously.

Average throughput of BERT benchmarked on different EC2 AWS instances. Batch size was equal to 1. The throughput of the fastest backend on each instance is displayed. Here we included the throughput and price when using 3 c6g.medium and 28 c5a.large for comparison.

If your use case requires BERT to run at 50 samples/second, then it would still be 4x cheaper to book a single g4dn.xlarge GPU instances than scaling out to 28 c5a.large CPU instances, although both meet the throughput requirements of being able to process 50 samples/second. However, if it requires the handling of, say 2 inference-requests per second instead, then deploying BERT on 3 c6g.medium CPU instances becomes a 5x cheaper option than deploying it to a single g4dn.xlarge GPU instance.

A note on latency: Keep in mind, that scaling out to multiple but cheap instances does not reduce the latency! Again, since we consider batch sizes of 1, the latency is the inverse of the throughput of a single instance (seconds/sample). In the example above, the latency of BERT on a single c6g.medium CPU is of 1.43s. Thus, by choosing 3 of those instances to run BERT in parallel means that you can handle 2 user-requests per second, but each user will have to wait more than a second for their response. So its up to your use case requirements to decide if this is acceptable or not. Latency of an inference of a DNN model can only be reduced by scaling vertically, that is, by choosing a more powerful and expensive instance, and/or by speeding up your model, by choosing a better backend for example.

These insights show that you can make better cost-optimal decisions regarding where and how to deploy your model, if you benchmark across different backends and hardware platforms before deploying it.

However, installing and developing the benchmarking code for all those different backends can be quite tedious and mundane, which not everyone is willing to dedicate time to. But as shown above, it pays off since significant cost-savings can be attained (in the Resnet50 example above, the difference between the worst and best performing backend was of 7x, and in some other cases we saw even 10x). We stress that these gains are entirely up for grabs and the conversion of the models usually take less than a second! So don’t miss out on this free lunch!

DNN Bench

DNN Bench is an open-source library that lets you benchmark the inference runtime (latency and throughput) of DNN models on several different backends by simply running the command

./bench_model.sh path_to_model --device=cpu --tf --onnxruntime \
--openvino --pytorch --nuphar

The example command above states that the runtime is measured on a CPU, and tested for the frameworks TensorFlow, ONNX-Runtime, OpenVINO, PyTorch and Nuphar. When the tests are finished, multiple JSON files containing the respective results are created. You can use the provided visualization script or build your own to view the results.
python vis/plot_results.py results_dir plots_dir
You can then select the best performing backend (the fastest) as the deployment option for your model. Note, that if you want to make your model production-ready for real-time request-response use cases, you should then build a microservice based on the docker image that performs best for your use case.

*** Please leave a comment if you are interested in knowing more about how you could effortlessly convert your data-processing pipeline and your model from one framework to another, and/or how to build a microservice around it. We will be publishing a new blog post where we will be discussing this topic in more depth.

Conclusion

  • Increasing the inference runtime speed of your model directly translates to reducing the operational costs involved in performing inference.
  • The inference performance (latency and throughput) of your DNN model depends on the models architecture, framework/backend and hardware platform it is executed on.
  • Benchmarking your model across different backends and, if possible, different hardware platforms, can be very helpful in guiding you to make the most cost-optimal deployment decision (where and how should I deploy it?) for your DNN model and use-case.
  • DNN Bench is an open-source tool designed for this use case, namely, it aims to make benchmarking of DNNs across different backends and hardware platforms as simple as possible, so that everyone can squeeze out the best of their models for their particular use-cases.
  • Our experiments show that GPUs become increasingly more cost-efficient as the throughput requirements increase. However, for lower throughputs, scaling out the model with cheaper instances may be a cheaper option than using more powerful but expensive instances, such as GPUs.

You can see the full benchmark results we attained from our experiments here.

ToriML is a COSS company building tools for developers and enterprises for inference acceleration and model deployment of ML algorithms.