Why you should use different backends when deploying deep learning models
Assessing the value of choosing the right backend for your DNN model
TensorFlow & PyTorch are arguably the most popular deep neural network (DNN) frameworks up to date, and these are still the status quo for deploying your models in production. However, these are unfortunately not always the best performing frameworks for running inference of your DNN models. In fact, a plethora of other inference engines have been developed over recent years, e.g. OpenVINO, CUDA, TensorRT, etc., all trying to improve the inference speed of your models on various hardware platforms.
Although development along these lines is very welcoming and great for the ML community, the caveat is that all of these frameworks perform better/worse depending on the model type and hardware platform the model is executed on. This increases the complexity when it comes to deciding where and how to deploy your models. Which AWS instance should I use for my model? Under which backend should I perform inference? What is the cheapest deployment configuration for my model and use case? There are unfortunately no trivial answers to these questions, especially when the inference performance (latency and/or throughput) of your DNN model depends on the hardware and framework it is executed on.
Why you should care
If you choose the wrong framework-hardware-fit for your model, you may be running it at significantly lower runtimes than necessary. Choosing the right backend for your model may sometimes speed up your model by 10x with absolutely no accuracy loss!
Speeding up your models inference runtime directly translates to proportionally lowering your operational costs. If you can run your model 10x faster, then you can either serve 10x more inference-requests at same costs, or equivalently, reduce your deployment cost by switching to a cheaper hardware platform (e.g. cheaper EC2 AWS instances) or/and reduce the number of processors for running your model (e.g., if you were using 10 CPU instances, then now you only require one).
Thus, if you do not want to spend unnecessary resources and maximise the value of your model, choosing the right framework and/or hardware for running inference could become of utmost importance.
The crux of the matter
If different frameworks/backends perform differently depending on the model type and the hardware, then how do I know which one to choose from?
Simply put, you need to benchmark your models inference speed for all those backends on the target hardware platform and select the one that is executing your model the fastest.
Use cases & insights
Many of our clients want to deploy their DNN models for real-time request-response use cases. An example may be analysing the sentiment of users in real-time as they speak with the customer service team via chat. Another may be performing real-time classification of images. For such type of use cases, the models inference operate on batch sizes of 1, since a real-time response is desired for each incoming inference-request.
- 8 different DNN models, BERT, Efficientnet-lite4, Fast-neural-style, Mobilenet-v2, Resnet50-v1, Shufflenet-v2, Super-res and YOLO-v4
- using 7 different backends, TensorFlow, PyTorch, ONNX-Runtime, OpenVINO, Nuphar, TensorRT and CUDA
- across 10 different AWS instances, c6g.medium, c6g.large, c6g.2xlarge, m5n.2xlarge, g4dn.xlarge, c6g.4xlarge, m5dn.2xlarge, c5a.4xlarge, c6g.8xlarge, c5a.8xlarge.
In these experiments we considered EC2 instances due to simplicity reasons, since the operational costs for performing inference can be calculated in a relatively straightforward manner, especially if comparisons across different hardware platforms is desired. However, we want to mention that there are other AWS services you may want to consider for deploying your models, for instance Lambda services where compute power can be scaled up on-demand. We also considered only batch sizes of 1 in order to reflect the above mentioned use case.
Here are some of the main takeaways we attained from analysing the results.
Takeaway 1: Different inference backends perform better/worse on different models.
In the plots on the left, we see the throughput (measured as average samples per second processed) of different models when benchmarked on different backends.
You can see that the throughput and latency (the latency is the inverse of the throughput in this case since we consider batch sizes of 1 and only a single instance), of a model can vary quite significantly across different backends. There is no consistency regarding which one performs best across models.
In some cases, the difference between the worst performing (slowest) backend and the best performing (fastest) can be quite significant, sometimes being of 10x (e.g. in the Resnet50 example on the left the difference is about 7x).
Thus, this stresses the importance of benchmarking your DNN model for several backends on your target hardware platform before deploying it, if you want to avoid spending up to 7x more on resources than necessary.
Takeaway 2: Different inference backends perform better/worse on different hardware platforms.
On the left we show the throughput of the same BERT model shown above, when benchmarked on a different EC2 AWS instance.
As you can see, the best performing backend is now OpenMP, whereas in the example above it was OpenVINO. Thus, backend’s performances change across different hardware platforms. This shows again that you cannot rely on a single backend if you want to deploy your model in the most cost-optimal manner across different hardware platforms.
Takeaway 3: You can make better cost-optimal decisions regarding where and how to deploy your model if you analyse the benchmark results across different backends and hardware platforms all together.
Here we benchmarked the models runtime across different EC2 AWS instances and ordered the instances from cheapest $/hr-usage to most expensive (left to right).
Such a plot can be quite informative and helpful when deciding where and how to deploy your DNN model. For example, you want to use only 1 instance for deploying your model (this may be due to simplicity reasons), then it becomes pretty trivial how to choose the cheapest deployment option. You simply draw a horizontal line at the average throughput and/or latency value that your specific use case requires and choose the cheapest instance that crosses the line, thus the first one from the left.
In the example above, we can see that for the Resnet50 model the cheapest deployment option is to use the c5a.large instance, if the use case requires an average throughput of 10 samples/second. However, if it requires 100 samples/second, then the cheapest option would be to use the g4dn.xlarge GPU.
Things can get a little bit more interesting if you consider scaling out (thus using multiples of the same instance) your model in order to meet the throughput requirements. In such a case one can ask the question, should I use several but cheap instances or a few but expensive ones?
A plot like the one shown below can guide you in making a decision, since you could simply plot the throughput of a stack of instances in the same graph and compare the options as discussed previously.
If your use case requires BERT to run at 50 samples/second, then it would still be 4x cheaper to book a single g4dn.xlarge GPU instances than scaling out to 28 c5a.large CPU instances, although both meet the throughput requirements of being able to process 50 samples/second. However, if it requires the handling of, say 2 inference-requests per second instead, then deploying BERT on 3 c6g.medium CPU instances becomes a 5x cheaper option than deploying it to a single g4dn.xlarge GPU instance.
A note on latency: Keep in mind, that scaling out to multiple but cheap instances does not reduce the latency! Again, since we consider batch sizes of 1, the latency is the inverse of the throughput of a single instance (seconds/sample). In the example above, the latency of BERT on a single c6g.medium CPU is of 1.43s. Thus, by choosing 3 of those instances to run BERT in parallel means that you can handle 2 user-requests per second, but each user will have to wait more than a second for their response. So its up to your use case requirements to decide if this is acceptable or not. Latency of an inference of a DNN model can only be reduced by scaling vertically, that is, by choosing a more powerful and expensive instance, and/or by speeding up your model, by choosing a better backend for example.
These insights show that you can make better cost-optimal decisions regarding where and how to deploy your model, if you benchmark across different backends and hardware platforms before deploying it.
However, installing and developing the benchmarking code for all those different backends can be quite tedious and mundane, which not everyone is willing to dedicate time to. But as shown above, it pays off since significant cost-savings can be attained (in the Resnet50 example above, the difference between the worst and best performing backend was of 7x, and in some other cases we saw even 10x). We stress that these gains are entirely up for grabs and the conversion of the models usually take less than a second! So don’t miss out on this free lunch!
We have built DNN Bench to help solve this issue. We wanted to build something that lets you benchmark your DNN model against several different backends without having to worry about building the entire benchmarking infrastructure.
DNN Bench is an open-source library that lets you benchmark the inference runtime (latency and throughput) of DNN models on several different backends by simply running the command
./bench_model.sh path_to_model --device=cpu --tf --onnxruntime \
--openvino --pytorch --nuphar
The example command above states that the runtime is measured on a CPU, and tested for the frameworks TensorFlow, ONNX-Runtime, OpenVINO, PyTorch and Nuphar. When the tests are finished, multiple JSON files containing the respective results are created. You can use the provided visualization script or build your own to view the results.
python vis/plot_results.py results_dir plots_dirYou can then select the best performing backend (the fastest) as the deployment option for your model. Note, that if you want to make your model production-ready for real-time request-response use cases, you should then build a microservice based on the docker image that performs best for your use case.
*** Please leave a comment if you are interested in knowing more about how you could effortlessly convert your data-processing pipeline and your model from one framework to another, and/or how to build a microservice around it. We will be publishing a new blog post where we will be discussing this topic in more depth.
The main takeaways of this article are
- Increasing the inference runtime speed of your model directly translates to reducing the operational costs involved in performing inference.
- The inference performance (latency and throughput) of your DNN model depends on the models architecture, framework/backend and hardware platform it is executed on.
- Benchmarking your model across different backends and, if possible, different hardware platforms, can be very helpful in guiding you to make the most cost-optimal deployment decision (where and how should I deploy it?) for your DNN model and use-case.
- DNN Bench is an open-source tool designed for this use case, namely, it aims to make benchmarking of DNNs across different backends and hardware platforms as simple as possible, so that everyone can squeeze out the best of their models for their particular use-cases.
- Our experiments show that GPUs become increasingly more cost-efficient as the throughput requirements increase. However, for lower throughputs, scaling out the model with cheaper instances may be a cheaper option than using more powerful but expensive instances, such as GPUs.
You can see the full benchmark results we attained from our experiments here.