Why you should use different backends when deploying deep learning models

Assessing the value of choosing the right backend for your DNN model

Photo by CHUTTERSNAP on Unsplash

Intro

Why you should care

The crux of the matter

Use cases & insights

Takeaway 1: Different inference backends perform better/worse on different models.

Average throughput of BERT benchmarked on a c5a.4xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).
Average throughput of Resnet50 benchmarked on a c5a.4xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).

Takeaway 2: Different inference backends perform better/worse on different hardware platforms.

Average throughput of BERT benchmarked on a c5n.2xlarge EC2 AWS instance for several different backends. Batch size was equal to 1. Latency: is the inverse of the throughput (seconds/sample).

Takeaway 3: You can make better cost-optimal decisions regarding where and how to deploy your model if you analyse the benchmark results across different backends and hardware platforms all together.

Average throughput of Resnet50 benchmarked on different EC2 AWS instances. Batch size was equal to 1. The throughput of the fastest backend on each instance is displayed. The instances are ordered from left to right from cheapest to most expensive (at the top is the price displayed in usd/h). Latency: is the inverse of the throughput (seconds/sample).
Average throughput of BERT benchmarked on different EC2 AWS instances. Batch size was equal to 1. The throughput of the fastest backend on each instance is displayed. Here we included the throughput and price when using 3 c6g.medium and 28 c5a.large for comparison.

DNN Bench

./bench_model.sh path_to_model --device=cpu --tf --onnxruntime \
--openvino --pytorch --nuphar

Conclusion

ToriML is a COSS company building tools for developers and enterprises for inference acceleration and model deployment of ML algorithms.