HW/SW Inference-time Optimizations for Reliable Embedded Convolutional Neural Networks


Enrico Macii, Massimo Poncino, Andrea Calimera, Daniele Jahier Pagliari, Roberto Giorgio Rizzo, Valentino Peluso and Luca Mocerino

Presentation title

HW/SW Inference-time Optimizations for Reliable Embedded Convolutional Neural Networks

Authors

Enrico Macii, Massimo Poncino, Andrea Calimera, Daniele Jahier Pagliari, Roberto Giorgio Rizzo, Valentino Peluso and Luca Mocerino

Institution(s)

Polytechnic of Turin

Presentation type

Technical presentation

Abstract

Embedded Convolutional Neural Networks (ConvNets) are driving the evolution of ubiquitous systems that can sense and understand the environment autonomously. They can be shrunk to fit embedded CPUs adopted on portable end-nodes, like smartphones or drones, through algorithmic optimizations, such as topology scaling, pruning, and quantization, ensuring less resource usage with negligible accuracy drop. Nonetheless, ConvNets still suffer from quality and performance penalties when deployed in a real-life scenario where input patterns collected in a harsh environment can mislead the model. Moreover, implementing systems capable of sustaining continuous inference over a long period is still a primary source of concern since the limited Thermal Design Power (TDP) of general-purpose embedded CPUs prevents execution at maximum speed.

This presentation proposes inference-time optimizations at both HW- and SW-level to manage embedded ConvNets accuracy-performance tradeoff under real-life applications.

We first introduce AdapTTA, an adaptive implementation of the Test-Time Augmentation (TTA) strategy that tackles on-the-field accuracy drop by applying altered versions of the same input and then computing the main outcome through a consensus of the aggregated partial predictions. AdapTTA reduces the impact of the redundant runs through a mechanism that dynamically controls the number of augmented inferences, depending on the complexity of the input. Then we present Topology Voltage Frequency Scaling (TVFS), a performance management technique for reliable and predictable embedded multi-inference tasks operating on a continuous stream of data under latency and power constraints.

Experimental results on ConvNets for image classification deployed onto commercial ARM Cortex-A CPUs demonstrate that AdapTTA reaches higher frame rates with the same accuracy gain than existing static TTA strategies, whereas TVFS holds fast and continuous inference (up to 2000), ensuring a limited accuracy loss and better power profiles (i.e., temperature below the on-chip critical threshold).


Additional material

  • Presentation slides: [pdf]

For more details on this presentation please click the button below: