



# An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

Gianluca Bellocchi, Alessandro Capotondi, and Andrea Marongiu

**IWES 2021** 

**University of Modena and Reggio Emilia**, *<name>.<surname>@unimore.it* 

Fondo di Ateneo per la Ricerca FAR2020



#### Introduction





**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

### Introduction



### **Accelerator-Rich Paradigm**



### **Accelerator-Rich Paradigm**



## **Accelerator-Rich Paradigm**

#### What has to be simplified?

#### System-Level Design

- Build and evaluate accelerator-rich systems
  - Expensive
  - Time-consuming

#### Design Space Exploration (DSE)

- Key effects only manifest at system-level
- o User knobs:
  - System optimization
  - Accelerator optimization

## **Recent Contribution**

- A first proposal to simplify the deployment of hardware accelerators
  - Design methodology
    - o Overlay-based
    - Plug-and-play integration of HW accelerators
  - Experimental results
    - Resource cost (LUT, FF, BRAM, DSP)
    - Application profiling
    - Comparison with Xilinx HLS flow

G. Bellocchi, A. Capotondi, F. Conti and A. Marongiu, *A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment*, 24th Euromicro Conference on Digital System Design (2021)

# **Starting Point**

#### PULP architecture

- PULP stands for «Parallel Ultra Low Power»
- Open and Scalable HW/SW research and development platform
- Cluster-based architecture
- ➢ RISC-V ISA compliant





Website: pulp-platform.org

UNIMORE

An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

# **Starting Point**

A. Kurth, A. Capotondi, P. Vogel, L. Benini, A. Marongiu, (2018) HERO: An open-source research platform for HW/SW exploration of heterogeneous manycore systems

- FPGA emulation of heterogeneous and massively parallel PULP systems
- Instantiable with COTS FPGA-based heterogeneous SoCs



UNIMORE

HERO

## **Overlay Architecture**

#### What is it?

- Hardware abstraction layer
- Overlays the original FPGA fabric
  - o Hides hardware details

#### Features:

- Coarse-grained
  - o Rapid swapping of architectural blocks
- Avoid FPGA design flow
  - Improved design productivity
- Programmable via standard APIs for heterogeneous compute platforms

#### **Overlay Architecture**



**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

# **Accelerator Integration Methodology**



# **Accelerator Integration Methodology**

#### Streamer

Specialized DMA controller that transforms streams into memory accesses

#### Controller

- Register file to host runtime parameters
- Control FSM for coarse-grained control/(re)-configuration



Website: pulp-platform.org











An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

#### **Open Questions**

#### To choose a proper way of interconnecting accelerators is a primary requirement

- 1. Which type of interconnect topology better fits our needs?
- 2. What about the clustering level?
- 3. How do accelerators mutually work?
  - Parallel vs. sequential execution

#### How?

- > The accelerator wrapper toolchain is a good starting point!
  - Goal → New functionalities to support generation of multiple overlay configurations
- Optimization knobs
  - o <u>System-level</u>
    - Memory hierarchy, control cores, DMA, etc.
    - Accelerator interconnections
    - ✤ Accelerator scheduling
  - o <u>Accelerator-level</u>
    - Data port parallelism, local buffers, datapath pipeling, loop unrolling, etc.

# Accelerator design flow

**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

**HLS-compiled accelerators** 

Hand-crafted accelerator

**PULP accelerators** 

...







#### **#1 – Cluster Interconnection**



**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

#### **#2 – Multi-Cluster Interconnection**



UNIMORE

#### **#3 – Heterogenous Interconnection**



**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

#### **#3 – Heterogenous Interconnection**



**UNIMORE** An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems

Overlay library

Accelerator Library









#### **Use Cases ~ Particle Filter**







To appear in DATE 2022

#### UNIMORE

#### **Use Cases ~ Particle Filter**



#### **Use Cases ~ Particle Filter**





#### Latency breakdown of Particle Filter algorithm



**UNIMORE** 

#### **Use Cases ~ C4D**







### Conclusions

1. Innovative methodologies to simplify accelerator-rich deployment is crucial!

- To choose a proper way of interconnecting accelerators is a primary requirement
  - o System design
  - $\circ$  Design space exploration

#### 2. Overlay-based solution

- $\blacktriangleright$  Proxy core  $\rightarrow$  Simplified and less expensive control!
- Overlay cost ~20% LUT usage on Xilinx ZU9EG MPSoC
- Comparable latency to Xilinx Vivado HLS methodology and up to 4.08x speedup compared to ARM host core

# Future Work (A)

- Tightly-Coupled Bandwidth Monitoring and Regulation for Accelerator-Rich Architectures
  - How to achieve accurate control of task activities in accelerator-rich architectures?
    - Control of main memory bandwidth usage in a FPGA-based heterogeneous SoC
    - Integration of Runtime Bandwidth Regulator (RBR) in overlaybased

# Future Work (B)

- Optimization strategies for hardware wrapper generation
  - Some hardware-mapped applications result in common hardware components
  - To further reduce FPGA occupation, we can automate the searching for common wrapper components to be shared among different acceleration kernels







# Thanks for your attention!

Fondo di Ateneo per la Ricerca FAR2020





An Open-Source Overlay for Reconfigurable, Accelerator-Rich Embedded Systems