## A cycle-accurate methodology to improve PREM-like memory bandwidth underutilization on FPGA-based HeSoCs

Gianluca Brilli<sup>\*</sup>, Giacomo Valente<sup>+</sup>, Alessandro Capotondi<sup>\*</sup>, Tania di Mascio<sup>+</sup>, Paolo Burgio<sup>\*</sup>, Paolo Valente<sup>\*</sup> and Andrea Marongiu<sup>\*</sup>

#### IWES, 2021

\*University of Modena and Reggio Emilia, <name>.<surname>@unimore.it \*University of L'Aquila, <name>.<surname>@univaq.it

FRACTAL

EDGE





Fondo di Ateneo per la Ricerca FAR2020



## Motivations (1)

 As the number of engine grows on next generation of HeSoCs, the interference due to shared interconnects and main memory hampers tasks' execution time.



#### Versal ACAP (up to 16x)



G. Brilli, A. Capotondi, P. Burgio and A. Marongiu, *Understanding and Mitigating Memory Interference in FPGAbased HeSoCs*, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022.



## Motivations (2)

- Available **memory bandwidth regulation** mechanisms are:
  - Too Loosely-coupled and Coarse-grained from the actuation & monitoring point of view;
  - or are **platform-specific.**



## Contributions

- Runtime Bandwidth Regulator (RBR)
  - Tightly-coupled monitoring & throttling;
  - Minimal timing overhead (1 clock);
  - High precision QoS regulation.
  - Evaluation on Xilinx Zynq UltraScale+ MPSoC

• This work is currently under-submission! Can't be disclosed.



## Background



- Some examples:
  - NVIDIA Xavier;
  - Xilinx Zynq UltraSCALE+;
  - Xilinx Versal.
- HeSoC is a new emerging trend.



### The proposed mechanism



- Control Core: Tightly-coupled with DMA and Sniffer/Throttler.
  - Set the amount of bandwidth the DMA can use (CLI).
- **DMA**: performs **controlled** memory transactions.
- Sniffer/Throttler: constantly monitors the DMA activity and regulates DMA transactions.



## **Experimental Results (1)**

• **Exp1** – Tightly-coupled versus Loosely-coupled Monitoring and Throttling.

#### • Objective:

- Test the ability of our system to follow a **bandwidth profile** (eg. provided by a system scheduler).
- Compared with Loosely-coupled solutions on a Xilinx Zynq UltraScale+ MPSoC
  - based on Xilinx AXI Performance Monitor (APM).



## **Experimental Results (2)**

• Exp1 – Tightly-coupled versus Loosely-coupled Monitoring and Throttling.



- Black dashed line: bandwidth profile (e.g. system scheduler);
- Blue: our TC solution.
- Green: LC solution based on APM.



## **Experimental Results (3)**

• **Exp1** – Tightly-coupled versus Loosely-coupled Monitoring and Throttling.

#### • Results:

- Our Tightly-Coupled solution (TCMT), follows a bandwidth profile with 32µs of period;
- Platform-dependent Loosely-Coupled solutions (LCMT), need a slower scheduling tick, at least 384µs of period.
- **12x** of improvement compared to Zynq UltraScale+ solutions.



## **Experimental Results (4)**

• **Exp2** – QoS for Memory Interference Mitigation.

#### • Objective:

 Test the ability of our system to mitigate memory interference on Heterogeneous System (Xilinx ZUS+);



## **Experimental Results (5)**

**Exp2** – QoS for Memory Interference Mitigation.

#### Exp-Setup:

- 3 ACT performing **controlled** memory reads;
- Real applications on APU & RPU.
- All the actors must meet deadlines (except ACT3 which is Best Effort)

#### Scenarios:

- Very Tight (VT): max 20% of tolerated slowdown
- Tight (T):
- Medium (M):

- max 40% of tolerated slowdown
- max 60% of tolerated slowdown



## **Experimental Results (6)**

• **Exp2** – QoS for Memory Interference Mitigation.



Figure 11 Ratio of accepted QoS setups with uniform thresholds for Workload 1.

Unfeasible configurations!

With ZUS+ QoS ecosystem.

Serrano-Cases, Alejandro, Juan M. Reina, Jaume Abella, Enrico Mezzetti and Francisco J. Cazorla. *Leveraging Hardware QoS to Control Contention in the Xilinx Zynq UltraScale+ MPSoC*, ECRTS 2021.



## **Experimental Results (7)**

• **Exp2** – QoS for Memory Interference Mitigation.





## Conclusion

- We introduced a **fine-grained QoS control** via **tightly-coupled bandwidth monitoring and regulation**.
  - 12x faster than loosely-coupled bandwidth regulation mechanisms of the Zynq UltraScale+ MPSoC;
  - Our mechanism is more accurate than ZUS+ based QoS ecosystem.



# Thank you! Gianluca Brilli

High-Performance Real-Time Lab