This thesis investigates the design and evaluation of FPGA-based implementations for accelerating deep learning workloads, with a particular focus on real-time object detection and Graph Convolutional Networks (GCNs). The first part of the work presents the first dedicated and comprehensive survey of real-time object detection on FPGAs, analyzing state-of-the-art architectures, acceleration techniques, and optimization strategies across more than a decade of research. To enable fair comparison among heterogeneous implementations, the study introduces pixel throughput as a resolution-independent performance metric and provides an in-depth classification of algorithmic and hardware design approaches, together with a discussion of the challenges and emerging trends shaping next-generation FPGA-based detection systems. Building on the insights gained from this survey, the second contribution evaluates a practical deployment of Tiny YOLOv2 on an AMD Xilinx Zynq UltraScale+ MPSoC. The implemented heterogeneous system achieves 30 FPS real-time performance while consuming only 1.1 W, outperforming GPU-based implementations (15 FPS at 13.5 W) and CPU baselines (2.5–8 FPS at 2.1–5.3 W) by large margins in both energy efficiency and performance–power ratio. A detailed profiling study using the Vitis AI Profiler reveals key execution bottlenecks—including CPU–DPU synchronization stalls and underutilized DDR bandwidth—offering quantitative insights into system-level inefficiencies and guiding future optimization efforts for FPGAbased MPSoC deployments. The final part of the thesis introduces a lightweight, fully streaming FPGA accelerator for Graph Convolutional Networks, designed for mid-range and power-constrained devices. The architecture integrates scalable sparse–dense (SpMM) units, an efficient systolic array for dense transformations, and a dataset-aware 16-bit fixed-point quantization strategy that preserves full classification accuracy. Implemented on a Kintex-7 FPGA, the accelerator achieves 1.77–3.47 ms inference latency on the Cora and Citeseer datasets while consuming only 0.305–0.553 W, delivering up to 1,082 graphs/J, over 50× greater energy efficiency than contemporary GPU implementations, while requiring just 54–55 DSPs. These results highlight its suitability for real-time, embedded, and edge-AI applications where energy and resources are severely limited. Overall, this dissertation provides a unified and experimentally validated body of methodologies for the FPGA acceleration of deep learning workloads. Through systematic analysis, empirical bottleneck characterization, and novel architectural design, it advances the state of the art in real-time and energy-aware deep learning on reconfigurable platforms and establishes a foundation for future research on efficient heterogeneous AI computing.
Hozhabr, S.H. (2026). Design and Evaluation of FPGA Implementations of Deep Learning Models: From Real-Time Object Detection to Graph Convolutional Networks.
Design and Evaluation of FPGA Implementations of Deep Learning Models: From Real-Time Object Detection to Graph Convolutional Networks
Seyed Hani Hozhabr
2026-05-01
Abstract
This thesis investigates the design and evaluation of FPGA-based implementations for accelerating deep learning workloads, with a particular focus on real-time object detection and Graph Convolutional Networks (GCNs). The first part of the work presents the first dedicated and comprehensive survey of real-time object detection on FPGAs, analyzing state-of-the-art architectures, acceleration techniques, and optimization strategies across more than a decade of research. To enable fair comparison among heterogeneous implementations, the study introduces pixel throughput as a resolution-independent performance metric and provides an in-depth classification of algorithmic and hardware design approaches, together with a discussion of the challenges and emerging trends shaping next-generation FPGA-based detection systems. Building on the insights gained from this survey, the second contribution evaluates a practical deployment of Tiny YOLOv2 on an AMD Xilinx Zynq UltraScale+ MPSoC. The implemented heterogeneous system achieves 30 FPS real-time performance while consuming only 1.1 W, outperforming GPU-based implementations (15 FPS at 13.5 W) and CPU baselines (2.5–8 FPS at 2.1–5.3 W) by large margins in both energy efficiency and performance–power ratio. A detailed profiling study using the Vitis AI Profiler reveals key execution bottlenecks—including CPU–DPU synchronization stalls and underutilized DDR bandwidth—offering quantitative insights into system-level inefficiencies and guiding future optimization efforts for FPGAbased MPSoC deployments. The final part of the thesis introduces a lightweight, fully streaming FPGA accelerator for Graph Convolutional Networks, designed for mid-range and power-constrained devices. The architecture integrates scalable sparse–dense (SpMM) units, an efficient systolic array for dense transformations, and a dataset-aware 16-bit fixed-point quantization strategy that preserves full classification accuracy. Implemented on a Kintex-7 FPGA, the accelerator achieves 1.77–3.47 ms inference latency on the Cora and Citeseer datasets while consuming only 0.305–0.553 W, delivering up to 1,082 graphs/J, over 50× greater energy efficiency than contemporary GPU implementations, while requiring just 54–55 DSPs. These results highlight its suitability for real-time, embedded, and edge-AI applications where energy and resources are severely limited. Overall, this dissertation provides a unified and experimentally validated body of methodologies for the FPGA acceleration of deep learning workloads. Through systematic analysis, empirical bottleneck characterization, and novel architectural design, it advances the state of the art in real-time and energy-aware deep learning on reconfigurable platforms and establishes a foundation for future research on efficient heterogeneous AI computing.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1315054
