Cloud native EDA tools & pre-optimized hardware platforms
Application-specific instruction set processors (ASIPs) have established themselves as an important implementation option for modern SoCs, i.e. when standard processor IP cannot meet challenging application-specific requirements, and fixed hardware is not flexible enough. Heterogeneous multicore systems including ASIPs are now becoming more mainstream. Domains such as artificial intelligence, image and video processing or automated driving assistance have fueled the development of such ASIPs, and triggered many university projects. Processor design projects such as the RISC-V initiative also initiated a lot of interest. With all the commercial activity around RISC-V these days, it has outgrown UC Berkeley.
Synopsys’ ASIP Designer is the market leading tool for:
Synopsys’ ASIP Designer is used by leading companies around the globe with hundreds of successful projects to date.
At this informal event, leading university teams will present results from their ongoing ASIP projects in a variety of application domains such as AI accelerators. Synopsys will share insight on market trends, and provide a technical update on ASIP Designer along with reference examples.
Please use your business or university email address to register to access the proceedings.
Please contact us if you have questions or are unable to access the proceedings firstname.lastname@example.org
Our lineup of speakers and topics is hand-curated by our experienced team. Have a look below for our full ASIP University Day 2022 agenda.
Falco Munsche, Technical Marketing Manager, Synopsys
Patrick Verbist, Product Marketing Manager, Synopsys
Falco Munsche, Technical Marketing Manager, Synopsys
ASIPs have established themselves as an implementation option next to standard processor IP and fixed-function RTL. They combine hardware specialization with flexibility through software programmability. This talk will provide an introduction into Synopsys' ASIP Designer tool-suite, targeted markets, and how Synopsys collaborates with university partners in this domain.
Lucas Ferreira, Doctoral Student, Integrated Electronics Systems, Lund University
In computer-vision feature extraction algorithms, compressing the image into a sparse set of trackable keypoints, empowers navigation-critical systems such as Simultaneous Localization And Mapping (SLAM) in autonomous robots, and also other applications such as augmented reality and 3D reconstruction. Most of those applications are performed in battery-powered gadgets featuring in common a very stringent power-budget. Near-to-sensor computing of feature extraction algorithms allows for several design optimizations. First, the overall on-chip memory requirements can be lessened, and second, the internal data movement can be minimized. This work explores the usage of an Application Specific Instruction Set Processor (ASIP), designed with Synopsys ASIP Designer, optimized for performing feature extraction in a real-time and energy-efficient manner. The ASIP features a Very Long Instruction Word (VLIW) architecture comprising one RV32I RISC-V and three vector slots. The on-chip memory sub-system implements parallel multi-bank memories with near-memory data shuffling to enable single-cycle multi-pattern vector access. Oriented FAST and Rotated BRIEF (ORB) is thoroughly explored to validate the proposed architecture, achieving a throughput of 140 Frames-Per-Second (FPS) for VGA images for one scale, while reducing the number of memory accesses by 2 orders of magnitude as compared to other embedded general-purpose architectures.
Jashandeep Dhaliwal, Junior Fellow, Experimental Physics Department, CERN, Geneva
This study presents the analysis and implementation of embedded processors for on-chip data processing and readout for High Energy Physics applications. Given the limited power, area, and latency budget available, an exploration of Application Specific Instruction-set Processors (ASIPs) has been conducted for the first time in the Micro-Electronics group at CERN. Two microprocessor examples have been chosen as a starting point: Trv32p3 and Tmicro, characterized respectively by 32-bit RISC-V and 16-bit non-RISC-V instruction sets. Both microprocessors have been customized with an AMBA APB system bus protocol interface for integration in a SoC environment. Moreover, additional load and store instructions have been added to their instruction sets. An application-specific test code performing filtering and clustering of particles verified their correct functionality with real physics data as input. Results are strongly dependent on the application and input data set. The number of cycles is strongly dominated by read and write accesses to the central register file and Data Memory. Finally, a complete RTL-to-GDS implementation flow in a 28 nm CMOS technology provided the relevant figures of merit regarding achievable frequency, area occupation, and power consumption completing the evaluation and enabling further optimization.
Erik Brockmeyer, Senior Applications Engineer, Synopsys
We will present the design of an application specific processor that accelerates the sorting of large data sets, with millions of elements. This sorting application has it use in many applications, for instance particle filters for dynamic grid mapping. We will present a SIMD processor with custom instructions for sorting the elements of a vector. Sorting of a 64-element vector is done in a single instruction that is based on a bitonic sorting network. This instruction is used to hierarchically sort a large array. The array is stored in off chip memory. Portions of the array are transferred between the off-chip memory and the processor local memory, where they are sorted and merged with other portions. We will also address the DMA control, the data buffer management, and the scheduling of sub-tasks.
Moussa Traore, M.A.Sc., AR Engineer, Polytechnique Montréal
Binarized Neural Networks (BNNs), where the weights and neuron activations are expressed with 1 or 2 bits, have significant potential for drastic power consumption reduction because of trivial computation requirements and a reduction by an order of magnitude in data movement. Previous work has shown that this can be achieved without sacrificing accuracy, but it remains difficult to increase BNN throughput on a processor with a standard instruction set and datapath. Recent research has aimed to exploit BNN potential by finding ways to map these networks to the underlying hardware more effectively, especially in the case of FPGAs. This paper introduces a VLIW processor with a specialized instruction set to efficiently compute the inference of LUT-based binary neurons in a single clock cycle. When operating at full capacity, the specialized processor uses 21 parallel data memories and 21 computing slots. A specialized input-output register loads the input image and reads out the processor inference results. On the MNIST data-set, the processor achieves an increased throughput of 2994× when compared to a base unoptimized VLIW processor, while requiring 116x the initial hardware cost, while achieving an accuracy of 98.15%.
Mircea R. Stan, Virginia Microelectronics Consortium Professor, ECE Department, University of Virginia
AI-RISC is a scalable processor developed over the past 3 years using hardware, ISA and software co-design by incorporating TVM on the front-end and ASIP Designer on the backend. AI-RISC extends the open-source RISC-V architecture for accelerating edge AI applications by integrating hardware accelerators as AI functional units (AFU) inside the RISC-V processor pipeline. This allows AI accelerators to be integrated in the RISC-V processor pipeline at a fine-granularity and treated as regular functional units during the execution of instructions. AI-RISC extends the RISC-V ISA with custom instructions which directly target the added AFUs which allows seamless processing of both AI and non-AI tasks on the same hardware. Additionally, AI-RISC integrates a complete software stack including compiler, assembler, linker, simulator and profiler while preserving the high-level programming abstraction offered by popular AI domain specific languages and frameworks like TensorFlow, PyTorch, MXNet, Keras etc. Evaluation results show that AI-RISC accelerates the processing of vector-matrix multiply (VMM) kernel by 17.63x and of ResNet-8 neural network model from industry standard MLPerf Tiny benchmark by 4.41x compared to RISC-V processor baseline. AI-RISC also outperforms the state-of-the-art Arm Cortex-A72 IoT edge processor by 2.45x on average over the complete MLPerf Tiny inference benchmark.
Dominik Auras, Senior Applications Engineer, Synopsys
Typical network applications, like intrusion detection or web application firewall, look for known patterns in packet payloads. To handle high bandwidth data streams, we can first look for short fixed-length string fragments of the pattern rules, before applying expensive regular expression matching to the identified pattern match candidates. This multipattern string matching acceleration case study presents an ASIP featuring a 7-way VLIW ISA, custom memory organization, dedicated HW operators and efficient loop operations.
Maria Auras-Rodriguez, Senior Software Engineer, Synopsys
ASIP Designer’s compiler-in-the-loop and synthesis-in-the-loop methodologies include performing SW compilation and HW synthesis runs multiple times while designing an application-specific processor, using the results to guide the ongoing architectural exploration. In this presentation, we show how the interoperability between ASIP Designer and RTL-Architect facilitates the synthesis-in-the-loop design approach, both during earlier design stages with processor model modifications and during RTL implementation. We present a case study that uses the ASIP RTL Explorer utility from ASIP Designer to systematically generate various RTL implementations from a medium-throughput AI accelerator processor model. Then, using RTL-Architect's exploration capabilities, we can perform a comparative analysis of the implementation variants with respect to the power consumption for running MobileNet inference. The objective is to find an energy-efficient RTL implementation of the processor model at hand.