Issue 1, 2011
Optimizing Multicore System Performance
Frank Schirrmeister, director of product marketing for Synopsys’ System-Level Solutions, and Patrick Sheridan, Senior Staff Product Marketing Manager for Synopsys Platform Architect, outline new technology that enables system architects to define and analyze multicore systems. Multicore Optimization technology, part of Platform Architect, enables design teams to more accurately predict system performance using SystemC months before software is available.
Today’s SoCs include more hardware functions and software, and share more resources, than ever before. It’s likely that the same architecture will have to accommodate multiple application use-cases, which makes it increasingly difficult to design the architecture to accommodate all of them efficiently. Multicore architectures compound the difficulty involved; it’s become very hard to predict performance. This makes life even more difficult for the system and hardware architects who would like to forecast the performance of their architectures before the software is ready – ideally in the early concept stage of the design.
Only by predicting the dynamic application performance will design teams avoid the risk of under-design and over-design. Under-designing a chip means that it won’t meet its specification, resulting in an uncompetitive product that the team will probably have to re-design, which will inevitably incur schedule delays, extra costs and missed market windows. If design teams have no way to accurately predict system performance, some will deliberately try to ‘play it safe’ by over-engineering their designs so that they avoid any possibility of not meeting the product specs. However, over-engineering a design also carries risks, including excessive product cost (which reduces profit) and inefficient use of power.
According to VDC Research, 43.4% of respondents to a survey  of the embedded systems market claim that system architecture design and specification is a major cause of project delays, second only to project management and planning issues.
What system architects need is more help to define the optimum system architecture to support all their application use-cases cost-effectively. This help must be available much earlier in the design cycle, and be more robust and accurate than spreadsheet-based models that engineers have used in the past. By enabling system engineers to find and resolve multicore performance issues while architecture changes are still feasible, they can avoid having to re-work the hardware; something that becomes increasingly expensive as the implementation progresses.
Synopsys has introduced new Multicore Optimization technology within Platform Architect to make it possible to easily create and use performance models of dynamic multicore applications in SystemC. By capturing performance models of dynamic multicore applications in the early concept phase of system architecture design, architects can measure, analyze and optimize the hardware/software system architecture months before the software is available.
Platform Architect with Multicore Optimization technology reduces the risk of over-design and/or under-design in consumer, wireless communications and automotive system products, helping ensure cost-effective and successful delivery of complex multicore SoCs.
Accurately specifying complex systems has always been challenging. Verbal specifications, or those written in English, can be ambiguous and open to interpretation. These days, designing a complex multicore system usually requires that providers and consumers of hardware and software IP collaborate. Having executable performance models that hardware and software partners can share, and which don’t depend on having final hardware and software, improves the effectiveness and precision of any collaboration.
Optimizing Multicore Systems
The new Multicore Optimization technology enables Platform Architect users to create task-driven workload models of the end-product application, known as task-graphs, enabling analysis and optimization of hardware/software partitioning and system performance. After hardware/software partitioning is finalized, architects reuse the same task-graphs and task-driven traffic for SoC-level architecture exploration and IP selection, as well as interconnect and memory subsystem performance optimization. Benefits include optimized multicore system performance, shorter evaluation times and faster time to market.
- Multicore Optimization technology in Synopsys Platform Architect comprises two major aspects:
- the modeling of an application as a SystemC task graph, and
- a Virtual Processing Unit (VPU), which represents a processing element for the execution of a task graph.
MCO Application Task Graph
Synopsys has defined an API on top of SystemC for the modeling of applications. It provides typical SystemC concepts such as “tasks”, “events”, “waiting”, “interfaces” and so on. In addition, tasks have well-defined states including “created”, “ready”, “running”, “waiting” and “suspended”. Tasks can consume processing time.
Providing an API layer on top of SystemC enables system architects to control the activation of tasks and the notification of events by an intermediate scheduling layer. This scheduling layer (or “task manager”) coordinates the execution of a task graph. Architects can start and stop tasks and entire task graphs at runtime, enabling them to model large-scale changes in the application workload.
Based on the task modeling API, Synopsys provides a set of task and communication libraries, so users can rapidly compose a task graph without having to spend a lot of time and effort creating models manually. The communication libraries provide interfaces and channels for data flow- and control flow oriented-communication. The task libraries provide a set of generic configurable tasks to create a non-functional performance model of arbitrary application topology. Of course, users can also use the task modeling API to create arbitrary task and communication models themselves.
Figure 1: Powerful system-level analysis views provide early visibility and quantitative measurement of dynamic application performance on multicore architectures.
Virtual Processing Unit
The VPU models the processing element, which can execute a task graph, or a portion of the task graph. It supports preemption and time-slicing of tasks for modeling of interrupts and arbitrary scheduling algorithms. If they wish, system architects can extend the set of default scheduling algorithms. The VPU also comes with a library of components for traffic generation, cache modeling, inter-VPU communication, interrupt handling, etc., to model the realistic execution of a task graph with high accuracy.
VPU Configuration Parameters
System architects can easily configure the VPU by changing the following parameters:
- Clock port: The VPU is connected to a clock generator to compute the processing delay of a task graph based on the actual clock period. Dynamic changes in the clock frequency are taken into account, so the effects of frequency scaling are accurately represented.
- Number of bus ports: The VPU can have an arbitrary number of TLM-2.0 initiator and target ports. Initiator ports send out instruction fetches and data transactions to the interconnect. The VPU needs to have target ports in case external initiators can access the local memory inside a VPU.
- Number of interrupt ports: The VPU can have an arbitrary number of interrupt ports to react to external interrupt events.
- Traffic generation: System architects can configure traffic generation separately for each VPU. This essentially represents the instruction fetch unit as well as the load store units.
- Level-1 cache: System architects can add stochastic level-1 caches to each VPU, which provides additional parameters like the line size and the miss probability.
In addition, users can configure a number of software-related parameters at the VPU level:
- Scheduling algorithm: Users can configure the scheduling algorithm separately for each VPU. By default, the algorithm provides priority-based and round-robin scheduling.
- Preemption: Each VPU has additional scheduling-related parameters to configure preemption and time-slicing.
- Drivers: If one task communicates with another task mapped to other VPUs or with other platform components (such as memories or peripherals), the task-level communication needs to be transformed into platform-level communication. This way the task graph itself remains independent of a specific platform. Synopsys provides a set of generic drivers for shared memory access, DMA communication, and FIFO-based communication. Users can add drivers for specific communication schemes.
- Interrupt Service Routine (ISR): Dedicated ISR tasks organize the sensitivity to external events. Again, the goal is to keep the application task graph independent of a particular platform, so that system architects can explore different mapping options without modifying the task graph itself.
Task Configuration Parameters
The topology of a task graph determines the coarse-grain properties of the application like task-level parallelism and precedence. Each task can have additional parameters. For example, a task in the Generic Task Library provides the following set of configuration parameters:
- Priority: This parameter is evaluated by priority-based scheduling algorithms.
- Job id: Users can group tasks into jobs, so execution control can be applied to a set of tasks with a single command. This way, users can start or stop an entire application at runtime.
- Processing time: Determines the inherent processing time of a task, i.e. how long it occupies a VPU (excluding communication delays).
- Fetch/Load/Store probabilities: Specify the communication requirements of a task. TLM buses and memories simulate the duration of the transaction, which extends the time a task occupies a VPU.
- Address space: Optionally, a task can specify the location of instruction memory and data memory, so the traffic generators in the VPU are directed to the correct memories or memory regions.
- Further optional parameters fine-tune the behavior of each task.
The goal of the tasks in the Generic Task Library is to enable the rapid creation of a non-functional performance model of the application without manual modeling effort. Users can also create their own tasks with arbitrary configuration parameters.
Figure 2: Spreadsheet input and the Generic Task Library makes it easy to create architecture performance models of multicore systems for early analysis
Defining Multicore Architectures
Multicore architectures are composed from regular SystemC TLM models like, for example, interconnect, memories and DMAs. These TLM models are typically available in the Platform Architect model library. VPUs represent all programmable, configurable, or fixed logic processing elements.
The next step in defining a multicore architecture is to map the application task graph onto the VPUs, which assigns tasks to physical resources.
Where communicating tasks are mapped to different VPUs, the system architect needs to map the channels between these tasks to a suitable communication mechanism provided by the platform architecture. Users achieve this communication refinement by using drivers (defined above). The drivers define how a task interface (e.g., a message queue) is mapped into the platform (e.g., a memory mapped circular buffer) and how the system achieves synchronization between the tasks on the different VPUs (e.g., interrupts).
The configuration parameters of the VPU determine the nature of the multicore platform. For example, a set of identical VPUs would represent a symmetric multiprocessing (SMP) cluster. VPUs with different parameters (scheduling algorithm, clock period, traffic generation, etc.) would represent an asynchronous multiprocessing (AMP) sub-system. Today, the Multicore Optimization solution for Synopsys Platform Architect provides a generic VPU configuration. Although it does not provide a library of pre-configured VPUs representing specific processing elements, users can customize VPU configurations for their systems to mimic a specific processor.
The new Multicore Optimization technology embedded in Synopsys’ Platform Architect gives architects a clear understanding of the application and required features at an early stage of their projects. This insight on system performance, the hardware and software allocation of available resources, software scheduling scenarios and architecture dimensions and decisions greatly reduces the overall design cycle time. The new technology allows users to find and resolve multicore performance issues while architecture changes are still feasible, avoiding costly re-work to hardware and software implementations.
About the author
As director of product marketing at Synopsys, Frank Schirrmeister is currently responsible for the system-level design solutions including virtual prototyping for embedded software-driven development, architecture development and processor design. Before joining Synopsys, Frank held senior management positions in the areas of embedded software, semiconductor IP development and design services at Imperas, ChipVision, Cadence, AXYS Design Automation and SICAN Microelectronics. Most recently, he served as vice president of marketing at Imperas in the area of multicore software development, and at Cadence as group director of verification marketing in the design and verification business unit. Frank has a MSEE from the Technical University of Berlin, Germany. He also writes the Synopsys System-Level Design Blog “A View from the Top.”
As Senior Staff Product Marketing Manager at Synopsys, Patrick Sheridan responsible for Synopsys’ system-level solution for multicore platform architecture design. In addition to his responsibilities at Synopsys, he currently serves as the Executive Director of the Open SystemC Initiative (or OSCI). Pat has 27 years experience in the marketing and business development of high technology hardware and software products. Prior to joining Synopsys he worked at CoWare, Hewlett-Packard, Cadence Design Systems, and provided marketing consulting to successful start-up companies in Silicon Valley. Pat has a BS in Computer Engineering from Iowa State University.
 2008 Embedded Software Market Intelligence Service, Track 2, Software/System Modeling and Test Tools, Volume 2, Virtual System Prototyping, Simulation Tools for Software Development and Verification, page 18