|A Physics-Based Three-Dimensional Analytical Model for RDF-Induced Threshold Voltage Variations|
In this paper, a 3-D analytical model is proposed to capture the threshold voltage, surface potential, and electric field variations induced by random dopant fluctuations in the channel region of metal-oxide-semiconductor field-effect transistors. The 3-D model treats the effect of each dopant separately and is based on fundamental laws of physics. The proposed approach enables determination of transistor threshold voltage variations with both very low computational cost and high accuracy. Using the developed model, we performed statistical analysis, simulating more than 100 000 transistor samples. Interestingly, the results showed that, although the distribution of the threshold voltage for large-channel transistors is Gaussian, for scaled transistors, it is non-Gaussian. Furthermore, the proposed model predicts known formulas, which are proven for 1-D analysis and large transistors, simply by setting the appropriate transistor size. As a consequence, this model is a logical extension of the theory of large transistors to nanoscaled devices.
Oct 24, 2012
|Cryogenic Operation of Junctionless Nanowire Transistors|
This letter presents the properties of nMOS junctionless nanowire transistors (JNTs) under cryogenic operation. Experimental results of drain current, subthreshold slope, maximum transconductance at low electric ﬁeld, and threshold voltage, as well as its variation with temperature, are presented. Unlike in classical devices, the drain current of JNTs decreases when temperature is lowered, although the maximum transconductance increases when the temperature is lowered down to 125 K. An analytical model for the threshold voltage is proposed to explain the inﬂuence of nanowire width and doping concentration on its variation with temperature. It is shown that the wider the nanowire or the lower the doping concentration, the higher the threshold voltage variation with temperature.
Oct 24, 2012
|Elastic-Buffer Flow Control for On-Chip Networks|
This paper presents elastic buffers (EBs), an efﬁcient ﬂow-control scheme that uses the storage already present in pipelined channels in place of explicit input virtual channel buffers (VCBs). With this approach, the channels themselves act as distributed FIFO buffers. Without VCBs, and hence virtual channels (VCs), deadlock prevention is achieved by duplicating physical channels. We develop a channel occupancy detector to apply universal globally adaptive load-balancing (UGAL) routing to load balance trafﬁc in networks using EBs. Using EBs results in up to 8% (12% for low-swing channels) improvement in peak throughput per unit power compared to a VC ﬂow-control network. These gains allow for a wider network data path to be used to offset the removal of VCBs and increase throughput for a ﬁxed power budget. EB networks have identical zero-load latency to VC networks operating under the same frequency. The microarchitecture of an EB router is considerably simpler than a VC router because allocators and credits are not required. For 5×5 mesh routers, this results in an 18% improvement in the cycle time.
Oct 24, 2012
|Silicon-die Thermal Monitoring Using Embedded Sensor Cells Unit|
Thermal monitoring is essential in integrated circuit (IC) and VLSI chip which are a multilayer structure and a stack of different materials. The increase of the internal temperature of the VLSI circuits can conduct to serious thermal and also thermo-mechanical problems. Due to aggressive technology scaling, VLSI integration density as well as power density increases drastically. Thermal phenomena research activities on micro-scale level are essential for SoC and MEMS-based applications. However, various measurement techniques are needed to understand the thermal behavior of VLSI chip. In particular, measurement techniques for surface temperature distributions of large VLSI systems are a highly challenging research topic. This paper presents an algorithm and the experimental result of silicon-die thermal monitoring method using embedded sensor cells unit. Sensor implementation results and analysis are also presented.
Oct 24, 2012
|Low-Power Functionality Enhanced Computation Architecture Using Spin-Based Devices|
Power consumption in CMOS integrated circuits increases every technology generation due to increased subthreshold and gate leakage currents. To cope with such a problem, researchers have started looking at the possibility of logic devices based on electron spin, as an alternative to charge based CMOS, for realizing low-power integrated circuits with low active power dissipation and zero standby leakage. In this paper, we investigate spin-based logic devices that employ low-power spintorque switching mechanism for circuit operation. We have developed a Functionality Enhanced All Spin Logic (FEASL) architecture and a synthesis framework using Logically Passively Self Dual (LPSD) formulation. This methodology enables the design of large functional logic blocks, especially low-power adders and multipliers, which constitute the building blocks of all arithmetic logic units (ALU). In addition, we have investigated three different variants of ASL, which are lowpower, medium-power--medium performance and high performance and we analyze their merits and drawbacks at circuit/architecture level. We synthesized Discrete Cosine Transform (DCT) algorithm using adders and multipliers to show the efficacy of the proposed FEASL approach in designing digital signal processing (DSP) systems. Compared to 15nm CMOS implementation, the FEASL based DCT shows 88% improvement in power and 83% in PDP with 43% degradation in performance.
Sep 26, 2012
|Allocator Implementations for Network-on-Chip Routers |
The present contribution explores the design space for virtual channel (VC) and switch allocators in network-on-chip(NoC) routers. Based on detailed RTL-level implementations, we evaluate representative allocator architectures in terms of matching quality, delay, area and power and investigate the sensitivity of these properties to key network parameters. We introduce a scheme for sparse VC allocation that limits transitions between groups of VCs based on the function they perform, and reduces the VC allocator's delay, area and power by up to 41%, 90% and 83%, respectively. Furthermore, we propose a pessimistic mechanism for speculative switch allocation that reduces switch allocator delay by up to 23% compared to a conventional implementation without increasing the router's zero-load latency. Finally, we quantify the effects of the various design choices discussed in the paper on overall network performance by presenting simulation results for two exemplary 64-node NoC topologies.
Sep 26, 2012
|Effect of Nonlinear Summation of Synaptic Currents on the Input-Output Properties of Spinal Motoneurons|
A single spinal motoneuron receives tens of thousands of synapses. The neurotransmitters released by many of these synapses act on iontotropic receptors and alter the driving potential of neighboring synapses. This interaction introduces an intrinsic nonlinearity in motoneuron input–output properties where the response to two simultaneous inputs is less than the linear sum of the responses to each input alone. Our goal was to determine the impact of this nonlinearity on the current delivered to the soma during activation of predetermined numbers and distributions of excitatory and inhibitory synapses. To accomplish this goal we constructed compartmental models constrained by detailed measurements of the geometry of the dendritic trees of three feline motoneurons. The current “lost” as a result of local changes in driving potential was substantial and resulted in a highly nonlinear relationship between the number of active synapses and the current reaching the soma. Background synaptic activity consisting of a balanced activation of excitatory and inhibitory synapses further decreased the current delivered to the soma, but reduced the nonlinearity with respect to the total number of active excitatory synapses. Unexpectedly, simulations that mimicked experimental measures of nonlinear summation, activation of two sets of excitatory synapses, resulted in nearly linear summation. This result suggests that nonlinear summation can be difﬁcult to detect, despite the substantial “loss” of current arising from nonlinear summation. The magnitude of this “loss” appears to limit motoneuron activity, based solely on activation of iontotropic receptors, to levels that are inadequate to generate functionally meaningful muscle forces.
Sep 26, 2012
|AXR-CMP: Architecture Support in Accelerator-Rick CMPs|
To improve performance/power efficiency, we expect that future CMPs may use special-purpose accelerators extensively. This work discusses hardware architectural support for accelerator-rich CMPs. First, we introduce an efficient cache management scheme for accelerators to mitigate memory latency by overlapping data transfer with computation. Second, we present a hardware resource management scheme for accelerator sharing. This scheme supports sharing and arbitration of multiple cores for a common set of accelerators, and it uses a software-based priority mechanism to provide feedback to cores that indicates the wait time before acquiring a particular resource. Finally we propose architectural support that allows us to compose a larger virtual accelerator out of multiple smaller accelerators, and chain multiple accelerators together with minimal intervention of the requesting core. Experimental results show significant performance and energy improvement compared to approaches that use OS-based accelerator management, and achieve on the average 9X in performance (up to 40.17X) and 32X in energy efficiency (up to 90X) over a software implementation, with minimal hardware overhead.
Sep 26, 2012
|Combined Loop Transformation and Hierarchy Allocation for Data Reuse Optimization Design Compiler|
External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. Data reuse is an important technique for reducing the external memory access by utilizing the memory hierarchy. Loop transformation for data locality and memory hierarchy allocationare two major steps in data reuse optimization flow. But they were carried out independently. This paper presents a combined approach which optimizes loop transformation and memory hierarchy allocationsimultaneously to achieve global optimal results on external memory bandwidth and on-chip data reuse buffer size. We develop an efficient and optimal solution to the combined problem by decomposing the solution space into two subspaces with linear and nonlinear constraints respectively. We show that we can significantly prune the solution space without losing its optimality. Experimental results show that our scheme can save up to 31% of on-chip memory size compared to the separated two-step method when the memory hierarchy allocation problem is not trivial. Also, run-time complexity is acceptable for the practical cases.
Aug 27, 2012
|Energy-Efficient Pipeline Templates for High-Performance Asynchronous Circuits|
We present two novel energy-efficient pipeline templates for high throughput asynchronous circuits. The proposed templates, called N-P and N-Inverter pipelines, use single-track handshake protocol. There are multiple stages of logic within each pipeline. The proposed techniques minimize handshake overheads associated with input tokens and intermediate logic nodes within a pipeline template. Each template can pack significant amount of logic in a single stage, while still maintaining a fast cycle time of only 18 transitions. Noise and timing robustness constraints of our pipelined circuits are quantified across all process corners. A completion detection scheme based on wide NOR gates is presented, which results in significant latency and energy savings especially as the number of outputs increase. To fully quantify all design trade-offs, three separate pipeline implementations of an 8x8-bit Booth-encoded array multiplier are presented. Compared to a standard QDI pipeline implementation, the N-Inverter and N-P pipeline implementations reduced the energy-delay product by 38.5% and 44% respectively. The overall multiplier latency was reduced by 20.2% and 18.7%, while the total transistor width was reduced by 35.6% and 46% with N-Inverter and N-P pipeline templates respectively.
Aug 27, 2012
|Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults|
To improve the lifetime performance of a multicore chip with simple cores, we propose the Core Cannibalization Architecture (CCA). A chip with CCA provisions a fraction of the cores as cannibalizable cores (CCs). In the absence of hard faults, the CCs function just like normal cores. In the presence of hard faults, the CCs can be cannibalized for spare parts at the granularity of pipeline stages. We have designed and laid out CCA chips composed of multiple OpenRISC 1200 cores. Our results show that CCA improves the chips’ lifetime performances, compared to chips without CCA.
Aug 27, 2012
|Comparison of the Inhibition of Renshaw Cells During Subthreshold and Suprathreshold Conditions Using Anatomically and Physiologically Realistic Models|
Inhibitory synaptic inputs to Renshaw cells are concentrated on the soma and the juxtasomatic dendrites. In the present study, we investigated whether this proximal bias leads to more effective inhibition under different neuronal operating conditions. Using compartmental models based on detailed anatomical measurements of intracellularly stained Renshaw cells, we compared the inhibition produced by glycine/-aminobutyric acid-A (GABAA) synapses when distributed with a proximal bias to the inhibition produced when the same synapses were distributed uniformly (i.e., with no regional bias). The comparison was conducted in subthreshold and suprathreshold conditions. The latter were mimicked by voltage clamping the soma to 55 mV. The voltage clamp reduces nonlinear interactions between excitatory and inhibitory synapses. We hypothesized that for electrotonically compact cells such as Renshaw cells, the strength of the inhibition would become much less dependent on synaptic location in suprathreshold conditions. This hypothesis was not conﬁrmed. The inhibition produced when inhibitory inputs were proximally distributed was always stronger than when the same inputs were uniformly distributed. In fact, the relative effectiveness of proximally distributed inhibitory inputs over uniformly distributed synapses was greater in suprathreshold conditions than that in subthreshold conditions. The somatic voltage clamp minimized saturation of inhibitory driving potentials. Because this effect was greatest near the soma, the current produced by more distal synapses suffered a greater loss because of saturation. Conversely, in subthreshold conditions, the effectiveness of proximal synapses was substantially reduced at high levels of background synaptic activity because of saturation. Our results suggest glycine/GABAA synapses on Renshaw cells are strategically distributed to block the powerful excitatory drive produced by recurrent collaterals from motoneurons.
Aug 27, 2012
|Application Exploration for 3-D Integrated Circuits: TCAM, FIFO, and FFT Case Studies|
3-D stacking and integration can provide system advantages. This paper explores application drivers and computer-aided design (CAD) for 3-D integrated circuits (ICs). Interconnect-rich applications especially beneﬁt, sometimes up to the equivalent of two technology nodes. This paper presents physical-design case studies of ternary content-addressable memories (TCAMs), ﬁrst-in ﬁrst-out (FIFO) memories, and a 8192-point fast Fourier transform (FFT) processor in order to quantify the beneﬁt of the through-silicon vias in an available 180-nm 3-D process. The TCAM shows a 23% power reduction and the FFT shows a 22% reduction in cycle-time, coupled with an 18% reduction in energy per transform.
Jul 25, 2012
|Argus: Low-Cost, Comprehensive Error Detection in Simple Cores|
Argus, a novel approach for detecting errors in simple processor cores, dynamically verifies the correctness of the fours tasks performed by a Von Neumann Core: control flow, data flow, computation, and memory access. Argus detects transient and permanent errors, with far lower impact on performance and chip area than previous techniques.
Jul 25, 2012
|Design of a Link-Controller architecture for Multiple Serial Link Protocols|
This paper introduces a novel Multi-mode Serial Link Controller (MMSLC) for logic physical layer (PHY) and data link layer (DLL) of USB 3.0, PCle 2.0 and SATA 3.0. Functions defined in these protocols are grouped based on qualifying similarities and workload. The framework consists of a configurable circuit, programmable accelerator and event processor for flexible implementation. This MMSLC can essentially substitute for three individual link-controllers across protocols, thus achieving area reduction. An RTL level implementation is fulfilled and the synthesis results are shown at the end of this paper.
Jul 25, 2012
|A Two-Step Readout CMOS Image Sensor Active Pixel Architecture|
In this paper, we introduce a 5-transistor (5T) active pixel sensor (APS) structure and a specialized oscillator readout circuit. The pixel keeps a reasonable ﬁll factor of 43% using n-well and p-sub photodiode with an area of 5 µm x 5 µm and generates a two-step signal response to the illumination. The pixel successfully extends output swing to 0.72 V. Measured pixel random noise is 2.5 mV, achieving 51 dB signal-to-noise ratio (SNR). A readout circuit is also implemented using a ring oscillator to replace the traditional design with analog-to-digital converter (ADC) circuitry. It generates frequency output and is recorded by counters to perform signal digitization. The design is implemented with an array of 32 x 92 pixels in a 0.13µm digital CMOS process and tested with a 1.25 V supply voltage.
Jul 25, 2012
|Asymmetric Drain Spacer Extension (ADSE) FinFETs for Low-Power and Robust SRAMs|
In this paper, we analyze and optimize FinFETs with asymmetric drain spacer extension (ADSE) that introduces a gate underlap only on the drain side. We present a physics-based discussion of current–voltage relationships, short channel effects, and leakage and show the application of ADSE FinFETs in 6T static random access memory (SRAM) bit cell. By exploiting asymmetry in current, we show that it is possible to achieve improvement in both read and write stability for the 6T SRAM bit cell, along with reduction in cell leakage at the cost of negligible increase in access time and area. We also propose a general circuit-aware device optimization methodology for SRAM design. We use this methodology to optimize the underlap in ADSE FinFETs. Compared to conventional FinFETs, we achieve 57% reduction in leakage, 11% improvement in read static-noise margin, and 6% improvement in write margin, with 7% increase in access time and cell area.
Jun 27, 2012
|Stage Number Optimization for Switched Capacitor Power Converts in Micro-Scale Energy Harvesting|
Micro-scale energy harvesting has become an increasingly viable and promising option for powering ultra-low power systems. A power converter is a key component in microscale energy harvesting systems. Various design parameters of the power converter, most notably the number of stages in a multi-stage power converter, play a crucial role in determining the amount of electrical power that can be extracted from a micro-scale energy transducer such as a miniature solar cell. Existing stage number optimization techniques for switched capacitor power converters, when used for energy harvesting systems, result in a substantial degradation in the amount of harvested electrical power. To address this problem, this paper proposes a new stage number optimization technique for switched capacitor power converters that maximizes the net harvested power in micro-scale energy harvesting systems. The proposed technique is based on a new figure-of-merit that is well suited for energy-harvesting systems. We have validated the proposed technique through circuit simulations using IBM 65nm technology. Our simulation results demonstrate that the proposed stage number optimization technique results in an increase of 60% - 290% in net harvested power, compared to existing stage number optimization techniques.
Jun 27, 2012
|Accurate and Scalable IO Buffer Macromodel Based on Surrogate Modeling|
In this paper, a new method is proposed to generate accurate and scalable macromodels for input/output buffers. The method characterizes the physically based model elements with adaptive multivariate surrogate modeling techniques in order to achieve high ﬁdelity and process–voltage–temperature scalability. Both single-ended and differential output buffer circuit examples demonstrate that the proposed modeling method offers good accuracy and ﬂexible scalability to facilitate signal integrity analysis.
Jun 27, 2012
|Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip|
As chip multiprocessors transition from multi-core to many-core, on-chip network power is increasingly becoming a key barrier to scalability. Studies have shown that on-chip networks can consume up to 36% of the total chip power, while analysis of network traffic reveals that for extended periods of execution time, network load is well below the network capacity in many applications. In recent studies, researchers have proposed to exploit this temporal variability in network traffic to dynamically turn off links, buffers and segments of the onchip routers. In this work, we make the case for a polymorphic topology, called Cubic Ring (cRing), that allows dynamically turning off over 30% of resources in a 2D network (and more in higher dimensional networks), with less than 5% increase in average distance. As a result, cRing networks provide an elegant way to trade off network bandwidth for lower (static) power. A complete formalism for the proposed cRing topologies and the associated routing algorithm is presented, along with evaluation under synthetic workloads.
Jun 27, 2012
|A Novel Source-Body Biasing Technique for RF to DC Voltage Multipliers in 0.18μm CMOS Technology |
This paper presents a novel source-body biasing technique for RF to DC voltage multipliers designed in 0.18 μm CMOS Technology for applications where CMOS integration is required. The proposed technique increases voltage gain and efficiency by cancelling body effect and reverse leakage currents. Simulation results using HSPICE software are presented to verify and illustrate the technique by applying it to different topologies tested at different frequencies. Results show that Peak Conversion Efficiencies (PCE) as high as 16.7% can be achieved.
May 10, 2012
|Energy Efficient Many-core Processor for Recognition and Mining using Spin-based Memory|
Emerging workloads such as Recognition, Mining and Synthesis present great opportunities for many-core parallel computing, but also place significant demands on the memory system. Spin-based devices have shown great promise in enabling high-density, energy-efficient memory. In this paper, we present the design and evaluation of a many-core domain-specific processor for Recognition and Data Mining (RM) using spin-based memory. The RM processor has a two-level on-chip memory hierarchy consisting of a streaming access first-level memory and a random access second-level memory. Based on the memory access characteristics, we suggest the use of Domain Wall Memory (DWM) and Spin Transfer Torque Magnetic RAM (STT MRAM) to realize the first and second levels, respectively. We develop architectural models of DWM and STT MRAM, and use them to evaluate the proposed design and explore various architectural tradeoffs in the RM processor. We evaluate the proposed design by comparing it to a CMOS based design at the same 45nm technology node. For three representative RM algorithms (Support Vector Machines, k-means clustering, and GLVQ classification), the iso-area spin memory based design achieves an energy-delay product improvement of 1.5X-3X. Our results suggest that spin based memory technologies can enable significant improvements in energy efficiency and performance for highly parallel, data-intensive workloads.
May 10, 2012
|VEDA: Variation-aware Energy-efficient Discrete wavelet transform Architecture|
In this paper, we present a unified approach to an energy-efficient variation-tolerant design of Discrete Wavelet Transform (DWT) in the context of image processing applications. It is to be noted that it is not necessary to produce exactly correct numerical outputs in most image processing applications. We exploit this important feature and propose a design methodology for DWT which shows energy quality tradeoffs at each level of design hierarchy starting from the algorithm level down to the architecture and circuit levels by taking advantage of the limited perceptual ability of the Human Visual System. A unique feature of this design methodology is that it guarantees robustness under process variability and facilitates aggressive voltage over-scaling. Simulation results show significant energy savings (74%-83%) with minor degradations in output image quality and avert catastrophic failures under process variations compared to a conventional design.
May 10, 2012
|Low-Power AES Coprocessor in 0.18μm CMOS Technology for Secure Microsystems|
This paper presents an implementation of the Advanced Encryption Standard (AES) algorithm in 0.18 µm CMOS technology. The core module was found to be low power and low area and is meant to act as a cryptographic coprocessor in microsystems requiring additional security. The design was fabricated using Canadian Microelectronics Corporation's (CMC) digital design flow and packaged using a 40 pin Dual-inline Package. A potential system architecture using the fabricated module is also presented.
May 10, 2012