aCortex: An Energy-Efficient Multipurpose Mixed-Signal Inference Accelerator

MOHAMMAD BAVANDPOUR®, MOHAMMAD R. MAHMOODI® (Graduate Student Member, IEEE), and DMITRI B. STRUKOV® (Senior Member, IEEE)
Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA 93117 USA
CORRESPONDING AUTHOR: M. BAVANDPOUR (mbavandpour@ece.ucsb.edu)

ABSTRACT We introduce “aCortex,” an extremely energy-efficient, fast, compact, and versatile neuromorphic processor architecture suitable for the acceleration of a wide range of neural network inference models. The most important feature of our processor is a configurable mixed-signal computing array of vector-by-matrix multiplier (VMM) blocks utilizing embedded nonvolatile memory arrays for storing weight matrices. Analog peripheral circuitry for data conversion and high-voltage programming are shared among a large array of VMM blocks to facilitate compact and energy-efficient analog-domain VMM operation of different types of neural network layers. Other unique features of aCortex include configurable chain of buffers and data buses, simple and efficient instruction set architecture and its corresponding multiagent controller, programmable quantization range, and a customized refresh-free embedded dynamic random access memory. The energy-optimal aCortex with 4-bit analog computing precision was designed in a 55-nm process with embedded NOR flash memory. Its physical performance was evaluated using experimental data from testing individual circuit elements and physical layout of key components for several common benchmarks, namely, Inception-v1 and ResNet-152, two state-of-the-art deep feedforward networks for image classification, and GNTM, Google’s deep recurrent network for language translation. The system-level simulation results for these benchmarks show the energy efficiency of 97, 106, and 336 TOp/J, respectively, combined with up to 15 TOp/s computing throughput and 0.27-MB/mm² storage efficiency. Such estimated performance results compare favorably with those of previously reported mixed-signal accelerators based on much less mature aggressively scaled resistive switching memories.

INDEX TERMS Artificial neural networks, floating-gate memory, machine learning, mixed-signal circuits, neuromorphic inference accelerator, nonvolatile memory (NVM).

I. INTRODUCTION

THE rapidly growing range of applications of machine learning for image classification, speech recognition, and natural language processing along with maturing of the neural network algorithms, especially for deep learning, led to an urgent need in specialized neuromorphic hardware [1]–[3]. At least, for the next several years, the demand for fast, low-precision inference accelerators will remain higher than for higher-precision systems for network training, as projected by NVidia Corporation, a leading company in the machine learning hardware [4].

The vast majority of the proposed neuromorphic accelerators from industry and academia are digital [5]–[8]—see also an extensive review of various proposals in [2]. The most natural approaches, however, are based on analog and mixed-signal circuits [9]–[30]. Though the core principles of analog computing had been developed almost four decades ago [9], [10], its efficient implementations were enabled only recently [14]–[30] due to the emergence of novel continuous-state, nonvolatile, memory devices [31], [32]. Such memories enable very dense implementation of weights and in-memory computing for vector-by-matrix multiplication, the most common operation in machine learning. Among different candidates, the resistive switching memories, including phase change and conductive bridge memories, and metal-oxide memristors (also known as ReRAM or RRAM [31]) are perhaps the most promising due to their excellent scaling prospects. Their technology, however, is still in need of improvement, which is less of a problem for another excellent candidate, floating gate (FG) memories, e.g., those based on redesigned commercial-grade-embedded NOR flash [22], [32], [33]. Though planar FG cells are less dense than passively integrated memristors, their main advantage is FG cell amplification, which simplifies and reduces the overhead...
of peripheral circuitry. It is worth noting that the limited endurance of memristors and FG memories is less of an issue for inference applications since the weights are typically reprogrammed infrequently.

In this article, we present a multipurpose inference accelerator, dubbed “aCortex,” which is designed to capitalize on in-memory mixed-signal computing with nonvolatile memories. Though the idea of employing a mixed-signal vector-by-matrix multiplier (VMM) based on nonvolatile memories for multipurpose inference accelerators is not new [23]–[27], our work is novel in several aspects. Its key advantage is more extensive use of analog computing not only for VMM computation but also for data transfer. Such an approach minimizes the area/energy/delay overhead of the sensing and data conversion peripheries that are key factors limiting the efficiency of the mixed-signal neural accelerators [23]–[30]. A more compact design, in turn, reduces communication overhead due to shorter distances for data transfer. Moreover, data transfer overhead is further reduced by using a configurable chain of buffers exploiting the data reuse for convolution operation, programmable data buses that can be efficiently tailored on the fly to a particular utilization of mixed-signal array, and a custom-designed refresh-free embedded dynamic random access memory (eDRAM) tailored to meet the retention time requirement. We also propose a simple and efficient instruction set architecture (ISA) along with a multiagent controller, which takes advantage of the eligible time-overlap between consecutive micro-operations while minimizing the instruction memory (IM) requirement. Finally, we developed a system-level estimator that imports the target network’s computational graph along with experimental and circuit-level simulation results for different architecture components, including digital-to-analog converters (DACs), analog-to-digital converters (ADCs), sense amplifiers, memory cells, digital blocks, and buses, maps the weight kernels onto the 2-D array of nonvolatile memory (NVM) blocks, and finally produces a comprehensive performance report considering various nonidealities, such as leakages and line parasitics. Using such a simulator, we perform a detailed performance analysis based on the actual layout in the 55-nm process with embedded NOR flash memory. Note that unlike many proposals based on emerging memory technologies, core components have been previously taped out using commercial processes and successfully tested, and we used such experimental data in our analysis.

In Section S.I in the Supplementary Material, we introduce today’s major neural layer types and present their hardware-friendly rearrangement targeting a weight-stationary implementation. The overall aCortex architecture and operation scheme, as well as the internal structure of its main components, are presented in Section II. This section also introduces the proposed ISA along with the controller architecture. More details on the ISA are provided in Section S.II in the Supplementary Material. The general framework for mapping applications into aCortex and our case study for three representative neural network inference tasks are provided in Section III. In Section IV, we provide the circuit diagram and experimental/simulation results for the FG-based implementation of aCortex’s core computing units in a 55-nm technology node. We then perform a design space exploration for architectural parameters and provide a detailed system-level report for a semi-optimal design point. Related prior works are discussed and compared with aCortex in Section S.III in the Supplementary Material.

II. aCortex ARCHITECTURE

A. TOP-LEVEL ARCHITECTURE

As shown in Fig. 1(a), the major processor’s components are auxiliary unit (AUX), microcontroller, main memory, and two mixed-signal processing units (MSPUs). Each MSPU includes a configurable chain of input digital buffer blocks (DBBs), a flexible 2-D array of VMM blocks, and an array of output neuron blocks. The architecture can be loosely characterized as Harvard weight-major type [2]. The instructions are stored in a microcontroller’s dedicated SRAM-based IM. All frequently changing data, i.e., input, output, and temporary data, are kept in eDRAM-based main memory, while fixed weights, which would be typically precomputed at ex-situ training, are stored in NVM arrays of MSPU’s VMM blocks.

The inference task is specified by a program code based on custom instructions and the corresponding set of neural layer weight matrices. Assuming that the code is loaded and all weights are set up accordingly, the inference is computed by loading input data to the main memory, executing code to perform the inference task, and storing the computed results back in the main memory. In particular, the stationary weight matrices corresponding to various network layers are packed in the 2-D array of VMM blocks [see Fig. 1(b)], and the inference is performed in a layer-by-layer manner by sequentially reading the layer input from the main memory into the DBBs, activating the appropriate VMM and neuron blocks to perform the target neural layer, and then temporarily storing the intermediate results in the main memory for computing the next layer. Note that some of the neural layers, such as CNN and LSTM, require multiple VMM operations with various input data on the same weight matrix to complete. In this case, the corresponding VMM/neuron blocks are activated multiple times during the execution of each neural network layer—more details on that are provided in Section III and Section S.I in the Supplementary Material.

Flexible activation of VMM/neuron blocks enables the compact implementation of a set of neural network layers with various VMM sizes while maximizing the energy efficiency (EE) by cutting off the active power consumption of unutilized VMM/neuron blocks. Moreover, aCortex minimizes the energy overhead of data transfer by cutting off the unutilized portion of data buses and effectively reducing their length via disabling further data propagation. For example, Fig. 1(c) shows the active blocks and buses when the processor is computing the first layer of the neural network shown in Fig. 1(b).

All inference data, i.e., inputs and outputs of the MSPU’s VMM blocks, as well as effective weight precision, are p bit. A set of p bits is defined as one data word. We specifically consider p = 4, which is typically sufficient for running state-of-the-art image classification inference without loss of functional performance [34], [35]. Blocks and buses are sized according to a global architecture parameter K, defined as the granularity of neural computation and data transfer on aCortex.
IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

FIGURE 1. (a) aCortex top-level architecture. The location of the components crudely corresponds to the actual layout and is chosen to reduce data transfer overhead. For clarity, the architecture is shown for \( N = M = 4 \), and most of the control lines and the circuitry for testing/weight tuning are omitted. (b) Example of a weight kernel mapping on aCortex VMM blocks, layer-by-layer operation scheduling, and corresponding content of main memory over time. (c) aCortex active blocks/bus portions during the execution of layer #1 of the neural network shown in (b).

B. MAIN MEMORY

Main memory is implemented with eDRAM technology, whose retention time is tailored for a refresh-free operation. In particular, it is organized as an array of eDRAM blocks, each with a \( K_p \)-bit I/O data port. The multibanked structure allows us to read and write \( K \) data words simultaneously, i.e., to supply data to one DBB and receive data from one neuron block. The required memory capacity and its retention time are calculated by monitoring the memory content and the longest lifetime of intermediate data for the inference of the target neural networks [see Fig. 1(b)].

C. MIXED-SIGNAL PROCESSING UNIT: ARRAY OF VMM BLOCKS

Each MSPU is comprised of two \( N \)-by-\( M \) arrays (quadrants) of VMM circuit blocks located on each side of a column with \( N \) neuron blocks. Each VMM block features \( K \)-by-\( 2K \) array of NVM cells, which is suitable for implementing analog-mode differential \( K \)-by-\( K \) VMM operation; \( K_p \)-bit front-end DACs; and \( 2K \) back-end local sensing circuitry [see Fig. 2(a)]. The data to a single column of VMM block array are fed via \( K_p \)-bit wide digital programmable “VMM input” bus from the corresponding distributed memory buffer block. The VMM block outputs are connected via analog “VMM output” buses, which are \( 2K \) lines wide, to the corresponding neuron blocks. More circuit details on the data conversion and sensing/summation for the considered VMM design based on 2-D-NOR flash memory technology are provided in Section IV-A.

Column/row enable lines (denotes as CE/RE) span the MSPU quadrants in vertical/horizontal directions and are used to activate the desired VMM blocks at each processing step. Specifically, a given VMM block is activated only if its compute enable signal \((\text{CoE} = CE \land \text{RE})\) is equal to “1” [see Fig. 2(b)]. These control lines allow us to flexibly implement a wide range of VMM sizes (from \( K \times K \) to \( 2MK \times NK \)) while cutting off the active power consumption of unutilized VMM/neuron blocks. Moreover, a \( K_p \)-bit wide VMM input bus repeater with horizontally shared enable control line \( \text{RBE} \) is integrated into each VMM block [see Fig. 2(a)]

to speed up data propagation as well as to minimize its energy overhead by cutting the unutilized portion of the VMM input bus. The logic circuitry and single-bit registers producing CE and RE/RBE control lines are integrated into the digital buffer and neuron blocks, respectively.

The programming/erasure circuitry consists of decoders and level-shifters that are shared by NVM arrays of VMM

FIGURE 2. (a) Schematic of mixed-signal VMM and neuron blocks, including their connectivity and required control signals (blue). Neuron block includes four stages, namely, global sensing, rescaling unit, activation function, and ADC. (b) Control circuitry that facilitates flexible activation of the target VMM/neuron block and cuts unutilized portions of the VMM input and store buses. Square labeled with “c” denotes comparator for selecting a specific neuron. For clarity, some of the details, e.g., circuitry for setting up enable bits and some neuron control circuitry, are not shown.
blocks and are placed at the outer margins of the VMM block arrays [see Fig. 1(a)] and row/column access switches, placed at the periphery of each VMM block, controlling the applied signals.

**D. MIXED-SIGNAL PROCESSING UNIT: NEURON BLOCK**

A neuron block includes $K$ identical neuron units. All units perform in parallel summation/integration of the analog data supplied from the corresponding VMM output lines, rescale the integrated data (if needed), apply the selected activation function, and finally convert the results to the digital domain using ADCs [see Fig. 2(a)]. The rescaling unit enables a wide range of quantization ranges, e.g., needed to operate with different VMM input sizes, via adjusting neuron’s analog input amplitude to match the fixed operating input range of the activation functions and ADC units. The activation function can be selected from linear, rectified linear, sigmoid, or hyperbolic tangent types, to support a wide range of neural layers (see Section S.I in the Supplementary Material). Neuron block outputs are digitized with $p$-bit ADCs and temporarily latched in their embedded digital memory. The digital results are then passed via $Kp$-bit digital “store” bus to the main memory. Such data transfer, i.e., the “store” operation, is performed in one step per neuron block so that, e.g., a total of $N$ steps are required to transfer data from all neurons in one MSPU. The specific number of processor cycles required for each step varies based on the location of the neuron block, i.e., its distance from the main memory. In particular, the store bus data are passed via a $Kp$-bit 2-to-1 multiplexer in each neuron block. These multiplexers act as a bus repeater for the utilized portion of the store bus and are also used to decouple and deactivate its unutilized portion.

The RE and RBE signals (which control the target neuron block and the corresponding VMM blocks it serves and cut the VMM input bus, respectively) are configured with “row-select” single-bit flip-flops of the neuron blocks [see Fig. 2(b)]. Specifically, the row-select flip-flops of neuron blocks are connected via xor gates, with the output of the xor gate directly controlling the RE line. To preselect RE signals for the contiguous set of rows (e.g., rows 10–15), row-select bits of neuron blocks in the first (10), closest to the main memory, and in the last +1 (16) row of the selected set of rows are set to “1,” while others are set to “0.” This implementation results in activating all selected rows (i.e., setting $RE = 1$ for rows 10–15). On the other hand, using a simple nand gate that detects a transition from selected to unselected row, RBE is deasserted only for the last +1 row of the selected group of rows [see Fig. 2(b)], which effectively cuts the downstream VMM input bus.

**E. MIXED-SIGNAL PROCESSING UNIT: DIGITAL BUFFER BLOCK**

The data to DBBs in each quadrant are supplied from the main memory via $Kp$-bit wide digital “load” bus [see Fig. 1(a)]. The DBB’s internal logic circuitry is designed to flexibly support various data flow scenarios—from the simple load (Fig. 3(b): scenario #1) to “load and shift” configuration (Fig. 3(b): scenarios #2 and #3). The flexible “load and shift” configuration enables efficient computation of CNV layers with a wide range of specifications (filter size and stride) while taking advantage of the rowwise data reuse [see Fig. S1b in the Supplementary Material].

Specifically, the input to each DBB is supplied by a $Kp$-bit-wide 2-to-1 multiplexer, which selects the input source between the load bus (i.e., main memory) and the previous DBB. For simplicity, we assume that only one DBB can be loaded from the load bus at one step, by setting the address of the target DBB on the buffer address bus. Similarly to the data transfer from neurons, the number of processor cycles required for each step varies based on the location of the DBB.

The shift operations are supported by properly configuring a “shift-bit” flip-flop in each DBB and are masked by the “shift enable” line. In particular, the shift register configurations are programmed by setting shift bits to “1” for all the DBBs except for the first block of shift register [see Fig. 3(b)]. Setting shift bit to “1” of a particular DBB is performed concurrently with loading that DBB from load bus by asserting “shift mode” control line. (Note that rowwise execution of the CNV layer typically starts by loading all DBBs with row data from main memory, with all shift operations disabled by deasserting shift enable signal line so that all shift bits are typically configured at this time.) Once shift bits are properly configured, simultaneously with loading new data into DBB, the already loaded data in DBBs can be shifted between the remaining blocks of that shift register [e.g., shifted to the right in scenarios #2 and #3 of Fig. 3(b)]. This is performed with an internal “master shift command” signal that enables latching data from previous DBBs of the specific shift register whenever its shift bit and enable signals are set to “1.”

In addition, each DBB has a single-bit SR latch to specify the target VMM block columns, i.e., to set column enable (CE) signal, which is configured similar to shift-bit flip-flop. It also has a bus repeater on the load bus, which is disabled when the DBB is loaded, to stop downstream data propagation on the bus [see Fig. 3(a)].
F. MIXED-SIGNAL PROCESSING UNIT: VMM COMPUTATION

Assuming that the target VMM columns are preselected (CE=1) and the desired row-select bits are set, VMM operation is performed by applying a positive pulse to the "VMM_OP" command signal. This signal enables the neuron blocks associated with the selected rows as well as their input VMM blocks for which CE = 1 via propagating the RE and then CoE lines. Naturally, the VMM_OP pulse width should be longer than the worst case end-to-end VMM operation time, i.e., the time it takes from the moment inputs are applied to DAC to that of latching ADC outputs.

G. AUXILIARY UNIT

Auxiliary unit (AUX) is provisioned to perform less frequent digital computations in neuromorphic inference. In particular, this block is used to perform in parallel Kp-bit vector-by-vector operations, such as additions, subtractions, and fixed-precision multiplications. It also performs comparison in max-pooling operation, which is typically used in CNNs. (An average pooling, another typical operation, can be implemented directly in a mixed-signal domain with properly adjusted weights.) AUX consists of an array of arithmetic logic units (ALUs), multipliers, and internal registers. During max-pooling operation, the register holds the current maximum value and feeds it to one of the ALU’s inputs to compare with the next input value fed from the main memory through the load bus. The outputs of AUX blocks can be written back to the main memory via the store bus.

H. CONTROLLER AND INSTRUCTION SET ARCHITECTURE

The most typical operations on aCortex involve loading multiple DBB blocks with data from the main memory, performing the vector-by-matrix computation, and then moving the results from neuron blocks back into the main memory. Accordingly, aCortex controller includes three separate agents as loader, operator, and collector, each dedicated to performing one of these frequently used operations [see Fig. 4(a)]. These agents are configured and synchronized using the main controller. Such multiagent structure shrinks the code size, enables eligible time overlapping between these operations, and minimizes the engagement of the main controller, which, in turn, reduces the controller time overhead. The main components of each agent are listed in Fig. 4(c).

III. APPLICATION MAPPING

Application mapping process of a given neural network on aCortex involves checking accelerator resource requirements (i.e., main memory and MSPU) and producing a weight mapping and machine code to set up the accelerator. A similar process is also used in the accelerator design to estimate the required main memory and MSPU resources and to optimize their specifications (M, N, K) for the target range of applications. Such a process starts with extracting the computational graph and finding the optimal topological partitioning and ordering of the computational steps [see Fig. 5(a)]. In such a graph, the vertexes are computation kernels (i.e., neural layers), while the edges are input/output data. In general, finding the optimal sorting of a given graph is an NP-hard problem. However, the layer-by-layer operation scheme of aCortex resulted from our energy-optimal design, reduces the complexity of the problem. After obtaining the graph and scheduling order, the next step is to check the memory requirement throughout the inference steps. In aCortex, 3-D data are stored in the main memory in a row first, column second, and channel/feature map third order [see Fig. 5(b)]. Following this order, multiple channels of one data pixel (and adjacent pixels in a row in the case of CNV stride > 1) are always grouped into K-word-long data packs (see Fig. S1b in the Supplementary Material). One pack of K-word data is then mapped into a word line of the main memory block and can be read/written simultaneously. Such data placement in the main memory enables burst mode read/write using the controller’s loader/collector. Note that, due to quantization, such a scheme may result in underutilized memory, e.g., when the number of channels/feature maps is not divisible by K.

Considering such data arrangement, the memory usage after each inference step (i.e., neural layer) is calculated by

---

**FIGURE 4.** (a) Top-level representation of the multiagent controller and connectivity between different agents and the rest of the architecture. (b) Examples of agent operation timing diagram when the load operation is completed before (scenario #1) and after (scenario #2) the collect operation. (c) Main components of each controller’s agent.
drawing a cut in the computational graph that separates the already computed portions of the graph (network processing steps) from the upcoming ones and adding up all the edges that are crossed. Since, during the execution of each layer, both its inputs and outputs are present in the main memory, the upper bound for the total memory usage is calculated by adding up the memory usages for two consecutive cuts and subtracting their overlap edges.

We applied this algorithm to three studied networks, namely, Inception-v1 [36] and ResNet-152 [37] DNNs for image classification, and GNMT [38], Google’s neural machine translation network featuring a 16-layer LSTM network with bidirectional encoder layers, with the vector length of 1024 and the sequence size of 10. For example, Fig. 5(c) shows such a process for an Inception neural layer and an unfolded two-layer LSTM network. As shown in Fig. 5(c), the memory requirement is limited by the initial layers of the DNNs (for 4-bit computation with the data pack quantization of $K = 64$).

In the next step, the weight matrices are mapped into MSPUs using a greedy search algorithm [see Fig. 6(a)], for which input parameters are the number of available MSPU’s columns ($M'$), architecture granularity parameter ($K$), the number of tries (epoch), and the list of weight kernels (LoK). In one iteration of the algorithm, the kernels are first randomly ordered and then greedily mapped in a row-first manner, in the given order, to the array of VMM blocks. Such procedure is repeated epoch times, and mapping configuration with the smallest number of occupied VMM blocks is selected. Furthermore, the mapping process is repeated for different $M' \leq M$ to search for a square-shaped mapping of occupied VMM blocks [see Fig. 6(b)] to minimize the average data transfer distance between MSPU’s active VMM/neuron blocks and the main memory (and hence to increase energy efficiency)—see more discussion on that in Section IV-B. The results of this optimization process for our target networks are shown in Fig. 6(c)–(e).

The mapped locations of weight kernels are then used in the compilation process to generate the machine code for handling the data flow and operation of the network. The number of instructions for initializing and operating each layer type is estimated for the system-level analysis based on the proposed ISA (presented in Section II-H and Section S.II in the Supplementary Material).

IV. CIRCUIT DESIGN AND PERFORMANCE EVALUATION
A. 55-nm FG-BASED IMPLEMENTATION OF MIXED-SIGNAL BLOCKS

While aCortex can be realized with a variety of NVM technologies, the focus of this article is on a mature industrial-grade flash-memory technology that has already enabled
extremely compact and energy-efficient implementations of mixed-signal circuits [20]. The key advantage of such implementation, compared with, e.g., those based on resistive switching devices [31], is the FG cell’s inherent signal amplification and low operation currents that greatly relax the requirement for sensing circuitry gain and enable very compact peripheral circuits [20]. More specifically, the proposed architecture was evaluated for the implementation based on ESF-3 (embedded split-gate flash) technology in which FG devices were redesigned for analog computing applications [17].

The proposed mixed-signal circuits allow for efficient implementation of different types of activation functions, kernel sizes, and quantization ranges, which are essential features of multipurpose inference accelerator [see Fig. 7(a)]. In particular, the VMM operation is implemented using a gate-coupled design [17]. In such an approach, inputs and outputs are both encoded as currents, and the weights are encoded as the memory state of the array FG devices.

The front-end input conversion is realized using current steering (CS) DAC architecture, a viable choice considering its low power consumption, compact footprint, and fast turn ON/OFF time at relatively low precision. A 4-bit PMOS-based CS DAC circuits source the current into peripheral FG devices.

The developed current-mode global sensing circuit of a neuron block has excellent wideband current following behavior and provides very low input impedance while limiting drain voltage distortion. To reduce process variation overhead associated with an offset error, two additional FG devices are provisioned in each channel (drain line) and are used either to source or sink the input-referred offset current.

The input current scaling of a neuron block is implemented using a binary-weighted current mirror structure controlled by the multibit digital input [see Fig. 7(a)]. As already mentioned in Section II.D, this feature is needed to adjust quantization range in accordance with the maximum VMM circuit output currents (which would vary, e.g., with the size of the weight kernels) and, hence, to minimize losses in functional performance due to quantization of activation function outputs.

The 4-bit current-mode ADC has a 1-bit per stage cyclic design that generates the 4-bit digital output in four cycles. FG transistors are also employed for offset/compensation in the high-speed comparator and to generate reference currents. The sigmoid and ReLU activation functions are implemented directly in the ADC, without using any other additional circuitry, by appropriately choosing reference currents, i.e., quantization levels of the ADC. Linear and hyperbolic tangent functions are emulated via input biasing and appropriate weight scaling [see Fig. 7(b)]. DACs/ADCs are designed for unipolar data and can be reused, without any circuit modification, for bipolar data by utilizing offset-binary representation. More details on the circuit structure for the design of PVT-resilient CS DAC circuit and algorithmic ADC can be found in [20].

Fig. 7(c) provides the VMM performance results as a function of K based on the measurements of ESF3 memory devices and postlayout simulations of peripheral circuitry. These data are used to estimate system-level performance.

B. MAIN MEMORY AND BUSES IN 55-nm TECHNOLOGY NODE

The main memory is implemented using asymmetric 2T gain embedded DRAM cells with boosted power supply [39]. The retention time of eDRAM cells was changed to 100 μs by reducing leakage and adjusting biases, with a 99.9% bit yield confirmed by block-level Monte Carlo simulations [39]. The memory performance was modeled using the CACTI tool [40]. We also developed a bus area/energy/delay model as a function of bus length and repeater size using postlayout simulations considering all device/interconnect parasitics.

C. SYSTEM-LEVEL RESULTS AND DESIGN SPACE EXPLORATION

We have developed a software framework that utilizes the postlayout energy/speed/area metrics for all the aCortex’s building blocks (buffers, buses, DACs, ADCs, integrators, and digital circuits) to evaluate the system-level performance for any target DNN/RNN network. This framework uses the list of processing tasks for a given neural network to map the
VMM kernels on the NVM devices embedded in the VMM blocks and then generate a detailed performance report for the given set of architecture specifications. Using such a tool, we have performed a preliminary exploration of architectural parameters (i.e., $K$ and MSPU aspect ratio, $AR = M/N$) to optimize the processor performance for the aforementioned target neural networks. A detailed study of these benchmark networks (see Figs. 5 and 6) has shown that a 1MB MM is sufficient to store all intermediate data, while the flow control program requires at most 4KB of IM. Moreover, the controller energy/delay is estimated in an instruction-by-instruction manner in which the required machine code for initializing and performing each layer type has been evaluated.

Fig. S2 in the Supplementary Material shows the system-level EE, throughput, and area with respect to $K$ and $AR$ for these networks. Larger $K$ typically results in higher throughput due to wider bus widths and consequently higher data transfer rate. It also improves the EE by reducing the VMM block peripheral circuitry energy consumption (trend clearly seen in Fig. S2c in the Supplementary Material). Moreover, the increase in throughput results in lower leakage energy, which further increases EE. However, for the networks with medium and small weight kernels (i.e., ResNet-152 and Inception-v1), larger $K$ results in underutilization of active blocks and buses as well as an increase in the number of required VMM blocks to map the network, which, in turn, increases the energy overhead of the analog peripheral circuitry and buses. As such negative effects outweigh the positive ones, the overall system-level EE and throughput are getting worse for larger $K$. For DNNs, the load bus delay/energy typically plays a more significant role in the system throughput/EE compared with the store bus due to a higher input-to-output data transfer ratio in convolution operations. Such property leads to higher throughput/EE for smaller AR (i.e., relatively shorter load bus) in these networks. The opposite of this trend is observed for GNMT (trend clearly seen in Fig. S2c in the Supplementary Material). This trend does not hold due to network-specific weight packing efficiency with respect to AR.

The accelerator area decreases for larger values of $K$ due to the larger sharing factor of analog peripheries, hence smaller area overhead of DACs and sensing circuitry (see Fig. S2 in the Supplementary Material). This trend does not hold for Inception-v1 in which smaller weight kernels result in the VMM block underutilization and lower weight packing efficiency which outweigh the gain in peripheral circuitry area efficiency.

The results for different networks indicate $K = 64$ and $AR \approx 2$ as a semioptimal design point for which block utilization and load/store bus energy/delay are somewhat balanced. Detailed performance report and area/energy breakdowns for this design point are presented in Table 1. As these results show, the energy consumption is dominated by data transfer and intermediate data storage for DNNs with smaller size weight kernels, such as Inception-v1 and ResNet-152 networks. On the other hand, the energy consumption is dominated by sensing circuitry and DACs, even despite larger accelerator area (and, hence, larger data transfer energy consumption) for GNMT inference task. This is because GNMT inference involves larger size VMM operations (weight kernels), which ultimately leads to a larger compute-to-communication ratio and allows us to take better advantage of analog-domain computing. Also, note that area overhead of high-voltage programming/erasure and VMM peripheral circuits is quite low (compared with other NVM-based accelerators) due to their effective sharing, and, in fact, the area of aCortex is dominated by FG memory cells. Thus, the detailed results show that the integration density is one of the key properties of memory devices for the energy-efficient inference accelerators.

The performance comparison of aCortex against its major competitors (fully digital [5]–[7] and mixed-signal [23], [24]) shows that aCortex achieves significantly higher performance, especially for mobile/IoT applications, for which the storage efficiency (MB/mm$^2$) and EE are the most important metrics (see Fig. S3 in the Supplementary Material). In order to make a fair comparison, we performed a highly optimistic projection of the performance metrics for the mixed-signal architectures to 55-nm, 4-bit design point. According to these estimations, aCortex achieves $28\times$–$65\times$ improvement in EE over ISAAC [23]/PUMA [24] while maintaining a comparable SE and enduring a relatively small drop in throughput ($\sim 0.3\times$–$0.4\times$). Note that these architectures do not consider the overhead of programming/erasure circuitry that could impact the performance results.

### Table 1. aCortex system-level results and breakdowns *

<table>
<thead>
<tr>
<th>Network Specification</th>
<th>Inc-v1</th>
<th>ResNet-152</th>
<th>GNMT</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Parameters</td>
<td>$7.2 \times 10^4$</td>
<td>$5.52 \times 10^2$</td>
<td>$1.3 \times 10^3$</td>
</tr>
<tr>
<td>Number of Operations</td>
<td>$5.2 \times 10^3$</td>
<td>$2.0 \times 10^3$</td>
<td>$2.6 \times 10^3$</td>
</tr>
<tr>
<td>Architectural Specfication</td>
<td>$K$</td>
<td>$M$</td>
<td>$N$ (top/bottom)</td>
</tr>
<tr>
<td>Main Memory Capacity</td>
<td>64</td>
<td>38</td>
<td>108/16</td>
</tr>
<tr>
<td>Main Memory Utilization (%)</td>
<td>47.8</td>
<td>59.8</td>
<td>5.07</td>
</tr>
<tr>
<td>VMM block Utilization (%)</td>
<td>67.5</td>
<td>87.6</td>
<td>100</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Energy Breakdown (%)</th>
<th>Main Memory</th>
<th>Sensing Circuitry</th>
<th>FG Arrays</th>
<th>DACs</th>
<th>Neuron Blocks</th>
<th>Programming/Erasure**</th>
<th>Others</th>
<th>Leakage</th>
<th>Buses</th>
<th>Performance Summary</th>
</tr>
</thead>
<tbody>
<tr>
<td>Area (mm$^2$)</td>
<td>37</td>
<td>146</td>
<td>293</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Power (mW)</td>
<td>20.6</td>
<td>26.5</td>
<td>44.5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EE (TOP/s)</td>
<td>97</td>
<td>106.2</td>
<td>353.8</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Throughput (TOP/s)</td>
<td>2.00</td>
<td>2.9</td>
<td>14.94</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

* Because of more optimal overapped execution and refined estimates for DAC circuits, the numbers are slightly adjusted compared to preliminary ones reported in Ref. [19]. ** The area overhead of programming/erasure circuitry is estimated as $(245.5M+120N) \times K \mu$m$^2$. 
REFERENCES

S.I. Neural Layers and Weight-Stationary Hardware Compatibility

Artificial neural networks (ANNs) consist of a common core computing “neuron” cell. Capturing the basic behavior of its biological counterpart, the artificial neuron calculates the weighted summation of inputs passing through a (non)linear activation function, \( f(\cdot) \), as \( y = f(\sum_{i=1}^{n} x_i w_i) \) where \( x \) and \( w \) are inputs and weights, respectively. Targeting various applications, different multi-layer ANNs with various layer-types (i.e. neuron connectivity and activation functions) have been developed [1]. We next briefly review the layer operation for today’s most popular ANN models, which, e.g., occupy 95% of Google’s data center workload [5], and present their weight-stationary hardware-friendly re-arrangement.

A. Fully-Connected (FC) Layer

FC is the most common ANN layer, e.g. in multi-layer perceptron (MLP) networks, which are used for classification, prediction, etc., and convolution neural networks (CNN), which are mainly used for image classification/recognition [1]. In the FC layer, each neuron in the input layer feeds all the output neurons through a set of weights. As shown in Fig. S1a, FC layer can be re-organized as a weight-stationary VMM followed by an activation function. In such VMM, weight matrix (green) is stationary, and the input elements (blue) are vertically shared and propagated through all the weight columns at the same time. Accordingly, all weight locations simultaneously perform multiply-and-add operation, and the outputs (shown with orange color) are calculated in parallel.

B. Convolutional (CNV) Layer

CNV is the core neural layer of CNNs, the dominant network in computer vision, which uses a special connectivity pattern to efficiently exploit the spatial locality of inputs while extracting image features. This layer type includes multiple channels of 3D weight matrices, each applied over the whole 3D input data in a sliding window fashion, to produce its corresponding output feature map as shown in Fig. S1b. This figure also shows our target scheme to map the CNV operation into a weight-stationary VMM structure. In this scheme, CNV operation is performed in a row-first manner in which one output pixel (for all the channels) is calculated at a time. At the input, the CNV row-wise data reuse is exploited using multiple “load and shift” chains of input buffers. This input buffer array feeds a stationary 2D weight matrix, which is a reshaped and stacked representation of the 3D CNV filters. Accordingly, the output vector represents different channels of one output pixel location at a time. Note that in this scheme, different input channels of each input pixel location (\( C_I \) elements) can be grouped in one input data pack without disturbing the data flow (Fig. S1b). Moreover, for the strides larger than unity (\( s > 1 \)), every \( s \) adjacent pixels in the row direction can also be grouped into one input data pack. Hence, for a given CNV operation, the data pack size can be any divisor of \( C_I d s \).

C. Recurrent Layer

Recurrent neural layers aim to extract and interpret the information encoded into the temporal locality in a sequence of inputs using a feedback connection and a sequential operation. Long-short-term-memory (LSTM) is one of the most popular recurrent layers, which is widely used in language translation and speech recognition [1]. Fig. S1c shows the original LSTM structure and its weight-stationary re-arrangement. As shown,
the LSTM’s computational effort is dominated by VMM operations for which the input vector is obtained via concatenating the current element of the input sequence and a hidden state \([h_{t-1}, x_t]\). Accordingly, the VMM’s weight matrix is obtained by stacking the forget (f), candidate (C), input (i), and output (o) weight matrices. The rest of the LSTM computation includes basic element-wise vector operations and recurrent data transfer for the next step. Note that unlike CNV, the LSTM computation includes bipolar inputs and multiple activation function types which calls for a more generalized computing scheme supporting such cases – see Fig. 7b and its discussion in Section IV.A. Moreover, LSTM layers typically have very large weight matrices resulting in a larger compute-to-communication ratio. Hence the computing efficiency typically plays a more significant role (compared to data transfer efficiency) in the overall efficiency of inference task for such layers.

S.II. INSTRUCTION SET ARCHITECTURE DETAILS

The main controller instructions are:

Agent configuration: CNF agent, (mstr), (nstr/bstr)

Configure the agent (i.e. loader/collector) with appropriate parameters such as memory stride (mstr) and neuron/buffer stride (nstr/bstr) which are immediate fields in the instruction.

Load: M2B rm, rb, cnt, smode, sen

Command the loader to load cnt data packs (each K words) starting from initial memory address specified by register rm into the digital buffer blocks starting from the initial address specified in register rb (assuming that strides are pre-configured). smode field specifies the new value loaded to the shift bit in each digital buffer block, and sen enables “load and shift” operation.

Compute: VMM nsf, af

Command the operator to start VMM computation while simultaneously configuring the neuron scaling factor (nsf) and activation function type (af).

Collect: N2M rn, rm, cnt

Command the collector to collect cnt data packs (each K words) from neuron blocks starting from initial address specified by register rm into the memory locations starting from initial address specified in register rm (assuming that strides are pre-configured).

Row select: RSEL n_addr

Set the row select bit in the neuron block specified by n_addr to “1”.

Synchronize: WAIT agent

Halt the main controller until the target agent finishes its task.

Reset: RST

Reset all the column and row select bits.

The remaining instructions include simple arithmetic (i.e. add/sub, addi/subi) and (non-)conditional control (i.e. jmp, djnz, call, return) instructions. Note that for the considered applications, all data in main memory are used before they have to be refreshed. Therefore, for simplicity, we will not discuss refresh operation, though its implementation is straightforward and can be performed explicitly using either M2B instruction or automatically by memory controller.

S.III PRIOR WORK

At the system-level, many efforts have been recently made to exploit the efficiency of mixed-signal operators and develop an efficient DNN/RNN processor architecture [23-30].

Specifically, ISAAC [23] and PUMA [24] architectures are 2D mesh structure of tiles where each tile contains several small fixed-size ReRAM-based VMM units (typically 128×128) with dedicated input/output peripheral circuitry. In these architectures, one shared memory is implemented in each tile for storing intermediate data and communication between VMMs, while the communication between the tiles are performed through a shared 2D bus structure. Such heavily granular multi-core design approach is followed with the aim of increasing the utilization, minimizing the data transfer overhead, and maximizing throughput via pipelining and parallel processing. However, data conversion and communication overhead due to partial VMM operation, static power consumption of the analog blocks, large area overhead of the neurons / DACs / ADCs, and large control and communication overhead between tiles/VMMs limit the performance of such architectures, especially when running relatively complex computational graphs such as of Inception and ResNet.

RENO [25] and Harmonica [26] are, respectively, a ReRAM-based reconfigurable neuromorphic computing accelerator and a heterogeneous computing system based on such accelerator. This accelerator utilizes a mixed-signal centralized mesh interconnect network to reduce DAC/ADC overhead while increasing the throughput via passing the analog output directly to the next layer. However, this approach is only optimized for fully-connected multi-layer networks and associative memories. Moreover, despite lowering the DAC/ADC overhead, the accelerator performance is impacted by costly mixed-signal routers.

Piplayer [27] explores the trade-off between hardware resource of ReRAM array and performance utilizing the notion of parallelism granularity targeting both training and inference. This architecture uses a spike-based integrate and fire scheme to eliminate DAC/ADC overhead. However, input multi-level encoding and output spike generation overhead still results in inferior efficiency.

PRIME [28] is a ReRAM based architecture proposing the application of morphable memory blocks with small extra add-on circuitry which can be configured as computational unit on demand. Such morphable memory blocks result in a compact and energy efficient design by reusing the memory block peripheries for computation. However, the performance of PRIME is negatively impacted by lack of data-reuse for convolution, high data conversion/transfer overhead due to small analog-domain VMM, i.e. kernel breakdown, and latency overhead due to SA/neuron sharing. Additionally, the very limited switching endurance of ReRAM makes the main idea of PRIME hardly practical.
Supplementary Information Fig. S2. Design space exploration for aCortex performance metrics, i.e. energy efficiency (TOp/J), throughput (TOp/s), and area (mm²), with respect to the key architectural parameters, i.e. granularity (K) and aspect ratio, (AR=M/N) for three benchmark neural networks (GNMT-1024 RNN; ResNet-152 and Inception-v1 DNNs).


Technology node 28 nm 28 nm 65 nm 32 nm 32 nm 55 nm
Approach digital digital digital ReRAM ReRAM 2D-NOR
Clock (MHz) 606 700 200 1200 1000 700
Precision (bits) 16 fixed point 8 fixed point 1-16 (4 here) 16 fixed point 16 fixed point 4 fixed point
Area (mm²) 88 330 16 85.4 90.6 293
Power (W) 20.1 40 297 65.8 62.5 0.044
Throughput (TOp/s) 5.54 92 1.38 39.9 52.31 14.94
SE (MB/mm²) 0.2 off-chip off-chip 0.74 (0.25*) 0.76 (0.257*) 0.273
EE (TOp/J) 0.286 0.43 11.6 0.35 (5.14*) 0.84 (12.09*) 335.8

Supplementary Information Fig. S3. (a) Performance comparison of aCortex with the state-of-the-art digital and mixed-signal neuromorphic processor architectures. Except for TPU, all performance results are based on simulations. * High optimistic mapping of performance metrics to 4-bit computing precision and 55-nm technology node. #The performance numbers do not include overhead of external memory access (weights/intermediate data). (b) Energy efficiency versus throughput scatter plot for the approaches listed in (a). The size of bubbles represents the area of the processors.

RAPIDNN [29] is also a ReRAM based architecture which aims to improve the hardware performance through minimizing the required computing precision while achieving similar network accuracy. Precision is reduced by utilizing a reinterpretation mechanism (non-linear quantization of inputs/weights/outputs based on statistical data). Moreover, in this architecture all neural functionalities are implemented inside the memory using a direct digital lookup table-based technique which eliminates costly DAC/ADC/neuron. However, semi-sequential VMM operation and lack of data-reuse for convolution result in performance drop for large scale neuromorphic applications especially those involving convolution operation. Besides, the architecture suffers from data encoding overhead despite eliminating data conversion overhead.