

# Training and operation of an integrated neuromorphic network based on metal-oxide memristors

M. Prezioso<sup>1</sup>\*, F. Merrikh-Bayat<sup>1</sup>\*, B. D. Hoskins<sup>1</sup>\*, G. C. Adam<sup>1</sup>, K. K. Likharev<sup>2</sup> & D. B. Strukov<sup>1</sup>

Despite much progress in semiconductor integrated circuit technology, the extreme complexity of the human cerebral cortex<sup>1</sup>, with its approximately 10<sup>14</sup> synapses, makes the hardware implementation of neuromorphic networks with a comparable number of devices exceptionally challenging. To provide comparable complexity while operating much faster and with manageable power dissipation, networks<sup>2</sup> based on circuits<sup>3,4</sup> combining complementary metaloxide-semiconductors (CMOSs) and adjustable two-terminal resistive devices (memristors) have been developed. In such circuits, the usual CMOS stack is augmented with one<sup>3</sup> or several<sup>4</sup> crossbar layers, with memristors at each crosspoint. There have recently been notable improvements in the fabrication of such memristive crossbars and their integration with CMOS circuits<sup>5-12</sup>, including first demonstrations<sup>5,6,12</sup> of their vertical integration. Separately, discrete memristors have been used as artificial synapses in neuromorphic networks<sup>13-18</sup>. Very recently, such experiments have been extended<sup>19</sup> to crossbar arrays of phase-change memristive devices. The adjustment of such devices, however, requires an additional transistor at each crosspoint, and hence these devices are much harder to scale than metal-oxide memristors 11,20,21, whose nonlinear current-voltage curves enable transistor-free operation. Here we report the experimental implementation of transistor-free metaloxide memristor crossbars, with device variability sufficiently low to allow operation of integrated neural networks, in a simple network: a single-layer perceptron (an algorithm for linear classification). The network can be taught in situ using a coarse-grain variety of the delta rule algorithm<sup>22</sup> to perform the perfect classification of  $3 \times 3$ -pixel black/white images into three classes (representing letters). This demonstration is an important step towards much larger and more complex memristive neuromorphic networks.

In a hybrid CMOS/memristor circuit, the CMOS subsystem contacts each wire, and hence can address each memristor on the add-on crossbar(s), using a specific 'CMOL' area-distributed interface<sup>3,4</sup>. The basic idea of hybrid neuromorphic networks—so-called CrossNets<sup>2</sup> is to use this opportunity to connect CMOS-implemented hardware models of neuron bodies with the memristive crossbar(s), whose wires play the parts of axons and dendrites and whose memristors mimic biological synapses. The simple, two-terminal, transistor-free topology of metal-oxide memristors may enable CrossNets to achieve extremely high density—much higher than that of pure-CMOS neuromorphic networks (including those based on CMOS-modelled memristors<sup>23</sup>, floating-gate<sup>24</sup> and ferroelectric<sup>25</sup> memory cells), and even higher than that of their biological prototypes. For example, a CrossNet based on a hybrid CMOS/memristor circuit with five layers of 30-nm-pitch crossbars, two memristors per synapse, and 10<sup>4</sup> synapses per neural cell would have an areal density of about 25 million cells per square centimetre, that is, higher than that in the human cerebral cortex, at comparable average connectivity<sup>1</sup>. Estimates show that such CrossNets may also provide comparable power efficiency, at a much higher operation speed—for example, an intercell signal transfer delay of about 0.02 ms (compared to about 10 ms in biological systems) at an easily manageable energy dissipation rate of about 1 W cm $^{-2}$ .

However, the practical implementation of such networks is still very challenging, owing to the specific physical mechanism of resistance change in most prospective metal-oxide-based memristors—a reversible modulation of the concentration profile of oxygen vacancies<sup>11,20,21</sup>. On the positive side, the atomic scale of the vacancy position modulation implies the possibility of memristor scaling down to few-nanometre dimensions, which has been confirmed by recent experiments<sup>26,27</sup>. On the negative side, such a small scale makes the device-to-device reproducibility of device parameters, most importantly the voltage required for memristor electroforming and switching<sup>20,21</sup>, difficult to achieve with the currently used fabrication technologies. Device variability is the main reason why the only (to our knowledge) demonstrations of memristive neuromorphic networks were based on disconnecting each memristor from the crossbar for individual forming, using either a crossbar with external (off-chip) wires<sup>18</sup>, or an individual switch transistor at each crosspoint<sup>19</sup>. Both these approaches are incompatible with the goal of reaching the extremely high density of neuromorphic networks discussed above.

The main goal of this work was an experimental demonstration of a fully operational neural network based on an integrated, transistorfree crossbar with metal-oxide memristors. To reach this goal, a large reduction of memristor variability was essential, and to achieve it, we used binary-oxide  $Al_2O_3/TiO_2 - x$  stacks (see inset to Fig. 1b). Their fabrication procedure was generally close to that described in ref. 27, but with the important difference of using low-temperature (<300 °C) reactive sputtering for film deposition, which enables monolithic three-dimensional integration. The stack was first optimized by conducting an exhaustive experimental search over a range of titanium dioxide compositions and layer thicknesses (from 5 nm to 100 nm) to find the parameter range providing the lowest forming voltages. Within that range, the device performance—most importantly the memristor uniformity and the current-voltage curve nonlinearitywas further optimized by varying the aluminium oxide thickness from 1 nm to 5 nm (Supplementary Information Section 1).

The optimized technology was then used to fabricate an integrated memristive crossbar with 12  $\times$  12 devices (Fig. 1), with a few process

<sup>&</sup>lt;sup>1</sup>Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, California 93106, USA. <sup>2</sup>Department of Physics and Astronomy, Stony Brook University, Stony Brook, New York 11794, USA.

<sup>\*</sup>These authors contributed equally to this work.



**Figure 1** | **Memristor crossbar. a**, Integrated  $12 \times 12$  crossbar with an  $Al_2O_3/TiO_{2-x}$  memristor at each crosspoint. **b**, A typical current–voltage curve of a formed memristor. **c**, Absolute values of conductance change under the effect of

500- $\mu s$  voltage pulses of two polarities, as a function of the initial conductance, for various pulse amplitudes. The inset in  ${\bf b}$  shows the device cross-section schematically.

modifications to increase the metal electrode thickness, so that the line resistances were reduced to about 800  $\Omega$  for the top layer of the crossbar and 600  $\Omega$  for its bottom layer. The crossbars retained the excellent uniformity of virgin (pre-formed) crossbar-integrated devices (see Supplementary Figs 3, 4 and 5), allowing individual electric forming and tuning of each memristor. The electroforming was performed by grounding the corresponding bottom electrode and applying a current-controlled ramp-up to the top electrode, while leaving all other line potentials floating (Supplementary Fig. 4). To minimize current leakage during the subsequent forming of other devices, each formed memristor was immediately switched into its low-current (OFF) state. The measured individual characteristics of the formed memristors were mostly similar to those of stand-alone devices, except for a somewhat smaller (~100) ON/OFF current ratio. This difference may be partly explained by current leakage through other crosspoints at the measurements, and partly by the somewhat smaller switching voltages used for the crossbar to lower the risk of device damage. In addition, some deviations from the optimal device performance could be caused by the electron-beam evaporation of thicker electrodes, which required breaking of the vacuum, as opposed to the fully in situ sputtering of single device layers, and their subsequent annealing (see Supplementary Information).

The fabricated memristive crossbar was used to implement a simple artificial neural network with the top-level (functional) scheme shown in Fig. 2. This is a single-layer perceptron<sup>22</sup> with ten inputs and three outputs, fully connected with  $10 \times 3 = 30$  synaptic weights (Fig. 2b).

As the scheme shows, the perceptron's outputs  $f_i$  (with i = 1, 2, 3) are calculated as nonlinear 'activation' functions:

$$f_i = \tanh(\beta I_i) \tag{1}$$

of the vector-by-matrix product components:

$$I_i = \sum_{i=1}^{10} W_{ij} V_j \tag{2}$$

Here  $V_j$  with j=1,...,9 are the input signals,  $V_{10}$  is a constant bias,  $\beta$  is a parameter controlling the function's nonlinearity, and  $W_{ij}$  are adjustable (trainable) synaptic weights. Such a network is sufficient for performing, for example, the classification of  $3\times 3$ -pixel blackand-white images into three classes, with nine network inputs  $(V_1,...,V_9)$  corresponding to the pixel values. We tested the network on a set of N=30 patterns, including three stylized letters ('z', 'v' and 'n') and three sets of nine noisy versions of each letter, formed by flipping one of the pixels of the original image (see Fig. 2c). Because of the very limited size of the set, it was used for both training and testing.

Physically, each input signal was represented by a voltage  $V_j$  equal to either +0.1 V or -0.1 V, corresponding, respectively, to the black or white pixel, while the bias input  $V_{10}$  was equal to -0.1 V. Such coding makes the benchmark input set balanced, in particular ensuring that the sum of all input signals across all patterns of a particular class is close to zero, which speeds up the convergence process<sup>28</sup>. To sustain this balance at the network's output as well, each synapse



Figure 2 | Pattern classification experiment (top-level description). a, Input image. b, The single-layer perceptron for classification of  $3 \times 3$  binary images. c, The used input pattern set. d, The flow chart of one epoch of the used *in situ* 

training algorithm. In  $\mathbf{d}$ , the grey-shaded boxes show the steps implemented inside the crossbar, while those with solid black borders denote the only steps required to perform the classification operation.

Figure 3 | Pattern classification experiment (physical-level description). a, An implementation of a single-layer perceptron using a  $10 \times 6$  fragment of the memristive crossbar. b, An example of the classification operation for a specific input pattern (stylized letter 'z'), with the crossbar input signals equal to  $+V_{\rm R}$  or  $-V_{\rm R}$ , depending on the pixel colour. (The read and write biases were

always  $V_{\rm R}=0.1~{\rm V}$  and  $V_{\rm W}^{\pm}=\pm1.3~{\rm V}$ , respectively.) **c**, An example of the weight adjustment in a specific (first positive) column, for a specific error matrix. At the step shown, only the synapses whose weights should be increased (marked by '+' in the table on the left) are adjusted, that is, the memristor conductances  $G_{1,1}^+$ ,  $G_{1,2}^+$ ,  $G_{1,5}^+$ ,  $G_{1,6}^+$  and  $G_{1,9}^+$  are being increased.



0.02 0.00 Neuron z -0.02 0.02 Neuron v 0.00 -0.02 0.02 0.00 -0.0210 20 30 40

**Figure 4** | **Pattern classification experiment: results.** a, Convergence of network outputs, during the training process, to the perfect value (zero), for six training runs from different initial states. b, The evolution of output signals, averaged over all patterns of a specific class. The inset in a shows the distribution of weights W in the initial state and immediately after epoch 21, when perfect classification is achieved for the first time for this particular run. The classification was considered successful when the output signal  $f_i$ 

corresponding to the correct class of the applied pattern was larger than all other outputs. Such perfect classification was achieved, on average, after 23 epochs, with the standard deviation of ten epochs. The training illustrated by **b** was continued even after the perfect classification had been achieved on epoch 21, to verify that the difference between the output signals continued to increase (unlike the 'perceptron rule' training used in ref. 18).

was implemented with two memristors, so that the total number of memristors in the crossbar was  $30 \times 2 = 60$ . Using external electronics to enforce the virtual ground conditions on each column line, and to subtract currents flowing in the adjacent columns to form a differential output signal  $I_i$ , we ensured that Ohm's law applied to each column of the crossbar gave a result identical to equation (2), with differential weights:

$$W_{ij} = G_{ij}^{+} - G_{ij}^{-} \tag{3}$$

where  $G_{ij}^{\pm}$  is the effective conductance of each memristor, namely the I/V ratio at voltage 0.1 V. For our devices, these effective conductances were in the range 10–100  $\mu$ S, so that currents  $I_i$  were of the order of a few microamperes. Activation functions—see equation (1)—were also implemented, using external electronics, with the slope  $\beta=2\times10^5$  A<sup>-1</sup> chosen according to the recommendation in ref. 28, confirmed by our own computer simulations (Supplementary Fig. 10).

The network was trained *in situ*, that is, without using its external computer model, using the Manhattan update rule<sup>29</sup>, which is essentially a coarse-grain, batch-mode variation of the usual delta rule of supervised training<sup>22</sup>. At each iteration ('epoch') of this procedure, sketched in Fig. 2d, patterns from the training set were applied, one by one, to the network's input, and its outputs  $f_i(n)$ , where n is

the pattern number, were used to calculate the delta-rule weight increments:

$$\Delta_{ij}(n) = \delta_i(n) V_j(n)$$

with

$$\delta_i(n) = \left[ f_i^{(g)}(n) - f_i(n) \right] \frac{\mathrm{d}f}{\mathrm{d}I} \Big|_{I = I_i(n)}$$
 (4)

Here  $f_i^{(g)}(n)$  is the target value of the *i*th output for the *n*th input pattern. (In our system these values were chosen to be +0.85 for the output corresponding to the correct pattern class, and -0.85 for the output corresponding to the wrong class.) Once all *N* patterns of the training set had been applied, and all  $\Delta_{ij}(n)$  calculated, the synaptic weights were modified using the following Manhattan update rule:

$$\Delta W_{ij} = \eta \operatorname{sgn} \sum_{n=1}^{N} \Delta_{ij}(n)$$
 (5)

where  $\eta$  is a constant that scales the training rate. (The only difference between the Manhattan update rule from the batch-mode delta rule is the binary quantization, expressed in equation (5) by the 'sgn' function, which simplifies the hardware implementation of the delta rule.

## RESEARCH LETTER

Physically, in our system the weights were modified in parallel for each column of the crossbar (corresponding to a certain value of index *i* in the above formulas), using two sequential voltage pulses. Namely, first a 'set' pulse with amplitude  $V_W^+ = 1.3 \text{ V}$  was applied to increase the conductances of the synapses whose  $\Delta G$  values, calculated from equation (5), were positive; then a 'reset' pulse  $V_{\rm W}^-=-1.3~{\rm V}$  was applied to the remaining synapses of that column (see Fig. 3c). This fixed-amplitude pulse procedure followed the Manhattan update rule only approximately, because the actual training rate  $\Delta G$  depends on the initial conductance G of the memristor (see Fig. 1c and Supplementary Fig. 6). (For  $G = 20 \mu S$ ,  $\Delta G$  was close to  $+60 \mu S$  for the set pulse and  $-5 \mu S$  for the reset pulse, while for  $G = 65 \mu S$ , the changes were close, respectively, to  $+24 \mu S$  and  $-55 \mu S$ .) Owing to the specific (though quite representative<sup>11</sup>) switching dynamics of our devices, the best classification performance was achieved when the memristors had been initialized somewhere in the middle of their conductance range, around 35 µS (Supplementary Fig. 7b). At such initialization, the perfect classification was reached, on average, after 23 training epochs (see Fig. 4).

In summary, here we have experimentally demonstrated an artificial neural network using memristors integrated into a dense, transistorfree crossbar circuit. This crossbar performed, on the physical (Ohm's law) level, the analogue vector-by-matrix multiplication of equations (2) and (3), which is by far the most computationally intensive part of the operation of any neuromorphic network used repeatedly in the same environment. The other operations, described by equations (1), (4) and (5), were performed by external electronics, but they are much less critical for network performance, and in future, larger CrossNets may be (at least partly) assisted by CMOS subsystems. This is an important step towards the effective analogue-hardware implementation of much more complex neuromorphic networks, from multilayerperceptron classifiers with deep learning  $^{\rm 30}$  to elaborate CrossNet-based cognitive systems. Recent experiments<sup>27</sup> with similar but smaller (discrete) devices imply that such circuits may be scaled down to devices of 30 nm across or less, that is, to networks with a density of approximately 10<sup>10</sup> synapses per square centimetre in each crossbar layer.

#### Received 16 December 2014; accepted 19 March 2015.

- 1. Mountcastle, V. B. The Cerebral Cortex (Harvard Univ. Press, 1998).
- Likharev, K. K. CrossNets: neuromorphic hybrid CMOS/nanoelectronic networks. Sci. Adv. Mater. 3, 322–331 (2011).
- Likharev, K. K. Hybrid CMOS/nanoelectronic circuits: opportunities and challenges. J. Nanoelectron. Optoelectron. 3, 203–230 (2008).
- Strukov, D. B. & Williams, R. S. Four-dimensional address topology for circuits with stacked multilayer crossbar arrays. *Proc. Natl Acad. Sci. USA* 106, 20155–20158 (2009).
- Xia, Q. et al. Memristor-CMOS hybrid integrated circuits for configurable logic. Nano Lett. 9, 3640–3645 (2009).
- Chevallier, C. J. et al. 0.13μm 64Mb multi-layered conductive metal-oxide memory. Int. Solid-State Circuits Conf. 10, 260–261 (2010).
- Miyamura, M. et al. Programmable cell array using rewritable solid-electrolyte switch integrated in 90 nm CMOS. Int. Solid-State Circuits Conf. 11, 228–229 (2011).
- 8. Kawahara, A. et al. An 8Mb multi-layered cross-point ReRAM macro with 443MB/s write throughput. Int. Solid-State Circuits Conf. 12, 432–434 (2012).

- Kim, G. H. et al. 32 ×32 crossbar array resistive memory composed of a stacked Schottky diode and unipolar resistive memory. Adv. Funct. Mater. 23, 1440–1449 (2013).
- Kim, K.-H. et al. A functional hybrid memristor crossbar-array/CMOS system for data storage and neuromorphic applications. Nano Lett. 12, 389–395 (2012).
- Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nature Nanotechnol. 8, 13–24 (2013).
- 12. Liu, T. et al. A 130.7-mm 2-layer 32-Gb ReRAM memory device in 24-nm technology. *IEEE J. Solid-State Circuits* **49**, 140–153 (2014).
- Jo, S. H. et al. Nanoscale memristor device as synapse in neuromorphic systems. Nano Lett. 10, 1297–1301 (2010).
- Chanthbouala, A. et al. A ferroelectric memristor. Nature Mater. 11, 860–864 (2012).
- 15. Seo, K. et al. Analog memory and spike-timing-dependent plasticity characteristics of a nanoscale titanium oxide bilayer resistive switching device. *Nanotechnology* **22**, 254023 (2011).
- Ohno, T. et al. Short-term plasticity and long-term potentiation mimicked in single inorganic synapses. Nature Mater. 10, 591–595 (2011).
- Ziegler, M. et al. An electronic version of Pavlov's dog. Adv. Funct. Mater. 22, 2744–2749 (2012).
- Alibart, F., Zamanidoost, E. & Strukov, D. B. Pattern classification by memristive crossbar circuits with ex-situ and in-situ training. *Nature Commun.* 4, 2072 (2013).
- Eryilmaz, S. B. et al. Brain-like associative learning using a nanoscale non-volatile phase change synaptic device array. Front. Neurosci. 8, 205 (2014).
- Waser, R., Dittman, R., Staikov, G. & Szot, K. Redox-based resistive switching memories. Adv. Mater. 21, 2632–2663 (2009).
- 21. Wong, H. S. P. et al. Metal-oxide RRAM. Proc. IEEE 100, 1951-1970 (2012).
- 22. Hertz, J., Krogh, A. & Palmer, R. G. Introduction to the Theory of Neural Computation (Perseus, 1991).
- 23. Pershin, Y. V. & Di Ventra, M. Experimental demonstration of associative memory with memristive neural network. *Neural Netw.* **23**, 881–886 (2010).
- Hasler, J. & Marr, B. Finding a roadmap to achieve large neuromorphic hardware systems. Front. Neurosci. 7, 118 (2013).
- Kaneko, Y., Nishitani, Y. & Ueda, M. Ferroelectric artificial synapses for recognition of a multishaded image. IEEE Trans. Electron. Dev. 61, 2827–2833 (2014).
- Pi, S., Lin, P. & Xia, Q. Cross point arrays of 8 nm × 8 nm memristive devices fabricated with nanoimprint lithography. J. Vacuum Sci. Technol. B 31, 06FA02–1 (2013).
- Govoreanu, B. et al. Vacancy-modulated conductive oxide resistive RAM (VMCO-RRAM). IEDM Tech Dig. 10.2. 1–4 http://dx.doi.org/10.1109/IEDM.2013.6724599 (2013).
- LeCun, Y., Bottou, L., Orr, G. B. & Müller, K.-R. Efficient backprop. Lect. Notes Comput. Sci. 7700, 9–48 (2012).
- Schiffmann, W., Joost, M. & Werner, R. Optimization of the Backpropagation Algorithm for Training Multilayer Perceptrons https://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.53.6869&rep=rep1&type=pdf (Technical Report, Institute of Physics, University of Koblenz, 1994).
- Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. *Neural Inf. Process. Systems* 12, 1097–1105 (2012)

**Supplementary Information** is available in the online version of the paper.

**Acknowledgements** We acknowledge useful discussions with F. Alibart, I. Kataeva, W. Lu, L. Sengupta, S. Stemmer, and E. Zamanidoost. This work was supported by the AFOSR under the MURI grant FA9550-12-1-0038, by DARPA under contract number HR0011-13-C-0051UPSIDE via BAE Systems, and by the DENSO Corporation, Japan.

**Author Contributions** M.P., F.M.-B., B.D.H., K.K.L., and D.B.S. designed the research. M.P., B.D.H., and G.C.A. performed fabrication and device testing. M.P. and F.M.-B. performed pattern classifier experiments. All authors discussed and interpreted results. M.P., K.K.L., and D.B.S. wrote the manuscript. K.K.L. and D.B.S. advised on all parts of the project.

**Author Information** Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Readers are welcome to comment on the online version of the paper. Correspondence and requests for materials should be addressed to M.P. (mprezioso@ece.ucsb.edu) and D.B.S. (strukov@ece.ucsb.edu).

### **Supplementary Information**

#### 1. Material and device stack optimization

Lower electroforming voltages reduce the electrical stress as well as current overshoot during the forming, which is a known risk factor contributing to device variability<sup>1</sup>. It had been noticed that lower forming voltages may be achieved in devices with higher conductivity, obtained by a combination of oxide layer thickness reduction and stoichiometry adjustment<sup>2,3</sup>. In our experiments, the oxygen concentration was continuously reduced by controlling the oxygen flow rate during the layer growth to the point at which resistive switching nearly ceased. The switching voltages of such devices were close to, or less than the electroforming voltage. However, many devices could not be turned off after switching and instead were shunted, or in some cases exhibited spontaneous complementary resistive switching. The observed nonlinearity of the I-V curves of such devices was also not large enough for crossbar operation. To address these issues, an Al<sub>2</sub>O<sub>3</sub> layer was added to the device stack. (Multilayers and bilayers had been used previously to improve device performance<sup>4,5</sup>.) An excessive increase of the Al<sub>2</sub>O<sub>3</sub> barrier thickness causes an increase of the forming voltage, so that an optimum thickness had to be selected. As a result of material and stack optimization, the most suitable values of thicknesses for TiO<sub>2-x</sub> and Al<sub>2</sub>O<sub>3</sub> layers have turned out to be close to respectively, 30 nm and 4 nm (Fig. S1b).

#### 2. Electrical characterization of single devices

To perform electrical characterization, single 200 nm  $\times$  200 nm devices of the dog-bone geometry (Fig. S1b) were fabricated first. (The inset in that figure shows the material stack parameters; note that they are slightly different from those in the crossbar-integrated devices – cf. the inset in Fig. 1b of the main text.) Figure S1a shows typical switching curves of such single devices, obtained by applying bipolar voltage sweeps. (To exhibit the *I-V* curve nonlinearity better, Fig. S1a also shows Ohmic currents for several resistance values.)

Figure S1c shows representative endurance test results for single devices. The data are obtained by repeatedly (over 5,000 times total) applying a sequence of set (-2 V, 500 µs), readout (0.1 V, 1 ms), reset (2 V, 500 µs), and again readout voltage pulses to a single device. The figure does not reflect the fact that approximately 7% of the negative pulses failed to switch off the device; however, this behavior may be attributed to an imperfect endurance setup rather than any deep device problem. This conclusion is supported by the fact that the device was being switched after each failure event. Some aging (in the form of a slight degradation of the ON/OFF dynamic range) is also visible, but generally the devices are rather robust. For example, during our experiments with the crossbar circuit, each of its memristors had been subjected, on the average, to 200,000 set/reset pulses even prior to the successful classification experiment.

As customary in nonvolatile memory technology, the retention test shown in Fig. S1d was carried out with a sample heated to 350 K. The device was first switched into a certain resistive state, and then its resistance was measured repeatedly by applying a 100 mV bias every 2 seconds during a 50,000-second time interval. Such retention measurements were carried out for ON (highly conductive), OFF (highly resistive), and some intermediate states. Based on the measurements, the retention at room temperature is expected to exceed 10 years.



**Figure S1.** Isolated Pt/TiO<sub>2-x</sub>/Al<sub>2</sub>O<sub>3</sub>/Pt memristive devices: (a) typical switching and electroforming behavior of a single device, (b) micrograph of a single device and its stack's structure, (c) switching endurance under a stream of  $\pm 2$  V, 500-μs pulses, and (d) retention of 3 initial states at 350 K. To highlight the data trends, large markers on panels (c) and (d) show the results of every  $100^{th}$  measurement at the endurance test, and of every  $1,000^{th}$  measurement at the retention test. Note that panel (d) shows significantly lower OFF state currents (and hence much higher ON/OFF current ratio) as compared to those of panels (a) and (c) due to higher voltages applied to reset the device.

#### 3. Crossbar circuit fabrication and packaging

Crossbar lines, 200 nm wide and separated by 400 nm gaps, were formed on 4" silicon wafers covered by 200 nm of thermal SiO<sub>2</sub>. After the standard cleaning and rinse, fabrication started with an e-beam evaporation of Ta (5 nm)/Pt (60 nm) bilayer over a patterned photoresist to form the bottom electrodes ("rows"). After liftoff, the wafer was descum by active oxygen dry etching at 200°C for 10 minutes. Then, a blanket film consisting of a 4-nm sputtered Al<sub>2</sub>O<sub>3</sub> barrier and a 30-nm TiO<sub>2</sub> switching layer was deposited from a fully oxidized target and a partially oxidized target, respectively. This bilayer was then removed by etching in an ICP chamber using CHF<sub>3</sub> plasma, while preserving it in the future crossbar area by a pre-deposited negative photoresist. After stripping the photoresist in the 1165 solvent for 3h at 80°C, the wafer was cleaned using a mild descum procedure performed in a RIE chamber for 15 seconds with 10 mTorr oxygen plasma at 300 W. Next, the top electrodes ("columns") consisting of 15 nm Ti and 60 nm Pt were deposited and patterned using e-beam evaporation and liftoff. Finally, the wire bonding pads were formed by e-beam deposition of Cr (10 nm) / Ni (30 nm) / Au (500 nm). All lithographic steps were performed using a DUV stepper using a 248 nm laser. After fabrication and dicing, the dies were annealed in a reducing atmosphere (10% H<sub>2</sub>, 90% N<sub>2</sub>) for 30 minutes at 300°C.

A single dye was wire-bonded (with gold wires) to the DIP40 package, using a thermosonic bonding process. The process was simplified due to thicker Au metallization of the outer contacts, which also helped to reduce overall wire resistance. Figure S2a shows an optical image of a dye mounted onto the package. It shows, in particular, 6 gold wires bonded to the pads of each chip side, with a total of 12 wires for the columns (top and bottom sides of a crossbar) and 12 wires for the rows (left and right sides). Figure S2b shows an SEM image of the crossbar area of the chip.

#### 4. Electrical characterization and forming of crossbar circuits

A detailed electrical characterization was performed on a  $10\times8$  section of the crossbar, which was later utilized in the classification experiment. All electrical characterizations were performed using the Agilent B1500A parameter analyzer. In addition, the Agilent B5250A switching matrix was employed for testing packaged crossbar circuits and carrying out the pattern classification experiment. The parameter analyzer and the switching matrix were controlled by a personal computer via a GPIB interface using a custom C code. All write and read pulses were 500  $\mu$ s long. For the memristor adjustment, we used the "V/2 scheme", in which the selected rows and columns were voltage-biased at V/2 and V/2, respectively. For device state readout, we voltage-biased the selected column, connected the selected row to a virtual ground, and physically grounded all the other lines.



**Figure S2**. Microphotographs of the  $12\times12$  crossbar with integrated Pt/Al<sub>2</sub>O<sub>3</sub>/TiO<sub>2-x</sub>/Pt memristors: (a) the bonded chip, and (b) zoom-in on the crossbar area. Figure 1a in the main text shows the further zoom-in on the crossbar.

Specifically, we first characterized the "virgin" devices (not electroformed yet). Figures S3a and S3b show the recorded map of conductances measured at 0.1 V and the corresponding histogram, while Fig. S3c shows a representative set of dc *I-V* curves. After this characterization, the electroforming procedure was performed. For the forming, a quasi-DC current ramp was applied to the selected column line, while the selected row line was grounded and all the remaining (unselected) lines were kept floating. Such fixed-current technique prevents devices from an excessive stress during its electroforming, when device's resistance drops sharply. To minimize the current leakage during forming, the already formed devices were switched to their low-conductive state. More particularly, the devices of a 2×2 subarray were formed first. Then the devices in additional rows and columns were formed, so that the subarray of formed devices was gradually increased: first to 3×3 devices, then to 4×4, and so on. Figure S4a shows the map of forming voltages for the working section of the crossbar, while Fig. S4b shows the corresponding histogram. Additionally, Figure S4c shows the electroforming process dynamics (for the diagonal devices, which were formed last to complete forming of the corresponding subarray) on the [*I, V*] plane.

Following the electroforming procedure, we characterized the effective switching thresholds for all devices in the working section of the crossbar array (Fig. S5). The threshold set / reset voltages were measured by first programming devices to their high / low resistive states, and then applying a sequence of 500- $\mu$ s pulses of the appropriate polarity with gradually increasing amplitude to measure the smallest voltage that caused a resistance change by more than 2 k $\Omega$ . The evolution of device conductance during the reset and set switching are shown in Figs. S5a and S5b, respectively, while Figs. S5c and S5d show maps of the corresponding

effective reset and set threshold voltages. Panel (e) of Fig. S5 exhibits the data from its panels (c) and (d) in the form of histograms. The devices marked with 'X' in the threshold maps could not be switched with largest applied voltages.



**Figure S3.** Pre-formed (virgin sample) characterization of a  $10\times8$  section of the crossbar: (a) color-coded conductance map of device conductances measured at 0.1 V, (b) the corresponding histogram, and (c) dc *I-V* curves of four representative devices. The average conductance and the standard deviation of conductances represented in panels (b) and (c) are 0.43  $\mu$ S and 0.08  $\mu$ S, correspondingly.

Of these data, the fact most important for applications is that the spread in effective switching voltages is narrow enough to avoid the infamous half-select problem.<sup>6</sup> For example, application of voltage +1.4 V to any device (besides those marked with X) ensures its set adjustment, while the half of that voltage (i.e. +0.7 V) is below the smallest observed set threshold voltage. This fact prevents disturbance of half-selected devices, connected to just one of voltage-biased lines. The switching threshold data were also important to identify a range of

reasonable switching voltages, following two main criteria: balancing the set and reset dynamics and avoiding permanent damage of the devices and hence prolonging their life.



**Figure S4.** Characterization of memristor electroforming in a 10×8 section of the crossbar. (a) Color-coded map of the forming voltages, and (b) their histogram. (c) Typical forming switching curves of the last-formed devices in each partial array (of the size indicated in the legend). The average forming voltage and the standard deviation for the data shown on panels (b) and (c) are, respectively, 1.91 V and 0.07 V.

The switching behavior of crossbar devices was further characterized in stress conditions similar to those imposed by the used neuromorphic network training procedure. In particular, we have studied the evolution of device conductance due to a train of rectangular pulses of the same voltage magnitude. Figure S6 shows typical results of such an experiment, for a specific device. The experiment was repeated three times, with different pulse magnitudes, for both set and reset transitions. The plots show clearly a saturation effect: the conductance change under the effect of a single voltage pulse is gradually reduced (on the average), so that the conductance reaches a certain value for each particular pulse magnitude. (This effect is summarized in Fig. 1c of the main text, which shows the final conductance change as the function of the initial conductance.) The saturation effect needs to be taken into account when conducting the neural network

experiment: as stated in the main text, it is much better, in the beginning of network training, to initialize devices in the middle of their dynamical range, to achieve substantial conductance changes in the beginning of the training process.



**Figure S5.** Characterization of the set and reset thresholds of formed memristors: (a, b) device conductance dynamics under applied voltage ramps of opposite polarities; (c, d) the effective threshold voltage maps for set and reset transitions, and (e) their histograms. The histograms on panel (e) do not include the 3 devices that could not be switched to the OFF state with voltages above -1.5 V; these devices are marked with crosses on panels (c) and (d). The average effective set/reset switching threshold voltages and their standard deviations, for the data shown on panel (e) are, respectively 0.9 V/-1.17 V and 0.1 V/0.12 V.



**Figure S6.** Evolution of memristor's conductance (measured at 0.1 V) under the effect of 500-μs pulse trains of several magnitudes, for the (a) reset and (b) set switching. Note that unlike Figures S5a and S5b, this figure shows the change in conductance for a fixed-magnitude pulses applied repeatedly to the same device. The dashed lines are just guides for the eye.

#### 5. Network training result analysis

The key metric of neural network classifier training is the difference it creates between network outputs induced by input patterns belonging to different classes. Two panels of Figure S7a show such difference for our memristor crossbar network. Namely, these are the histograms of the differences between currents in the "correct" and "incorrect" outputs for each of 30 input patterns. (Evidently, there are 60 such differences: for each of 30 inputs in 3 classes, there is one correct output and two incorrect ones.) In a perfectly trained classifier, all these differences have to be positive. The top panel shows that before training, the differences are small, and have random signs. During the training the differences increased, so that, as the lower panel shows, after a sufficient number (in this particular case, 21) of training epochs, all of them have become positive, signaling the 100% classification fidelity.

Figure S7b shows another useful way of understanding network dynamics during its training, namely the distributions of conductances at three phases of training process. (Note that the same information, but for only two phases, is presented in the inset of Fig. 4a of the main text.) For convenience, Fig. S7c shows the distributions of the effective weights W, i.e. the differences between the conductances of devices that form each differential pair, for the same three phases of training. As the data show, the difference of the weight distributions created by the training is less dramatic than that of the output signals, shown in panel (a), i.e. the training imposes mostly relative weight changes (which are important for the correct classification), rather than their global change.



**Figure S7.** Additional data from a particular pattern classification experiment ("Run 1") for which the perfect classification has been achieved, for the first time, after 21 training epochs. (a) Histogram of differences between currents in the "correct" output (corresponding to input pattern's class) and other two channels. (b) Histogram of device conductances at the initial state, measured after training epoch 21, and after epoch 54 (for which classification is also perfect). (c) Histogram of the differential weights *W* defined by Eq. (3) of main text. In panel (b), the average values / standard deviations of the measured conductances are 36.3 μS / 9 μS, 41.9 μS / 13 μS, and 42.4 μS / 13.4 μS, when measured in the initial state, after epoch 21, and epoch 54, respectively. The corresponding averages / deviations of the weights in panel (c) are -0.24 μS / 2.83μS, -5.36 μS / 16.8 μS, and -1.17 μS / 17.1 μS.

Finally, Figures S8 and S9 show the weight maps measured before and after training for 6 separate training experiments ("runs"). (The convergence graphs for these experiments are shown in Fig. 4a of the main text). The comparison of weight maps in Fig. S8 shows that even though the initialization procedure was similar in all runs, there were some run-to-run variations of the state of same device. (To quantify the variations, Figs. S8b and S9b show relative standard deviations for each of 30 weights.)



**Figure S8:** Initial weight values: (a) the maps of initial weights  $W_{ij}$  (in siemens) measured in 6 separate training experiments ("runs"), and (b) the corresponding relative variation, i.e. (standard deviation)-to-(average value) ratio for each weight. On panel (a), red and blue background colors indicate, respectively, negative and positive values of  $W_{ij}$ , while on panel (b), the colors are used just to emphasize lower (greenish) and higher (reddish) values. Narrow columns with gray / white cells show positive / negative input signals from noiseless images of the classes corresponding to the particular synaptic columns - see Fig. 3b in the main text.



**Figure S9:** Final synaptic weight values after the network training process convergence: (a) the maps of weights  $W_{ij}$  (in siemens) measured after the perfect classification has been reached, for 6 separate runs (after, respectively, 21, 6, 33, 26, 35, and 18 training epochs for Runs 1, 2, 3, 4, 5, and 6 – see Fig. 4a); (b) the corresponding relative variation for each weight. The color coding is the same as in Fig. S8.

As Fig. S9 shows, after training the weights become markedly different - not only in their magnitude, but in many cases also in their sign, despite the fact that each final distribution provides perfect classification fidelity. This is in agreement with the well-known fact that the classifier training using the Delta Rule algorithm and its cousins (such as the Manhattan Update Rule used in this work) does not result in a unique synaptic weight distribution – unless the initial weight values are exactly the same in each training run.

#### 6. Computer simulations

We have also carried out extensive computer simulations of our neural network, in order to understand the impact of various parameters, most importantly of the initial device conductances, on classifier's fidelity and convergence speed. To perform the simulation, an approximate device model was derived from the data shown on Fig. 1c of the main text. In this model, the conductance increase ( $\Delta G_{\text{set}}$ ) by a positive voltage pulse and its decrease ( $\Delta G_{\text{reset}}$ ) by the negative pulse are calculated as

$$\Delta G_{\text{set}}(G) = +10^{-3} (10^6 G - 10^6 G_{\text{min}} + 10^{\text{vset / slope}})^{-\text{slope}}, \tag{S1}$$

$$\Delta G_{\text{reset}}(G) = -10^{-3} (-10^6 G + 10^6 G_{\text{max}} + 10^{\text{vreset/slope}})^{-\text{slope}},$$
 (S2)

where constant *slope*, for our specific pulse amplitude and duration was taken equal to 2, while parameters *vset* and *vreset* were randomly chosen from the range [1, 5.5] for every memristor before each run. Constants  $G_{\min} = 10 \, \mu \text{S}$  and  $G_{\max} = 100 \, \mu \text{S}$  are the minimum and maximum conductance values; G is always clipped between these values after each update. Such simple model captures two main features of the memristors, namely the saturation of the switching dynamics (Fig. 1c) and switching threshold variations (Fig. S5).

Fig. S10 shows the most important results of these simulations. First, they have confirmed that the classifier fidelity and convergence speed are the best when the initial conductances of the devices are in the middle of their dynamical range. The second important simulation result is that the performance is rather insensitive to the choice of parameter  $\beta$ , with the optimal value close to  $\beta = 2 \times 10^5$ . As was stated in the main text, these results had been used at network training.

#### 7. Toward practical applications

We believe that our work is the first proof-of-concept that passively integrated memristive crossbar circuits can be used to perform classification task. However, due to its small size, this network is not by itself practical, and several major steps have to be made toward larger, useful neuromorphic networks.

Besides the primary, self-evident task of fabrication much larger crossbars with smaller memristors of (at least) similar quality and reproducibility, there is also a challenge of their efficient training. For a large crossbar, the batch training algorithm described in the paper would come with a substantial circuit overhead, if it is fully implemented on the same chip as the network. In fact, in the batch mode, the training circuit has to hold at least  $\sim M_1 \times M_2$  continuous intermediate values, for example numbers  $\Delta_{ij}$ , participating in Eq. (5), until the next weight update. (Here  $M_{1,2}$  are the linear sizes of the crossbar array.) Such training circuit, implemented in the usual CMOS technology, would have a much larger chip footprint than the crossbar itself.



**Figure S10:** Major results of computer simulation of our neural network, using a realistic memristor model. (a) The number of training epochs required to achieve the perfect classification, and (b) the fraction of training experiments with the perfect convergence, as functions of the initial memristor conductances. Each point is an average over 100 runs; for each run the weights were randomly initialized within a  $5-\mu S$  conductance window around the value indicated on the horizontal axis. If an experiment took more than 50 epochs for convergence, for the purposes of panel (b), it was considered a failure. These "failed" experiments were excluded when calculating the error bars on panel (a).

However, several factors make this task less hopeless than it may look. First of all, for most important practical applications, a neural network is used repeatedly for classification of patterns of the same type. In these cases it needs to be trained only in the beginning, and then may be used with the same set of weights for a long time. (In fact, this is the approach used in most advanced recent demonstrations of neural network hardware chips.<sup>7,8</sup>) Such rare training may be assisted by external (digital) computers which, in particular, would store all the intermediate values. According to our recent results,<sup>9</sup> this approach has substantial advantages over a purely ex-situ training in a "precursor" software network, with the subsequent synaptic weight import into the crossbar (as discussed, e.g., in Ref. 2 of the main text), because the former procedure allows to mitigate detrimental effects of memristor and also neuron circuits variability.

Another opportunity is to train the network in situ, using a local online ("stochastic") training procedure, for which the requirements to external memory would be substantially reduced. In this case, just one circuit per each line and column of the memristive crossbar, i.e.  $\sim (M_1+M_2)$  circuits per an  $M_1\times M_2$  memristors, may be sufficient to implement all training and

neuron functions. As the crossbar is scaled up in future for more complex tasks (up to  $M_{1.2} \sim 10^3$ - $10^4$  deemed necessary for some cognitive architectures), the relative size of this circuit overhead would be much decreased. Figure S11 shows some results of our preliminary attempt at implementing online training algorithm using the same memristive crossbar. Perfect classification has been achieved for the used small set of input patterns, but unfortunately, in its current form the online learning requires too many synaptic updates to achieve the perfect network performance.

Another important question which must be addressed in practical application is the generalization performance of the classifier. As Fig. S11 shows, for online training we could reach a perfect classification performance on a small separate test set. However, this training was too slow for our basic experiments with 30 images in 3 classes. As stated in the main text, these experiments were carried out without a separate test set, but using the measured values of weights recorded after every training epoch, we could use computer simulations to estimate network's generalization ability, i.e. its possible classification performance on a separate test set (not used at training). The set consisted of all possible patterns (36 patterns in each of 3 classes) obtained from the ideal images (Fig. 2c of main text) by flipping 2 pixels – see Fig. S12b. Though the classification improved during training (Fig. S12a), typically it is not perfect by its end. This is not surprising, giving the fact that the minimum Hamming distances between patterns of the test and training sets are 2 and 4, accordingly. Indeed, even a human would hardly be able to classify all these patterns correctly – please have one more look at Fig. S12b.



**Figure S11:** Preliminary results for pattern classification experiment using online ("stochastic") Manhattan Rule training: (a) Classification convergence for the training and test sets, and (b) used sets. On panel (a), "iteration" means an application of one pattern from a training set, followed by the corresponding weight update. (There were 12 iterations in training epoch.) Note that here perfect classification was achieved on the (admittedly, extremely small) test set. (The evaluation on this set was performed only at the end of training.)

As a result, we believe that a convincing demonstration of the generalization ability is only possible on larger input vectors, which in turn require larger networks (and hence larger crossbar arrays). Such a demonstration remains one of our major future goals.



**Figure S12:** Pattern classification experiment (Run 1): (a) Classification convergence for the training and test sets, and (b) the used test set. (The training set and the training method were the same as described in main text). The bottom figure on panel (a) shows one of the curves (black one) from Fig. 4a, extended to the subsequent training epochs.

#### References

- 1. Young-Fisher, K. G. *et al.* Leakage current-forming voltage relation and oxygen gettering in HfO<sub>x</sub> RRAM devices. *IEEE Electron Device Lett.* **34**, 750–752 (2013).
- 2. Yang, J. J. *et al.* The mechanism of electroforming of metal oxide memristive switches. *Nanotechnology* **20**, 215201 (2009).
- 3. Lentz, F. *et al.* Current compliance-dependent nonlinearity in TiO<sub>2</sub> RRAM. *IEEE Electron Device Lett.* **34**, 996–998 (2013).
- 4. Goux, L. *et al.* Ultralow sub-500nA operating current high-performance TiN/Al<sub>2</sub>O<sub>3</sub>/HfO<sub>2</sub> /Hf/TiN bipolar RRAM achieved through understanding-based stack-engineering. *VLSI Symp. Technol.* '12, 159–160 (2012).
- 5. Wu, H. *et al.* Resistive switching performance improvement of Ta<sub>2</sub>O<sub>5-x</sub>/TaO<sub>y</sub> bilayer ReRAM devices by inserting AlO<sub>δ</sub> barrier layer. *IEEE Electron Device Lett.* **35**, 39–41 (2014).
- 6. Strukov, D. B., Likharev, K. K. Reconfigurable nano-crossbar architectures. *Nanoelectronics and Information Technology*, Waser, R. (ed.), 3rd ed. (Wiley, Weinheim, Germany, 2012).
- 7. Merolla, P. A. *et al.* A million spiking-neuron integrated circuit with a scalable communication network and interface. *Science* **345**, 668-673 (2014).

- 8. Farabet, C. *et al.* NeuFlow: A runtime reconfigurable dataflow processor for vision. *CVPRW'11*, 109-116 (2011).
- 9. Kataeva, I. *et al.* Efficient training algorithms for neural networks based on memristive crossbar circuits. *submitted to IJCNN'15*, (2015), available online at https://www.ece.ucsb.edu/~strukov/papers/2015/mlpclassifier.pdf.