# Learning spike-based correlations and conditional probabilities in silicon #### AaronP.Shon DavidHsu ChrisDiorio DepartmentofComputerScienceandEngineering UniversityofWashington Seattle,WA98195-2350USA {aaron,hsud,diorio}@cs.washington.edu #### Abstract We have designed and fabricated a VLSI synapse that can learn a conditional probability or correlation between spike-based inputs and feedback signals. The synapse is low power, compact, provides nonvolatile weight storage, and can perform simultaneous multiplication and adaptation. We can calibrate arrays of synapses to ensure uniform adaptation characteristics. Finally, adaptation in our synapse does not necessarily depend on the signals used for computation. Consequently, our synapse can implement learning rules that correlate past and present synaptic activity. We provide analysis and experimental chip results demonstrating the operation in learning and calibration mode, and show how to use our synapse to implement various learning rules in silicon. # 1 Introduction Computation with conditional probabilities and correlations underlies many models of neurally inspired information processing. For example, in the sequence-learning neural network models proposed by Levy [1], synapses storethelog conditional probability that a presynaptic spike occurred given that the postsynaptic neuron spiked sometime later. Boltzmann machine synapses learn the difference between the correlations of pairs of neurons in the sleep and wake phase [2]. In most neural models, computation and adaptation occurs at the synaptic level. Hence, as iliconsynapse that can learn conditional probabilities or correlations between pre-and post-synaptic signals can be a keypart of many silicon neural-learning architectures. We have designed and implemented a silicon synapse, in a 0.35 $\mu m CMOS$ process, that learns a synaptic weight that corresponds to the conditional probability or correlation between binary input and feedback signals. This circuit utilizes floating-gate transistors to provide both nonvolatile storage and weight adaptation mechanisms [3]. In addition, the circuit is compact, low power, and provides simultaneous adaptation and computation. Our circuit improves upon previous implementations of floating-gate based learning synapses [3,4,5] inseveral ways. First, our synapse appears to be the first spike-based floating-gate synapse that implements ageneral learning principle, rather than aparticular learning rule [4,5]. We demonstrate the synapse appears to be the first spike-based floating-gate synapse that implements agenerallearning principle, rather than aparticular learning rule [4,5]. We demonstrate the synapse appears to be the first spike-based floating-gate synapse that implements agenerallearning principle, rather than aparticular learning rule [4,5]. strate that our synapse can learn either the conditional probability or the correlation between input and feedback signals. Consequently, we can implement a widerange of synaptic learning networks withour circuit. Second, unlike the general correlational learning synapse proposed by Hasler et. al. [3], our synapse can implement learning rules that correlate pre-and post synaptic activity that occur at different times. Learning algorithms that employ time-separated correlations include both temporal difference learning [6] and recently postulated temporally asymmetric Hebbian learning [7]. Hasler's correlational floating-gate synapse can only perform updates based on the present input and feedback signals, and is therefore unsuitable for learning rules that correlate signals that occur at different times. Because signals that control adaptation and computation in our synapse are separate, our circuit can implement the setime-dependent learning rules. Finally, we can calibrate our synapses to remove mismatch between the adaptation mechanisms of individual synapses. Mismatch between the same adaptation mechanisms on different floating-gate transistors limits the accuracy of learning rules based on these devices. This problem has been noted in previous circuits that use floating-gate adaptation [4,8]. In our circuit, different synapses can learn widely divergent weights from the same inputs because of component mismatch. We provide a calibration mechanism that enables identical adaptation across multiple synapses despite device mismatch. To our knowledge, this circuit is the first instance of a floating-gate learning circuit that includes this feature. This paper is organized as follows. First, we provide a brief introduction to floating-gate transistors. Next, we provide a description and analysis of our synapse, demonstrating that it can learn the conditional probability or correlation between a pair of binary signals. We then describe the calibration circuitry and show its effectiveness in compensating for adaptation mismatches. Finally, we discuss how this synapse can be used for silicon implementations of various learning networks. ## 2 Floating-gatetransistors Because our circuit relies on floating-gate transistors to achieve adaptation, we begin by briefly discussing these devices. A floating-gate transistor (e.g. transistor M 30fFig. 1(a)) 2. A control gate cacomprises a MOSFET whose gate is isolated on all sides by SiO pacitively couples signals to the floating gate. Charge stored on the floating gate implements a nonvolatile analog weight; the transistor's output current varies with both the floating-gate voltage and the control-gate voltage. We use Fowler-Nordheim tunneling [9] to increase the floating-gate charge, and impact-ionized hot-electron injection (IHEI) [10] to decrease the floating-gate charge. We tunnel by placing a high voltage on a tunnelingimplant, denoted by the arrowin Fig. 1(a). We inject by imposing more than about 3V across the drain and source of transistor M 3. The circuital lows simultaneous adaptation and computation, because neither tunneling nor IHE I interfere with circuit operation. $V_{\text{tun}}$ , we can approximate the magnitude of the Over a wide range of tunneling voltages tunnelingcurrent $I_{tun}$ as[4]: $$I_{nun} = I_{nun0} \exp\left(V_{nun} - V_{fg}\right) / V_{\chi} \tag{1}$$ where $V_{\text{tun}}$ is the tunneling-implant voltage, $V_{\text{fg}}$ is the floating-gate voltage, and and $V_{\chi}$ are fit constants. Over a wide range of transistor drain and source voltages, we can approximate the magnitude of the injection current $I_{\text{inj}}$ as [4]: $$I_{inj} = I_{inj0} I_s^{1 - U_t / V_{\gamma}} \exp((V_s - V_d) / V_{\gamma})$$ (2) # 3 The silicon synapse We show our silicon synapse in Fig. 1. The synapse stores an analog weight W, multiplies W by a binary input $X_{\rm in}$ , and adapts W to either a conditional probability $P(X_{\rm cor}|Y)$ or a correlation $P(X_{\rm cor}Y)$ . $X_{\rm in}$ is analogous to a presynaptic input, while Y is analogous to a postsynaptic signal or error feedback. $X_{\rm cor}$ is a presynaptic adaptation signal, and typically has some relationship with $X_{\rm in}$ . We can implement different learning rules by altering the relationship between $X_{\rm cor}$ and $X_{\rm in}$ . For some examples, see section 4. We now describe the circuit in more detail. The drain current of floating-gate transistor 4 is fixed, Wdepends $M_4$ represents the weight value W. Because the control gate of M 1. Wecanswitchthedraincurrentonor solelyonthechargeonfloating-gatecapacitorC off using transistor M 7; this switching action corresponds to a multiplication of the weight value Wby a binary input signal, $X_{\rm in}$ . We choose values for the drain voltage of the M<sub>4</sub>to preventinjection. A second floating-gate transistor M 3, whose gate is also connected to C<sub>1</sub>, controls adaptation by injection and tunneling. Simultaneously high input signals $X_{cor}$ and Y cause injection, increasing the weight. A high $V_{\rm tun}$ causes tunneling, decreasing the weight. We either choose to correlate a high $V_{\rm tun}$ with signal Yor provide a fixed high $V_{\text{tun}}$ throughout the adaptation process. The choice determines whether the circuitlearnsaconditionalprobabilityoracorrelation, respectively. Because the drain current sourced by M $_4$ provides is the weight W, we can express W in terms of M $_4$ 's floating-gate voltage, $V_{\rm fg}$ . $V_{\rm fg}$ includes the effects of both the fixed controlgate voltage and the variable floating-gate charge. The expression differs depending on whether the readout transistor is operating in the subthreshold or above-threshold regime. We provide both expressions below: $$W = \begin{cases} I_0 \exp(-\kappa^2 V_{fg} / (1 + \kappa) U_t) \text{ belowthreshold} \\ \beta \left( V_0 - \frac{\kappa^2 V_{fg}}{(1 + \kappa)} \right)^2 & \text{above threshold} \end{cases}$$ (3) Here $V_0$ is a constant that depends on the threshold voltage and on $V_{\rm dd}$ , $U_{\rm t}$ is the thermal voltage kT/q, $\kappa$ is the floating-gate-to-channel coupling coefficient, and I $_0$ is a fixed bias current. Eq. 3 shows that W depends solely on $V_{\rm fg}$ , (all the other factors are constants). These equations differ slightly from standard equations for the source current through a transistor due to source degeneration caused by M $_4$ . This degeneration smoothes the nonlinear relationship between $V_{\rm fg}$ and $I_{\rm s}$ ; its addition to the circuit is optional. ## 3.1 Weight adaptation Because W depends on $V_{\rm fg}$ , we can control W by tunneling or injecting transistor M 3. In this section, we show that these mechanisms enable our circuit to learn the correlation or conditional probability between inputs $X_{\rm cor}$ (which we will refer to as X) and Y. Our analysis assumes that these statistics are fixed oversome period during which adaptation occurs. The change in floating-gate voltage, and hence the weight, discussed below should therefore be interpreted in terms of the expected weight change due to the statistics of the inputs. We discuss learning of conditional probabilities; a slight change in the tunneling signal, described previously, allows us to learn correlations in stead. We first derive the injection equation for the floating-gate voltage in terms of the joint probability P(X,Y) by considering the relationship between the input signals and $I_s$ , $V_s$ , Fig. 1. (a) Synapse schematic. (b) Plot of equilibrium weight in the subthreshold regime versus the conditional probability P(X|Y), showing both experimental chipdata and a fit from Eq. 7 (c). Plot of equilibrium weight versus conditional probability in the above-threshold regime, again showing chip data and a fit from Eq. 7. and $V_{\rm d}$ of M $_3$ . We assume that transistor M $_1$ is in saturation, constraining $I_{\rm s}$ at M $_3$ to be constant. Presentation of a joint binary event ( X,Y) closes nFET switches M $_5$ and M $_6$ , pulling the drain voltage $V_{\rm d}$ of M $_3$ to 0 V and causing injection. Therefore the probability that $V_{\rm d}$ is low enough to cause injection is the probability of the joint event Pr( X,Y). By Eq. 2, the amount of the injection is also dependent on M $_3$ 's source voltage $V_{\rm s}$ . Because M3 is constrained to a fixed channel current, a drop in the floating-gate voltage, causes a drop in $V_{\rm s}$ of magnitude $\kappa\Delta V_{\rm fg}$ . Substituting these expressions into Eq. 2 results in a floating-gate voltage update of: $$(dV_{fg}/dt_{inj} = -I_{inj}Pr(X,Y)exp(XV_{fg}/V),$$ (4) where $I_{\text{inj0}}$ also includes the constant source current. Eq. 4 shows that the floating-gate voltageupdateduetoinjectionisafunction of the probability of the joint event (X,Y). Next we analyze the effects of tunneling on the floating-gate voltage. The origin of the tunneling signal determines whether the synapse is learning a conditional probability or a correlation. If the circuit is learning a conditional probability, occurrence of the conditioning event Y gates a corresponding high-voltage ( $\sim$ 9V) signal onto the tunneling implant. Consequently, we can express the change in floating-gate voltage due to tunneling interms of the probability of Y, and the floating-gate voltage. $$(dV_{fg}/dt)_{tun} = I_{tun0} \Pr(Y) \exp(-V_{fg}/V_{\chi})$$ (5) Eq. 5 shows that the floating-gate voltage update due to tunneling is a function of the probability of the event Y. #### 3.2 Weightequilibrium Todemonstrate that our circuit learns P(X|Y), we show that the equilibrium weight of the synapse is solely a function of P(X|Y). The equilibrium weight of the synapse is the weight value where the expected weight change over time equals zero. This weight value corresponds to the floating-gate voltage where injection and tunneling currents are equal. To find this voltage, we equate Eq's. 4 and 5 and solve: $$V_{fg}^{eq} = \frac{-1}{\left(\kappa/V_{v} + 1/V_{x}\right)} \left(\log \Pr(X \mid Y) + \log \frac{I_{inj0}}{I_{tun0}}\right)$$ $$\tag{6}$$ Toderivetheequilibriumweight, wesubstitute Eq. 6 into Eq. 3 and solve: $$W_{eq} = \begin{cases} I_0 \left( \frac{I_{inj0}}{I_{tun0}} \Pr(X \mid Y) \right)^{\alpha} \text{ belowthreshold} \\ \beta \left( V_0 + \eta \left( \log \frac{I_{inj0}}{I_{tun0}} + \log \left( \Pr(X \mid Y) \right) \right) \right)^2 \text{ abovethreshold} \end{cases}$$ $$\text{where } \alpha = \frac{\kappa^2}{(1+\kappa)U_t(\kappa/V_{\gamma} + 1/V_{\chi})} \text{ and } \eta = \frac{\kappa^2}{(1+\kappa)(\kappa/V_{\gamma} + 1/V_{\chi})}.$$ $$(7)$$ Consequently, the equilibrium weight is a function of the conditional probability below threshold and a function of the log-squared conditional probability above threshold. Note that the equilibrium weight is stable because of negative feedback in the tunneling and injection processes. Therefore, the weight will always converge to the equilibrium value shown in Eq. 7. Figs. 1(b) and (c) show the equilibrium weight versus the conditional P(X|Y) for both sub-and above-threshold circuits, along with fits to Eq. 7. Note that both the sub- and above-threshold relationship between P(X|Y) and the equilibrium weighten ables us to compute the probability of a vector of synaptic inputs X given a post-synaptic response Y. In both cases, we can apply the outputs currents of an array of synapses through diodes, and then add the resulting voltages via a capacitive voltage divider, resulting in a voltage that is a linear function of log P(X|Y). #### 3.3 Calibration circuitry Mismatch between injection and tunneling in different floating-gate transistors can greatly reduce the ability of our synapses to learn meaningful values. Experimental data fromfloating-gate transistors fabricated in a0.35 $\,$ µmprocess show that injection varies by asmuch as 2:1 across achip, and tunneling by up to 1.2:1. The effect of this mismatch on our synapses causes the weight equilibrium of different synapses to differ by a multiplicative gain. Fig. 2 (b) shows the equilibrium weights of an array of six synapses exposed to identical input signals. The variation of the synaptic weights is of the same order of magnitude as the weights themselves, making large arrays of synapses all but useless for implementing many learning algorithms. We alleviate this problem by calibrating our synapses to equalize the pre-exponential tunneling and injection constants. Because the dependence of the equilibrium weight on these constants is determined by the ratio of $I_{\rm inj0}/I_{\rm tun0}$ , our calibration process changes to equalize the ratio of injection to tunneling across all synapses. We choose to calibrate injection because we can easily change $I_{\rm inj0}$ by altering the drain current through $I_{\rm inj0}$ . Our calibration procedure is a self-convergent memory write [11], that causes the equilibrium weight of every synapse to equal the current $I_{\text{cal}}$ . Calibration requires many operat- ingcycles, where, during each cycle, we first increase the equilibrium weight of the synapse, and second, we let the synapse adapt to the new equilibrium weight. We create the calibrated synapse by modifying our original synapse according to Fig. 2(a). We convert M<sub>1</sub> into a floating-gate transistor, whose floating-gate charge thereby sets M<sub>3</sub>'s channel current, providing control of $I_{\text{inj0}}$ of Eq. 7. Transistor M 8 modifies M 1's V<sub>cal</sub>islow.M <sub>9</sub>'sgateisonly gatechargebymeansofinjectionwhenM 9'sgateislowand lowwhentheequilibriumweight Wislessthan $I_{cal}$ . During calibration, injection and tunneling on M<sub>3</sub> are continuously active. We apply a pulse train to $V_{\rm cal}$ ; during each pulse period, $V_{\text{cal}}$ is predominately high. When $V_{\text{cal}}$ is high, the synapse adapts towards its equilibrium weight. When $V_{cal}$ pulses low, M $_8$ injects, increasing the synapse's equilibrium weight W. We repeat this process until the equilibrium weight W matches $I_{cal}$ , causing M<sub>9</sub>'s gate voltage to rise, disabling $V_{\rm cal}$ and with it injection. To ensure that a precalibrated synapse has an equilibrium weight below $I_{\rm cal}$ , we use tunneling to erase all bias transistors prior to calibration. Fig. 2(c) shows the equilibrium weights of six synapses after calibration. The data show that calibration can reduce the effect of mismatched adaptationonthesynapse's learned weight to a small fraction of the weight itself. Because M $_1$ is a floating-gate transistor, its parasitic gate-drain capacitance causes a mild dependence between M $_1$ 's drain voltage and source current. Consequently, M $_3$ 's floating-gate voltage now affects its source current (through M $_1$ 's drain voltage), and we can model M $_3$ as a source-degenerated pFET [3]. The new expression for the injection current in M $_3$ is: Fig. 3. Amethodforachieving spike-time dependent plasticity in silicon. $$(dV_{fg}/dt_{inj} = -I_{inj}Pr(X) \text{ Wexp } \left(V_{fg}\left(\frac{\kappa}{V_{\gamma}} - \frac{\kappa k_{1}}{U_{t}}\right)\right)$$ (8) where $k_1$ is close to zero. The new expression for injection slightly changes the $\alpha$ and $\eta$ terms of the weight equilibrium in Eq. 7, although the qualitative relationship between the weight equilibrium and the conditional probability remains the same. # 4 Implementingsiliconsynapticlearningrules In this section we discuss how to implement a variety of learning rules from the computational-neurobiology and neural-network literature with our synapsecircuit. We can use our circuit to implement a Hebbian learning rule. Simultaneously activating both $M_5$ and $M_6$ is an alogous to heterosynaptic L TP based on synchronized pre-and post-synaptic signals, and activating tunneling with the postsynaptic Y is an alogous to homosynaptic L TD. In our synapse, we tie $X_{in}$ and $X_{cor}$ to gether and correlate $V_{tun}$ with Y. Our synapse is also capable of emulating a Boltzmann weight-update rule [2]. This weight-update rulederives from the difference between correlations among neurons when the network receives external input, and when the network operates in a free running phase (denoted as clamped and unclamped phases respectively). With weight decay, a Boltzmann synapse learns the difference between correlations in the clamped and unclamped phase. We can create a Boltzmann synapse from a pair of our circuits, in which the effective weight is the difference between the weights of the two synapses. To implement a weight update, we update one silicon synapse based on pre- and postsynaptic signals in the clamped phase, and update the other synapse in the unclamped phase. We dothis by sending $X_{\rm in}$ to $X_{\rm cor}$ of one synapse in the clamped phase, and sending $X_{\rm in}$ to $X_{\rm cor}$ of the other synapse in the negative phase. $V_{\rm tun}$ remains constant throughout adaptation. Finally, we consider implementing a temporally asymmetric Hebbian learning rule [7] using our synapse. In temporally asymmetric Hebbian learning, a synapse exhibits LTP or LTD if the presynaptic input occurs before or after the postsynaptic response, respectively. We implement an asymmetric learning synapse using two of our circuits, where the synaptic weight is the difference in the weights of the two circuit. We show the circuit in Fig. 3. Each neuron sends two signals: a neuronal output, and an adaptation time window that is active for some time afterwards. Therefore, the combined synapse receives two presynaptic signals and two postsynaptic response, Y, with the presynaptic input, X, determines whether the synapse undergoes LTPorLTD.If Yoccurs before X, Y's time window correlates with X, causing injection on the negative synapse, decreasing the weight. If Yoccurs after X, Y correlates with X's time window, causing injection on the positive synapse, increasing the weight. Hence, our circuit can use the relative timing between presynaptic and postsynaptic activity to implement learning. ### 5 Conclusion We have described a silicon synapse that implements a wide range of spike-based learning rules, and that does not suffer from device mismatch. We have also described how we can implement various silicon-learning networks using this synapse. In addition, although we have only analyzed the learning properties of the synapse for binary signals, we can instead use pulse-coded analog signals. One possible avenue for future work is to analyze the implications of different pulse-coded schemes on the circuit's adaptive behavior. #### Acknowledgements This work was supported by the National Science Foundation and by the Office of Naval Research. Aaron Shon was also supported by a NDSEG fellowship. We thank Anhai Doan and the anonymous reviewers for helpful comments. #### References [1] W.B.Levy, "Acomputational approach to hippocampal function," in R.D. Hawkins and G.H. Bower (eds.), Computational Models of Learning in Simple Neural Systems, The Psychology of Learning and Motivation vol. 23, pp. 243-305, San Diego, CA: Academic Press, 1989. - [2] D.H. Ackley, G. Hinton, and T. Sejnowski, "A learning algorithm for Boltzmann machines," Cognitive Science vol. 9, pp. 147-169, 1985. - [3] P. Hasler, B. A. Minch, J. Dugger, and C. Diorio, "Adaptive circuits and synapses using pFET floating-gate devices," in G. Cauwenberghs and M. Bayoumi (eds.) Learning in Silicon, pp. 33-65, Kluwer Academic, 1999. - [4] P. Hafliger, A spike-based learning rule and its implementation in analog hardware thesis, ETHZurich, 1999. , Ph.D. - [5] C. Diorio, P. Hasler, B. A. Minch, and C. Mead, "A floating-gate MOS learning array with locally computer weight updates," *IEEE Transactions on Electron Devices* vol. 44(12), pp. 2281-2289,1997. - [6] R. Sutton, "Learning to predict by the methods of temporal difference," *Machine Learning*, vol. 3, pp. 9-44,1988. - [7] H.Markram, J.Lübke, M. Frotscher, and B. Sakmann, "Regulation of synaptic efficacy by coincidenceofpostsynaptic APsand EPSPs," *Science* vol. 275,pp.213-215,1997. - [8] A. Pesavento, T. Horiuchi, C. Diorio, and C. Koch, "Adaptation of current signals with floating-gate circuits," in Proceedings of the 7 th International Conference on Microelectronics for Neural, Fuzzy, and Bio-Inspired Systems (Microneuro 99), pp. 128-134, 1999. - [9] M. Lenzlinger and E. H. Snow. "Fowler-Nordheim tunneling into thermally grown SiO JournalofAppliedPhysics vol. 40(1), pp. 278--283,1969. - [10] E. Takeda, C. Yang, and A. Miura-Hamada, Hot Carrier Effects in MOSDevices , San Diego, CA: Academic Press, 1995. - [11] C.Diorio, "Ap-channel MOS synapsetransistor with self-convergent memory writes," *IEEE Journal of Solid-State Circuits* vol. 36(5), pp. 816-822, 2001.