Low-Power Edge Computing with Optical Neural Networks via WDM Weight Broadcasting
20230274156 · 2023-08-31
Assignee
Inventors
Cpc classification
G06N3/0675
PHYSICS
International classification
Abstract
NetCast is an optical neural network architecture that circumvents constraints on deep neural network (DNN) inference at the edge. Many DNNs have weight matrices that are too large to run on edge processors, leading to limitations on DNN inference at the edge or bandwidth bottlenecks between the edge and server that hosts the DNN. With NetCast, a weight server stores the DNN weight matrix in local memory, modulates the weights onto different spectral channels of an optical carrier, and distributes the weights to one or more clients via optical links. Each client stores the activations, or layer inputs, for the DNN and computes the matrix-vector product of those activations with the weights from the weight server in the optical domain. This multiplication can be performed coherently by interfering the spectrally multiplexed weights with spectrally multiplexed activations or incoherently by modulating the weight signal from the weight server with the activations.
Claims
1. A method comprising: at a server, generating a weight signal comprising an optical carrier modulated with a set of spectrally multiplexed weights for a deep neural network (DNN); transmitting the weight signal from the server to a client via an optical link; and at the client, computing a matrix-vector product of (i) the set of spectrally multiplexed weights modulated onto the optical carrier and (ii) inputs to a layer of the DNN.
2. The method of claim 1, wherein generating the weight signal comprises retrieving the set of spectrally multiplexed weights from a memory of the server.
3. The method of claim 1, wherein generating the weight signal comprises, at each of a plurality of time steps, modulating wavelength-division multiplexed (WDM) channels of the optical carrier with respective entries of a column of a weight matrix of the DNN.
4. The method of claim 3, wherein computing the matrix-vector product comprises: modulating the weight signal with the inputs to the layer of the DNN; demultiplexing the WDM channels of the weight signal modulated with the input to the layer of the DNN; and sensing powers of the respective WDM channels of the weight signal modulated with the input to the layer of the DNN.
5. The method of claim 4, wherein modulating the weight signal with the inputs to the layer of the DNN comprises: intensity modulating inputs to a Mach-Zehnder modulator with amplitudes of the inputs to the layer of the DNN; and encoding signs of the inputs to the layer of the DNN with the Mach-Zehnder modulator.
6. The method of claim 1, wherein generating the weight signal comprises: modulating an intensity of the optical carrier with amplitudes of the set of spectrally multiplexed weights before coupling the optical carrier into a set of ring resonators; and modulating the optical carrier with signs of the set of spectrally multiplexed weights using the ring resonators.
7. The method of claim 1, wherein: generating the weight signal comprises encoding the set of spectrally multiplexed weights in a complex amplitude of the optical carrier; and computing the matrix-vector product comprises detecting interference of the weight signal with a local oscillator modulated with the inputs to the layer of the DNN.
8. The method of claim 1, wherein the spectrally multiplexed weights form a weight matrix and computing the matrix-vector product of (i) the set of spectrally multiplexed weights modulated onto the optical carrier and (ii) inputs to the layer of the DNN comprises: weighting columns of the weight matrix with the inputs to the layer of the DNN to produce spectrally multiplexed products; demultiplexing the spectrally multiplexed products; and detecting the spectrally multiplexed products with respective photodetectors.
9. The method of claim 8, wherein weighting the columns of the weight matrix with the inputs to the layer of the DNN comprises simultaneously modulating a plurality of wavelength channels.
10. The method of claim 1, wherein the spectrally multiplexed weights form a weight matrix and computing the matrix-vector product of (i) the set of spectrally multiplexed weights modulated onto the optical carrier and (ii) inputs to the layer of the DNN comprises: weighting rows of the weight matrix with the inputs to the layer of the DNN to produce temporally multiplexed products; and detecting the temporally multiplexed products with at least one photodetector.
11. The method of claim 10, wherein weighting the rows of the weight matrix with the inputs to the layer of the DNN comprises independently modulating each of a plurality of wavelength channels.
12-20. (canceled)
Description
BRIEF DESCRIPTIONS OF THE DRAWINGS
[0019] The skilled artisan will understand that the drawings primarily are for illustrative purposes and are not intended to limit the scope of the inventive subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the inventive subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally similar and/or structurally similar elements).
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DETAILED DESCRIPTION
[0039]
[0040] The output port of the beam splitter 113 is coupled to the optical link 120, which can be a fiber link 121 (e.g., polarization-maintaining fiber (PMF) or single-mode fiber (SMF) with polarization control at the output), free-space link 122, or optical link with fan-outs 123 for connecting to multiple clients 130. If the server 110 is connected to multiple clients 110, it can be connected to each client 110 via a different (type of) optical link 120. In addition, a given optical link 120 may include multiple segments, including multiple fiber or free-space segments connected by amplifiers or repeaters.
[0041] Each client 130 includes a PBS 131 with two output ports, which are coupled to respective input ports of a Mach-Zehnder modulator (MZM) 133 with a phase modulator 132 in the path from one PBS output to the corresponding MZM input. The outputs of the MZM 133 are demultiplexed into an array of difference detectors 135, one per wavelength channel. Demultiplexing can be achieved with various passive optics, including arrayed waveguide gratings, unbalanced Mach-Zehnder trees, and ring filter arrays (shown here). In the ring-based implementation, the light is filtered with banks of WDM ring resonators 134. The ring resonators 134 in each bank are tuned to the same resonance frequencies ω.sub.1 through coo as the micro-ring modulators 112 in the client 110. Each resonator 134 is paired with a corresponding resonator in the other bank that is tuned to the same resonance frequency. These pairs of resonators 134 are evanescently coupled to respective differential detectors 135, such that each differential detector 135 is coupled to a pair of resonators 134 resonant at the same frequency (e.g., ω.sub.1). In this arrangement, the pairs of resonators 134 act as passband filters that couple light at a particular frequency from the MZM 133 to the respective differential detectors 135.
[0042] The differential detectors 135 are coupled to an analog-to-digital converter (ADC) 136 that converts analog signals from the differential detectors 135 into digital signals that can be stored in a RAM 137. The RAM 137 also stores inputs to one or more layers of the DNN. The RAM 136 is coupled to a DAC 138 that is coupled in turn to the MZM 133. The DAC 138 drives the MZM 133 with the DNN layer inputs stored in the RAM 137 as described below.
[0043] The NetCast optical neural network 100 works as follows. Data is encoded using a combination of time multiplexing and WDM: the server 110 and client 130 perform an M×N matrix-vector product in N time steps over M wavelength channels. At each time step (indexed by n), the server 110 broadcasts a column w.sub.n of the weight matrix to the client 130 via the optical link 120. The server 110 modulates the weight matrix elements, which are stored in the RAM 113, on the frequency comb to produce a weight signal using the broadband modulator (e.g., micro-ring resonators 112). Then the server 110 transmits this weight signal to the client 130 via the optical link 120. The MZM 133 in the client 130 multiplies the weight signal with the input to the corresponding DNN layer, which is stored in the client RAM 137. The pair of 1-to-M WDMs (e.g., M ring resonators 134) and M difference photodetectors 135 (one set per wavelength) in the client 130 demultiplex the outputs of the MZM 133. These outputs are the products of the weights with the input vector stored in the client's RAM 137, w.sub.mnx.sub.n. Integrating over all N time steps, the total charge accumulated on each difference detector 135 is
γ.sub.m=Σ.sub.nW.sub.mnx.sub.n (1)
performing the desired matrix-vector product.
[0044]
where Δ.sub.mn is the cavity detuning of the m.sup.th ring modulator 112 (couples to ω.sub.m) at time step n.
[0045] The PBS 115 combines the through- and drop-port outputs of the ring modulators 112 to orthogonal polarizations of a polarization-maintaining output fiber (PMF) optical fiber link 121, which transmits the combined through- and drop-port outputs to the client 130 as a weight signal. If the through and drop beams have the same polarization (e.g., transverse electric (TE)), there may also be a polarization rotator coupled to one input port of the PBS 115 to rotate the polarization of one input to the PBS 115 (e.g., from TE to transverse magnetic (TM)), so that the inputs are coupled to the same output port of the PBS 115 as orthogonal modes (e.g., TE and TM modes propagating in the same waveguide 121). The optical link 120 may be over fiber or free space and may include optical fan-out to multiple clients as explained above. If the link loss or fan-out ratio is large, the server output can be pre-amplified by an erbium-doped fiber amplifier (EDFA) or another suitable optical amplifier (not shown).
[0046] At the end of the link 120, the weight signal enters the client 130, where the second PBS 131 separates the polarizations and the phase shifter 132 (
[0047] Finally, the WDM channels are demultiplexed using the ring resonators 134 and the power in each channel is read out on a corresponding photodetector 135. In this case, with a ring-based WDM transmitter, the difference current between the MZM outputs evaluates to:
[0048] The first term in Eq. (4) is a product between a DNN weight (encoded as |t.sub.mn|.sup.2−|r.sub.mn|.sup.2) and an activation (encoded as cos(2θ.sub.n)). The second term Re[t*.sub.mnr.sub.mn]sin(2θ.sub.n) is unwanted: it comes from interference between the through- and drop-port outputs on the MZM 133. This interference can be suppressed or eliminated by ensuring the fields are ±π/2 out of phase (true in the critically coupled case Eq. (2)), by offsetting them with a time delay (though this reduces the throughput by a factor of two), or by using two MZMs rather than one (at the cost of extra complexity).
[0049] NetCast uses time multiplexing, and the matrix-vector product is derived by integrating over multiple time steps. For clarity, label the wavelength channels with index m and time steps with index n. In each time step n, the weight server 110 outputs a column of this matrix w.sub.:,n, where the weights are related to the modulator transmission coefficients (and hence the detuning) and the activation x.sub.n is encoded in the MZM phase:
For lossless modulators (k.sub.1=k.sub.2=k/2), the range of accessible weights is w.sub.mn∈[−1, +1]; for lossy modulators, the lower bound is stricter: w.sub.mn∈[−1, +1]; w.sub.mn∈[−1+2k.sub.abs/k, +1]. To reach all activations in the full range x.sub.n ∈[−1,1], the modulation should hit all points in θ∈[−π/2,]; [−π/2,π/2]; this condition can be achieve using a driver with V.sub.pp=V.sub.π.
[0050] After integrating Eq. (4) over the time steps, the difference charge for detector pair m is:
γ.sub.m=Σ.sub.nΔI.sub.mn=Σ.sub.nw.sub.mnx.sub.n (7)
which is the desired matrix-vector product.
[0051] At a high level, the NetCast architecture encodes the neural network (the weights) into optical pulses and broadcasts it to lightweight clients 130 for processing, hence the name NetCast.
NetCast Architecture Variants
[0052] The NetCast concept is very flexible. For example, if one has a stable local oscillator, one can use homodyne detection rather than differential power detection to create a coherent version. While NetCast does not rely on coherent detection or interference, coherent detection can improve performance. In addition, one can replace the fast MZM with an array of slow ring modulators to integrate the signal over frequency rather than time (computing x.sup.Tw instead of wx). Finally, there are a number of ways to reduce the noise incurred in differential detection if many of the signals are small.
Coherent NetCast
[0053]
[0054] This architecture 300 is called a coherent architecture because the weight data is encoded in coherent amplitudes, and the client 330 performs coherent homodyne detection using a local oscillator (LO) 340. A tap coupler (e.g., a 90:10 beam splitter) 341 couples a small fraction of the output of the LO 340 to one port of a differential detector 342 and the remainder to the input of an MZM 333. Likewise, the other port of the differential detector 342 receives a fraction of the weight signal from the server 310 via another tap coupler 332. The output of the differential detector 342 drives a phase-locking circuit 343 that stabilizes the carrier frequency and repetition rate of the LO 340 in a phase-locked loop (PLL). The second tap coupler 332 couples the remainder of the weight signal to a 50:50 beam splitter 344 at whose other input port is coupled to the output of the MZM 333. The output ports of this 50:50 beam splitter 344 are fed to respective input ports of a WDM homodyne detector 334.
[0055] For concreteness,
[0056] As in
[0057] One advantage of coherent detection at the client 330 is increased data rate. The coherent scheme shown in
[0058] Another advantage of the coherent scheme is increased signal-to-noise ratio (SNR), especially at low signal powers. This is especially relevant for long-distance free-space links where the transmission efficiency is very low. Homodyne detection with a sufficiently strong LO allows this signal to be measured down to the quantum limit, rather than being swamped by Johnson noise.
[0059] Assume that inputs and weights are scaled to lie in the range x.sub.n, w.sub.mn∈[−1,1]. The comb line amplitudes input to the homodyne detector, normalized to photon number, are α.sub.mn.sup.(w)=α.sub.ww.sub.mn and α.sub.mn.sup.(x)=α.sub.xx.sub.n. In the weak-signal limit α.sub.w«α.sub.x, the difference charge accumulated on each photodetector, per time step, is:
Q/e
=2α.sub.wα.sub.xw.sub.mnx.sub.n,
Q/e
.sub.rms≡α.sub.x|x.sub.n| (8)
The mean and standard deviation of the output signal are therefore:
As expected, the SNR depends inversely on the energy per weight pulse (before modulation) |α.sub.w|.sup.2. The ONN's performance may be impaired if the SNR is too low; this sets a lower bound to the optical received power, analogous to the ONN standard quantum limit.
[0060] The same protocol can also work if the weight data is sent over an RF link; in this case a mixer is used in place of an optical homodyne detector. An advantage of using an optical link is the much higher data capacity, driven by the 10.sup.4-10.sup.5× higher carrier frequency.
Additional NetCast Variants
[0061] NetCast is very extensible: it can detect coherently or incoherently, integrate over frequency or time, and in the case of incoherent detection, additional complexity can lower the receiver noise.
[0062]
[0063]
[0064] In the TIFS client 130, the optical signal is modulated by a broadband MZM 133, which modulates all wavelength channels simultaneously. This weights the columns of the weight matrix W.sub.mn by activations x.sub.n. The resulting wavelength channels are demultiplexed 134′ and the product is detected on the difference detector 135′ after time integration (sum over the rows of the weighted matrix, Σ.sub.mw.sub.mnx.sub.m).
[0065] In the FITS client 130′, the optical signal is sent through a weight bank 134, which independently modulates each wavelength channel. This weights the rows of the weight matrix w.sub.mn by activations x.sub.n. The resulting signal is detected on a difference detector; at time step n, the difference current is the sum of all contributing wavelength channels (sum over the rows of the weighted matrix, Σ.sub.mw.sub.mnx.sub.m).
[0066] The low-noise incoherent servers 410 and clients 430 and 430′, shown in the bottom row of
[0067]
[0068] Simple and low-noise incoherent servers and clients can be mixed and matched depending on the desired neural network performance and system complexity. To show the advantage of the low-noise configurations, consider the following four cases, named S/S, S/LN, LN/S, LN/LN (simple server/simple client, simple server/low-noise client, etc.). In each case, start with an unweighted frequency comb with amplitudes α.sub.w, where N.sub.wt=|α.sub.w|.sup.2 is the number of photons per weight (at the source), and normalize variables so that w, x ∈[−1,1]. [0069] 1. S/S: The weight bank (WB) encodes w.sub.mn into the differential power in two channels, which are multiplexed with a PBS. These are |α±|.sup.2=(1/2)(1±W.sub.mn)N.sub.wt. At the client, these channels are remixed with the MZM (avoiding interference) to give |α′±|.sup.2=(1/2)(1±w.sub.mnx.sub.n)N.sub.wt. Thus the differential charge is Q.sub.det=|α′+|.sup.2−℄α′−|.sup.2=w.sub.mnx.sub.nN.sub.wt, while the total absorbed charge, which sets the shot noise, is Q.sub.tot=|α′+|.sup.2+|α′−|.sup.2=N.sub.wt. [0070] 2. S/LN: The inputs are the same as in S/S, but the client has an additional pair of intensity modulators (IM) before the MZM as shown in
TABLE-US-00001 TABLE 1 Scheme
[0073] These cases are enumerated in Table 1. While they collect the same differential charge Q.sub.det=w.sub.mnx.sub.nN.sub.wt, the total PD charge, which sets the shot-noise limit, varies considerably if many of the inputs or weights are small (or zero). This is generally true, especially for DNN weights which are often pruned to save memory.
[0074] From the PD charge, it is possible to calculate the shot noise on the logical output γ.sub.m. In general, we will have:
γ.sub.m=∈.sub.nw.sub.mnx.sub.n+N(0, σ.sub.m.sup.2) (10)
[0075] The right column of Table 1 compares the noise amplitudes σ.sub.m for the four incoherent schemes (as well as the coherent scheme, Eq. (9)). As expected, the low-noise and coherent schemes have lower noise amplitudes than the simple scheme. Also, because (∥x∥.sub.2).sup.2≤∥x∥.sub.1 (application of Holder's inequality), the coherent scheme is superior to S/LN. But whether LN/LN or Coherent is best may depend on the weights.
[0076] Because time and frequency are Fourier conjugates, the noise analysis is the same for the FITS and TIFS integration schemes, with the replacements w.fwdarw.w.sup.T and N.fwdarw.M (swap time bins with frequency channels). In addition, a side benefit of the low-noise schemes is robustness to phase errors: because the MZMs are always in a BAR or CROSS configuration, there is no interference between α.sub.+ and α.sub.− and the relative phase no longer matters.
Performance
Throughput
[0077] If the client runs as a matrix-vector multiplier, e.g., as shown in
[0078] Fundamentally, the channel capacity of the optical link between the server and client is usually limited by crosstalk. In this architecture, crosstalk takes two forms: (1) temporal crosstalk and (2) frequency crosstalk. Temporal crosstalk arises from the finite photon lifetime in the ring modulators and their finite RC time constant. Lumping these together gives an approximate modulator response time τ=√{square root over (1/k.sup.2+(RC).sup.2)}. For efficient modulators, RC ≈k, so τ≈√{square root over (2)}/k. Temporal crosstalk can have the form X.sub.t=e.sup.−T/96 , where T is the time between weights. This sets an upper limit on the symbol rate R=1/T of the modulators:
where ƒ.sub.0 is the optical carrier frequency and Q is the ring's quality factor.
[0079] Frequency crosstalk occurs among channels of the WDM receiver (even for a perfect WDM, the transmitter rings have frequency crosstalk). This is set by the Lorentzian lineshape .sub.X ω=(1/2K).sup.2/(Δω.sup.2+(1/2K).sup.2), where Δω is the spacing between neighboring WDM channels. In the low-crosstalk case Δω»K, this gives a minimum channel spacing:
[0080] Analog crosstalk should be sufficiently low for the DNN to function. An analog crosstalk of X.sub.t≲0.05 is usually sufficient. Assuming spatial crosstalk has a similar threshold (X.sub.t=X.sub.ω=X), the channel capacity is bounded by:
Here B is the bandwidth (in Hz) and C.sub.0 is the normalized symbol rate (units 1/Hz-s).
[0081] Table 2 shows the capacity as a function of crosstalk. These values are in the same ballpark as the HBM memory bandwidth of high-end GPUs (e.g., 6-12 Tbps). In the matrix-vector case of 1 MAC/wt, it may not be possible to reach GPU- or TPU-level arithmetic performance (>50 TMAC/s). This could involve optical fan-out in the client to reuse weights (as mentioned above; GPUs and TPUs do this anyway) or operating beyond the C-band.
[0082] There may also be practical bandwidth limits set by dispersion in the MZM, long fiber links, PBS, or free-space optics. Many of these bandwidth limits can be circumvented with appropriate engineering.
TABLE-US-00002 TABLE 2 Maximum link bandwidth as a function of crosstalk. The C-band is the wavelength range 1530-1565 nm where EDFAs operate (B = 4.4 THz). The rightmost column gives the equivalent digital data capacity, assuming 8-bit weights. Laser Power/SQL Crosstalk Capacity C χ Symbol rate C.sub.0 (C-band) ×8 b/wt 0.1 1.22 5.3 Twt/s 43 Tbps 0.05 0.66 2.9 Twt/s 23 Tbps 0.01 0.19 850 Gwt/s 6.8 Tbps 0.005 0.12 520 Gwt/s 4.2 Tbps 0.001 0.04 180 Gwt/s 1.2 Tbps
[0083] The server should emit enough laser power to maintain a reasonable SNR at the detector. The noise can be modeled as a Gaussian term in the matrix-vector product of each DNN layer. Following Eq. (10), one writes:
y.sub.m=∈.sub.nW.sub.mnx.sub.n+N(0, τ.sup.2), τ=√{square root over (τ.sub.j.sup.2+τ.sub.s.sup.2)} (14)
[0084] Here, τ.sub.j and τ.sub.s are the Johnson- and shot-noise contributions, respectively. Johnson noise gives rise to so-called kTC noise fluctuations on the charge of a capacitor; these fluctuations scale as (ΔQ).sub.ms=√{square root over (kTC)} and can dominate for readout circuits (detector and transimpedance amplifier (TIA)) with large capacitance. Shot noise, due to the quantization of light into photons, may dominate in the case of high optical powers or coherent detection (with a strong LO).
[0085] There are at least two ways to define the basis for benchmarking laser power. First, the basis can be defined based on the source power in the frequency comb at the weight server before the WDM-MZM. Denote this as N.sub.src. This is the same as N.sub.wt used elsewhere in this specification. Second, the basis can be defined based on the transmitted power (averaged) at the weight server's output, denoted N.sub.tr. This may be much lower than N.sub.src if many weights are zero and a low-noise or coherent detection scheme is used. Received power (at the client) is just N.sub.tr times the link efficiency. Source power is a convenient basis without practical amplifiers, but as long as it is possible to amplify the signal efficiently without too much dispersion, nonlinearity, or crosstalk, transmitted power may be a more convenient basis. Plus using transmitted power leads to more favorable results in many cases.
[0086] To calculate the energy bound imposed by noise in the ONN, consider running the neural network with additive Gaussian noise in each layer (Eq. (14)) and computing the noise limit, the largest tolerable noise amplitude τ.sub.max. This depends on the DNN and the tolerance to error.
[0087]
[0088] The largest tolerable noise amplitude τ.sub.max can be used to obtain a conservative estimate for the energy metric (either N.sub.src or N.sub.tr) since τ=√{square root over (τ.sub.j.sup.2+τ.sub.s.sup.2)} depends on the optical energy. First, the Johnson noise scales inversely with N.sub.src and sets a lower bound on it:
Table 3 lists the kTC noise, the corresponding minimum energy per MAC E.sub.min, and the minimum power (at a rate of 1 TMAC/s).
TABLE-US-00003 TABLE 3 Johnson (kTC) noise as a function of capacitance C and corresponding minimum source energy per MAC E.sub.min. C = 1 fF 10 fF 100 fF 1 pF ΔQ/e
.sub.rms 13 40 130 400 E.sub.min × σ.sub.max 1.6 aJ 5.1 aJ 16 aJ 51 aJ P.sub.min × σ.sub.max.sup.† 1.6 μW 5.1 μW 16 μW 51 μW .sup.†Power P.sub.min calculated at 1 TMAC/s.
TABLE-US-00004 TABLE 4 Shot noise for the incoherent and NetCast schemes (Table 1) and corresponding coefficients F.sub.src and F.sub.tr (Eq. (17)). Noise Coefficients Scheme σ.sup.2 F.sub.src F.sub.tr S/S N/N.sub.src 1 1 S/LN |x.sub.n|
|x.sub.n|
|x.sub.n|
.sup.2 (N/N.sub.src) LN/S
|w.sub.mn|
|w.sub.mn|
|w.sub.mn|
.sup.2 (N/N.sub.src) LN/LN
|w.sub.mnx.sub.n|
|w.sub.mnx.sub.n|
|w.sub.mn|
|w.sub.mnx.sub.n|
(N/N.sub.src) Coherent
|x.sub.n|
|x.sub.n.sup.2|
|x.sub.n|.sup.2
|w.sub.mn|.sup.2
(N/N.sub.src)
[0089] The shot noise term as scales inversely with the square root of power. This sets a lower bound on the optical power called the Standard Quantum Limit (SQL) because it arises from fundamental quantum fluctuations in coherent states (rather than thermal fluctuations, which can be avoided with a sufficiently small capacitance, or using avalanching or on-chip gain before the detector). The SQL may be relevant here for two reasons: (1) optical power budgets are much lower owing to laser efficiency, free-carrier effects, and nonlinear effects—while chips can tolerate 100 W of heating, most silicon-on-insulator (SOI) waveguides take at most 100 mW; and (2) links can be very low efficiency in many applications (e.g., long-distance free-space). Therefore, unlike the HD-ONN, a NetCast system may operate near the SQL.
[0090] Define coefficients F.sub.src and F.sub.tr by:
The power bound set by shot noise is therefore:
[0091] Thus, the energy bound is closely related to the coefficients F.sub.src, F.sub.tr. These coefficients can be obtained by the form of τ (Table 1); Table 4 lists the coefficients for each scheme. As mentioned above, by reducing the noise in the case of sparse or nearly-sparse weights or activations (|x.sub.n|
,
|w.sub.mn|
«1), low-noise designs can reduce the required laser power by a large factor. These factors F.sub.src and F.sub.tr, shown in Table 5 for the same MNIST neural networks, allow for a 10.sup.3× reduction in optical power consumption compared to the “simple” design.
[0092] At first glance, such a reduction seems unimportant because, even with the simple design, the noise-limited power is E.sub.min=1.4 fJ/MAC, sufficiently low that on-chip electronics, e.g., DACs, ADCs, and memory, are likely to dominate. However, this noise-limited power means that even at a modest throughput of 1 TMAC/s there should be 1.4 mW of optical power at the receiver. Given that lasers and EDFAs support at most 10-100 mW, this places a limit on the allowed optical fan-out, to say nothing of link loss or eye safety. For especially lossy links (e.g., drones connected at long distance over free space), there is a strong incentive to reduce E.sub.min as much as possible, even if it doesn't affect the client-side power budget.
[0093] Fortunately, both the coherent scheme and the LN/LN incoherent schemes can operate at very low transmitted energies of a few photons/MAC, enabling P.sub.min<1 μW even at 1 TMAC/s. With such a client, a 10 mW source can tolerate link losses (or fan-out ratios) of up to 10.sup.4. Alternatively, a lower-loss link could deliver enough power for 100 TMAC/s of computation, beating the TPU with a sub-mW (optical) power budget.
[0094] For the low-noise incoherent schemes, Johnson noise may dominate over shot noise because the shot-noise bound is so low. To suppress Johnson noise, signal pre-amplification (e.g., with an EDFA or a semiconductor optical amplifier) or avalanching detectors can be used.
TABLE-US-00005 TABLE 5 Source Power Transmitted Power Scheme F.sub.src N.sub.min E.sub.min P.sub.min.sup.† F.sub.tr N.sub.min E.sub.min P.sub.min.sup.† Small NN S/S 1.000 11,000 1.4 fJ 1.4 mW 1.000 11,000 1.4 fJ 1.4 mW S/LN 0.092 1,300 160 aJ 160 μW 0.092 1,300 160 aJ 160 μW LN/S 0.130 530 67 aJ 67 μW 0.020 57 7.2 aJ 7.2 μW LN/LN 0.015 110 15 aJ 15 μW 0.002 6.0 770 zJ 770 nW Coherent 0.061 1,100 140 aJ 140 μW 0.002 7.5 960 zJ 960 nW Large NN S/S 1.000 1,100 140 aJ 140 μW 1.000 1,100 140 aJ 140 μW S/LN 0.076 102 13 aJ 13 μW 0.076 102 13 aJ 13 μW LN/S 0.091 175 23 aJ 23 μW 0.011 27 3.6 aJ 3.6 μW LN/LN 0.009 17 2.2 aJ 2.2 μW 0.001 2.7 340 zJ 340 nW Coherent. 0.048 86 11 aJ 11 μW 0.0007 1.5 180 zJ 180 nW Coefficients F.sub.src and F.sub.tr in Eq. (17). Estimated minimum power required to achieve acceptable SNR (both at source (assuming no amplification) and transmitted power. .sup.†Power P.sub.min calculated at 1 TMAC/s.)
Client Electrical Power Consumption
[0095] Electrical power consumption at the client depends on: (1) fetching activations (the inputs to the DNN layer) from client memory, (2) driving the MZM, and (3) reading and digitizing the detector outputs.
[0096] By broadcasting the weights from the server to the client(s), NetCast eliminates the need to retrieve weights from client memory. In general, the weights of a DNN take up much more memory than the activations. For a fully connected layer, weights take up O(N.sup.2) memory while activations only take up O(N) (batching evens this out a bit, but the size of the mini-batch is usually smaller than N). Moreover, unlike the weights, all of which should be stored somewhere, during inference only the current layer's activations need to be stored at any time (excepting branch points and residual layers). Thus, the ratio of weights to activations should increase with the depth of the network and the size of its layers.
[0097] Without the weights, the client may be able to store the entire DNN's state in on-chip memory, eliminating dynamic random-access memory (DRAM) reads on the client side. Moreover, even when reading from on-chip memory, there is a data reuse factor of M from wavelength multiplexing in the MZM as shown in
[0098] Driving the MZM at the client does not consume much electrical power either. A free carrier-based uni-traveling-carrier (UTC) MZM transmitter uses O(1) pJ/bit. As with the memory reads, WDM amortizes the driver cost over M channels, so the energy per MAC is O(1/M) pJ. With many channels, the driving cost can be driven below tens of femtojoules/MAC. (This assumes the MZM is UTC over the whole bandwidth and neglects dispersion). More exotic modulators (e.g., based on LiNbO.sub.3, organic polymers, BaTiO.sub.3, or photonic crystals) could reduce the modulation cost to femtojoules, which would again be amortized by the 1/M factor from WDM. However, few-fJ/MAC performance is already possible with modulators available in foundries today.
[0099] Reading and digitizing the detector outputs at the client also consumes small amounts of electrical power. Readout and digitization power consumption is usually dominated by the analog-to-digital conversion (ADC), which is O(1) pJ/sample at 8 bits of precision. It may be possible to scale ADC energies down to 100 fJ or less by sacrificing a bit or two without harming performance. In any event, after dividing by N >100, the ADC cost is at most tens of femtojoules/MAC.
[0100] The client may consume power for other operations, including tuning and controlling the ring resonators used as filters. Thermal ring tuning can raise the system-level power consumption figure for ring modulators from fJ/bit to pJ/bit. If the receiver WDM (designed with ring arrays as in
Server Electrical Power Consumption
[0101] In the highest power consumption scenario, the weight server stores all of its weights in DRAM and achieves zero local data reuse, so the power budget is dominated by DRAM reads (about 20 pJ/wt at 8-bit precision). At a target bandwidth of 1 Twt/s, this is approximately 20 W. The transmitter may add a few watts (assuming O(1) pJ/wt as before), and then there is the optical power considered earlier.
[0102] The NetCast server-client architecture can lead to entirely new dataflows because the server is freed from the tasks of computation and memory writes. For example, the weight server may be constructed as a wafer-scale weight server that stores the weights in static random-access memory (SRAM). With commensurate modulator improvements, the energy consumption can be reduced by orders of magnitude. In a wafer-scale server, the data should be stored locally to avoid both off- and on-chip interconnect costs.
[0103]
[0104] At first glance, a switching tree may seem energy-intensive if each leaf on the tree contains one weight and the switches are toggled every clock cycle. But in this case, each leaf can contain many weights and can wait for many clock cycles before switching. This greatly reduces the burden on the switching network. Even in the case where weights are stored in DRAM, however, NetCast should operate at reasonable powers with existing technology.
Applications for NetCast
[0105] There are many edge computing scenarios where smart sensors have a direct line of sight or a fiber-optic connection to a server but are power-starved. For example, complex machinery like aircraft contain hundreds of sensors that can be linked through fibers inside the airframe, as shown in
[0106]
[0107]
[0108] NetCast offers several advantages over other schemes of edge processing with DNNs. To start, it integrates the optical power in the analog domain and reads it out at the end, so the energy consumption is O(1/N) times smaller than digital optical neural networks. It can be used to implement large DNNs (e.g., with more than 10.sup.8 weights), which is not possible with today's integrated circuits. It can operate without phase coherence, which relaxes requirements on the stability of the links connecting the server to the clients. In addition, the links are not imaging links; they can be fiber-optic links or single-mode free-space links with simple Gaussian optics. Finally, the chip area scales as O(M), not O(MN) or O(N.sup.2), because NetCast is output-stationary, unlike schemes that are weight-stationary.
Distributed Training
[0109] Another exciting possibility is to perform distributed training using two-way optical links between the server and the client. Training allows the server to update its weights in real time from data being processed on the clients. This following method for training is compatible with NetCast and runs on similar hardware.
[0110] DNN training is a two-step process. First, the gradients of the loss function J with respect to activations X.sub.n=∂J/∂x.sub.n,ψ.sub.m=∂J/∂y.sub.m are computed by back-propagation. Within each layer, the backpropagation relation is:
and between layers it is:
In vectorized form, Eq. (18) can be written as the matrix product X=w.sup.Tψ, while Eq. (19) is an elementwise weighting of the vector elements ψ=g′(x)X.
[0111] Second, compute the weight update δ.sub.mn=∂J/∂w.sub.mn, i.e., the gradient of J with respect to the weights:
which is just the vector outer product δ=ψx.sup.T. These relations are summarized in Table 6 and illustrated in
TABLE-US-00006 TABLE 6 Comparison of inference, backpropagation, and weight updates. The first two can be cast as matrix-vector multiplications with one optical input, an electrical input, and an electrical output (O, E .fwdarw. E). The weight update is different, taking the form of an outer product between two electrical inputs to produce an optical output ((E, E) .fwdarw. O). Inputs Output Format Inference Weights w Activations x Activations (O, E) .fwdarw. E y = wx Backprop Weights w Gradients ψ Gradients (O, E) .fwdarw. E χ = w.sup.Tψ Weight Activations x Gradients ψ Updates (E, E) .fwdarw. O update δ = ψx.sup.T
[0112] Backpropagation relies on a matrix-vector product. In terms of optics, this is straightforward to perform in NetCast: simply swap w for w.sup.T and everything runs the same as for inference. For the weight update, given the activation x and gradient ψ, compute the outer product δ=ωx.sup.T, and transmit the result (encoded optically in a compatible format) to the server.
[0113] Since the weight update is a matrix, it can be encoded in the same time-frequency format as the weight matrix as shown in
[0114]
[0115] In the simple client 730a of
Q.sub.det=|α.sub.mn.sup.(+)|.sup.2−|α.sub.mn.sup.(−)|.sup.2∝ψ.sub.mX.sub.n=δ.sub.mn (21)
[0116] If many of the activations or weights are very small, it can be difficult to resolve the signal Q.sub.det because of the large shot noise. The low-noise client 730a′ in
[0117] The coherent server 710b and client 730b share a common LO and so can encode the weights coherently. This involves cascading a frequency comb from a comb source 731 through a slow WDM-MZM 732b into a fast broadband MZM 733 on the client side and beating the resulting training signal against a LO comb from an LO 711 in a WDM homodyne detector 712b at the server 710b. In this case, the signal field (rather than power) scales as ψ.sub.mx.sub.n. With an LO amplitude α, the charge in each detector is Q.sub.±=(1/2)(α±√{square root over (N.sub.src)}ψ.sub.mx.sub.n).sup.2 and the difference charge scales as ψ.sub.mx.sub.n.
TABLE-US-00007 TABLE 7 Comparison of the simple, low-noise, and coherent NetCast training schemes. Signal Noise Scheme Power N.sub.tr/N.sub.src Q.sub.det (ΔQ.sup.2) σ.sub.J.sup.2 × N.sub.src.sup.2 σ.sub.S.sup.2 × N.sub.src σ.sub.S.sup.2 × N.sub.tr Simple 1 N.sub.srcψ.sub.mx.sub.n N.sub.src kTC/e.sup.2 1 1 Low-Noise |ψ.sub.m|
|x.sub.n|
N.sub.srcψ.sub.mx.sub.n N.sub.src|ψ.sub.mx.sub.n| kTC/e.sup.2
|ψ.sub.m|
|x.sub.n|
|ψ.sub.m|
.sup.2
|x.sub.n|
.sup.2 Coherent
|ψ.sub.m|.sup.2
|x.sub.n|.sup.2
2α{square root over (N.sub.src)}ψmx.sub.n α.sup.2 —
[0118] Like inference, the accuracy of training in NetCast is limited by detector noise, which is a function of the optical power. In the large-signal limit, this noise leads to a Gaussian term in the calculated outer product:
δ.sub.mn=ψ.sub.mx.sub.n+N(0, τ.sub.mn.sup.2) (22)
[0119] While σ.sub.mn often depends on the specific matrix element, it can be more convenient to look at the average σ.sup.2=(σ.sub.mn.sup.2). This noise variance is a sum of Johnson and shot-noise terms σ.sup.2=σ.sub.j.sup.2+σ.sub.s.sup.2, which scale as σ.sub.j∝N.sub.src.sup.−1, σ.sub.S∝N.sub.src.sup.−1/2. Table 7 compares the noise amplitudes for the three training schemes in |x.sub.n|
<0.1 for a trained DNN; if this remains true in training and ψ.sub.m is similarly sparse, the low-noise design can reduce noise (or reduce power at fixed noise) by a factor of 10.sup.3-10.sup.4 compared to the simple design. The noise reduction (or energy savings) of the coherent design may also be significant.
[0120] If training is really distributed, the server may receive weight updates from multiple clients. While the client-side power budget for weight transmission is quite low (O(M)+O(N) for an M×N matrix), on the server side, it is O(MN) since every weight is read to memory. If the server processes the weight updates of the clients independently, it may run into severe bandwidth and energy bottlenecks. Therefore, it can be highly advantageous to combine these updates optically before the server reads them out.
[0121]
[0122]
{circumflex over (α)}.sub.k=α.sub.k+N(0,1/4).Math.Σ.sub.k{circumflex over (α)}.sub.k=Σ.sub.kα.sub.k+N(0,1/4K) (23)
to first combining the fields optically (α=K.sup.−1/2Σ.sub.kα.sub.k) and then performing homodyne detection:
{circumflex over (α)}.sub.k=K.sup.−1/2Σ.sub.k+N(0,1/4)=K.sup.−1/2[Σ.sub.k{circumflex over (α)}.sub.k+N(0, K/4)] (24)
[0123] The results in Eqs. (23) and (24) differ by a scaling factor; the SNR is the same. Therefore, in the coherent scheme, the weight updates can be combined without loss of signal. Beyond this, another advantage of the coherent scheme is speed: without interleaving, it is much faster in the case of many clients. In the incoherent case, interleaving can limit the weight update rate to the bounds derived above. By contrast, with coherent optics, these weight updates are optically batched and the bound no longer applies. This could be a major advantage in systems that have many clients and are (optical) throughput-limited.
Conclusion
[0124] While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize or be able to ascertain, using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
[0125] Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
[0126] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
[0127] The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
[0128] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0129] As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of” “Consisting essentially of” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
[0130] As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0131] As used herein in the specification and in the claims, when a numerical range is expressed in terms of two values connected by the word “between,” it should be understood that the range includes the two values as part of the range.
[0132] In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.