Power/Performance/Cost Efficiency of Adiabatic Circuits, as a function of Device On/Off Power Ratios Michael P. Frank CISE Department / ECE Dept. Brown Bag Seminar Tue., Mar. 26 ITRS Feature Size Projections 1000000 uP chan L DRAM 1/2 p 100000 min Tox Human hair thickness max Tox Eukaryotic

cell Feature Size (nanometers) 10000 1000 Bacterium Virus 100 Protein molecule 10 DNA molecule thickness

1 Atom 0.1 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 Year of First Product Shipment We are here ITRS Feature Size Projections 1000 Bacterium uP chan L DRAM 1/2 p min Tox max Tox Feature Size (nanometers) 100

Virus Protein molecule 10 DNA molecule thickness 1 Atom 0.1 1995 2000 2005 We are here

2010 2015 2020 2025 2030 Year of First Product Shipment 2035 2040 2045 2050 Source: ITRS 99

Across Multiple Technologies Vacuum Tubes Integrated Circuits Mechanical Discrete Electromechanical Transistors Relays Source: Kurzweil, The Age of Spiritual Machines, pp. 22-25 Min transistor switching energy, kTs Trend of minimum transistor switching energy 1000000 High 100000

Low 10000 trend 1000 100 10 1 1995 2005 2015 2025 2035 Year of First Product Shipment CV2 based on ITRS 99 figures for Vdd and minimum transistor gate capacitance. T=300 K

Information & Entropy Example: System with 3 twostate subsystems, such as quantum spins. Physical Information Known information plus entropy = maximum entropy = maximum known information Known Information Entropy Spin label 1 2 3 Informational

Status Entropy Known information Entropy 1 2 3 Ruled out by some knowledge 23=8 states Illustrating Landauers principle Before bit erasure s0 0 1

1 0 sN1 sN 0 2N states 0 sN-1 Unitary

(1-1) evolution State of bit to be erased. s0 s0 sN-1 N states 0

N states After bit erasure s2N-1 0 State of rest of system (thermal modes, &c.) Conventional Gates are Irreversible Logic gate behavior (on receiving new input):

Many-to-one transformation of local state! Required to dissipate bT by Landauer principle Incurs CV2 dissipation in 2 out of 4 cases. Example: Static CMOS Inverter: in out Transformation of local state: Just before After transition: transition: in out 0 0 0 1 1 0 1 1

in out 0 1 1 0 Adiabatic Charging in CMOS Exact formula (if R const.): Ediss f 1 f e 1/ f 1 CV 2 for frequency factor f : RC/t Adiabaticity is Fundamental Adiabatic (dissipation quickness) processes can occur in any type of system. Cf. Adiabatic theorem of quantum mechanics. Specific adiabatic logics have been described for many proposed future device technologies:

Superconducting (Likharev 82, Averin et al. 01) Nanomechanical (Drexler 92, & Merkle mid-90s) Quantum-dot (Lent & Tougaw, mid-90s-present) Quantum computing implementations (inherently) Claim: Work on architectures & analysis for adiabatic CMOS will still apply post-CMOS! Adiabatic Rules for Transistors Rule 1: Never turn on a transistor if it has a nonzero voltage across it! I.e., between its source & drain terminals. Why: This erases info. & causes CV2 disspation. Rule 2: Never apply a nonzero voltage across a transistor even during any onoff transition!

Why: When partially turned on, the transistor has relatively low R, gets rel. high P=V2/R dissipation. Corollary: Never turn off a transistor when it has a nonzero current going through it! Why: As R gradually increases, the V=IR voltage drop will build, and then rule 2 will be violated. Adiabatic Rules continued Transistor Rule 3: Never apply a large voltage across any on transistor. Why: So transition will be more reversible; dissipation will approach CV2(RC/t), not CV2. Adiabatic rules for other components: Diodes: Dont use them at all! There is always a built-in voltage drop across them! Resistors: Avoid moderate network resistances. e.g. stay away from range >10 k and <1 M Capacitors: Minimize, reliability permitting.

Note: Adiabatic dissipation scales with C2! Transistor Rules Summarized Legal transitions in green. (For n- or p-FETs.) Dissipative states and transitions in red. high high off high low low on low high high

off low off low off high low on high low on low on high

Transformation of local state: Just before transition: After transition: in out 0 1 in out 0 1 1 0 Simple Reversible CMOS Latch Uses a standard CMOS transmission gate Sequence of operation: (1) input initially matches latch contents (output), (2) input changesoutput changes, (3) latch closes,

(4) input removed. b a Before input: in out a a P in out P b a Input arrived: in out

a a b b Input removed: in out a a a b Generic Frictional Coefficients Normal defs. of friction (coeff. of sliding friction, viscosity, etc.) may not apply to all processes. For a given mechanism executing a specified process (i.e., following a specified desired trajectory or -ies) adiabatically over a time t: Energy coefficient: cE = Elostt = Elost/q Energy dissipated from traj. per unit of quickness Note quickness q = 1/t has units like Hz Entropy coefficient: cS = Smadet = Smade/q New entropy generated per unit of quickness

Note that cE = cST at temperature T. What matters! Energy Coefficient in Electronics For charging capacitive load C by voltage V through effective resistance R: cE = Elostt = (CV2RC/t)t = C2V2R If the resistances are voltage-controlled switches with gain factor k controlled by the same voltage V, then effective R 1/kV cE = C2V/k In constant-field-scaled CMOS, k 1/hox , C , and V , so cE 3/ = 4; Elost = cE/t 4/ = 3 (like CV2 energy) Entropy coefficients of some reversible logic gate operations From Frank 98, Ultimate theoretical models of nanocomputers (Nanotechnology journal): SCRL, circa 1997:

~1 b/Hz Optimistic reversible CMOS: ~10 b/kHz Merkles quantum FET: ~1.2 b/GHz Nanomechanical rod logic: ~.07 b/GHz Superconducting PQ gate: ~25 b/THz Helical logic: ~.01 b/THz How low can you go? We dont really know! Quantifying Leakage For a given structured system: Leakage power: Pleak = dEleak / dt Spontaneous entropy generation rate: Sleak = dSleak / dt Again, note Pleak = Sleak T at temperature T.

Minimum Losses w. Leakage topt cS cE Pleak S leak Etot = Eadia + Eleak Eleak = Pleaktr 2 Pleak cE 2T S leak cS Eadia = cE / tr Min. energy & Roff/Ron ratio Note that:

cE = C2V2Ron and if dominant leakage is source/drain: Pleak = V2/Roff So: cEPleak = C2V4/(Roff/Ron) Emin = 2(cEPleak)1/2 = 2CV2(Roff/Ron)1/2 So: Qmax = CV2 / (2CV2(Roff/Ron)1/2) = (Roff/Ron)1/2 = (Ion/Ioff)1/2 Clock/Power Supply Desiderata Requirements for an adiabatic timing signal / power supply: Generate trapezoidal waveform with very flat high/low regions Flatness limits Q of logic. Waveform during transitions is ideally linear, But this does not affect maximum Q, only energy coefficient.

Operate resonantly with logic, with high Q. Power supply Q will limit overall system Q Reasonable cost, compared to logic it powers. If possible, scale Q t (cycle time) Required to be considered an adiabatic mechanism. May conflict w. inductor scaling laws! At the least, Q should be high at leakage-limited speed (Ideally, independent of t.) Supply concepts in my research Superpose several sinusoidal signals from phasesynchronized oscillators at harmonics of fundamental frequency Weight these frequency components as per Fourier transform of desired waveform

Create relatively high-L integrated inductors via vertical, helical metal coils Only thin oxide layers between turns Use mechanically oscillating, capacitive MEMS structures in vacuo as high-Q (~10k) oscillator Use geometry to get desired wave shape directly A MEMS Supply Concept Energy stored mechanically. Variable coupling strength -> custom wave shape. Can reduce losses through balancing, filtering. Issue: How to adjust frequency? Summary of Limiting Factors

When considering adiabaticizing a system: What fraction of system power is in logic? fL Vs. Displays, transmitters, propulsion. What fraction of logic is done adiabatically? fa Can be all, but w. cost-efficiency overheads. How large is the Ion/Ioff ratio of switches? Affects leakage & minimum adiabatic energy. What is the Qsup of the resonant power supply? What is the relative cost of power & logic? r$ E.g. decreasing power cost by r$ by increasing HW cost by r$ will not help. Power premium Minimizing cost/performance

$P = Cost of power in original system $H = Cost of logic HW in original system $P = r$$H; $H = $P/r$ For cost-efficiency inverse to energy savings: $tot,min = $Pr$-1/2 + $Hr$1/2 = 2 $Pr$-1/2 $tot,orig = $P + $H = (1+r$)$H = ((1+r$)/r$) $P $tot,orig/$tot,min = (1+r$)r$-1/2 r$1/2 for large r$ Summary of adiabatic limits Cost-effective adiabatic energy savings factor : Sa = Econv / Eadia in cost-effective adiabatic system Some rough upper bounds on Sa: Sa ~ 1/(1fL) Sa ~ 1/(1fa) Sa ~ (Ion/Ioff)1/2 Sa Qsup

(worse than these 1/2 Sa ~ r$ for non-ideal Discussion ignores benefits from adiabatics of denser computations) packing & smaller communications delays in parallel algorithms. Motivation for this study We want to know how to carry out any arbitrary computation in a way that is reversible to an arbitrarily high degree. Up to limits set by leakage, power supply, etc. We want to do this as efficiently as possible: Using as few device ticks as possible (spacetime) Minimizes HW cost, & leakage losses Using as few adiabatic transitions as possible (ops)

Minimizes frictional losses But, a desired computation may be originally specd in terms of irreversible primitives. General-Case vs. Special-Case Wed like to know two kinds of things: For arbitrary general-purpose computations, How to automatically emulate them in a fairly efficient reversible way, w/o needing new intelligent/creative design work in each case? For various specific computations of interest, What are the most efficient reversible algorithms? Or at least, the most efficient that we can find? Note: These may not look anything like the most efficient irreversible algorithms!

The Landauer embedding The obvious embedding of irreversible ops into expanding reversible ones leads to a linear increase in space through time. (Landauer 61) Or, increase in width of an input-consuming circuit Expanding operations (e.g., AND) Desired output Garbage bits input Circuit depth, or time Lecerf Reversal Lecerf (63) was interested in the group-theory question of whether an iterated permutation of

items would eventually return to initial item. Proved undecidable by reducing Turings halting problem to this question, w. a reversible TM. Reversible TM reverses direction instead of halting. Returns to initial state iff irreversible TM would halt. Desired Only problem: output No useful output data! f f 1 Garbage Input Copy of Input The Bennett Trick Bennett (73) pointed out that you could simply

fan-out (reversibly copy) the desired output before reversing. Note O(T) storage is still temporarily needed! Desired output f f 1 Copy of Input Input Garbage Triangle Representation Represents any use of Bennett 73 embedding State of irrev. comp. @ time ti+ti ti State of irrev.

comp. @ time ti Time in irreversible system Adiabatic Process Forward Reverse phase phase Time in reversible system Mass on any vertical line = space usage @ that time Improving Spacetime Efficiency Bennett 73 transforms a computation taking spacetime

ST to one taking (ST2) in the worst case. Can we do better? Bennett 89: Described a technique that takes spacetime Actually, can generalize slightly and arrange for exponent on log 2 3 1.59 T to be 1+, where 0 (very slowly) ( S T log T ) ( S T log T ) Lange, McKenzie, Tapp 97: Space (S) is possible, if you use time (exp((S))) Not any more spacetime-efficient than Bennett.

Pebble Game Representation Triangle representation k=2 n=3 k=3 n=2 Analysis of Bennett Algorithm n = # of recursive levels of algorithm (n+1 spikes) k = # of lower-level iterations to go forward 1 higherlevel step Tr = # of reversible lowest-level steps executed = c(2k1)n (c a small constant, e.g. 2) Ti = # of irreversible steps emulated = kn So, n = logk Ti, and so Tr = c(2k1)log Ti/log k = celog(2k1)log(Ti)/log k = cTilog(2k 1)/log k

E.g. k=2: Tr = 2Tilog(3)/log(2) Cost-Efficiency Analysis Total cost of doing a computation includes: Spacetime costs (storage used, integrated over time) Includes time-amortized manufacturing cost Includes cost of total energy leakage leakage from any in-use storage element Irreversibility costs (energy loss from irrev. ops) Total number of irreversible bit-erasures, CV2 > kT each. Adiabatic costs (energy loss from reversible ops.) Proportional to number na of adiabatic ops performed, times ce, divided by time top of a single op. Bennett 89 alg. is not optimal k=2 n=3

k=3 n=2 Just look at all the spacetime it wastes!!! Parallel Frank02 algorithm We can simply squish the triangles closer together to eliminate the wasted spacetime! Resulting algorithm is linear time for all n and k and dominates Ben89 for time, spacetime, & energy! Emulated time k=2 n=3 k=3 n=2 k=4 n=1 Real time

Setup for Analysis For energy-dominated limit, let cost $ equal energy. c$ = energy coefficient, r$ = r$(min) = leakage power $i = energy dissipation per irreversible state-change Let the on/off ratio Ron/off = r$(max)/r$(min) = Pmax/Pmin. Note that c$ $itmin = $i ($i / r$(max)), so r$(max) $i2/c$ So Ron/off $i2 / c$r$(min) = $i2 / c$r$ Time Taken There are n levels of recursion. Each multiplies the width of the base of the triangle by k. Lowest-level triangles take time ctop. Total time is thus ctopkn.

k=4 n=1 Width 4 sub-units Number of Adiabatic Ops Each triangle contains k + (k 1) = 2k 1 immediate sub-triangles. There are n levels of recursion. Thus number of adiabatic ops is c(2k 1)n k=3 n=2 52 = 25 little triangles (adiabatic operations) Spacetime Usage Each triangle includes the spacetime usage of all k 1 of its subtriangles, Plus, k 2 (k 2)(k 1) k 2 3k 2

i i 1 2 2 additional spacetime units, each consisting of 1 storage unit, for time topkn1 k=5 1 n=1 2 3 1+2+3 units 1 state of irrev. mach. Being stored Time top kn-1 Resulting recurrence relation: ST(k,0) = 1 (or c) ST(k,n) = (2k1)ST(k,n1) + (k23k+2)kn1/2

Reversible Cost Adiabatic cost plus spacetime cost: $r = $a + $r = (2k-1)nc$/t + ST(k,n)r$t Minimizing over t gives: $r = 2[(2k-1)n ST(k,n) c$ r$]1/2 But, in energy-dominated limit, c$ r$ $i2 / Ron/off, So: $r = 2$i [(2k-1)n ST(k,n) / Ron/off]1/2 Tot. Cost, Orig. Cost, Advantage Total cost $i for irreversible operation performed at end of algorithm, plus reversible cost, gives: $tot = $i {1 + 2[(2k-1)n ST(k,n) / Ron/off]1/2} Original irreversible machine performing kn ops would use cost $orig = $ikn, so, Advantage ratio between reversible & irreversible cost, n

$ R$(i/r ) (k , n) orig $tot k (2k 1) n M (k , n) 1 2 Ron/off Optimization Algorithm For any given value on Ron/off, Scan the possible values of n (up to some limit), For each of those, scan the possible values of k, Until the maximum R$(i/r) for that n is found (the function only has a single local maximum)

And return the max R$(i/r) over all n tried. Cost-Efficiency Gains, Modified Ben89 Advantage in Arbitrary Computation 100000000 0.6198 y = 1.741x 10000000 70 60 1000000 50 100000

10000 ac p S 1000 100 etim En e 10 e up w blo y = 0.3905x0.3896

30 ed v a s rgy k 1 20 10 n 0.1 1 100

10000 1000000 40 10000000 0 On/Off Ratio of Individual Devices 1E+10 0 1E+12 out hw n k

Asymptotic Scaling The potential energy savings factor scales as R$ ~0.4, (i/r) Ron/off while the spacetime overhead goes only as R$(i/r) R$(i/r)~0.45, or Ron/off~0.18. E.g., with an Ron/off of 109, you can do worstcase computation in an adiabatic circuit with: An energy savings of up to a factor of 1,200 ! But, this point is 700,000 less hardware-efficient! Conclusions A new, more spacetime-efficient & energy-efficient algorithm for doing arbitrary computations adiabatically has been described. The energy savings in worst-case computations goes as the ~0.4th power of device on/off ratio. Best case computations: 0.5th power. However, the reduction in spacetime efficiency scales with energy savings to the ~1.6th power. Still much faster than we would like!

Adiabatics can be generally cost-effective, but still only for heavily energy-dominated apps.