US20040030858A1 - Reshuffled communications processes in pipelined asynchronous circuits - Google Patents

Reshuffled communications processes in pipelined asynchronous circuits Download PDF

Info

Publication number
US20040030858A1
US20040030858A1 US10/294,044 US29404401A US2004030858A1 US 20040030858 A1 US20040030858 A1 US 20040030858A1 US 29404401 A US29404401 A US 29404401A US 2004030858 A1 US2004030858 A1 US 2004030858A1
Authority
US
United States
Prior art keywords
inputs
logic
input
circuit
precharge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/294,044
Inventor
Andrew Lines
Alain Martin
Uri Cummings
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
California Institute of Technology CalTech
Original Assignee
California Institute of Technology CalTech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by California Institute of Technology CalTech filed Critical California Institute of Technology CalTech
Priority to US10/294,044 priority Critical patent/US20040030858A1/en
Publication of US20040030858A1 publication Critical patent/US20040030858A1/en
Assigned to CALIFORNIA INSTITUTE OF TECHNOLOGY reassignment CALIFORNIA INSTITUTE OF TECHNOLOGY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MARTIN, ALAIN J., LINES, ANDREW M., CUMMINGS, URI
Assigned to ADD INC. reassignment ADD INC. LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: CALIFORNIA INSTITUTE OF TECHNOLOGY
Priority to US11/433,203 priority patent/US7934031B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Definitions

  • This specification describes communicating sequential processes (CSP) which are implemented as quasi delay insensitive asynchronous circuits. More specifically the present specification teaches reshuffling communication sequences and combining computation with buffering to produce pipelined circuits.
  • Asynchronous processors are known as described in U.S. Pat. No. 5,752,050. These processors process an information stream without a global clock synchronizing the operation.
  • An asynchronous processor pipeline scheme uses the basic layout shown in FIG. 1.
  • a first process 100 communicates with a second process 110 that in turn sends a message to the next process.
  • the messages use a four phase handshake.
  • the sender raises the request line.
  • the receiver raises the acknowledge line.
  • the sender lowers the request line.
  • the receiver lowers the acknowledge line.
  • HSE handshaking expansion language
  • the handshake on channel X is described as X+; Xa+; X ⁇ ; Xa ⁇ .
  • the request between 100 and 110 is the L wire ( 102 ).
  • the acknowledge for that communication is La ( 108 ).
  • the request between 110 and 120 is the R wire ( 104 ), and the acknowledge is Ra ( 106 ).
  • Pipelined asynchronous circuits are known as “Bundled-Data” or “Micropipelines” and have a synchronous style data path which is “clocked” by asynchronous self-timed control elements. These control elements handshake between pipeline stages with a request/acknowledge pair. The delay of the datapath logic is estimated with a delay-element in the control, so that the request to the next pipeline state is not made until the data is assumed to be valid.
  • the alternative style involves (quasi) delay-insensitive circuits, for which no delay assumptions are made.
  • the prior art is embodied in the Caltech Asynchronous Microprocessor patent. Datapaths are still separated from control, as in the bundled-data case, but completion detection circuitry is added instead of delay lines to detect when the data is valid. Communication between processes occurs via delay-insensitive channels with a 4 phase handshake. In between latches or buffers, logic can be performed by unpipelined weak-condition logic blocks.
  • the present system teaches a way of pipelining this handshake to allow certain processes to occur closer to simultaneously.
  • the disclosed system is a delay insensitive system that uses a combination of logic and buffering to resequence certain operations.
  • a new way of pipelining quasi-delay-insensitive circuits is disclosed in which control is not explicitly separated from the datapath. No extra buffers or latches are added between logic blocks. Instead, the state-holding property of a buffer is combined directly with a dual-rail domino logic computation.
  • the tokens travel through the pipeline as in the case of simple buffers. The tokens also carry values which are computed upon. By not separating control from data, and by carefully designing the circuit parts which handle the handshakes, higher throughput is expected.
  • the extra handshake circuitry typically adds no more than 50% area.
  • the supporting circuitry which handles the handshake takes place in precharge domino logic of a type that is common in synchronous design. Additional circuits detect the validity of the input and output channels (common in asynchronous design). An acknowledge circuit acknowledges the inputs and precharges the logic.
  • circuit implementations disclosed in this patent include components for logic computation, plus components to detect the validitity of the input and output data, and another component to generate the acknowledges and precharge the logic.
  • the details and composition of these pieces generate fast quasi delay insensitive circuits superior to the prior art.
  • This patent also include further enhancements of this combined buffer/logic cell are disclosed. These include the ability to conditionally communicate on either inputs or outputs, so as to implement routing functionality. Also, mechanisms for efficiently implementing internal state variables are described.
  • FIG. 1 shows a basic pipelining system and some of the signals used in that system
  • FIG. 1A shows a basic precharge type buffer in block diagram form
  • FIG. 2 shows a basic weak condition half buffer circuit
  • FIG. 3 shows the transistor diagrams for the weak condition half buffer
  • FIG. 4 shows a precharge buffer, with the transistor arrangement at the top; and the gate arrangement at the bottom;
  • FIG. 5 shows a split precharge circuit
  • FIG. 6 shows a merge precharge circuit
  • FIG. 7 shows a Reg precharge circuit
  • the present system is based on a way of pipelining the information in the FIG. 1 drawing using precharge logic that allows the operations to occur in parallel. Pipelining allows a system to carry out more than one operation at the same time. Put another way, a pipelined system does not need to wait for one action to be completed before the other action is carried out. However, if one attempts to reset data before using it, then the data is lost.
  • the present system teaches a way of dealing with this issue by reshuffling the communication sequence, storing certain information within the sequence, and enabling more efficient pipelining information.
  • a “pipeline” is a linear sequence of buffers where the output of one buffer connects to the input of the next buffer as shown in FIG. 1.
  • “Tokens” 99 are sent into the input end of the pipeline, and flow through each buffer to the output end. The tokens remain in first-in-first-out (FIFO) order.
  • the tokens For synchronous pipelines, the tokens usually advance through one stage on each clock cycle. For asynchronous pipelines there is no global clock to synchronize the movement. Instead, each token moves forward down the pipeline where there is an empty cell in front of it; otherwise, the token stalls. Effectively, the tokens have similar behavior to cars on a freeway.
  • the buffer capacity or “slack” of an asynchronous pipeline is proportional to the maximum number of tokens that can be packed into the pipeline without stalling the input end of the pipeline.
  • the “throughput” is the number of tokens per second which pass a given stage in the pipeline.
  • the “forward latency” is the time it takes a given token to travel the length of the pipeline.
  • a single rail buffer has the Communication Sequential Process “CSP” specification *[L;R]. Using a passive protocol for L and a lazy active protocol for R, the buffer will have the handshaking expansion (HSE):
  • the present system recognizes that certain sequences are the most interesting among these sequences.
  • the present application reshuffles the sequence in order to do these first.
  • equation 1 represents a four phase protocol.
  • the first two actions [L]; L a ⁇ , represent waiting for L to become active, and acknowledging that.
  • the second two actions represent L becoming inactive.
  • the third two actions represent waiting for R to become active.
  • the fourth two actions represent R inactive.
  • Another option is to reshuffle the waits and events to reduce the amount of sequencing and the number of state variables, in order to maximize the throughput and minimize the latency of the pipeline.
  • the first requirement for a valid reshuffling is that the Handshaking expression maintains the handshaking protocols on L and R. That is, the projection on the L channel is *[[L]; L a ⁇ ; [ L]; L a ⁇ ] and the projection on the R channel is *[[ R a ]; [R ⁇ ; [R a ; R ⁇ ].
  • the number of completed L ⁇ minus the number of completed R ⁇ should be at least zero to conserve the number of tokens in the pipeline. Also, since this is a “buffer”, it should introduce some nonzero slack. Hence, the L a ⁇ should not wait for the corresponding [R a ], or the reshuffling will have zero slack. This is the “constant response time” requirement.
  • the L and R channels may be expanded to encode data. If the reshuffling moves the R ⁇ past the corresponding L a ⁇ , then the “L” data would disappear before R ⁇ is done.
  • the data here is saved in a buffer, here implemented as an internal state variable proportional to the number of bits on R or L. That data would need to be saved in internal state bits, since the L data may disappear as soon as La+ occurs. These additional internal state bits are undesirable, so La ⁇ will follow R ⁇ .
  • B1 and B2 are also very similar to PCFP, except they have more sequencing. However, that extra sequencing simplifies the production rule for en: to R ⁇ en ⁇ instead of R ⁇ circumflex over ( ) ⁇ L a ⁇ en ⁇ , in the case of PCFB. The inventors therefore do not believe that these will always be inferior to PCFB. However, due to the extra sequencing and additional transistors elsewhere, these reshufflings will likely seldom, if ever, be better than PCFB.
  • MSFB has the least possible sequencing of any of these reshufflings.
  • MSFB requires two state variables and has more complicated production rules than PCFB. It has a possible advantage in speed since it allows R ⁇ to happen a little earlier. If one counts transitions, it turns out that the next buffer in the pipeline (if it is reshuffled similarly) will not even raise R a until after L a ⁇ ; occurs. This might not really be an advantage at all.
  • WCHB indicates weak-condition logic.
  • PC indicates precharge logic.
  • HB indicates a halfbuffer (slack 1 ⁇ 2)
  • FB indicates a fullbuffer (slack 1).
  • the three best reshufflings are: ⁇ PCFB ⁇ * [ [ ⁇ R a ⁇ L ] ; R ⁇ ; L a ⁇ ; en ⁇ ( [ R a ] ; R ⁇ ) , [ ⁇ L ] ; L a ⁇ ) ; en ⁇ ] ⁇ PCHB ⁇ * [ [ ⁇ R a ⁇ L ] ; R ⁇ ; L a ⁇ ; [ R a ] ; R ⁇ ; [ ⁇ L ] ; L a ⁇ ] ⁇ WCHB ⁇ * [ [ ⁇ R a ⁇ L ] ; R ⁇ ; L a ⁇ ; [ ⁇ R a ⁇ L ] ; R ⁇ ; L a ⁇ ]
  • FIG. 1A shows a box and arrow diagram of the standard components of a PCHB or PCFB cell.
  • the various parts of the circuit may be thought of as logic, input completion, output completion, and enable generation.
  • the logic is shown as precharge dual rail domino logic with two enabling gates, the internal enable and the output enable coming back from the next cell in the pipeline.
  • the inverted logic is followed by inverters to restore it to the normal sense.
  • the completion circuits are standard NOR or NAND gates and C-element trees which compute the validity of the inputs and the validity of the outputs.
  • the “enable” circuit generates the input acknowledge(s) and the internal enable (en) of the cell.
  • the PCHB and PCFB differ only in the exact implementation of this enable circuit.
  • A means receive data a on channel A and y!g means send data g on channel y.
  • P receives some inputs, then sends out functions computed from these inputs.
  • the channels A,B,X, and Y must encode some data. The usual way to do this is using sets of 1-of-N rails for each channel. For instance, to send two bits, one could use two 1-of-2 rails with one acknowledge, or one 1-of-4 rails with one acknowledge.
  • a rail is identified by the channel name with a superscript for the 1-of-N wire which is active, and a subscript for what group of 1-of-N wires it belongs to (if there is more than one group in the channel).
  • the corresponding acknowledge will be the channel name with a “a” superscript, or an “e” superscript if it is used in the inverted sense.
  • [ R] indicates a wait for all the output acknowledges to be false
  • [R a ] indicates a wait for all the output acknowledges to be true
  • L a ⁇ indicates making true all the input acknowledges in parallel
  • L a ⁇ indicates making them false.
  • R ⁇ means that all the outputs are set to their valid states in parallel.
  • R ⁇ means that all the outputs are set to their neutral states. When R ⁇ occurs, it means that particular rails of the outputs are made true, depending on which rails of L are true. This expands R ⁇ into a set of exclusive selection statements executing in parallel.
  • the PCFB version of a P with dual rail channels would therefore be: * [ [ ⁇ X a ⁇ f 0 ( A , B , ... ⁇ ) ⁇ X 0 ⁇ ⁇ ⁇ X a ⁇ f 1 ( A , B , ... ⁇ ) ⁇ X 1 ⁇ ] , ⁇ ⁇ [ ⁇ Y a ⁇ g 0 ( A , B , ... ⁇ ) ⁇ Y 0 ⁇ ⁇ ⁇ Y a ⁇ g 1 ( A , B , ... ⁇ ) ⁇ Y 1 ⁇ ] , ... ⁇ ; A a ⁇ , B a ⁇ , ... ⁇ ; en ⁇ ; [ X a ⁇ X 0 ⁇ X 1 ⁇ ] , [ Y a ⁇ ⁇ Y .
  • the f 0 , f 1 , g 0 , and g 1 are boolean expansions in the data rails of the input channels. They are derived from the f and g of the CSP and indicate the conditions for raising the various data rails of the output channels. Note that each output channel waits only for its own acknowledge, which is less sequenced than a direct translation of the PCFB template would be.
  • a a and B a tend to switch at about the same time. They could actually be combined into a single AB a which would wait for the conjunction of the guards on A a and B a . Combining the acknowledges tends to reduce the area of the circuit, but might slow it down. The best decision depends on the circumstances.
  • Three Handshaking expansion reshufflings for this process are: WCHB_BUF ⁇ * [ [ ⁇ R a ⁇ L 0 ⁇ R 0 ⁇ ⁇ ⁇ R a ⁇ L 1 ⁇ R 1 ⁇ ] ; L a ; [ R a ⁇ ⁇ L 0 ⁇ L 1 ⁇ R 0 ⁇ R 1 ⁇ ] ; L a ⁇ ] ⁇ ⁇ PCHB_BUF ⁇ * [ [ ⁇ R a ⁇ L 0 ⁇ R 0 ⁇ 0 ⁇ ⁇ R a ⁇ L 1 ⁇ R 1 ⁇ ] ; L a ⁇ [ R a ⁇ R 0 ⁇ , R 1 ⁇ ] ; [ ⁇ L 0 ⁇ ⁇ L 1 ⁇ L a ] ] ⁇ ⁇ PCFG_BUF ⁇ * [ [ ⁇ R a ⁇ l 0 ⁇ R 0 ⁇ ⁇ R a ⁇ L1
  • Extra inverters can be added to WCHB_BUF to get 10 transitions per cycle. These inverters can actually speed up the throughput, despite the increased transition count, because inverters have high gain. Also, the 6 transitions per cycle buffer would invert the senses of the data and acknowledges after every stage, which is highly inconvenient when composing different pipelined cells. As a standard practice, most pipelined logic cells will be done with 2 transitions of forward latency, but more complicated circuits will have 5, 7 or even 9 transitions backward latency, yielding transitions per cycle from 10 to 22 (even numbers only, of course).
  • the three Handshaking expansion reshufflings are: WCHB_FA ⁇ * [ [ S e ⁇ XOR 0 ⁇ ( A , B , C ) ⁇ S 0 ⁇ 0 ⁇ S e ⁇ XOR 1 ⁇ ( A , B , C ) ⁇ S 1 ⁇ ] , ⁇ ⁇ [ D e ⁇ MAJ 0 ⁇ ( A , B , C ) ⁇ D 0 ⁇ 0 ⁇ D e ⁇ MAJ 1 ⁇ ( A , B , C ) ⁇ D 1 ⁇ ] ; F e ⁇ ; [ ⁇ S e ⁇ ⁇ A 0 ⁇ ⁇ A 1 ⁇ ⁇ C 0 ⁇ S 0 ⁇ , S 0 ⁇ ] , ⁇ ⁇ [ ⁇ D e ⁇ ⁇ B 0 ⁇ ⁇ B 1 ⁇ ⁇ C 1 ⁇ D 0 ⁇ , D 0 ⁇ ]
  • the validity of the outputs S and D implies the validity of the inputs, because the S must check all of A,B, and C.
  • the test for the neutrality of the inputs is split between S ⁇ and D ⁇ . This works as long as both S ⁇ and D ⁇ check at least one input's neutrality completely, and both rails of S and D wait for the same expansion.
  • the expansion for the neutrality of the inputs is obviously too large to implement as a single production rule. Instead, the neutrality test must be decomposed into several operators. The usual decomposition is “nor” gates for each dual rail input, followed by a 3-input c-element. F e ⁇ must now wait for the validity of the inputs just to acknowledge the internal transitions. However, this means the logic for S and D no longer needs to fully check validity of the inputs; it is not required to be weak-condition.
  • the bubble-reshuffled and decomposed production rules for WCHB_FA are: ⁇ S e ⁇ XOR 0 ⁇ ( A , B , C ) ⁇ S 0 _ ⁇ ⁇ S e ⁇ XOR 1 ⁇ ( A , B , C ) ⁇ S 1 _ ⁇ ⁇ D e ⁇ MAJ 0 ⁇ ( A , B , C ) ⁇ D 0 _ ⁇ ⁇ D e ⁇ MAJ 1 ⁇ ( A , B , C ) ⁇ D 1 _ ⁇ ⁇ ⁇ S 0 _ ⁇ S 0 ⁇ ⁇ ⁇ S 1 _ ⁇ S 1 ⁇ ⁇ ⁇ D 0 _ ⁇ D 0 ⁇ ⁇ ⁇ D 1 _ ⁇ D 1 ⁇ ⁇ ( ⁇ S 0 _ ⁇ ⁇ D 1 _ ) ⁇ ( ⁇ D 0 _ ⁇ ⁇ D 1 _ ) ⁇ ( ⁇ D
  • the circuit diagram is shown in FIG. 3.
  • the pull-up logic for S0, S 1 , D 0 , and D 1 has 4 P-type transistors in series. This can be quite weak, due to the lower mobility of holes. Other WCHB circuits can be even worse. Since all the inputs are checked for neutrality before the outputs reset, a process with three inputs and only one output would end up with 7 p-transistors in series to reset that output.
  • the present systems uses the “precharge-logic” reshufflings, PCHB_FA or PCFE_FA. These test the neutrality of the inputs in a different place, which is more easily decomposed into manageable gates, and does not slow the forward latency.
  • the PCHB_FA reshuffling has the production: rules: ⁇ A 0 ⁇ A 1 ⁇ A v _ ⁇ ⁇ B 0 ⁇ B 1 ⁇ B v _ ⁇ ⁇ C 0 ⁇ C 1 ⁇ C v _ ⁇ ⁇ F e ⁇ S e ⁇ XOR 0 ⁇ ( A , B , C ) ⁇ S 0 _ ⁇ ⁇ F e ⁇ S e ⁇ XOR 1 ⁇ ( A , B , C ) ⁇ S 1 _ ⁇ ⁇ F e ⁇ D e ⁇ MAJ 0 ⁇ ( A , B , C ) ⁇ D 0 _ ⁇ ⁇ F e ⁇ D e ⁇ MAJ 1 ⁇ ( A , B , C ) ⁇ D 1 _ ⁇ ⁇ ⁇ ⁇ S 0 _ ⁇ S 0 ⁇ ⁇ ⁇ S 1 _ ⁇ S 1 ⁇ ⁇ ⁇ D
  • This circuit can be made faster by adding two inverters to Fe and then two more to produce the F e used internally (which is now called en). This circuit is shown in FIG. 4.
  • a PCFB_FA reshuffling would have only slightly different production rules: ⁇ A 0 ⁇ A 1 ⁇ A v _ ⁇ ⁇ B 0 ⁇ B 1 ⁇ B v _ ⁇ C 0 ⁇ C 1 ⁇ C v _ ⁇ ⁇ en ⁇ S e ⁇ XOR 0 ⁇ ( A , B , C ) ⁇ S 0 _ ⁇ ⁇ en ⁇ S e ⁇ XOR 1 ⁇ ( A , B , C ) ⁇ S 1 _ ⁇ ⁇ en ⁇ D e ⁇ MAJ 0 ⁇ ( A , B , C ) ⁇ D 0 _ ⁇ ⁇ en ⁇ D e ⁇ MAJ 1 ⁇ ( A , B , C ) ⁇ D 1 _ ⁇ ⁇ ⁇ S 0 _ ⁇ S 0 ⁇ ⁇ ⁇ S 1 _ ⁇ S 1 ⁇ ⁇ ⁇ D 0
  • the WCHB_FA has only 10 transitions per cycle, while the PCHB_FA has 14 and the PCFB_FA has 12 (7 on the setting phase, but 5 on the resetting phase, since the L and R handshakes reset in parallel).
  • the W FA has fewer transistors, to make it reasonably fast, the 4 P-transistors in series must be made very large.
  • both PCHB_FA and PCFB_FA are substantially faster in throughput and latency.
  • PCFB_FA is the fastest of all, since it relies heavily on n-transistors and saves 2 transitions on the reset phase.
  • PCFB_FA can be larger than PCHB_FA, due to the extra state variable en and the extra completion SD v If the speed of the fulladder is not critical, the PCHB FA seems to be the best choice.
  • the WCHB reshuffling tends to be best only for buffers and copies ([L?x;R!x,S!x]).
  • the PCHB is the workhorse for most applications; it is both small and fast. When exceptional speed is called for, the PCFB dominates. It is also especially good at completing 1-of-N codes where N is very large, since the completion can be done by a circuit which looks like a tied-or pulldown as opposed to many stages of combinational logic.
  • the reshuffling can actually be mixed together; with each channel in the cell using a different one. This is most commonly useful when a cell computes on some inputs using PCHB, but also copies some inputs directly to outputs using WCHB. In this case, the neutrality detection for the WCHB outputs is only one p-gate, which is no worse than an extra en gate.
  • Another common class of logic circuits use shared control inputs to process multi-bit words. This is similar to a fulladder.
  • the control is just another input, which happens to have a large fallout to many output channels. Since the outputs only sparsely depend on the inputs (usually with a bit to bit correspondence), the number of gates in series in the logic often does not become prohibitive. However, if the number of bits is large e.g. 32, the completion of all the inputs and outputs will take many stages in a c-element tree, which adds to the cycle time, as does the load on the broadcast of the control data.
  • a dual-rail version of P1 with a PCFB reshuffling is: * [ [ do_x ⁇ ( A , B , ... ) ⁇ ⁇ X a ⁇ f 0 ( A , B , ... ⁇ ) ⁇ X 0 ⁇ ⁇ ⁇ do_x ⁇ ( A , B , ... ) ⁇ ⁇ X a ⁇ f 1 ⁇ ( A , B , ... ) ⁇ X 1 ⁇ ⁇ do_x ⁇ ( A , B , ... ) ⁇ skip ] , do_y ⁇ ( A , B , ... ) ⁇ ⁇ Y a ⁇ g ⁇ ⁇ ( A , B , ... )
  • a skip causes no visible change in state, so the next statements in sequence (A a ⁇ ,B a , . . . ) must actually look directly at the boolean expansion for do_x(A,B, . . . ) and do_y(A,B, . . . ) in addition to the output rails X*, X 1 , Y*, Y 1 .
  • Another approach is to introduce a new variable to represent the do_x and do_y cases.
  • the skip's are replaced with no_x ⁇ and no_y ⁇ , respectively, and no_x ⁇ , are added to X 0 ⁇ ,X 1 ⁇ and no_y ⁇ , to Y 0 ⁇ , Y 1 ⁇ .
  • the production rules are simply produced as if X and Y were 1-of-3 channels instead of 1-of-2, except the extra rail doesn't check the right acknowledge, or, in fact, leave the cell.
  • PCHB_SPLIT ⁇ ⁇ * [ [ A e ⁇ S 0 ⁇ L o ⁇ A 0 ⁇ ⁇ A e ⁇ S 0 ⁇ L 1 ⁇ A 1 ⁇ ⁇ S 1 ⁇ skip ] , ⁇ [ ⁇ B e ⁇ S 1 ⁇ L o ⁇ B 0 ⁇ ⁇ B e ⁇ S 1 ⁇ L 1 ⁇ B 1 ⁇ ⁇ S 1 ⁇ skip ] ; SL e ⁇ ; ⁇ [ ⁇ A e ⁇ ⁇ A 0 ⁇ ⁇ A 1 ⁇ A 0 ⁇ A 1 ⁇ ] , ⁇ [ ⁇ B e ⁇ ⁇ B 0 ⁇ ⁇ B 1 ⁇ B 0 ⁇ B 1 ⁇ ] ⁇ SL e ⁇ ]
  • the ⁇ overscore (A) ⁇ in this context refers to a probe of the value of A, not just its availability. This is not standard in CSP, but is a useful extension which is easily implemented in Handshaking expansion. Basically, the booleans for do_a, do_b, no_a, and no_b may inspect the rails of A and B in order to decide whether to actually receive from the channels. The selection statements will suspend until either do a or no a are true. These expansions are required to be stable; that is, as additional inputs show up, they may not become false as a result.
  • the PCFB version of the Handshaking expansion is: u 0 ⁇ , u 1 ⁇ , v 0 ⁇ , v 1 ⁇ , ... ⁇ ; * [ ⁇ f 0 ( A , B , ... ⁇ ) ⁇ X 0 ⁇ ⁇ f 1 ( A , B , ... ⁇ ) ⁇ X 1 ⁇ ] , [ ⁇ g 0 ( A , B , ... ⁇ ) ⁇ Y 0 ⁇ ⁇ f 1 ( A , B , ... ⁇ ) ⁇ Y 1 ⁇ ] , ... ⁇ , [ ⁇ do_a ⁇ ( A , B ) ⁇ u 1 ⁇ ⁇ no_a ⁇ ( A , B ) ⁇ u 0 ⁇ ] , [ do_b ⁇ ( A , B ) ⁇ v 1 ⁇ ⁇ no_b ⁇ ( A , B ) ⁇
  • this general template can be greatly simplified. For instance, if a set of unconditional inputs completely controls the conditions for reading the others, these can be thought of as the “control” inputs. If raising the acknowledges of the various inputs is sequenced so that the conditional ones precede the control ones, then the variables u and v may be eliminated without causing stability problems. Also in some cases the u and v may be substituted with an expansion of the outputs, instead of stored separately.
  • the circuit for the merge process reverses the split of the last section by conditionally reading one of two data input channels (A and B) to the single output channel R based on a control input M.
  • the CSP is *[M?m; [ m ⁇ A?x[ ]m ⁇ B?x]; X!x].
  • the simplification of acknowledging the data inputs A and B before the control input M is used.
  • the PCHB reshuffling is: PCHB ⁇ ⁇ MERGE ⁇ ⁇ * [ [ X e ⁇ ( M 0 ⁇ A 0 ⁇ M 1 ⁇ B 0 ) ⁇ X 0 ⁇ ⁇ X e ⁇ ( M 0 ⁇ A 1 ⁇ M 1 ⁇ B 1 ] ⁇ X 1 ⁇ ] , ⁇ [ ⁇ M 0 ⁇ A e ⁇ ⁇ M 1 ⁇ B e ⁇ ] ; ⁇ M e ⁇ ; ⁇ [ ⁇ X e -> X 0 ⁇ , X 1 ⁇ ] , ⁇ [ ⁇ A 0 ⁇ ⁇ A 1 ⁇ ⁇ M 0 ⁇ ⁇ A e ⁇ A e ⁇ ] , ⁇ [ ⁇ ⁇ B 0 ⁇ ⁇ B 1 ⁇ ⁇ M 1 ⁇ ⁇ B e ⁇ B ⁇ ] , M e ⁇ ]
  • a subtle simplification used here is to make A e ⁇ and B e ⁇ check the corresponding M e ⁇ and M 1 . This reduces the guard condition for M e ⁇ and makes the reset phase symmetric with the set phase. Some decomposition is done to add A v , B v and X v to do validity and neutrality checks.
  • the PRS is: ⁇ A 0 ⁇ A 1 ⁇ A v _ ⁇ ⁇ B 0 ⁇ B 1 ⁇ B v _ ⁇ ⁇ ⁇ A v ⁇ ⁇ B v _ ⁇ B v ⁇ ⁇ M e ⁇ X e ⁇ ( M 0 ⁇ A 0 ⁇ M 1 ⁇ B 0 ) ⁇ X 0 _ ⁇ ⁇ M e ⁇ X e ⁇ ( M 0 ⁇ A 1 ⁇ M 1 ⁇ B 1 ) ⁇ X 1 _ ⁇ ⁇ ⁇ X 0 _ ⁇ X 0 ⁇ ⁇ ⁇ X 1 _ ⁇ X 1 ⁇ ⁇ ⁇ X 0 _ ⁇ ⁇ X 1 _ ⁇ X 1 ⁇ ⁇ ⁇ X 0 _ ⁇ ⁇ X 1 _ ⁇ X v ⁇ A v ⁇ M 0 ⁇ X v ⁇ A e
  • the state variable is exclusively set or used in a cycle, a simple modification of the standard pipelined reshuffling will suffice.
  • the state variable, s is assigned to a dual-rail value at the same time the outputs are produced. On the reset phase, it remains stable. Unlike the usual return-to-zero variables, s will only briefly transition through neutrality between valid states. If s doesn't change, it does not go through a neutral state at all.
  • the CSP for this behavior is expressed just like P3, except the semicolon before the assignment to s is replaced with a comma. This is made possible by the assumption that s only changes when the outputs X and Y do not depend on it; this avoids any stability problems.
  • the PCFB version of the Handshaking expansion for this type of state holding process is: * [ [ ⁇ X a ⁇ f 0 ( s , A , B , ... ⁇ ) ⁇ X 0 ⁇ ⁇ ⁇ X a ⁇ f 1 ( s , A , B , ... ⁇ ) ⁇ X 1 ⁇ ] , [ ⁇ ⁇ Y a ⁇ g 0 ( s , A , B , ... ⁇ ) ⁇ Y 0 ⁇ ⁇ ⁇ Y a ⁇ g 1 ( s , A , B , ... ⁇ ) ⁇ Y 1 ⁇ ] , [ ⁇ h 0 ( A , B , ... ⁇ ) ⁇ s 0 ⁇ ⁇ h 1 ⁇ ( A , B , ... ) ⁇ s 0 ⁇ ; s 1 ⁇ ] , ... ⁇ ; A
  • PCHB_REG PCHB_REG ⁇ x 0 ⁇ , x 1 ⁇ ; * [ [ C 1 ⁇ R e ⁇ x 0 ⁇ R 0 ⁇ ⁇ ⁇ C 1 ⁇ R e ⁇ x 1 ⁇ R 1 ⁇ C 0 ⁇ L 0 ⁇ x 1 ⁇ ; x 0 ⁇ C 0 ⁇ L 1 ⁇ x 0 ⁇ ; x 1 ⁇ ] ; ⁇ [ C 0 ⁇ L e ⁇ ⁇ C 1 ⁇ skip ] ; ⁇ C e ⁇ l ⁇ [ ⁇ R e ⁇ ⁇ R 0 ⁇ ⁇ R 1 ⁇ R 0 ⁇ , R 1 ⁇ ] ; ⁇ [ ⁇ L 0 ⁇ ⁇ L 1 ⁇ L e ⁇ ] ; ⁇ [ ⁇ C 0 ⁇ ⁇ C 1 ⁇ C e ⁇ ] ]
  • the most general form of state holding cell is one where the state variable can be used and set in any cycle. In order to do this, it is necessary to have separate storage locations for the new state and the old state. This may be done by introducing an extra state variable t which holds the new state until s is used.
  • this “go” signal may be added to a PCFB as well.
  • the “go” signal must also be checked before producing the left enables, or instabilities will result. This has the side effect of reducing the slack to one half, but this is irrelevant when the goal is high speed.
  • the “go-” must not wait for the right enable (re) to go down since it won't (as no data was sent on the last cycle). Instead of a c-element this gives the PRS: “en & re & ⁇ no_r->go+” and “ ⁇ en & (no_r
  • the output completion is taken before the inverters, since this allows the use of a NAND gate instead of a NOR gate and gets the completion done a transition earlier. However, it is possible to complete from after the inverters as well. This is particularity useful when you can share the output completion circuit of one cell with the input completion of the next cell in the pipeline.
  • this patent primarily presents asynchronous circuits in a quasi-delay-insensitive framework, it may prove desirable to introduce timing assumptions in order to simplify or speed up the circuit.
  • Several useful non-QDI circuits can be derived simply by omitting transistors from a QDI WCHB, PCHB, or PCFB circuit. It is preferred if the introduced timing assumptions can be met entirely by estimating the delays within the cell, without making assumptions on the delays of its environment. Several simple modifications can satisfy this property.
  • the weak-condition halfbuffer variety works well for buffers and copies without logic.
  • the precharge-logic half-buffering is the simplest good way to implement most logic cells.
  • the precharge-logic full-buffering has an advantage in speed and is good at decoupling the handshakes of neighboring units. It should be used when necessary to improve the throughput.

Abstract

An asynchronous processor that has reshuffled processes to implement precharge logic.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 09/501,638, filed Feb. 10, 2000, which is a continuation of U.S. application Ser. No. 09/360,468, filed Jul. 22, 1999, which claims the benefit of U.S. Provisional Application No. 60/093,840, filed on Jul. 22, 1998, all of which are incorporated herein by reference.[0001]
  • STATEMENT AS TO FEDERALLY SPONSORED RESEARCH
  • [0002] This application may have received funding under U.S. Government Grant No. DAAH-04-94-G-0274 awarded by the Department of Army.
  • This specification describes communicating sequential processes (CSP) which are implemented as quasi delay insensitive asynchronous circuits. More specifically the present specification teaches reshuffling communication sequences and combining computation with buffering to produce pipelined circuits. [0003]
  • BACKGROUND
  • Asynchronous processors are known as described in U.S. Pat. No. 5,752,050. These processors process an information stream without a global clock synchronizing the operation. [0004]
  • An asynchronous processor pipeline scheme uses the basic layout shown in FIG. 1. A [0005] first process 100 communicates with a second process 110 that in turn sends a message to the next process. The messages use a four phase handshake. In the first phase, the sender raises the request line. In the second phase, the receiver raises the acknowledge line. In the third pase, the sender lowers the request line. In the fourth phase, the receiver lowers the acknowledge line. In the handshaking expansion language (HSE), the handshake on channel X is described as X+; Xa+; X−; Xa−. In FIG. 1, the request between 100 and 110 is the L wire (102). The acknowledge for that communication is La (108). The request between 110 and 120 is the R wire (104), and the acknowledge is Ra (106).
  • This is a basic request, acknowledge system. The request [L] is acknowledged (La), then acted on R↑, then acknowledged again (Ra). [0006]
  • Pipelined asynchronous circuits are known as “Bundled-Data” or “Micropipelines” and have a synchronous style data path which is “clocked” by asynchronous self-timed control elements. These control elements handshake between pipeline stages with a request/acknowledge pair. The delay of the datapath logic is estimated with a delay-element in the control, so that the request to the next pipeline state is not made until the data is assumed to be valid. [0007]
  • The alternative style involves (quasi) delay-insensitive circuits, for which no delay assumptions are made. In this style, the prior art is embodied in the Caltech Asynchronous Microprocessor patent. Datapaths are still separated from control, as in the bundled-data case, but completion detection circuitry is added instead of delay lines to detect when the data is valid. Communication between processes occurs via delay-insensitive channels with a 4 phase handshake. In between latches or buffers, logic can be performed by unpipelined weak-condition logic blocks. [0008]
  • SUMMARY
  • The present system teaches a way of pipelining this handshake to allow certain processes to occur closer to simultaneously. The disclosed system is a delay insensitive system that uses a combination of logic and buffering to resequence certain operations. [0009]
  • A new way of pipelining quasi-delay-insensitive circuits is disclosed in which control is not explicitly separated from the datapath. No extra buffers or latches are added between logic blocks. Instead, the state-holding property of a buffer is combined directly with a dual-rail domino logic computation. The tokens travel through the pipeline as in the case of simple buffers. The tokens also carry values which are computed upon. By not separating control from data, and by carefully designing the circuit parts which handle the handshakes, higher throughput is expected. The extra handshake circuitry typically adds no more than 50% area. [0010]
  • The supporting circuitry which handles the handshake takes place in precharge domino logic of a type that is common in synchronous design. Additional circuits detect the validity of the input and output channels (common in asynchronous design). An acknowledge circuit acknowledges the inputs and precharges the logic. [0011]
  • The circuit implementations disclosed in this patent include components for logic computation, plus components to detect the validitity of the input and output data, and another component to generate the acknowledges and precharge the logic. The details and composition of these pieces generate fast quasi delay insensitive circuits superior to the prior art. [0012]
  • This patent also include further enhancements of this combined buffer/logic cell are disclosed. These include the ability to conditionally communicate on either inputs or outputs, so as to implement routing functionality. Also, mechanisms for efficiently implementing internal state variables are described.[0013]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • These and other aspects will now be described in detail with reference to the accompanying drawings, wherein: [0014]
  • FIG. 1 shows a basic pipelining system and some of the signals used in that system; [0015]
  • FIG. 1A shows a basic precharge type buffer in block diagram form; [0016]
  • FIG. 2 shows a basic weak condition half buffer circuit; [0017]
  • FIG. 3 shows the transistor diagrams for the weak condition half buffer; [0018]
  • FIG. 4 shows a precharge buffer, with the transistor arrangement at the top; and the gate arrangement at the bottom; [0019]
  • FIG. 5 shows a split precharge circuit; [0020]
  • FIG. 6 shows a merge precharge circuit; [0021]
  • FIG. 7 shows a Reg precharge circuit. [0022]
  • DESCRIPTION OF THE EMBODIMENT
  • The present system is based on a way of pipelining the information in the FIG. 1 drawing using precharge logic that allows the operations to occur in parallel. Pipelining allows a system to carry out more than one operation at the same time. Put another way, a pipelined system does not need to wait for one action to be completed before the other action is carried out. However, if one attempts to reset data before using it, then the data is lost. [0023]
  • The present system teaches a way of dealing with this issue by reshuffling the communication sequence, storing certain information within the sequence, and enabling more efficient pipelining information. [0024]
  • A “pipeline” is a linear sequence of buffers where the output of one buffer connects to the input of the next buffer as shown in FIG. 1. “Tokens” 99 are sent into the input end of the pipeline, and flow through each buffer to the output end. The tokens remain in first-in-first-out (FIFO) order. [0025]
  • For synchronous pipelines, the tokens usually advance through one stage on each clock cycle. For asynchronous pipelines there is no global clock to synchronize the movement. Instead, each token moves forward down the pipeline where there is an empty cell in front of it; otherwise, the token stalls. Effectively, the tokens have similar behavior to cars on a freeway. [0026]
  • The buffer capacity or “slack” of an asynchronous pipeline is proportional to the maximum number of tokens that can be packed into the pipeline without stalling the input end of the pipeline. The “throughput” is the number of tokens per second which pass a given stage in the pipeline. The “forward latency” is the time it takes a given token to travel the length of the pipeline. [0027]
  • Buffer Reshuffling
  • A single rail buffer has the Communication Sequential Process “CSP” specification *[L;R]. Using a passive protocol for L and a lazy active protocol for R, the buffer will have the handshaking expansion (HSE): [0028]
  • *[[L]; La↑[
    Figure US20040030858A1-20040212-P00900
    L]; La↓; [
    Figure US20040030858A1-20040212-P00900
    Ra]; R↑; [Ra]; R↓.],  (1)
  • in english, the handshaking expansion for this buffer is as follows: Wait for L to become true. Set La true. Wait for L to become false. Set La false. Wait for Ra to become false. Set R true. Wait for Ra to become true. Set R false. Repeat infinitely. [0029]
  • The present system recognizes that certain sequences are the most interesting among these sequences. The present application reshuffles the sequence in order to do these first. [0030]
  • In effect, [0031] equation 1 represents a four phase protocol. The first two actions [L]; La↑, represent waiting for L to become active, and acknowledging that. The second two actions represent L becoming inactive. The third two actions represent waiting for R to become active. The fourth two actions represent R inactive.
  • The environment will perform *[[[0032]
    Figure US20040030858A1-20040212-P00900
    La]; L↑; [La]; L↓] and *[[R]; Ra↑[
    Figure US20040030858A1-20040212-P00900
    R]; Ra↓]. The wait for L, or [L] is interpreted to be the arrival of an input token, and the transition R↑ is the beginning of the output token. Buffers are used herein to preserve the desired FIFO order and properties of a pipeline.
  • Direct implementation of this handshaking expression can use a state variable to distinguish the first half from the second half. This represents a large amount of sequencing in each cycle. [0033]
  • Another option is to reshuffle the waits and events to reduce the amount of sequencing and the number of state variables, in order to maximize the throughput and minimize the latency of the pipeline. [0034]
  • The first requirement for a valid reshuffling is that the Handshaking expression maintains the handshaking protocols on L and R. That is, the projection on the L channel is *[[L]; L[0035] a↑; [
    Figure US20040030858A1-20040212-P00900
    L]; La↓] and the projection on the R channel is *[[
    Figure US20040030858A1-20040212-P00900
    Ra]; [R↑; [Ra; R↓]. In addition, the number of completed L↑ minus the number of completed R↑ (the slack of the buffer) should be at least zero to conserve the number of tokens in the pipeline. Also, since this is a “buffer”, it should introduce some nonzero slack. Hence, the La↑ should not wait for the corresponding [Ra], or the reshuffling will have zero slack. This is the “constant response time” requirement.
  • Although these three requirements are sufficient to guarantee a correct implementation, one more is useful. The L and R channels may be expanded to encode data. If the reshuffling moves the R↑ past the corresponding L[0036] a↑, then the “L” data would disappear before R↑ is done. The data here is saved in a buffer, here implemented as an internal state variable proportional to the number of bits on R or L. That data would need to be saved in internal state bits, since the L data may disappear as soon as La+ occurs. These additional internal state bits are undesirable, so La↑ will follow R↑.
  • There are nine valid reshufflings, each labeled below: [0037] MSFB * [ [ R a L ] ; R ; ( [ R a ] ; R ) , ( L a ; [ L ] ; L a ) ] PCFB * [ [ R a L ] ; R ; L a ; ( [ R a ] ; R ) , [ L ] ; L a ) ] PCHB * [ [ R a L ] ; R ; L a ; [ R a ] ; R ) , [ L ] ; L a ] WCHB * [ [ R a L ] ; R ; L a ; [ R a L ] ; R L a ] B1 * [ [ R a L ] ; R ; L a ; [ R a L ] ; L a R ] B2 * [ [ R a L ] ; R ; L a ; [ L ] ; L a ; [ R a ] ; R ] B3 * [ [ R a L ] ; R ; L a ; [ L ] ; [ R a ] ; R L a ] B4 * [ [ R a L ] ; R ; L a ; [ R a ] ; R , [ L ] ; L a ] B5 * [ [ R a L ] ; R ; L a ; [ R a L ] ; R , L a ]
    Figure US20040030858A1-20040212-M00001
  • It takes two state variables to implement the MSFB reshuffling. The PCFB, B1, B2, B3, B4, and B5 reshufflings all require one state variable en (short for enable) with en; inserted after L[0038] a↓ and en↑ inserted before the end.
  • Selection of which of these reshufflings is the best can assume that the goal is fewer transistors and faster operation. By that metric, the present inventors believe that B3, B4, and B5 are always inferior to PCFB. They all require the same state variable. They produce only a subset of the trade-off PCFB, with additional waits that may be unnecessary. These waits add extra transistors and slow the circuit down, compared to PCFB. They also slow the circuit down as compared with PCFB, which adds extra transistors. [0039]
  • B1 and B2 are also very similar to PCFP, except they have more sequencing. However, that extra sequencing simplifies the production rule for en: to [0040]
    Figure US20040030858A1-20040212-P00900
    R→en↑ instead of
    Figure US20040030858A1-20040212-P00900
    R{circumflex over ( )}
    Figure US20040030858A1-20040212-P00900
    La→en↑, in the case of PCFB. The inventors therefore do not believe that these will always be inferior to PCFB. However, due to the extra sequencing and additional transistors elsewhere, these reshufflings will likely seldom, if ever, be better than PCFB.
  • The MSFB has the least possible sequencing of any of these reshufflings. However, MSFB requires two state variables and has more complicated production rules than PCFB. It has a possible advantage in speed since it allows R↓ to happen a little earlier. If one counts transitions, it turns out that the next buffer in the pipeline (if it is reshuffled similarly) will not even raise R[0041] a until after La↓; occurs. This might not really be an advantage at all.
  • That leaves three most interesting reshufflings, WCHB, PCHB, and PCFB. The names are derived from characteristics of the circuit implementations. WC indicates weak-condition logic. PC indicates precharge logic. HB indicates a halfbuffer (slack ½), and FB indicates a fullbuffer (slack 1). [0042]
  • In the halfbuffer reshufflings, only every other stage can have a token on its output channel, since a token on that channel blocks the previous stage from producing an output token. In practice, each of these reshufflings has advantages for certain applications, so they are all useful. With state variables inserted, the three best reshufflings are: [0043] PCFB * [ [ R a L ] ; R ; L a ; en ( [ R a ] ; R ) , [ L ] ; L a ) ; en ] PCHB * [ [ R a L ] ; R ; L a ; [ R a ] ; R ; [ L ] ; L a ] WCHB * [ [ R a L ] ; R ; L a ; [ R a L ] ; R ; L a ]
    Figure US20040030858A1-20040212-M00002
  • Note that the first three parts of the reshuffling are the same. [0044]
  • FIG. 1A shows a box and arrow diagram of the standard components of a PCHB or PCFB cell. The various parts of the circuit may be thought of as logic, input completion, output completion, and enable generation. The logic is shown as precharge dual rail domino logic with two enabling gates, the internal enable and the output enable coming back from the next cell in the pipeline. The inverted logic is followed by inverters to restore it to the normal sense. The completion circuits are standard NOR or NAND gates and C-element trees which compute the validity of the inputs and the validity of the outputs. Finally, the “enable” circuit generates the input acknowledge(s) and the internal enable (en) of the cell. The PCHB and PCFB differ only in the exact implementation of this enable circuit. [0045]
  • Logic with Buffering [0046]
  • Suppose it is desired to implement a unit with CSP of the form: [0047] P * [ A ? a , B ? b ; X ! f ( a , b , ) , Y ! g ( a , b , ) , ]
    Figure US20040030858A1-20040212-M00003
  • Where A?a means receive data a on channel A and y!g means send data g on channel y. [0048]
  • On each cycle, P receives some inputs, then sends out functions computed from these inputs. The channels A,B,X, and Y must encode some data. The usual way to do this is using sets of 1-of-N rails for each channel. For instance, to send two bits, one could use two 1-of-2 rails with one acknowledge, or one 1-of-4 rails with one acknowledge. [0049]
  • As a notational convention, a rail is identified by the channel name with a superscript for the 1-of-N wire which is active, and a subscript for what group of 1-of-N wires it belongs to (if there is more than one group in the channel). The corresponding acknowledge will be the channel name with a “a” superscript, or an “e” superscript if it is used in the inverted sense. [0050]
  • As in the single rail buffer case, P could be implemented by expanding each channel communication into a handshaking expansion. Direct implementation of this handshaking expansion requires state variables for the a, b variables and more. It could produce an enormously big and slow circuit. Some reshuffling is desired. The PCFB, PCHB, and WCHB reshufflings will be the most useful ones. The correspondence between the single rail “templates” for PCFB, PCHB, and WCHB and a process like P is as follows. The L and L[0051] a represent all the input data and acknowledges. The R and Ra represent all the output data and acknowledges. [L] indicates a wait for the validity of all inputs, and [
    Figure US20040030858A1-20040212-P00900
    L] indicates a wait for the neutrality of all inputs. [
    Figure US20040030858A1-20040212-P00900
    R] indicates a wait for all the output acknowledges to be false, and [Ra] indicates a wait for all the output acknowledges to be true. La↑ indicates making true all the input acknowledges in parallel, and La↓ indicates making them false. R↑ means that all the outputs are set to their valid states in parallel. R↓ means that all the outputs are set to their neutral states. When R↑ occurs, it means that particular rails of the outputs are made true, depending on which rails of L are true. This expands R↑ into a set of exclusive selection statements executing in parallel.
  • Unfortunately, the inventors have recognized that this simple translation may introduce more sequencing than necessary. Of the various actions which occur in parallel like setting all the outputs valid (R↑), each action might need to wait for only a portion of the preceding guard ([[0052]
    Figure US20040030858A1-20040212-P00900
    Ra{circumflex over ( )}L]). For instance, raising X0↑ or X1↑ needs to check [
    Figure US20040030858A1-20040212-P00900
    Xa] but not [
    Figure US20040030858A1-20040212-P00900
    Ya]. Similarly, the semicolons between actions (R↑; La↑) might also over sequence. However, this cannot be easily fixed while still using the Handshaking expansion language. For instance, in the sequence X↑, Y↑; Aa↑, Ba↑, it might be necessary for Aa↑ to wait for [X] only (if Y↑ did not use the value of A) while Ba↑ might need to wait for [X{circumflex over ( )}Y]. This case could be written as X↑, Y↑,[X]; Aa↑),[X{circumflex over ( )}Y]; Ba↑). However, this may make the written software more difficult to understand. If the next actions are not fully sequenced, it could get even worse. In the limit, the Handshaking expansion just mirrors the actual production rule set (PRS). To skirt the issue, Handshaking expansion can be used. This might be a bit over sequenced, with the understanding that the unnecessary sequencing will be optimized out in the compilation to production rules.
  • The PCFB version of a P with dual rail channels would therefore be: [0053] * [ [ X a f 0 ( A , B , ) X 0 X a f 1 ( A , B , ) X 1 ] , [ Y a g 0 ( A , B , ) Y 0 Y a g 1 ( A , B , ) Y 1 ] , ; A a , B a , ; en ; [ X a X 0 X 1 ] , [ Y a Y . Y 1 Y 1 ] , , [ A 0 A A a ] , BE B 1 B a ] , ; ]
    Figure US20040030858A1-20040212-M00004
  • In this Handshaking expansion, the f[0054] 0, f1, g0, and g1 are boolean expansions in the data rails of the input channels. They are derived from the f and g of the CSP and indicate the conditions for raising the various data rails of the output channels. Note that each output channel waits only for its own acknowledge, which is less sequenced than a direct translation of the PCFB template would be.
  • In P it is seen that A[0055] a and Ba tend to switch at about the same time. They could actually be combined into a single ABa which would wait for the conjunction of the guards on Aa and Ba. Combining the acknowledges tends to reduce the area of the circuit, but might slow it down. The best decision depends on the circumstances.
  • Examples of Logic with Buffering
  • To put the previous section into practice, several CSP processes with the same form as P are compiled into pipelined circuits. The simplest CSP buffer that encodes data has a dual rail input L, and a dual rail output R. The CSP is *[L?x;R!x]. Three Handshaking expansion reshufflings for this process are: [0056] WCHB_BUF * [ [ R a L 0 R 0 R a L 1 R 1 ] ; L a ; [ R a L 0 L 1 R 0 R 1 ] ; L a ] PCHB_BUF * [ [ R a L 0 R 0 0 R a L 1 R 1 ] ; L a [ R a R 0 , R 1 ] ; [ L 0 L 1 L a ] ] PCFG_BUF * [ [ R a l 0 R 0 R a L1 R 1 ] ; L a ; en ; [ R a R 0 , R 1 ] , [ L 0 L 1 L a ] ; en
    Figure US20040030858A1-20040212-M00005
  • After bubble-reshuffling (which suggests using the inverted acknowledges, L[0057] e and Re), the production rules for the WCHB-BUF follow. The circuit diagram for a WCHB is shown in FIG. 2. R e L 0 R _ 0 L e L 1 R _ 1 R 0 R 0 R 1 R 1 R 0 R 1 L _ e L e L _ e R e L 0 R _ 0 R e L 1 R _ 1 R 0 R 0 R 1 R 1 R 1 R 1 L _ e L e L _ e
    Figure US20040030858A1-20040212-M00006
  • The other Handshaking expansions can be implemented similarly, but they are both somewhat bigger. For this reshuffling, the validity and neutrality of the output data R implies the neutrality of the input data L. Logic which has this property is called “weak-condition”. It means that the L does not need to be checked anywhere else, besides in R. The WCHB also gets some of its semicolons implemented for free. The semicolon between L[0058] a↑; [Ra{circumflex over ( )}
    Figure US20040030858A1-20040212-P00900
    L] is implemented by the environment, as is the implicit semicolon at the end of the loop. The WCHB has some inherent benefits. However, it turns out that although WCHB works well for buffers, the “weak-condition” requirement can cause problems with other circuits.
  • This WCHB_BUF bubble-reshuffling has 2 transitions forward latency and 3 transitions “backward” latency (for the path from the right acknowledge to the left acknowledge). Combining these times for the whole handshake yields 2+3+2+3=10 transitions per cycle. [0059]
  • Extra inverters can be added to WCHB_BUF to get 10 transitions per cycle. These inverters can actually speed up the throughput, despite the increased transition count, because inverters have high gain. Also, the 6 transitions per cycle buffer would invert the senses of the data and acknowledges after every stage, which is highly inconvenient when composing different pipelined cells. As a standard practice, most pipelined logic cells will be done with 2 transitions of forward latency, but more complicated circuits will have 5, 7 or even 9 transitions backward latency, yielding transitions per cycle from 10 to 22 (even numbers only, of course). [0060]
  • Next consider a fulladder, with the CSP *[A?a, B?b, C?c; S!XOR(a,b,c), D!MAJ(a,b,c)]. The A,B,C,S and D channels are dual rail. The acknowledges for A,B, and C are combined into a single F[0061] e. Inverted acknowledges are used from the start. The three Handshaking expansion reshufflings are: WCHB_FA * [ [ S e XOR 0 ( A , B , C ) S 0 0 S e XOR 1 ( A , B , C ) S 1 ] , [ D e MAJ 0 ( A , B , C ) D 0 0 D e MAJ 1 ( A , B , C ) D 1 ] ; F e ; [ S e A 0 A 1 C 0 S 0 , S 0 ] , [ D e B 0 B 1 C 1 D 0 , D 0 ] , F e ] PCHB_FA * [ [ S e XOR 0 ( A , B , C ) S 0 0 S e XOR 1 ( A , B , C ) S 1 ] , [ D e MAJ 0 ( A , B , C ) D 0 0 D e MAJ 1 ( A , B , C ) D 1 ] ; F e ; [ S e S 0 , S 1 ] , [ D e D 0 , D 1 ] , [ A 0 A 1 B 0 B 1 C 0 C 1 , F e ] ; ] PCFB_FA * [ [ S e XOR 0 ( A , B , C ) S 0 0 S e XOR 1 ( A , B , C ) S 1 ] , [ D e MAJ 0 ( A , B , C ) D 0 0 D e MAJ 1 ( A , B , C ) D 1 ] ; F e ; en ; [ S e S 0 , S 1 ] , [ D e D 0 , D 1 ] , [ A 0 A 1 B 0 B 1 C 0 C 1 , F e ] ; en ]
    Figure US20040030858A1-20040212-M00007
  • In the WCHB_FA, the validity of the outputs S and D implies the validity of the inputs, because the S must check all of A,B, and C. The test for the neutrality of the inputs is split between S↓ and D↓. This works as long as both S↓ and D↓ check at least one input's neutrality completely, and both rails of S and D wait for the same expansion. In both PCHB_FA and PCFB_FA, the expansion for the neutrality of the inputs is obviously too large to implement as a single production rule. Instead, the neutrality test must be decomposed into several operators. The usual decomposition is “nor” gates for each dual rail input, followed by a 3-input c-element. F[0062] e↓ must now wait for the validity of the inputs just to acknowledge the internal transitions. However, this means the logic for S and D no longer needs to fully check validity of the inputs; it is not required to be weak-condition.
  • The bubble-reshuffled and decomposed production rules for WCHB_FA are: [0063] S e XOR 0 ( A , B , C ) S 0 _ S e XOR 1 ( A , B , C ) S 1 _ D e MAJ 0 ( A , B , C ) D 0 _ D e MAJ 1 ( A , B , C ) D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 ( S 0 _ S 1 _ ) ( D 0 _ D 1 _ ) F e _ F e _ F e S e A 0 A 1 C 0 S 0 _ S e A 0 A 1 C 0 S 1 _ D e B 0 B 1 C 1 D 0 _ D e B 0 B 1 C 1 D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 S 0 _ S 1 _ D 0 _ D 1 _ F e _ F e _ F e
    Figure US20040030858A1-20040212-M00008
  • The circuit diagram is shown in FIG. 3. The pull-up logic for S0, S[0064] 1, D0, and D1 has 4 P-type transistors in series. This can be quite weak, due to the lower mobility of holes. Other WCHB circuits can be even worse. Since all the inputs are checked for neutrality before the outputs reset, a process with three inputs and only one output would end up with 7 p-transistors in series to reset that output.
  • The present systems uses the “precharge-logic” reshufflings, PCHB_FA or PCFE_FA. These test the neutrality of the inputs in a different place, which is more easily decomposed into manageable gates, and does not slow the forward latency. The PCHB_FA reshuffling has the production: rules: [0065] A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ F e S e XOR 0 ( A , B , C ) S 0 _ F e S e XOR 1 ( A , B , C ) S 1 _ F e D e MAJ 0 ( A , B , C ) D 0 _ F e D e MAJ 1 ( A , B , C ) D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v S v D v ABC v F e A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ S e F e S 0 _ S e F e S 1 _ D e F e D 0 _ D e F e D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v S v D v ABC v F e
    Figure US20040030858A1-20040212-M00009
  • This circuit can be made faster by adding two inverters to Fe and then two more to produce the F[0066] e used internally (which is now called en). This circuit is shown in FIG. 4.
  • A PCFB_FA reshuffling would have only slightly different production rules: [0067] A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ en S e XOR 0 ( A , B , C ) S 0 _ en S e XOR 1 ( A , B , C ) S 1 _ en D e MAJ 0 ( A , B , C ) D 0 _ en D e MAJ 1 ( A , B , C ) D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v en S v D v ABC v F e S v D v SD v _ F e SD v _ en _ en _ en A 0 A 1 A v _ B 0 B 1 B v _ C 0 C 1 C v _ S e F e S 0 _ S e F e S 1 _ D e F e D 0 _ D e F e D 1 _ S 0 _ S 0 S 1 _ S 1 D 0 _ D 0 D 1 _ D 1 A v _ B v _ C v _ ABC v S 0 _ S 1 _ S v D 0 _ D 1 _ D v en ABC v F e S v D v SD v _ F e SD v _ en _ en _ en
    Figure US20040030858A1-20040212-M00010
  • Comparing the three fulladder reshufflings, the WCHB_FA has only 10 transitions per cycle, while the PCHB_FA has 14 and the PCFB_FA has 12 (7 on the setting phase, but 5 on the resetting phase, since the L and R handshakes reset in parallel). Although the W FA has fewer transistors, to make it reasonably fast, the 4 P-transistors in series must be made very large. Despite the lower transition count of tile WCHB_FA, both PCHB_FA and PCFB_FA are substantially faster in throughput and latency. PCFB_FA is the fastest of all, since it relies heavily on n-transistors and saves 2 transitions on the reset phase. However PCFB_FA can be larger than PCHB_FA, due to the extra state variable en and the extra completion SD[0068] v If the speed of the fulladder is not critical, the PCHB FA seems to be the best choice.
  • In general, the WCHB reshuffling tends to be best only for buffers and copies ([L?x;R!x,S!x]). The PCHB is the workhorse for most applications; it is both small and fast. When exceptional speed is called for, the PCFB dominates. It is also especially good at completing 1-of-N codes where N is very large, since the completion can be done by a circuit which looks like a tied-or pulldown as opposed to many stages of combinational logic. The reshuffling can actually be mixed together; with each channel in the cell using a different one. This is most commonly useful when a cell computes on some inputs using PCHB, but also copies some inputs directly to outputs using WCHB. In this case, the neutrality detection for the WCHB outputs is only one p-gate, which is no worse than an extra en gate. [0069]
  • Another common class of logic circuits use shared control inputs to process multi-bit words. This is similar to a fulladder. The control is just another input, which happens to have a large fallout to many output channels. Since the outputs only sparsely depend on the inputs (usually with a bit to bit correspondence), the number of gates in series in the logic often does not become prohibitive. However, if the number of bits is large e.g. 32, the completion of all the inputs and outputs will take many stages in a c-element tree, which adds to the cycle time, as does the load on the broadcast of the control data. To make high throughput datapath logic, it can be better to break the datapath up into manageable chunks (perhaps 4 or 8 bits), and send buffered copies of the control tokens to each chunk. This cuts down the cycle time, but does not change the high-level meaning, except to introduce extra slack. [0070]
  • Conditionally Producing Outputs [0071]
  • Although the cells discussed in the previous section can be shown to be Turing complete (they can be turned into a VonNeumann state machine, with some outputs fed back through buffers to store state), they are clearly inefficient for many applications. A very useful extension is the ability to skip a communication on a channel on a given cycle. This turns out to require only a few minor modifications to the scheme as presented so far. [0072]
  • Suppose the process completes at most one communication per cycle on the outputs, but always receives all its inputs. The CSP would be: [0073] P1 = * [ A ? a , B ? b , ; [ do_x ( a , b , ) X ! f ( a , b , ) do_x ( a , b , ) skip ] , [ do_y ( a , b , ) Y ! g ( a , b ) do_x ( a , b , ) skip ] ,
    Figure US20040030858A1-20040212-M00011
  • As above, this can reshuffle like WCHB, PCHB, or PCFB. The selection statements for the outputs expand into exclusive selections for setting the output rails, plus a new case for producing no output at all on the channel. A dual-rail version of P1 with a PCFB reshuffling is: [0074] * [ [ do_x ( A , B , ) X a f 0 ( A , B , ) X 0 do_x ( A , B , ) X a f 1 ( A , B , ) X 1 do_x ( A , B , ) skip ] , do_y ( A , B , ) Y a g o ( A , B , ) Y o do_y ( A , B , ) Y a g 1 ( A , B , ) Y 1 do_y ( A , B , ) skip ] , ; A a , B a , ; en ; [ X a X 0 X 1 X 0 , X 1 ] , [ Y a Y 0 Y 1 Y 0 , Y 1 ] , , [ A 0 A 1 A a ] , [ B 0 B 1 B a ] , ; en ]
    Figure US20040030858A1-20040212-M00012
  • Note that the resetting of the output channels X and Y must accommodate the cases when those channels were not used. Since they produce no outputs, they must not wait for the acknowledges. Adding in the [0075]
    Figure US20040030858A1-20040212-P00900
    Xo{circumflex over ( )}
    Figure US20040030858A1-20040212-P00900
    X1 terms will allow the wait to be completed vacuously. This does not actually generate any production rules. This Handshaking expansion can be compiled into production rules, but there are some tricky details.
  • An interesting choice arises from the use of the skip. A skip causes no visible change in state, so the next statements in sequence (A[0076] a↑,Ba, . . . ) must actually look directly at the boolean expansion for
    Figure US20040030858A1-20040212-P00900
    do_x(A,B, . . . ) and
    Figure US20040030858A1-20040212-P00900
    do_y(A,B, . . . ) in addition to the output rails X*, X1, Y*, Y1.
  • The completion condition for setting the outputs would be en {circumflex over ( )}(X*vX[0077] 1v
    Figure US20040030858A1-20040212-P00900
    do_x(A,B, . . . )){circumflex over ( )}(Y*vY1v
    Figure US20040030858A1-20040212-P00900
    do_y(A,B, . . . )). However, this expansion cannot be used directly in the guards for Aa↑ and Ba↑, since if one fired first, it could destabilize the other. (This would work if Aa and Ba were combined into one acknowledge.)
  • Another approach is to introduce a new variable to represent the [0078]
    Figure US20040030858A1-20040212-P00900
    do_x and
    Figure US20040030858A1-20040212-P00900
    do_y cases. Suppose the skip's are replaced with no_x↑ and no_y↑, respectively, and no_x↓, are added to X0↓,X1↓ and no_y↓, to Y0↓, Y1↓. Now the production rules are simply produced as if X and Y were 1-of-3 channels instead of 1-of-2, except the extra rail doesn't check the right acknowledge, or, in fact, leave the cell.
  • Finally, there are many cases were some expansion of the outputs is sufficient to produce the output completion expansion without reference to the inputs. For instance, if one input is used to decide if a certain output is used, but is also copied to another output, the copied output could be used to check the completion of the optional output. Similarly, if two output channels are used exclusively, such that one or the other will be used each cycle, the completion for both is just the or of each one's completion. [0079]
  • To put this discussion into practice, a split is implemented, a fundamental routing process which uses one control input to route a data input to one of two output channels. The simple one-bit CSP is *[S?s,L?x; [[0080]
    Figure US20040030858A1-20040212-P00900
    s→A!x[ ]s→B!x]]. The PCHB reshuffling is: PCHB_SPLIT * [ [ A e S 0 L o A 0 A e S 0 L 1 A 1 S 1 skip ] , [ B e S 1 L o B 0 B e S 1 L 1 B 1 S 1 skip ] ; SL e ; [ A e A 0 A 1 A 0 A 1 ] , [ B e B 0 B 1 B 0 B 1 ] SL e ]
    Figure US20040030858A1-20040212-M00013
  • The first two selection statements are known to be finished when A[0081] 0vA1vB0vB1v Hence, this will be used as the guard for SLe↓. The bubble-reshuffled production rules are: S 0 S 1 S v _ L 0 L 1 L v _ SL e A e S 0 L 0 A 0 _ SL e A e S 0 L 1 A 1 _ SL e A e S 1 L 0 B 0 _ SL e A e S 1 L 1 B 1 _ A 0 _ A 0 A 1 _ A 1 B 0 _ B 0 B 1 _ B 1 S v _ L v _ SL v A 0 _ A 1 _ B 0 _ B 1 _ AB v AB v SL v SL e S 0 S 1 S v _ L 0 L 1 L v _ SL e A e A 0 _ SL e A e A 1 _ SL e B e B 0 _ SL e B e B 1 _ A 0 _ A 0 A 1 _ A 1 B 0 _ B 0 B 1 _ B 1 S v _ L v _ SL v A 0 _ A 1 _ B 0 _ B 1 _ AB v AB v SL v SL e
    Figure US20040030858A1-20040212-M00014
  • The circuit is shown in FIG. 5. [0082]
  • Conditionally Reading Inputs [0083]
  • It is also highly useful to be able to conditionally read inputs. Normally the condition is read in on a separate unconditional channel, but in general it could be any expansion of the rails of the inputs. A CSP template for type of cell this would be: [0084] P2 * [ [ do_a ( A _ , B _ ) A ? a no_a ( A _ , B _ ) a := unused ] , [ do_b ( A _ , B _ ) B ? b no_a ( A _ , B _ ) b := unused ] , X ! f ( a , b ) , Y ! g ( a , b ) , ]
    Figure US20040030858A1-20040212-M00015
  • The {overscore (A)} in this context refers to a probe of the value of A, not just its availability. This is not standard in CSP, but is a useful extension which is easily implemented in Handshaking expansion. Basically, the booleans for do_a, do_b, no_a, and no_b may inspect the rails of A and B in order to decide whether to actually receive from the channels. The selection statements will suspend until either do a or no a are true. These expansions are required to be stable; that is, as additional inputs show up, they may not become false as a result. [0085]
  • For the Handshaking expansion, instead of assigning “unused” to an internal variable, the f and g expansions examine the inputs directly. The results of the do_a/no_a and do_b/no_b expansions must be latched into internal variables u and v, so that A and B may be acknowledged in parallel without destabilizing the guards of do a and the like. The PCFB version of the Handshaking expansion is: [0086] u 0 , u 1 , v 0 , v 1 , ; * [ f 0 ( A , B , ) X 0 f 1 ( A , B , ) X 1 ] , [ g 0 ( A , B , ) Y 0 f 1 ( A , B , ) Y 1 ] , , [ do_a ( A , B ) u 1 no_a ( A , B ) u 0 ] , [ do_b ( A , B ) v 1 no_b ( A , B ) v 0 ] , ; [ u 1 A a u 0 skip ] , [ v 1 B a v 0 skip ] , ; en ; [ X a X 0 , X 1 ] , [ Y a Y 0 , Y 1 ] , , ( u 0 , u 1 ; [ A 0 A 1 A a A a ] ) , ( v 0 , v 1 ; [ B 0 B 1 B a B a ] ) ; , en ]
    Figure US20040030858A1-20040212-M00016
  • Similarly to the conditional output Handshaking expansion, the guards for A[0087] a↓ and Ba↓ are weakened to allow the vacuous case. The skip again can pose a problem, since it makes no change in the state. However. with the u0 and v0 variables it is possible to infer the skip and generate the correct guard for en. On the reset phase, the u and v must return to the neutral state. There are several places to put this, but the symmetric placement which sequences them with the Aa↓ and Ba↓ simplifies the PRS.
  • In many cases, this general template can be greatly simplified. For instance, if a set of unconditional inputs completely controls the conditions for reading the others, these can be thought of as the “control” inputs. If raising the acknowledges of the various inputs is sequenced so that the conditional ones precede the control ones, then the variables u and v may be eliminated without causing stability problems. Also in some cases the u and v may be substituted with an expansion of the outputs, instead of stored separately. [0088]
  • As a concrete example, the circuit for the merge process reverses the split of the last section by conditionally reading one of two data input channels (A and B) to the single output channel R based on a control input M. The CSP is *[M?m; [[0089]
    Figure US20040030858A1-20040212-P00900
    m→A?x[ ]m→B?x]; X!x]. Here the simplification of acknowledging the data inputs A and B before the control input M is used. The PCHB reshuffling is: PCHB MERGE * [ [ X e ( M 0 A 0 M 1 B 0 ) X 0 X e ( M 0 A 1 M 1 B 1 ] X 1 ] , [ M 0 A e M 1 B e ] ; M e ; [ X e -> X 0 , X 1 ] , [ A 0 A 1 M 0 A e A e ] , [ B 0 B 1 M 1 B e B e ] , M e ]
    Figure US20040030858A1-20040212-M00017
  • A subtle simplification used here is to make A[0090] e↑ and Be↑ check the corresponding
    Figure US20040030858A1-20040212-P00900
    Me↑ and
    Figure US20040030858A1-20040212-P00900
    M1. This reduces the guard condition for Me↑ and makes the reset phase symmetric with the set phase. Some decomposition is done to add Av, Bv and Xv to do validity and neutrality checks. After bubble-reshuffling, the PRS is: A 0 A 1 A v _ B 0 B 1 B v _ A v _ A v B v _ B v M e X e ( M 0 A 0 M 1 B 0 ) X 0 _ M e X e ( M 0 A 1 M 1 B 1 ) X 1 _ X 0 _ X 0 X 1 _ X 1 X 0 _ X 1 _ X v A v M 0 X v A e B v M 1 X v B e A e B e M e _ M e _ M e A 0 A 1 A v _ B 0 B 1 B v _ A v _ A v B v _ B v M e X e X 0 _ M e X e X 1 _ X 0 _ X 0 X 1 _ X 1 X 0 _ X 1 _ X v A v M 0 X v A e B v M 1 X v B e A e B e M e _ M e _ M e
    Figure US20040030858A1-20040212-M00018
  • As usual for PCHB reshuffling, of the work is done in a large network of n transistors. The circuit is shown in FIG. 6. [0091]
  • Internal State [0092]
  • Another extension to this design style is the ability to store internal state from one cycle to the next. A CSP template for a state holding process with state variable s is: [0093] P3 s := initial_s ; * [ A ? a , B ? b , ; X ! f ( s , a , b , ) Y ! g ( s , a , b , ) , ; s := h ( s , a , b , ) ]
    Figure US20040030858A1-20040212-M00019
  • This can be implemented in a variety of ways. The simplest, which requires no new circuits, is to feed an output of a normal pipelined cell back around to an input, via several buffer stages. One of these feedback buffers is initialized containing a token with the value of the initial state. Enough buffers must be used to avoid deadlock, and even more are needed to maximize the throughput. Therefore, this solution can be quite large. For control circuitry, where area is less of an issue, this is often adequate. As an added benefit, the feed forward portion of the state machine can be implemented as several sequential stages of pipelined logic, which correspondingly reduces the number of feedback buffers necessary and allows far more complicated functions. [0094]
  • Aside from using feedback buffers, there are three main approaches to retaining state, of increasing generality and complexity. First, pipelining channels by themselves store state. Usually, these values move forward down the pipeline, passing each stage only once. However, if a stage uses but does not acknowledge its input, the input value will still be there on the next cycle. Essentially, the token is stopped and sampled many times. In CSP, this can be expressed with the probe of the value of the channel. A conditional input type of circuit is used, which uses an input to produce outputs without acknowledging that input. This technique can be used for certain problems. For example, a loop unroller could take an instruction on the input channel, and produce many copies of it on an output channel based on a control input. Of course, this type of state variable can never be set, only read one or more times from an input. [0095]
  • If the state variable is exclusively set or used in a cycle, a simple modification of the standard pipelined reshuffling will suffice. The state variable, s is assigned to a dual-rail value at the same time the outputs are produced. On the reset phase, it remains stable. Unlike the usual return-to-zero variables, s will only briefly transition through neutrality between valid states. If s doesn't change, it does not go through a neutral state at all. The CSP for this behavior is expressed just like P3, except the semicolon before the assignment to s is replaced with a comma. This is made possible by the assumption that s only changes when the outputs X and Y do not depend on it; this avoids any stability problems. [0096]
  • The only tricky thing about deriving the Handshaking expansion for this is the assignment statement. Basically, the assignment is done by lowering the opposite rail first, then raising the desired rail. This guarantees that the variable passes through neutral when it changes, and also bubble-reshuffles nicely. The completion detection of this assignment is basically equivalent to checking that the value of s corresponds to the inputs to s. So s:=x becomes [x[0097] 0→s1↓;s1↑[ ]x1→s0↓s1↑]; [x0{circumflex over ( )}s0vx1{circumflex over ( )}s1]. The PCFB version of the Handshaking expansion for this type of state holding process is: * [ [ X a f 0 ( s , A , B , ) X 0 X a f 1 ( s , A , B , ) X 1 ] , [ Y a g 0 ( s , A , B , ) Y 0 Y a g 1 ( s , A , B , ) Y 1 ] , [ h 0 ( A , B , ) s 0 h 1 ( A , B , ) s 0 ; s 1 ] , ; A a , B a , ; en [ X a X 0 , X 1 ] , [ Y a Y 0 , Y 1 ] , , [ A 0 A 1 A a ] , [ B 0 B 1 B a ] , , en ]
    Figure US20040030858A1-20040212-M00020
  • It is often desirable to decompose the completion detection of the state variable into a 4 phase completion variable s[0098] v which detects the completion of the assignment on the set phase and is cleared on the reset phase. This makes it easier to have multiple state variables. One thing to note is that the assignment sequence and completion has 3 transitions if it changes state, and therefore often takes more transitions than a typical output channel. However, on the reset phase or if the state is unchanged, this only takes 1 transition. Another caveat is that the state variable shown here works best for only dual rail 1 bit state variables.
  • As an example of this type of state variable, consider the “register” process x:=0; *[C?c; [c→R!x[ ][0099]
    Figure US20040030858A1-20040212-P00900
    c→Lx]]. This uses a control channel C to decide whether to read or write the state bit x via the input and output channels L and R. Obviously, the state bit is exclusively used or set on any given cycle. This process also conditionally communicates on L and R.
  • The PCHB version of the Handshaking expansion is: PCHB_REG≡ [0100] x 0 , x 1 ; * [ [ C 1 R e x 0 R 0 C 1 R e x 1 R 1 C 0 L 0 x 1 ; x 0 C 0 L 1 x 0 ; x 1 ] ; [ C 0 L e C 1 skip ] ; C e l [ R e R 0 R 1 R 0 , R 1 ] ; [ L 0 L 1 L e ] ; [ C 0 C 1 C e ] ]
    Figure US20040030858A1-20040212-M00021
  • The PRS has a few tricky features. Due to the exclusive pattern of the communications the rules for C[0101] e an be simplified. The decomposed and bubble reshuffled PRS follows. The circuit is shown in FIG. 7. C e C 0 R e x 0 R 0 _ C e C 0 R e x 1 R 1 _ R 0 _ R 0 R 1 _ R 1 R 0 _ R 1 _ R v R v R v _ C e C 1 L 0 x 1 C e C 1 L 1 x 0 L 0 x 0 x 1 L 1 x 1 x 0 C e ( x 0 L 0 x 1 L 1 ) L e L e R v _ C e _ C e _ C e C e R e R 0 _ C e R e R 1 _ R 0 _ R 0 R 1 _ R 1 R 0 _ R 1 _ R v R v R v _ C e L 0 L 1 L e L e R v _ C e _ C e _ C 0 C 1 C e
    Figure US20040030858A1-20040212-M00022
  • The most general form of state holding cell is one where the state variable can be used and set in any cycle. In order to do this, it is necessary to have separate storage locations for the new state and the old state. This may be done by introducing an extra state variable t which holds the new state until s is used. The CSP for this is: [0102] p 4 s := 0 ; * [ A ? a , B ? b , ; X ! f ( s , a , b , ) , Y ! g ( s , a , b , ) , t := h ( s , a , b , ) , ; s := t ]
    Figure US20040030858A1-20040212-M00023
  • When this is converted into an Handshaking expansion, there are several choices for where to put the assignment s:=t. It works best to do this assignment on the reset phase of the channel handshakes. After the assignment s:=t, t returns to neutral just like a channel. The PCFB version of this type of cell is: [0103] s := 0 ; * [ [ X a f 0 ( s , A , B , ) X 0 [ Y a g 0 ( s , A , B , ) Y 0 X a f 1 ( s , A , B , ) X 1 ] , Y a g 1 ( s , A , B , ) Y 1 ] , h 0 ( s , A , B , ) t 0 h 1 ( s , A , B , ) t 0 ] , ; A a , B a , ; en ; [ X a X o , X 1 ] , [ Y a - Y o , Y 1 ] , , [ t o s 1 ; s o ; t o t 1 s o ; s 1 ; t 1 ] , , [ A o A 1 - A a ] , [ B o B 1 - B a ] , , en ; ]
    Figure US20040030858A1-20040212-M00024
  • The assignment statements may be compiled into production rules as before. Of special interest is the compilation of the sequence [t[0104] 0→s1↓;s0↑;t0↓[ ]t1→s0↓;s1↑;t1↓ Due to correlations of the data, this compiles into the simple (bubble-reshuffled) production rules: en t 0 _ s 0 en t 1 _ s 1 s 0 t 1 _ s 1 s 1 t 0 _ s 0 en s 1 t 0 _ en s 0 t 1 _
    Figure US20040030858A1-20040212-M00025
  • The s* and s[0105] 1 should also be reset to the correct initial value. The completion of this sequence is just the normal check for to {overscore (t)}0{circumflex over ( )} {overscore (t)}1. If the state variable doesn't change, this sequence takes only 1 transition, since the first 4 rules are vacuous. If the state changes it takes 3 transitions. This is 2 transitions longer than the reset of a normal output channel, so this should be considered to optimize the low level production rule decomposition. This type of structure only works well if s and t are dual-rail, although several dual-rail state variables can be used in parallel to encode more states.
  • In addition, extensions to these cells which allow for conditionally receiving inputs or conditionally sending outputs were explained. Finally, various approaches to storing internal state in the cells are disclosed. [0106]
  • The prior state of the art was to use un-pipelined weak condition logic. Extra buffers or registers would be added between blocks of logic to add some pipelining. This approach was smaller, but much slower. The extra buffers also increased the foreward latency. Essentially, in the limit of using more and more buffers, they should eventually be merged into the logic and all cells should be “maximally” pipelined. That is, any discrete state of logic gets its own pipelining, so that no more slack could be added without just adding excess buffers. In practice, the cost of such fine pipelining amounts to a 50% to 100% increase in area over a completely un-pipelined circuit. It reduces the latency (since no separate buffers are added), and, of course, increases the throughput. At this natural limit of pipelining, all handshakes between neighboring cells require a small number of transmissions per cycle, typically 14 to 18. The internal cycles usually keep up. This yields a very high peak throughput (comparable to 14 transition per cycle hyper-pipelined synchronous designs like the DEC Alpha) but is more easily composable. However, composing fast pipelined cells in various patters can yield much lower system throughputs unless special care is taken to match the latencies as well as the throughputs of the units. [0107]
  • Several simple modifications to these pipelined circuit templates are also useful and novel. [0108]
  • 1. Go Signal [0109]
  • In the PCHB, it is possible to separate out the “en & re” expressions for the logic pulldown and “˜en & ˜re” for the logic pullup into a 2-input c-element of “en” and “re” which generates a single “go” signal used to precharge and enable the logic. This improves the forward latency and analog safety of the logic, although it adds 4 transitions to the handshake on the output channel. [0110]
  • With more care, this “go” signal may be added to a PCFB as well. In this case, the “go” signal must also be checked before producing the left enables, or instabilities will result. This has the side effect of reducing the slack to one half, but this is irrelevant when the goal is high speed. When a “go” signal is used with conditional outputs, the “go-” must not wait for the right enable (re) to go down since it won't (as no data was sent on the last cycle). Instead of a c-element this gives the PRS: “en & re & ˜no_r->go+” and “˜en & (no_r|˜re)->go-”. [0111]
  • 2. Shared Input/Output Completion [0112]
  • In most of these examples, the output completion is taken before the inverters, since this allows the use of a NAND gate instead of a NOR gate and gets the completion done a transition earlier. However, it is possible to complete from after the inverters as well. This is particularity useful when you can share the output completion circuit of one cell with the input completion of the next cell in the pipeline. [0113]
  • 3. Timing Assumptions [0114]
  • Although this patent primarily presents asynchronous circuits in a quasi-delay-insensitive framework, it may prove desirable to introduce timing assumptions in order to simplify or speed up the circuit. Several useful non-QDI circuits can be derived simply by omitting transistors from a QDI WCHB, PCHB, or PCFB circuit. It is preferred if the introduced timing assumptions can be met entirely by estimating the delays within the cell, without making assumptions on the delays of its environment. Several simple modifications can satisfy this property. [0115]
  • For example, in a PCFB with a single “go” signal, it can be assumed that the output will precharge quickly after the “go” goes low. The fact that “go” is low can be taken to imply that the output data is precharged, or soon will be. This “implied neutrality” timing assumption can eliminate many transitors of completion detection, and allow the next cycle to begin earlier. In a similar fashion, the input validity can sometimes be ignored if the output validity implies that all input channels are valid. [0116]
  • Of the various types of state-holding cells, the more restricted versions generally have simpler and faster implementations, and should therefore be used if possible. For the most general case, either a pair of state variables should be used, or if area is not an issue, a feedback loop of buffers. [0117]
  • Three main types of handshaking reshuffling have proved superior for different circumstances. The weak-condition halfbuffer variety works well for buffers and copies without logic. The precharge-logic half-buffering is the simplest good way to implement most logic cells. The precharge-logic full-buffering has an advantage in speed and is good at decoupling the handshakes of neighboring units. It should be used when necessary to improve the throughput. [0118]
  • Although only a few embodiments have been described in detail above, other embodiments are contemplated by the inventor and are intended to be encompassed within the following claims. In addition, other modifications are contemplated and are also intended to be covered. [0119]

Claims (14)

What is claimed is:
1. An asynchronous circuit, comprising:
a first process;
a second process, communicating with said first process;
wherein said first and second processes communicate using precharge logic that receive inputs in a first gate, and test nuetrality of said inputs in a second gate separate from said first gate.
2. A circuit as in claim 1 where each of said first and second processes communicate via request and acknowledges.
3. A circuit as in claim 1 further comprising determining a specified request, setting a state variable to represent said specified request, and resetting said specified request before acknowledging or acting on it.
4. A circuit as in claim 1 wherein said first and second processes communicate according to
PCFB≡*[[
Figure US20040030858A1-20040212-P00900
Ra{circumflex over ( )}L]; R↑; La↑; en↓([Ra]; R↓), [
Figure US20040030858A1-20040212-P00900
L]; La↓); en↑.
5. A circuit as in claim 1 wherein said first and second processses communicate according to
PCHB≡*[[
Figure US20040030858A1-20040212-P00900
Ra{circumflex over ( )}L]; R↑; La↑; [Ra]; R↓; [
Figure US20040030858A1-20040212-P00900
L]; La↓.
6. A circuit according to claim 1 wherein said precharge logic includes a first portion which computes validity of inputs and a second portion which computes validity of outputs.
7. A method of communicating between first and second processes without a synchronizing global clock, comprising:
receiving requests in a first gate; and
acknowledging said requests prior to completion of action thereon, and
determining neutrality of said inputs in a second gate that is separate from said first gate that receives the input.
8. A method as in claim 7 wherein said device set test validity of the input using at least one transistor which is unconnected to the transistor of the device that receives the inputs.
9. A method of operating an asynchronous process, comprising:
receiving a request for some action to occur;
reshuffling responses that usually occur relative to said request, said reshuffling responses including using a precharge logic.
10. A cell, comprising:
a buffering element;
a logic element, connected to said buffering element, said logic element having a dual rail precharge domino logic block which computes an output based on an input;
a completion tree for an input channel and a completion tree for an output channel; and
a control circuit which combines the completion trees to generate an input acknowledge and to precharge the logic element.
11. A cell as in claim 10, wherein said input acknowledge does not wait for nuetrality of output data, and also producing an enable which does wait for nuetrality of output data.
12. A cell as in claim 10, wherein said outputs are conditionally produced by indicating a condition with an extra wire.
13. A cell as in claim 10, wherein said inputs are conditional inputs
14. A cell as in claim 10, wherein said input is only conditionally acknowledged, and said logic determines which inputs to acknowledge.
US10/294,044 1998-07-22 2001-07-18 Reshuffled communications processes in pipelined asynchronous circuits Abandoned US20040030858A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/294,044 US20040030858A1 (en) 1998-07-22 2001-07-18 Reshuffled communications processes in pipelined asynchronous circuits
US11/433,203 US7934031B2 (en) 1998-07-22 2006-05-11 Reshuffled communications processes in pipelined asynchronous circuits

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US9384098P 1998-07-22 1998-07-22
US36046899A 1999-07-22 1999-07-22
US50163800A 2000-02-10 2000-02-10
US10/294,044 US20040030858A1 (en) 1998-07-22 2001-07-18 Reshuffled communications processes in pipelined asynchronous circuits

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US50163800A Continuation 1998-07-22 2000-02-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US11/433,203 Continuation US7934031B2 (en) 1998-07-22 2006-05-11 Reshuffled communications processes in pipelined asynchronous circuits

Publications (1)

Publication Number Publication Date
US20040030858A1 true US20040030858A1 (en) 2004-02-12

Family

ID=22241108

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/294,044 Abandoned US20040030858A1 (en) 1998-07-22 2001-07-18 Reshuffled communications processes in pipelined asynchronous circuits
US11/433,203 Expired - Fee Related US7934031B2 (en) 1998-07-22 2006-05-11 Reshuffled communications processes in pipelined asynchronous circuits

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/433,203 Expired - Fee Related US7934031B2 (en) 1998-07-22 2006-05-11 Reshuffled communications processes in pipelined asynchronous circuits

Country Status (5)

Country Link
US (2) US20040030858A1 (en)
EP (1) EP1121631B1 (en)
AU (1) AU5123799A (en)
DE (1) DE69935924T2 (en)
WO (1) WO2000005644A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006117391A1 (en) * 2005-05-04 2006-11-09 Etat Francais, représenté par le Secretariat General de la Defense Nationale Device forming a logic gate for minimizing the differences in electrical or electromagnetic behavior in an integrated circuit manipulating a secret
US20080294879A1 (en) * 2005-09-05 2008-11-27 Nxp B.V. Asynchronous Ripple Pipeline
US20110029941A1 (en) * 2008-06-18 2011-02-03 University Of Southern California Multi-level domino, bundled data, and mixed templates

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8817029B2 (en) * 2005-10-26 2014-08-26 Via Technologies, Inc. GPU pipeline synchronization and control system and method
US9965342B2 (en) * 2010-03-16 2018-05-08 Arm Limited Synchronization in data processing layers
US8791717B2 (en) * 2010-07-15 2014-07-29 Nanyang Technological University Asynchronous-logic circuit for full dynamic voltage control
US8994406B2 (en) * 2011-12-19 2015-03-31 Nanyang Technological University Digital cell
US8854075B2 (en) * 2012-03-06 2014-10-07 Tiempo Delay-insensitive asynchronous circuit
US9520180B1 (en) 2014-03-11 2016-12-13 Hypres, Inc. System and method for cryogenic hybrid technology computing and memory
US9576094B2 (en) 2014-08-20 2017-02-21 Taiwan Semiconductor Manufacturing Company, Ltd. Logic circuit and system and computer program product for logic synthesis
WO2018122658A1 (en) * 2016-12-27 2018-07-05 Semiconductor Energy Laboratory Co., Ltd. Semiconductor device
WO2019241979A1 (en) * 2018-06-22 2019-12-26 Huawei Technologies Co., Ltd. Method of deadlock detection and synchronization-aware optimizations on asynchronous processor architectures

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5434520A (en) * 1991-04-12 1995-07-18 Hewlett-Packard Company Clocking systems and methods for pipelined self-timed dynamic logic circuits
US5752070A (en) * 1990-03-19 1998-05-12 California Institute Of Technology Asynchronous processors
US6038656A (en) * 1997-09-12 2000-03-14 California Institute Of Technology Pipelined completion for asynchronous communication
US6152613A (en) * 1994-07-08 2000-11-28 California Institute Of Technology Circuit implementations for asynchronous processors
US6301655B1 (en) * 1997-09-15 2001-10-09 California Institute Of Technology Exception processing in asynchronous processor
US6381692B1 (en) * 1997-07-16 2002-04-30 California Institute Of Technology Pipelined asynchronous processing
US6502180B1 (en) * 1997-09-12 2002-12-31 California Institute Of Technology Asynchronous circuits with pipelined completion process

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3290511A (en) * 1960-08-19 1966-12-06 Sperry Rand Corp High speed asynchronous computer
US4680701A (en) * 1984-04-11 1987-07-14 Texas Instruments Incorporated Asynchronous high speed processor having high speed memories with domino circuits contained therein
US4710650A (en) * 1986-08-26 1987-12-01 American Telephone And Telegraph Company, At&T Bell Laboratories Dual domino CMOS logic circuit, including complementary vectorization and integration
GB8711991D0 (en) * 1987-05-21 1987-06-24 British Aerospace Asynchronous communication systems
US4912348A (en) 1988-12-09 1990-03-27 Idaho Research Foundation Method for designing pass transistor asynchronous sequential circuits
US5121003A (en) * 1990-10-10 1992-06-09 Hal Computer Systems, Inc. Zero overhead self-timed iterative logic
TW226057B (en) 1991-12-23 1994-07-01 Philips Nv
DE4214981A1 (en) * 1992-05-06 1993-11-11 Siemens Ag Asynchronous logic circuit for 2-phase operation
US5544342A (en) * 1993-06-30 1996-08-06 International Business Machines Corporation System and method for prefetching information in a processing system
EP0650117B1 (en) 1993-10-21 2002-04-10 Sun Microsystems, Inc. Counterflow pipeline
US5440182A (en) 1993-10-22 1995-08-08 The Board Of Trustees Of The Leland Stanford Junior University Dynamic logic interconnect speed-up circuit
US5642501A (en) * 1994-07-26 1997-06-24 Novell, Inc. Computer method and apparatus for asynchronous ordered operations
US5732233A (en) * 1995-01-23 1998-03-24 International Business Machines Corporation High speed pipeline method and apparatus
DE69621763T2 (en) 1995-08-23 2003-02-06 Koninkl Philips Electronics Nv DATA PROCESSING SYSTEM WITH AN ASYNCHRONOUS PIPELINE
GB2310738B (en) * 1996-02-29 2000-02-16 Advanced Risc Mach Ltd Dynamic logic pipeline control
US5889979A (en) * 1996-05-24 1999-03-30 Hewlett-Packard, Co. Transparent data-triggered pipeline latch
US5737614A (en) * 1996-06-27 1998-04-07 International Business Machines Corporation Dynamic control of power consumption in self-timed circuits
US5920899A (en) 1997-09-02 1999-07-06 Acorn Networks, Inc. Asynchronous pipeline whose stages generate output request before latching data
US6055620A (en) * 1997-09-18 2000-04-25 Lg Semicon Co., Ltd. Apparatus and method for system control using a self-timed asynchronous control structure
US6044453A (en) * 1997-09-18 2000-03-28 Lg Semicon Co., Ltd. User programmable circuit and method for data processing apparatus using a self-timed asynchronous control structure
US5949259A (en) 1997-11-19 1999-09-07 Atmel Corporation Zero-delay slew-rate controlled output buffer
US5973512A (en) 1997-12-02 1999-10-26 National Semiconductor Corporation CMOS output buffer having load independent slewing
US6049882A (en) * 1997-12-23 2000-04-11 Lg Semicon Co., Ltd. Apparatus and method for reducing power consumption in a self-timed system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5752070A (en) * 1990-03-19 1998-05-12 California Institute Of Technology Asynchronous processors
US5434520A (en) * 1991-04-12 1995-07-18 Hewlett-Packard Company Clocking systems and methods for pipelined self-timed dynamic logic circuits
US6152613A (en) * 1994-07-08 2000-11-28 California Institute Of Technology Circuit implementations for asynchronous processors
US6381692B1 (en) * 1997-07-16 2002-04-30 California Institute Of Technology Pipelined asynchronous processing
US6038656A (en) * 1997-09-12 2000-03-14 California Institute Of Technology Pipelined completion for asynchronous communication
US6502180B1 (en) * 1997-09-12 2002-12-31 California Institute Of Technology Asynchronous circuits with pipelined completion process
US6301655B1 (en) * 1997-09-15 2001-10-09 California Institute Of Technology Exception processing in asynchronous processor

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006117391A1 (en) * 2005-05-04 2006-11-09 Etat Francais, représenté par le Secretariat General de la Defense Nationale Device forming a logic gate for minimizing the differences in electrical or electromagnetic behavior in an integrated circuit manipulating a secret
FR2885461A1 (en) * 2005-05-04 2006-11-10 France Etat DEVICE FORMING A LOGICAL DOOR ADAPTED TO MINIMIZE DIFFERENCES IN ELECTRIC OR ELECTROMAGNETIC BEHAVIOR IN AN INTEGRATED CIRCUIT HANDLING A SECRET
US20090302882A1 (en) * 2005-05-04 2009-12-10 Duflot Loic Device forming a logic gate for minimizing the differences in electrical of electro-magnetic behavior in an intergrated circuit manipulating a secret
US7863926B2 (en) 2005-05-04 2011-01-04 Etat Francais, représenté par le Secretariat General de la Defense Nationale Device forming a logic gate for minimizing the differences in electrical or electro-magnetic behavior in an integrated circuit manipulating a secret
US20080294879A1 (en) * 2005-09-05 2008-11-27 Nxp B.V. Asynchronous Ripple Pipeline
US7971038B2 (en) * 2005-09-05 2011-06-28 Nxp B.V. Asynchronous ripple pipeline
US20110029941A1 (en) * 2008-06-18 2011-02-03 University Of Southern California Multi-level domino, bundled data, and mixed templates
US8495543B2 (en) * 2008-06-18 2013-07-23 University Of Southern California Multi-level domino, bundled data, and mixed templates

Also Published As

Publication number Publication date
WO2000005644A1 (en) 2000-02-03
EP1121631A4 (en) 2001-09-19
EP1121631A1 (en) 2001-08-08
EP1121631B1 (en) 2007-04-25
DE69935924D1 (en) 2007-06-06
AU5123799A (en) 2000-02-14
DE69935924T2 (en) 2008-01-10
US7934031B2 (en) 2011-04-26
US20060212628A1 (en) 2006-09-21

Similar Documents

Publication Publication Date Title
US7934031B2 (en) Reshuffled communications processes in pipelined asynchronous circuits
Lines Pipelined asynchronous circuits
Ozdag et al. High-speed QDI asynchronous pipelines
US7053665B2 (en) Circuits and methods for high-capacity asynchronous pipeline processing
US6850092B2 (en) Low latency FIFO circuits for mixed asynchronous and synchronous systems
Ferretti et al. Single-track asynchronous pipeline templates using 1-of-N encoding
KR100231605B1 (en) Apparatus of reduced power consumption for semiconductor memory device
JP4146519B2 (en) How to establish self-synchronization of each configurable element in a programmable component
JP2947356B2 (en) Parallel processing system and parallel processor synchronization method
US5386585A (en) Self-timed data pipeline apparatus using asynchronous stages having toggle flip-flops
US6956406B2 (en) Static storage element for dynamic logic
US8495543B2 (en) Multi-level domino, bundled data, and mixed templates
US20030146074A1 (en) Asynchronous crossbar with deterministic or arbitrated control
US6502180B1 (en) Asynchronous circuits with pipelined completion process
Branover et al. Asynchronous design by conversion: Converting synchronous circuits into asynchronous ones
US6954909B2 (en) Method for synthesizing domino logic circuits
Coates et al. Automatic synthesis of fast compact self-timed control circuits
Farnsworth et al. A hybrid asynchronous system design environment
US7047392B2 (en) Data processing apparatus and method for controlling staged multi-pipeline processing
Plana et al. Concurrency-oriented optimization for low-power asynchronous systems
US6339835B1 (en) Pseudo-anding in dynamic logic circuits
US7053664B2 (en) Null value propagation for FAST14 logic
US20030042935A1 (en) Static transmisstion of FAST14 logic 1-of-N signals
Mathew et al. A data-driven micropipeline structure using DSDCVSL
Fang Width-adaptive and non-uniform access asynchronous register files

Legal Events

Date Code Title Description
AS Assignment

Owner name: CALIFORNIA INSTITUTE OF TECHNOLOGY, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LINES, ANDREW M.;MARTIN, ALAIN J.;CUMMINGS, URI;REEL/FRAME:014744/0892;SIGNING DATES FROM 20030130 TO 20040414

AS Assignment

Owner name: ADD INC., CALIFORNIA

Free format text: LICENSE;ASSIGNOR:CALIFORNIA INSTITUTE OF TECHNOLOGY;REEL/FRAME:016216/0334

Effective date: 20010801

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION