WO1990005950A1 - Data flow multiprocessor system - Google Patents

Data flow multiprocessor system Download PDF

Info

Publication number
WO1990005950A1
WO1990005950A1 PCT/US1989/005105 US8905105W WO9005950A1 WO 1990005950 A1 WO1990005950 A1 WO 1990005950A1 US 8905105 W US8905105 W US 8905105W WO 9005950 A1 WO9005950 A1 WO 9005950A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
token
value
processing system
operands
Prior art date
Application number
PCT/US1989/005105
Other languages
French (fr)
Inventor
Gregory M. Papadopoulos
David E. Culler
Arvind
Original Assignee
Massachusetts Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Massachusetts Institute Of Technology filed Critical Massachusetts Institute Of Technology
Priority to EP89912813A priority Critical patent/EP0444088B1/en
Priority to DE68925646T priority patent/DE68925646T2/en
Publication of WO1990005950A1 publication Critical patent/WO1990005950A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/82Architectures of general purpose stored program computers data or demand driven
    • G06F15/825Dataflow computers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4494Execution paradigms, e.g. implementations of programming paradigms data driven

Definitions

  • SIMD Single-instruction-multiple-data stream
  • MIMD multiple-instruetion-multiple-data-stream
  • MIMD systems can be further classified by the level at which they extract parallelism,
  • Coarse grained systems are those in which relatively large units of computation are performed in parallel. Although the units of computation are performed in parallel, execution still proceeds sequentially within each unit. Fine grained systems, in contrast, perform small units of computation in parallel. Since the units of computation that may be performed in parallel in fine-grained systems are much smaller than in coarse-grained systems, fewer operations are performed sequentially and the
  • fine-grained systems is effectively controlling the extraordinary amount of parallel activity of the systems, particularly since the increase in
  • MIMD machines have produced only mildly encouraging results. Generally, such machines have used software as the primary tool for instilling parallelism. In programming such
  • Data flow machines are a special variety of fine-grained machines that attempt to execute models of data flow through a data processing system. These models are known as data flow diagrams. Data flow diagrams are comprised of nodes and edges. The nodes represent instructions, and the edges represent data dependencies. To be more precise, the nodes represent operators and the edges represent operands. Data flow machines operate by processing the
  • the data flow diagrams impose only a partial order of execution.
  • the instructions are executed in data flow machines whenever the operands they require are available. Data flow machines are, thus, not constrained by the rigid total order of execution found in sequential machines. This flexibility allows data flow machines to schedule execution of instructions asynchronously. In contrast, sequential machines based on the von Neumann model execute instructions only when the instruction pointer is pointing to the instructions. The primary benefit of asynchronous scheduling is the greater exposure of latent parallelism.
  • Dynamic data flow machines are those machines which can dynamically allocate an instance of a function. In other words, the memory required for an instance of a function need not be preplanned, rather it can be allocated when the function is executed.
  • Static data flow machines in contrast, can only statically allocate an instance of a function. As such, they must preplan the storage required for each instance of the function prior to execution.
  • the tagged token data flow machine allows simultaneous applications of a function by tagging each operand with a context identifier that specifies the
  • the combination of the tag and the operand constitutes a token.
  • the tags of two tokens must match if they are destined to the same instruction.
  • the tagged token architecture must have a means for correctly matching the tags.
  • An associative memory has been relied upon as the matching mechanism.
  • the embodiment is comprised of at least one processing element and a plurality of memory locations.
  • the processing element includes a processing pipeline of four stages. The first stage fetches instruction from memory to operate on a token. The second stage of the pipeline performs operand matching on the token as indicated by the fetched instruction. The third stage continues processing on the token by performing an ALU operation specified by the fetched instruction. Lastly, the fourth stage of the
  • pipeline forms a new token or tokens from the results of the ALU operation and the fetched instruction.
  • This data processing system is preferably a multiple processer data flow processing system.
  • the tokens referred tc above are data objects that are comprised of two fields.
  • the first field is a tag that includes a pointer to the beginning of an activation frame and a pointer to an instruction.
  • Activation frames are contiguous blocks of memory locations that are used to store information
  • the instruction pointer points to the
  • the second field of the token is simply a data value.
  • this data value typically represents an operand value.
  • Each processing element includes not only a processing pipeline but also a token buffer. This buffer stores the tokens while they are waiting to be processed within the processing pipeline. Each processing element has its own token buffer.
  • each of the processing elements preferably operates in parallel with its counterparts.
  • the token buffer is preferably comprised of the plurality of stacks that are prioritized such that tokens leave a higher priority stack for the pipeline before they begin a lower priority stack. Since the buffer is organized as a stack, it operates on a last in first out (LIFO) basis. It should be noted, however, that the buffer need not operate in this manner, rather it may operate on a first in first out (FIFO) basis. The LIFO approach, however, provides a more efficient implementation.
  • the activation frames serve as a place for matching of operands for operands that are destined to the same data flow node.
  • each memory location within an activation frame includes a state field for indicating a current state of the memory location and a value field for holding a value.
  • the state field may indicate a number of different types of information. For instance, it may indicate the data type of the value in a value field. Similarly, it may indicate whether a value is stored in the value field. It is this second type of indication that is utilized to
  • the matching function is an important part of execution of arithmetic/logical instructions within the data processing system.
  • the course of execution of an arithmetic/logical instruction begins with fetching of the instruction. Once the instruction is fetched, a memory location indicated by the
  • the instruction is accessed. Typically, this memory location is within an activation frame. Once the memory location is accessed, the state field is examined and the value field is operated on as determined by the arithmetic/logical instruction and the current value in the state field. Once the state field and value field have been operated on, the arithmetic/logical instruction is either performed or not performed, depending upon the current state of the memory location and the instruction.
  • the operations that are performed on the state field and value field is determined to a great extent upon the time at which the operand encoded by the token arrives at the memory location.
  • a first operand of an instruction is stored in the memory location when the memory location is empty.
  • the state field is changed to reflect that an operand of the instruction is stored in the memory location.
  • a second operand of the instruction is received.
  • the memory location is checked to see if the first operand is stored therein. Since the first operand is already stored in the memory location, it is read out of the memory location and sent with the second available operand to be further processed.
  • the state of the memory location be changed to indicate that the memory location is empty.
  • the above approach is applicable with dyadic operations; however, when monadic operations are specified by the instruction, the operand value of the current token is forwarded to the processing element without the necessity of examining the contents of the memory location.
  • each token contains an instruction pointer within its tag field.
  • this pointer is
  • Each instruction contains a matching rule for matching operands. It also contains a rule for computing an effective address of a storage location on which the matching rule operates. Additionally, the
  • the instruction contains an ALU operation to be performed by the ALU of the data processing system and lastly, a token forming rule for forming new tokens that result from execution of the instruction.
  • the dyadic and monadic matching rules have been discussed above but the sticking matching rule has of yet, not been discussed.
  • the sticking matching rule tells the system to write a value of a token into the value field of a location if the state field of the
  • the rule for computing an effective address of a storage location as specified by the instruction selects one of three possible addressing approaches.
  • the memory location is specified at an absolute address indicated by an offset that is contained within the instruction.
  • the memory location is an address located at the address specified by the activation frame pointer of the tag offset by the offset contained within the instruction.
  • the third option employs yet another approach. In this final approach the memory location is at the address pointed to by the pointer to an instruciton offset by the offset contained within the instruction.
  • the ALU operation indicated by the instruction tells the system what operation is sought to be performed by the ALU on the operands. For instance, if the ALU operation was specified as an addition operation when the ALU operation stage of the pipeline acted on the given token, it would add the two operands passed to it.
  • the final rule specified by the instruction is the token forming rule. This rule tells the system how to form new output tokens from the operations that have been performed. This rule is employed in the tag forming portion of the ALU operation stage as well as in the token forming stage.
  • the processing pipeline preferably contains a capability to handle exceptions.
  • One means that may be employed is to have each activity associated with a token recorded as the token enters the pipeline. When an exception occurs, the value in the register remains unchanged until the exception is resolved. The resolution of the exception is performed by issuing an. exception handling token. This exception handling token preferably may not be interrupted.
  • each processing element is preferably assigned a given region of storage In which activation frames encode is stored with that processing element.
  • a token may indicate the processing element for which it is destined by encoding within the tag a processing element
  • Figure 1 shows a sample data flow diagram
  • FIG. 2 shows the major components of the data flow machine.
  • Figure 3 shows the major components of a
  • Figure 4 shows the relationship between tokens, instructions, and activation frames.
  • Figure 5 shows the state transition diagram for the arithmetic matching rule.
  • Figure 6 shows the state transition diagram for the sticky matching rule.
  • Figure 7 shows the effect of the three basic token forming rules.
  • Figure 8 shows the major components of the token forming stage of the pipeline.
  • FIG. 9 shows the partitioning scheme of memory amongst processing elements.
  • Figure 10 illustrates the interleaving
  • Figure 11 shows the fields of a token.
  • Figure 12 shows the fields of a tag.
  • Figure 13 shows the sub-fields of the map field and shows how Interleaving strategies are encoded.
  • Figure 14 shows the fields of an instruction.
  • Figure 15 shows the decoding strategy.
  • FIG. 1 A sample data flow diagram is shown in Figure 1. Specifically, Figure 1
  • Node 10 represents a
  • node 10 has two input edges 14 and 16 that represent A and B respectively.
  • the output edge 18 from node 10 has the value AxB.
  • node 12 which also represents a
  • Output edge 24 has the value CxD.
  • the two output edges 18 and 24 from these nodes 10 and 12 then enter an addition node 26.
  • the resulting output edge 28 represents (AxB)+(CxD).
  • each of the operations represented by the nodes would be performed in sequential order.
  • the machine would first multiply A times B, then it would multiply C times D and lastly, It would add the product A times B and C times D. There Is, however, no reason to impose such an order if the operands are available.
  • the operations AxB and CxD are performed simultaneously. The resulting products are subsequently summed.
  • the data flow machine substitutes the arbitrary sequential order imposed on such operations with an order imposed by the operations themselves.
  • the present invention includes a plurality of processing elements 3, each associated with an assigned region of memory 4, and global memory units 2.
  • An interconnection network 1 comprised of logic circuitry is provided to
  • processing elements 3 may access any of the global memory units 2 and acts in parallel with the other processing elements 3.
  • FIG. 3 shows a typical processing element 3 in more detail.
  • each processing element 3 includes a processing pipeline 36. Also included is a token queue 34 for storing tokens waiting to be processed.
  • An assigned local portion 4 of global memory is allocated for storing activation frames 45 and code 32 for processing use.
  • an exception handler 43 and a set of registers 41 are provided within the processing element 3.
  • the memory 4 is distinguishable from the memory of traditional machines built under the von Neumann model. Instead of merely consisting of a value, each of the memory locations of the present invention comprises of two fields: a presence field and a value field. This memory organization in the present invention is significant, for it alters the common view of memory being merely a location for storing values.
  • the dynamically assigned portion of memory 4 in the present Invention has both a presence state and a value.
  • the presence state may affect the value's significance, as well as alter the execution of Instructions on the value. Further, the presence field may be manipulated independently from the value field. This design of memory makes it a more powerful tool, as will be more apparent from the discussion that follows.
  • the memory 4 is used to store activation frames 45, heap storage 6 and code sections 32.
  • an implementation may have separate memories for code 32 and activation frames 45.
  • Activation frames 45 can be viewed as all the
  • tokens 20 are akin to embellished operands that identify not only the operand values but also identify the
  • a token 30 can be viewed as a tuple comprising a tag (c.s p ) and a value (v).
  • the tag identifies the instruction and activity associated with the operand that the token represents.
  • the tag is comprised of three fields: a context pointer (c), a statement number (s), and a port indicator (p).
  • the context pointer (c) points to the beginning of an activation frame 45 in which the operand
  • the statement number (s) points to a specific memory location where the instruction of the operand is stored, and the port (p) indicates whether the token enters a node representing the instruction on the left input edge or the right input edge.
  • the processing element 2 of Figure 3 is perhaps most easily viewed as a token processing system. It processes tokens 30 to bring about execution of instructions 32. In terms of the data flow model, this system processes operands to execute data flow diagrams, and it continues processing operands until told to stop. In more accurate terms, the tokens trigger the production of activities that are
  • tokens 30 leave the token queue 34 whereupon they enter the processing pipeline 36.
  • Multiple tokens are typically present in different stages of the pipeline at any given time. Each such token is processed in parallel with the other tokens in the pipeline.
  • the pipeline represents a second level of parallelism within the present invention that is distinct from the parallelism attributable to the simultaneous computations of multiple processors.
  • More than one token 30 may enter the pipeline at a time, but generally only one token enters the pipeline at a time.
  • the number of tokens 30 that may simultaneously enter the pipeline 36 is dictated by the multiprocessing capability of the particular design employed; that is, each processing element may operate on multiple tokens in parallel.
  • the first stage of the processing pipeline 36 is the
  • the system retrieves the instruction at that location (see the memory location pointed to by arrow 31 in Figure 4).
  • An instruction of the present invention can be viewed as a tuple.
  • an instruction equals (E.r, W, A, T.d) where E.r specifies a method of determining the effective address for the storage location on which the matching rule will operate; W specifies the matching rule; A specifies the ALU operation; and T.d specifies the token forming rule.
  • E.r specifies a method of determining the effective address for the storage location on which the matching rule will operate
  • W specifies the matching rule
  • A specifies the ALU operation
  • T.d specifies the token forming rule.
  • instructions are limited to at most two operands.
  • the token 30 and information from the instruction 32 are passed on to the operand matching stage 40 of the pipeline 36.
  • the system looks to match operands destined for the same node (i.e. those having like tags differing only as to port).
  • This stage is the means for the system to check if all the operands necessary for execution of an instruction 32 are available or not. If they are available, the
  • the ALU operation stage 42 which is comprised of a tag former 42A and an ALU 42B.
  • the tag former 42A forms the tag portion of output tokens
  • the ALU forms the value portion of output tokens.
  • the operation specified by the instruction is performed by an ALU 42B within the stage.
  • new tags for the results of the ALU operation are formed by the tag former 42A.
  • the tag former 42A and ALU 42B may be further pipelined into substages so as to balance overall performance.
  • the tags are formed in
  • The. output from this stage 42 enters the token forming stage 44 where the output tokens are formed.
  • the resulting output tokens that typically carry the result of the
  • operation then may travel to a number of different locations. First, they may travel within the
  • processing element 2 to the token queue 34 or to the pipeline 36. Second, they may travel to other processing elements 3, and third, they may travel to a memory unit 2. Where they travel is dictated by the tag portion of each output token, for the tag specifies a particular processing element 2 or memory unit 3 as will be discussed below.
  • the output token's tag specifies which memory unit 2 is to be accessed.
  • the output token 30 also specifies the operation to be performed on the memory location (i.e. a read operation or a write
  • the memory controller of the memory unit 2 uses the operation specification to perform the desired operation. If a read is requested, the memory unit 2 produces a token 30 comprising the data read of out the memory unit 2 and a tag specifying a given processing element. Once produced, this token 30 travels to the appropriate processing element where it joins the other tokens 30 being processed by the destination processing element.
  • Outputs tokens 30 may also travel, as mentioned above, to other processing elements 3.
  • Each processing element 3 can communicate with all other processing elements and with any memory unit 2. To communicate with such components, it need only generate an output token having a tag that specifies the processing element 3 or memory unit 2 to which it is destined. Moreover, such communications can be performed in parallel with ongoing computations, for the
  • processing elements 3 need not wait for a response to the communication.
  • an activation frame must first be allocated before it can be used.
  • an activation frame is allocated for each routine that is performed.
  • the activation frame is allocated. This allocation is performed by the operating system.
  • the operating system maintains a list of free activation frames such that when a request for a new activation frame is received, it simply pops an activation frame off the free list and assigns it to the call of the routine. Moreover, when the call to the routine is completed, the activation frame is added back on to the free list. In performing such allocations and deallocations, the operating system makes certain that the same activation frame is not assigned to multiple routine calls at the same time.
  • each activation frame is processed by a specific processing element, and each processing element may be allocated thousands of activation frames. Since the activation frames are allocated as required and requested with execution of prior code blocks, the processing elements are called upon dynamically with execution of the code.
  • Routines are code blocks and calls to such routines are made by prior code blocks. These code blocks are defined by the compiler which decides which nodes of a data flow graph constitute a code block and thus share a single activation frame. The compiler, thus, establishes the interprocessor granularity of the process. Smaller code blocks characterize a fine grain system having greater potential for parallelism with increased
  • code is not shared; rather each processing element 3 has a copy of the entire code in its assigned memory 4. All of the
  • a cache system may be used to load code as required.
  • the system can look to perform matching of operands.
  • the operand matching stage 40 looks to the E.r field contained with the instruction 32 that is to be executed.
  • the E.r field specifies one of three effective addressing modes.
  • One possible effective addressing mode is frame relative mode.
  • frame relative mode the address of the memory location is located within an activation frame 45 by adding an offset (r) to the context pointer (c) contained within the tag of the token 30. This scheme is illustrated in Figure 4. Note the arrow 35 in Figure 4 from c pointing to the beginning of the activation frame 45 and note the other arrow 33 pointing to the location specified by the context (c) plus the offset (r).
  • absolute mode specifies the address as the absolute address (r).
  • code relative mode the address is specified by the statement number from the tag plus the offset (r).
  • Matching rules are basically a means of generating activities.
  • an activity can be thought of as a tuple (c.s, v 1 , v r , A, T.d) where c.s is the context pointer and statement number from the tag shared by the matching tokens; v.. is the value of the token on the left port; v r is the value of the token on the right port; A is an ALU operation; and T.d is the token forming rule.
  • W dyadic This rule is referred to as W dyadic .
  • the system looks at the memory location within an activation frame specified by the effective address and checks the presence field. If a value is
  • the presence field equals "present”, and if a value is not currently residing there, the presence field equals "empty".
  • the system 50 If the location's presence field indicates that It is in the "empty” state 48 ( Figure 5), the system 50 writes the value of the token into the value field of that memory location and changes the presence field to the "present” state 52. On the other hand, if the presence field is initially in the "present” state 52, the value field of the memory location is read, and an activity is issued 54.
  • This aspect of the scheme reflects the asynchronous nature of the scheduling of execution of the instructions.
  • a location in an activation frame need only be large enough to store a value and a state.
  • the incoming tag of the first token to arrive may be discarded as it can be
  • W monadic W monadic .
  • the left input port is used for input in monadic operations.
  • the value field of the token on the left input port is read and an activity is issued. The presence state is not affected. The activity takes on the context (c), statement number (s), and value (v 1 ) of the input token on the left input port.
  • Constant operands to which this rule is applied are of two types.
  • the first type is literal constants which are those known at the time of compilation.
  • the second type is frame constants which are those values that may be established in an activation frame to be shared by many activities within a given invocation of a code block.
  • W sticky It allows a single active operand to be written into the location specified by the effective address prior to the arrival of the constant.
  • the presence state is the "empty" state 56 and an active operand is in the matching stage, the operand's value is written 62 into the location, and the presence state is changed to the "present" state 58.
  • the active operand must be at the left port input. If the constant arrives after the active operand (i.e., an input operand subsequently arrives on the right port), the value of the constant is exchanged with the value of the operand in the location, and an activity is issued 66. The presence state is changed to the "constant" state 60.
  • non- cons tants must enter on the left port.
  • the presence state need not be limited to encoding the presence or absence of a value, rather the presence state may also encode other information such as data types.
  • the presence state may dynamically alter the execution of an instruction. For instance, if an instruction specifies the sticky matching rule, the operation performed at the memory location on which the matching rule acts is determined by the presence state. Thus, the instruction is conditioned on the presence state and performs different operations given different presence states.
  • the ALU operation stage 42 The central aim of this stage is to produce tags and values to pass on to the token forming stage 44.
  • This stage relies on the A and T.d fields on the instruction 32 to direct system activity.
  • the token forming rule, T.d specifies how the new tags and values are to be formed.
  • the tags formed are in large part determined by the d field of the token forming rule.
  • the d field specifies the destination addresses to be given the newly formed tokens. In particular, d equals (s1.p', s2.p").
  • the first is the arithmetic rule denoted as T arith . It directs the ALU operation stage to apply the ALU operation (A) to the two values (v 1 , v r ) in the activity to generate an output value (v'). At least one token is produced and a second token (noted in brackets in Figure 7) may be produced.
  • the output tags for the tokens (c.s' p , and c.s" p'' ) are generated by applying increments (s1 and s2) to the statement number (s) contained in the incoming tags and by supplying new ports (p' and p") as specified by the d field of the token forming rule. Note that under this rule these output tokens have the same context as the original tokens.
  • the second token forming rule, the send rule is denoted as T send . It sets the output token tag equal to the left value (v 1 ) of the activity which is data type TAG, and it sets the output token value equal to the right value (v r ) of the activity. It can generate another token if necessary, but it generates the new token's tag in the same manner as they are generated in the arithmetic rule by adding an increment to the stated number.
  • the value field of the second output token is set equal to the right port value (v r ).
  • the primary aim of this rule is to send a value to a different context. As such, it provides an
  • the third and final basic rule is the extract rule, T extract which sends a value equal to the current context to an instruction within the current context. It can only generate one output token.
  • the activity tag (c.s) It generates the two new tags (c.s' p' ' c.s" p'' ) from the activity tag (c.s) by adding increments (s1 and s2) to s as is done in the arithmetic rule.
  • the first generated tag (c.s' p ,) is used as the output token's tag, and the second tag (c.s" p'' ) is used as the output token's value.
  • extract rule and the send rule may combined to form an extract-send rule that sends a tag from within one context (c), as an argument to another context ).
  • v 1 a tag specifying the new context
  • T inc-s-send Another example of a combination token forming rule is the inc-s-send rule denoted as T inc-s-send .
  • T inc-s-send can be represented by: (c.s, A, T inc-s-send .d) ⁇ + s2) p" ,
  • This instruction is used primarily for passing arguments and results.
  • An additional combination token forming rule is the T fetch rule. It combines all three of the basic token forming rules. This rule is used to read elements from an array where c is the base address and v is an index. Specifically,
  • T switch acts in a manner consistent with the switch instruction discussed below.
  • the token forming rule dictates the output tag and output value. Further, It should be noted that there is a close relationship between the token formation stage 44 and the ALU operation stage 42. Both rely heavily on the token forming rules (T.d). The difference between the stages is that the ALU operation stage 42 only produces non-joined output tags and output values. The combining of these outputs to produce output tokens is accomplished by the token formation stage 44.
  • the token formation stage 44 can be implemented with the assistance of a number of multiplexers (76 and 78 in Figure 8).
  • the values and tags produced by the ALU operation stage 42 are fed into multiplexers which select the appropriate combinations as dictated by the token forming rule (T.d).
  • the select lines of the multiplexer are controlled by the opcode (T.d) of the instruction. The opcodes will be discussed in greater detail below.
  • Tokens 30 generated by the token forming stage 44 typically exit the pipeline 36 and return to the token queue 34. They are subsequently processed by the pipeline 36. Thus, there is a continuous flow through the pipeline until execution is completed. They may, however, travel to other destinations.
  • the token forming stage 44 contains logic that examines the output token to see where it is destined and routes the output token accordingly.
  • the other destinations that output tokens may -travel include other processing elements 3 and memory units 2.
  • An output token may travel to another processing element when a change-tag instruction (which will be described below) is executed that sends the output token to a new context associated with a different processing element 3. Further, an output token may travel to a memory unit 2 when a memory access instruction is specified by the tag. Such memory access instructions (e.g., read, write) are handled asynchronously by the memory units 2.
  • the processing element need not wait while the memory access is being performed, rather it continues the processing.
  • the actual memory access is
  • a traditional queue may be used, but one optimization is for the queue to be comprised of a series of stacks (See Figure 3).
  • stacks See Figure 3
  • the idea behind the use of stacks is to try to create a sort of cache queue.
  • a FIFO buffer would be useful, but not as practical for the present purposes.
  • a FIFO would not easily provide for priority scheduling since, in a FIFO scheme, all tokens must wait their turn to be processed.
  • a FIFO controls parallelism in the wrong way. If the tasks to be executed are viewed as a tree wherein those tasks that must be executed early in the execution process to allow other tasks to be executed are near the top of the tree, a FIFO approach would unfold the tree in a breadth-first manner. What is desired is an approach that can control unfolding of the tree in a
  • a LIFO approach can control unfolding in such a depth-first manner.
  • the LIFO approach induces more locality than the FIFO
  • the preferred embodiment uses stacks that are organized by priority.
  • the tokens 30 are removed off the highest priority stack 70 ( Figure 3) until it is empty. Once the highest priority stack 70 is empty, the tokens 30 are removed from the next highest priority stack 72. This continues until execution is complete.
  • a preferred implementation is to utilize only two stacks. Most tokens enter the highest priority stack 70, but some tokens are set aside in the next highest stack 72 to delay their processing.
  • priority for purposes of deciding which stack a token 30 enters is to encode the priority in the token 30 as part of the tag.
  • This option has the advantage of being dynamic but results in an increase in the complexity of the system. Such an approach can encode priority dynamically but can also permit static control.
  • the other option is to have the priority encoded in the destination as specified by the d field of the token forming rule of an
  • This stacking scheme allows for high priority tokens 30 to be processed quickly and early on in the execution of the instructions. This characteristic allows instructions which are condition precedents to the execution of a number of waiting instructions to be fully executed early in the execution process so as to free the path for execution of the waiting instructions. The net result is greater control over exposed parallelism.
  • Another optimization embodied within the present invention is to provide a direct path (74 in
  • the present invention also concerns an optimization for exception handling.
  • the system included In the present invention includes a special register set (41 in Figure 3) .
  • This register set 41 records activities in the pipeline 36. In particular, it records the tag, the left value (v 1 ), and the right value (v r ) of each activity in the pipeline 36.
  • An exception is an event such as mismatched data types in an operation, an attempt to divide by zero, etc.
  • the instruction set plays a particularly important role in system operation.
  • the instruction set is
  • dyadic arithmetic instruction class This class of instruction perform arithmetic operations on two input operands.
  • a typical example of a dyadic arithmetic instruction is the add
  • This instruction can be summarized as:
  • Inputs (c.s 1 , v 1 ) (c.s' p ,, v 1 ⁇ v r ) (c.s r , v r ) (c.s" p" , v 1 ⁇ v r )]
  • a second class of instructions completes the set of arithmetic instructions.
  • This second class is known as the monadic arithmetic Instructions. They, like their dyadic counterparts, perform arithmetic operations. They, however, only act on one operand as opposed to two operands.
  • a leading example of a monadic arithmetic instruction is the float instruction. This instruction can be summarized as:
  • Inputs Outputs: (c.s" p" , v 1 ) (c.s' p' , float(v 1 ))
  • Identity instructions represent an additional class of instruction.
  • a primary instruction of this class is the identity instruction. This instruction can be summarized as:
  • the identity instruction passes the value of the input operand to another context.
  • Inputs Outputs: (c.s” p" , v 1 ) (c.s" p" , v 1 )
  • T T arith
  • the gate instruction differs in that it copies the operand of the left input port and forwards that value when a value, called a trigger, is received on the right input port.
  • Conditional instructions are used to institute conditional execution. The sole conditional
  • switch instruction employed is the switch instruction. It demands two input operands. One of the input
  • operands must be a value, and the other input must be a boolean. It produces one of two possible output value choices for the given inputs. Which output value is chosen depends on the value of the boolean input. If the boolean input is TRUE, the first output is chosen:
  • T switch Tag manipulation is carried out by a special class of tag manipulation instructions. This class contains three basic instructions. The first such instruction, change-tag, can be summarized as:
  • Extract-tag in contrast, can be summarized as: extract-tag
  • T T extract It is restricted to a single output.
  • the resulting output token has a tag that shares the context with the input operand. It has a statement number that is sum of an increment with the statement number of the input. Moreover, the value field of the output token equals a tag in the same context as the input but has a statement number that is the sum of an additional increment with the statement number.
  • the adjust-offset instruction can be summarized as:
  • the first input's value specifies a new tag
  • the second input's value specifies an offset.
  • This instruction produces up to two output tokens.
  • the first output token is in the same context as the inputs and has a value equal to the first input's value offset by the second input's value.
  • the second output token also has the same value and shares the same context with the inputs, but its statement number is offset.
  • the present system can readily provide for iterative operations, such as loops. It provides for loop implementation by assigning each iteration of a loop a new context. It then uses a change-tag instruction on the new context to send to the current iteration the tokens from the previous iteration. When the iteration is complete, it frees the context so that the context can be assigned to the next iteration. As such, the parent context can be passed from iteration to iteration. Hence, the loop unfolds during execution as a tail recursion of the N activation frames, where N is the number of iterations in the loop. To maintain efficiency in this approach, activation frames used in loops are recycled.
  • an activation frame is allocated every time a procedure is called.
  • the operating system of the data flow processing system maintains a list of free activation frames. When a procedure is called it pops an activation frame off the free list. Similarly, when a procedure call is completed, it no longer needs the activation frame so the activation frame it used is returned to the free list.
  • the invention is a multiprocessor system. It, thus, must be able to communicate readily amongst processing elements 3.
  • the preferred embodiment provides this capability by allowing tokens to freely flow from one processing element to another.
  • the present Invention is easily composed from individual
  • composition of multiple processing elements Into a multiple processor system is easily achieved. Further, the fine-grained nature of the computations that are performed in parallel, likewise, allows such easy composition of single processing elements 2 into the multiple processor system. The fine-grained nature of the system also provides the benefit of facilitating easy compilation of code.
  • the assignments of storage space is not necessarily fixed; hence, space used an an activation frame by one processing element 2 may subsequently be used by another processing element 2.
  • the approach has the additional advantage of making it possible to dynamically reallocate the partitions.
  • One disadvantage of partitioning the address space concerns the allocation of large data structures.
  • the present system interleaves such large data structures across multiple processors to produce an even distribution of network traffic and processor loads. This interleaving is performed word-by-word.
  • Non-interleaved and interleaved approaches are provided for by conceptually dividing the address space into regions where increments to the context advance either within a processor or across processors depending on the subdomain
  • each processor is assigned exclusive activation frames, the question arises whether code is also shared. Generally, it is preferred that a copy of the code for each code block be present on every processing element that executes the code block. Hence, the destination instructions are local and non-blocking. The result is to heighten design simplicity at the minimal cost of a larger memory. Alternatively, a cache system may be utilized. It was noted previously that each memory
  • presence bits For purposes of this capability, presence bits for
  • adjacent locations are coalesced into words of the size equal to a machine word.
  • the preferred embodiment utilizes a 72 bit word wherein 64 bits are a value field and 8 bits are a type field. Tokens are comprised of two words (See Figure 11). The first word is the tag and the second word is a value-part. Tags are of data type, TAG, whereas value-parts may be of several data types including TAG, FLT (floating point), INT (integer) and BITS (unsigned integer).
  • Tags can be further broken down into a number of fields (See Figure 12).
  • the leading bit of a tag indicates the port of the operand. A zero indicates the left port, and a one indicates the right port.
  • the next 7 bits are the MAP field (See Figure 13).
  • the first 2 bits of the MAP field are the HASH indicator which selects an interleaving strategy. The strategies are listed in the table of Figure 13.
  • the other 5 bits are N field which equals the
  • IP instruction pointer
  • the PE follows the instruction pointer, IP, field, that points to an instruction in memory and is 10 bits long.
  • the final field of a tag is the FP field. It is a frame pointer and is 22 bits long. It points to the particular frame amongst those assigned a particular subdomain.
  • the tag c.s is formed primarily out of the PE, FP and IP fields.
  • c is comprised of the frame pointer FP, and the processing element designation PE such that PE comprises the most significant bits of c and FP comprises the least significant bits of c.
  • s is comprised of PE and IP. PE comprises the most significant bits of s, and IP comprises the least significant bits of s.
  • Instructions are only 32 bits long and comprised of only four fields: OPCODE, r, PORT and s (See
  • OPCODE is 10 bits long and specifies an instruction OPCODE. Also 10 bits long is r. It is an unsigned offset used to compute the effective address of the operand. PORT defines the port for the destination tags. Lastly, s is an offset to IP for one of the destination tags in twos complement form. If two destination tags are required by a given instruction, the second tag is generated by adding 1 to the incoming IP and setting port to 1.
  • the operand matching stage 40 can be subdivided into three substages.
  • the first substage 80 computes the effective address of the memory location where the operand matching takes place.
  • the second substage 82 operates on the presence bits, and the third substage 84 either fetches or stores the operand at the effective address.
  • the 10 bit OPCODE 90 field is used as an address to a first level decode table.
  • Entries 109 in the first level decode table have four fields.
  • a BASE field 92 specifies the base address for an entry in the second level decode.
  • the TMAP field 94 specifies one of 32 type maps.
  • the PMAP field 96 specifies one of 64 presence maps and lastly, the EA field 98 specifies the effective address generation mode.
  • the OPCODE 90 is used to look up a first level decode table entry 109, and the EA field 98 of the entry is examined. It Is 2 bits long. If both bits are zero, the address equals FP + r; if both bits are one, the address equals r; and if the leading bit is one and the trailing bit is a zero, the address equals IP+r.
  • the TMAP (94 in Figure 15) field is also examined.
  • the TMAP field 94 selects one of 32 type maps.
  • the type maps are two dimensional arrays of size 256 by 2. Each entry has 2 bits.
  • the second substage 82 is the presence bits substage.
  • the system looks at the memory location specified by the particular address. From the location it reads the 2 presence bits. It uses these bits along with the port of the token in the substage, the type code bits 100 that were read from the type map and the PMAP field 96 of first level code entry to look up an entry 110 in the presence map table.
  • the presence map table has sixty-four entries and each entry has four fields.
  • the BRA field 102 is for four-way branch control and will be discussed below.
  • the FZ field 104 determines whether the force-to-zero override is exerted. It, likewise, will be discussed more below.
  • the FOP field 106 specifies which operand fetch/store
  • the final field 108 specifies the new value of the presence bits. It is denoted as NEXT.
  • fetch/store operation specified by the FOP field is carried out.
  • the contents of the location specified by the effective address are passed on to the next stage of the pipeline.
  • the BRA field 102 of the presence map and the BASE field 92 of the first level decode entry are ORed to produce an address in a second level decode.
  • the second level decode' table entry is used to specify parameters used in system operation. If the FZ field 104 is one, the BASE field 92 is forced to zero before being ORed.
  • the net result is the second level decode table entry 111 is set as an absolute address of 0, 1, 2 or 3.
  • the matching mechanism has been simplified so as to remove the Inbred complexity found in associative memory systems.
  • the unique memory design is
  • the fine-grained nature of the present system provides a number of benefits. First, it exposes the maximum amount of parallelism.
  • the present invention optimizes performance, simplicity and cost effectiveness.

Abstract

A data flow processing system has a plurality of processing elements and memory units. Communication amongst processing elements and amongst processing elements and memory units is facilitated by an interconnection network. Each processing element is pipelined. The system operates upon data objects known as tokens. The tokens initiate activity within the processing element pipelines. Included within the activities initiated by the tokens is execution of instructions. Operands for instructions are matched in non-associative portions of memory known as activation frames. The activiation frame memory locations have a state field that indicates whether a value is present or not in the activation frame. The state field may also indicate other information about an activation frame memory location. The state field is used to determine what action is taken at an activation frame memory location when an instruction is executed. The state field also determines the scheduling of execution of instructions. Moreover, the state field may be manipulated independent of a value held at an activation frame memory location.

Description

DATA FLOW MULTIPROCESSOR SYSTEM
Background of the Invention
Current data processing systems can be
classified by the level at which they choose to employ parallelism. The degree of parallelism in a system is to a great extent more a product of how a system is programmed than its inherent
characteristics. In single-instruction-single-data stream (SISD) systems, only one instruction is executed at a time, and only one data stream is used. The amount of parallelism in execution of
instructions in such systems is minimal.
Single-instruction-multiple-data stream (SIMD) systems also execute only a single instruction at a time; however, they have multiple data streams and thus can act on multiple operands in parallel. As a result, they experience a higher level of parallelism in execution than SISD systems. An even higher level of parallelism is achieved in
multiple-instruetion-multiple-data-stream (MIMD) systems for both data and instructions are processed in parallel.
MIMD systems can be further classified by the level at which they extract parallelism,
specifically, they can be classified by the size of the units of computation that they perform in
parallel. Coarse grained systems are those in which relatively large units of computation are performed in parallel. Although the units of computation are performed in parallel, execution still proceeds sequentially within each unit. Fine grained systems, in contrast, perform small units of computation in parallel. Since the units of computation that may be performed in parallel in fine-grained systems are much smaller than in coarse-grained systems, fewer operations are performed sequentially and the
inherent potential for parallelism is much greater.
Given this classification scheme, the greatest inherent potential for parallel activity rests with MIMD systems. Moreover, amongst MIMD systems, the greatest inherent potential for parallel activity is found with fine-grained systems. Such systems have small sized units of computation and provide for easy computation. Furthermore, processing elements that operate on fine grains of computation are readily composed to multiprocessor systems. The problem, however, in efficiently implementing such
fine-grained systems is effectively controlling the extraordinary amount of parallel activity of the systems, particularly since the increase in
parallelism results in a dramatic rise in inter-task communication. The increase in parallelism also presents complex computation scheduling
synchronization problems.
Efforts to date at MIMD machines have produced only mildly encouraging results. Generally, such machines have used software as the primary tool for instilling parallelism. In programming such
software, programmers have decided where to install parallelism and have had to account for the difficult interactions of the programs with the machine. As a result, the programmer has born the brunt of the burden for deciding how to bring about parallel execution. Given the complex and confusing nature of these decisions, MIMD machines have proven
unappealing to most users. Moreover, the resulting software has tended to be difficult to debug,
unreliable, and not portable. To make matters worse, these machines have not performed as well as
expected, despite the level of effort required to operate them.
Data flow machines are a special variety of fine-grained machines that attempt to execute models of data flow through a data processing system. These models are known as data flow diagrams. Data flow diagrams are comprised of nodes and edges. The nodes represent instructions, and the edges represent data dependencies. To be more precise, the nodes represent operators and the edges represent operands. Data flow machines operate by processing the
operands.
The data flow diagrams impose only a partial order of execution. The instructions are executed in data flow machines whenever the operands they require are available. Data flow machines are, thus, not constrained by the rigid total order of execution found in sequential machines. This flexibility allows data flow machines to schedule execution of instructions asynchronously. In contrast, sequential machines based on the von Neumann model execute instructions only when the instruction pointer is pointing to the instructions. The primary benefit of asynchronous scheduling is the greater exposure of latent parallelism.
Data flow machines can be further classified into two categories: dynamic and static. Dynamic data flow machines are those machines which can dynamically allocate an instance of a function. In other words, the memory required for an instance of a function need not be preplanned, rather it can be allocated when the function is executed. Static data flow machines, in contrast, can only statically allocate an instance of a function. As such, they must preplan the storage required for each instance of the function prior to execution.
The Tagged-Token Data Flow Architecture
developed at the Massachusetts Institute of
Technology is a leading example of a dynamic data flow machine. It is described in Arvind, S.A.
Brobst, and G.K. Maa, "Evaluation of the MIT Tagged-Token Data Flow Project", Technical Report CSG Memo, MIT Laboratory for Computer Science, 1988. The tagged token data flow machine allows simultaneous applications of a function by tagging each operand with a context identifier that specifies the
activation of the function to which it belongs.
In a tagged-token data flow architecture, the combination of the tag and the operand constitutes a token. The tags of two tokens must match if they are destined to the same instruction. Hence, the tagged token architecture must have a means for correctly matching the tags. An associative memory has been relied upon as the matching mechanism.
Summary of the Invention
A data processing system of a preferred
embodiment is comprised of at least one processing element and a plurality of memory locations. The processing element includes a processing pipeline of four stages. The first stage fetches instruction from memory to operate on a token. The second stage of the pipeline performs operand matching on the token as indicated by the fetched instruction. The third stage continues processing on the token by performing an ALU operation specified by the fetched instruction. Lastly, the fourth stage of the
pipeline forms a new token or tokens from the results of the ALU operation and the fetched instruction.
This data processing system is preferably a multiple processer data flow processing system.
The tokens referred tc above are data objects that are comprised of two fields. The first field is a tag that includes a pointer to the beginning of an activation frame and a pointer to an instruction.
Activation frames are contiguous blocks of memory locations that are used to store information
necessary to execute a block of instructions. In particular, they serve as the meeting ground for matching operands destined to the same data flow node. The instruction pointer points to the
instruction that is to be executed when the given token is processed in the pipeline of a processing element. The second field of the token is simply a data value. In terms of the data flow model, this data value typically represents an operand value.
Each processing element includes not only a processing pipeline but also a token buffer. This buffer stores the tokens while they are waiting to be processed within the processing pipeline. Each processing element has its own token buffer.
Moreover, each of the processing elements preferably operates in parallel with its counterparts. The token buffer is preferably comprised of the plurality of stacks that are prioritized such that tokens leave a higher priority stack for the pipeline before they begin a lower priority stack. Since the buffer is organized as a stack, it operates on a last in first out (LIFO) basis. It should be noted, however, that the buffer need not operate in this manner, rather it may operate on a first in first out (FIFO) basis. The LIFO approach, however, provides a more efficient implementation.
As mentioned above, the activation frames serve as a place for matching of operands for operands that are destined to the same data flow node. The
activation frame memory locations are able to provide such operand matching because of their structure. In particular, each memory location within an activation frame includes a state field for indicating a current state of the memory location and a value field for holding a value. The state field may indicate a number of different types of information. For instance, it may indicate the data type of the value in a value field. Similarly, it may indicate whether a value is stored in the value field. It is this second type of indication that is utilized to
implement the matching function.
The matching function is an important part of execution of arithmetic/logical instructions within the data processing system. The course of execution of an arithmetic/logical instruction begins with fetching of the instruction. Once the instruction is fetched, a memory location indicated by the
instruction is accessed. Typically, this memory location is within an activation frame. Once the memory location is accessed, the state field is examined and the value field is operated on as determined by the arithmetic/logical instruction and the current value in the state field. Once the state field and value field have been operated on, the arithmetic/logical instruction is either performed or not performed, depending upon the current state of the memory location and the instruction.
The operations that are performed on the state field and value field is determined to a great extent upon the time at which the operand encoded by the token arrives at the memory location. In the typical course of events, a first operand of an instruction is stored in the memory location when the memory location is empty. Once it is stored in the memory location, the state field is changed to reflect that an operand of the instruction is stored in the memory location. Subsequently, a second operand of the instruction is received. Upon locating this second operand, the memory location is checked to see if the first operand is stored therein. Since the first operand is already stored in the memory location, it is read out of the memory location and sent with the second available operand to be further processed. It is preferred that after removing the first available operand from the memory location that the state of the memory location be changed to indicate that the memory location is empty. The above approach is applicable with dyadic operations; however, when monadic operations are specified by the instruction, the operand value of the current token is forwarded to the processing element without the necessity of examining the contents of the memory location.
As mentioned above, each token contains an instruction pointer within its tag field. When a token enters the first stage of the pipeline (i.e., the instruction fetch stage), this pointer is
utilized to locate the particular instruction. Each instruction contains a matching rule for matching operands. It also contains a rule for computing an effective address of a storage location on which the matching rule operates. Additionally, the
instruction contains an ALU operation to be performed by the ALU of the data processing system and lastly, a token forming rule for forming new tokens that result from execution of the instruction. The dyadic and monadic matching rules have been discussed above but the sticking matching rule has of yet, not been discussed. The sticking matching rule tells the system to write a value of a token into the value field of a location if the state field of the
location indicates that another value is not present. When it writes such a value it changes the state field location to indicate that a value is now present. On the other hand, if the value of the token is a constant, this value is written into the value field, but the state field is changed to indicate that not a normal value is present but rather that a constant value is present. If a normal value is present within the value field and a
constant value subsequently arrives, the constant value is exchanged for the normal value.
Furthermore, once the constant value is written into the value field, it is not removed until explicitly cleared. All accesses to the memory location after that point merely read a constant value out of the memory location.
The rule for computing an effective address of a storage location as specified by the instruction, selects one of three possible addressing approaches. In the first of these options, the memory location is specified at an absolute address indicated by an offset that is contained within the instruction. In the second option, the memory location is an address located at the address specified by the activation frame pointer of the tag offset by the offset contained within the instruction. The third option employs yet another approach. In this final approach the memory location is at the address pointed to by the pointer to an instruciton offset by the offset contained within the instruction.
The ALU operation indicated by the instruction tells the system what operation is sought to be performed by the ALU on the operands. For instance, if the ALU operation was specified as an addition operation when the ALU operation stage of the pipeline acted on the given token, it would add the two operands passed to it. The final rule specified by the instruction is the token forming rule. This rule tells the system how to form new output tokens from the operations that have been performed. This rule is employed in the tag forming portion of the ALU operation stage as well as in the token forming stage.
The processing pipeline preferably contains a capability to handle exceptions. One means that may be employed is to have each activity associated with a token recorded as the token enters the pipeline. When an exception occurs, the value in the register remains unchanged until the exception is resolved. The resolution of the exception is performed by issuing an. exception handling token. This exception handling token preferably may not be interrupted.
So as to insure no conflict between processing elements acting in parallel, each processing element is preferably assigned a given region of storage In which activation frames encode is stored with that processing element. A token may indicate the processing element for which it is destined by encoding within the tag a processing element
designation. Brief Description of the Drawings
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention, as illustrated in the accompanying drawings.
Figure 1 shows a sample data flow diagram.
Figure 2 shows the major components of the data flow machine.
Figure 3 shows the major components of a
processor of the data flow machine.
Figure 4 shows the relationship between tokens, instructions, and activation frames.
Figure 5 shows the state transition diagram for the arithmetic matching rule.
Figure 6 shows the state transition diagram for the sticky matching rule.
Figure 7 shows the effect of the three basic token forming rules.
Figure 8 shows the major components of the token forming stage of the pipeline.
Figure 9 shows the partitioning scheme of memory amongst processing elements.
Figure 10 illustrates the interleaving
strategies.
Figure 11 shows the fields of a token. Figure 12 shows the fields of a tag.
Figure 13 shows the sub-fields of the map field and shows how Interleaving strategies are encoded.
Figure 14 shows the fields of an instruction. Figure 15 shows the decoding strategy.
Detailed Description of the Preferred Embodiment
The preferred embodiment of the present
invention Includes a data flow multiprocessor system. As mentioned previously, data flow systems execute data flow diagrams. A sample data flow diagram is shown in Figure 1. Specifically, Figure 1
illustrates a data flow diagram for (AxB)+(CxD). The operands (i.e., A, B, C, and D) are represented as edges, and the operators (i.e., x and +) are
represented as nodes. Node 10 represents a
multiplication operation. That node 10 has two input edges 14 and 16 that represent A and B respectively. The output edge 18 from node 10 has the value AxB. Similarly, node 12, which also represents a
multiplication operation, has Input edges 20 and 22 that represent C and D respectively. Output edge 24 has the value CxD. The two output edges 18 and 24 from these nodes 10 and 12 then enter an addition node 26. The resulting output edge 28 represents (AxB)+(CxD).
In a traditional sequential machine, each of the operations represented by the nodes would be performed in sequential order. In the example shown in Figure 1, the machine would first multiply A times B, then it would multiply C times D and lastly, It would add the product A times B and C times D. There Is, however, no reason to impose such an order if the operands are available. Thus, in a parallel processing data flow machine, if the operands A, B, C, and D are available, the operations AxB and CxD are performed simultaneously. The resulting products are subsequently summed. By operating in this manner, the data flow machine substitutes the arbitrary sequential order imposed on such operations with an order imposed by the operations themselves.
A view of the major components of the preferred embodiment of the present invention is shown in
Figure 2. Specifically, the present invention includes a plurality of processing elements 3, each associated with an assigned region of memory 4, and global memory units 2. An interconnection network 1 comprised of logic circuitry is provided to
facilitate communication amongst processing elements 3 as well as amongst global memory units 2 and processing elements 3. Each of the processing elements 3 may access any of the global memory units 2 and acts in parallel with the other processing elements 3.
Figure 3 shows a typical processing element 3 in more detail. As can be seen in Figure 3, each processing element 3 includes a processing pipeline 36. Also included is a token queue 34 for storing tokens waiting to be processed. An assigned local portion 4 of global memory is allocated for storing activation frames 45 and code 32 for processing use. To provide exception handling capabilities, an exception handler 43 and a set of registers 41 are provided within the processing element 3. The memory 4 is distinguishable from the memory of traditional machines built under the von Neumann model. Instead of merely consisting of a value, each of the memory locations of the present invention comprises of two fields: a presence field and a value field. This memory organization in the present invention is significant, for it alters the common view of memory being merely a location for storing values. The dynamically assigned portion of memory 4 in the present Invention has both a presence state and a value. The presence state may affect the value's significance, as well as alter the execution of Instructions on the value. Further, the presence field may be manipulated independently from the value field. This design of memory makes it a more powerful tool, as will be more apparent from the discussion that follows.
The memory 4 is used to store activation frames 45, heap storage 6 and code sections 32.
Alternatively, an implementation may have separate memories for code 32 and activation frames 45.
Activation frames 45 can be viewed as all the
locations required for the invocation of a function or a code block. They play a particularly important role in the present invention, for they constitute the working memory in which tokens with matching tags meet. Tokens will be discussed in more detail below.
In the present invention, tokens 20 (Figure 3) are akin to embellished operands that identify not only the operand values but also identify the
particular activation of an instruction to which they belong. In particular, a token 30 can be viewed as a tuple comprising a tag (c.sp) and a value (v). A token 30, thus, is a tuple, (c.sp,v). The tag identifies the instruction and activity associated with the operand that the token represents. The tag is comprised of three fields: a context pointer (c), a statement number (s), and a port indicator (p).
The context pointer (c), points to the beginning of an activation frame 45 in which the operand
represented by the token will be matched. The statement number (s) points to a specific memory location where the instruction of the operand is stored, and the port (p) indicates whether the token enters a node representing the instruction on the left input edge or the right input edge.
The processing element 2 of Figure 3 is perhaps most easily viewed as a token processing system. It processes tokens 30 to bring about execution of instructions 32. In terms of the data flow model, this system processes operands to execute data flow diagrams, and it continues processing operands until told to stop. In more accurate terms, the tokens trigger the production of activities that are
performed in the pipeline 36.
During operation of the system, within each processing element 2 tokens 30 leave the token queue 34 whereupon they enter the processing pipeline 36. Multiple tokens are typically present in different stages of the pipeline at any given time. Each such token is processed in parallel with the other tokens in the pipeline. Hence, the pipeline represents a second level of parallelism within the present invention that is distinct from the parallelism attributable to the simultaneous computations of multiple processors.
More than one token 30 may enter the pipeline at a time, but generally only one token enters the pipeline at a time. The number of tokens 30 that may simultaneously enter the pipeline 36 is dictated by the multiprocessing capability of the particular design employed; that is, each processing element may operate on multiple tokens in parallel. The first stage of the processing pipeline 36 is the
instruction fetch stage 38. In this stage, the system looks at the statement number (s) contained in the tag of the incoming token 30. This statement number (s) corresponds to the memory location for the instruction that is to operate on the token 30.
After determining the statement number, the system retrieves the instruction at that location (see the memory location pointed to by arrow 31 in Figure 4).
An instruction of the present invention can be viewed as a tuple. In particular, an instruction equals (E.r, W, A, T.d) where E.r specifies a method of determining the effective address for the storage location on which the matching rule will operate; W specifies the matching rule; A specifies the ALU operation; and T.d specifies the token forming rule. These elements of the tuple need not be encoded by separate opcodes; rather, all may be encoded by a single opcode as will be seen below. In the
preferred embodiment the instructions are limited to at most two operands. Thus, instructions
traditionally having more than two operands must be broken down into combinations of instructions having at most two operands.
Once the instruction 32 is fetched, the token 30 and information from the instruction 32 are passed on to the operand matching stage 40 of the pipeline 36. In this stage, the system looks to match operands destined for the same node (i.e. those having like tags differing only as to port). This stage is the means for the system to check if all the operands necessary for execution of an instruction 32 are available or not. If they are available, the
instruction 32 is executed. If they are not
available the operand specified by the token is generally written into the matching location. The specifics of what occurs are discussed in more detail below.
The processing then continues in the next stage of the pipeline 36: the ALU operation stage 42 which is comprised of a tag former 42A and an ALU 42B. In general, the tag former 42A forms the tag portion of output tokens, and the ALU forms the value portion of output tokens. In this stage 42, the operation specified by the instruction is performed by an ALU 42B within the stage. Moreover, new tags for the results of the ALU operation are formed by the tag former 42A. The tag former 42A and ALU 42B may be further pipelined into substages so as to balance overall performance. The tags are formed in
accordance with the token forming rule of the
instruction being executed. The. output from this stage 42 enters the token forming stage 44 where the output tokens are formed. The resulting output tokens that typically carry the result of the
operation then may travel to a number of different locations. First, they may travel within the
processing element 2 to the token queue 34 or to the pipeline 36. Second, they may travel to other processing elements 3, and third, they may travel to a memory unit 2. Where they travel is dictated by the tag portion of each output token, for the tag specifies a particular processing element 2 or memory unit 3 as will be discussed below.
If an output token 30 is traveling to a memory unit 2, the output token's tag specifies which memory unit 2 is to be accessed. The output token 30 also specifies the operation to be performed on the memory location (i.e. a read operation or a write
operation). The memory controller of the memory unit 2, uses the operation specification to perform the desired operation. If a read is requested, the memory unit 2 produces a token 30 comprising the data read of out the memory unit 2 and a tag specifying a given processing element. Once produced, this token 30 travels to the appropriate processing element where it joins the other tokens 30 being processed by the destination processing element.
Outputs tokens 30 may also travel, as mentioned above, to other processing elements 3. In
particular, upon exiting the pipeline 36 of the processing element 3 that produced them, they are directed to the interconnection network 1 where they travel to the processing element 3 specified by the tag of the token 30. Once there, they join other tokens at the processing element 3 and are processed. All of the tokens in a single processing element 3 have tags that specify the same processing element 3.
Bearing in mind the many different paths an output token 30 may follow, one can appreciate the diversity of communication options available in the present invention and the ease with which such communication are implemented. Each processing element 3 can communicate with all other processing elements and with any memory unit 2. To communicate with such components, it need only generate an output token having a tag that specifies the processing element 3 or memory unit 2 to which it is destined. Moreover, such communications can be performed in parallel with ongoing computations, for the
processing elements 3 need not wait for a response to the communication.
In light of the above discussion, it should be apparent that there is a great deal of activity occuring in parallel within the present invention. To maintain proper synchronization certain ground rules must be followed. One of the ground rules is that an activation frame must first be allocated before it can be used. Thus, an activation frame is allocated for each routine that is performed. In particular, once a routine is to be called, the activation frame is allocated. This allocation is performed by the operating system. The operating system maintains a list of free activation frames such that when a request for a new activation frame is received, it simply pops an activation frame off the free list and assigns it to the call of the routine. Moreover, when the call to the routine is completed, the activation frame is added back on to the free list. In performing such allocations and deallocations, the operating system makes certain that the same activation frame is not assigned to multiple routine calls at the same time.
In the preferred system, each activation frame is processed by a specific processing element, and each processing element may be allocated thousands of activation frames. Since the activation frames are allocated as required and requested with execution of prior code blocks, the processing elements are called upon dynamically with execution of the code.
Routines are code blocks and calls to such routines are made by prior code blocks. These code blocks are defined by the compiler which decides which nodes of a data flow graph constitute a code block and thus share a single activation frame. The compiler, thus, establishes the interprocessor granularity of the process. Smaller code blocks characterize a fine grain system having greater potential for parallelism with increased
Interprocessor communications.
In this embodiment, code is not shared; rather each processing element 3 has a copy of the entire code in its assigned memory 4. All of the
instructions encoded within the code are not
performed by each processing element. Instead, only those instructions that are pointed to by the tokens processed by the given processing element 3 are executed by that processing element 3. Thus, only a proportional share of the entire code is typically performed by a given processing element. As an alternative, a cache system may be used to load code as required.
Once the allocation frames are allocated and the code blocks are defined, the system can look to perform matching of operands. In particular, the operand matching stage 40 looks to the E.r field contained with the instruction 32 that is to be executed. The E.r field specifies one of three effective addressing modes. One possible effective addressing mode is frame relative mode. In frame relative mode, the address of the memory location is located within an activation frame 45 by adding an offset (r) to the context pointer (c) contained within the tag of the token 30. This scheme is illustrated in Figure 4. Note the arrow 35 in Figure 4 from c pointing to the beginning of the activation frame 45 and note the other arrow 33 pointing to the location specified by the context (c) plus the offset (r). In contrast, absolute mode specifies the address as the absolute address (r). Lastly, in code relative mode, the address is specified by the statement number from the tag plus the offset (r).
Once it has located the address on which the matching rule is to operate, the processing element 3 can then perform the matching rule. Matching rules are basically a means of generating activities. For illustrative purposes, an activity can be thought of as a tuple (c.s, v1, vr, A, T.d) where c.s is the context pointer and statement number from the tag shared by the matching tokens; v.. is the value of the token on the left port; vr is the value of the token on the right port; A is an ALU operation; and T.d is the token forming rule.
What matching rule is applied depends on the operation specified in the instruction and depends on the operands. The matching rule that is most
commonly used operates on two operands. This rule is referred to as Wdyadic. When executing this rule, the system looks at the memory location within an activation frame specified by the effective address and checks the presence field. If a value is
currently residing in the value field, the presence field equals "present", and if a value is not currently residing there, the presence field equals "empty".
If the location's presence field indicates that It is in the "empty" state 48 (Figure 5), the system 50 writes the value of the token into the value field of that memory location and changes the presence field to the "present" state 52. On the other hand, if the presence field is initially in the "present" state 52, the value field of the memory location is read, and an activity is issued 54. How the issued activity is generated can be best explained in equation form: (c-s1, v1), (c.sr, vr) → (c.s, v1, vr, A, T.d) wherein A and T.d are derived from the relevant instruction, and c.s, v1, vr are derived from the tokens 30. In addition to issuing an activity, the rule changes the presence state from the "present" state 52 to the "empty" state 48. This scheme greatly simplifies the task of matching operands in dyadic operations. There is no longer the need for costly associative memory.
Further, if one operand arrives prior to another it is not lost. Rather, it merely waits until the other is available. This aspect of the scheme reflects the asynchronous nature of the scheduling of execution of the instructions. In addition, a location in an activation frame need only be large enough to store a value and a state. The incoming tag of the first token to arrive may be discarded as it can be
reconstructed from the tag of the second token to arrive.
If a monadic operator is specified, a different matching rule is applied. This rule is denoted
Wmonadic. By convention the left input port is used for input in monadic operations. In accordance with this matching rule, the value field of the token on the left input port is read and an activity is issued. The presence state is not affected. The activity takes on the context (c), statement number (s), and value (v1) of the input token on the left input port. In particular,
(c.s1,v1) → (c. s1,v1,v1,A,T.d) Another possibility is that a dyadic operator is specified but with a constant operand. If that is the case, the sticky matching rule denoted as Wsticky is applied. It is called a sticky matching rule because once the constant is written into the
location, it is not extracted on subsequent accesses. Constant operands to which this rule is applied are of two types. The first type is literal constants which are those known at the time of compilation.
The second type is frame constants which are those values that may be established in an activation frame to be shared by many activities within a given invocation of a code block.
Figure 6 shows the state transition diagram of
Wsticky. It allows a single active operand to be written into the location specified by the effective address prior to the arrival of the constant. In particular, if the presence state is the "empty" state 56 and an active operand is in the matching stage, the operand's value is written 62 into the location, and the presence state is changed to the "present" state 58. Note that the active operand must be at the left port input. If the constant arrives after the active operand (i.e., an input operand subsequently arrives on the right port), the value of the constant is exchanged with the value of the operand in the location, and an activity is issued 66. The presence state is changed to the "constant" state 60.
Suppose, on the other hand, that the location is initially in the "empty" state 56 and that the constant arrives on the right port input prior to the active operand on the left port input. In that case, the constant is written at the specified memory location 64, and the presence state is changed to the "constant" state 60. Once the constant is written
Into the memory location, the arrival of a subsequent active operand on the left port input does not alter the presence state but, instead, results in reading of the constant and issuance 68 of an activity. For this rule to work effectively in this embodiment, constants must enter on the right port and
non- cons tants must enter on the left port.
It should be pointed out that the presence state need not be limited to encoding the presence or absence of a value, rather the presence state may also encode other information such as data types. Furthermore, the presence state may dynamically alter the execution of an instruction. For instance, if an instruction specifies the sticky matching rule, the operation performed at the memory location on which the matching rule acts is determined by the presence state. Thus, the instruction is conditioned on the presence state and performs different operations given different presence states.
Once the activities are generated, the
activities are passed on to the next stage of the pipeline 36: the ALU operation stage 42. The central aim of this stage is to produce tags and values to pass on to the token forming stage 44. This stage relies on the A and T.d fields on the instruction 32 to direct system activity.
In the ALU operation stage 42, the token forming rule, T.d, specifies how the new tags and values are to be formed. The tags formed are in large part determined by the d field of the token forming rule. The d field specifies the destination addresses to be given the newly formed tokens. In particular, d equals (s1.p', s2.p"). The tag for the first newly formed token is generally c.s'p, where s' = s + s1, and the tag for the second token is generally c.s"p'' where s" = s + s2.
There are three basic token forming rules (See
Figure 7). The first is the arithmetic rule denoted as Tarith. It directs the ALU operation stage to apply the ALU operation (A) to the two values (v1, vr) in the activity to generate an output value (v'). At least one token is produced and a second token (noted in brackets in Figure 7) may be produced. The output tags for the tokens (c.s'p, and c.s"p'') are generated by applying increments (s1 and s2) to the statement number (s) contained in the incoming tags and by supplying new ports (p' and p") as specified by the d field of the token forming rule. Note that under this rule these output tokens have the same context as the original tokens.
The second token forming rule, the send rule, is denoted as Tsend. It sets the output token tag equal to the left value (v1) of the activity which is data type TAG, and it sets the output token value equal to the right value (vr ) of the activity. It can generate another token if necessary, but it generates the new token's tag in the same manner as they are generated in the arithmetic rule by adding an increment to the stated number. The value field of the second output token is set equal to the right port value (vr). The primary aim of this rule is to send a value to a different context. As such, it provides an
inter-activation communication capability.
The third and final basic rule is the extract rule, Textract which sends a value equal to the current context to an instruction within the current context. It can only generate one output token.
It generates the two new tags (c.s'p' ' c.s"p'') from the activity tag (c.s) by adding increments (s1 and s2) to s as is done in the arithmetic rule. The first generated tag (c.s'p,) is used as the output token's tag, and the second tag (c.s"p'') is used as the output token's value.
These three basic forming rules are also used in conjunction to produce additional token forming rules. For example, the extract rule and the send rule may combined to form an extract-send rule that sends a tag from within one context (c), as an argument to another context
Figure imgf000029_0001
). In particular, if v1 equals a tag specifying the new context,
Figure imgf000029_0002
the output token equals
Figure imgf000029_0003
+ s2)p'', c.s'p,).
Another example of a combination token forming rule is the inc-s-send rule denoted as Tinc-s-send.
It allows an adjustment to the new s field by
employing the second increment (s2) and the second port indicator (p") of the d field. In equation form, Tinc-s-send can be represented by: (c.s, A, Tinc-s-send.d) → + s2)p",
Figure imgf000029_0005
vr)
Figure imgf000029_0004
This instruction is used primarily for passing arguments and results.
An additional combination token forming rule is the Tfetch rule. It combines all three of the basic token forming rules. This rule is used to read elements from an array where
Figure imgf000030_0001
c is the base address and v is an index. Specifically,
(c.s, vr, inc-c, Tfetch.d) → ((
Figure imgf000030_0003
+ vr).sp" , c.s'p').
Figure imgf000030_0002
Figure imgf000030_0004
Also worth mentioning is the specially defined switch matching rule denoted as Tswitch. It acts in a manner consistent with the switch instruction discussed below. In particular,
(c.s, v1 , TRUE, A, Tswitch.d) (c.s'p', v1) and (c.s, v1, FALSE, A, Tswitch.d) → (c.s"p", v1). It can be seen that only when the arithmetic rule is applied does the ALU field (A) of the
instruction have much significance in the token forming process. In all other cases, the token forming rule dictates the output tag and output value. Further, It should be noted that there is a close relationship between the token formation stage 44 and the ALU operation stage 42. Both rely heavily on the token forming rules (T.d). The difference between the stages is that the ALU operation stage 42 only produces non-joined output tags and output values. The combining of these outputs to produce output tokens is accomplished by the token formation stage 44.
The token formation stage 44 can be implemented with the assistance of a number of multiplexers (76 and 78 in Figure 8). In particular, the values and tags produced by the ALU operation stage 42 are fed into multiplexers which select the appropriate combinations as dictated by the token forming rule (T.d). There are multiplexers 76 for the tag portion of the tokens 30 and multiplexers 78 for the value portions of the tokens 30. The select lines of the multiplexer are controlled by the opcode (T.d) of the instruction. The opcodes will be discussed in greater detail below.
Tokens 30 generated by the token forming stage 44 typically exit the pipeline 36 and return to the token queue 34. They are subsequently processed by the pipeline 36. Thus, there is a continuous flow through the pipeline until execution is completed. They may, however, travel to other destinations. The token forming stage 44 contains logic that examines the output token to see where it is destined and routes the output token accordingly.
The other destinations that output tokens may -travel include other processing elements 3 and memory units 2. An output token may travel to another processing element when a change-tag instruction (which will be described below) is executed that sends the output token to a new context associated with a different processing element 3. Further, an output token may travel to a memory unit 2 when a memory access instruction is specified by the tag. Such memory access instructions (e.g., read, write) are handled asynchronously by the memory units 2.
The processing element need not wait while the memory access is being performed, rather it continues the processing. The actual memory access is
performed by a controller within the particular memory unit 2 that is accessed. If appropriate, a token is sent by the memory unit 2 back to a given processing element 3 when the memory access is completed. Such asynchronous memory accesses
constitute a third level of parallelism of the present invention, for the memory accesses are performed In parallel with the computations of the pipelines 36 of the various processing elements 3.
Thus far, little has been said of the token queue 34. A traditional queue may be used, but one optimization is for the queue to be comprised of a series of stacks (See Figure 3). The idea behind the use of stacks is to try to create a sort of cache queue. A FIFO buffer would be useful, but not as practical for the present purposes. A FIFO would not easily provide for priority scheduling since, in a FIFO scheme, all tokens must wait their turn to be processed. Moreover, a FIFO controls parallelism in the wrong way. If the tasks to be executed are viewed as a tree wherein those tasks that must be executed early in the execution process to allow other tasks to be executed are near the top of the tree, a FIFO approach would unfold the tree in a breadth-first manner. What is desired is an approach that can control unfolding of the tree in a
depth- first manner. A LIFO approach can control unfolding in such a depth-first manner. The LIFO approach induces more locality than the FIFO
approach. More importantly, a LIFO queue can be cached from local memory, whereas a FIFO queue is difficult to cache. The preferred embodiment uses stacks that are organized by priority. The tokens 30 are removed off the highest priority stack 70 (Figure 3) until it is empty. Once the highest priority stack 70 is empty, the tokens 30 are removed from the next highest priority stack 72. This continues until execution is complete. A preferred implementation is to utilize only two stacks. Most tokens enter the highest priority stack 70, but some tokens are set aside in the next highest stack 72 to delay their processing.
One option that may be used to determine
priority for purposes of deciding which stack a token 30 enters is to encode the priority in the token 30 as part of the tag. This option has the advantage of being dynamic but results in an increase in the complexity of the system. Such an approach can encode priority dynamically but can also permit static control. The other option is to have the priority encoded in the destination as specified by the d field of the token forming rule of an
instruction. This option is more static than
encoding priority in the tag but has the benefit of being readily implemented.
This stacking scheme allows for high priority tokens 30 to be processed quickly and early on in the execution of the instructions. This characteristic allows instructions which are condition precedents to the execution of a number of waiting instructions to be fully executed early in the execution process so as to free the path for execution of the waiting instructions. The net result is greater control over exposed parallelism. Another optimization embodied within the present invention is to provide a direct path (74 in
Figure 3) for output from the pipeline 36 back into the pipeline 36 input. This path 74 bypasses the token queue 34. As a result, only a single port queue is required since only one output or one input as needed for a given cycle. If two output tokens are produced by the pipeline 36, one of the tokens 30 follows the bypass path 74 to be recirculated into the pipeline 36, and the other enters the token queue 34. If only one output token is produced by the pipeline 36, it follows the bypass path 74. If no token is then produced, a token is removed from the token queue and inserted into the pipeline 36.
The present invention also concerns an optimization for exception handling. The system included In the present invention includes a special register set (41 in Figure 3) . This register set 41 records activities in the pipeline 36. In particular, it records the tag, the left value (v1), and the right value (vr) of each activity in the pipeline 36.
These registers figure into system operation vhen an exception occurs. An exception is an event such as mismatched data types in an operation, an attempt to divide by zero, etc.
When an exception does occur, it is noted in the ALU output. The system then freezes the value of the activity which caused the exception in the register 41 and flags the activity which had an exception. A noninterruptable token is generated to call an exception handling operation. This new token follows the bypass path 74 directly back into the pipeline 36. The new token activates the exception handler 43 which consults the activity frozen in the register 41 to decide whether to interpret or restart the
offending instruction. After the exception handler 43 has made its decision, processing continues as usual.
As is evident from the above discussion, the instruction set plays a particularly important role in system operation. The instruction set is
comprised of several different classes of
instructions. One of the major classes of
instructions is the dyadic arithmetic instruction class. This class of instruction perform arithmetic operations on two input operands. A typical example of a dyadic arithmetic instruction is the add
instruction. This instruction can be summarized as:
add
Inputs: Outputs:
(c.s1, v1) (c.s'p' ' v1 + vr)
(c.sr, vr) (c.s'p' ' v1 + vr)]
Where
ALU rule A = +
and token forming rule T = Tarith
It produces a value that is the sum of two input operands. Another example is the fgeq instruction, This instruction can be summarized as: fgeq
Inputs: Outputs: (c.s 1, v1) (c.s'p,, v1 ≥ vr) (c.s r, vr) (c.s"p", v1 ≥ vr)]
A = ≥
T = T
arith
It returns a value of TRUE if the first operand is greater than or equal to the second operand and returns FALSE otherwise.
A second class of instructions completes the set of arithmetic instructions. This second class is known as the monadic arithmetic Instructions. They, like their dyadic counterparts, perform arithmetic operations. They, however, only act on one operand as opposed to two operands. A leading example of a monadic arithmetic instruction is the float instruction. This instruction can be summarized as:
float
Inputs: Outputs: (c.s"p", v1) (c.s'p', float(v1))
(c.s'p', float(v1))]
A = float
T = Tarith
It converts an integer operand into a floating point number. Identity instructions represent an additional class of instruction. A primary instruction of this class is the identity instruction. This instruction can be summarized as:
identity
Inputs: Outputs:
(c.s1, v1) (c.s"p", v1)
(c.s"p", v1)
A = nop
T = Tswitch
The identity instruction passes the value of the input operand to another context. The gate
instruction, in contrast, can be summarized as: gate
Inputs: Outputs: (c.s"p", v1) (c.s"p", v1)
(c.sr, x) (c.s"p", v1)]
A = nop
T = Tarith The gate instruction differs in that it copies the operand of the left input port and forwards that value when a value, called a trigger, is received on the right input port. Conditional instructions are used to institute conditional execution. The sole conditional
instruction employed is the switch instruction. It demands two input operands. One of the input
operands must be a value, and the other input must be a boolean. It produces one of two possible output value choices for the given inputs. Which output value is chosen depends on the value of the boolean input. If the boolean input is TRUE, the first output is chosen:
switch - TRUE
Inputs: Outputs:
(c.s1, v1) (c.s'p,, v1)
(c.sr, TRUE) A = nop
T = Tswitch
If the boolean input is FALSE, the second output token is chosen:
switch - FALSE lnputs: Outputs:
(c.s1, v1) (c.s"p", v1)
(c.sr, FALSE)
A = nop
T = Tswitch Tag manipulation is carried out by a special class of tag manipulation instructions. This class contains three basic instructions. The first such instruction, change-tag, can be summarized as:
change - tag
Inputs: Out puts:
Figure imgf000039_0001
It is useful in communicating values between
contexts. As noted, it requires two inputs. The value of the left input denotes the new context, and the value of the right input equals the value to be forwarded to the new context. Together these two input values comprise an output token. In addition, there is provided the option of producing a second output token in the current context having the right input's value.
Extract-tag, in contrast, can be summarized as: extract-tag
Input: Output:
(c.s1, x) (c.s'p., c.s"p")
A = nop
T = Textract It is restricted to a single output. The resulting output token has a tag that shares the context with the input operand. It has a statement number that is sum of an increment with the statement number of the input. Moreover, the value field of the output token equals a tag in the same context as the input but has a statement number that is the sum of an additional increment with the statement number.
The adjust-offset instruction can be summarized as:
adjust - offset
Inputs: Outputs:
( Λ
Figure imgf000040_0001
It provides a more complex operation. It utilizes two inputs. The first input's value specifies a new tag, and the second input's value specifies an offset. This instruction produces up to two output tokens. The first output token is in the same context as the inputs and has a value equal to the first input's value offset by the second input's value. The second output token also has the same value and shares the same context with the inputs, but its statement number is offset.
Other instruction classes exist, however, they are not of significance to the present discussion. Nevertheless, it should be noted that the present invention is not limited exclusively to the
instructions discussed in detail but rather,
encompasses the entire set of instructions embodied by the described classes.
The present system can readily provide for iterative operations, such as loops. It provides for loop implementation by assigning each iteration of a loop a new context. It then uses a change-tag instruction on the new context to send to the current iteration the tokens from the previous iteration. When the iteration is complete, it frees the context so that the context can be assigned to the next iteration. As such, the parent context can be passed from iteration to iteration. Hence, the loop unfolds during execution as a tail recursion of the N activation frames, where N is the number of iterations in the loop. To maintain efficiency in this approach, activation frames used in loops are recycled.
in general, an activation frame is allocated every time a procedure is called. The operating system of the data flow processing system maintains a list of free activation frames. When a procedure is called it pops an activation frame off the free list. Similarly, when a procedure call is completed, it no longer needs the activation frame so the activation frame it used is returned to the free list.
The preferred embodiment of the present
invention is a multiprocessor system. It, thus, must be able to communicate readily amongst processing elements 3. The preferred embodiment provides this capability by allowing tokens to freely flow from one processing element to another. Moreover, the present Invention is easily composed from individual
processing elements. Since the same entitles account for intra-processor traffic as well as
inter-processor traffic, composition of multiple processing elements Into a multiple processor system is easily achieved. Further, the fine-grained nature of the computations that are performed in parallel, likewise, allows such easy composition of single processing elements 2 into the multiple processor system. The fine-grained nature of the system also provides the benefit of facilitating easy compilation of code.
Since the present invention is a multiple processor system, there must be a means of
partitioning processes amongst the various processing elements 2. A simple and workable approach is to assign each processing element a region of storage (See Figure 9). The tokens specify the processing element 2 that is assigned a given context through their tags. In particular, the two leading bits of the context pointers indicate the appropriate
processing element. This approach has the advantage of removing the possibility of interprocessor
conflict and the advantage of providing rapid access to memory. The assignments of storage space is not necessarily fixed; hence, space used an an activation frame by one processing element 2 may subsequently be used by another processing element 2. Thus, the approach has the additional advantage of making it possible to dynamically reallocate the partitions. One disadvantage of partitioning the address space, as such, concerns the allocation of large data structures. The present system interleaves such large data structures across multiple processors to produce an even distribution of network traffic and processor loads. This interleaving is performed word-by-word. Non-interleaved and interleaved approaches are provided for by conceptually dividing the address space into regions where increments to the context advance either within a processor or across processors depending on the subdomain
specified.
A subdomain is a collection of 2N processing elements wherein N is specified in the N field as established by the tag of the token (described below in the discussion of decoding). If N = 0, any increments to c map on to the same processing element (See Figure 10). If N = 1, however, any increments to c alternate between two processing elements. If N= 2, increments to c alternate between four
processing elements, and so on. Figure 10
illustrates the basic operation of this approach.
Given that each processor is assigned exclusive activation frames, the question arises whether code is also shared. Generally, it is preferred that a copy of the code for each code block be present on every processing element that executes the code block. Hence, the destination instructions are local and non-blocking. The result is to heighten design simplicity at the minimal cost of a larger memory. Alternatively, a cache system may be utilized. It was noted previously that each memory
location has associated presence bits. It is helpful if a large number of these presence bits can be changed simultaneously with a single instruction. To facilitate this capability, presence bits for
adjacent locations are coalesced into words of the size equal to a machine word. In the preferred embodiment, there are two presence bits for each memory location. Thirty-two bits constitute a word. Thus, 16 sets of presence bits are stored as a word.
Another concern in implementing the system is the word length chosen. The preferred embodiment utilizes a 72 bit word wherein 64 bits are a value field and 8 bits are a type field. Tokens are comprised of two words (See Figure 11). The first word is the tag and the second word is a value-part. Tags are of data type, TAG, whereas value-parts may be of several data types including TAG, FLT (floating point), INT (integer) and BITS (unsigned integer).
Tags can be further broken down into a number of fields (See Figure 12). The leading bit of a tag indicates the port of the operand. A zero indicates the left port, and a one indicates the right port. The next 7 bits are the MAP field (See Figure 13). The first 2 bits of the MAP field are the HASH indicator which selects an interleaving strategy. The strategies are listed in the table of Figure 13. The other 5 bits are N field which equals the
logarithm base two of the number of processors In the subdomain.
Following the MAP field is a 24 bit IP field that gives the instruction pointer address of an instruction on a processor number specified by the processor element, PE, field. With memory access tokens, the PE actually specifies a memory unit 2 rather than a processing element 3. The PE follows the instruction pointer, IP, field, that points to an instruction in memory and is 10 bits long. The final field of a tag is the FP field. It is a frame pointer and is 22 bits long. It points to the particular frame amongst those assigned a particular subdomain.
The tag c.s is formed primarily out of the PE, FP and IP fields. In particular, c is comprised of the frame pointer FP, and the processing element designation PE such that PE comprises the most significant bits of c and FP comprises the least significant bits of c. Further, s is comprised of PE and IP. PE comprises the most significant bits of s, and IP comprises the least significant bits of s.
Instructions are only 32 bits long and comprised of only four fields: OPCODE, r, PORT and s (See
Figure 14). OPCODE is 10 bits long and specifies an instruction OPCODE. Also 10 bits long is r. It is an unsigned offset used to compute the effective address of the operand. PORT defines the port for the destination tags. Lastly, s is an offset to IP for one of the destination tags in twos complement form. If two destination tags are required by a given instruction, the second tag is generated by adding 1 to the incoming IP and setting port to 1.
Knowing the formats for the tags, tokens and instructions, one can look at the operand matching stage 40 of the pipeline 36 in more detail (Figure 3). Specifically, the operand matching stage 40 can be subdivided into three substages. The first substage 80 computes the effective address of the memory location where the operand matching takes place. The second substage 82 operates on the presence bits, and the third substage 84 either fetches or stores the operand at the effective address.
During the first substage (See "First Level Decode" Figure 15), the 10 bit OPCODE 90 field is used as an address to a first level decode table.
Entries 109 in the first level decode table have four fields. A BASE field 92 specifies the base address for an entry in the second level decode. The TMAP field 94, on the other hand, specifies one of 32 type maps. In addition, the PMAP field 96 specifies one of 64 presence maps and lastly, the EA field 98 specifies the effective address generation mode.
The OPCODE 90 is used to look up a first level decode table entry 109, and the EA field 98 of the entry is examined. It Is 2 bits long. If both bits are zero, the address equals FP + r; if both bits are one, the address equals r; and if the leading bit is one and the trailing bit is a zero, the address equals IP+r.
During this first substage 80 (Figure 3), the TMAP (94 in Figure 15) field is also examined. As mentioned above, the TMAP field 94 selects one of 32 type maps. The type maps are two dimensional arrays of size 256 by 2. Each entry has 2 bits.
These 2 bits represent a mapping from the port and the data type of the value of the token in the first substage. To conclude the substage, the token, effective address and type code are passed on to the next substage 82 (Figure 3).
The second substage 82 is the presence bits substage. In this substage 82 the system looks at the memory location specified by the particular address. From the location it reads the 2 presence bits. It uses these bits along with the port of the token in the substage, the type code bits 100 that were read from the type map and the PMAP field 96 of first level code entry to look up an entry 110 in the presence map table. The presence map table has sixty-four entries and each entry has four fields. The BRA field 102 is for four-way branch control and will be discussed below. The FZ field 104 determines whether the force-to-zero override is exerted. It, likewise, will be discussed more below. The FOP field 106 specifies which operand fetch/store
operation is to be performed (i.e., read, write, exchange, or exchange-decrement). The final field 108 specifies the new value of the presence bits. It is denoted as NEXT.
In the third and final substage 84, the
fetch/store operation specified by the FOP field is carried out. The contents of the location specified by the effective address are passed on to the next stage of the pipeline. Furthermore, the BRA field 102 of the presence map and the BASE field 92 of the first level decode entry are ORed to produce an address in a second level decode. The second level decode' table entry is used to specify parameters used in system operation. If the FZ field 104 is one, the BASE field 92 is forced to zero before being ORed. The net result is the second level decode table entry 111 is set as an absolute address of 0, 1, 2 or 3.
The above-described system relieves the
programmer of the herculean task of instilling parallelism Into the data processing system. Since parallelism is Inherent in the instruction set and implemented via hardware, this system is much easier to use than present parallel processing machines.
The matching mechanism has been simplified so as to remove the Inbred complexity found in associative memory systems. The unique memory design is
primarily responsible for the diminished overhead.
In addition, the fine-grained nature of the present system provides a number of benefits. First, it exposes the maximum amount of parallelism.
Second, it can readily tolerate memory latency because it can process other tokens while a memory access request is being serviced. Third it is easier to compile than coarse-grained systems. In sum, the present invention optimizes performance, simplicity and cost effectiveness.
While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A data processing system comprising
a) at least one processor for processing
arithmetic/logical instructions; b) a plurality of memory locations that are each accessed by the at least one processor of the data processing system to match operands of an arithmetic/logical instruction processed by said at least one processor; and
c) logic means for detecting a current state of a memory location accessed by the at least one processor to determine whether an operand is present at the memory location so as to facilitate proper matching of operands.
2 . A data processing system as recited in claim 1 wherein each memory location is the sole
location wherein matching of operands of a particular execution of an instruction takes place and each memory location contains a state field as well as a value field such that the significance of the value field depends on the state field.
3. A data processing system as recited in Claim 1 wherein each processor comprises:
a) a buffer for storing tokens that represent operands as the tokens wait to be processed; b) a processing pipeline in communication with the memory locations and the buffer for processing tokens stored in the buffer, comprising
an instruction fetch stage for fetching instructions to manipulate the tokens in the pipeline;
an operand matching stage for matching the operands responsive to the fetched instruction;
an operation stage for performing operations specified by the fetched instruction; and
a token formation stage for forming new tokens carrying results of the operations specified by the fetched instruction.
4. A data processing system as recited in Claim 3 wherein each token comprises a tag for
indicating an address for an instruction that acts upon the token as well as an activation frame, and also comprises a value for storing a piece of data.
5. A data processing system as recited in Claim 3 wherein a fetched instruction encodes a matching rule for matching operands, a rule for computing an effective address of a storage location on which the matching rule operates, an ALU
operation to be performed by an ALU of the data processing system, and a token forming rule for forming new tokens that result from execution of the instruction.
6. A data processing system as recited in Claim 1, 2, 3, 4 or 5 wherein the data processing system is a parallel multiple processor system.
7. A data processing system as recited in Claim 1, 2, 3, 4 or 5 wherein the data processing system is a data flow processing system.
8. A data processing system as recited in Claim 1 wherein the data processing system is a tagged token data processing system which relies on non-associative memory for matching operands, wherein, in order to match operands, the system directs operands to a shared memory location.
9. A data processing system which is a tagged token data processing system having a processing pipeline and memory and relies on
non-associative memory for matching operands wherein, in order to match operands, the system directs operands to a shared memory location.
10. A data processing system as recited in Claim 3 or 9 comprising a plurality of registers for recording activities associated with the tokens in each stage of the pipeline so that if an exception occurs, an activity that caused the exception is available, and an exception handler for resolving exceptions that occur in the pipeline that examines the plurality of
registers to find an activity for each exception that occurs.
11. A data processing system as recited in Claim 9 wherein each shared memory location includes a value field and a state field.
12. A data processing system as recited in Claim 2 or 11 wherein the system obeys matching rules which include a sticky matching rule that tells the system to write a value of a token into a value field of a memory location if a state field of the location indicates that another value is not present, and changes the state field of the location to indicate that a value is present if the value of the token is not a constant, and changes the state field of the location to indicate that a constant is present if the value of the token is a constant.
13. A data processing system as recited in Claim 3 or 11 wherein instructions are decoded prior to matching of operands so that location of
operands are known prior to matching of operands to facilitate easy matching of said operands.
14. In a data processing system, a method of
executing an arithmetic/logical instruction comprising the steps of:
a) fetching the arithmetic/logical
instruction; b) accessing a memory location indicated by the arithmetic/logical instruction to locate an operand to be used in execution of the arithmetic/logical instruction; and c) determining whether to perform the
arithmetic/logical instruction by examining a current state of the memory location.
15. In a data flow processing system, a method
comprising:
providing tokens, each token comprising a frame pointer, an instruction pointer and a value, the instruction pointer pointing to an instruction which processes the values of tokens with identical frame pointers and identical instruction pointers;
responsive to a token, addressing a storage location in memory identified by the frame pointer and the instruction pointer;
determining whether a value is stored in the addressed storage location, and
if a value is stored, performing the operation determined by the instruction pointer on the stored value and the value of the token to create a new token, and
if the value is not stored, storing the value of the token in the storage location.
16. In a data processing system, a method of
matching operands of a single instruction, comprising the steps of: a) storing a first available operand of the instruction in an empty memory location; b) altering a state of the memory location to reflect that an operand of the instruction is stored at the memory location; c) locating a second available operand of the instruction;
d) checking the memory location to see if the first available operand of the instruction is stored therein; and
e) reading the first available operand from the memory location and sending the operands to a processing means to execute the instruction.
17. A method as recited in Claim 16 further
comprising the steps of changing the state of the memory location to indicate that the memory location is empty after the step of sending the operands to a processing means.
18. A method as recited in Claim 16 wherein the data processing system is a multiple processor system.
19. A method of handling exceptions in a pipelined data processing system comprising
a) storing activities in a register as they enter a pipeline;
b) flagging the activity when an exception occurs;
c) halting processing of the activity; d) freezing the value of the activity in the register so that its value may not be changed or replaced;
e) replacing the activity with a
noninterruptable exception activity that remedies the exception;
f) processing the exception activity; and g) continuing the processing of the activity.
20. In a tagged token data flow processing system, a token data object comprising
a) a tag comprising
a pointer to the beginning of an activation frame; and
a pointer to an instruction; and b) a value.
PCT/US1989/005105 1988-11-18 1989-11-16 Data flow multiprocessor system WO1990005950A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP89912813A EP0444088B1 (en) 1988-11-18 1989-11-16 Data flow multiprocessor system
DE68925646T DE68925646T2 (en) 1988-11-18 1989-11-16 PIPELINE MULTIPROCESSOR SYSTEM

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US27449888A 1988-11-18 1988-11-18
US274,498 1988-11-18
US07/396,480 US5241635A (en) 1988-11-18 1989-08-21 Tagged token data processing system with operand matching in activation frames
US396,480 1989-08-21

Publications (1)

Publication Number Publication Date
WO1990005950A1 true WO1990005950A1 (en) 1990-05-31

Family

ID=26956858

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1989/005105 WO1990005950A1 (en) 1988-11-18 1989-11-16 Data flow multiprocessor system

Country Status (5)

Country Link
US (1) US5241635A (en)
EP (1) EP0444088B1 (en)
JP (1) JPH04503416A (en)
DE (1) DE68925646T2 (en)
WO (1) WO1990005950A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5499349A (en) * 1989-05-26 1996-03-12 Massachusetts Institute Of Technology Pipelined processor with fork, join, and start instructions using tokens to indicate the next instruction for each of multiple threads of execution
WO2001097054A2 (en) * 2000-06-13 2001-12-20 Synergestic Computing Systems Aps Synergetic data flow computing system

Families Citing this family (91)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5353418A (en) * 1989-05-26 1994-10-04 Massachusetts Institute Of Technology System storing thread descriptor identifying one of plural threads of computation in storage only when all data for operating on thread is ready and independently of resultant imperative processing of thread
US5511167A (en) * 1990-02-15 1996-04-23 Hitachi, Ltd. Program processing method and apparatus for producing a data flow type program
JP2682232B2 (en) * 1990-11-21 1997-11-26 松下電器産業株式会社 Floating point processor
US5430850A (en) * 1991-07-22 1995-07-04 Massachusetts Institute Of Technology Data processing system with synchronization coprocessor for multiple threads
US5414821A (en) * 1991-12-17 1995-05-09 Unisys Corporation Method of and apparatus for rapidly loading addressing environment by checking and loading multiple registers using a specialized instruction
IL100598A0 (en) * 1992-01-06 1992-09-06 Univ Bar Ilan Dataflow computer
US5371684A (en) * 1992-03-31 1994-12-06 Seiko Epson Corporation Semiconductor floor plan for a register renaming circuit
US5842033A (en) 1992-06-30 1998-11-24 Discovision Associates Padding apparatus for passing an arbitrary number of bits through a buffer in a pipeline system
US6112017A (en) 1992-06-30 2000-08-29 Discovision Associates Pipeline processing machine having a plurality of reconfigurable processing stages interconnected by a two-wire interface bus
US6330665B1 (en) 1992-06-30 2001-12-11 Discovision Associates Video parser
US5828907A (en) 1992-06-30 1998-10-27 Discovision Associates Token-based adaptive video processing arrangement
US6067417A (en) 1992-06-30 2000-05-23 Discovision Associates Picture start token
US5809270A (en) 1992-06-30 1998-09-15 Discovision Associates Inverse quantizer
DE69229338T2 (en) 1992-06-30 1999-12-16 Discovision Ass Data pipeline system
US6079009A (en) 1992-06-30 2000-06-20 Discovision Associates Coding standard token in a system compromising a plurality of pipeline stages
US6047112A (en) 1992-06-30 2000-04-04 Discovision Associates Technique for initiating processing of a data stream of encoded video information
US6263422B1 (en) 1992-06-30 2001-07-17 Discovision Associates Pipeline processing machine with interactive stages operable in response to tokens and system and methods relating thereto
US5768561A (en) 1992-06-30 1998-06-16 Discovision Associates Tokens-based adaptive video processing arrangement
EP0586767A1 (en) * 1992-09-11 1994-03-16 International Business Machines Corporation Selective data capture for software exception conditions
JPH06124352A (en) * 1992-10-14 1994-05-06 Sharp Corp Data driven type information processor
DE69327504T2 (en) * 1992-10-19 2000-08-10 Koninkl Philips Electronics Nv Data processor with operational units that share groups of register memories
US5761407A (en) * 1993-03-15 1998-06-02 International Business Machines Corporation Message based exception handler
US5861894A (en) 1993-06-24 1999-01-19 Discovision Associates Buffer manager
JP2560988B2 (en) * 1993-07-16 1996-12-04 日本電気株式会社 Information processing apparatus and processing method
US5765014A (en) * 1993-10-12 1998-06-09 Seki; Hajime Electronic computer system and processor element for processing in a data driven manner using reverse polish notation
US5448730A (en) * 1993-12-14 1995-09-05 International Business Machines Corporation Correlating a response with a previously sent request in a multitasking computer system using pipeline commands
US5598546A (en) * 1994-08-31 1997-01-28 Exponential Technology, Inc. Dual-architecture super-scalar pipeline
CN100412786C (en) 1994-12-02 2008-08-20 英特尔公司 Microprocessor capable of compressing composite operation number
US5710923A (en) * 1995-04-25 1998-01-20 Unisys Corporation Methods and apparatus for exchanging active messages in a parallel processing computer system
WO1997008623A1 (en) * 1995-08-23 1997-03-06 Symantec Corporation Coherent file system access during defragmentation operations on a storage media
US5761740A (en) * 1995-11-30 1998-06-02 Unisys Corporation Method of and apparatus for rapidly loading addressing registers
US6792523B1 (en) 1995-12-19 2004-09-14 Intel Corporation Processor with instructions that operate on different data types stored in the same single logical register file
US5701508A (en) 1995-12-19 1997-12-23 Intel Corporation Executing different instructions that cause different data type operations to be performed on single logical register file
US5940859A (en) * 1995-12-19 1999-08-17 Intel Corporation Emptying packed data state during execution of packed data instructions
US5857096A (en) * 1995-12-19 1999-01-05 Intel Corporation Microarchitecture for implementing an instruction to clear the tags of a stack reference register file
US6065108A (en) * 1996-01-24 2000-05-16 Sun Microsystems Inc Non-quick instruction accelerator including instruction identifier and data set storage and method of implementing same
US5956518A (en) 1996-04-11 1999-09-21 Massachusetts Institute Of Technology Intermediate-grain reconfigurable processing device
US6128299A (en) * 1996-08-23 2000-10-03 Virata Ltd. System for low-cost connection of devices to an ATM network
US5778233A (en) * 1996-10-11 1998-07-07 International Business Machines Corporation Method and apparatus for enabling global compiler optimizations in the presence of exception handlers within a computer program
US5983266A (en) * 1997-03-26 1999-11-09 Unisys Corporation Control method for message communication in network supporting software emulated modules and hardware implemented modules
US5944788A (en) * 1997-03-26 1999-08-31 Unisys Corporation Message transfer system and control method for multiple sending and receiving modules in a network supporting hardware and software emulated modules
US5999969A (en) * 1997-03-26 1999-12-07 Unisys Corporation Interrupt handling system for message transfers in network having mixed hardware and software emulated modules
US5842003A (en) * 1997-03-26 1998-11-24 Unisys Corporation Auxiliary message arbitrator for digital message transfer system in network of hardware modules
US6064818A (en) * 1997-04-10 2000-05-16 International Business Machines Corporation Straight path optimization for compilers
RU2148857C1 (en) * 1998-02-20 2000-05-10 Бурцев Всеволод Сергеевич Computer system
US6148392A (en) * 1998-09-08 2000-11-14 Hyundai Electronics Industries Co., Ltd. Low power implementation of an asynchronous stock having a constant response time
GB0019341D0 (en) * 2000-08-08 2000-09-27 Easics Nv System-on-chip solutions
US6880070B2 (en) 2000-12-08 2005-04-12 Finisar Corporation Synchronous network traffic processor
US7383421B2 (en) * 2002-12-05 2008-06-03 Brightscale, Inc. Cellular engine for a data processing system
WO2003038645A2 (en) * 2001-10-31 2003-05-08 University Of Texas A scalable processing architecture
JP4272371B2 (en) * 2001-11-05 2009-06-03 パナソニック株式会社 A debugging support device, a compiler device, a debugging support program, a compiler program, and a computer-readable recording medium.
US8935297B2 (en) * 2001-12-10 2015-01-13 Patrick J. Coyne Method and system for the management of professional services project information
US7035996B2 (en) * 2002-01-17 2006-04-25 Raytheon Company Generating data type token value error in stream computer
GB2392742B (en) * 2002-09-04 2005-10-19 Advanced Risc Mach Ltd Synchronisation between pipelines in a data processing apparatus
JP3503638B1 (en) * 2002-09-26 2004-03-08 日本電気株式会社 Cryptographic device and cryptographic program
US7979384B2 (en) * 2003-11-06 2011-07-12 Oracle International Corporation Analytic enhancements to model clause in structured query language (SQL)
US8074051B2 (en) 2004-04-07 2011-12-06 Aspen Acquisition Corporation Multithreaded processor with multiple concurrent pipelines per thread
US7716455B2 (en) * 2004-12-03 2010-05-11 Stmicroelectronics, Inc. Processor with automatic scheduling of operations
RU2281546C1 (en) * 2005-06-09 2006-08-10 Бурцева Тамара Андреевна Method for processing information on basis of data stream and device for realization of said method
US7451293B2 (en) * 2005-10-21 2008-11-11 Brightscale Inc. Array of Boolean logic controlled processing elements with concurrent I/O processing and instruction sequencing
TW200806039A (en) * 2006-01-10 2008-01-16 Brightscale Inc Method and apparatus for processing algorithm steps of multimedia data in parallel processing systems
US8244718B2 (en) * 2006-08-25 2012-08-14 Teradata Us, Inc. Methods and systems for hardware acceleration of database operations and queries
WO2008027567A2 (en) * 2006-09-01 2008-03-06 Brightscale, Inc. Integral parallel machine
US20080059762A1 (en) * 2006-09-01 2008-03-06 Bogdan Mitu Multi-sequence control for a data parallel system
US20080059763A1 (en) * 2006-09-01 2008-03-06 Lazar Bivolarski System and method for fine-grain instruction parallelism for increased efficiency of processing compressed multimedia data
US9563433B1 (en) 2006-09-01 2017-02-07 Allsearch Semi Llc System and method for class-based execution of an instruction broadcasted to an array of processing elements
US20080244238A1 (en) * 2006-09-01 2008-10-02 Bogdan Mitu Stream processing accelerator
US20080055307A1 (en) * 2006-09-01 2008-03-06 Lazar Bivolarski Graphics rendering pipeline
US20080059467A1 (en) * 2006-09-05 2008-03-06 Lazar Bivolarski Near full motion search algorithm
WO2009147613A1 (en) * 2008-06-02 2009-12-10 Nxp B.V. Viewer credit account for a multimedia broadcasting system
GB2471067B (en) 2009-06-12 2011-11-30 Graeme Roy Smith Shared resource multi-thread array processor
WO2013100783A1 (en) 2011-12-29 2013-07-04 Intel Corporation Method and system for control signalling in a data path module
US10331583B2 (en) 2013-09-26 2019-06-25 Intel Corporation Executing distributed memory operations using processing elements connected by distributed channels
US10067871B2 (en) * 2014-12-13 2018-09-04 Via Alliance Semiconductor Co., Ltd Logic analyzer for detecting hangs
US11106467B2 (en) 2016-04-28 2021-08-31 Microsoft Technology Licensing, Llc Incremental scheduler for out-of-order block ISA processors
US10572376B2 (en) 2016-12-30 2020-02-25 Intel Corporation Memory ordering in acceleration hardware
US10223002B2 (en) * 2017-02-08 2019-03-05 Arm Limited Compare-and-swap transaction
US10515049B1 (en) 2017-07-01 2019-12-24 Intel Corporation Memory circuits and methods for distributed memory hazard detection and error recovery
US10515046B2 (en) 2017-07-01 2019-12-24 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10469397B2 (en) 2017-07-01 2019-11-05 Intel Corporation Processors and methods with configurable network-based dataflow operator circuits
US11086816B2 (en) 2017-09-28 2021-08-10 Intel Corporation Processors, methods, and systems for debugging a configurable spatial accelerator
US10496574B2 (en) 2017-09-28 2019-12-03 Intel Corporation Processors, methods, and systems for a memory fence in a configurable spatial accelerator
US10564980B2 (en) * 2018-04-03 2020-02-18 Intel Corporation Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator
US11307873B2 (en) 2018-04-03 2022-04-19 Intel Corporation Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US10853073B2 (en) 2018-06-30 2020-12-01 Intel Corporation Apparatuses, methods, and systems for conditional operations in a configurable spatial accelerator
US11200186B2 (en) 2018-06-30 2021-12-14 Intel Corporation Apparatuses, methods, and systems for operations in a configurable spatial accelerator
US10891240B2 (en) 2018-06-30 2021-01-12 Intel Corporation Apparatus, methods, and systems for low latency communication in a configurable spatial accelerator
US10678724B1 (en) 2018-12-29 2020-06-09 Intel Corporation Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
US10817291B2 (en) 2019-03-30 2020-10-27 Intel Corporation Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US10915471B2 (en) 2019-03-30 2021-02-09 Intel Corporation Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US11037050B2 (en) 2019-06-29 2021-06-15 Intel Corporation Apparatuses, methods, and systems for memory interface circuit arbitration in a configurable spatial accelerator

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4511958A (en) * 1978-01-30 1985-04-16 Patelhold Patentverwertungs- & Elektro-Holding Ag Common bus access system using plural configuration tables for failure tolerant token passing among processors
US4591971A (en) * 1981-10-15 1986-05-27 National Research Development Corporation Method and apparatus for parallel processing of digital signals using multiple independent signal processors
US4675806A (en) * 1982-03-03 1987-06-23 Fujitsu Limited Data processing unit utilizing data flow ordered execution
US4757440A (en) * 1984-04-02 1988-07-12 Unisys Corporation Pipelined data stack with access through-checking
US4780820A (en) * 1983-11-07 1988-10-25 Masahiro Sowa Control flow computer using mode and node driving registers for dynamically switching between parallel processing and emulation of von neuman processors
US4841436A (en) * 1985-05-31 1989-06-20 Matsushita Electric Industrial Co., Ltd. Tag Data processing apparatus for a data flow computer

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4153932A (en) * 1974-03-29 1979-05-08 Massachusetts Institute Of Technology Data processing apparatus for highly parallel execution of stored programs
US4145733A (en) * 1974-03-29 1979-03-20 Massachusetts Institute Of Technology Data processing apparatus for highly parallel execution of stored programs
US4149240A (en) * 1974-03-29 1979-04-10 Massachusetts Institute Of Technology Data processing apparatus for highly parallel execution of data structure operations
US3962706A (en) * 1974-03-29 1976-06-08 Massachusetts Institute Of Technology Data processing apparatus for highly parallel execution of stored programs
US4130885A (en) * 1976-08-19 1978-12-19 Massachusetts Institute Of Technology Packet memory system for processing many independent memory transactions concurrently
US4128882A (en) * 1976-08-19 1978-12-05 Massachusetts Institute Of Technology Packet memory system with hierarchical structure
US4229790A (en) * 1978-10-16 1980-10-21 Denelcor, Inc. Concurrent task and instruction processor and method
US4837676A (en) * 1984-11-05 1989-06-06 Hughes Aircraft Company MIMD instruction flow computer architecture
JPS61276032A (en) * 1985-05-31 1986-12-06 Matsushita Electric Ind Co Ltd Information processing device
JP2564805B2 (en) * 1985-08-08 1996-12-18 日本電気株式会社 Information processing device
US5021947A (en) * 1986-03-31 1991-06-04 Hughes Aircraft Company Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
US4811214A (en) * 1986-11-14 1989-03-07 Princeton University Multinode reconfigurable pipeline computer
US4893234A (en) * 1987-01-15 1990-01-09 United States Department Of Energy Multi-processor including data flow accelerator module
US4922413A (en) * 1987-03-24 1990-05-01 Center For Innovative Technology Method for concurrent execution of primitive operations by dynamically assigning operations based upon computational marked graph and availability of data
US4916652A (en) * 1987-09-30 1990-04-10 International Business Machines Corporation Dynamic multiple instruction stream multiple data multiple pipeline apparatus for floating-point single instruction stream single data architectures

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4511958A (en) * 1978-01-30 1985-04-16 Patelhold Patentverwertungs- & Elektro-Holding Ag Common bus access system using plural configuration tables for failure tolerant token passing among processors
US4591971A (en) * 1981-10-15 1986-05-27 National Research Development Corporation Method and apparatus for parallel processing of digital signals using multiple independent signal processors
US4675806A (en) * 1982-03-03 1987-06-23 Fujitsu Limited Data processing unit utilizing data flow ordered execution
US4780820A (en) * 1983-11-07 1988-10-25 Masahiro Sowa Control flow computer using mode and node driving registers for dynamically switching between parallel processing and emulation of von neuman processors
US4757440A (en) * 1984-04-02 1988-07-12 Unisys Corporation Pipelined data stack with access through-checking
US4841436A (en) * 1985-05-31 1989-06-20 Matsushita Electric Industrial Co., Ltd. Tag Data processing apparatus for a data flow computer

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5499349A (en) * 1989-05-26 1996-03-12 Massachusetts Institute Of Technology Pipelined processor with fork, join, and start instructions using tokens to indicate the next instruction for each of multiple threads of execution
WO2001097054A2 (en) * 2000-06-13 2001-12-20 Synergestic Computing Systems Aps Synergetic data flow computing system
WO2001097054A3 (en) * 2000-06-13 2002-04-11 Synergestic Computing Systems Synergetic data flow computing system

Also Published As

Publication number Publication date
DE68925646D1 (en) 1996-03-21
JPH04503416A (en) 1992-06-18
EP0444088B1 (en) 1996-02-07
US5241635A (en) 1993-08-31
EP0444088A1 (en) 1991-09-04
DE68925646T2 (en) 1996-10-17

Similar Documents

Publication Publication Date Title
EP0444088B1 (en) Data flow multiprocessor system
Gurd et al. The Manchester prototype dataflow computer
Dally et al. The message-driven processor: A multicomputer processing node with efficient mechanisms
Melhem Introduction to parallel computing
Thistle et al. A processor architecture for Horizon
Veen Dataflow machine architecture
Lee et al. Issues in dataflow computing
Goldstein et al. Enabling primitives for compiling parallel languages
Amamiya et al. Datarol: A massively parallel architecture for functional languages
Ruggiero Throttle Mechanisms for the Manchester Dataflow Machine
Oliker et al. Parallelization of a dynamic unstructured application using three leading paradigms
Hum et al. A high-speed memory organization for hybrid dataflow/von Neumann computing
Culler Multithreading: Fundamental limits, potential gains, and alternatives
Hum et al. A novel high-speed memory organization for fine-grain multi-thread computing
Gao et al. Towards an efficient hybrid dataflow architecture model
Haines Distributed runtime support for task and data management
Krishnan et al. Executing sequential binaries on a clustered multithreaded architecture with speculation support
Najjar et al. A Quantitative Analysis of Dataflow Program Execution-Preliminaries to a Hybrid Design
Topham et al. Context flow: An alternative to conventional pipelined architectures
Yang et al. Fei teng 64 stream processing system: architecture, compiler, and programming
Dai et al. A basic architecture supporting LGDG computation
Anderson et al. Design overview of the Shiva
Nicolau et al. ROPE: a statically scheduled supercomputer architecture
Formella et al. Cost effectiveness of data flow machines and vector processors
Ha et al. Design and Implementation of a Massively Parallel Multithreaded Architecture: DAVRID

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE ES FR GB IT LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1989912813

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1989912813

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 1989912813

Country of ref document: EP