US20050005084A1 - Scalable processing architecture - Google Patents

Scalable processing architecture Download PDF

Info

Publication number
US20050005084A1
US20050005084A1 US10/829,668 US82966804A US2005005084A1 US 20050005084 A1 US20050005084 A1 US 20050005084A1 US 82966804 A US82966804 A US 82966804A US 2005005084 A1 US2005005084 A1 US 2005005084A1
Authority
US
United States
Prior art keywords
instructions
preselected
instruction
group
computation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/829,668
Inventor
Douglas Burger
Stephen Keckler
Karthikeyan Sankaralingam
Ramadass Nagarajan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Texas System
Original Assignee
University of Texas System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Texas System filed Critical University of Texas System
Priority to US10/829,668 priority Critical patent/US20050005084A1/en
Assigned to BOARD OF REGENTS, THE UNIVERSITY OF TEXAS SYSTEM reassignment BOARD OF REGENTS, THE UNIVERSITY OF TEXAS SYSTEM ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BURGER, DOUG, KECKLER, STEPHEN W., NAGARAJAN, RAMADASS, SANKARALINGAM, KARTHIKEYAN
Publication of US20050005084A1 publication Critical patent/US20050005084A1/en
Assigned to NATIONAL SCIENCE FOUNDATION reassignment NATIONAL SCIENCE FOUNDATION CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: UNIVERSITY OF TEXAS AUSTIN
Priority to US12/136,645 priority patent/US8055881B2/en
Assigned to NATIONAL SCIENCE FOUNDATION reassignment NATIONAL SCIENCE FOUNDATION CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: UNIVERSITY OF TEXAS AUSTIN
Assigned to NATIONAL SCIENCE FOUNDATION reassignment NATIONAL SCIENCE FOUNDATION CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: UNIVERSITY OF TEXAS, AUSTIN
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4494Execution paradigms, e.g. implementations of programming paradigms data driven

Definitions

  • Embodiments of the invention relate to apparatus, systems, articles, and methods for data processing using distributed computational elements.
  • ILP Instruction-Level Parallelism
  • ILP Instruction-Level Parallelism
  • ILP Instruction-Level Parallelism
  • One approach to increasing the use of ILP is via conventional superscalar processor cores that detect parallelism at run-time.
  • the amount of ILP that can be detected is limited by the issue window, the complexity of which grows as square of the number of entries.
  • Conventional superscalar architectures also rely on frequently accessed global structures, slowing down the system clock or increasing the depth of the pipeline.
  • the apparatus, systems, and methods described herein provide a simplified approach to increasing the amount of ILP that can be applied to programs, taking advantage of multiple, interconnected, and possibly identical computation nodes.
  • the essence of the approach involves scheduling instructions statically for execution across specially preselected, interconnected nodes, and then issuing the instructions dynamically for execution.
  • a computation node includes an input port capable of being coupled to at least one first other computation node, a first store to store input data, a second store to receive and store instructions, an instruction wakeup unit to match input data to instructions, at least one execution unit to execute the instructions and produce output data from the input data, and an output port capable of being coupled to at least one second other computation node.
  • the node may also include a router to direct the output data from the output port to the second other node.
  • a system can include one or more interconnected, preselected computation nodes and an external instruction sequencer (coupled to the second store in the nodes) to fetch instruction groups.
  • An article according to an embodiment of the invention includes a medium having instructions capable of causing a machine to partition a program into a plurality of groups of instructions, assign one or more of the instruction groups to a plurality of interconnected, preselected computation nodes, load the instruction groups on to the nodes, and execute the instruction groups as each instruction in each group receives all necessary associated operands for execution.
  • FIG. 1 is a schematic block diagram of an apparatus according to an embodiment of the invention.
  • FIG. 2 is a schematic block diagram of a system according to an embodiment of the invention.
  • FIG. 3 is a flow diagram illustrating a method according to an embodiment of the invention.
  • the computation substrate can be configured as a two- (or more) dimensional grid of computation nodes communicatively coupled via an interconnection network.
  • a compiler partitions the program into a sequence of blocks (e.g., basic blocks or hyperblocks), performs renaming of temporaries, and schedules instructions in a block to nodes in the grid.
  • Instruction traces generated at run-time can also be used instead of (or in addition to) blocks generated by the compiler. In either case, blocks and/or traces are fetched one at a time and their instructions are assigned or mapped to the computation nodes en masse. Execution proceeds in a dataflow fashion, with each instruction sending its results directly to other instructions that use them.
  • a set of interfaces are used by the computation substrate to access external data.
  • FIG. 1 is a schematic block diagram of an apparatus according to an embodiment of the invention.
  • the apparatus 100 such as a computation node 100 includes one or more input ports 102 capable of being communicatively coupled (in some embodiments by using transport channels 104 ) to at least one substantially simultaneously preselected computation node 110 .
  • the input ports 102 receive input data 112 , such as operands OP 1 , OP 2 , OP 3 , OP 4 , . . . , OPn 1 , . . . , OPn 2 112 .
  • the term “substantially simultaneously preselected” means, with respect to a computation node 100 (i.e., an executing node), that the mapping of an instruction to the executing node was determined prior to fetching the instruction, and at about the same time as the nodes (i.e., one or more producer nodes) providing the input data to be used or consumed by the mapped instruction at the executing node, as well as the nodes (i.e., one or more consumer nodes) consuming any output data provided by the executing node as a result of executing the mapped instruction, were also determined.
  • the nodes i.e., one or more producer nodes
  • the nodes i.e., one or more consumer nodes
  • a “group of instructions” means some selected number of instructions, contiguous in program order, assigned as a group to a plurality of nodes prior to any single instruction in the group being fetched. It should be noted that individual instructions included in a selected group of instruction are typically, although not always, mapped to a corresponding number of specific nodes in order to minimize the physical distance traveled by operands included in a critical path associated with the group. Mapping can also, or in addition, be accomplished so as to minimize the execution associated with a particular group of instructions.
  • the node 100 includes a first store 116 , which may have one or more memory locations, coupled to the input port(s) 102 to store the input data 112 .
  • the first store 116 can be used to retain one or more operands or elements of input data 112 .
  • the node 100 can include a second store 118 , also comprising one or more memory locations, coupled to an external instruction sequencer 120 .
  • the second store 118 receives and stores one or more instructions INST 1 , INST 2 , . . . , INSTn 122 from the external instruction sequencer 120 .
  • the term “external” means, with respect to the instruction sequencer 120 , that the sequencer 120 is not located within any of the computation nodes 100 , 110 , 140 to which it can pass instructions 122 .
  • One or more instructions 122 can be included in an instruction group 123 .
  • the node 100 also includes an instruction wakeup unit 124 to match the input data 112 to the instruction 122 , and at least one execution unit 126 to execute the instructions 122 passed to the execution unit 126 by the wakeup unit 124 using the input data 112 and the instructions 122 to produce output data 130 .
  • the execution unit 126 can include one or more arithmetic logic units, floating point units, memory address units, branch units, or any combination of these units.
  • One or more output ports 134 included in the node 100 can be coupled to at least one substantially simultaneously preselected computation node 140 (in some embodiments, by using transport channels 104 ). It should be noted that the substantially simultaneously preselected nodes 110 , 140 can be similar to or identical to node 100 .
  • the node 100 can also include a router 142 to send or direct the output data 130 from the output ports 134 to one or more substantially simultaneously preselected computation nodes 140 .
  • the instructions 122 can include a destination address 144 associated with the substantially simultaneously preselected computation node 140 .
  • the router 142 is capable of using the destination address ADD 1 144 , for example, to direct the output data 130 to the node 140 .
  • Destination addresses 144 can be generated by, among other mechanisms, a compiler and a run-time trace mapper.
  • the output data 130 can include the destination address ADD 1 144 associated with the computation node 140 .
  • the router 142 is capable of using the destination address ADD 1 144 to direct the output data 130 to the computation node 140 , whether the address ADD 1 144 is included in the output data 130 , or merely associated with the output data 130 .
  • the input data OP 1 112 can include, or be associated with, a destination address ADD 1 148 associated with the computation node 140 , such that the router 142 is capable of using the destination address ADD 1 148 to send or direct the input data OP 1 112 directly to the computation node 140 . This may occur, for example, if the input data OP 1 112 is ultimately destined for use within the node 140 (i.e., the consuming node 140 ), but must pass through the node 100 as part of the shortest path from the node 110 that produces or provides it (i.e., the producing or provider node 110 ).
  • the node 100 is also capable of routing output data 130 back to itself along the bypass path 143 , such that an instruction INSTn 122 can include a destination address ADDn 144 associated with the node 100 .
  • the router 142 is then also capable of using the destination address ADDn 144 to direct the output data 130 back to the node 100 for use with another instruction INSTn 122 .
  • one or more of the output ports 134 can be communicatively coupled to a direct channel 150 , which, unlike transport channels 104 , bypasses the router 142 and makes a direct connection between the execution unit 126 and the node 140 .
  • a direct channel 150 can be communicatively coupled to the direct channel 150 .
  • one or more of the input ports 102 of the node 100 can be communicatively coupled to a direct channel 152 which bypasses the router (not shown) included in the node 110 to makes a direct connection between the execution unit 126 (not shown) in the node 110 and the node 100 .
  • FIG. 2 is a schematic block diagram of a system according to an embodiment of the invention.
  • a system 260 such as a processor 260 is shown to include an external instruction sequencer 220 , similar to or identical to the external instruction sequencer 120 (shown in FIG. 1 ) as well as one or more nodes 200 , similar to or identical to nodes 100 , 110 , and/or 140 (also shown in FIG. 1 ), included on a computation substrate 261 .
  • the external instruction sequencer 220 is used to fetch one or more groups of instructions, such as the instruction groups GRP 1 , GRP 2 , . . . , GRPn 223 .
  • a single group of instructions may include one or more instructions, such as the instructions INST 1 , INST 2 , . . . INSTn 122 shown in FIG. 1 .
  • the nodes 200 can be connected in just about any topology desired.
  • each node 200 can be connected to each and every other node 200 included in the system 260 .
  • One or more nodes 200 can also be connected to just one other node 200 , or to some selected number of nodes 200 , up to and including all of the nodes 200 included in the system 260 .
  • Connections between the nodes 200 can be effected via routers (not shown) included in the nodes 200 , and transport channels 204 , which are similar to or identical to transport channels 104 shown in FIG. 1 , and/or direct channels 250 , which are similar to or identical to the direct channels 150 , 152 shown in FIG. 1 .
  • the direct channels 250 include communication connections between selected nodes 200 that bypass routers (not shown) in the nodes 200 , so that input data and output data can be sent directly from one node 200 to another node 200 (e.g., from node 262 to node 264 ) without travelling through a router or other nodes 200 .
  • the input ports of each node 200 are connected to the output ports of at least one other preselected node 200 , such as a substantially simultaneously preselected node 200 .
  • the output ports of each node 200 are connected to the input ports of at least one other preselected node 200 , such as a substantially simultaneously preselected node 200 .
  • the input ports of node 266 (among other nodes) are connected to the output ports of substantially simultaneously preselected node 262
  • the output ports of node 266 are connected to the input ports of substantially simultaneously preselected node 268 .
  • the output ports of node 268 (among other nodes) are connected to the input ports of substantially simultaneously preselected node 264 .
  • Almost any coupling topology desired can be addressed by using the transport channels 204 and the direct channels 250 to connect the nodes 200 .
  • the system 260 can also include a register file 270 , a memory interface 272 , stitch logic 273 , and a block termination control 274 , each communicatively coupled to the nodes 200 .
  • the system 260 can include one or more memories 276 coupled to the external instruction sequencer 220 , as well as to the memory interface 272 .
  • the memory 276 can be volatile or nonvolatile, or a combination of these types, and may include various types of storage devices, such as Random Access Memory (RAM), Read Only Memory (ROM), FLASH memory, disk storage, and any other type of storage device or medium.
  • the memory 276 can comprise an instruction memory used to store one or more selected groups of instructions, such as the group of instructions GRP 1 223 .
  • the register file 270 operates to receive indications 278 to send operands or input data to be used by instructions within the nodes 200 , and/or to receive data upon completion of an instruction group, and/or to store values produced in a group of instructions that are live outside of the group in which the value is produced.
  • the register file 270 can be coupled to the block termination control 274 , which operates to detect execution termination of a selected group of instructions, such as the group of instructions GRP 1 223 .
  • the performance of the system 260 may be further enhanced by including an instruction cache C 1 , C 2 , . . . , Cm 280 for each one of the m rows of nodes 200 . It should be noted that, when all of the consumers of a particular datum reside within the group of instructions that produce that datum, it is not necessary to write the datum to a register file.
  • the stitch logic module 273 is used to accomplish “register-stitching”. This can occur when, for example, executing a first group of instructions produces output data (typically written to a register), and a second concurrently executing group of instructions can use the output data so produced as input data. In this case, the output data can be forward directly from the first group to the second group via the stitch logic module 273 .
  • the nodes 100 , 110 , 140 , 200 , 262 , 264 , 266 , 268 ; input ports 102 ; first store 116 ; second store 118 ; external instruction sequencer 120 ; instruction wakeup unit 124 ; execution unit 126 ; output ports 134 ; router 142 ; direct channels 150 , 152 , 250 ; system 260 ; register file 270 ; memory interface 272 ; stitch logic module 273 ; block termination control 274 ; memory 276 ; and instruction caches 280 may all be characterized as “modules” herein.
  • Such modules may include hardware circuitry, and/or one or more processors and/or memory circuits, software program modules, and/or firmware, and combinations thereof, as desired by the architect of the nodes 100 , 200 and the system 260 , and as appropriate for particular implementations of various embodiments of the invention.
  • inventions that may include the novel apparatus and systems of the present invention include electronic circuitry used in communication and signal processing circuitry, modems, processor modules, embedded processors, and application-specific modules, including multilayer, multi-chip modules. Such apparatus and systems may further be utilized as sub-components within a variety of electronic systems, including cellular telephones, personal computers, dedicated data acquisition systems, and others.
  • FIG. 3 is a flow diagram illustrating a method according to an embodiment of the invention.
  • the method 311 may begin with partitioning a program into a plurality of groups of instructions at block 321 , and continue with assigning one or more groups of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes, such as the nodes 100 , 200 shown in FIGS. 1 and 2 , respectively, at block 331 .
  • the nodes 100 , 200 in the group of interconnected preselected nodes can be connected in any topology desired, as noted above.
  • One or more of the groups in the plurality of instruction groups can be a basic block, a hyperblock, or a superblock.
  • one or more of the groups in the plurality of instruction groups can be an instruction trace constructed by a hardware trace construction unit at run time.
  • the method 311 may continue with loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes, and executing the group(s) of instructions at block 345 as each one of the instructions in a respective group of instructions receives all necessary associated operands for execution.
  • Partitioning the program into groups of instructions can be performed by a compiler at block 351 . Partitioning the program into groups of instructions can also be performed by a run-time trace mapper, or a combination of a compiler and a run-time trace mapper, at block 351 .
  • Loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending at least two instructions selected from the group of instructions from an instruction sequencer to a selected computation node included in the plurality of interconnected preselected computation nodes for storage in a store at block 355 .
  • one or more of the plurality of groups can be statically assigned for execution.
  • loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending multiple sets of instructions at block 361 , such as sending a first set of instructions selected from a first group of instructions (selected from a plurality of groups of instructions) from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a first frame (included in a first computation node), and sending a second set of instructions selected from the first group of instructions from the instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a second frame (included in the first computation node).
  • each frame means a designated set of buffers spanning a plurality of nodes 100 , wherein one buffer (e g. selected from the stores 116 , 118 ) of a particular frame is typically (but not always) included in each node 100 in the plurality of nodes 100 .
  • each frame may permit mapping the same number of instructions as there are nodes 100 .
  • Each group of instructions can span multiple frames, and multiple groups of instructions can be mapped on to multiple frames.
  • the number of frames may be selected so as to be equal to the number of instruction storage locations in the store 116 and/or store 118 .
  • the embodiments of the invention are not so limited.
  • Assigning one or more groups of instructions to a plurality of interconnected preselected computation nodes at block 331 can also include assigning a first group of instructions to a first set of frames included in the plurality of interconnected preselected computation nodes, assigning a second group of instructions to a second set of frames included in the plurality of interconnected preselected computation nodes (wherein the first group and the second group of instructions are capable of concurrent execution), and wherein at least one output datum associated with the first group of instructions is written to a register file and passed directly to the second group of instructions for use as an input datum by the second group of instructions.
  • loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending a third set of instructions selected from a second group of instructions (selected from the plurality of instruction groups) from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in the first frame, and sending a fourth set of instructions selected from the second group of instructions from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in the second frame.
  • Executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution at block 345 can include matching at least one instruction selected from the group of instructions with at least one operand received from another computation node included in the plurality of interconnected preselected computation nodes at block 365 .
  • one or more of the instructions included in at least one of the plurality of instruction groups can be dynamically issued for execution.
  • the method may continue with generating one or more wakeup tokens to reserve one or more output data channels (e.g., transport channels or direct channels) to connect selected computation nodes included in the plurality of interconnected preselected computation nodes at block 371 .
  • Generating wakeup tokens may operate to accelerate the wakeup of one or more corresponding consuming instructions (i.e., instructions which receive the data).
  • the method 311 may continue, at block 375 , with routing one or more output data arising from executing the group of instructions to one or more consumer nodes (e.g., nodes coupled to the output ports of the producing or provider node(s)) included in the plurality of interconnected preselected computation nodes, wherein the addresses of the consumer nodes are included in a token associated with at least one instruction included in the group of instructions.
  • the method 311 may also include detecting execution termination of one or more groups of instructions at block 381 . If one or more of the instructions includes an output having architecturally visible data, the method 311 may conclude with committing the architecturally visible data to a register file and/or memory at block 385 .
  • program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • any of the modules 100 , 102 , 110 , 116 , 118 , 120 , 124 , 126 , 134 , 140 , 142 , 150 , 152 , 200 , 250 , 260 , 262 , 264 , 266 , 268 , 270 , 272 , 274 , 276 , and 280 described herein may include software operative on one or more processors to perform methods according to the teachings of various embodiments of the present invention.
  • a software program can be launched from a computer readable medium in a computer-based system to execute the functions defined in the software program.
  • One of ordinary skill in the art will further understand the various programming languages that may be employed to create one or more software programs designed to implement and perform the methods disclosed herein.
  • the programs can be structured in an object-orientated format using an object-oriented language such as Java, Smalltalk, or C++.
  • the programs can be structured in a procedure-orientated format using a procedural language, such as COBOL or C.
  • the software components may communicate using any of a number of mechanisms that are well-known to those skilled in the art, such as application program interfaces (API) or interprocess communication techniques such as the Remote Procedure Call (RPC).
  • API application program interfaces
  • RPC Remote Procedure Call
  • teachings of various embodiments of the present invention are not limited to any particular programming language or environment.
  • a processor or control logic 220 may access some form of computer-readable media, such as memory 276 .
  • a system 260 having nodes 200 may also include a processor 220 coupled to a memory 274 , volatile or nonvolatile.
  • computer-readable media may comprise computer storage media and communications media
  • Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Communications media specifically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, coded information signal, and/or other transport mechanism, which includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signaL
  • communications media also includes wired media such as a wired network or direct-wired connections, and wireless media such as acoustic, optical, radio frequency, infrared and other wireless media Combinations of any of the above are also included within the scope of computer-readable and/or accessible media.
  • another embodiment of the invention may include an article 290 comprising a machine-accessible medium or memory 276 having associated data, wherein the data, when accessed, results in a machine performing activities such as partitioning a program into a plurality of groups of instructions, assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes, loading the group of instructions to the plurality of interconnected preselected computation nodes, and executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution.
  • a machine performing activities such as partitioning a program into a plurality of groups of instructions, assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes, loading the group of instructions to the plurality of interconnected preselected computation nodes, and executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution.
  • other activities may include partitioning the program into the plurality of groups of instructions, as performed by a compiler or a run-time trace mapper; statically assigning all of the plurality of groups of instructions for execution; and dynamically issuing one or more instructions selected from one or more of the plurality of groups of instructions for execution.
  • Still further activities can include generating a wakeup token to reserve an output data channel to connect selected computation nodes included in the plurality of interconnected preselected computation nodes; routing one or more output data arising from executing the group of instructions to one or more consumer nodes included in the plurality of interconnected preselected computation nodes, wherein the address of each one of the consumer nodes is included in a token associated with at least one instruction included in the group of instructions; detecting execution termination of a group of instructions (including an output having architecturally visible data); and committing the architecturally visible data to a register file and/or a memory.

Abstract

A computation node according to various embodiments of the invention includes at least one input port capable of being coupled to at least one first other 5 computation node, a first store coupled to the input port(s) to store input data, a second store to receive and store instructions, an instruction wakeup unit to match the input data to the instructions, at least one execution unit to execute the instructions, using the input data to produce output data, and at least one output port capable of being coupled to at least one second other computation node. The node may also include a router to direct the output data from the output port(s) to the second other node. A system according to various embodiments of the invention includes and external instruction sequencer to fetch a group of instructions, and one or more interconnected, preselected computational nodes. An article according to an embodiment of the invention includes a medium having instructions which are capable of causing a machine to partition a program into a plurality of groups of instructions, assign one or more of the instruction groups to a plurality of interconnected preselected computation nodes, load the instruction groups on to the nodes, and execute the instruction groups as each instruction in each group receives all necessary associated operands for execution.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation under 35 U.S.C. 111(a) of PCT/US02/34965 filed on Oct. 31, 2002 and published in English on May 8, 2003 which claims priority under 35 U.S.C. 119(e) of U.S. Provisional Application Ser. No. 60/334,764 filed on Oct. 31, 2001 which is incorporated herein by reference.
  • TECHNICAL FIELD
  • Embodiments of the invention relate to apparatus, systems, articles, and methods for data processing using distributed computational elements.
  • BACKGROUND OF THE INVENTION
  • The performance of conventional microarchitectures, measured in Instructions Per Cycle (IPC), has improved by approximately 50-60% per year. This growth has typically been achieved by increasing the number of transistors on a chip and/or increasing the instruction cycle clock speed. However, these results will not continue to scale with respect to future technologies (90 nanometers and below), because fundamental pipelining limits and wire delays bind such architectures to their data communications systems.
  • Instruction-Level Parallelism (ILP), which can describe methods of using multiple transistors, also refers to a process where multiple instructions are executed in parallel, and constitutes yet another path to greater computational performance. One approach to increasing the use of ILP is via conventional superscalar processor cores that detect parallelism at run-time. The amount of ILP that can be detected is limited by the issue window, the complexity of which grows as square of the number of entries. Conventional superscalar architectures also rely on frequently accessed global structures, slowing down the system clock or increasing the depth of the pipeline.
  • Another approach to the implementation of parallel processing is taken by VLIW machines, where ILP analysis is performed at compile time. Instruction scheduling is done by the compiler, orchestrating the flow of execution in a static manner. However, this approach works well only for statistically predictable codes, and suffers when dynamic events occur—a run-time stall in one function unit, or a cache miss, forces the entire machine to stall, since all functional units are synchronized. Thus, there is a need for new computational architectures that capitalize on the transistor miniaturization trend while overcoming communications bottlenecks.
  • SUMMARY OF THE INVENTION
  • The apparatus, systems, and methods described herein provide a simplified approach to increasing the amount of ILP that can be applied to programs, taking advantage of multiple, interconnected, and possibly identical computation nodes. The essence of the approach involves scheduling instructions statically for execution across specially preselected, interconnected nodes, and then issuing the instructions dynamically for execution.
  • A computation node according to various embodiments of the invention includes an input port capable of being coupled to at least one first other computation node, a first store to store input data, a second store to receive and store instructions, an instruction wakeup unit to match input data to instructions, at least one execution unit to execute the instructions and produce output data from the input data, and an output port capable of being coupled to at least one second other computation node. The node may also include a router to direct the output data from the output port to the second other node. A system according to various embodiments of the invention can include one or more interconnected, preselected computation nodes and an external instruction sequencer (coupled to the second store in the nodes) to fetch instruction groups.
  • An article according to an embodiment of the invention includes a medium having instructions capable of causing a machine to partition a program into a plurality of groups of instructions, assign one or more of the instruction groups to a plurality of interconnected, preselected computation nodes, load the instruction groups on to the nodes, and execute the instruction groups as each instruction in each group receives all necessary associated operands for execution.
  • This summary is intended to provide an exemplary overview of the subject matter further described hereinbelow. It is not intended to provide an exhaustive or exclusive explanation of various embodiments of the invention. The Detailed Description which follows is included to provide further information about such embodiments.
  • DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic block diagram of an apparatus according to an embodiment of the invention;
  • FIG. 2 is a schematic block diagram of a system according to an embodiment of the invention; and
  • FIG. 3 is a flow diagram illustrating a method according to an embodiment of the invention.
  • DETAILED DESCRIPTION
  • In the following detailed description of various embodiments of the invention, information with respect to making and using the various embodiments, including a best mode of practicing such embodiments, is provided. Thus, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration, and not of limitation, specific embodiments in which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. The embodiments illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments may be utilized and derived therefrom, such that electrical, structural, and logical substitutions and changes may be made without departing from the scope of this disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments of the invention is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
  • Herein a new architecture is disclosed that takes into consideration the technology constraints of wire delays and pipelining limits. Instructions are scheduled statically for execution on a computation substrate, and then issued dynamically. The computation substrate can be configured as a two- (or more) dimensional grid of computation nodes communicatively coupled via an interconnection network. A compiler partitions the program into a sequence of blocks (e.g., basic blocks or hyperblocks), performs renaming of temporaries, and schedules instructions in a block to nodes in the grid. Instruction traces generated at run-time can also be used instead of (or in addition to) blocks generated by the compiler. In either case, blocks and/or traces are fetched one at a time and their instructions are assigned or mapped to the computation nodes en masse. Execution proceeds in a dataflow fashion, with each instruction sending its results directly to other instructions that use them. A set of interfaces are used by the computation substrate to access external data.
  • FIG. 1 is a schematic block diagram of an apparatus according to an embodiment of the invention. The apparatus 100, such as a computation node 100 includes one or more input ports 102 capable of being communicatively coupled (in some embodiments by using transport channels 104) to at least one substantially simultaneously preselected computation node 110. The input ports 102 receive input data 112, such as operands OP1, OP2, OP3, OP4, . . . , OPn1, . . . , OPn2 112.
  • As used herein, the term “substantially simultaneously preselected” means, with respect to a computation node 100 (i.e., an executing node), that the mapping of an instruction to the executing node was determined prior to fetching the instruction, and at about the same time as the nodes (i.e., one or more producer nodes) providing the input data to be used or consumed by the mapped instruction at the executing node, as well as the nodes (i.e., one or more consumer nodes) consuming any output data provided by the executing node as a result of executing the mapped instruction, were also determined. Further, as used herein, a “group of instructions” means some selected number of instructions, contiguous in program order, assigned as a group to a plurality of nodes prior to any single instruction in the group being fetched. It should be noted that individual instructions included in a selected group of instruction are typically, although not always, mapped to a corresponding number of specific nodes in order to minimize the physical distance traveled by operands included in a critical path associated with the group. Mapping can also, or in addition, be accomplished so as to minimize the execution associated with a particular group of instructions.
  • The node 100 includes a first store 116, which may have one or more memory locations, coupled to the input port(s) 102 to store the input data 112. Thus the first store 116 can be used to retain one or more operands or elements of input data 112. In addition, the node 100 can include a second store 118, also comprising one or more memory locations, coupled to an external instruction sequencer 120. The second store 118 receives and stores one or more instructions INST1, INST2, . . . , INSTn 122 from the external instruction sequencer 120. As used here, the term “external” means, with respect to the instruction sequencer 120, that the sequencer 120 is not located within any of the computation nodes 100, 110, 140 to which it can pass instructions 122. One or more instructions 122 can be included in an instruction group 123.
  • The node 100 also includes an instruction wakeup unit 124 to match the input data 112 to the instruction 122, and at least one execution unit 126 to execute the instructions 122 passed to the execution unit 126 by the wakeup unit 124 using the input data 112 and the instructions 122 to produce output data 130. The execution unit 126 can include one or more arithmetic logic units, floating point units, memory address units, branch units, or any combination of these units.
  • One or more output ports 134 included in the node 100 can be coupled to at least one substantially simultaneously preselected computation node 140 (in some embodiments, by using transport channels 104). It should be noted that the substantially simultaneously preselected nodes 110, 140 can be similar to or identical to node 100. The node 100 can also include a router 142 to send or direct the output data 130 from the output ports 134 to one or more substantially simultaneously preselected computation nodes 140.
  • The instructions 122 can include a destination address 144 associated with the substantially simultaneously preselected computation node 140. The router 142 is capable of using the destination address ADD1 144, for example, to direct the output data 130 to the node 140. Destination addresses 144 can be generated by, among other mechanisms, a compiler and a run-time trace mapper. In some embodiments, the output data 130 can include the destination address ADD1 144 associated with the computation node 140. However, the router 142 is capable of using the destination address ADD1 144 to direct the output data 130 to the computation node 140, whether the address ADD1 144 is included in the output data 130, or merely associated with the output data 130.
  • Similarly, the input data OP1 112 can include, or be associated with, a destination address ADD1 148 associated with the computation node 140, such that the router 142 is capable of using the destination address ADD1 148 to send or direct the input data OP1 112 directly to the computation node 140. This may occur, for example, if the input data OP1 112 is ultimately destined for use within the node 140 (i.e., the consuming node 140), but must pass through the node 100 as part of the shortest path from the node 110 that produces or provides it (i.e., the producing or provider node 110). The node 100 is also capable of routing output data 130 back to itself along the bypass path 143, such that an instruction INSTn 122 can include a destination address ADDn 144 associated with the node 100. The router 142 is then also capable of using the destination address ADDn 144 to direct the output data 130 back to the node 100 for use with another instruction INSTn 122.
  • To further enhance the capabilities of the node 100, one or more of the output ports 134 can be communicatively coupled to a direct channel 150, which, unlike transport channels 104, bypasses the router 142 and makes a direct connection between the execution unit 126 and the node 140. Thus, one or more of the input ports 102 of the node 140 can be communicatively coupled to the direct channel 150. Similarly, one or more of the input ports 102 of the node 100 can be communicatively coupled to a direct channel 152 which bypasses the router (not shown) included in the node 110 to makes a direct connection between the execution unit 126 (not shown) in the node 110 and the node 100.
  • FIG. 2 is a schematic block diagram of a system according to an embodiment of the invention. Here a system 260, such as a processor 260 is shown to include an external instruction sequencer 220, similar to or identical to the external instruction sequencer 120 (shown in FIG. 1) as well as one or more nodes 200, similar to or identical to nodes 100, 110, and/or 140 (also shown in FIG. 1), included on a computation substrate 261. The external instruction sequencer 220 is used to fetch one or more groups of instructions, such as the instruction groups GRP1, GRP2, . . . , GRPn 223. A single group of instructions may include one or more instructions, such as the instructions INST1, INST2, . . . INSTn 122 shown in FIG. 1.
  • As seen in FIG. 2, the nodes 200 can be connected in just about any topology desired. For example, each node 200 can be connected to each and every other node 200 included in the system 260. One or more nodes 200 can also be connected to just one other node 200, or to some selected number of nodes 200, up to and including all of the nodes 200 included in the system 260.
  • Connections between the nodes 200 can be effected via routers (not shown) included in the nodes 200, and transport channels 204, which are similar to or identical to transport channels 104 shown in FIG. 1, and/or direct channels 250, which are similar to or identical to the direct channels 150, 152 shown in FIG. 1. As noted previously, the direct channels 250 include communication connections between selected nodes 200 that bypass routers (not shown) in the nodes 200, so that input data and output data can be sent directly from one node 200 to another node 200 (e.g., from node 262 to node 264) without travelling through a router or other nodes 200. In most embodiments, however, the input ports of each node 200 are connected to the output ports of at least one other preselected node 200, such as a substantially simultaneously preselected node 200. Similarly, the output ports of each node 200 are connected to the input ports of at least one other preselected node 200, such as a substantially simultaneously preselected node 200. For example, as shown in FIG. 2, the input ports of node 266 (among other nodes) are connected to the output ports of substantially simultaneously preselected node 262, and the output ports of node 266 (among other nodes) are connected to the input ports of substantially simultaneously preselected node 268. In turn, the output ports of node 268 (among other nodes) are connected to the input ports of substantially simultaneously preselected node 264. Almost any coupling topology desired can be addressed by using the transport channels 204 and the direct channels 250 to connect the nodes 200.
  • The system 260 can also include a register file 270, a memory interface 272, stitch logic 273, and a block termination control 274, each communicatively coupled to the nodes 200. In addition, the system 260 can include one or more memories 276 coupled to the external instruction sequencer 220, as well as to the memory interface 272. The memory 276 can be volatile or nonvolatile, or a combination of these types, and may include various types of storage devices, such as Random Access Memory (RAM), Read Only Memory (ROM), FLASH memory, disk storage, and any other type of storage device or medium. The memory 276 can comprise an instruction memory used to store one or more selected groups of instructions, such as the group of instructions GRP1 223.
  • The register file 270 operates to receive indications 278 to send operands or input data to be used by instructions within the nodes 200, and/or to receive data upon completion of an instruction group, and/or to store values produced in a group of instructions that are live outside of the group in which the value is produced. The register file 270 can be coupled to the block termination control 274, which operates to detect execution termination of a selected group of instructions, such as the group of instructions GRP1 223. The performance of the system 260 may be further enhanced by including an instruction cache C1, C2, . . . , Cm 280 for each one of the m rows of nodes 200. It should be noted that, when all of the consumers of a particular datum reside within the group of instructions that produce that datum, it is not necessary to write the datum to a register file.
  • The stitch logic module 273 is used to accomplish “register-stitching”. This can occur when, for example, executing a first group of instructions produces output data (typically written to a register), and a second concurrently executing group of instructions can use the output data so produced as input data. In this case, the output data can be forward directly from the first group to the second group via the stitch logic module 273.
  • The nodes 100, 110, 140, 200, 262, 264, 266, 268; input ports 102; first store 116; second store 118; external instruction sequencer 120; instruction wakeup unit 124; execution unit 126; output ports 134; router 142; direct channels 150, 152, 250; system 260; register file 270; memory interface 272; stitch logic module 273; block termination control 274; memory 276; and instruction caches 280 may all be characterized as “modules” herein. Such modules may include hardware circuitry, and/or one or more processors and/or memory circuits, software program modules, and/or firmware, and combinations thereof, as desired by the architect of the nodes 100, 200 and the system 260, and as appropriate for particular implementations of various embodiments of the invention.
  • One of ordinary skill in the art will understand that the apparatus and systems of the present invention can be used in applications other than for parallel instruction processing, and thus, embodiments of the invention are not to be so limited. The illustrations of nodes 100, 200, and a system 260 are intended to provide a general understanding of the structure of various embodiments of the present invention, and are not intended to serve as a complete description of all the elements and features of apparatus and systems that might make use of the structures described herein.
  • Applications that may include the novel apparatus and systems of the present invention include electronic circuitry used in communication and signal processing circuitry, modems, processor modules, embedded processors, and application-specific modules, including multilayer, multi-chip modules. Such apparatus and systems may further be utilized as sub-components within a variety of electronic systems, including cellular telephones, personal computers, dedicated data acquisition systems, and others.
  • FIG. 3 is a flow diagram illustrating a method according to an embodiment of the invention. The method 311 may begin with partitioning a program into a plurality of groups of instructions at block 321, and continue with assigning one or more groups of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes, such as the nodes 100, 200 shown in FIGS. 1 and 2, respectively, at block 331. The nodes 100, 200 in the group of interconnected preselected nodes can be connected in any topology desired, as noted above.
  • One or more of the groups in the plurality of instruction groups can be a basic block, a hyperblock, or a superblock. Alternatively, or in addition, one or more of the groups in the plurality of instruction groups can be an instruction trace constructed by a hardware trace construction unit at run time.
  • At block 341, the method 311 may continue with loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes, and executing the group(s) of instructions at block 345 as each one of the instructions in a respective group of instructions receives all necessary associated operands for execution.
  • Partitioning the program into groups of instructions can be performed by a compiler at block 351. Partitioning the program into groups of instructions can also be performed by a run-time trace mapper, or a combination of a compiler and a run-time trace mapper, at block 351.
  • Loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending at least two instructions selected from the group of instructions from an instruction sequencer to a selected computation node included in the plurality of interconnected preselected computation nodes for storage in a store at block 355. In any case, one or more of the plurality of groups (including all of the groups) can be statically assigned for execution.
  • Alternatively, or in addition, loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending multiple sets of instructions at block 361, such as sending a first set of instructions selected from a first group of instructions (selected from a plurality of groups of instructions) from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a first frame (included in a first computation node), and sending a second set of instructions selected from the first group of instructions from the instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a second frame (included in the first computation node).
  • As used herein, the term “frame” means a designated set of buffers spanning a plurality of nodes 100, wherein one buffer (e g. selected from the stores 116, 118) of a particular frame is typically (but not always) included in each node 100 in the plurality of nodes 100. Thus, each frame may permit mapping the same number of instructions as there are nodes 100. Each group of instructions can span multiple frames, and multiple groups of instructions can be mapped on to multiple frames. For example, the number of frames may be selected so as to be equal to the number of instruction storage locations in the store 116 and/or store 118. However, the embodiments of the invention are not so limited.
  • Assigning one or more groups of instructions to a plurality of interconnected preselected computation nodes at block 331 can also include assigning a first group of instructions to a first set of frames included in the plurality of interconnected preselected computation nodes, assigning a second group of instructions to a second set of frames included in the plurality of interconnected preselected computation nodes (wherein the first group and the second group of instructions are capable of concurrent execution), and wherein at least one output datum associated with the first group of instructions is written to a register file and passed directly to the second group of instructions for use as an input datum by the second group of instructions.
  • Other assignments can be made, including assignments for other sets of instructions. For example, once assignments for a third and fourth set of instructions have been made, and continuing the previous example, loading the assigned group(s) of instructions on to the plurality of interconnected preselected computation nodes at block 341 can include sending a third set of instructions selected from a second group of instructions (selected from the plurality of instruction groups) from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in the first frame, and sending a fourth set of instructions selected from the second group of instructions from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in the second frame.
  • Executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution at block 345 can include matching at least one instruction selected from the group of instructions with at least one operand received from another computation node included in the plurality of interconnected preselected computation nodes at block 365. In any case, one or more of the instructions included in at least one of the plurality of instruction groups (including an entire group or all of the instruction groups) can be dynamically issued for execution.
  • The method may continue with generating one or more wakeup tokens to reserve one or more output data channels (e.g., transport channels or direct channels) to connect selected computation nodes included in the plurality of interconnected preselected computation nodes at block 371. Generating wakeup tokens may operate to accelerate the wakeup of one or more corresponding consuming instructions (i.e., instructions which receive the data).
  • The method 311 may continue, at block 375, with routing one or more output data arising from executing the group of instructions to one or more consumer nodes (e.g., nodes coupled to the output ports of the producing or provider node(s)) included in the plurality of interconnected preselected computation nodes, wherein the addresses of the consumer nodes are included in a token associated with at least one instruction included in the group of instructions. The method 311 may also include detecting execution termination of one or more groups of instructions at block 381. If one or more of the instructions includes an output having architecturally visible data, the method 311 may conclude with committing the architecturally visible data to a register file and/or memory at block 385. It should be noted that multiple groups of instructions can execute concurrently, and that data from each group is typically (although embodiments of the invention are not so limited) committed to register files and/or memory in serial fashion, with the data from a previously-terminating group being committed prior to data from a later-terminating group.
  • Referring to the methods just described, it should be clear that some embodiments of the present invention may also be realized in the context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. As such, any of the modules 100, 102, 110, 116, 118, 120, 124, 126, 134, 140, 142, 150, 152, 200, 250, 260, 262, 264, 266, 268, 270, 272, 274, 276, and 280 described herein may include software operative on one or more processors to perform methods according to the teachings of various embodiments of the present invention.
  • One of ordinary skill in the art will understand, upon reading and comprehending this disclosure, the manner in which a software program can be launched from a computer readable medium in a computer-based system to execute the functions defined in the software program. One of ordinary skill in the art will further understand the various programming languages that may be employed to create one or more software programs designed to implement and perform the methods disclosed herein. The programs can be structured in an object-orientated format using an object-oriented language such as Java, Smalltalk, or C++. Alternatively, the programs can be structured in a procedure-orientated format using a procedural language, such as COBOL or C. The software components may communicate using any of a number of mechanisms that are well-known to those skilled in the art, such as application program interfaces (API) or interprocess communication techniques such as the Remote Procedure Call (RPC). However, the teachings of various embodiments of the present invention are not limited to any particular programming language or environment.
  • As is evident from the preceding description, and referring back to FIGS. 1 and 2, it can be seen that during the operation of the nodes 100, 200 (as well as the system 260), a processor or control logic 220 may access some form of computer-readable media, such as memory 276. Thus, a system 260 having nodes 200 according to an embodiment of the invention may also include a processor 220 coupled to a memory 274, volatile or nonvolatile.
  • By way of example and not limitation, computer-readable media may comprise computer storage media and communications media Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Communications media specifically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, coded information signal, and/or other transport mechanism, which includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signaL By way of example and not limitation, communications media also includes wired media such as a wired network or direct-wired connections, and wireless media such as acoustic, optical, radio frequency, infrared and other wireless media Combinations of any of the above are also included within the scope of computer-readable and/or accessible media.
  • Thus, it is now easily understood that another embodiment of the invention may include an article 290 comprising a machine-accessible medium or memory 276 having associated data, wherein the data, when accessed, results in a machine performing activities such as partitioning a program into a plurality of groups of instructions, assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes, loading the group of instructions to the plurality of interconnected preselected computation nodes, and executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution.
  • As noted above, other activities may include partitioning the program into the plurality of groups of instructions, as performed by a compiler or a run-time trace mapper; statically assigning all of the plurality of groups of instructions for execution; and dynamically issuing one or more instructions selected from one or more of the plurality of groups of instructions for execution. Still further activities can include generating a wakeup token to reserve an output data channel to connect selected computation nodes included in the plurality of interconnected preselected computation nodes; routing one or more output data arising from executing the group of instructions to one or more consumer nodes included in the plurality of interconnected preselected computation nodes, wherein the address of each one of the consumer nodes is included in a token associated with at least one instruction included in the group of instructions; detecting execution termination of a group of instructions (including an output having architecturally visible data); and committing the architecturally visible data to a register file and/or a memory.
  • Although specific embodiments have been illustrated and described herein, those of ordinary skill in the art will appreciate that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments of the present invention. It is to be understood that the above Detailed Description has been made in an illustrative fashion, and not a restrictive one. Combinations of the above embodiments, and other embodiments not specifically described herein will be apparent to those of skill in the art upon reviewing the above description. The scope of various embodiments of the invention includes any other applications in which the above structures and methods are used. Therefore, the scope of various embodiments of the invention should be determined with reference to the appended claims, along with the full range of equivalents to which such claims are entitled.
  • It is emphasized that the Abstract is provided to comply with 37 C.F.R. § 1.72(b) requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. It should also be noted that in the foregoing Detailed Description, various features may be grouped together in a single embodiment for the purpose of streaming the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate preferred embodiment.

Claims (36)

1. An apparatus, comprising:
at least one input port capable of being coupled to at least one substantially simultaneously preselected first other computation node, the input port to receive input data;
a first store coupled to the at least one input port to store the input data;
a second store coupled to an external instruction sequencer, the second store to receive and store an instruction from the external instruction sequencer;
an instruction wakeup unit to match the input data to the instruction;
at least one execution unit to execute the instruction using the input data to produce output data; and
at least one output port capable of being coupled to at least one substantially simultaneously preselected second other computation node.
2. The apparatus of claim 1, further comprising:
a router to direct the output data from the at least one output port to the at least one substantially simultaneously preselected second other computation node.
3. The apparatus of claim 2, wherein the instruction includes a destination address associated with the at least one substantially simultaneously preselected second other computation node, and wherein the router is capable of using the destination address to direct the output data to the at least one substantially simultaneously preselected second other computation node.
4. The apparatus of claim 3, wherein the destination address is generated by a mechanism selected from the group consisting of: a compiler and a run-time trace mapper.
5. The apparatus of claim 2, wherein the instruction includes a destination address associated with the computation node, and wherein the router is capable of using the destination address to direct the output data to the computation node.
6. The apparatus of claim 1, wherein the execution unit comprises at least one calculation module selected from the group consisting of: an arithmetic logic unit, a floating point unit, a memory address unit, and a branch unit.
7. The apparatus of claim 1, wherein the second store is capable of storing multiple instructions.
8. The apparatus of claim 1, wherein the first store is capable of storing multiple operands.
9. The apparatus of claim 1, wherein the at least one output port is coupled to a direct channel, and wherein an input port of the at least one substantially simultaneously preselected second other computation node is coupled to the direct channel.
10. The apparatus of claim 1, wherein the at least one input port is coupled to a direct channel, and wherein an output port of the at least one substantially simultaneously preselected first other computation node is coupled to the direct channel.
11. A system, comprising:
an external instruction sequencer to fetch a group of instructions including an instruction; and
a first preselected computation node including at least one input port capable of being coupled to at least one first other preselected computation node, the input port to receive input data, a first store coupled to the at least one input port to store the input data, a second store coupled to the instruction sequencer, the second store to receive and store the instruction, an instruction wakeup unit to match the input data to the instruction, at least one execution unit to execute the instruction using the input data to produce output data, at least one output port capable of being coupled to at least one second other preselected computation node, and a router to direct the output data from the at least one output port to the at least one second other preselected computation node.
12. The system of claim 11, further comprising:
a second preselected computation node including at least one input port capable of being coupled to at least one third other preselected computation node, the input port to receive other input data, a first store coupled to the at least one input port to store the other input data, a second store coupled to the instruction sequencer, the second store to receive and store an other instruction selected from the group of instructions, an instruction wakeup unit to match the other input data to the other instruction, at least one execution unit to execute the other instruction using the other input data to produce other output data, at least one output port capable of being coupled to at least one fourth other preselected computation node, and a router to direct the other output data from the at least one output port to the at least one fourth other preselected computation node.
13. The system of claim 12, further comprising:
a register file to receive indications to send operands to be used by instructions at the first and the second preselected computations nodes.
14. The system of claim 12, wherein the output port of the first preselected computation node is coupled to the input port of the second preselected computation node.
15. The system of claim 11, further comprising:
a grid of computation nodes including the first preselected computation node, wherein the grid of computation nodes includes M rows of computation nodes, and wherein each one of the M rows of computation nodes includes an instruction cache.
16. The system of claim 11, further comprising:
an instruction memory coupled to the instruction sequencer, the instruction memory to store the group of instructions.
17. The system of claim 11, further comprising:
a block termination control module to detect execution termination of the group of instructions; and
a register file coupled to the block termination control module.
18. A method, comprising:
partitioning a program into a plurality of groups of instructions;
assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes;
loading the group of instructions to the plurality of interconnected preselected computation nodes; and
executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution.
19. The method of claim 18, wherein at least one computation node included in the plurality of interconnected preselected computation nodes has at least one input port capable of being coupled to at least one preselected first other computation node included in the plurality of interconnected preselected computation nodes, the input port to receive input data, a first store coupled to the at least one input port to store the input data, a second store coupled to an instruction sequencer, the second store to receive and store the at least one instruction, an instruction wakeup unit to match the input data to the at least one instruction, at least one execution unit to execute the at least one instruction using the input data to produce output data, at least one output port capable of being coupled to at least one second other preselected computation node included in the plurality of interconnected preselected computation nodes, and a router to direct the output data from the at least one output port to the at least one preselected second other computation node.
20. The method of claim 18, wherein at least one of the plurality of groups of instructions is a basic block.
21. The method of claim 18, wherein at least one of the plurality of groups of instructions is a hyperblock.
22. The method of claim 18, wherein at least one of the plurality of groups of instructions is a superblock.
23. The method of claim 18, wherein at least one of the plurality of groups of instructions is an instruction trace constructed by a hardware trace construction unit at run time.
24. The method of claim 18, wherein loading the group of instructions to the plurality of interconnected preselected computation nodes includes:
sending at least two instructions selected from the group of instructions from an instruction sequencer to a selected computation node included in the plurality of interconnected preselected computation nodes for storage in a store.
25. The method of claim 18, wherein executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution includes:
matching at least one instruction selected from the group of instructions with at least one operand received from an other computation node included in the plurality of interconnected preselected computation nodes.
26. The method of claim 18, wherein loading the group of instructions to the plurality of interconnected preselected computation nodes includes:
sending a first set of instructions selected from a first group of instructions selected from the plurality of groups of instructions from an instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a first frame included in a first computation node included in the plurality of interconnected preselected computation nodes; and
sending a second set of instructions selected from the first group of instructions from the instruction sequencer to the plurality of interconnected preselected computation nodes for storage in a second frame included in the first computation node.
27. The method of claim 18, wherein assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes includes:
assigning a first group of instructions to a first set of frames included in the plurality of interconnected preselected computation nodes;
assigning a second group of instructions to a second set of frames included in the plurality of interconnected preselected computation nodes, wherein the first group and the second group of instructions are capable of concurrent execution, and wherein at least one output datum associated with the first group of instructions is written to a register file and passed directly to the second group of instructions for use as an input datum by the second group of instructions.
28. An article comprising a machine-accessible medium having associated data, wherein the data, when accessed, results in a machine performing:
partitioning a program into a plurality of groups of instructions;
assigning a group of instructions selected from the plurality of groups of instructions to a plurality of interconnected preselected computation nodes;
loading the group of instructions to the plurality of interconnected preselected computation nodes; and
executing the group of instructions as each one of the instructions in the group of instructions receives all necessary associated operands for execution.
29. The article of claim 28, wherein partitioning the program into the plurality of groups of instructions is performed by a compiler.
30. The article of claim 28, wherein partitioning the program into the plurality of groups of instructions is performed by a run-time trace mapper.
31. The article of claim 28, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
statically assigning all of the plurality of groups of instructions for execution.
32. The article of claim 31, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
dynamically issuing one or more instructions from at least one of the plurality of groups of instructions for execution.
33. The article of claim 28, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
generating a wakeup token to reserve an output data channel to connect selected computation nodes included in the plurality of interconnected preselected computation nodes.
34. The article of claim 28, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
detecting execution termination of the group of instructions including an output having architecturally visible data; and
committing the architecturally visible data to a register file.
35. The article of claim 28, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
detecting execution termination of the group of instructions including an output having architecturally visible data; and
committing the architecturally visible data to a memory.
36. The article of claim 28, wherein the machine-accessible medium further includes data, which when accessed by the machine, results in the machine performing:
routing an output datum arising from executing the group of instructions to a consumer node included in the plurality of interconnected preselected computation nodes, wherein the address of the consumer node is included in a token associated with at least one instruction included in the group of instructions.
US10/829,668 2001-10-31 2004-04-22 Scalable processing architecture Abandoned US20050005084A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/829,668 US20050005084A1 (en) 2001-10-31 2004-04-22 Scalable processing architecture
US12/136,645 US8055881B2 (en) 2001-10-31 2008-06-10 Computing nodes for executing groups of instructions

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US33476401P 2001-10-31 2001-10-31
PCT/US2002/034965 WO2003038645A2 (en) 2001-10-31 2002-10-31 A scalable processing architecture
US10/829,668 US20050005084A1 (en) 2001-10-31 2004-04-22 Scalable processing architecture

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/034965 Continuation WO2003038645A2 (en) 2001-10-31 2002-10-31 A scalable processing architecture

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/136,645 Division US8055881B2 (en) 2001-10-31 2008-06-10 Computing nodes for executing groups of instructions

Publications (1)

Publication Number Publication Date
US20050005084A1 true US20050005084A1 (en) 2005-01-06

Family

ID=23308723

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/829,668 Abandoned US20050005084A1 (en) 2001-10-31 2004-04-22 Scalable processing architecture
US12/136,645 Expired - Fee Related US8055881B2 (en) 2001-10-31 2008-06-10 Computing nodes for executing groups of instructions

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/136,645 Expired - Fee Related US8055881B2 (en) 2001-10-31 2008-06-10 Computing nodes for executing groups of instructions

Country Status (3)

Country Link
US (2) US20050005084A1 (en)
AU (1) AU2002363142A1 (en)
WO (1) WO2003038645A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162884A1 (en) * 2007-01-02 2008-07-03 International Business Machines Corporation Computer processing system employing an instruction schedule cache
US20090204788A1 (en) * 2005-03-29 2009-08-13 Theseus Research, Inc. Programmable pipeline array
US20110072239A1 (en) * 2009-09-18 2011-03-24 Board Of Regents, University Of Texas System Data multicasting in a distributed processor system
US20140195715A1 (en) * 2006-08-22 2014-07-10 Mosaid Technologies Incorporated Scalable memory system
CN104424129A (en) * 2013-08-19 2015-03-18 上海芯豪微电子有限公司 Cache system and method based on read buffer of instructions
US20170083431A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Debug support for block-based processor
US9946549B2 (en) 2015-03-04 2018-04-17 Qualcomm Incorporated Register renaming in block-based instruction set architecture
US20180143833A1 (en) * 2016-11-22 2018-05-24 The Arizona Board Of Regents Method and system for managing control of instruction and process execution in a programmable computing system
US10452399B2 (en) 2015-09-19 2019-10-22 Microsoft Technology Licensing, Llc Broadcast channel architectures for block-based processors
US10963379B2 (en) 2018-01-30 2021-03-30 Microsoft Technology Licensing, Llc Coupling wide memory interface to wide write back paths
US11106467B2 (en) 2016-04-28 2021-08-31 Microsoft Technology Licensing, Llc Incremental scheduler for out-of-order block ISA processors
US11386031B2 (en) * 2020-06-05 2022-07-12 Xilinx, Inc. Disaggregated switch control path with direct-attached dispatch

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962717B2 (en) * 2007-03-14 2011-06-14 Xmos Limited Message routing scheme
WO2011159309A1 (en) 2010-06-18 2011-12-22 The Board Of Regents Of The University Of Texas System Combined branch target and predicate prediction
US9792252B2 (en) 2013-05-31 2017-10-17 Microsoft Technology Licensing, Llc Incorporating a spatial array into one or more programmable processor cores
GB2535547B (en) * 2015-04-21 2017-01-11 Adaptive Array Systems Ltd Data processor
US10175988B2 (en) 2015-06-26 2019-01-08 Microsoft Technology Licensing, Llc Explicit instruction scheduler state information for a processor
US9720693B2 (en) 2015-06-26 2017-08-01 Microsoft Technology Licensing, Llc Bulk allocation of instruction blocks to a processor instruction window
US9940136B2 (en) 2015-06-26 2018-04-10 Microsoft Technology Licensing, Llc Reuse of decoded instructions
US10409599B2 (en) 2015-06-26 2019-09-10 Microsoft Technology Licensing, Llc Decoding information about a group of instructions including a size of the group of instructions
US10169044B2 (en) 2015-06-26 2019-01-01 Microsoft Technology Licensing, Llc Processing an encoding format field to interpret header information regarding a group of instructions
US10346168B2 (en) 2015-06-26 2019-07-09 Microsoft Technology Licensing, Llc Decoupled processor instruction window and operand buffer
US10191747B2 (en) 2015-06-26 2019-01-29 Microsoft Technology Licensing, Llc Locking operand values for groups of instructions executed atomically
US9952867B2 (en) 2015-06-26 2018-04-24 Microsoft Technology Licensing, Llc Mapping instruction blocks based on block size
US11755484B2 (en) 2015-06-26 2023-09-12 Microsoft Technology Licensing, Llc Instruction block allocation
US9946548B2 (en) 2015-06-26 2018-04-17 Microsoft Technology Licensing, Llc Age-based management of instruction blocks in a processor instruction window
US10409606B2 (en) 2015-06-26 2019-09-10 Microsoft Technology Licensing, Llc Verifying branch targets
US10180840B2 (en) 2015-09-19 2019-01-15 Microsoft Technology Licensing, Llc Dynamic generation of null instructions
US10198263B2 (en) 2015-09-19 2019-02-05 Microsoft Technology Licensing, Llc Write nullification
US11016770B2 (en) 2015-09-19 2021-05-25 Microsoft Technology Licensing, Llc Distinct system registers for logical processors
US11126433B2 (en) 2015-09-19 2021-09-21 Microsoft Technology Licensing, Llc Block-based processor core composition register
US10031756B2 (en) 2015-09-19 2018-07-24 Microsoft Technology Licensing, Llc Multi-nullification
US20170083327A1 (en) 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Implicit program order
US10061584B2 (en) 2015-09-19 2018-08-28 Microsoft Technology Licensing, Llc Store nullification in the target field
US10936316B2 (en) 2015-09-19 2021-03-02 Microsoft Technology Licensing, Llc Dense read encoding for dataflow ISA
US10871967B2 (en) 2015-09-19 2020-12-22 Microsoft Technology Licensing, Llc Register read/write ordering
US10095519B2 (en) 2015-09-19 2018-10-09 Microsoft Technology Licensing, Llc Instruction block address register
US10719321B2 (en) 2015-09-19 2020-07-21 Microsoft Technology Licensing, Llc Prefetching instruction blocks
US11681531B2 (en) 2015-09-19 2023-06-20 Microsoft Technology Licensing, Llc Generation and use of memory access instruction order encodings
US10768936B2 (en) 2015-09-19 2020-09-08 Microsoft Technology Licensing, Llc Block-based processor including topology and control registers to indicate resource sharing and size of logical processor
US10678544B2 (en) 2015-09-19 2020-06-09 Microsoft Technology Licensing, Llc Initiating instruction block execution using a register access instruction
US11531552B2 (en) 2017-02-06 2022-12-20 Microsoft Technology Licensing, Llc Executing multiple programs simultaneously on a processor core
US10824429B2 (en) 2018-09-19 2020-11-03 Microsoft Technology Licensing, Llc Commit logic and precise exceptions in explicit dataflow graph execution architectures

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4814978A (en) * 1986-07-15 1989-03-21 Dataflow Computer Corporation Dataflow processing element, multiprocessor, and processes
US5241635A (en) * 1988-11-18 1993-08-31 Massachusetts Institute Of Technology Tagged token data processing system with operand matching in activation frames
US5276819A (en) * 1987-05-01 1994-01-04 Hewlett-Packard Company Horizontal computer having register multiconnect for operand address generation during execution of iterations of a loop of program code
US6282583B1 (en) * 1991-06-04 2001-08-28 Silicon Graphics, Inc. Method and apparatus for memory access in a matrix processor computer
US6338129B1 (en) * 1997-06-30 2002-01-08 Bops, Inc. Manifold array processor

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5197137A (en) * 1989-07-28 1993-03-23 International Business Machines Corporation Computer architecture for the concurrent execution of sequential programs
NL9100598A (en) * 1991-04-05 1992-11-02 Henk Corporaal Microprocessor circuit with extended and flexible architecture - provides separation between data transfer and data processing operations
AU6964501A (en) * 2000-06-13 2001-12-24 Nobel Limited Liability Company Synergic computation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4814978A (en) * 1986-07-15 1989-03-21 Dataflow Computer Corporation Dataflow processing element, multiprocessor, and processes
US5276819A (en) * 1987-05-01 1994-01-04 Hewlett-Packard Company Horizontal computer having register multiconnect for operand address generation during execution of iterations of a loop of program code
US5241635A (en) * 1988-11-18 1993-08-31 Massachusetts Institute Of Technology Tagged token data processing system with operand matching in activation frames
US6282583B1 (en) * 1991-06-04 2001-08-28 Silicon Graphics, Inc. Method and apparatus for memory access in a matrix processor computer
US6338129B1 (en) * 1997-06-30 2002-01-08 Bops, Inc. Manifold array processor

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204788A1 (en) * 2005-03-29 2009-08-13 Theseus Research, Inc. Programmable pipeline array
US7930517B2 (en) * 2005-03-29 2011-04-19 Wave Semiconductor, Inc. Programmable pipeline array
US20140195715A1 (en) * 2006-08-22 2014-07-10 Mosaid Technologies Incorporated Scalable memory system
US7454597B2 (en) 2007-01-02 2008-11-18 International Business Machines Corporation Computer processing system employing an instruction schedule cache
US20080162884A1 (en) * 2007-01-02 2008-07-03 International Business Machines Corporation Computer processing system employing an instruction schedule cache
US10698859B2 (en) 2009-09-18 2020-06-30 The Board Of Regents Of The University Of Texas System Data multicasting with router replication and target instruction identification in a distributed multi-core processing architecture
US20110072239A1 (en) * 2009-09-18 2011-03-24 Board Of Regents, University Of Texas System Data multicasting in a distributed processor system
US20160202983A1 (en) * 2013-08-19 2016-07-14 Shanghai Xinhao Microelectronics Co., Ltd. Processor system and method based on instruction read buffer
CN104424129A (en) * 2013-08-19 2015-03-18 上海芯豪微电子有限公司 Cache system and method based on read buffer of instructions
US10067767B2 (en) * 2013-08-19 2018-09-04 Shanghai Xinhao Microelectronics Co., Ltd. Processor system and method based on instruction read buffer
US10656948B2 (en) 2013-08-19 2020-05-19 Shanghai Xinhao Microelectronics Co. Ltd. Processor system and method based on instruction read buffer
US9946549B2 (en) 2015-03-04 2018-04-17 Qualcomm Incorporated Register renaming in block-based instruction set architecture
US20170083431A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Debug support for block-based processor
US10452399B2 (en) 2015-09-19 2019-10-22 Microsoft Technology Licensing, Llc Broadcast channel architectures for block-based processors
US10776115B2 (en) * 2015-09-19 2020-09-15 Microsoft Technology Licensing, Llc Debug support for block-based processor
US11106467B2 (en) 2016-04-28 2021-08-31 Microsoft Technology Licensing, Llc Incremental scheduler for out-of-order block ISA processors
US11687345B2 (en) 2016-04-28 2023-06-27 Microsoft Technology Licensing, Llc Out-of-order block-based processors and instruction schedulers using ready state data indexed by instruction position identifiers
US10496409B2 (en) * 2016-11-22 2019-12-03 The Arizona Board Of Regents Method and system for managing control of instruction and process execution in a programmable computing system
US20180143833A1 (en) * 2016-11-22 2018-05-24 The Arizona Board Of Regents Method and system for managing control of instruction and process execution in a programmable computing system
US10963379B2 (en) 2018-01-30 2021-03-30 Microsoft Technology Licensing, Llc Coupling wide memory interface to wide write back paths
US11726912B2 (en) 2018-01-30 2023-08-15 Microsoft Technology Licensing, Llc Coupling wide memory interface to wide write back paths
US11386031B2 (en) * 2020-06-05 2022-07-12 Xilinx, Inc. Disaggregated switch control path with direct-attached dispatch

Also Published As

Publication number Publication date
WO2003038645A3 (en) 2004-03-04
AU2002363142A1 (en) 2003-05-12
US20080244230A1 (en) 2008-10-02
US8055881B2 (en) 2011-11-08
WO2003038645A2 (en) 2003-05-08

Similar Documents

Publication Publication Date Title
US8055881B2 (en) Computing nodes for executing groups of instructions
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
US5872987A (en) Massively parallel computer including auxiliary vector processor
US7363467B2 (en) Dependence-chain processing using trace descriptors having dependency descriptors
CN108108188B (en) Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
CN108376097B (en) Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9275002B2 (en) Tile-based processor architecture model for high-efficiency embedded homogeneous multicore platforms
US7734895B1 (en) Configuring sets of processor cores for processing instructions
US7526636B2 (en) Parallel multithread processor (PMT) with split contexts
US7020763B2 (en) Computer processing architecture having a scalable number of processing paths and pipelines
US20110231616A1 (en) Data processing method and system
US10747709B2 (en) Memory network processor
US20030182376A1 (en) Distributed processing multi-processor computer
CN100573500C (en) Stream handle IP kernel based on the Avalon bus
US20180246847A1 (en) Highly efficient scheduler for a fine grained graph processor
US11868250B1 (en) Memory design for a processor
CN113407483B (en) Dynamic reconfigurable processor for data intensive application
JP4589305B2 (en) Reconfigurable processor array utilizing ILP and TLP
Stepchenkov et al. Recurrent data-flow architecture: features and realization problems
Bhagyanath et al. Buffer allocation for exposed datapath architectures
Gao et al. Towards an efficient hybrid dataflow architecture model
Pöppl et al. Shallow water waves on a deep technology stack: Accelerating a finite volume tsunami model using reconfigurable hardware in invasive computing
Dimitroulakos et al. Exploring the design space of an optimized compiler approach for mesh-like coarse-grained reconfigurable architectures
US20230367604A1 (en) Method of interleaved processing on a general-purpose computing core
Keckler et al. Computing nodes for executing groups of instructions

Legal Events

Date Code Title Description
AS Assignment

Owner name: BOARD OF REGENTS, THE UNIVERSITY OF TEXAS SYSTEM,

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BURGER, DOUG;KECKLER, STEPHEN W.;SANKARALINGAM, KARTHIKEYAN;AND OTHERS;REEL/FRAME:015089/0949

Effective date: 20040812

AS Assignment

Owner name: NATIONAL SCIENCE FOUNDATION, VIRGINIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:UNIVERSITY OF TEXAS AUSTIN;REEL/FRAME:018412/0406

Effective date: 20050905

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: NATIONAL SCIENCE FOUNDATION, VIRGINIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:UNIVERSITY OF TEXAS AUSTIN;REEL/FRAME:033232/0102

Effective date: 20050905

AS Assignment

Owner name: NATIONAL SCIENCE FOUNDATION, VIRGINIA

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:UNIVERSITY OF TEXAS, AUSTIN;REEL/FRAME:041276/0399

Effective date: 20170216