WO1996014617A1 - Multicomputer system and method - Google Patents
Multicomputer system and method Download PDFInfo
- Publication number
- WO1996014617A1 WO1996014617A1 PCT/US1994/012921 US9412921W WO9614617A1 WO 1996014617 A1 WO1996014617 A1 WO 1996014617A1 US 9412921 W US9412921 W US 9412921W WO 9614617 A1 WO9614617 A1 WO 9614617A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- clm
- computer
- set forth
- multicomputer system
- program
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/448—Execution paradigms, e.g. implementations of programming paradigms
- G06F9/4494—Execution paradigms, e.g. implementations of programming paradigms data driven
Definitions
- This invention relates to a system for connecting a plurality of computers to a network for parallel processing. DESCRIPTION OF THE RELEVANT ART
- Systems with multiple processors may be divided into two categories: systems with physically shared memory, called multi-processors, and systems without physically shared memory, called multi-computers.
- the execution mode for sequential machines is single instruction, single data (SISD)
- SISD computer operates a single instruction, I, on a single datum, D, one at a time, in an arrangement commonly known as the von
- the first method includes single instruction, multiple data (SIMD) computing, known as vector processing.
- SIMD single instruction, multiple data
- processors are loaded with the same set of instructions, but each processor operates on a different set of data.
- SIMD computing has each processor calculating, for a set of data, D 1 , D 2 , D 3 , D 4 , using the same instruction set I, in
- the second method of parallel computing is multiple instruction, multiple data (MIMD) computing.
- MIMD computing has different data processed by different instruction sets, as indicated in FIG. 3.
- MIMD computing breaks the execution by a parallel software into pieces, thereby providing multiple instruction sets or multiple processors, I 1 , I 2 , I 3 , I 4 .
- MIMD computing there are multiple sets, D 1 , D 2 , D 3 , D 4 , and each data set respectively is fed into a
- the third method of parallel computing is multiple instruction, single data (MISD) computing, commonly known as a pipeline system, as shown in FIG. 4.
- MISD multiple instruction, single data
- FIG. 4 data D 1 , D 2 , D 3 , D 4 , go into instruction set I 1 , I 2 , I 3 , I 4 .
- the second data D 2 go into the first processor I 1
- the first data D 1 having been processed by the first processor I 1 , go into the second processor I 2 .
- MISD computing can contribute to the overall efficiency only when there are at least two (2) input instances to the pipe intake with maximal k times speedup where k is the number of pipe stages.
- a general object of the invention is a multicomputer system and method for parallel execution of application programs.
- Another object of the invention is to construct a scalable, fault tolerant and self-scheduling computer architecture for multi-computer systems.
- connectionless computing method are provided for use with a plurality of computers and a backbone.
- the multicomputer system is thus called connectionless machine (CLM).
- CLM connectionless machine
- the CLM backbone has a unidirectional slotted ring structure but can be simulated, with less efficiency, by any interconnection network that provides point-to-point, broadcast and
- the CLM backbone may use most types of medium for interconnecting the plurality of computers.
- At least one of the computers and in general several computers, sends labeled messages, in data packets, over the CLM backbone.
- the messages may be sets of instructions I,, I 2 , I 3 , I 4 , . . ., or sets of data, D 1 , D 2 , D 3 , D 4 , . . .
- each computer receives a full set of
- partitioned segments belonging to one application program Each program segment, or a subset of a partitioned program, consumes and produces labeled data messages.
- a computer having no work to perform is considered to be sitting idle.
- the first computer encountering a message that matches the requirements of a program segment may exclusively receive that message.
- the program segment on the first computer is enabled when all required labeled data messages are
- the first computer When the first computer generates new data, they are sent as new data messages which may activate many other computers. This process continues until all data messages are processed.
- FIG. 1 illustrates a single instruction, single data machine
- FIG. 2 illustrates a single instruction, multiple data machine
- FIG. 3 illustrates a multiple instruction, multiple data machine
- FIG. 4 illustrates a multiple instruction, single data machine
- FIG. 5 illustrates a plurality of computing machines employing a tuple space
- FIG. 6 illustrates the CLM architecture
- FIG. 7 is a MIMD data flow example
- FIG. 8 is a SIMD component example
- FIG. 9 illustrates a CLM compiler and operating system interface
- FIGS. 10A and 10B illustrates the CLM compilation principle
- FIGS. 11A and 11B illustrates the concept of
- FIG. 12 shows a L-block master section procedure
- FIG. 13 shows a L-block worker section procedure
- FIG. 14 shows a R-block master section procedure
- FIG. 15 shows a R-block worker section procedure
- FIG. 16 illustrates an exclusive-read deadlock
- FIG. 17 shows a CLM operating system extension
- FIG. 18 illustrates a method of the present invention for processing and executing application programs
- FIG. 19 shows a conceptual implementation of a CLM processor with a backbone interface
- FIG. 20 illustrates an implementation of CLM with three processors
- FIG. 21 illustrates a CLM construction with two
- the multicomputer system and method has a plurality of computers connected to a connectionless machine (CLM) backbone.
- CLM connectionless machine
- Each computer connected to the CLM backbone has a local memory for receiving data packets transmitted on the CLM backbone.
- the plurality of computers seek or contend to receive data packets, and thus receive messages, from the CLM backbone.
- a condition may exist where the CLM backbone becomes saturated with data packets, i.e., all the slots on the CLM backbone are full of data packets. The saturation will be automatically resolved when some computers become less busy. If all computers stay busy for an extended period of time, adjustments to the computer processing powers and local write-buffer sizes can prevent or resolve this condition.
- a computer is defined as a device having a central processing unit (CPU) or a processor.
- the computers in a CLM system may be heterogeneous and homogeneous, and each computer has a local memory and optional external storage.
- Each computer preferably has more than 0.5 million floating point operations per second (MFLOPS) processing power.
- MFLOPS floating point operations per second
- a CLM backbone is defined as having a unidirectional slotted ring structure or simulated by any interconnection network that can perform or simulate point-to-point, broadcast and exclusive-read operations.
- the network can use fiber optical or copper cables, radio or micro waves, infra-red signals or any other type of medium for
- a message is defined to be a labeled information entity containing one or more data packets.
- a message may include at least one set of instructions or data.
- the term 'data tuple' is used to refer to 'labeled message'.
- the set of instructions operates on at least one set of data.
- a data packet is defined to be a limited number of bits containing useful information and a label for program segment matching purpose.
- a data packet is the basic building block for transmitting messages through the CLM backbone.
- a broadcast is defined to be a communication
- a network device typically includes a computer; however, any device in general may be connected to the CLM backbone.
- a multicast is defined to be a communication
- originating from a network device to group of selected network devices with a specially arranged addresses.
- broadcast is used for identifying both broadcasts and multicasts unless indicated differently.
- the CLM multicomputer system is a multicomputer
- processor architecture for providing a solution to problems in multi-processor (multi-CPU) system in scheduling, programmability, scalability and fault tolerance.
- multi-CPU multi-processor
- the following description is of the CLM architecture, operating principle and the designs for a CLM compiler and an
- a multicomputer system and method are provided for use with a plurality of computers, and a CLM backbone 120.
- the CLM multicomputer system uses the CLM backbone 120 as a medium for transmitting the plurality of labeled data packets corresponding to the message, for transmitting processed data packets, and for transmitting EXCLUSIVE-READ signals.
- the CLM backbone 120 may be a local area network or a wide area network, with the CLM backbone 120 using any type of medium for interconnecting the computers.
- the medium may be cable, fiber optics, parallel wires, radio waves, etc.
- the CLM architecture is based on a parallel computing model called a Rotating Parallel Random Access Machine (RPRAM).
- RPRAM Rotating Parallel Random Access Machine
- the common bus for connecting parallel processors and memories in conventional Parallel Random Access Machine (PRAM) systems is replaced by a high speed rotating
- unidirectional slotted ring i.e. the CLM backbone 120, in the RPRAM.
- a CLM system operates according to the dataflow principle; i.e. every application program is represented as a set of labeled program segments, or a subset of an application program, according to the natural data
- the CLM backbone 120 may transmit data tuples, embodied as labeled data packets, as initial data tuples of the application program, in the backbone.
- a computer completing the processing of a data tuple returns the results of the processing into the CLM backbone 120 in the form of new data tuples.
- a new set of computers may be triggered by the new data tuples to process the new data tuples, and the processing of initial and new data tuples continues until the application program is completed.
- the CLM multicomputer system executes an application program including a plurality of program segments, with each program segment labeled according to data dependencies or parallelism among the program segments, and with each program segment connected to other programs segments by a corresponding labeled data tuple.
- the CLM backbone 120 responsive to an execution command, transmits the labeled program segments to the plurality of computers to load the labeled program segments into each computer of the plurality of computers.
- the CLM backbone 120 also circulates the labeled data tuples through the plurality of computers. At least one of the plurality of computers, when it receives a matching set of data tuples, activates the program segment corresponding to the labeled data tuples, processes the activated program segments, and transmits on the CLM
- the CLM backbone 120 continues to transmit the labeled data tuples to the
- a coarse grain processor vector or SIMD forms when a number of computers are activated upon the reading of vetorized data tuples, tuples of the same prefix, thereby triggering the same program segment existing on these computers.
- a coarse grain MIMD processor forms when
- processor pipelines stabilize to provide an additional acceleration for the inherently sequential parts of the application.
- FIG. 7 illustrates the formation of a coarse grain MIMD processor.
- computers 100, 101, 102 coupled to the CLM backbone 120
- computers 100, 101 execute program segments SI and S2, respectively, in parallel as MIMD, followed by computer 102 executing segment S3.
- program segments in this example use the EXCLUSIVE-READ operation since the original dataflow does not have the sharing of data elements.
- FIG. 8 illustrates the formation of coarse grain SIMD processor.
- computers 100, 101, 102 coupled to the CLM backbone 120
- computers 100, 101, 102 execute program segments Si and S2 in parallel as SIMD.
- Computer 100 exclusively reads tuple a, so segment SI is executed only once on 100.
- Segment SI generates 1000 vectorized tuples labeled as 'd1' to 'd1000', and triggers at least three computers (waiting for 'd*'s) to run in parallel in an extended version of conventional vector processors; i.e. an SIMD component.
- S3 may only be activated in one of the three computers due to EXCLUSIVE-READS.
- a set of sequentially dependent program segments executed by the computers of the CLM system may form a coarse grain processor pipeline. If the data tuples continue to be input, the pipeline may contribute to a potential speedup of computations.
- Multiple parallel segments may execute at the same time and compete for available data tuples.
- Multiple backbone segments can transmit simultaneously. Labeled tuples may circulate in the backbone until either consumed or recycled.
- the unidirectional high speed CLM backbone 120 fuses the active data tuples from local memories into a rotating global memory.
- the CLM backbone 120 circulates only active data tuples required by the processing elements at all times. Thus, if the overall processing power is balanced with the backbone capacity, the worst tuple matching time is the maximum latency of the backbone.
- the multicomputer scheduling problem is transformed into a data dependency analysis problem and an optimal data granule size problem at the expense of redundant local disk space or memory space.
- MIMD and pipeline segments are natural products of the program dependency analysis.
- the parallelization of SIMD segments requires dependency tests on all repetitive entities (loops or recursive calls). For example, any upwardly independent loop is a candidate for parallelization. All candidates are ranked by the
- the parallelized loop uses a variable data granule size. Its adjustments affect the overall computing versus communications ratio in this loop's parallelization. A heuristic algorithm has been developed to adjust this size dynamically to deliver reasonable good performance. The best ratio gives an optimal performance. Recursive
- any sequential program may automatically distribute and parallelize onto the CLM architecture.
- the CLM compilation is independent of the number of computers in a CLM system.
- a CLM compiler generates codes for execution regardless of the number of computers.
- Application's natural structure also has an impact on performance. For example, an application having many duplicably decomposable segments performs better than the applications having few duplicably decomposable segments. Applications with wider range of data granule sizes would perform better than applications with narrow data granule ranges. The impact is most obvious when the CLM processor powers vary widely.
- the CLM backbone capacity and the aggregate computer power also limit the maximum performance deliverable by a CLM system for an application.
- the theoretical power limit of a CLM may be modeled by the following equation:
- P is the CLM theoretical performance limit in MFLOPS
- CD is the average computation density of an application in floating point operations per byte (FLOPB)
- DD is the average data density of an application; i.e. total input, output, and intermediate data in millions of bytes preferably transmitted in one time unit (second);
- R is the CLM ring capacity in MBPS
- P i is the individual CLM processor power in millions of floating point operations per second (MFLOPS).
- a distributed CLM system may include up to 10,000 computers with a maximum round trip latency less than 8 seconds. With 100 ns per computer delay, a 32,000 computer-centralized CLM system has a 3.2 millisecond maximum round trip latency. Any application requiring more than 10 seconds of computing time can benefit from either CLM.
- the CLM software architecture includes a compiler and an operating system extension, and, in the preferred
- every computer of the CLM system runs a multiprogrammed operating system, such as Unix(tm), VMS(tm) and OS2 (tm).
- the CLM compiler analyzes the data dependency and embedded parallelism; i.e. the sub-program duplicability, of a given sequential program.
- the CLM compiler generates a set of partitioned program segments and a program-segment-to-data-tuple mapping.
- the CLM operating system extension uses the mapping to provide a fast searchless tuple matching mechanism.
- a CLM system is a data reduction machine that uses parallel processors to reduce the natural dataflow graph of a computer application.
- the embedded parallelism of the application is defined by the CLM compiler.
- the CLM architecture uses space redundancy to trade-off scheduling efficiency, programmability and fault tolerance in the present invention .
- the CLM compiler is a sequential program re-structurer; i.e. the CLM compiler takes a sequential program description as an input in a specific programming language; for example, the C programming language. In use, the CLM compiler generates a set of distributable independent program
- DTPS_TBL data-tuple-to-program-segment matching table
- the local CLM operating system extensions load the DTPS_TBL, and prepare each computer for a subsequent RUN command.
- FIG. 9 illustrates the relationship between the CLM compiler 555 and the CLM operating system (O/S) extension 560.
- the CLM 'run X' command sends application X's data tuples to all CLM processors through the O/S extension 560.
- the compile-link 580 means that the CLM compiler can
- FC Force Copy
- the CLM compilation command at the operating system level has the following syntax:
- the -D option designates a data density threshold for determining the worthiness of parallelization of repetitive program segments.
- the -G option designates a grain size value for optimizing the difference.
- the -F option activates fault tolerance code generation.
- the -R option may impact the overall performance due to the extra overhead introduced in fault detection and recovery.
- the time-out value of -F may range from microseconds to 60 seconds.
- the -V option controls the depth of vectorization. A greater value of V may be used to obtain better performance when a large number of CLM processors are employed.
- the -F option defines the factoring percentage used in a heuristic tuple size
- CLM compiler commands for other programming languages may have similar syntax structure.
- FIGS. 10A and 10B illustrate the CLM compiler operating principle. Every sequentially composed program, regardless the number of sub-programs, can be transformed into one linear sequence of single instructions.
- the repetitive segments indicate the natural partitions of the program. Every partitioned segment can be converted into an independent program by finding input and output data structures of the partitioned segment.
- the repetitive segments may be further parallelized using the SIMD principle. For example, for an iterative segment, independent loop instances are considered duplicable on all CLM processors. Then a data vectorization is performed to enable the simultaneous activation of the same loop core on as many as possible CLM processors, i.e., dynamic SIMD processors.
- a recursive state tuple is constructed to force the duplicated kernels to perform a breadth-first search over the implied recursion tree.
- the CLM compiler 555 operates in the following phases: Data Structure Initialization, Blocking, Tagging,
- Data structure initialization if the input specification is syntactically correct, the data structure initialization phase builds an internal data structure to store all information necessary to preserve the semantics of the sequential program, including:
- a linear list of statements for the main program body is the primary output of the data structure initialization phase. The list, all procedures and functions have been substituted in line.
- Blocking the blocking phase scans the linear statements and blocks the linear statements into segments according to the following criterion:
- L-Block Vectorization the L-Block vectorization phase analyzes the duplicability of each L block, and performs the following tasks:
- the L-Block vectorization phase may optionally attempt to order the number of
- v. Calculate the computational density D of the L block by counting the number of fixed and floating point operations, including the inner loops. If D is less than a threshold specified by the -D option at the
- Vth loop Starting from the outermost loop, mark the Vth loop as an SIMD candidates, with V specified at
- R-Block Vectorization the R-block vectorization phase is responsible for analyzing the duplicability of recursive functions and procedures by performing the
- S-Block processing is performed for enhancing the efficiency of the potential processor pipelines if the intended program is to operate on a continuous stream of data, as specified by the presence of the -P option at compilation command.
- Each S-block is inspected for the total number of statements contained in the S-block, and all S-blocks are to be spliced into segments less than or equal to the value given by the -P option.
- ii. Collect all data structures returned by each block.
- the returned data structures include all global data structures referred on the left-hand-side of all statements in the block and the modified parameters returned from functions/procedure calls;
- Formulate tuples the formulation of tuples process starts from the output of each block. For each block, the optimal number of output tuple equals to the number of sequentially dependent program segments, thereby preserving the original parallelism by retaining the
- Every output tuple is assigned a tuple name.
- the tuple name consists of the sequential program name, the block name, and the first variable name in the collection of output variables.
- the assigned names are to be propagated to the inputs of all related program segments;
- variable ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
- X is modified by three blocks:
- X_1 will be EXCLUSIVE- READ by S2 with updated value in X_2. The similar process
- FIGS. 11A and 11B assume the same definition as X.
- each SIMD each L-block is further spliced into two sections: a master section and a worker section.
- the master section contains statements for scattering the vectorized data tuples and for gathering the resulting data tuples.
- the globally referenced data structures in worker section are structured as READ tuples in the worker code.
- the vectorized tuples are scattered in G groups, where G is calculated according to the following loop scheduling algorithm, developed based on "Factoring - - A Method for Scheduling Parallel Loops," by S.F. Hummel, E. Schonberg and L.E. Flynn, Communications of ACM, Vol., No. 8, 90-101, August 1992.
- the disclosing loop scheduling algorithm was developed based on the algorithm published in, "A Virtual Bus Architecture for Dynamic Parallel
- K.C. Lee discussed a modular time-space-time network
- Switches comprising multiple time division networks connected by a nonblocking space division network (switches). Except for the exclusive-read implementation, the rest is compatible with the multiple backbone CLM architecture.
- the tuple size G(Sj's) is defined as follows:
- the value of P can be automatically determined by the CLM backbone management layer and transmitted to the program at execution time.
- the value T is from the -T option at the compilation command line.
- the worker section is padded with statements for retrieving tuples in the beginning of the worker section, and with statements for packing the results into new tuples at the end of the worker section.
- the new spliced segments are distinguished by suffixes M and W respectively.
- the master section performs the reading and getting 200 all the
- the worker section performs the reading 310 of global data tuples
- vectorized data 335 (T,Sj,#) to the result tuple; putting 350 the result tuple to backbone; checking 355 if Sj>T then loop back to get a vectorized tuple 320; otherwise ending 360.
- each duplicable R-block is spliced into two sections: the master section and a worker section.
- the master section contains statements for generating an initial state tuple and a live tuple count TP_CNT, with TP_CNT initially set to 1.
- the initial state tuple is a collection of all data structures required for subsequent computation.
- the master section also has statements for collecting the results. As shown in FIG. 14, the master section operates using the procedure of assembling 365 a state tuple from the collection of all read-only dependent data structures in the R-block, with the state tuple
- ST_TUPLE (G, d1, d2. ..., dk) where G is the grain size given at compilation time using the -G option.
- the master section also creates a result tuple using the globally updatedd data structures and return data structures via syntactical analysis of the recursion body. It puts the result tuple into the backbone.
- the master section generates 370 a live tuple count TP_CNT; sets 375 the live tuple count TP CNT to 1; puts 380 the state tuple and the live tuple count to the CLM backbone 120; gets 385 a term tuple from the CLM backbone 120; gets 390 a result tuple from the CLM backbone 120; unpacks 395 the result tuple; and returns 400 the unpacked results.
- the worker section gets 405 the state tuple ST_TUPLE from the CLM backbone 120; unpack 410 the state tuple calculates 415 using the state tuple;
- the worker section also updates 425 the result tuple during the calculation via exclusive-reads and writes to the backbone. It then creates 440 N new state tuples according to the results of the calculations. It gets 445 TP_CNT from the backbone and sets it to TP_CNT + N -1. If TP_CNT becomes 455 zero, then it creates 460 a "term" tuple and puts it into the backbone. Otherwise, it puts 475 N new state tuples into the backbone and loops back to the beginning.
- EXCLUSIVE-READ deadlock prevention with the prevention performed by processing the multiple-EXCLUSIVE-READ-blocks.
- K exclusive-read input tuples
- K > 1 implements the protocol as shown in FIG. 16, where a count CNT is set 500 to equal 1; input tuples are read 505 from the CLM backbone 120; check 510 if the input tuple is to be read, as opposed to being exclusively read. If the input tuple is to be read, the input tuple is read 515. Otherwise, check 520 if the input tuple is to be exclusively read and if the count CNT is less than K. If the input tuple is to be exclusively read and count CNT is less than K, then the input tuple is read 525 and the count is incremented by setting 530 CNT equal to CNT + 1.
- the input tuple is exclusively read but the count CNT is greater than or equal to K, the input tuple is exclusively read 535.
- the procedure illustrated in FIG. 16 prevents possible deadlocks when K exclusive tuples are acquired by L different program segments on L computers, with L > 1, with the deadlock resulting in no exclusive tuples progressing.
- Map generation For each program segment, generate the initial input tuple names and mark the initial input tuple names with RD to be read, or ERD to be
- the initial input tuples in the map should NOT contain the tuples belonging to the gathering parts, such as 245 and 285 in FIG. 12 and 385 and 390 in FIG. 14. These tuples are to be enabled after the completion of the scattering part.
- the CLM O/S extension 560 contains the following elements:
- Event Scheduler 585 being a program using CLM extended TCP/IP to interpret the network traffic.
- the Event Scheduler 585 branches into four different servers: a
- PMS Program Management Server
- DTS Data Token Server
- GPMS General Process Management Server
- COSI Conventional Operating System Interface
- PMS Program Management Server
- the program storage and removal functions act as a simple interface with existing O/S file systems. After receiving an activation command or a RUN command for an application program, the PMS builds a memory image for every related segments:
- the segment image can contain only the Trigger Tuple Name Table and The Disk Address entry. Similar to a demand paging concept, the segment with a matching tuple is fetched from local disk to local memory.
- the trigger tuple name table size is
- adjustable at system generation time It also creates an indexed table (MTBL) from DTPS_TBL to the newly created segment images according to the referenced tuple names.
- MTBL indexed table
- DTS Data Token Server
- General Process Management Server (GPMS) 620 for managing the KILL, SUSPEND, and RESUME commands, as well as general process status (GPS) commands for the CLM processes.
- GPMS General Process Management Server
- COSI Operating System Interface
- the present invention uses a method for executing an application program including a plurality of program
- the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers; loading 1050 the labeled program segments into each computer of the plurality of computers; transmitting 1055, on the CLM backbone 120, the labeled data tuples to the plurality of computers; receiving 1060, at a receiving computer, a labeled data tuple; activating 1065, at the receiving computer, in response to the labeled data tuple, the program segments corresponding to the labeled data tuple; processing 1070, at the receiving computer, the activated program segments; transmitting 1075, on the CLM backbone 120, the results of the processing of the program segments as labeled data tuples; continuing 1080 to
- FIG. 19 illustrates CLM Processor and Single Backbone
- the S bit indicates the availability of register to the
- the Shift_Clock controls the backbone rotating frequency.
- the shifting register can hold typically 1024 bytes of information.
- a) Purge of returned messages This protocol checks the register content against BUFFER_OUT. If there is a match and if the message is not EXCLUSIVE-READ or the message is not the last packet in an EXCLUSIVE-READ message, then the content in BUFFER_OUT is purged. This empties the message slots. A returned EXCLUSIVE-READ message (or the last packet of the message) will keep circulating until it is consumed. b) Tuple name matching.
- a data tuple in Register contains an index generated by the compiler. This index is also recorded in the DTPS_TBL and local TTNTs. A simple test using the index in the local TTNTs can determine the
- EXCLUSIVE-READ deadlock avoidance When more than one CPU exclusively read one of many tuples belonging to the same program segment, or more than on CPU exclusively read some of the packets belonging to the same tuple, none of the CPUs will be ever completely matched. The deadlock is avoided by only exclusively read the last EXCLUSIVE-READ tuple or the last packet of an EXCLUSIVE-READ tuple in a TTNT.
- FIG. 20 illustrates CLM with three Processors and A Single Backbone. In this structure, point-to-point,
- broadcast and exclusive-read of any message can be performed by any processors.
- BUFFER_OUTS On all processors are full, the system enters a "lock state".
- the 'lock state' may be automatically unlocked when some of the CPUs become available for computing.
- the system enters a "deadlock state" when all CPUs are blocked for output and no empty slot is available on the backbone.
- FIG. 21 illustrates the CPU-to-backbone interface for a CLM system with two backbones. Both READ and WRITE
- the backbone initialization command sets all message (register) headers to 0.
- processors are not guaranteed to share the backbone in a "fair" fashion - - the closer neighbors of the sender will be busier than those are further away. In general, this should not affect the overall CLM performance since when all the closer neighbors are busy the further neighbors will get something to do eventually.
- Patent Filing Number: 08/029,882 A MEDIUM ACCESS CONTROL PROTOCOL FOR SINGLE-BUS MULTIMEDIA FAIR ACCESS LOCAL AREA NETWORKS, by Zheng Liu, can give both the fairness property and the multi-media capabilities.
- the present invention also demonstrates a feasible design of a single backbone and a multiple-backbone CLM system.
- the disclosed protocols illustrate the principles for implementing:
- the present invention automatically partitions any sequential program into program segments and uses a method for executing an application program including a plurality of inter-relating program segments, with each program segment labeled according to data dependencies and
- the method includes the steps of receiving 1040 an execution command; transmitting 1045, on the CLM backbone 120, the labeled program segments to the plurality of computers;
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU11746/95A AU1174695A (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
PCT/US1994/012921 WO1996014617A1 (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
EP95902495A EP0791194A4 (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
JP8515266A JPH10508714A (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US1994/012921 WO1996014617A1 (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1996014617A1 true WO1996014617A1 (en) | 1996-05-17 |
Family
ID=22243254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1994/012921 WO1996014617A1 (en) | 1994-11-07 | 1994-11-07 | Multicomputer system and method |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP0791194A4 (en) |
JP (1) | JPH10508714A (en) |
AU (1) | AU1174695A (en) |
WO (1) | WO1996014617A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001014970A2 (en) * | 1999-08-25 | 2001-03-01 | Infineon Technologies Ag | Event scheduler and method for analyzing an event-oriented program code |
GB2374443B (en) * | 2001-02-14 | 2005-06-08 | Clearspeed Technology Ltd | Data processing architectures |
CN102567079A (en) * | 2011-12-29 | 2012-07-11 | 中国人民解放军国防科学技术大学 | Parallel program energy consumption simulation estimating method based on progressive trace update |
EP3539261A4 (en) * | 2016-11-14 | 2020-10-21 | Temple University Of The Commonwealth System Of Higher Education | System and method for network-scale reliable parallel computing |
CN113590166A (en) * | 2021-08-02 | 2021-11-02 | 腾讯数码(深圳)有限公司 | Application program updating method and device and computer readable storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE602007014413D1 (en) * | 2007-03-06 | 2011-06-16 | Nec Corp | DATA TRANSFER NETWORK AND CONTROL DEVICE FOR A SYSTEM WITH AN ARRAY OF PROCESSING ELEMENTS, EITHER EITHER SELF- OR COMMONLY CONTROLLED |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5021947A (en) * | 1986-03-31 | 1991-06-04 | Hughes Aircraft Company | Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing |
US5313647A (en) * | 1991-09-20 | 1994-05-17 | Kendall Square Research Corporation | Digital data processor with improved checkpointing and forking |
-
1994
- 1994-11-07 AU AU11746/95A patent/AU1174695A/en not_active Abandoned
- 1994-11-07 JP JP8515266A patent/JPH10508714A/en active Pending
- 1994-11-07 EP EP95902495A patent/EP0791194A4/en not_active Withdrawn
- 1994-11-07 WO PCT/US1994/012921 patent/WO1996014617A1/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5021947A (en) * | 1986-03-31 | 1991-06-04 | Hughes Aircraft Company | Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing |
US5313647A (en) * | 1991-09-20 | 1994-05-17 | Kendall Square Research Corporation | Digital data processor with improved checkpointing and forking |
Non-Patent Citations (1)
Title |
---|
See also references of EP0791194A4 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001014970A2 (en) * | 1999-08-25 | 2001-03-01 | Infineon Technologies Ag | Event scheduler and method for analyzing an event-oriented program code |
WO2001014970A3 (en) * | 1999-08-25 | 2002-08-01 | Infineon Technologies Ag | Event scheduler and method for analyzing an event-oriented program code |
GB2374443B (en) * | 2001-02-14 | 2005-06-08 | Clearspeed Technology Ltd | Data processing architectures |
US8127112B2 (en) | 2001-02-14 | 2012-02-28 | Rambus Inc. | SIMD array operable to process different respective packet protocols simultaneously while executing a single common instruction stream |
CN102567079A (en) * | 2011-12-29 | 2012-07-11 | 中国人民解放军国防科学技术大学 | Parallel program energy consumption simulation estimating method based on progressive trace update |
CN102567079B (en) * | 2011-12-29 | 2014-07-16 | 中国人民解放军国防科学技术大学 | Parallel program energy consumption simulation estimating method based on progressive trace update |
EP3539261A4 (en) * | 2016-11-14 | 2020-10-21 | Temple University Of The Commonwealth System Of Higher Education | System and method for network-scale reliable parallel computing |
US11588926B2 (en) | 2016-11-14 | 2023-02-21 | Temple University—Of the Commonwealth System of Higher Education | Statistic multiplexed computing system for network-scale reliable high-performance services |
CN113590166A (en) * | 2021-08-02 | 2021-11-02 | 腾讯数码(深圳)有限公司 | Application program updating method and device and computer readable storage medium |
CN113590166B (en) * | 2021-08-02 | 2024-03-26 | 腾讯数码(深圳)有限公司 | Application program updating method and device and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
JPH10508714A (en) | 1998-08-25 |
EP0791194A1 (en) | 1997-08-27 |
EP0791194A4 (en) | 1998-12-16 |
AU1174695A (en) | 1996-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5517656A (en) | Multicomputer system and method | |
US11128519B2 (en) | Cluster computing | |
US5021947A (en) | Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing | |
Kruskal et al. | Efficient synchronization of multiprocessors with shared memory | |
JP2882475B2 (en) | Thread execution method | |
Willcock et al. | AM++ a generalized active message framework | |
CA2367977C (en) | Distributed digital rule processor for single system image on a clustered network and method | |
Grimshaw et al. | Portable run-time support for dynamic object-oriented parallel processing | |
US8595736B2 (en) | Parsing an application to find serial and parallel data segments to minimize mitigation overhead between serial and parallel compute nodes | |
JPH08185325A (en) | Code generation method in compiler and compiler | |
WO1996014617A1 (en) | Multicomputer system and method | |
EP0420142B1 (en) | Parallel processing system | |
Gaudiot et al. | Performance evaluation of a simulated data-flow computer with low-resolution actors | |
Wrench | A distributed and-or parallel prolog network | |
Shekhar et al. | Linda sub system on transputers | |
Moreira et al. | Autoscheduling in a shared memory multiprocessor | |
Stricker et al. | Decoupling communication services for compiled parallel programs | |
Solworth | Epochs | |
Buzzard | High performance communications for hypercube multiprocessors | |
Arapov et al. | Managing the computing space in the mpC compiler | |
JP3514578B2 (en) | Communication optimization method | |
Barak et al. | The MPE toolkit for supporting distributed applications | |
Athas et al. | Multicomputers | |
Painter et al. | ACLMPL: Portable and efficient message passing for MPPs | |
Moura et al. | High level thread-based competitive or-parallelism in logtalk |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AU BB BG BR CA FI HU JP KP KR LK MG MN MW NO PL RO SD SE |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
ENP | Entry into the national phase |
Ref document number: 2204518 Country of ref document: CA Ref country code: CA Ref document number: 2204518 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1995902495 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1995902495 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1995902495 Country of ref document: EP |