US20030182518A1 - Parallel processing method for inverse matrix for shared memory type scalar parallel computer - Google Patents

Parallel processing method for inverse matrix for shared memory type scalar parallel computer Download PDF

Info

Publication number
US20030182518A1
US20030182518A1 US10/288,984 US28898402A US2003182518A1 US 20030182518 A1 US20030182518 A1 US 20030182518A1 US 28898402 A US28898402 A US 28898402A US 2003182518 A1 US2003182518 A1 US 2003182518A1
Authority
US
United States
Prior art keywords
matrix
parallel
blocks
block
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/288,984
Inventor
Makoto Nakanishi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Assigned to FUJITSU, LIMITED reassignment FUJITSU, LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NAKANISHI, MAKOTO
Publication of US20030182518A1 publication Critical patent/US20030182518A1/en
Priority to US10/692,533 priority Critical patent/US7483937B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present invention relates to an arithmetic process using a shared memory type scalar parallel computer.
  • P P n P n ⁇ 1 . . . P 1
  • the present invention aims at providing a method of computing an inverse matrix at a high speed in a parallel process.
  • the method according to the present invention is a parallel processing method for an inverse matrix for shared memory type scalar parallel computer, and includes the steps of: specifying a predetermined square block in a matrix for which an inverse matrix is to be obtained; decomposing the matrix into upper left, left side, lower left, upper, lower, upper right, right side, and lower right blocks surrounding the square block positioned in the center; dividing each of the decomposed blocks into the number of processors and LU-decomposing the square block and the lower, right side, and lower right blocks in parallel; updating the left side, upper, lower, and right side blocks in parallel in a recursive program, and further updating in parallel using the blocks updated in the recursive program on the upper left, lower left, upper right, and lower right blocks; updating a predetermined square block in plural stages using one processor; and setting the position of the square block such that it can sequentially move on the diagonal line of the matrix, and obtaining an inverse matrix of the matrix by repeating the above mentioned steps.
  • the arithmetic operation of an inverse matrix is an updating process performed on each block, and each block is updated in parallel in a plurality of processors.
  • the shared memory is accessed after the largest possible amount of operations are performed on a block stored in the cache provided for each processor, thereby realizing a localizing algorithm and performing a high-speed arithmetic operation for an inverse matrix.
  • FIG. 1 shows the configuration of the hardware of the shared memory type scalar computer according to an embodiment of the present invention
  • FIG. 2 is a diagram ( 1 ) showing the order of computation according to an embodiment of the present invention
  • FIG. 3 is a diagram ( 2 ) showing the order of computation according to an embodiment of the present invention.
  • FIG. 4 is a diagram ( 3 ) showing the order of computation according to an embodiment of the present invention.
  • FIG. 5 is a diagram ( 4 ) showing the order of computation according to an embodiment of the present invention.
  • FIG. 6 is a diagram ( 5 ) showing the order of computation according to an embodiment of the present invention.
  • FIGS. 7A through 7F are diagrams ( 6 ) showing the order of computation according to an embodiment of the present invention.
  • FIGS. 8A through 8C are diagrams ( 7 ) showing the order of computation according to an embodiment of the present invention.
  • FIG. 9 shows a pseudo code ( 1 ) according to an embodiment of the present invention.
  • FIG. 10 shows a pseudo code ( 2 ) according to an embodiment of the present invention
  • FIG. 11 shows a pseudo code ( 3 ) according to an embodiment of the present invention.
  • FIG. 12 shows a pseudo code ( 4 ) according to an embodiment of the present invention
  • FIG. 13 shows a pseudo code ( 5 ) according to an embodiment of the present invention
  • FIG. 14 shows a pseudo code ( 6 ) according to an embodiment of the present invention.
  • FIG. 15 shows a pseudo code ( 7 ) according to an embodiment of the present invention.
  • the present embodiment provides an algorithm of obtaining an inverse matrix of a given matrix.
  • the shared memory type scalar computer it is necessary to increase a ratio of computation to load. Therefore, according to an embodiment of the present invention, a method of dividing a matrix into blocks as matrix products so that an arithmetic operation can be efficiently performed. Furthermore, the computation density is enhanced by updating part of the matrix with a recursively calculated product of matrices of varying dimensions.
  • FIG. 1 shows the configuration of the hardware of the shared memory type scalar computer according to an embodiment of the present invention.
  • Processors 10 - 1 through 10 -n have primary cache memory, and the primary cache memory can also be incorporated into a processor.
  • Each of the processors 10 - 1 through 10 -n is provided with secondary cache memory 13 - 1 through 13 -n, and the secondary cache memory 13 - 1 through 13 -n are connected to an interconnection network 12 .
  • the interconnection network 12 is provided with memory modules 11 - 1 through 11 -n which are shared memory.
  • the processors 10 - 1 through 10 -n read necessary data for arithmetic operations from the memory modules 11 - 1 through 11 -n, store the data in the secondary cache memory 13 - 1 through 13 -n or the primary cache memory through the interconnection network 12 , and perform arithmetic operations.
  • This equation means that by selecting the column vector in the first column as the center and selecting the row vector in the first row orthogonal to the selected vector, the matrix formed by the second and subsequent rows x the second and subsequent columns is updated by the outer product of the column vector in the first column and the row vector in the first row.
  • the element (i, i) of the column corresponding to the coefficient deleted in A is 1/a ii in the right matrix, and the number of the other elements in the right matrix is obtained by multiplying the elements of the column vector deleted in A by 1/a ii and then by ⁇ 1.
  • left elements are also to be deleted.
  • FIGS. 2 through 8 show the order of computation according to an embodiment of the present invention.
  • the matrix M is divided into blocks of column block matrices from the left side, and the updating process is performed in the above mentioned method.
  • B is updated by B ⁇ BU ⁇ 1 using the upper triangular portion U of E.
  • the updating process is performed using a matrix product in the recursive program to enhance the computation density.
  • the upper portions of D and F can be computed by the outer products of the respective row vectors and the column vector orthogonal to the row vector of E. To enhance the the density of computation, the following recursive program is used.
  • the upper portions of D and F indicate, as described by referring to FIG. 8, the upper portions of the row blocks of D and F. As clearly shown in FIG. 8, D and F can be updated from the upper portions of the row blocks.
  • a reciprocal of a ii is obtained.
  • Other portions are obtained by dividing each portion by a ii , and an inverse sign is assigned.
  • the left portion is updated by multiplying the row i+1 by the column i+1 of the left portion, and then performing subtraction.
  • An embodiment of the present invention is provided with a system of adjusting a block width depending on the scale of a problem and the number of processors used in the parallelism.
  • the block width is determined to be ignored (about 1%) with consideration given to the total cost of the portion.
  • FIGS. 7 A, and 7 B, 7 C, and 7 D, 7 E, and 7 F each is a set.
  • the diagonal-line portion shown in FIG. 7A is updated using the horizontal-line portion shown in FIG. 7A and the rectangular diagonal-line portion shown in FIG. 7B. Then, the horizontal-line portion shown in FIG. 7A is updated using the triangular portion in bold lines.
  • the diagonal-line portion shown in FIG. 7C is updated by the product of the right white-painted rectangle shown in FIG. 7C and the rectangle in bold lines shown in FIG. 7D.
  • the diagonal-line portion shown in FIG. 7E is updated using the triangular portion in broken lines shown in FIG. 7F. Furthermore, the diagonal-line portion shown in FIG. 7E is updated by the product of the horizontal-line portion shown in FIG. 7E and the rectangle with diagonal lines shown in FIG. 7F. Finally, the horizontal-line portion is updated using the triangular portion in bold lines shown in FIG. 7F.
  • E is common. Therefore, E can be stored in the cache of each processor for reference.
  • reference to and update to B is processed in parallel on the areas in the row direction (areas obtained by dividing by bold broken lines).
  • the left diagonal-line portion shown in FIG. 8A is updated using the right rectangular portion with diagonal lines shown in FIG. 8A and the left vertical-line portion shown in FIG. 8A.
  • the left horizontal-line portion shown in FIG. 8A is updated using the right triangular portion in bold lines shown in FIG. 8A.
  • the left diagonal-line portion shown in FIG. 8B is updated by the product of the right rectangle in bold lines shown in FIG. 8B and the left white painted rectangle shown in FIG. 8B.
  • the left diagonal-line portion shown in FIG. 8C is updated using the right triangular portion in broken lines shown in FIG. 8C.
  • the left diagonal-line portion shown in FIG. 8C is updated using the product of the right rectangle with diagonal lines shown in FIG. 8C and the left vertical-line portion shown in FIG. 8C.
  • the left vertical-line portion shown in FIG. 8C is updated using the right triangular portion in bold lines shown in FIG. 8C.
  • FIGS. 9 through 15 show the pseudo code according to an embodiment of the present invention.
  • FIG. 9 shows the pseudo code of the main algorithm of the parallelism algorithm of an inverse matrix.
  • the row having C at the left end is a comment row.
  • array a(k,n) is an array storing the elements of the matrix whose inverse matrix is to be obtained.
  • ip (n) is an array storing the information used in interchanging rows in the LU decomposition subroutine.
  • ‘nb’ refers to the number of blocks specified when the LU decomposition is performed.
  • FIG. 10 shows the pseudo code for updating the remaining portions according to the information about the LU decomposition.
  • the updating process is performed on each of the blocks A through H.
  • an exclusive subroutine is further read. Since the block I has already been updated when the LU decomposition is carried out, the subroutine is not further prepared here.
  • the barrier synchronization is attained. Then, if the number of the processor (thread number) is 1 , then the ‘update 1 of e’ is performed. This process is performed by the e-update 1 subroutine. After the process, the barrier synchronization is attained.
  • ‘len’ indicates the width of a block to be processed in one thread. ‘is’ indicates the first position of the block to be processed, and ‘ie’ is the last location of the block to be processed. ‘df-update’ indicates the subroutine for updating the blocks D and F. If the blocks D and F have been updated, then the first position of a block with a block width added is stored as the first position (nbase 2 ) of a new block, ‘len’ is newly computed, the first and last positions of the blocks ‘is 2 ’ and ‘ie 2 ’ is newly computed, and D and F are updated by df-update, thereby attaining the barrier synchronization.
  • ‘update 2 of e’ when the thread number is 1, the update subroutine e-update 2 of the block E is called for and barrier synchronization is performed.
  • the update routine bh-update is called for the blocks B and H, ‘nbase 2 ’ is obtained, ‘len’, ‘is 2 ’, and ‘ie 2 ’ are obtained, and the process is performed by bh-update again, thereby attaining the barrier synchronization.
  • FIG. 11 shows the pseudo code of the update subroutine of the blocks B and D.
  • the subroutine b-update accesses a shared matrix a(k,n), and ‘len’, ‘is 1 ’, and ‘ie 1 ’ having the same meanings as the explanation above are computed. ‘iof’ indicates the number of the starting column of the block B. Then, using the matrix TRU-U having the diagonal element of the upper triangular matrix set to 1, the block B of the array a is updated by the equation shown in FIG. 11. The expression ‘is:ie’ indicates the process from the matrix element ‘is’ to ‘ie’.
  • the subroutine d-update computes a similar parameter, and updates the matrix a by the equation shown in FIG. 11 according to the lower triangular matrix TRL in the block E.
  • FIG. 12 shows the pseudo code of the update subroutine for the blocks C and A.
  • the block C is updated by the multiplication of the blocks B and F.
  • a (1:iof, is 2 :ie 2 ) indicates the block C
  • a (1:iof, iof+1:iof+blk) indicates the block B
  • a (iof+1:iof+blk, is 2 :ie 2 ) indicates the block F.
  • the block A is updated using the blocks B and D.
  • a (1:iof, is 2 :ie 2 ) indicates the block A
  • a (1:iof, iof+1:iof+blk) indicates the block B
  • a (iof+1:iof+blk, is 2 :ie 2 ) indicates the block D.
  • FIG. 13 shows the pseudo code indicating the first and second update of the blocks G and E.
  • a-update of the block G as in the above mentioned subroutines, ‘len’, ‘is 2 ’, ‘ie 2 ’, ‘iof’, etc. indicating the width, the starting position, the ending position, etc. of a block are computed, and the block G is updated using the blocks D and H.
  • a (iof+1:n, is 2 :ie 2 ) indicates the block G
  • a (iof+1:n, iof+1:iof+blk) indicates the block H
  • a (oif+1:iof+blk, is 2 :ie 2 ) indicates the block D.
  • the triangular matrix above the diagonal elements of E is updated using the column vector s (1:i, i) before the diagonal elements and the row vector (i, i+1:blk) after the diagonal elements.
  • the diagonal element of the upper triangular matrix of the block E is updated into the value obtained by dividing the element value before the update by the diagonal element value, updated using the row vector s (i, 1:i ⁇ 1) before the diagonal element and the column vector s (i+1:blk, i) after the diagonal element, updated into the value obtained by dividing the element value of the lower triangular matrix of the block E by the value obtained by changing the sign of the diagonal element, and the diagonal element is updated into the reciprocal of the diagonal element of the block E.
  • FIG. 14 shows the pseudo code of the final update for the block E, and the update subroutine for the blocks D and F.
  • the upper triangular matrix of the block E is updated by the column vector s (1:i ⁇ 1, i) before the diagonal element and the row vector s (i, 1:i ⁇ 1), and is updated by multiplying the element before the diagonal element of the block E by the diagonal element before update.
  • the block D or F (depending on the argument ‘is’ or ‘ie’ of the subroutine) is updated by the element value s (1:i ⁇ 1, i) and its own row vector a (i, is:ie).
  • the element value of the block D or F is expressed by a (1:i ⁇ 1, is:ie) so that when the subroutine reads a matrix element, a read position is offset by the above mentioned nbase, thereby studying the block D or F by computing a column number for the element values 1 ⁇ i ⁇ 1.
  • len 1 and len 2 are defined, df-update id recursively called, the process shown in FIG. 14 is performed, and df-update is further called, thereby terminating the process.
  • FIG. 15 shows the pseudo code of the update subroutine of the blocks B and H.
  • bh-update performs the updating process by the operation shown in FIG. 15 when ‘len’ is smaller than 10. If ‘len’ is 20 or larger and 32 or smaller, then len 1 and len 2 are defined. Otherwise, len 1 and len 2 are separately defined, bh-update is called, the operation is performed by the equation shown in FIG. 15, and bh-update is further called, thereby terminating the process.
  • the process can be performed using 7 CPUs at a speed 6.6 times higher.
  • a method of solving an inverse matrix can be realized with high performance and scalability.

Abstract

An LU decomposition is carried out on a block E and H. Then, a block B is updated using an upper triangular portion of the block E, and a block D is updated using a lower triangular portion of the block E. At this time, in an LU decomposition, blocks F and I have been updated. Then, using the blocks B, D, F, and H, blocks A, C, G, and I are updated, an upper triangular portion of the block E is updated, and finally, the blocks D and F are updated. Then, the second updating process is performed on the block E. Using the result of the process, the blocks B and H are updated. Finally, the block E is updated, and the pivot interchanging process is completed, thereby terminating the process. These processes on the blocks are performed in a plurality of divided threads in parallel.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0001]
  • The present invention relates to an arithmetic process using a shared memory type scalar parallel computer. [0002]
  • 2. Description of the Related Art [0003]
  • Conventionally, when an inverse matrix of a matrix is obtained using a vector computer, a computing method based on the Gauss-Jordan method and that operations are quickly performed by utilizing fast-access memory. For example, the methods such as a double unrolling method, etc. in which an instruction in a loop is unrolled into a single list of instructions and executed once. [0004]
  • Described below is the method of obtaining an inverse matrix in the Gauss-Jordan method (or referred to simply as the Gauss method). (In the explanation below, the interchange of pivots is omitted, but a process of interchanging row vectors for the interchange of pivots is actually performed). [0005]
  • Assume that A indicates a matrix for which an inverse matrix should be calculated, and x and y indicate arbitrary column vectors. Ax=y is expressed as simultaneous linear equations as follows when matrix elements are explicitly written.[0006]
  • a 11 x 1 +a 12 x 2 + * * * +a 1n x n =y 1
  • a 21 x 1 +a 22 x 2 + * * * +a 2n x n =y 2
  • a n1 x 1 +a n2 x 2 + * * * +a nn x n =y 2
  • If the above listed equations are transformed into By=x, then B is an inverse matrix of A, thereby obtaining an inverse matrix. [0007]
  • 1) The equation in the first row is divided by all. [0008]
  • 2) Compute the i-th row (i>1)−the first row x a[0009] i1.
  • 3) To obtain the coefficient of 1 for x[0010] 2 of the equation in the second row, multiply the second row by a reciprocal of the coefficient of x2.
  • 4) Compute the i-th row (i>2)−the second row x a[0011] i2.
  • 5) The above mentioned operation is continued up to the (n−1)th row. [0012]
  • The interchange of column vectors accompanying the interchange of pivots is described below. [0013]
  • Both sides of Ax=y are multiplied by the matrix P corresponding to the interchange of pivots.[0014]
  • PAx=Py=z
  • When the following equation holds with the matrix B,[0015]
  • x=Bz
  • then B is expressed as follows.[0016]
  • B=(PA)−1 =A −1 p −1
  • That is, by right-multiplying the obtained B by P, an inverse matrix of A can be obtained. Actually, it is necessary to interchange the column vectors. [0017]
  • In the equation, P=P[0018] nPn−1 . . . P1, and Pn is an orthogonal transform having matrix elements of Pii=0, Pij=1, Pjj=0, and Pji=1.
  • In the vector computer, an inverse matrix is computed in the above mentioned method based on the assumption that the operation of the memory access system can be quickly performed. However, in the case of the shared memory type scalar computer, the frequency of accessing shared memory increases with an increasing size of a matrix to be computed, thereby largely suppressing the performance of a computer. Therefore, it is necessary to perform the above mentioned matrix computation by utilizing the fast-accessing cache memory provided for each processor of the shared memory type scalar computer. That is, since the shared memory is frequently accessed if computation is performed on each row or column of a matrix, it is necessary to use an algorithm of localizing the computation assigned to processors by dividing the matrix into blocks, each processor processing the largest possible amount of data stored in the cache memory, and then accessing the shared memory, thereby reducing the frequency of accessing the shared memory. [0019]
  • SUMMARY OF THE INVENTION
  • The present invention aims at providing a method of computing an inverse matrix at a high speed in a parallel process. [0020]
  • The method according to the present invention is a parallel processing method for an inverse matrix for shared memory type scalar parallel computer, and includes the steps of: specifying a predetermined square block in a matrix for which an inverse matrix is to be obtained; decomposing the matrix into upper left, left side, lower left, upper, lower, upper right, right side, and lower right blocks surrounding the square block positioned in the center; dividing each of the decomposed blocks into the number of processors and LU-decomposing the square block and the lower, right side, and lower right blocks in parallel; updating the left side, upper, lower, and right side blocks in parallel in a recursive program, and further updating in parallel using the blocks updated in the recursive program on the upper left, lower left, upper right, and lower right blocks; updating a predetermined square block in plural stages using one processor; and setting the position of the square block such that it can sequentially move on the diagonal line of the matrix, and obtaining an inverse matrix of the matrix by repeating the above mentioned steps. [0021]
  • According to the present invention, the arithmetic operation of an inverse matrix is an updating process performed on each block, and each block is updated in parallel in a plurality of processors. Thus, the shared memory is accessed after the largest possible amount of operations are performed on a block stored in the cache provided for each processor, thereby realizing a localizing algorithm and performing a high-speed arithmetic operation for an inverse matrix. [0022]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows the configuration of the hardware of the shared memory type scalar computer according to an embodiment of the present invention; [0023]
  • FIG. 2 is a diagram ([0024] 1) showing the order of computation according to an embodiment of the present invention;
  • FIG. 3 is a diagram ([0025] 2) showing the order of computation according to an embodiment of the present invention;
  • FIG. 4 is a diagram ([0026] 3) showing the order of computation according to an embodiment of the present invention;
  • FIG. 5 is a diagram ([0027] 4) showing the order of computation according to an embodiment of the present invention;
  • FIG. 6 is a diagram ([0028] 5) showing the order of computation according to an embodiment of the present invention;
  • FIGS. 7A through 7F are diagrams ([0029] 6) showing the order of computation according to an embodiment of the present invention;
  • FIGS. 8A through 8C are diagrams ([0030] 7) showing the order of computation according to an embodiment of the present invention;
  • FIG. 9 shows a pseudo code ([0031] 1) according to an embodiment of the present invention;
  • FIG. 10 shows a pseudo code ([0032] 2) according to an embodiment of the present invention;
  • FIG. 11 shows a pseudo code ([0033] 3) according to an embodiment of the present invention;
  • FIG. 12 shows a pseudo code ([0034] 4) according to an embodiment of the present invention;
  • FIG. 13 shows a pseudo code ([0035] 5) according to an embodiment of the present invention;
  • FIG. 14 shows a pseudo code ([0036] 6) according to an embodiment of the present invention; and
  • FIG. 15 shows a pseudo code ([0037] 7) according to an embodiment of the present invention.
  • DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • The present embodiment provides an algorithm of obtaining an inverse matrix of a given matrix. In the shared memory type scalar computer, it is necessary to increase a ratio of computation to load. Therefore, according to an embodiment of the present invention, a method of dividing a matrix into blocks as matrix products so that an arithmetic operation can be efficiently performed. Furthermore, the computation density is enhanced by updating part of the matrix with a recursively calculated product of matrices of varying dimensions. [0038]
  • FIG. 1 shows the configuration of the hardware of the shared memory type scalar computer according to an embodiment of the present invention. [0039]
  • Processors [0040] 10-1 through 10-n have primary cache memory, and the primary cache memory can also be incorporated into a processor. Each of the processors 10-1 through 10-n is provided with secondary cache memory 13-1 through 13-n, and the secondary cache memory 13-1 through 13-n are connected to an interconnection network 12. The interconnection network 12 is provided with memory modules 11-1 through 11-n which are shared memory. The processors 10-1 through 10-n read necessary data for arithmetic operations from the memory modules 11-1 through 11-n, store the data in the secondary cache memory 13-1 through 13-n or the primary cache memory through the interconnection network 12, and perform arithmetic operations.
  • In this case, reading data from the memory modules [0041] 11-1 through 11-n to the secondary cache memory 13-1 through 13-n or the primary cache memory, and writing the computed data from the primary cache memory from the secondary cache memory 13-1 through 13-n or the primary cache memory to the memory modules 11-1 through 11-n are performed much more slowly than the operation speed of the processors 10-1 through 10-n. Therefore, if these writing and reading operations are frequently performed, then the performance of the entire computer is badly lowered.
  • Therefore, to maintain high performance of the entire computer, an algorithm of reducing the access to the memory modules [0042] 11-1 through 11-n and performing the largest possible amount of computation in a local system formed by the secondary cache memory 13-1 through 13-n, the primary cache memory, and the processors 10-1 through 10-n is required.
  • Therefore, according to an embodiment of the present invention, computation for an inverse matrix is performed as follows. [0043]
  • Only a rightward updating process is performed on a block having a certain block width in a matrix to be computed so that the information used in the updating process outside a block can be stored. That is, to update the remaining portion in the outer product of the row vector orthogonal to the column vector selected as the center in the past process is performed in the Gauss-Jordan method described above by referring to the conventional technology. That is, the method 2) of the above mentioned Gauss-Jordan is explicitly written as follows.[0044]
  • a ij −c 1j ×a i1 (where i, j>1: c 1j =a 1j /a 11)
  • This equation means that by selecting the column vector in the first column as the center and selecting the row vector in the first row orthogonal to the selected vector, the matrix formed by the second and subsequent rows x the second and subsequent columns is updated by the outer product of the column vector in the first column and the row vector in the first row. [0045]
  • Furthermore, Ax=y can be rewritten using I as a unit matrix into Ax−Iy, and when A is converted in the Gauss-Jordan method, the unit matrix in the right side is also converted. [0046]
  • In updating the unit matrix, the element (i, i) of the column corresponding to the coefficient deleted in A is 1/a[0047] ii in the right matrix, and the number of the other elements in the right matrix is obtained by multiplying the elements of the column vector deleted in A by 1/aii and then by −1. In a column block matrix, left elements are also to be deleted.
  • FIGS. 2 through 8 show the order of computation according to an embodiment of the present invention. [0048]
  • The order of computation is described below. [0049]
  • First, in FIG. 2, the matrix M is divided into blocks of column block matrices from the left side, and the updating process is performed in the above mentioned method. [0050]
  • 1) E and H are LU decomposed (refer to Patent Application No. Hei-12-358232). [0051]
  • 2) B is updated by B←BU[0052] −1 using the upper triangular portion U of E.
  • 3) D←L[0053] −1D and F←L−1D using the lower triangular portion of E.
  • 4) A, C, G, and I are updated.[0054]
  • A←A−B×D, C←C−B×F, G←G−H×D, I←I−H×F
  • 5) The upper triangular portion of E is updated. Since this portion has not been updated in the LU decomposition, it is updated at this timing. [0055]
  • 6) The update of the upper portion of D and F is performed according to the information about D, E, and F. [0056]
  • At this time, the updating process is performed using a matrix product in the recursive program to enhance the computation density. [0057]
  • a) Recursive Program [0058]
  • The upper portions of D and F can be computed by the outer products of the respective row vectors and the column vector orthogonal to the row vector of E. To enhance the the density of computation, the following recursive program is used. The upper portions of D and F indicate, as described by referring to FIG. 8, the upper portions of the row blocks of D and F. As clearly shown in FIG. 8, D and F can be updated from the upper portions of the row blocks. [0059]
  • recursive subroutine rowupdate ( ) [0060]
  • if (update width<10) then [0061]
  • The product of the matrix excluding the diagonal element and the row block matrix is subtracted from the upper triangular matrix of E (refer to the explanation of FIG. 8 for details). [0062]
  • else [0063]
  • c The update width is divided into a first half and a second half. [0064]
  • The first half is updated. [0065]
  • c Then, the second half is updated after calling rowupdate. [0066]
  • call rowupdate ( ) [0067]
  • return [0068]
  • end [0069]
  • 7) Then, while moving the elements on the diagonal line in the lower right direction, the rectangular portion in the column direction on the lower left with respect to the diagonal line is updated. As a result, the information required to update B and H is generated. [0070]
  • 8) The potions on the right of B and H are updated. This process is performed as follows. [0071]
  • With d indicating a block width, the numbers are sequentially assigned from left to right starting with i, . . . , i+d−1. [0072]
  • A reciprocal of a[0073] ii is obtained. Other portions are obtained by dividing each portion by aii, and an inverse sign is assigned.
  • The portion on the left of the (i+1)th row is divided by a[0074] i+1,i+1, and a reciprocal of ai+1,i+1 is obtained.
  • The left portion is updated by multiplying the row i+1 by the column i+1 of the left portion, and then performing subtraction. [0075]
  • The above mentioned processes are repeated. [0076]
  • This portion is further divided into the following procedures. They also require the rearrangement into a recursive algorithm. [0077]
  • 9) Finally, the rectangular portion in the row direction on the upper left portion of the element on the diagonal line of E is updated. [0078]
  • 10) Finally performed is the process of interchanging the column vector into the inverse direction of the history of the interchange of the pivots. [0079]
  • The portion updated in 5) above is updated by E[0080] 2=E2−a×b while sliding the diagonal elements of the diagonal-line portion (E2) shown in FIG. 3 along the diagonal line.
  • This portion has not been updated in the LU decomposition. After the update, the information about the upper triangular matrix required to update D and F can be obtained. [0081]
  • The portion updated in 7) above is updated by E[0082] 3=E3−c×d while sliding the diagonal elements of the diagonal-line portion shown in FIG. 4.
  • Prior to the update, c is multiplied by the reciprocal of the diagonal element. [0083]
  • Consider that the diagonal element after the update is the reciprocal of the original diagonal element. The character d is considered to be obtained by multiplying each column element by a diagonal element before update, and then multiplying the result by −1. [0084]
  • As a result, the information about the lower triangular matrix required to update the portion on the left of B and H. [0085]
  • In [0086] 9) above, the updating process is performed by E1=E1−a×c while sliding the diagonal element of the diagonal-line portion shown in FIG. 5.
  • The ‘a’ after the update is considered to be obtained by multiplying the column element of a by the original diagonal element, and then by −1. [0087]
  • D) Details of the parallelism for the shared memory type scalar computer [0088]
  • 0) The LU decomposition on E, F, H, and I is a parallel decomposition using the parallel algorithm of the LU decomposition. [0089]
  • 1) Control of block width [0090]
  • An embodiment of the present invention is provided with a system of adjusting a block width depending on the scale of a problem and the number of processors used in the parallelism. [0091]
  • Since the block E is updated by one processor, the block width is determined to be ignored (about 1%) with consideration given to the total cost of the portion. [0092]
  • 2) In the update in C 4) above, the updating processes by the respective matrix products are performed in parallel. Each processor equally divides the second dimension for parallel sharing in computation. [0093]
  • 3) In the update in C 5), that is, the update of D and F is performed by each processor updating in parallel the areas obtained by equally dividing the second dimension of D and F. [0094]
  • 4) The update in C 8) above is performed in parallel by each processor by sharing the areas obtained by equally dividing the first dimension of H. [0095]
  • 5) Finally, the interchanging process on column vectors is performed in parallel by equally dividing the first dimension of the entire matrix into areas. [0096]
  • The broken lines shown in FIG. 6 show an example of dividing an area in updating matrix portion in parallel. [0097]
  • Access to Memory in Updating B or H [0098]
  • Described below is the case in which the recursive program operates up to the depth of 2. [0099]
  • FIGS. [0100] 7A, and 7B, 7C, and 7D, 7E, and 7F each is a set.
  • First updated is the diagonal-line portion shown in FIG. 7A. At this time, the diagonal-line portion and the dotted line triangular portion are used. [0101]
  • Refer to the pseudo code for details. Next, the diagonal-line portion shown in FIG. 7A is updated using the horizontal-line portion shown in FIG. 7A and the rectangular diagonal-line portion shown in FIG. 7B. Then, the horizontal-line portion shown in FIG. 7A is updated using the triangular portion in bold lines. [0102]
  • Next, the diagonal-line portion shown in FIG. 7C is updated by the product of the right white-painted rectangle shown in FIG. 7C and the rectangle in bold lines shown in FIG. 7D. [0103]
  • Then, the diagonal-line portion shown in FIG. 7E is updated using the triangular portion in broken lines shown in FIG. 7F. Furthermore, the diagonal-line portion shown in FIG. 7E is updated by the product of the horizontal-line portion shown in FIG. 7E and the rectangle with diagonal lines shown in FIG. 7F. Finally, the horizontal-line portion is updated using the triangular portion in bold lines shown in FIG. 7F. [0104]
  • In the above mentioned procedure, reference to E is common. Therefore, E can be stored in the cache of each processor for reference. In addition, for example, reference to and update to B is processed in parallel on the areas in the row direction (areas obtained by dividing by bold broken lines). [0105]
  • Memory Access in Updating D or F [0106]
  • Described below is the case in which the recursive program operates up to the depth of 2. [0107]
  • First updated is the left diagonal-line portion shown in FIG. 8A. At this time, the diagonal-line portion and the right triangular portion in broken lines shown in FIG. 8A are used. [0108]
  • Refer to the pseudo code for details. Then, the left diagonal-line portion shown in FIG. 8A is updated using the right rectangular portion with diagonal lines shown in FIG. 8A and the left vertical-line portion shown in FIG. 8A. Then, the left horizontal-line portion shown in FIG. 8A is updated using the right triangular portion in bold lines shown in FIG. 8A. [0109]
  • Next, the left diagonal-line portion shown in FIG. 8B is updated by the product of the right rectangle in bold lines shown in FIG. 8B and the left white painted rectangle shown in FIG. 8B. [0110]
  • Then, the left diagonal-line portion shown in FIG. 8C is updated using the right triangular portion in broken lines shown in FIG. 8C. The left diagonal-line portion shown in FIG. 8C is updated using the product of the right rectangle with diagonal lines shown in FIG. 8C and the left vertical-line portion shown in FIG. 8C. Finally, the left vertical-line portion shown in FIG. 8C is updated using the right triangular portion in bold lines shown in FIG. 8C. [0111]
  • In the above mentioned procedure, reference to E is common. Therefore, E can be stored in the cache of each processor for reference. Furthermore, the reference to and update of D is processed in parallel on the areas in the row direction (areas obtained by dividing by bold broken lines). [0112]
  • FIGS. 9 through 15 show the pseudo code according to an embodiment of the present invention. [0113]
  • FIG. 9 shows the pseudo code of the main algorithm of the parallelism algorithm of an inverse matrix. In the following pseudo code, the row having C at the left end is a comment row. ‘array a(k,n)’ is an array storing the elements of the matrix whose inverse matrix is to be obtained. ‘ip (n)’ is an array storing the information used in interchanging rows in the LU decomposition subroutine. For the algorithm of the LU decomposition subroutine, refer to Japanese Patent Application Laid-open No.Hei-12-358232. ‘nb’ refers to the number of blocks specified when the LU decomposition is performed. [0114]
  • When the LU decomposition is completed on one specified block, and if ip(i) is larger than i, then the i-th row of the matrix is interchanged with the ip(i)th row. Then, the update subroutine is called, and the matrix is updated. The processes from the LU decomposition to the update subroutine are repeatedly performed until the processes are performed on all specified blocks. For the final specified block, another LU decomposition and update are performed, thereby terminating the process. [0115]
  • FIG. 10 shows the pseudo code for updating the remaining portions according to the information about the LU decomposition. [0116]
  • In the update routine shown in FIG. 10, the updating process is performed on each of the blocks A through H. For the blocks A through D and G, an exclusive subroutine is further read. Since the block I has already been updated when the LU decomposition is carried out, the subroutine is not further prepared here. [0117]
  • When the update routine terminates for the blocks A through D and G, the barrier synchronization is attained. Then, if the number of the processor (thread number) is [0118] 1, then the ‘update 1 of e’ is performed. This process is performed by the e-update1 subroutine. After the process, the barrier synchronization is attained.
  • ‘len’ indicates the width of a block to be processed in one thread. ‘is’ indicates the first position of the block to be processed, and ‘ie’ is the last location of the block to be processed. ‘df-update’ indicates the subroutine for updating the blocks D and F. If the blocks D and F have been updated, then the first position of a block with a block width added is stored as the first position (nbase[0119] 2) of a new block, ‘len’ is newly computed, the first and last positions of the blocks ‘is2’ and ‘ie2’ is newly computed, and D and F are updated by df-update, thereby attaining the barrier synchronization.
  • Additionally, as for ‘[0120] update 2 of e’, when the thread number is 1, the update subroutine e-update2 of the block E is called for and barrier synchronization is performed. Similarly, as described above, ‘len’, ‘is’, and ‘ie’ are computed, the update routine bh-update is called for the blocks B and H, ‘nbase2’ is obtained, ‘len’, ‘is2’, and ‘ie2’ are obtained, and the process is performed by bh-update again, thereby attaining the barrier synchronization.
  • Furthermore, when the thread number is 1, the process is performed by u-update[0121] 3 as ‘update 3 of e’, thereby attaining the barrier synchronization.
  • Afterwards, to restore the state of interchanged pivots to the original state, ‘len’, ‘is’, and ‘ie’ are computed, then columns are interchanged by the subroutine ‘exchange’, the barrier synchronization is attained, and the thread is deleted, thereby terminating the process. [0122]
  • FIG. 11 shows the pseudo code of the update subroutine of the blocks B and D. [0123]
  • In updating the block B, the subroutine b-update accesses a shared matrix a(k,n), and ‘len’, ‘is[0124] 1’, and ‘ie1’ having the same meanings as the explanation above are computed. ‘iof’ indicates the number of the starting column of the block B. Then, using the matrix TRU-U having the diagonal element of the upper triangular matrix set to 1, the block B of the array a is updated by the equation shown in FIG. 11. The expression ‘is:ie’ indicates the process from the matrix element ‘is’ to ‘ie’.
  • In updating the block D, the subroutine d-update computes a similar parameter, and updates the matrix a by the equation shown in FIG. 11 according to the lower triangular matrix TRL in the block E. [0125]
  • FIG. 12 shows the pseudo code of the update subroutine for the blocks C and A. [0126]
  • In the update subroutine c-update for the block C, the block C is updated by the multiplication of the blocks B and F. a (1:iof, is[0127] 2:ie2) indicates the block C, a (1:iof, iof+1:iof+blk) indicates the block B, a (iof+1:iof+blk, is2:ie2) indicates the block F.
  • In the update subroutine a-update of the block A, the block A is updated using the blocks B and D. a (1:iof, is[0128] 2:ie2) indicates the block A, a (1:iof, iof+1:iof+blk) indicates the block B, a (iof+1:iof+blk, is2:ie2) indicates the block D.
  • FIG. 13 shows the pseudo code indicating the first and second update of the blocks G and E. [0129]
  • In the update subroutine a-update of the block G, as in the above mentioned subroutines, ‘len’, ‘is[0130] 2’, ‘ie2’, ‘iof’, etc. indicating the width, the starting position, the ending position, etc. of a block are computed, and the block G is updated using the blocks D and H. a (iof+1:n, is2:ie2) indicates the block G, a (iof+1:n, iof+1:iof+blk) indicates the block H, and a (oif+1:iof+blk, is2:ie2) indicates the block D.
  • In the first update subroutine e-update[0131] 1 of the block E, the triangular matrix above the diagonal elements of E is updated using the column vector s (1:i, i) before the diagonal elements and the row vector (i, i+1:blk) after the diagonal elements.
  • In the second update subroutine e-update[0132] 2 of the block E, the diagonal element of the upper triangular matrix of the block E is updated into the value obtained by dividing the element value before the update by the diagonal element value, updated using the row vector s (i, 1:i−1) before the diagonal element and the column vector s (i+1:blk, i) after the diagonal element, updated into the value obtained by dividing the element value of the lower triangular matrix of the block E by the value obtained by changing the sign of the diagonal element, and the diagonal element is updated into the reciprocal of the diagonal element of the block E.
  • FIG. 14 shows the pseudo code of the final update for the block E, and the update subroutine for the blocks D and F. [0133]
  • In the final update subroutine e-update[0134] 3 of the block E, the upper triangular matrix of the block E is updated by the column vector s (1:i−1, i) before the diagonal element and the row vector s (i, 1:i−1), and is updated by multiplying the element before the diagonal element of the block E by the diagonal element before update.
  • In the update subroutine df-update for the blocks D and F, if the width len of a block is smaller than 10, the block D or F (depending on the argument ‘is’ or ‘ie’ of the subroutine) is updated by the element value s (1:i−1, i) and its own row vector a (i, is:ie). The element value of the block D or F is expressed by a (1:i−1, is:ie) so that when the subroutine reads a matrix element, a read position is offset by the above mentioned nbase, thereby studying the block D or F by computing a column number for the [0135] element values 1˜i−1. When ‘len’ is 20 or larger and 32 or smaller, len1 and len2 are defined, df-update id recursively called, the process shown in FIG. 14 is performed, and df-update is further called, thereby terminating the process.
  • FIG. 15 shows the pseudo code of the update subroutine of the blocks B and H. [0136]
  • In FIG. 15, bh-update performs the updating process by the operation shown in FIG. 15 when ‘len’ is smaller than 10. If ‘len’ is [0137] 20 or larger and 32 or smaller, then len1 and len2 are defined. Otherwise, len1 and len2 are separately defined, bh-update is called, the operation is performed by the equation shown in FIG. 15, and bh-update is further called, thereby terminating the process.
  • According to an embodiment of the present invention, for the same function (another method of obtaining an inverse matrix after the LU decomposition), as compared with the function of the numeric operation library SUN Performance library of SUN, the process can be performed using 7 CPUs at a speed 6.6 times higher. [0138]
  • Refer to the following textbook for the common algorithm of matrix computation. [0139]
  • G. H. Golub and C. F. Van Loan “Matrix Computations” The Johns Hopkins University Press, Third edition 1996 [0140]
  • According to the present invention, a method of solving an inverse matrix can be realized with high performance and scalability. [0141]

Claims (8)

What is claimed is:
1. A parallel processing method for an inverse matrix for a shared memory type scalar parallel computer, comprising:
specifying a predetermined square block in a matrix for which an inverse matrix is to be obtained;
decomposing the matrix into upper left, left side, lower left, upper, lower, upper right, right side, and lower right blocks surrounding the square block positioned in the center;
dividing each of the decomposed blocks into the number of processors and LU decomposing the square block and the lower, right side, and lower right blocks in parallel;
updating the left side, upper, lower, and right side blocks in parallel in a recursive program, and further updating in parallel using the blocks updated in the recursive program on the upper left, lower left, upper right, and lower right blocks;
updating a predetermined square block in plural stages using one processor; and
setting the position of the square block such that it can sequentially move on the diagonal line of the matrix, and obtaining an inverse matrix of the matrix by repeating the above mentioned steps.
2. The method according to claim 1, wherein said shared memory type scalar parallel computer comprises a plurality of processors, plural units of cache memory provided for respective processors, plural units of shared memory, and an interconnection network for connection to be made such that the units can be communicated.
3. The method according to claim 1, wherein
said method is used to realize a Gauss-Jordan method for parallel computation for each block.
4. The method according to claim 1, wherein
a width of a division used when each block is divided for parallel computation is set such that a total amount of computation of square blocks not processed in parallel can be about 1% of entire computation from a size of a matrix from which an inverse matrix is obtained and from a number of processors available in a parallel process.
5. A program for realizing a parallel processing method for an inverse matrix for a shared memory type scalar parallel computer, comprising:
specifying a predetermined square block in a matrix for which an inverse matrix is to be obtained;
decomposing the matrix into upper left, left side, lower left, upper, lower, upper right, right side, and lower right blocks surrounding the square block positioned in the center;
dividing each of the decomposed blocks into the number of processors and LU decomposing the square block and the lower, right side, and lower right blocks in parallel;
updating the left side, upper, lower, and right side blocks in parallel in a recursive program, and further updating in parallel using the blocks updated in the recursive program on the upper left, lower left, upper right, and lower right blocks;
updating a predetermined square block in plural stages using one processor; and
setting the position of the square block such that it can sequentially move on the diagonal line of the matrix, and obtaining an inverse matrix of the matrix by repeating the above mentioned steps.
6. The program according to claim 5, wherein
said shared memory type scalar parallel computer comprises a plurality of processors, plural units of cache memory provided for respective processors, plural units of shared memory, and an interconnection network for connection to be made such that the units can be communicated.
7. The program according to claim 5, wherein
said method is used to realize a Gauss-Jordan method for parallel computation for each block.
8. The program according to claim 5, wherein
a width of a division used when each block is divided for parallel computation is set such that a total amount of computation of square blocks not processed in parallel can be about 1% of entire computation from a size of a matrix from which an inverse matrix is obtained and from a number of processors available in a parallel process.
US10/288,984 2002-03-22 2002-11-06 Parallel processing method for inverse matrix for shared memory type scalar parallel computer Abandoned US20030182518A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/692,533 US7483937B2 (en) 2002-03-22 2003-10-24 Parallel processing method for inverse matrix for shared memory type scalar parallel computer

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2002-079909 2002-03-22
JP2002079909 2002-03-22

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US10/692,533 Continuation US7483937B2 (en) 2002-03-22 2003-10-24 Parallel processing method for inverse matrix for shared memory type scalar parallel computer
US10/692,533 Continuation-In-Part US7483937B2 (en) 2002-03-22 2003-10-24 Parallel processing method for inverse matrix for shared memory type scalar parallel computer

Publications (1)

Publication Number Publication Date
US20030182518A1 true US20030182518A1 (en) 2003-09-25

Family

ID=28035680

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/288,984 Abandoned US20030182518A1 (en) 2002-03-22 2002-11-06 Parallel processing method for inverse matrix for shared memory type scalar parallel computer
US10/692,533 Expired - Fee Related US7483937B2 (en) 2002-03-22 2003-10-24 Parallel processing method for inverse matrix for shared memory type scalar parallel computer

Family Applications After (1)

Application Number Title Priority Date Filing Date
US10/692,533 Expired - Fee Related US7483937B2 (en) 2002-03-22 2003-10-24 Parallel processing method for inverse matrix for shared memory type scalar parallel computer

Country Status (1)

Country Link
US (2) US20030182518A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083290A1 (en) * 2005-10-12 2007-04-12 Kenichiro Nagasaka Apparatus and method for computing operational-space physical quantity
US20090319592A1 (en) * 2007-04-19 2009-12-24 Fujitsu Limited Parallel processing method of tridiagonalization of real symmetric matrix for shared memory scalar parallel computer
CN102486727A (en) * 2010-12-03 2012-06-06 同济大学 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)
US9244798B1 (en) * 2011-06-20 2016-01-26 Broadcom Corporation Programmable micro-core processors for packet parsing with packet ordering
US9455598B1 (en) 2011-06-20 2016-09-27 Broadcom Corporation Programmable micro-core processors for packet parsing

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7792895B1 (en) 2006-06-16 2010-09-07 Nvidia Corporation Efficient matrix multiplication on a parallel processing device
US7836118B1 (en) * 2006-06-16 2010-11-16 Nvidia Corporation Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication
US7912889B1 (en) 2006-06-16 2011-03-22 Nvidia Corporation Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication
WO2008018188A1 (en) * 2006-08-08 2008-02-14 Kyoto University Eigen value decomposing device and eigen value decomposing method
US8417755B1 (en) 2008-05-28 2013-04-09 Michael F. Zimmer Systems and methods for reducing memory traffic and power consumption in a processing environment by solving a system of linear equations
US8495120B2 (en) * 2009-06-15 2013-07-23 Em Photonics, Inc. Method for using a graphics processing unit for accelerated iterative and direct solutions to systems of linear equations
US10331762B1 (en) * 2017-12-07 2019-06-25 International Business Machines Corporation Stream processing for LU decomposition

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5226171A (en) * 1984-12-03 1993-07-06 Cray Research, Inc. Parallel vector processing system for individual and broadcast distribution of operands and control information
US5251097A (en) * 1990-06-11 1993-10-05 Supercomputer Systems Limited Partnership Packaging architecture for a highly parallel multiprocessor system
US5301342A (en) * 1990-12-20 1994-04-05 Intel Corporation Parallel processing computer for solving dense systems of linear equations by factoring rows, columns, and diagonal, inverting the diagonal, forward eliminating, and back substituting
US5333117A (en) * 1993-10-04 1994-07-26 Nec Research Institute, Inc. Parallel MSD arithmetic using an opto-electronic shared content-addressable memory processor
US5428803A (en) * 1992-07-10 1995-06-27 Cray Research, Inc. Method and apparatus for a unified parallel processing architecture
US5490278A (en) * 1991-07-12 1996-02-06 Matsushita Electric Industrial Co., Ltd. Data processing method and apparatus employing parallel processing for solving systems of linear equations
US5561784A (en) * 1989-12-29 1996-10-01 Cray Research, Inc. Interleaved memory access system having variable-sized segments logical address spaces and means for dividing/mapping physical address into higher and lower order addresses
US5596518A (en) * 1994-05-10 1997-01-21 Matsushita Electric Industrial Co., Ltd. Orthogonal transform processor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2956800B2 (en) * 1991-09-19 1999-10-04 株式会社日立製作所 Computer system for simultaneous linear equations
JP3542184B2 (en) 1994-12-15 2004-07-14 株式会社日立製作所 Linear calculation method
JP3639206B2 (en) 2000-11-24 2005-04-20 富士通株式会社 Parallel matrix processing method and recording medium in shared memory type scalar parallel computer
US7003542B2 (en) * 2002-01-02 2006-02-21 Intel Corporation Apparatus and method for inverting a 4×4 matrix
US7065545B2 (en) * 2002-05-07 2006-06-20 Quintero-De-La-Garza Raul Gera Computer methods of vector operation for reducing computation time

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5226171A (en) * 1984-12-03 1993-07-06 Cray Research, Inc. Parallel vector processing system for individual and broadcast distribution of operands and control information
US5561784A (en) * 1989-12-29 1996-10-01 Cray Research, Inc. Interleaved memory access system having variable-sized segments logical address spaces and means for dividing/mapping physical address into higher and lower order addresses
US5251097A (en) * 1990-06-11 1993-10-05 Supercomputer Systems Limited Partnership Packaging architecture for a highly parallel multiprocessor system
US5301342A (en) * 1990-12-20 1994-04-05 Intel Corporation Parallel processing computer for solving dense systems of linear equations by factoring rows, columns, and diagonal, inverting the diagonal, forward eliminating, and back substituting
US5490278A (en) * 1991-07-12 1996-02-06 Matsushita Electric Industrial Co., Ltd. Data processing method and apparatus employing parallel processing for solving systems of linear equations
US5428803A (en) * 1992-07-10 1995-06-27 Cray Research, Inc. Method and apparatus for a unified parallel processing architecture
US5333117A (en) * 1993-10-04 1994-07-26 Nec Research Institute, Inc. Parallel MSD arithmetic using an opto-electronic shared content-addressable memory processor
US5596518A (en) * 1994-05-10 1997-01-21 Matsushita Electric Industrial Co., Ltd. Orthogonal transform processor

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083290A1 (en) * 2005-10-12 2007-04-12 Kenichiro Nagasaka Apparatus and method for computing operational-space physical quantity
US8140189B2 (en) * 2005-10-12 2012-03-20 Sony Corporation Apparatus and method for computing operational-space physical quantity
US20090319592A1 (en) * 2007-04-19 2009-12-24 Fujitsu Limited Parallel processing method of tridiagonalization of real symmetric matrix for shared memory scalar parallel computer
US8527569B2 (en) 2007-04-19 2013-09-03 Fujitsu Limited Parallel processing method of tridiagonalization of real symmetric matrix for shared memory scalar parallel computer
CN102486727A (en) * 2010-12-03 2012-06-06 同济大学 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)
US9244798B1 (en) * 2011-06-20 2016-01-26 Broadcom Corporation Programmable micro-core processors for packet parsing with packet ordering
US9455598B1 (en) 2011-06-20 2016-09-27 Broadcom Corporation Programmable micro-core processors for packet parsing

Also Published As

Publication number Publication date
US20040093470A1 (en) 2004-05-13
US7483937B2 (en) 2009-01-27

Similar Documents

Publication Publication Date Title
KR101298393B1 (en) Training convolutional neural networks on graphics processing units
EP3499427A1 (en) Method and electronic device for convolution calculation in neutral network
US20030182518A1 (en) Parallel processing method for inverse matrix for shared memory type scalar parallel computer
Dong et al. LU factorization of small matrices: Accelerating batched DGETRF on the GPU
Miyamoto et al. Fast calculation of Haralick texture features
DE202017103725U1 (en) Block operations for an image processor having a two-dimensional execution path matrix and a two-dimensional shift register
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
JP3526976B2 (en) Processor and data processing device
CN110807170B (en) Method for realizing Same convolution vectorization of multi-sample multi-channel convolution neural network
JP2002163246A (en) Parallel matrix processing method in shared memory type scalar parallel computer and recording medium
DE202017103727U1 (en) Core processes for block operations on an image processor with a two-dimensional runway matrix and a two-dimensional shift register
Yamazaki et al. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators
US8527569B2 (en) Parallel processing method of tridiagonalization of real symmetric matrix for shared memory scalar parallel computer
Andrew et al. Implementing QR factorization updating algorithms on GPUs
Slyusar A family of face products of matrices and its properties
US9582474B2 (en) Method and apparatus for performing a FFT computation
Guerreiro et al. Exact hypervolume subset selection through incremental computations
US7603402B2 (en) Solution program recording media for simultaneous linear equations having band coefficient matrix
US20030187898A1 (en) Parallel processing method of an eigenvalue problem for a shared-memory type scalar parallel computer
CN113989169A (en) Expansion convolution accelerated calculation method and device
Nepomniaschaya An associative version of the Bellman-Ford algorithm for finding the shortest paths in directed graphs
Coulaud et al. Parallelization of semi-Lagrangian Vlasov codes
König et al. Computing eigenvectors of block tridiagonal matrices based on twisted block factorizations
JP3983188B2 (en) Parallel processing method of inverse matrix for shared memory type scalar parallel computer
JP2022074442A (en) Arithmetic device and arithmetic method

Legal Events

Date Code Title Description
AS Assignment

Owner name: FUJITSU, LIMITED, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NAKANISHI, MAKOTO;REEL/FRAME:013474/0777

Effective date: 20020717

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION