WO1990000287A1

WO1990000287A1 - Intelligent floating-point memory

Info

Publication number: WO1990000287A1
Application number: PCT/US1989/002864
Authority: WO
Inventors: Steven G. Morton
Original assignee: Morton Steven G
Priority date: 1988-07-01
Filing date: 1989-06-30
Publication date: 1990-01-11

Abstract

The methods of operation and logic design of a digital, memory-plus-processor unit (203) are given. This unit both stores and multiplies matrices within it. Each unit stores one or more bits of each element of a matrix. Multiple units (202, 203) work together to give the precision of the matrix desired. Data may be represented in either fixed-point or floating-point format. Each unit combines a high capacity, very-wide-word memory (300) that stores one or more large matrices, a lower capacity memory (319) that is loaded from the high capacity memory and stores a vector or a smaller matrix, and highly parallel, simple processing logic (307) that feeds an on-chip, global adder and accumulator circuit (308).

Description

Intelligent Floating-Point Memory

Technical Field

The invention relates primarily to semiconductor devices that Multiply Matrices using either fixed-point or floating-point arithmetic. Matrix multiplication is a fundamental operation that is required for digital signal processing, pattern recognition, graphics, and scientific-and- engineering computing. This operation may be advantageously implemented by the digital, parallel processing units described herein that combine a small amount of processor logic with a large amount of memory into a single unit in a modular fashion. Multiple such units work together to provide the performance and precision desired, and to handle more complex problems than a single unit can handle alone.

Background Art

U.S. patent #4.777.614 (Ward) 11 Oct 1988. "Digital Data Processor for Matrix-Vector Multiplication", describes a two-dimensional, orthogonal, systolic array of bit-level processing cells that multiply matrices. The matrix must be fed into the array from one direction while the vector is fed in from an orthogonal direction.

U.S. patent #4.493.048 (Kung et al) 08 Jan 1985. "Systolic Array Apparatus for Matrix Computations", describes a two-dimensional, systolic array of inner-product-step processors. Data must be fed into the array for computations to be done.

U.S. patent #4.150.434 (Shibayama et al) 17 Apr 1979. "Matrix Arithmetic Apparatus", describes a pipelined system for performing matrix operations. Each arithmetic element handles an entire matrix element, rather than being distributed across multiple chips, and is not colocated with bulk memory for economy of communication.

U.S. patent #3.440.611 (Falkoff et al) 22 Apr 1969. "Parallel Operations in a Vector Arithmetic Computing System" . describes a vector arithaetic. multiprocessor computing system for performing multiword- parallel vector operations such as sum reduction and search for largest. Each arithmetic element handles an entire matrix element, rather than being distributed across multiple chips, and is not colocated with bulk memory for economy of communication.

M. Duranton and J. A. Sirat. in "Learning on VLSI: A General Purpose Digital Neurochip". in the Proceedings of the International Joint Confer¬ ence on Neural Networks. June 1989. page 11-613 (abstract only), describe a chip that simultaneously multiplies multiple elements of a matrix and vector, and sums these products. The multiplication is done in several steps to minimize the amount of hardware required to build the multiplier. The matrix and vector are stored in two separate memories that are on the same chip as the arithmetic logic. The chip supports only 8-bit and 16-bit precision of the matrix, is specifically designed for connection to a single microprocessor (the Transputer), does not readily support its use with other identical chips to spread the bits of the matrix over multiple chips, and only performs fixed-point arithmetic.

The applicant of this application filed an international patent application. "An Intelligent Memory Chip for Matrix Multiplication", serial number PCT/US 88/04433. that describes how to perform matrix multiplication and convolution with fixed-point representation within multiple identical, specially configured memory chips.

Disclosure of Invention

The object of the invention is to multiply matrices having either a fixed- point or a floating-point representation. The design is modular so that multiple units can work together to provide whatever precision is desired. Since complex problems are represented by large amounts of data, the bulk of each unit is memory which stores the data. Simple processing logic is placed on the same unit as the memory to minimize its complexity and cost, and to avoid transferring the data from the memory to distant processing logic. The interface to the memory is made as simply as possible, like the interface to common memory chips, which have no processing logic. Brief Description of the Drawings

The details of carrying out the invention are described by schematics. block diagrams, equations and methods in the following Figures. The Figures are according to this invention unless otherwise noted.

Figure 1 is equations for matrix multiplication (prior art). Figure 2 is a block diagram of the intelligent floating-point memory nodule using multiple intelligent floating-point memory units.

Figure 3 is a block diagram of the intelligent floating-point memory unit.

Figure 4 is a diagram showing the spatial and temporal implementation of the multiplier. Figure 5 is a timing diagram for the clock generator.

Figure 6 is a table of data storage formats for a row of the matrix memory.

Figure 7 is a block diagram of the matrix memory. Figure 8 is a block diagram of the vector memory. Figure 9 is a block diagram of the normalizer cell .

Figure 10 is a table of functions for the normalizer cell. Figure 11 is a block diagram of the processor logic cells and adder- and-accumulator.

Figure 12 is a block diagram of one processor logic cell. Figure 13 is a block diagram of the 8-port floating-point vector accumulator and interface unit.

Figure 14 is a block diagram of the 8-port mantissa section. Figure 15 is a block diagram of the 8 by 40-bit weighted adder. Figure 16 is a block diagram of the bus interface.

NOTE: For the convenience of the reader, all reference numbers are tied to the Figure numbers. Reference numbers are of the form XXYY. where

XX is the Figure number, exclusive of any letter suffix, and YY is a reference number within that Figure. Modes for Carrying Out the Invention

Figure 1 shows the equations that define the multiplication of ⁵ matrices with four columns. In equation 1.1. a large input matrix [A] has four columns as required by 3-D graphics, and has as many rows as are required to store the many vectors that represent 3-D objects. One hundred thousand to a million vectors may easily be required. The precision of each element of each of these vectors is typically 32 bits, in either fixed- ° point or floating-point format. This invention supports both formats.

A transformation matrix [B] is multiplied times each of many of the row-vectors in the input matrix. The input matrix may represent multiple objects^*, each having its own transformation matrix, in which case the transformation matrix must be changed depending upon which object is being 5 handled. It is assumed, however, that the vectors of each object are grouped in adjacent rows of the input matrix in order to minimize the number of times that a new transformation matrix must be loaded.

The output matrix [C] is the conventional product of the input matrix and the transformation matrix. The invention provides for the storing of ⁰ the output matrix within the intelligent memory chips without disturbing the external data bus or processor, in which case successive transforma¬ tions may easily be computed.

It is important to note that this invention may be applied to matrices with any number of columns. The preferred embodiment is given with four 5 columns because 3-D graphics is a large market which requires four columns, but there is no limitation to four columns. It is also important to note that this invention applies to the multiplication of matrices regardless of their application. The choice of graphics, signal processing, scientific computing or whatever has no bearing upon the principle of operation. ⁰ The method used by the invention to multiply the matrices is to decompose the operation into a sequence of simpler operations. Each row of the input matrix is handled in turn. As shown in equation 1.2, the first row of the input matrix is multiplied by the transformation matrix to produce the first row-vector of the output matrix. ⁵ The multiplication of a row-vector times a transformation matrix is further decomposed as shown in equation 1.3. In this invention, a row- vector, of the input matrix is multiplied by the first column of the transformation matrix to produce the first element. aO. of the row-vector of the output matrix. The same row-vector of the input matrix is then ultiplied by the second column of the transformation matrix to produce the second element of the row-vector of the output matrix. This process is repeated for the remaining columns of the transformation matrix. The next row-vector of the input matrix is then handled, and so on. Figure 2 shows the block diagram of an intelligent floating-point memory module. If 32-bit precision of the input matrix is desired and eight intelligent floating-point memory chips as 202 and 203 share the load, then each unit stores four bits of each element of the matrix. Each unit is typically implemented as a single chip, so the words 'chip¹ and 'unit' will be used interchangeably from here on. These chips are controlled by a control unit 208 that is commanded by a host processor via the microproces¬ sor (uP) data bus 210 to multiply a matrix composed of a group of row- vectors times a transformation matrix composed of four column-vectors.

Since each intelligent floating-point memory unit computes only a part of a product, the partial products as 206 and 207 must be combined to form a complete product. The floating-point vector accumulator and interface unit 209 combines these partial products. Note that scale factors are associated with each of the partial products as 206. Like the use of ordinary 4-bit memory chips, the memory data bus 201 places four bits of each element of the matrix in each four-bit chip, in which case bits 3:0 are placed in chip 203 and bits 31:28 (i.e., 31, 30, 29 and 28) are placed in chip 202. Since the least significant bit of chip 203 has a weight of 2~0. its partial product (PP) has a scale factor of 2~0. Likewise, chip 202 has bit 28 as its least significant bit. in which case its partial product 206 has a scale factor of 2^*28.

The invention is not restricted to the use of four bits in each chip. Four is given for the example because many applications require 32 bits of precision and it is often convenient from a packaging point of view to use eight chips with four bits each. The spatial significance of the chips is indicated by the slice type (ST) input. Only the most significant slice, chip 202. has the slice type equal to 1 since it stores the sign bit (assuming two's complement arithmetic): the other chips as 203 have slice type = 0.

As will be explained, the partial product line as 206 carries a variety of information between the vector accumulator unit 209 and the Intelligent memory chips. The sign bus 204 provides all Intelligent memory chips with the sign bite of the matrix elements being used for a particular operation. These sign bits are located in the most significant slice. This bus has four bits for chips that operate upon matrices with four columns. Figure 3 shows the block diagram of the intelligent floating-point memory chip. It has three main sections: a matrix memory 300. processor logic and a vector memory 319. A matrix is loaded into the matrix memory via the matrix address 301. matrix data 302 and matrix control 303 lines. A transformation matrix is loaded into the vector memory from the matrix memory under control of the vector control lines 321. Information from both memories enters the processor logic and is processed. In the preferred embodiment. MDB. the number of matrix data bits, is 4. and MAB. the number of matrix address bits is 9. corresponding to a memory with multiplexed row and column addresses and a total of lM-bits of storage (256K * 4).

The processor logic includes a buffer 305 for the sign bus 306. an adder and accumulator 308, a floating-point interface (FPI. 309) and a clock generator 317. The floating-point interface produces the partial product lines (PP. 310) which it drives if the partial product output enable line (PPOE. 311) is asserted and an output is required. The interface also receives information from the partial product lines and conveys it to the delta exponent bus (DE, 314). When the chip is designed to handle matrices with four columns, there are four logic cells within block 307 and four normalizer cells within block 313. The clock generator 31 receives the clock 315 and reset lines 316." The processor control (PC) and slice type (ST) lines drive all of the processor logic.

Figure 4 shows the spatial and temporal implementation of a multiplier as used by this invention. Assuming 32-bit by 32-bit operation for sake of example, the product can be implemented with thirty-two small multipliers. each handling 4 columns of bits and 8 rows of bits of the computation. Each binary product and sum can be handled by a tiny. 1-bit cell as P 400. Note that some of the small, multi-bit multipliers as 402 operate only upon unsigned bits (A3:0 and B7:0). whereas others as 401 have a combination of signed and unsigned bits (A31:28 and B7:0) depending upon whether bit A31 or bit B31 is being handled. Both A and B inputs to multiplier 403 are signed. A flexible means of specifying which combination of bits is to be handled by each chip is provided by the invention so that a single type of chip can handle all cases.

The invention handles each 4-bit wide slice of the multiplier in a single chip, in which case 8 chips provide 32 bits — the precision required for- raphics. Each 8-bit-high slice is provided in time by passing successive sets of data through a single 4-bit by 8-bit multiplier. More or fewer cycles may be used to provide more or less precision. Data is placed in the chips so that the spatial dimension determines the precision of the row-vector of the input matrix, while the temporal dimension determines the precision of the column-vector of the transformation matrix.

Note that the amount of multiplier logic for each product on each chip is only 1/32 of a 32-bit by 32-bit multiplier. As a result, the manufac- turing yield is high since the amount of logic is very small. Even less logic may be used by using more cycles to compute the product. An 8-bit- high slice was chosen to provide a particular level of performance with a clock rate that is easily achieved with fabrication capability in 1988. Similarly. 32-bit precision of the row-vector of the input matrix could be provided by 4 chips with 8-bit wide slices rather than having 8 chips with 4-bit wide slices. The choice is purely an engineering one based upon the configuration that best suits an application. Note, however, that as the width increases, the amount of logic increases, reducing yields, increasing the number of connections on each chip and increasing costs. The yield problem, while small, could be handled by placing spare multipliers on a chip and connecting them depending upon where defects fall.

The total amount of multiplier logic per chip is thus four. 4-bit by 8-bit multipliers, equivalent to one 8-bit by 16-bit multiplier. The adder and accumulator reduces the data rate on the partial product pins by combining the results of four 8-bit cycles. The clock generator is required to provide the multi-cycle operation for each product.

The type of memory technology used for the matrix memory 300 and the vector memory 319, and its storage capacity, are chosen to suit the application and the state of technology; they are not fundamental to the invention. Typically, the capacity of the matrix memory will be many times greater than the capacity of the vector memory.

Figure 5 shows the signals generated by the clock generator 317 in response to clock 315 and reset 316. Reset 316 is asserted upon system initialization to synchronize the operation of all intelligent memory chips. Each chip then acts as a simple finite state machine. The timing signal8 shown may be generated by many shift register or counter means as are known in the art. Four cycles are shown as TO (507). Tl (508). T2 (509) and T3 (510). consistent with the use of four. 8-bit temporal slices to accomplish a 32-bit multiplication. Other precisions can be provided by varying the number of cycles, either dynamically (under program control) or statically (at hardware design time).

FClk 502 li the "fast clock", MClk 504 is the "medium clock" (MClk phase 1). and SClk 506 is the "slow clock" (SClk phase 3). The fast clock is the rate that successive 8-bit portions of one column-vector of the transformation matrix are handled. The medium clock is the rate at which partial products are conveyed between chips. The slow clock is the rate at which complete, 32-bit products are computed. The bits shown for Tn (n = 0 to 3) are at the output of the Vector Memory. The bits shown for TO.l (511) and T2.3 (512) are at the output of the Partial Product bus.

Figure 6 is a table of data storage formats for the matrix memory 300. The processor logic operates upon data that is read at one time from a row of the matrix memory, so there are constraints upon how data is stored. These constraints are key to the operation of the invention. Note that the many row-vectors of the input matrix (shown under matrix format) are stored efficiently, where each bit of each element is stored only; once in a group of intelligent memory chips. However, the few column- vectors of the transformation matrix are passed through the matrix memory to the vector memory and are stored redundantly. (Close attention must be paid to which vector is which to avoid confusion.) Since each intelligent memory chip uses the entire column-vector, and there are eight intelligent memory chips in the module shown in Figure 2. there is eight-fold redundancy in the storage of the column-vector. This redundancy of storage eliminates a massive wiring problem that would arise if the column-vector were stored only once and wired to all of the chips that need it.

Assuming the placement of four bits in the matrix memory 300 (MDB = 4 in Figure 3) of each element of the input matrix, then the matrix memory would be implemented from four planes (700, 701. 702 and 703) of memory cells. Note also that while a single bit is written into each plane from the matrix data bus 302. as is conventional for a by-4 memory. Further assuming that matrices with four columns are operated upon by the intelligent memory chip, then four columns from each plane are selected simultaneously by the multiplexer 706 and connected to the matrix memory bus 304. Each bit is the same weight of bit for each of the four columns of the row-vector as shown in Figure 6. (This is an important point.)

For example, the data that may be present on the matrix memory bus in chip 203 in Figure 2 is:

Matrix Memory Bus B: 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Matrix format:

Row K. B: 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Column: 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 Vector format:

Col L, B: 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Row: 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0

Similarly, the data that may be present on the matrix memory bus the next most significant chip is:

Matrix Memory Bus B: 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Matrix format:

(higher) Row K, B: 7 6 5 4 7 6 5 4 7 6 5 4 7 6 5 4 Column: 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0

Vector format: (same) Col , B: 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Row: 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0

Figure 7 shows that each memory plane as 703 has sense amplifiers column read/write logic 704, as is common for a high speed RAM. The out of the sense amplifiers are also connected to an output register and multiplexers 706 to store the entire row of data and provide the multi- access described in Figure 6. The column write logic is also connected an input register 705 that operates in a read-modify-write mode where multiple bits may be changed each time that a row of the memory is writ (If each of the four planes of the RAM stores 256K bits and is a square array of storage cells, then there are 512 sense amplifiers per plane a = 512/4 = 128.) This input register allows multiple results to be accum ted and written into one row of the memory at the same time on an infre quent basis, rather than requiring memory write cycles to be performed the same rate as vector dot-products are computed. The choice of the bi to be loaded is made by the matrix address lines, and the choice of loa the register rather than the memory is made by the matrix control lines

Figure 8 is a block diagram of the vector memory 319. A typical minimum configuration would be 16 words by 32 bits so that it can store 4-by-4 transformation matrix with 32-bit precision. Note, however, that each word has 8 bits from each of the elements in the four rows of a column-vector, rather than the 32 bits of a single element. This format required because the four logic cells 307 work on 8 bits of each elemen - at a time. A se uence of four words rovides the 32 bit each of the four elements. Note that:

1. A column-vector from the vector memory is multiplied by a row- vector from the matrix memory. 2. The connections from the Matrix Memory Bus to the Input Bus are:

Input Bus B: 31:28 27:24 23:20 19:16 15:12 11:08 07:04 03:00

Matrix Memory Bus B: 15:12 15:12 11:08 11:08 07:04 07:04 03:00 03:00

Write Column Select: 1 0 1 0 1 0 1 0

3. The content of the Vector Memory Bus versus time is: TO - B7:0. Tl - B15:8. T2 - B23:16. T3 - B31:24 from each of the four rows of the column- vector being processed.

Unlike the matrix memory 300 that is accessed infrequently because of the storage capacity of the output register 706, the vector memory is accessed often, at the fast clock rate. This speed difference is reasonable since the matrix memory is large and presumably slow, whereas the vector memory is small and presumably fast. However, the speed requirements of the vector memory can be reduced by increasing its width in proportion to a reduction in its height (the number of words). In this case the sense amplifiers 805 would read additional columns and a set of multiplexers (not shown) would select some of them for passage to the vector memory bus. The write logic 806 provides the loading of sections of the vector memory from the matrix memory. The read logic 807 includes registers that allow the definition of a set of locations in the vector memory to be accessed cyclically for repetitive use of the transformation matrix.

Writing into the vector memory from the matrix memory takes two cycles per word since a word is 32 bits wide but the matrix memory bus 304, which supplies the data, is only 16 bits wide. It thus takes 32 cycles to load the vector memory with a 4-by-4 transformation matrix, so it is desirable to minimize the frequency with which it is loaded to minimize the amount c time that no computations are being performed. This problem can be alleviated by increasing the amount of storage and providing dual-port storage cells so that one transformation matrix may be loaded while another is being used. Alternatively, the matrix memory bus could be made wider. The method for loading the vector memory from the matrix memory is:

1. Select a row of the matrix memory with Matrix Address and load the row into the Output Register. This operation, like any operation using Matrix Address or the matrix memory, preempts use of the matrix memory.

2. Select a starting point in the Output Register with Matrix Address. This point must be bit 0 of the column-vector. ⁵ 3. Select a starting address in the vector memory with Matrix Address. This point must be bit 0 of the column-vector.

4. Initialize a counter. N, to 0.

5. Move bits [(3:0) + 8N] of each row of the vector from the Output Register to the even nibbles (B: 3:0. 11:8. 19:16. 27:24) of the vector ° memory.

6. Advance to the next point (bit(4 + 8N)) in the Output Register.

7. Move bits [(7:4) + 8N] of each row of the vector from the Output Register to the odd nibbles (B: 7:4. 15:12. 23:20. 31:28) of the vector memor . ⁵ 8. Advance to the next point (bit(8 + 8N)) in the Output Register.

9. Advance to the next row (bit(8 + 8N)) of the vector memory.

10. N - N + 1. If N < 4 then go to step 5 to handle successive groups of 8 bits of the Vector.

11. Done. The new vector is ready for use and the next may be loaded.

Figure 9 Is a block diagram of the normalizer cell. The table of functions of the normalizer cell is given in Figure 10. The basic idea is that each vector mantissa may be shifted to the extent necessary to line up the binary point of all of the products of the mantissas. No shifting is performed for the product with the largest exponent; all others are shifted to match it, where the shift reduces the magnitude of the product. Low order bits of a vector element may thus be lost, but this produces an error in the sum of products that is very nearly the same as produced by losing bits when adding a series of products with conventional devices. However, if these small errors are significant, then the product can be computed with additional precision. This is possible because all bits of each vector element are available in each intelligent memory chip. Any increase in precision (beyond eight bits) is in direct proportion to the reduction in performance from handling the additional bits. The shifting of the mantissa is implemented in two stages. Shifts of 8 bits are handled by loading different registers as 905, 906, 907. Shifts of 1 to 7 bite are handled by multiplexer 901 during the calculation of the product. Unlike conventional barrel shifters that require large number of gates to implement large shifts, the structure in Figure 9 requires relatively little logic because the byte-serial architecture allows the selective loading of register 2 (905), register 1 (906) or register 0 (907) under control of the SC (shift control) block 900.

Note if Figure 9 that the bit assignments for the four cells are: 5

Cell 0: FPI Bus B4:0. Vector Memory Bus B7:0, Shifter Bus B7:0.

Cell 1: FPI Bus B9:5, Vector Memory Bus B15:8, Shifter Bus B15:8.

Cell 2: FPI Bus B14:10. Vector Memory Bus B23:16. Shifter Bus B23-.16.

Cell 3: FPI Bus B19:15, Vector Memory Bus B31:24. Shifter Bus 831:24. 10

The eight,OR gates 908 are used to set the hidden bit of the mantissa as the data is being loaded into registers 2 to 0. The hidden bit is set only if the vector element is not zero as indicated by the DE (Delta Exponent) bus 314. In the IEEE single precision floating-point format, bit

¹⁵22 is the hidden bit. corresponding to bit 6 of the OR bus 909.

Figure 11 is a block diagram of the floating-point processor logic. There are four processor cells, as 1102, that perform multiplication. Each receives four bits from the matrix memory bus and eight bits from the shifter bus 312. The products from these multipliers are summed and stored

²⁰ by the adder/register 1103. The sum of products is added by the ALU 1104 to the accumulated sum of products stored by the register 1105: The accumu¬ lated sum of products is shifted by 8 bits per cycle, corresponding to the use of successively higher weight sets of eight bits from the shifter bus. The final sum of products is stored in the register 1106 and passed in two

²⁵ sets of bits to the floating-point interface 309 by the multiplexer 1107 via the Internal Partial Product data lines 322. Note in Figure 11 that:

1. The assignment of the Matrix Memory Bus to each Cell's input "A" is:

Cell 0 B0 to B3 is Matrix Memory Bus B0 to B3. ³⁰ Cell 1 BO to B3 is Matrix Memory Bus B4 to B7. Cell 2 BO to B3 is Matrix Memory Bus B8 to Bll. Cell 3 BO to B3 is Matrix Memory Bus B12 to B15.

2. The ALU function is Σ = A during the first cycle for each column-vector, otherwise Σ = A plus B.

³⁵3. The mantissa sign (MS) input to each Cell from the sign bus 306 is used only for floating-point. It is the product of signs of the row-vector and column-vector elements it is working on.

Also note for single precision floating-point that the mantissa has only 23 bits, rather than 32, hence three cycles (each handling eight bits), rather than four cycles, are required to compute a product. However, four cycles are needed per product to pass data between the intelligent memory chips and the vector accumulator chip 208, so no increase in performance is realized. Thus an additional eight bits of precision may be provided without degrading performance, but this is not shown.

Figure 12 is the block diagram of the floating-point processor cell. It includes a hidden-bit register 1203 that specifies the bit position of the mantissa hidden bit for each element of the input matrix. This bit is set only for bit 22, a bit that occurs in a single intelligent memory chip. whereas the hidden bit for the column-vector mantissa occurs in each chip since the entire column-vector is stored in each chip. (When a non-zero number is in floating-point format, one bit of the mantissa does not need to be stored because it is always a 1; it is called the "hidden bit".) The mask register 1202 is used to inhibit (turn off) the bits that are not in the mantissa of an element of the input matrix. For the IEEE single precision format, bits 21:0 are in the mantissa and the corresponding mask bits would be l's; the remaining mask bits would be 0's.

These registers must be loaded via the matrix memory under the control of the processor control lines 318 before computations begin.

Note that only the logic for single precision floating-point is described. The extension to double precision requires a doubling of the width of the vector memory and the normalizer, and twice as many intelli¬ gent memory chips to provide 64-bit matrix precision. The mask register and hidden register are sufficiently general so that no changes are required; they would simply be loaded with appropriate data.

The "F" gates as 1204 handle masking and hidden bit injection. Each gate has three, 1-bit inputs and a 1-bit output. The function is: ([Data Register AND Mask Register] OR [Hidden Bit Register AND (NOT Matrix Zero)]).

For floating-point operations, the multiplier 1206 operates only on positive values since floating-point mantissas are stored in sign/magnitude format. For fixed-point operations, a signed multiplier is used whenever a sign bit is present, as indicated by the slice type and clocks. For floating-point operations, the sign block, "S", 1208 receives the sign of the matrix mantissa from the sign bus, and the sign of the column- vector mantissa from the vector memory bus. If the sign of the product is negative, then the ALU 1209 complements the output from the multiplier 1206. Otherwise, and for fixed-point operations, the data passes unchanged. For fixed-point arithmetic, the normalizers act as though they were invisible, the mask registers are set so that all bits are used, and the hidden bit registers are cleared. However, the number of cycles to produce a floating-point product is much greater than the number of cycles to compute a fixed point product because information must flow from the intelligent memory chips to the vector accumulator chip and back. However, the rate of dot-product computation is the same for both fixed-point and floating-point.

For fixed point arithmetic, the multiplier 1206 is controlled by two signals so as to implement the correct section of a 32-bit by 32-bit multi¬ plier. The spatial selection is made by slice type (ST) which distinguishes the most significant slice of the matrix from lower significance slices. The temporal selection is made by TO which distinguishes the most signifi¬ cant slice of the vector from lower significance slices. The choice of timing signal TO is a result of the pipelining of the structure; it is the signal that is asserted when the most significant bits of the vector are flowing through the multiplier.

The data register 1201 and mask register 1202 are loaded when commanded by the processor control inputs to the chip, whereas the vector register 1205 is loaded whenever run (one of the processor control inputs) is asserted, which is whenever multiplications are. being performed.

Figure 13 is a block diagram of the floating-point vector accumulator and interface chip 209. The exponent section 1303 performs the exponent arithmetic required by the method for performing floating-point arithetic described herein. The mantissa section 1306 checks for matrix mantissas that are zero and sends this information to the exponent section via the exponent bus 1305. The presence or absence of a zero mantissa for each element is sent to all intelligent memory chips via the exponent section. The mantissa section receives exponent information from the exponent section and includes normalization logic as is known in the art to provide the correct floating-point format of the final vector dot-product.

The mantissa section 1306 also combines the partial products into a complete product. The bus interface 1307 routes data between the intelligent memory chips via the MD bus 201, the mantissa section and the microprocessor data bus 210. A clock generator 1308 provides timing signals; it is synchronized by reset 1310 to the clock generator as 317 in each intelligent memory chip to facilitate the passing of data between chips. The reset signal can be shifted in time between the vector accumulator chip and the group of intelligent memory chips should changes in the pipelines require it.

Figure 14 is a block diagram of the logic within the mantissa section that combines partial products. The speed requirements of the weighted adder 1406 have been minimized by buffering the partial products as 1400 in such a way as to present the full 40 bits of each partial product to the adder at the same time. The first bank of registers as 1403 delays the partial product so that the second bank of registers as 1404 can capture the entire partial product and hold it for an entire slow clock period.

The output 1407 of the adder 1406 passes through a register 1408 to a rounding unit 1409. For graphics, the 68-bit product 1407 would typically be truncated or rounded to 32 bits. The choice of manipulation provided by the round unit depends upon the needs of the application. A full precision product could be passed if it were desired to combine the outputs from multiple vector accumulators in order to handle matrices with more than 4 columns. The output of the round unit is captured by an output register 1410.

Figure 15 is the block diagram of the 8-port by 40-bit weighted adder 1406. It is composed of four adder/registers 1502 to 1505 in the first layer. (An adder/register is an adder followed by a pipeline register.) Each combines a pair of partial products whose significance differs by a factor of 16, consistent with each intelligent memory chip handling 4 bits. The least significant partial product, PP0 1500, feeds the 40 AL ("A" least significant) inputs to adder/register 1502. The most significant bit is used to extend the product by 4 bits at the AM ("A" most significant) inputs. The four least significant bits of the other input to the adder/ register are set to zero at the BL input. PP1 1501 is fed into the 40 most significant bits at the BM input. The adder/register then computes a 44-bit sum. The other adder/registers in the same level operate similarly.

The second layer of adder/registers 1507 and 1506 combine the outputs from the first layer of adder/registers. Adder/register 1508 in the third layer combines the results from the second layer and produces sum out 1509.

Figure 16 is a block diagram of the bus interface 1307. It has two driver/receivers 1600, 1602 to communicate with the external busses. MD and 1/0. The novel feature of the interface is its eight sets of multiplexers, where each set has four multiplexers as 1606 to 1609. These multiplexers provide the loading of each 4-bit nibble of the vector into each of 8 intelligent memory chips.

The method for loading a vector into the matrix memory is described below. The objective is to load all bits of each element of the column- vector into each intelligent memory chip, whereas only four bits of each element of the matrix are loaded into each chip.

The following procedure is executed for each 32-bit element (or row) of each column-vector. It is controlled by the control unit 208 in the ⁵ intelligent floating-point memory module shown in Figure 2. The control unit is instructed by a microprocessor to perform the operation and to store the vector at a particular address. The procedure is:

1. The vector element is loaded into the Write Vector Register in the ⁱ⁰ vector accumulator chip. The element may come from the microprocessor data bus, the memory data bus or the dot product output.

2. Under control of matrix control and matrix address, the input registers In each intelligent memory chip are loaded from the row of the matrix memory that will store the vector element.

¹⁵ 3. No writing to this row of the matrix memory may be performed while the following steps take place.

4. For each 4-bit nibble, where nibble N comprises bits 4N+3 to 4N, and N ^•***• 0 to 7:

4a. The selected nibble from the vector write register in the vector 20 accumulator chip is replicated eight times by the multiplexers in the bus interface logic in the vector accumulator chip to fill the entire 32-bit word. The selection of the nibble is made by the vector accumulator control (VAC) lines.

4b. The word containing the replicated nibble is passed to the memory ²⁵ data bus and received by the intelligent memory chips.

4c. This nibble is written into the input register in the intelligent memory chips, where the position in the register is selected by the memory address lines. The position must be selected in accordance with the "table of data storage formats in a row of the matrix memory" (Figure 6). ³⁰ 5. The input register is written back into the row of the matrix memory. Only the bits that were updated by the nibbles of the column- vector are changed.

6. Full use of the matrix memory may resume.

7. The next element of the vector may now be handled as described ³⁵ above.

In a typical operation, a matrix of row-vectors is multiplied by a transformation matrix. Each product is available at the output of the mantissa section 1306 and flows to the register 1601 in the bus interface 1706. The output of register 1601 is turned on, all other devices connected to the bus 1605 are turned off, and the product flows through driver/ receiver 1600 to the MD bus for loading back into the intelligent memory chips. ⁵ The intelligent memory chips may be read or written like common memory chips by the host processor via the microprocessor data bus that is connected to the I/O port of driver/receiver 1602. Data is written into the memory chips by turning on the receiver in driver/receiver 1602 and the driver in driver/receiver 1600. The situation is reversed to read data from ⁱ⁰ the intelligent memory chips back into the microprocessor. The control unit 208 would receive a bank-select signal that it passes to the memory chips and the vector accumulator chip to determine when they should respond.

The use of multiple intelligent memory chips to perform floating-point operations is now given. First, the terminology for describing the 15 operation of floating-point data is given below. As is common practice, each matrix element is the product of a mantissa and a power of 2. As is also common practice, the mantissa is held in sign/magnitude representation and the exponent is held in offset binary.

Let: [M] - [M0 M(N-l)] and [V] = [V0 V(N-l)]. ⁰ where:

Mk » MMantissa(k) * 2^ΛMExponent(k) ^•**■* MMk * 2^ΛMEk Vk ^****■ VMantissa(k) * 2~VExponent(k) = VMk * 2~VEk 1.0 <= magnitude of mantissa < 2.0, or magnitude = 0.

⁵ The conventional method for computing the floating-point product of a row-vector and a column-vector is given below. First, a running sum in floating-point format (generally an extended format) is initialized. Then, and this is a key point, the product of the mantissas of a pair of elements is computed in fixed point representation, and then a shift of the smaller ⁰ of the product and the running sum is performed so that the exponent of the product and the running sum are the same, at which point the product is added to the running sum. When a series of products is complete, the result is renormalized. Note that this method works on one product at a time.

5 For C = [M] * [V]^τ (ignoring special cases):

MT0 - 0; initialize MantissaTemporaryO; accumulator ET0 = 0; initialize ExponentTemporaryO; accumulator k ■= 0; initialize loop counter 4 5 MT1 = MMk * VMk: multiply mantissas in fixed-->int 6 ET1 = MEk + VEk; add exponents in fixed-point 7 ETO * larger of ET0 and ET1 8 shift MT1 and adjust ET1 so ET1 = ETO; denormalize 9 MTO * MT0 + MT1; accumulate - add mantissas in fixed-point 10 normalize MTO and adjust ETO; renormalize 11 k = k + 1 12 If k < N Then GOTO 5 13 14 C - MTO * 2~ET0

The computation method that is implemented by intelligent floating¬ point memory chips is given below. Since an intelligent memory chip adds several products simultaneously, the solution is to shift each element of the vector prior to computing the product. The products thus have a common exponent and may be added. Any number of products may be handled simultan¬ eously by this method, although the preferred embodiemnt of the chip described herein operates upon only four. This method is possible because the entire vector is stored in each chip, so all bits are available to perform the shift. The matrix, on the other hand, is spread over multiple chips and cannot be shifted without a serious interconnection problem between chips.

The details of this method are:

Let degree of parallelism = P = number of products computed in a group:

1 MTO » 0; initialize MantissaTemporaryO; accumulator 2 ETO ■ 0; initialize ExponentTemporaryO; accumulator 3 k = 0; initialize loop counter 4 5 in parallel: E0 = MEk + VEk, . E(P-l) = ME{k+P-l) + VE(k+P-l; sum the pairs of exponents in the group

6 EP = Max(E0 E(P-l)); find the largest exponent ₇ in parallel: SO - E0 - EP, .... S(P-l) = E(P-l) - EP in parallel: MT1 ^•***■ [MMk * (VMk * 2"S0)] + ... + [MM(k+P-l) * (VM(k+P-l) * 2^*S(P-1))]; multiply and add at the same time 9: normalize MT1 and adjust EP: renormalize 10: add (MT1.ET1) to (MTO,ETO); accumulate as normal 11 k - k + P 12 If k < N Then GOTO 5 13 14 C = MTO * 2~ET0

The Partial Product (PP) Bus is used as follows:

Cycle 0: Matrix Data -> Vector Accumulator Chip from each Intelligent Memory Chip. Cycle 1: PP19:0 -> Vector Accumulator Chip from each Intelligent Memory Chip.

Cycle 2: Delta Exponent -> all Intelligent Memory Chips in parallel from the Vector Accumulator Chip.

Cycle 3: PP39:20 -> Vector Accumulator Chip from each Intelligent Memory Chip.

The operations required to multiply a row of a matrix times a column of a vector are as follows. The operations would be highly pipelined and overlapped for efficient operation, but these complexities are ignored for simplicity of explanation. It is assumed that appropriate data has been loaded into the matrix memory from a host processor prior to beginning this procedure.

0. The Mask Register and the Hidden Bit Register are initialized in turn from the Output Register according to the desired floating-point format. For the IEEE 32-bit standard: (1) Mask Register bits 31-22 are cleared (sign bit and exponent bits) and bits 21-0 (mantissa or significand bits) are set, and (2) Hidden Bit Register bit 22 is set and all other bits are cleared. This initialization enables identical intelligent memory chips to provide the appropriate function for each bit.

1. The Output Register is loaded with a row of the Matrix Memory.

2. A portion of the Output Register is selected as the "current vector column".

3. The Vector Memory is loaded with the current vector column. 4. All bits of each element in the current vector column are sent to the Vector Accumulator Chip. They are stored in the Exponent Section.

5. The Output Register is loaded with a row of the matrix from the Matrix Memory.

6. A portion of the Output Regi^<**-.er is selected as the "current matrix row .

7. All bits of each element in the current matrix row are sent to the Vector Accumulator Chip. They are stored in the Mantissa Section.

8. The sign bit of each element in the current matrix row is sent to ⁵ all Intelligent Memory Chips via the Sign Bus.

9. The Exponent Section in the Vector Accumulator Chip computes the exponents of the products, selects the maximum exponent, and computes the Delta Exponents - the difference between the maximum and each product's exponent.

1° 10. The Mantissa Section in the Vector Accumulator Chip checks each matrix element for a zero value.

11. The Exponent Section in the vector Accumulator Chip checks each vector element for a zero value.

12. The Delta Exponents are sent to all Intelligent Memory Chips. The -⁵ checks for zero for the matrix and vector are used to form the Delta

Exponent.

13. The Vector Memory is loaded into Registers 2 to 0 in the Normalizer Cell under control of the Delta Exponent, shifting the vector mantissa by multiples of 8 bits. The Hidden Bit is not set if a vector

²⁰ element is zero.

14. Registers 2 to 0 in the Normalizer Cell are loaded into Registers 5 to 3.

15. The sign bit of each product is computed by the Processor Logic Cell.

²⁵ 16. The product of the matrix row and vector column are computed. The hidden bit of each matrix element is suppressed if the matrix element is zero. Each element of the vector column is shifted by up to seven bits by the Normalizer cell to complete the adjustment according to the Delta Exponent.

³⁰ 17. Go to the next vector column and/or matrix row.

Industrial Applicability

³⁵ The intelligent floating-point memory units described herein may be used for a wide range of applications requiring the rapid manipulation of large matrices of numerical data. These applications include digital signal processing, pattern recognition, three-dimensional graphics, and scientific-and-engineering computing.

Claims

1. An intelligent floating-point memory unit capable of performing a ⁵ NumPoints (number of points - two or more) vector dot-product calculation upon data having a fixed-point or a floating-point representation and comprising:

(a) a matrix memory (300) capable of storing one or more matrices and ° one or more vectors, and receiving matrix control inputs and matrix address inputs, and driving/receiving NumMBits (number of matrix bits - one or more) matrix data pins, and comprising:

(al) NumMBits memory-and-logic planes (703), each plane being connected to one bit of the matrix data bus and comprising: ⁵ (ala) memory cells arranged in a multiplicity of rows and columns, and storing a row of the matrix in a row of memory cells,

(alb) row-selection means for receiving the matrix address inputs and the matrix control inputs, and selecting one row of the memory cells, (ale) access means for reading information from, and writing information to, the memory cells, where information may be conveyed between a single column in a plane of the memory cells and one of the matrix data pins,

(aid) an input register that stores a row of bits received from the access means for each plane of memory cells, that replaces those bits by one or more bits received from one of the matrix data pins, where the selection of each bit replaced is made by the matrix address inputs in conjunction with the matrix control inputs,

(ale) an output register that stores a row of bits received from the access means and a multiplexer that selects NumPoints of these bits to be connected to the matrix memory bus,

(b) a vector memory (319) receiving the matrix memory bus, matrix address pins and vector control pins, producing the vector memory bus, and comprising: (bl) multiple sets of words of storage, each word having NumPoints sections, each section having NumVBits (number of vector bits) bits, where NumVBitβ is an Integer multiple of NumMBits, where data is loaded from the matrix memory via the output register and the matrix memory bus into the vector memory in such a way that for the bits that represent the manti sa: (1) the bits within each section have ascending bit weight, (2) the same set of bit weights is used in each of the sections in a single word, and (3) as many words are used in each set as are required to store all of the mantissa bits of each element of the vector, in which case successive words ⁵ have bits with ascending weights,

(c) processor logic receiving the matrix memory bus, vector memory bus, and processor control pins, where one of the processor control pins is "MSS slice" (aka "slice type"), and producing the partial product signals 1° and the sign bus, and comprising:

(cl) a clock generator (317) that produces a repetitive sequence of timing signals as TO to TL,

(c2) NumPoints normalizer cells (313), each receiving NumVbits of the vector memory bus and a portion of the DE bus, where each cell receives a I⁵ different portion of each bus, and producing NumVBits of the Shifter bus. and comprising:

(c2a) a set of registers (as 905 to 907) that receive the vector memory bus, where the position that each portion of an element of a vector is loaded into the registers depends upon a value conveyed by the DE bus, ²⁰ and where registers that are not loaded are set to zero,

(c2b) a set of registers (as 902 to 904) that are loaded in parallel from the set of registers (as 905 - 907) and shift the element of the vector by NumVBits at a time where the least significant bits are lost first, ²⁵ (c2c) a set of multiplexers (901) that receives the outputs of the 2 * NumVBits of the least significant bits of the registers (as 902 to 904) and produces NumVBits of the shifter bus, where choice of NumVBits of adjacent bits read from the registers depends upon a value conveyed by the DE bus. (c3) NumPoints logic cells (307), each receiving one bit of the sign ³⁰ bus. NumMBits bits of the matrix memory bus and NumVBits of the shifter bus, where each cell receives a different portion of each bus and produces a product output, and comprising:

(c3a) a mask register (1202) receiving NumMBits of the matrix memory bus under command of the processor control pins, ³⁵ (c3b) a hidden bit register (1203) receiving NumMBits of the matrix memory bus under command of the processor control pins,

(c3c) NumMBits gates (as 1204), each receiving one bit from the mask register, one bit from the hidden bit register and a signal. Matrix Zero, conveyed by a state of the DE bus, where the function of each gate is: {[Data Register AND Mask Register] OR [Hidden Bit Register AND (Not Matrix Zero)]},

⁵ (c3d) a multiplier that forms the product of its gated matrix memory bus input and its shifter bus input, and produces the product output.

(c3e) an ALU (1209) that passes the output of the multiplier if the sign of the overall product is positive or complements the output if the sign is negative, ° (c4) an adder and accumulator (308) receiving the product outputs from the logic cells and producing the internal partial product output, and comprising:

(c4a) an adder/register (1103) that sums the product outputs, (c4b) an ALU (1104) and register (1105) that sum the adder (1103) 5 output over time, where the weight of the adder output in each successive cycle is 2~NumVBits times the weight of the preceeding cycle, where the weight 1 when a new summation starts,

(c4c) a register and multiplexer (1106) that receives the completed sum from the ALU (1103) and register (1105), where the multiplexer passes portions of the output of the register to the internal partial product in sequential cycles,

and (d) a floating-point interface (309) that (1) receives the matrix memory bus and sends it on the partial product bus, (2) receives the internal partial product and sends it on the partial product bus, and (3) receives the partial product bus and sends it as the DE bus.

2. A method for computing the dot-product of vectors represented in floating-point format, comprising a series of steps:

Step 1: For vectors [M] and [V], each having N elements, computation of the set of the sums of exponents, where S(i) = [exponent of M(i)] plus [exponent of V(i) ] , for i = 0 to N-l,

Step 2: Selection of the largest exponent, LE, from the set of sums of exponents.

Step 3: Calculation of the set of differential exponents where DE(i) = LE - S(i). for i = 0 to N-l,

Step 4: Reduction in the magnitude of the mantissa of V(i) by 2~DE(i). where the scaled mantissa of V(i) = mantissa of V(i) / 2~DE(i), for i =^** o to N-l .

Step 5: Calculation of the sum of the products of the mantissas of the M vector and the scaled mantissas of the V vector, where Sum = [mantissa of M(0) * scaled mantissa of V(0)] + ... + [mantissa of M(N-l) * scaled ⁵ mantissa of V(N-l)],

Step 6: Normalization of the Sum. including calculation of its exponent,

Step 7: Addition of LE to the exponent of the normalized Sum.

Step 8: Exit. 0

3. A nibble-serial, element-parallel method for multiplying a row-vector times a column-vector, comprising:

(a) a data structure, comprising: ⁵ (al) two or more vectors, each with NumPoints (number of points) elements,

(a2) one or more row-vectors [R(i)]« each of whose elements has M *

NumMBits (NumMBits = number of matrix bits) bits of precision in two's complement notation, P (a3) one or more column-vectors [C(i)],^'each of whose elements has V *

NumVBits (NumVBits = number-of-vector-bits) bits of precision in two's complement notation,

(b) an intelligent memory module capable of synchronizing all of its ⁵ members, comprising:

(bl) M identical storage-and-processing units, aka intelligent memory units, each unit containing a cycle counter and operating upon NumPoints (number of points) elements of a vector, storing a group of NumMBits (number of matrix bits) adjacent bits of each element of the row-vector, ⁰ where unit 0 stores bit 0 and groups of ascending bits are stored in ascending units, storing all bits of each element of the column-vector, operating upon NumVBits (number of vector bits) adjacent bits of each element of the vector during a single cycle as selected by its cycle counter, producing a series of sub-partial products, and producing a ⁵ partial product output following the last cycle,

(b2) one global combinatorial unit, aka a vector accumulator unit 208. that receives the partial products from each of the M intelligent memory units and combines these partial products according to their respective significance to form a complete product. and (c) a sequence of steps, comprising: Step 0: Initialization of a cycle counter,

Step 1: Initialization of a sub-partial product accumulator in each ⁵ unit,

Step 2: For each of the M units, for each of the NumPoints, selection of NumVBits of the vector by the cycle counter, and computation of its NumMBits matrix bits and its NumVBits vector bits,

Step 3: For each of the M units, summation of the results of the ° multiplications and addition of the summation to the sub-partial product accumulator according to the significance of the summation of the results of the multiplications,

Step 4: Advance to the next cycle, and if there are more cycles then go to Step 2, step 5: For each of the M units, assignment of the sub-partial product to the partial product, and transmission of each partial product to the global combinatorial unit,

Step 6: Combination of the M partial products according to their significance and production of the final product, Step 7: Exit.

4. A vector accumulator and interface unit (208) capable of operating upon words with M bits where M = K * NumMBits and K is an integer, comprising:

(a) an arithmetic section (1306) comprising: (al) a weighted adder tree having K multi-bit inputs and producing a sum output, where the inputs are scaled prior to being added so that the weight of the least significant bit of each of these K inputs increases by 2~NumMBits and the weight of the least significant bit of the least significant input is 1, (a2) a round unit (1409) producing a rounded output and capable of passing a portion of sum outpu^* to its rounded output,

(b) a bus interface (1307) receiving a select input and comprising: (bl) a write register (1601) capable of receiving the rounded output and a value from an external data bus, (b2) K sets of multiplexers with NumMBits multiplexers in each set (as 1606 to 1609) and capable of replicating any group of NumMBits adjacent bits from the output of the write register throughout an M-bit word, where the least significant bit in a group is bit number (J * NumMBits) for J = 0 to K-l and the group is chosen by the select input.