US20010023425A1 - Method and apparatus for rounding in a multiplier arithmetic - Google Patents

Method and apparatus for rounding in a multiplier arithmetic Download PDF

Info

Publication number
US20010023425A1
US20010023425A1 US09/782,475 US78247501A US2001023425A1 US 20010023425 A1 US20010023425 A1 US 20010023425A1 US 78247501 A US78247501 A US 78247501A US 2001023425 A1 US2001023425 A1 US 2001023425A1
Authority
US
United States
Prior art keywords
multiplier
product
bit
redundant
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/782,475
Other versions
US6397238B2 (en
Inventor
Stuart Oberman
Norbert Juffa
Ming Siu
Frederick Weber
Ravikrishna Cherukuri
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Silicon Technologies LLC
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Priority to US09/782,475 priority Critical patent/US6397238B2/en
Publication of US20010023425A1 publication Critical patent/US20010023425A1/en
Application granted granted Critical
Publication of US6397238B2 publication Critical patent/US6397238B2/en
Assigned to ADVANCED MICRO DEVICES, INC. reassignment ADVANCED MICRO DEVICES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHERUKURI, RAVIKRISHNA, SIU, MING, JUFFA, NORBERT, OBERMAN, STUART, WEBER, FRED
Assigned to ADVANCED SILICON TECHNOLOGIES, LLC reassignment ADVANCED SILICON TECHNOLOGIES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ADVANCED MICRO DEVICES, INC.
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3828Multigauge devices, i.e. capable of handling packed numbers without unpacking them
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49905Exception handling
    • G06F7/4991Overflow or underflow
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49936Normalisation mentioned as feature only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49942Significance control
    • G06F7/49947Rounding
    • G06F7/49963Rounding to nearest
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49994Sign extension
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/533Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even
    • G06F7/5334Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product
    • G06F7/5336Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product overlapped, i.e. with successive bitgroups sharing one or more bits being recoded into signed digit representation, e.g. using the Modified Booth Algorithm
    • G06F7/5338Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product overlapped, i.e. with successive bitgroups sharing one or more bits being recoded into signed digit representation, e.g. using the Modified Booth Algorithm each bitgroup having two new bits, e.g. 2nd order MBA

Definitions

  • This invention relates generally to the field of microprocessors and, more particularly, to rounding the results of iterative calculations in microprocessors.
  • Microprocessors are typically designed with a number of “execution units” that are each optimized to perform a particular set of functions or instructions. For example, one or more execution units within a microprocessor may be optimized to perform memory accesses, i.e., load and store operations. Other execution units may be optimized to perform general arithmetic and logic functions, e.g., shifts and compares. Many microprocessors also have specialized execution units configured to perform more complex arithmetic operations such as multiplication and reciprocal operations. These specialized execution units typically comprise hardware that is optimized to perform one or more particular arithmetic functions. In the case of multiplication, the optimized hardware is typically referred to as a “multiplier.”
  • multipliers were implemented using designs that conserved die space at the expense of arithmetic performance. Until recently, this was not a major problem because most applications, i.e., non-scientific applications such as word processors, did not frequently generate multiplication instructions.
  • recent advances in computer technology and software are placing greater emphasis upon multiplier performance. For example, three dimensional computer graphics, rendering, and multimedia applications all rely heavily upon a microprocessor's arithmetic capabilities, particularly multiplication and multiplication-related operations.
  • microprocessor designers have favored performance-oriented designs that use more die space. Unfortunately, the increased die space needed for these high performance multipliers reduces the space available for other execution units within the microprocessor. Thus, a mechanism for increasing multiplier performance while conserving die space in needed.
  • MMX and 3D graphics instructions are often implemented as “vectored” instructions.
  • Vectored instructions have operands that are partitioned into separate sections, each of which is independently operated upon.
  • a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections.
  • a vectored multiply instruction Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied.
  • FIG. 1 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication.
  • a scalar (i.e., non-vectored) multiplication To quickly execute vectored multiply instructions, many microprocessors use a number of multipliers in parallel. In order to conserve die space, a mechanism for reducing the number of multipliers in a microprocessor is desirable. Furthermore, a mechanism for reducing the amount of support hardware (e.g., bus lines) that may be required for each multiplier is also desirable.
  • support hardware e.g., bus lines
  • microprocessor Another factor that may affect the number of multipliers used within a microprocessor is the microprocessor's ability to operate upon multiple data types. Most microprocessors must support multiple data types. For example, x86 compatible microprocessors execute instructions that are defined to operate upon an integer data type and instructions that are defined to operate upon floating point data types. Floating point data can represent numbers within a much larger range than integer data. For example, a 32-bit signed integer can represent the integers between ⁇ 2 31 and 2 31 ⁇ 1 (using two's complement format).
  • a 32-bit (“single precision”) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2 ⁇ 126 to 2 ⁇ 127 ⁇ (2 ⁇ 2 ⁇ 23 ) in both positive and negative numbers. While both integer and floating point data types are capable of representing positive and negative values, integers are considered to be “signed” for multiplication purposes, while floating point numbers are considered to be “unsigned.” Integers are considered to be signed because they are stored in two's complement representation.
  • FIG. 2A an exemplary format for an 8-bit integer 100 is shown. As illustrated in the figure, negative integers are represented using the two's complement format 104 . To negate an integer, all bits are inverted to obtain the one's complement format 102 . A constant of one is then added to the least significant bit (LSB).
  • LSB least significant bit
  • FIG. 2B an exemplary format for a 32-bit (single precision) floating point number is shown.
  • a floating point number is represented by a significand, an exponent and a sign bit.
  • the base for the floating point number is raised to the power of the exponent and multiplied by the significand to arrive at the number represented.
  • base 2 is typically used.
  • the significand comprises a number of bits used to represent the most significant digits of the number.
  • the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. In order to save space, the bit to the left of the radix point, known as the integer bit, is not explicitly stored.
  • the multiplier may perform signed and unsigned scalar and vector multiplication using the same hardware.
  • the multiplier may receive either signed or unsigned operands in either scalar or packed vector format and accordingly output a signed or unsigned result that is either a scalar or a vector quantity.
  • this embodiment may reduce the total number of multipliers needed within a microprocessor because it may be shared by execution units and perform both scalar and vector multiplication. This space savings may in turn allow designers to optimize the multiplier for speed without fear of using too much die space.
  • speed may be increased by configuring the multiplier to perform fast rounding and normalization. This may be accomplished configuring the multiplier to calculate two version of an operand, e.g., an overflow version and a non-overflow version, in parallel and then select between the two versions.
  • an operand e.g., an overflow version and a non-overflow version
  • the multiplier may be further optimized to perform certain calculations such as evaluating constant powers of an operand (e.g., reciprocal or reciprocal square root operations). Iterative formulas may be used to recast these operations into multiplication operations. Iterative formulas generate intermediate products which are used in subsequent iterations to achieve greater accuracy.
  • the multiplier may be configured to store these intermediate products for future iterations.
  • the multiplier may be configured to compress these intermediate products before storing them, which may further conserve die space.
  • the multiplier may comprise a partial product generator, a selection logic unit, and an adder.
  • the multiplier may also comprise a multiplicand input configured to receive a multiplicand operand (signed or unsigned), a multiplier input configured to receive a multiplier operand (also signed or unsigned), and a sign-in input.
  • the sign-in input is configured to receive a sign-in signal indicative of whether the multiplier is to perform signed or unsigned multiplication.
  • the partial product generator which is coupled to the multiplicand input, is configured to generate a plurality of partial products based upon the multiplicand operand.
  • the selection logic unit which is coupled to the partial product generator and the multiplier input, is configured to select a number of partial products from the partial product generator based upon the multiplier operand.
  • the adder which is coupled to the selection logic unit, is configured to sum the selected partial products to form a final product.
  • the final product which may be signed or unsigned, may then be output to other parts of the microprocessor.
  • the multiplier may further comprise an “effective sign” calculation unit.
  • the calculation unit may comprise a pair of AND gates, each configured to receive the most significant bit of one operand and the sign-in signal. The output of each AND gate is used as the effective sign for that gate's operand.
  • the effective sign may be appended to each operand for use as the operand's sign during the multiplication process.
  • the effective sign may allow both unsigned operands and signed operands to be multiplied on the same hardware.
  • a method for operating a multiplier within a microprocessor comprises receiving a multiplier operand, a multiplicand operand, and a sign-in signal from other functional units within the microprocessor.
  • An effective sign bit for the multiplicand operand is generated from the sign-in signal and the most significant bit of the multiplicand operand.
  • a plurality of partial products may then be calculated from the effective sign bit and the multiplicand operand.
  • a number of the partial products may be selected according to the multiplier operand.
  • the partial products are then summed, and the results are output.
  • the steps may be performed in parallel or in a different order.
  • the multiplier may be capable of multiplying one pair of N-bit operands or two pairs of N/2-bit operands simultaneously.
  • the multiplier may comprise a multiplier input and a multiplicand input, each configured to receive an operand comprising one N-bit value or two N/2-bit values.
  • the multiplier may also comprise a partial product generator coupled to the multiplicand input, wherein the partial product generator is configured to generate a plurality of partial products based upon the value of the multiplicand operand.
  • the multiplier may further comprise a selection logic unit coupled to the partial product generator and the multiplier input. The selection logic unit may be configured to select a plurality of partial products from the partial product generator based upon the value of the multiplier operand.
  • An adder may be coupled to the selection logic unit to receive and sum the selected partial products to form a final product comprising either one 2N-bit value or two N-bit values.
  • the multiplier may receive a vector_in signal indicating whether vector or scalar multiplication is to be formed.
  • a method for operating a multiplier capable of scalar and vector multiplication may comprise receiving a multiplier operand, a multiplicand operand, and a vector-in signal as inputs from functional units within the microprocessor and then calculating a number of partial products from the multiplicand operand using inverters and shifting logic. Certain partial products may be selected according to the multiplier operand. The selected partial products may then be summed to generate a final product. The final product may be in scalar form if the vector_in signal is unasserted, and in vector form if the vector_in signal is asserted.
  • the multiplier may also be configured to calculate vector dot products and may comprise a multiplier input and a multiplicand input, each configured to receive a vector.
  • a partial product generator may be coupled to the multiplicand input and may be configured to generate a plurality of partial products based upon one of the vectors.
  • a first adder may be coupled to receive the partial products and sum them to generate vector component products for each pair of vector components.
  • a second adder may be coupled to the first adder and may be configured to receive and sum the vector component products to form a sum value and a carry value.
  • a third adder may be configured to receive the sum and carry values and one or more vector component products from the first adder. The third adder may be configured to output the sum of the sum and carry values (and any carry bits resulting from the summation of the one or more vector components) as a final result.
  • the multiplier may be configured to output the results in segments or portions. This may advantageously reduce the amount of interface logic and the number of bus lines needed to support the multiplier. Furthermore, the segments or portions may be rounded.
  • the multiplier may comprise a multiplier input, a multiplicand input, and a partial product generator. The generator is coupled to the multiplicand input and is configured to generate one or more partial products.
  • An adder, coupled to the partial product generator and the multiplier input, may be configured to receive a number of the partial products. The adder may sum the partial products together with rounding constants to form a plurality of vector component products which are logically divided into portions. One or more of the portions may be rounded.
  • the multiplier may be configured to round its outputs in a number of different modes.
  • an apparatus and method for rounding and normalizing results within a multiplier is also contemplated.
  • the apparatus comprises an adder configured to receive a plurality of redundant-form components.
  • the adder is configured to sum the redundant-form components to generate a first non-redundant-form result.
  • the adder may also be configured to generate a second non-redundant-form result comprising the sum of the redundant-form components plus a constant.
  • Two shifters are configured to receive the results. Both shifters may be controlled by the most significant bits of the results they receive.
  • a multiplexer may be coupled to receive the output from the shifters and select one of them for output based upon the least significant bits in the first non-redundant-form result. By generating more than version of the result (e.g., the result and the result plus a constant) in parallel, rounding may be accomplished in less time than previously required.
  • a multiplier configured to round and normalize products is also contemplated.
  • the multiplier may comprise two paths.
  • Each path may comprise one or more adders, each configured to receive a redundant-form product and reduce it to a non-redundant form. The first path does so assuming no overflow will occur, while the second path does so assuming an overflow will occur.
  • a multiplexer may be coupled to the outputs of the two paths, so as to select between the results from the first and second paths.
  • a method for rounding and normalizing results within a multiplier comprises multiplying a first operand and a second operand to form a plurality of redundant-form components.
  • a rounding constant is generated and added to the redundant-form component in two different bit positions. The first position assumes an overflow will occur, while the second position assumes no overflow will occur.
  • a particular set of bits are selected for output as the final result from either the first addition or the second addition.
  • the apparatus comprises an initial estimate generator configured to receive the operand and output an initial estimate of the operand raised to the desired constant power.
  • a multiplier may be coupled to receive the operand and the initial estimate, wherein the multiplier is configured to calculate the product of the initial estimate and the operand.
  • a first plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a first approximation, wherein the first approximation assumes an overflow has occurred in the multiplier.
  • a second plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a second approximation, wherein the second approximation assumes an overflow has not occurred in the multiplier.
  • a multiplexer may be configured to select either the first or second approximations for output.
  • the method comprises determining an initial estimate of the operand raised to a first constant power.
  • the operand and the initial estimate are then multiplied in the multiplier to form a first product.
  • a normalized first intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming an overflow occurred during the multiplying.
  • a normalized second intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming no overflow occurred during the multiplying.
  • a set of bits are selected from either the first intermediate approximation or the second intermediate approximation to form a selected approximation that may be output or used in subsequent iterations to achieve a more accurate result.
  • the apparatus may comprise two adders and a multiplexer.
  • the first adder is configured to receive the redundant-form value and add a rounding constant to its guard bit position, thereby forming a first rounded result, wherein the guard bit position is selected assuming no overflow will occur.
  • the second adder is similarly configured and performs the same addition assuming, however, that an overflow will occur.
  • a multiplexer is configured to select either the first rounded result or the second rounded result based upon one or more of the most significant bits from the first and second rounded results. Performing the rounding in parallel may advantageously speed the process by allowing normalization to take place in parallel with the multiplexer's selection.
  • a method for operating a multiplier that compresses intermediate results comprises calculating an intermediate product to a predetermined number of bits of accuracy.
  • a signaling bit is selected from the intermediate product.
  • the signaling bit is equal to each of the N most significant bits of the intermediate product.
  • the intermediate product is compressed by replacing the N most significant bits of the intermediate product with the signaling bit.
  • the compressed intermediate product is then stored into a storage register. During the next iteration, the storage register is read to determine the value of the compressed intermediate product.
  • the compressed intermediate product may be expanded to form an expanded intermediate product by padding the compressed intermediate product with copies of the signaling bit.
  • a multiplier configured to perform iterative calculations and to compress intermediate products is also contemplated.
  • the multiplier comprises a multiplier input, a multiplicand input, and a partial product generator as described in previous embodiments.
  • the multiplier also comprises a partial product array adder which is configured to receive and add a selected plurality of partial products to form an intermediate product.
  • Compression logic may be coupled to the partial product array adder.
  • the compression logic may comprise a wire shifter configured to replace a predetermined number of bits of the intermediate product with a single signal bit, which represents the information stored in the predetermined number of bits. The signal bit is selected so that it equals the value of each individual bit within the predetermined number of bits.
  • a storage register may be coupled to receive and store the compressed intermediate product from the compression logic.
  • the multiplier may be configured to add an adjustment constant to increase the frequency of exactly rounded results.
  • the multiplier may comprise a multiplier input configured to receive a multiplier operand, a multiplicand input configured to receive a multiplicand operand, a partial product generator, and selection logic.
  • the partial product generator is coupled to the multiplicand input and configured to generate one or more partial products based upon the multiplicand operand.
  • the selection logic may be coupled to the partial product generator and the multiplier, wherein the selection logic is configured to select a plurality of partial products based upon the multiplier.
  • the partial product array adder may be coupled to the selection logic, wherein the adder is configured to receive and sum a number of the partial products and an adjustment constant to form a product. The adjustment constant is selected to increase the frequency that the result is exactly rounded.
  • a method for increasing the frequency of exactly rounded results comprises receiving an operand and determining an initial estimate of the result of an iterative calculation using the operand.
  • the initial estimate and the operand are multiplied to generate an intermediate result.
  • the multiplication is repeated a predetermined number of times, wherein the intermediate result is used in place of the initial estimate in subsequent iterations.
  • the final repetition generates a final result, and an adjustment constant may be added to the final result, wherein the adjustment constant increases the probability that the final result will equal the exactly rounded result of the iterative calculation.
  • FIG. 1 is a diagram illustrating an exemplary scalar multiplication and an exemplary vector multiplication.
  • FIG. 2A is a diagram of an exemplary integer data format using two's complement representation.
  • FIG. 2B is a diagram of an exemplary floating point data format.
  • FIG. 3 is a block diagram of one embodiment of an exemplary microprocessor.
  • FIG. 4 is a block diagram of one embodiment of the computational core from the microprocessor of FIG. 3.
  • FIG. 5A illustrates one embodiment of the shift-and-add algorithm for binary multiplication.
  • FIG. 5B illustrates one embodiment of Booth's algorithm for binary multiplication.
  • FIG. 6 is a block diagram illustrating details of one embodiment of the multiplier from FIG. 4.
  • FIG. 7 is a block diagram illustrating the operation of the multiplier from FIG. 6 for unsigned operands.
  • FIG. 8 is a block diagram illustrating an example of the operation of the multiplier from FIG. 6 for signed operands.
  • FIG. 9 is a block diagram illustrating another example of the operation of the multiplier from FIG. 6 for signed operands.
  • FIG. 10 is a diagram illustrating one embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication.
  • FIG. 11A is a diagram that illustrates details of one embodiment of the partial product generator from FIG. 6.
  • FIG. 11B is a diagram that illustrates in detail part of one embodiment of the selection logic from FIG. 6.
  • FIG. 12A-B is a diagram that illustrates details of one embodiment of the selection logic and adder from FIG. 6.
  • FIG. 13 is a diagram illustrating another embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication.
  • FIG. 14 is a diagram illustrating yet another embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication.
  • FIG. 15 is a diagram illustrating one embodiment of a multiplier that is configured to calculate the vector dot product of a pair of vector operands.
  • FIG. 16 is a diagram illustrating another embodiment of a multiplier that is configured to calculate the vector dot product of a pair of vector operands.
  • FIG. 17 is a diagram illustrating one embodiment of a multiplier that is configured to return vector component products in portions, some of which may be rounded.
  • FIG. 18 is a diagram illustrating another embodiment of a multiplier that is configured to return vector component products in portions, some of which may be rounded.
  • FIG. 19 is a diagram illustrating one embodiment of the multiplier from FIG. 6 configured to perform rounding.
  • FIG. 20 is a diagram illustrating a numerical example of the operation of the multiplier from FIG. 19.
  • FIG. 21 is a diagram illustrating details of one embodiment of the sticky bit logic from FIG. 19.
  • FIG. 22 is a diagram illustrating a numerical example of the operation of the multiplier from FIG. 19.
  • FIG. 23 is a diagram illustrating a no ther embodiment of the multiplier from FIG. 6.
  • FIG. 24 is a flowchart illustrating one embodiment of a method for calculating the reciprocal of an operand.
  • FIG. 25 is a flowchart illustrating one embodiment of a method for calculating the reciprocal square root of an operand.
  • FIG. 26 is a diagram illustrating one embodiment of the multiplier from FIG. 6 that is configured to perform iterative calculations.
  • FIG. 27 is a diagram illustrating details of one exemplary embodiment of the non-overflow and overflow logic units from FIG. 26.
  • FIG. 28 is a diagram illustrating details of another exemplary embodiment of non-overflow and overflow logic units from FIG. 26.
  • FIG. 29A is a flowchart illustrating one possible method for fast compression.
  • FIG. 29B is a flowchart illustrating one possible method for fast decompression.
  • FIG. 30 is a diagram illustrating one embodiment of the multiplier from FIG. 4 configured to compress intermediate products.
  • FIG. 31A is a figure illustrating one possible method for compression.
  • FIG. 31B is a figure illustrating another possible method for compression.
  • FIG. 32 is a figure illustrating one embodiment of a multiplier configured to achieve a higher frequency of exactly rounded results.
  • FIG. 33A is a diagram illustrating an example of a vector multiplication using two multipliers.
  • FIG. 33B is a diagram illustrating another example of a multiplication using two multipliers.
  • FIG. 34 is a block diagram of one embodiment of a computer system configured to utilize the microprocessor of FIG. 3.
  • microprocessor 10 comprises a predecode logic block 12 , a bus interface unit 24 , and a level one-cache controller 18 , all of which are coupled to the following three caches: a level-one instruction cache 14 , a level-one data cache 26 , and an on-chip level-two cache 40 . Both instruction cache 14 and data cache 26 are configured with translation lookaside buffers, i.e., TLBs 16 and 28 , respectively.
  • Microprocessor 10 further comprises a decode unit 20 which receives instructions from instruction cache 14 , decodes them, and then forwards them to an execution engine 30 in accordance with inputs received from a branch logic unit 22 .
  • Execution engine 30 comprises a scheduler buffer 32 , an instruction control unit 34 , and a plurality of execution units 36 A- 36 F.
  • Scheduler buffer 32 is coupled to receive decoded instructions from decode unit 20 and convey them to execution units 36 in accordance with input received from instruction control unit 34 .
  • execution units 36 A-F include a load unit 36 A, a store unit 36 B, two integer/MMX/3D units 36 C and 36 D, a floating point unit 36 E, and a branch resolving unit 36 F.
  • Load unit 36 A receives input from data cache 26 , while store unit 36 B interfaces with data cache 26 via a store queue 38 .
  • Integer/MMX/3D units 36 C and 36 D, and floating point unit 36 E collectively form a computational core 42 for microprocessor 10 .
  • Computational core 42 may further comprise other execution units and specialized hardware such as multipliers.
  • instruction cache 14 is organized into sectors, with each sector including two 32-byte cache lines.
  • the two cache lines within each sector share a common tag but have separate state bits that indicate the status of the line.
  • two forms of cache misses may take place: (1) sector replacement and (2) cache line replacement.
  • sector replacement the cache miss is caused by a tag mismatch in instruction cache 14 .
  • the required cache line is supplied by external memory via bus interface unit 24 .
  • the cache line within the sector that is not needed is then marked invalid.
  • a tag matches the requested address but the corresponding cache line is marked as invalid.
  • the required cache line is then supplied by external memory, but unlike a sector replacement, the cache line within the sector that was not requested remains unaltered.
  • other organizations and replacement policies for instruction cache 14 may be utilized.
  • microprocessor 10 may be configured to perform prefetching only in the case of sector replacements. During sector replacement, the required cache line is filled. If the required cache line is in the first half of the sector, the other cache line in the sector is prefetched. If the required cache line is in the second half of the sector, no prefetching is performed. Other prefetching methodologies may also be employed in different embodiments of microprocessor 10 .
  • predecode logic block 12 When cache lines of instruction data are retrieved from external memory by bus interface unit 24 , the data is conveyed to predecode logic block 12 .
  • the instructions processed by microprocessor 10 and stored in cache 14 are variable-length (e.g., the x86 instruction set). Because decoding variable-length instructions is particularly complex, predecode logic 12 may be configured to provide additional information to be stored in instruction cache 14 to aid during decode. In one embodiment, predecode logic 12 generates “predecode bits” for each byte in instruction cache 14 . The predecode bits may provide various information useful during the decode process, e.g., the number of bytes to the start of the next variable-length instruction. The predecode bits are passed to decode unit 20 when instruction bytes are requested from cache 14 .
  • instruction cache 14 is implemented as a 32-Kbyte, two-way set-associative, writeback cache.
  • the cache line size may be 32 bytes in this embodiment.
  • Cache 14 also includes a 64-entry TLB that may be used to speed linear to physical address translation. Other variations of instruction cache 14 are possible and contemplated.
  • Instruction cache 14 receives instruction fetch addresses from cache controller 18 . In one embodiment, up to 16 bytes may be fetched from cache 14 per clock cycle. The fetched information is placed into an instruction buffer that feeds into decode unit 20 . In one embodiment of microprocessor 10 , fetching may occur along a single execution stream with seven outstanding branches taken. In another embodiment, fetching may take place along multiple execution streams.
  • the instruction fetch logic within cache controller 18 is capable of retrieving any 16 contiguous instruction bytes within a 32-byte boundary of cache 14 with no additional penalty when the 16 bytes cross a cache line boundary. New instructions are loaded into the instruction buffer as the current instructions are consumed by decode unit 20 .
  • Other configurations of cache controller 18 are also possible and contemplated.
  • decode logic 20 may be configured to decode multiple instructions per processor clock cycle.
  • Decode unit 20 may further be configured to accept instruction and predecode bytes from the instruction buffer (in x86 format), locate actual instruction boundaries, and generates corresponding “RISC ops”.
  • RISC ops are fixed-format internal instructions, most of which are executable by microprocessor 10 in a single clock cycle.
  • microprocessor 10 RISC ops are combined to form every function in the x86 instruction set.
  • Microprocessor 10 may use a combination of decoders to convert x86 instructions into RISC ops.
  • the hardware comprises three sets of decoders: two parallel short decoders, one long decoder, and one vector decoder.
  • the parallel short decoders translate the most commonly-used x86 instructions (e.g., moves, shifts, branches, etc.) into zero, one, or two RISC ops each.
  • the short decoders only operate on x86 instructions that are up to seven bytes long. In addition, they are configured to decode up to two x86 instructions per clock cycle. Commonly-used x86 instructions which are greater than seven bytes long, as well as those semi-commonly-used instructions that are up to seven bytes long, are handled by the long decoder.
  • the long decoder in decode unit 20 only performs one decode per clock cycle generating up to four RISC ops. All other translations (complex instructions, interrupts, etc.) are handled by a combination of the vector decoder and an on-chip ROM. For complex operations, the vector decoder logic provides the first set of RISC ops and an initial address to a sequence of further RISC ops within the on-chip ROM. The RISC ops fetched from the on-chip ROM are of the same type that are generated by the hardware decoders.
  • decode unit 20 generates a group of four RISC ops each clock cycle. For clock cycles in which four RISC ops cannot be generated, decode unit 20 places RISC NOP operations in the remaining slots of the grouping. These groupings of RISC ops (and possible NOPs) are then conveyed to scheduler buffer 32 . It is noted that in other embodiments, microprocessor 10 may be configured to decode other instructions sets in place of, or in addition to, the x86 instruction set.
  • Instruction control logic 34 contains the logic necessary to manage out-of-order execution of instructions stored in scheduler buffer 32 . Instruction control logic 34 also manages data forwarding, register renaming, simultaneous issue and retirement of RISC ops, and speculative execution. In one embodiment, scheduler buffer 32 holds up to 24 RISC ops at one time, which is equivalent to a maximum of twelve x86 instructions. When possible, instruction control logic 34 may simultaneously issue (from buffer 32 ) RISC ops to any available execution units 36 . In one embodiment, control logic 34 may be configured to issue up to six and retire up to four RISC ops per clock cycle.
  • store unit 36 A and load unit 36 B may each have two-stage pipelines.
  • Store unit 36 A may be configured to perform memory and register writes such that the data is available for loading after one clock cycle.
  • load unit 36 B may be configured to perform memory reads such that the data is available after two clock cycles.
  • Other configurations for load and store units 36 A and 36 B are also possible with varying latencies.
  • Execution unit 36 G (the branch resolving unit) is separate from branch prediction logic 22 in that it resolves conditional branches such as JCC and LOOP after the branch condition has been evaluated.
  • Branch resolving unit 36 G allows efficient speculative execution, enabling microprocessor 10 to execute instructions beyond conditional branches before knowing whether the branch prediction was correct.
  • microprocessor 10 may be configured to handle up to seven outstanding branches in one embodiment.
  • Branch prediction logic 22 coupled to decode unit 20 , is configured to increase the accuracy with which conditional branches are predicted in microprocessor 10 . Ten to twenty percent of the instructions in typical applications include conditional branches. Branch prediction logic 22 is configured to handle this type of program behavior and its negative effects on instruction execution, such as stalls due to delayed instruction fetching. In one embodiment, branch prediction logic 22 includes an 8192-entry branch history table, a 16-entry by 16 byte branch target cache, and a 16-entry return address stack. Branch prediction logic 22 may implement a two-level adaptive history algorithm using the branch history table. The table stores executed branch information, predicts individual branches, and predicts behavior of groups of branches. In one embodiment, the branch history table does not store predicted target addresses in order to save space. Instead, the addresses are calculated on-the-fly during the decode stage.
  • branch target cache within branch logic 22 supplies the first 16 bytes at that address directly to the instruction buffer (assuming a hit occurs in the branch target cache).
  • branch prediction logic 22 achieves branch prediction rates of over 95%.
  • Branch logic 22 may also include special circuitry designed to optimize the CALL and RET instructions. This circuitry allows the address of the next instruction following the CALL instruction in memory to be pushed onto a return address stack. When microprocessor 10 encounters a RET instruction, branch logic 22 pops this address from the return stack and begins fetching.
  • data cache 26 may also be organized as two-way set associative 32-Kbyte storage.
  • data TLB 28 includes 128 entries that may be used to translate linear to physical addresses.
  • data cache 26 may also be sectored.
  • Data cache 26 may further implement a MESI (modified-exclusive-shared-invalid) protocol to track cache line status. Other configurations of data cache 26 are also possible and are contemplated.
  • MESI modified-exclusive-shared-invalid
  • computation core 42 comprises three execution units 36 C-E and a multiplier 50 .
  • Integer/MMX/3D execution unit 36 C is a fixed point execution unit which is configured to operate on all ALU operations, as well as multiplies, divides (both signed and unsigned), shifts, and rotates.
  • integer/MMX/3D execution unit 36 E Integer Y unit
  • Execution units 36 C and 36 D are also configured to accelerate performance of software written using multimedia and 3D graphics instructions. Applications that can take advantage of multimedia and 3D graphics instructions include 3D modeling and rendering, video and audio compression/decompression, speech recognition, and telephony. Execution units 36 C and 36 D may be configured to execute multimedia instructions in a single clock cycle. Many of these instructions are designed to perform the same operation to multiple sets of data at once (i.e., vector processing). In one embodiment, execution units 36 C and 36 D use registers which are mapped onto the stack of floating point unit 36 E.
  • Execution unit 36 E contains an IEEE 754-compatible floating point unit designed to accelerate the performance of software which utilizes the x86 instruction set. Floating point software is typically written to manipulate numbers that are either very large or small, require a great deal of precision, or result from complex mathematical operations such as transcendentals. Floating point execution unit 36 E may comprise an adder unit, a multiply unit, and a divide/square root unit. In one embodiment, these low-latency units are configured to execute floating point instructions in as few as two clock cycles.
  • execution units 36 C and 36 D are coupled to multiplier 50 and are configured to utilize multiplier 50 as a shared resource.
  • this configuration allows both execution units 36 C and 36 D to perform multiplication without requiring two multipliers.
  • each execution unit 36 C and 36 D may each have their own dedicated multiplier. Still other configurations are possible and contemplated. For example, two n-bit multipliers may be shared by execution units 36 C and 36 D.
  • Multiplier 50 may be configured to accept operands that are 76-bits wide (i.e., the width of the significand in a double precision floating point data type), thereby providing the same functionality as two separate 32-bit multipliers while further alleviating the need for a separate multiplier in floating point unit 36 E.
  • execution units 36 C- 36 E may be directly coupled to multiplier 50 , with each execution unit sharing multiplier 50 .
  • Multiplier 50 may also be configured to perform both signed and unsigned multiplication. Advantageously, this allows multiplier 50 to support both integer multiplication for MMX instructions, and floating point multiplication for 3D graphics instructions.
  • multiplier 50 may be configured to perform multiplication using a number of different algorithms
  • the embodiment shown in the figure is configured to use a modified version of Booth's Algorithm to improve multiplication times.
  • Booth's algorithm relies upon calculating a number of partial products and then summing them to obtain a final product.
  • Booth's algorithm is able to improve multiplication times over the standard “add-and-shift” algorithm by reducing the number of partial products that need to be summed in order to obtain the final product. For example, in performing an 8-bit by 8-bit multiplication, the shift-and-add algorithm generates eight partial products. By contrast, same 8-bit by 8-bit multiplication using the 2-bit version of Booth's algorithm generates only five partial products. This reduction in the number of partial products is illustrated in FIGS. 5A and 5B.
  • multiplier 50 comprises a partial product generator 60 , a partial product selection logic unit 62 , and an adder 64 .
  • partial product generator 60 is coupled to selection logic unit 62 , which is in turn coupled to adder 64 .
  • the execution unit conveys two operands to multiplier 50 , i.e., a multiplicand operand 72 and a multiplier operand 74 .
  • Partial product generator 60 is coupled to receive multiplicand operand 72 , which is used as a starting value for calculating a plurality of partial products 70 .
  • partial product generator 60 is configured to use the 2-bit version of Booth's algorithm, the following partial products would be generated: the multiplicand itself (“+M”), a shifted version of the multiplicand (“+2M”), an inverted version of the multiplicand (“ ⁇ M”), a shifted and inverted version of the multiplicand (“ ⁇ 2M”), and two constants, i.e., a positive zero (“+0”) and a negative zero (“ ⁇ 0”) in two's complement form.
  • Partial product selection unit 62 is coupled to receive multiplier operand 74 .
  • Selection unit 62 is configured to select a number of partial products from generator 60 based upon particular fields within multiplier operand 74 . For example, using the 2-bit version of Booth's algorithm, multiplier operand 74 is padded with leading and trailing zeros (assuming an unsigned multiplication is being performed), and then one partial product is selected by each 3-bit field within the operand.
  • adder 64 is configured to receive and sum the partial products selected by selection unit 62 . As noted in the figure, the selected partial products 68 are shifted before they are summed. The resulting final product 76 is output to the execution unit that transmitted the operands. As previously noted, multiplier 50 may advantageously be configured to perform both signed and unsigned multiplication. This is described in greater detail below.
  • multiplier 50 further comprises a “sign-in” input 78 , which indicates whether a signed or unsigned multiplication is to be performed. Sign-in input 78 is coupled to AND gate 86 A, which also receives the most significant bit (“MSB”) of multiplier operand 74 .
  • MSB most significant bit
  • AND gate 86 A outputs an “effective sign” bit 90 for multiplier operand 74 which is copied and appended to multiplier operand 74 for use by selection logic unit 62 .
  • Sign-in input 78 is also routed to AND gate 88 B, which similarly calculates and appends an effective sign bit 92 for multiplicand operand 72 . While other effective sign calculation logic may be used, the configuration illustrated advantageously generates an effective sign of zero for all unsigned operands and positive signed operands using a minimum amount of logic. Furthermore, in the embodiment shown only signed negative operands receive an asserted effective sign bit.
  • Partial product generation logic 60 uses multiplicand operand 72 and effective sign bit 92 to generate a number of partial products 80 A- 80 C.
  • a shifted version 80 A of multiplicand operand 72 is generated by shifting logic 84 B.
  • Shifted version 80 A is equivalent to two times the multiplicand operand (+2M).
  • inverters 98 generate an inverted (i.e., one's complement) version ( ⁇ M) of multiplicand operand 72 .
  • Shifting logic 84 A is used to generate a shifted and inverted version 80 C ( ⁇ 2M) of multiplicand operand 72 .
  • Partial product generation logic 60 also generates constants for use as partial products, e.g., positive zero 82 B (+0) and negative zero 82 A ( ⁇ 0). As illustrated in the figure, each partial product 80 A, 80 B, 80 C, 72 , 82 A, and 82 B may have an extra constant bit 88 associated with it. Extra constant bit 88 is asserted only for negative partial products, i.e., ⁇ M, ⁇ 2M, and ⁇ 0, and is added to the partial product within adder 64 to generate two's complement versions of the inverted partial products.
  • the shaded areas of the figure denote constants that may be designed into multiplier 50 .
  • selection logic 62 is configured to select partial products based upon 3-bit fields from multiplier operand 74 .
  • Multiplier operand 74 is padded with zeros and copies of effective sign bit 90 so that there are no fractional 3-bit fields.
  • Selection logic 62 may comprise a number of multiplexers 94 A- 94 F, one for each partial product to be selected.
  • Each multiplexer 94 A- 94 E is controlled by a different 3-bit field from multiplier operand 74 .
  • the 3-bit fields determine which partial product from those generated by partial product generator 60 , i.e., +M, +2M, ⁇ M, ⁇ 2M, +0, ⁇ 0, will be selected.
  • Adder 64 is configured to receive and sum the selected partial products. As illustrated in the figure, the partial products are shifted before being summed. Some of the partial products may have prefix bits added to eliminate the need for sign extending the partial product's most significant bit (i.e., sign bit) to the maximum width of final product 76 . The prefixes may be generated using simple inverters coupled to the partial product's most significant bit and constants. Once the partial products are shifted, padded, and summed, final product 76 is output and conveyed to the execution unit that provided the operands. Adder 64 may use a number of different algorithms for summing the partial products. For example, adder 64 may configured as a carry look-ahead adder, a carry skip adder, a carry select adder, a carry-save adder, or a carry propagate adder.
  • the exemplary values in the figure illustrate the unsigned multiplication of two values, 240 10 and 230 10 .
  • Sign-in input 78 is unasserted because unsigned multiplication to be performed.
  • Sign-in input 78 may be provided by the same execution unit that provided the operands.
  • the execution unit may generate sign-in input bit 78 based upon the type of multiply instruction it received. In the example shown in the figure, effective signs 90 and 92 are both zero because sign-in input 78 is unasserted.
  • an 8-bit by 8-bit version of multiplier 50 is able to multiply 8-bit unsigned operands (i.e., operands that do not have a sign bit) having values from 0 to 255 to obtain a 16-bit unsigned result.
  • multiplier 50 is performing signed multiplication.
  • Sign-in input 78 is asserted because signed multiplication is to be performed.
  • multiplicand operand 72 equals 100 10
  • multiplier operand 74 equals ⁇ 50 10 .
  • Multiplier operand 74 is received in two's complement format because it is a negative signed value.
  • effective sign bit 90 (as calculated by AND gate 88 A) is asserted.
  • effective sign bit 92 for multiplicand operand 72 is unasserted because multiplicand operand 72 is positive.
  • the final product 76 is a negative 16-bit number ( ⁇ 5000 10 ) represented in two's complement format with the MSB indicating the sign.
  • multiplier 50 performing a signed multiplication is shown.
  • both multiplier operand 74 having a value of ⁇ 50 10
  • multiplicand operand 72 having a value of ⁇ 100 10
  • the multiplication results in a signed final product 76 (having a value of 5000 10 ) that is positive.
  • multiplier 50 may advantageously perform both signed and unsigned multiplication with the same hardware.
  • multiplier 50 may advantageously be configured to use Booth's algorithm to further increase multiplication performance.
  • multiplier 50 capable of performing vector multiplication.
  • multiplier 50 comprises partial product generator 60 , selection logic 62 , and adder 64 .
  • This embodiment of multiplier 50 is configured to perform component-wise vector multiplication of two pairs of N-bit values (A 1 ⁇ B 1 and A 2 ⁇ B 2 ) simultaneously or a scalar multiplication of one pair of 2N-bit values (A ⁇ B).
  • multiplier 50 may take the place of three separate multipliers (i.e., one for scalar multiplication and two for the vector multiplication), thereby saving valuable die space.
  • multiplier 50 has several features which allow it to perform both scalar and component-wise vector multiplication.
  • multiplier 50 functions as previously disclosed, i.e., adder 64 will sum the partial products selected by selection logic 62 from partial product generator 60 to form final product 76 .
  • multiplier 50 is configured to effectively operate as two separate multipliers. This behavior ensures that the results generated by multiplier 50 will equal the results that would have been generated had two separate multipliers been used.
  • multiplier 50 receives a vector_in input signal 120 .
  • a plurality of multiplexers within selection logic 62 e.g., multiplexers 122 and 124 ) effectively isolate the two “logical halves” of multiplier 50 . This separation prevents partial products from one pair of vector components (e.g., A 1 and B 1 ) from interfering with the multiplication of another pair of vector components (e.g., A 2 and B 2 ).
  • the operation of multiplexers 122 and 124 is described in greater detail below.
  • multiplicand operand 72 and multiplier operand 74 may each comprise a vector (two N-bit values) or a scalar value (a single 2N-bit value).
  • multiplicand operand 72 may comprise a vector (A 2 , A 1 ) or a single scalar value A.
  • the partial products selected by selection logic 62 may be logically divided into four quadrants 130 - 136 for component-wise vector multiplications (assuming vector operands each having two vector components).
  • Quadrant 130 represents the higher order bits of partial products selected by the least significant vector component of vector multiplier 74 (i.e., B 1 ).
  • Quadrant 132 represents the lower order bits of partial products selected by the least significant vector component of vector multiplier 74 (i.e., B 1 ).
  • Quadrant 134 represents the lower order bits of partial products selected by the most significant vector component of vector multiplier 74 (i.e., B 2 ).
  • Quadrant 136 represents the higher order bits of partial products selected by the most significant vector component of vector multiplier 74 (i.e., B 2 ).
  • the least significant bits of partial products selected by vector component B2 located within quadrant 134 may affect the addition performed to generate A 1 ⁇ B 1 within final product 76 .
  • multiplexer 124 is configured to “zero-out” the lower order bits of partial products located within quadrant 134 .
  • the higher order bits of partial products selected by vector component B 1 may extend into quadrant 130 , thereby possibly affecting the summation used to form B 1 ⁇ B2 within final product 76 .
  • additional multiplexers similar to multiplexer 124 may be used to zero-out the higher order bits within quadrant 130 .
  • Multiplexer 122 also assists in the logical separation that is advantageous for component-wise vector multiplication. Staggered bit fields within multiplier operand 74 are used to select partial products from partial product generator 60 . When a bit field encompasses bits from more than one vector component within multiplier operand 74 , the resulting partial product may also be “corrupted.” For example, selecting a partial product using one bit from vector component B 1 and two bits from vector component B2 (as illustrated in the figure) will result in a partial product that is partially representative of vector component B 1 and partially representative of vector component B 2 . This is undesirable because B 1 is to be multiplied with A 1 separately from B 2 . To remedy this, a multiplexer 122 may be used.
  • multiplexer 122 may zero-out the unwanted bit or bits (e.g., the most significant bit from B 1 as shown in the figure).
  • the partial product selected by multiplexer 94 B will reflect only the bit values within the desired vector component.
  • a second multiplexer similar to multiplexer 122 may zero out the opposite bits.
  • two partial products may be selected, one representing the end of vector operand B 1 and one representing the beginning of vector operand B 2 .
  • the zeroing-out of bits for partial product selection and summation are illustrated in more detail by way of a numerical example in FIGS. 11A through 12.
  • FIG. 11A More detail of one embodiment of partial product generator 60 is shown.
  • an additional effective sign bit 172 A- 172 F may be generated for the lower-order portion of each partial product.
  • the same logic may be used as previously disclosed, with AND-gate 86 B being duplicated (see AND-gate 86 C) to generate an effective sign for each lower-order vector component.
  • multiplier 50 may be configured to perform both signed and unsigned vector multiplication.
  • Generator 60 may also be configured to generate separate constant bits 88 A-F (referred to as S 1 ) and 170 A-F (referred to as S 2 ) to further improve separability when the selected partial products are summed in adder 64 .
  • the extra constant bits 170 A-F and effective sign bits 172 A-F may simply remain unused or unselected during scalar multiplication.
  • Sign_in input 78 is unasserted to indicate that an unsigned multiplication is being performed.
  • selection logic 62 may comprise a plurality of multiplexers 310 A-B, 312 A-B, 314 A-B, and 316 A-B. These multiplexers operate to select particular bits from partial product generator 60 according to the status of vector_in signal 120 . Each partial product has its own set of selection multiplexers (excluding constants +0 and ⁇ 0 which are simply fed through as is; see 320 A and 320 B).
  • multiplexer 310 A selects bits [ 9 - 0 ] from the partial product ⁇ 2M and outputs them to the rest of selection logic 62 and adder 64 if vector_in is asserted. This may ensure that both effective sign bits 92 A and 172 A are conveyed to adder 64 . Two effective sign bits are needed because two separate multiplications are being performed. Conversely, if vector_in is unasserted (indicating a scalar multiplication), extra effective sign bit 172 A is not needed, thus multiplexer 31 OA selects bits [ 9 - 6 , 4 - 0 ] and outputs them as bits [ 0 - 8 ].
  • bit [S 1 ] may be passed through as it is needed in both cases (scalar and component-wise vector multiplication).
  • Multiplexer 310 B selects bit [S 2 ] if vector_in signal 10 is asserted, thereby providing two constants 88 A and 170 A. If vector_in signal 120 is not asserted and scalar multiplication is being performed, bit [S 2 ] is not needed (and may cause an incorrect result if it is passed through to adder 64 ).
  • multiplexer 310 B is configured to select and convey a constant zero in place of actual S 2 bit 170 A if scalar multiplication is performed.
  • Multiplexers 312 A-B, 314 A-B, and 316 A-B operate in a similar fashion. Each multiplexer may be configured to select the required bits from partial product generator 60 without passing extra bits unless they are needed.
  • selection logic 62 comprises a plurality of multiplexers 94 A- 94 F as in the previous embodiments. Note that multiplexers 312 A-B, 314 A-B, and 316 A-B are not shown, but are instead included within partial product generator 60 . Selection logic 62 further comprises multiplexers 152 - 156 , which operate to select two portions of partial products: (1) a portion of the partial product corresponding to the higher order bits of vector operand B 1 , and (2) a portion of the partial product corresponding to the lower order bits of vector operand B 2 .
  • Multiplexer 156 selects this “combination” partial product when vector_in signal 120 is asserted.
  • this configuration may remedy the problem of summation corruption when a bit field encompassing bits from more than one vector operand is used to select a partial product. This problem is described in greater detail below (see FIGS. 13 and 14).
  • adder 64 comprises three pluralities of multiplexers 160 A- 160 D, 162 A- 162 E, and 164 C- 164 E.
  • Multiplexers 160 A- 160 D are controlled by vector_in signal 120 and operate to “zero-out” portions of the partial products to prevent corruption of the vector components within final product 76 during the summation within adder 64 .
  • Multiplexers 164 C-E are also controlled by vector_in signal 120 and operate to select either extra constant bits 140 C- 140 E (in the event of a vector multiplication) or a zero constant (in the event of a scalar multiplication) for addition into the more significant product.
  • Multiplexers 162 A- 162 D are controlled by sign_in input 78 and are configured to select either the effective sign bit of the more significant portion of the selected partial product (in the event of a signed vector multiplication) or the actual sign (in the event of an unsigned vector multiplication).
  • Multiplexers 164 C- 164 E are also controlled by vector_in signal 102 and perform the same function as multiplexers 310 B, 312 B, 314 B, and 316 B, i.e., they select a constant zero instead of extra constant bit S 2 if scalar multiplication is performed. Note that other configurations of logic for zeroing out and partial product selection are possible and contemplated. Further note that multiplexers 160 A- 160 D, 162 A- 162 E, and 164 C- 164 E may be configured as part of adder 64 , selection logic unit 62 , or as a separate part of multiplier 50 .
  • adder 64 may further comprise a plurality of multiplexers (not shown) to prevent carries across the boundaries of vector operands within final product 76 when summing the selected partial products. This boundary is represented by a dashed line 178 in the figure.
  • multiplier 50 may utilize different configurations of multiplexers.
  • multiplexers 160 A- 160 C may be configured to select either additional sign-extension bits or the most significant bits of the selected partial products.
  • multiplexers 160 A- 160 C may be configured to pad each selected partial product with prefix bits until the most significant bit of each selected product corresponds to the most significant bit of final product 76 (as indicated by dashed bit positions 170 A- 170 B).
  • the prefix bits may comprise a constant, sign extension bits, or a combination thereof.
  • FIGS. 11 A-B and 12 together illustrate the exemplary component-wise multiplication of two vector operands, i.e., multiplier operand 74 having a value of (3,12), i.e., B 2 ⁇ 3 and B 1 ⁇ 12, and multiplicand operand 72 having a value of (6,7), i.e., A 2 ⁇ 6, and A 1 ⁇ 7, resulting in final product 76 having a value of (18,84).
  • multiplier operand 74 having a value of (3,12), i.e., B 2 ⁇ 3 and B 1 ⁇ 12
  • multiplicand operand 72 having a value of (6,7), i.e., A 2 ⁇ 6, and A 1 ⁇ 7
  • final product 76 having a value of (18,84).
  • multiplier operand 74 having a value of (3,12), i.e., B 2 ⁇ 3 and B 1 ⁇ 12
  • multiplicand operand 72 having a value of (6,7), i.e.
  • multiplier operand 74 and multiplicand operand 72 may also be varied, e.g., 32-bits or 64-bits, as may the widths of their vector components.
  • multiplier 50 may be configured to return only a portion of final product 76 per clock cycle. For example, the most significant vector component of final product 76 may be returned during a first clock cycle. Other vector components may be returned during subsequent clock cycles in order of their significance.
  • multiplier 50 further comprises multiplexer 138 .
  • multiplexer 138 When vector_in signal 120 is asserted, component-wise vector multiplication is performed. If the summing of partial products generates one or more carry bits 140 , the upper vector component in final product 144 may be corrupted if carry bits 140 are allowed to propagate across boundary 176 .
  • multiplier 50 may comprise one or more carry multiplexers 138 to prevent carry bits from propagating to higher order vector components within final product 76 .
  • multiplexers 138 may be configured to propagate carry bits normally. As shown in the figure, in this embodiment of multiplier 50 the partial products in quadrant 130 are zeroed out such that they will not affect the value of final product 144 .
  • multiplier 50 another embodiment of multiplier 50 is shown.
  • the partial products in quadrant 130 are not zeroed out. Instead, the selected partial products in quadrant 132 are allowed to sign extend across quadrant 130 .
  • final product 76 will have a lower order vector component 142 that will be negative and may result in a sign extensions across quadrant 130 . This sign extension may affect the value of the more significant vector component 144 within final product 76 .
  • Multiplexer 146 is configured to insert a constant to be summed with the selected partial products to form final product vector component 144 .
  • the constant (e.g., a binary value of one) is calculated to compensate for a negative sign extension across final product 144 .
  • a negative sign extension may be equivalent to “11111111,” thus adding a constant of one (i.e., “00000001”) will negate the effect of the sign extension on result vector component 144 .
  • an XOR-gate 148 may be used in conjunction with vector_in input 120 to control multiplexer 146 so that the constant is only added when final product 142 will be negative and a component-wise vector multiplication is being performed. As illustrated, XOR-gate 148 may receive the sign bits (i.e., the most significant bits) of vector components A 1 and B 1 as inputs.
  • Multiplier 50 may also be configured to calculate the “vector dot product” or inner product of two vectors.
  • the following example illustrates the calculation of a vector dot product. Assuming vector A equals (x1, x2, x3), and vector B equals (y1, y2, y3), then the vector dot product A•B equals x1y1+x2y2+x3y3.
  • calculation of the dot product entails performing a component-wise vector multiplication and then summing the vector component products.
  • multiplier 50 configured to calculate the vector dot product is shown.
  • partial products 190 are summed within adder 64 to form vector component products 192 A-N.
  • Each vector component product 192 A-N corresponds to one vector pair within multiplicand operand 72 and multiplier operand 74 as previously disclosed.
  • Vector component products 192 A-N are then summed using a plurality of carry-propagate adders 194 A-N to form final result 196 , which may then be output for use by other parts of microprocessor 10 .
  • each vector component product 192 A-F is represented by more than one value.
  • each vector component product 192 A-F may be represented by two values, a sum value 198 A-F and a carry value 200 A-F.
  • a set of carry-save adders may be used within adder 64 to sum partial products 192 in redundant form.
  • carry-save adders may significantly reduce the amount of time and die space required to sum partial products 192 .
  • a carry-save adder will take three bits of the same significance and produce a sum value (having the same significance) and a carry value (having a significance one bit higher than the sum value).
  • the term “carry-propagate adder” denotes an adder that is not a carry-save adder.
  • a carry-save adder may be implemented as a number of independent full adders.
  • vector component products 192 A- 192 F may be summed together using a second set of carry-save adders 202 A-J.
  • a carry-propagate adder 204 may be used to perform the final summation. Note, however, that this configuration may require further modification if multiplier 50 is configured to propagate sign extension and carry bits as illustrated in FIG. 14. The embodiment of multiplier 50 illustrated in FIG. 14 relies upon carries from less significant products propagating into the more significant ones.
  • summing partial products 190 and products 192 A-F using carry-save adders may cause final result 196 to be less than the correct result by one unit-in-the-last-place (ULP) for each product below the most significant product. This is because carries from lower products are not incorporated into upper products during carry-save adds.
  • ULP unit-in-the-last-place
  • carry-propagate adder 204 may be configured to accept summands having a width equal to the cumulative width of all products 192 A-F. Assuming the length of each operand (multiplier and multiplicand) is n bits wide and comprises p vector components, each product 192 A-F will have a width of 2n/p. Thus to accommodate all products 192 A- 192 F, adder 204 may be 2n bits wide or wider.
  • each product 192 - 192 F e.g., sum values 198 A-F and carry values 200 A-F
  • adder 204 excluding the most significant product 192 F
  • the final two summands remaining from the carry-save summation of products 192 A- 192 F are input to adder 204 as the most significant inputs.
  • adder 204 will output a 2n-bit wide result, only the most significant 2n/p bits comprise the final result 196 .
  • This configuration advantageously allows adder 204 to propagate carry bits from lower order products to higher order products, thereby ensuring a proper result while still retaining the advantages associated with carry-save addition.
  • the cost in die space of having a 2n-bit wide carry-propagate adder such as adder 204 may be reduced if other functions to performed by multiplier 50 also require a wide carry-propagate adder.
  • multiplier 50 may be configured to accept operands having varying widths (n), and varying numbers of vector components (p).
  • multiplier 50 may be configured to calculate the dot product of two vector operands, each 64-bits wide and each having four vector components.
  • multiplier 50 may be configured to conserve hardware resources (e.g., signal lines and registers) by returning only a portion of the final product (or products, in the case of component-wise vector multiplication) per clock cycle. For example, the higher order bits of the final product may be returned first, and then the lower order bits may be returned in subsequent clock cycles. However, in some embodiments it may be advantageous to return the higher order bits rounded to the nearest unit in the last place (“ULP”).
  • ULP nearest unit in the last place
  • FIG. 17 a diagram of another embodiment of multiplier 50 is shown.
  • This embodiment is configured to round the higher order bits of each vector component product to the nearest ULP.
  • partial products 190 are reduced in redundant form (e.g., a sum value and a carry value for each pairs of vector components) by adder 64 .
  • a plurality of adders 210 A- 210 F. are used to add a rounding constant 214 to each vector component product.
  • Rounding constant 214 may comprise a single asserted bit (i.e., a “one-hot”) added to the bit position below the least significant bit position in the portion of the vector component to be rounded.
  • each adder 210 A- 210 F is configured to receive the redundant form of a single vector component product.
  • adder 210 A is configured to receive sum value 198 A and carry value 200 A and combine them with rounding constant 214 .
  • Adder 210 A combines these three values and generates a redundant form output comprising a new sum value and a new carry value.
  • adders 210 A- 210 F may be configured as independent carry-save adders, thereby preventing carry-bits caused by rounding constant 214 from propagating to more significant vector component products.
  • each adder 210 A- 210 F are coupled to the inputs of one of a plurality of carry-propagate adders 212 A- 212 F.
  • Each carry-propagate adder 212 A- 212 F is configured to sum the outputs of adders 210 A- 210 F and thereby generate a non-redundant form of each vector component product.
  • the rounded MSBs of each vector product may be output first, while the remaining least significant bits (“LSBs”) may be output during a subsequent clock cycle.
  • Adders 212 A- 212 F may be configured independently to avoid the possibility of an unwanted carry-bit propagating across vector product boundaries.
  • additional adders may be configured to generate the LSBs (which are unrounded) separately from the MSBs.
  • adder 212 A may be configured to generate the rounded MSBs by summing the sum and carry values generated by adder 210 A, while an additional adder may be configured to sum the lower bits of sum value 198 A and carry value 200 A to generate the LSBs.
  • each adder 210 A- 210 F and 212 A- 212 F is configured to perform addition without propagating carry bits from one vector component product to another. While this may be desirable in many configurations, the non-propagation of carry bits may disrupt some configurations of adder 50 .
  • the embodiment illustrated in FIG. 14 relies upon the propagation of sign extension bits across vector component product boundaries. If carry bits are not allowed to propagate during the final addition stages which convert the redundant-from vector component products to non-redundant-form, the higher order products may be incorrect.
  • multiplier 50 which rounds the higher order bits of each vector component product, yet still allows carry bits to propagate across consecutive vector component product boundaries.
  • rounding constant 214 is once again added to the redundant form sum values 198 A- 198 F and carry values 200 A- 200 F of each vector component product by carry-save adders 210 A- 210 F.
  • carry-propagate adders 212 A- 212 F are used for each vector component product.
  • each adder 212 A- 212 F may equal the number of bits in the vector component product itself plus all of the bits corresponding to less significant vector component products. For example, assuming each vector component product is eight bits wide, adder 212 B may be 16 bits wide and may add redundant vector component values 198 A- 198 C and 200 A- 200 C.
  • adder 212 B may be 16 bits wide and may add redundant vector component values 198 A- 198 C and 200 A- 200 C.
  • undesired carry-out bits from each vector component product will not affect higher order vector component products in this configuration.
  • the carry bits that may be required for correct operation of the embodiment of multiplier 50 illustrated in FIG. 14 still propagate to form the correct result despite possible sign-extensions.
  • multiplier 50 may be configured to round and return the upper portions of scalar products and vector dot products in addition to vector component products.
  • the types of adders used may also be changed according to the implementation, e.g., carry-propagate adders may be used through out in conjunction with multiplexers configured to prevent carry bits from propagating across vector component product boundaries.
  • various control signals e.g., a round_in signal, may be used to indicate whether rounding is to be performed.
  • multiplier and multiplicand operands are received in normalized form.
  • a binary number is said to be normalized when the most significant asserted bit is directly to the left of the binary radix point. For example, 1.010011 2 is normalized, while 10.10011 2 and 0.01010011 2 are not.
  • the number is shifted either right or left until the most significant asserted bit is directly to the left of the binary radix point. The number's exponent is then increased or decreased an amount equal to the number of positions that the number was shifted.
  • multiplier 50 When multiplier 50 performs floating point multiplication, it receives two normalized significands.
  • multiplier 64 may be configured to output the results in normalized form.
  • multiplier 50 may receive two 32-bit normalized significands as operands and be configured to output one 32-bit result in normalized form. After multiplier 50 generates and selects the partial products, they are summed by adder 64 to create the final result. As the final result may be in redundant form, it may be passed through a carry-propagate adder as previously described. Once in non-redundant form, the result is rounded and normalized before being output. Different methods of rounding are possible.
  • IEEE Standard 754 defines four different rounding methods: round to nearest (even), round to positive infinity, round to minus infinity, and round to zero.
  • the round to nearest method is particularly useful because it ensures that the error in the final product is at most one-half ULP (unit in the last place).
  • multiplier 50 comprises two “paths” which are configured to perform IEEE rounding and normalization by calculating two results in parallel, i.e., one result assuming there is an overflow and one result assume no overflow.
  • This embodiment comprises a pair of carry-save adders 276 A-B, a pair of carry-propagate adders 278 A-B, a pair of sticky bit logic units 286 A-B, and a pair of LSB fix-up logic units 288 A-B.
  • the “no-overflow path” comprises carry-save adder 276 A, carry-propagate adder 278 A, sticky bit logic unit 286 A, and LSB fix-up logic unit 288 A
  • the “overflow path” comprises carry-save adder 276 B, carry-propagate adder 278 B, sticky bit logic unit 286 B, and LSB fix-up logic unit 288 B.
  • Both carry-save adders 276 A and 276 B are configured to receive sum value 274 A and carry value 274 B from partial product array adder 64 .
  • Each carry-save adder 276 A and 276 B is also configured to receive a rounding constant 268 from multiplexer 266 .
  • Multiplexer 266 is configured to select rounding constant 268 from one of four rounding constants.
  • the first rounding constant is a hard-wired constant one and is selected when rounding mode input 270 indicates that round to nearest (even) is the selected rounding mode.
  • the constant is added to the guard bit position by both carry save adders 276 A and 276 B.
  • the second rounding constant is a hard-wired zero and is selected when rounding mode input 270 indicates that round to zero (truncate) is the selected rounding mode.
  • the third rounding constant is the sign of the final product of the multiplication being performed.
  • This sign may be obtained by exclusively ORing the sign bit 260 A of multiplicand operand 72 and the sign bit 260 B of multiplier operand 74 within XOR gate 262 .
  • the resulting sign bit is added to the guard bit position, and each bit position less significant than the guard bit position, by carry-save adders 276 A and 276 B.
  • the fourth rounding constant is the inversion of the third rounding constant. It may obtained by inverting the rounding constant obtained from XOR gate 262 with inverter 264 .
  • the resulting inverted sign bit is added to the guard bit position and each bit position less significant than the guard bit position by carry-save adders 276 A and 276 B.
  • Carry-save adders 276 A and 276 B are configured to receive and add sum value 274 A, carry value 274 B, and the selected rounding constant from multiplexer 266 .
  • Carry-save adders 276 A and 276 B convey their results in redundant form to carry-propagate adders 278 A and 278 B, respectively.
  • Carry-propagate adders 278 A and 278 B reduce the results to non-redundant form 282 A and 282 B and convey them to LSB fix-up logic units 288 A and 288 B, respectively.
  • sticky bit logic units 280 A-B calculate sticky bits 286 A-B.
  • Sticky bit logic units 280 A-B each receive sum value 274 A and carry value 274 B as inputs. The calculation of sticky bits and the operation of sticky bit logic units 280 A-B are described in greater detail below.
  • LSB fix-up logic units 288 A and 288 B are coupled to carry-propagate adders 278 A-B and sticky bit logic units 280 A-B.
  • Fix-up logic units 288 A-B are configured to conditionally invert the least significant bit of the non-redundant results received from adders 278 A-B.
  • L and G may be calculated within fix-up units 288 A-B using sum value 274 A and carry value 274 .
  • the calculation of L and G may be performed in parallel with the additions performed by adders 276 A-B and 278 A-B and need not include a rounding constant.
  • L and G may be calculated within fix-up units 288 A-B, or by using an extra component within multiplier 50 (e.g., a third pair of carry-save/carry-propagate adders).
  • the fix-up may advantageously compensate for cases in which adders 276 A-B have added a constant when a constant was not actually needed (e.g., result+1 is generated when result+0 is needed).
  • the desired number of upper bits from the outputs of LSB fix-up logic units 288 A and 288 B may be conveyed to multiplexer 290 , which selects one of the two values (overflow or no overflow) as output 292 .
  • Multiplexer 290 may be controlled by MSB 284 from the output of fix-up logic unit 288 A. By looking at the most significant bit, a determination of whether an overflow occurred can be made. If an overflow occurred, the upper bits from the output of LSB fix-up logic unit 288 A are selected. If an overflow did not occur, the upper bits from the output of LSB fix-up logic unit 288 B are selected.
  • MSB 284 may be the most significant bit of the output from fix-up logic unit 288 B.
  • multiplier 50 only one fix-up logic unit may be needed.
  • the single fix-up logic unit may be coupled to the output of multiplexer 290 and perform the fix-up before final result 292 is output.
  • exponent control logic unit 254 is also controlled by the same signal that controls multiplexer 290 . If an overflow occurs, exponent control logic unit 254 is configured to increment the corresponding exponent. This completes the normalization of the output.
  • the embodiment of multiplier 50 depicted in the figure may be able to round and normalize the final result in less time because normalization is performed in parallel. Furthermore, the fix-up is performed while multiplexer 290 is selecting a result (overflow or no overflow). This may further reduce the cycle time of this embodiment of multiplier 50 .
  • FIG. 20 a diagram illustrating the operation of one embodiment of carry-save adders 276 A and 276 B is shown.
  • the example assumes eight bit sum and carry values 274 A-B are being rounded to four bit values and that round to nearest (even) is being performed.
  • Adders 276 A-B each receive sum value 274 A, carry value 274 B, and rounding constant 268 as inputs.
  • adder 276 A is configured to add a constant one to the guard bit position of sum value 274 A and constant value 274 B assuming there will not be an overflow.
  • the guard bit position is the bit position that is one bit less significant than the least significant bit of the portion to be output.
  • An overflow occurs when the summation of sum value 274 A, carry value 274 B, and any added rounding constants, creates a carry out from the bit position directly to the left of the binary radix point.
  • An overflow may require the result to be shifted to the right (and the corresponding exponent to be incremented) in order to produce a normalized output.
  • adder 276 A adds a constant one to the guard bit position of sum value 274 A and carry value 274 B assuming there will be no overflow.
  • adder 276 B adds rounding constant 268 to the guard bit position of sum value 274 A and carry value 274 B assuming there is an overflow.
  • adder 286 B adds the constant one in a different bit position than adder 276 A. For this reason, adders 276 A and 276 B each generate a different result.
  • the results from adder 276 A are conveyed to carry propagate adder 278 A, which is configured to reduce them to non-redundant form.
  • the results from adder 276 B are conveyed to carry propagate adder 278 B, which operates in manner similar to adder 278 A.
  • sticky bit logic unit 280 A receives the lower four bits of the sum and carry values ( 350 and 352 , respectively) generated by adder 276 A.
  • a constant 354 e.g., 1111
  • the output from exclusive NOR gate 342 A is routed to 4-input OR gate 344 A, which outputs sticky bit 286 A.
  • Sticky bit logic 280 B is configured similarly to sticky bit logic 280 A, but it may be configured to receive one extra bit, e.g., five bits as opposed to four bits, due to the assumed overflow.
  • FIG. 22 a numerical example of the operation of the embodiment of multiplier 50 from FIG. 20 is shown.
  • This example assumes an eight bit output from adder 64 is being rounded to a four bit result.
  • the figure shows each of the four IEEE rounding modes being performed by both carry-save adders 276 A and 276 B.
  • the selected rounding constant 268 corresponds to the rounding mode.
  • the selected rounding constant 268 is added to sum value 274 A and carry value 274 B by carry save adders 276 A and 276 B.
  • the starting bit position to which the constant is added varies from adder 276 A to adder 276 B.
  • sticky bit logic units 280 A and 280 B each calculate their own version of the sticky bit ( 286 A and 286 B, respectively), also reflecting whether or not an overflow is presumed to occur.
  • LSB fix-up logic units 288 A and 288 B fix-up (invert) the LSB of output 282 A, if necessary.
  • the LSB and Guard bit are taken from the sum of sum value 274 A and carry value 274 B without selected rounding constant 268 .
  • the upper four bits are output to multiplexer 290 .
  • LSB fix-up logic 288 A and 288 B may each comprise a single inverter configured to invert the least significant bit of results 282 A and 282 B, respectively.
  • multiplier 50 Other configurations of multiplier 50 are possible and contemplated.
  • FIG. 23 another embodiment of multiplier 50 configured to perform rounding and normalization is shown.
  • the “fix-up” or inversion of the LSB is performed by a single LSB fix-up logic unit 288 after multiplexer 290 performs the overflow/no overflow selection.
  • a second multiplexer 290 B is included to select which sticky bit 286 A or 286 B will be used by LSB fix-up logic unit 288 in determining whether to perform the inversion.
  • the rounding and normalization hardware disclosed herein may be configured to round and normalize redundant results from other functional units also, e.g., adders.
  • microprocessor 10 already contains a highly optimized multiplier 50 , it would be advantageous to perform other calculations on multiplier 50 as well, e.g., division.
  • This may be accomplished by recasting division operations into reciprocal operations followed by multiplication operations.
  • the operation “A divided by B” (A/B) may be recast into “A multiplied by the reciprocal of B” (A ⁇ B ⁇ 1 ).
  • Forming the reciprocal of B may also be recast into a series of multiplication operations by using a version of the Newton-Raphson iteration.
  • the initial estimate, X 0 may be determined in a number of different ways.
  • X 0 may be read from a ROM table using B as the index, wherein X 0 approximates 1/B.
  • X 0 may be calculated directly from B or from one or more ROM tables configured to output seed values.
  • the seed values may be manipulated, e.g., using arithmetic and combinational logic, to determine X 0 .
  • the first iteration may be performed. Thereafter, the results from each iteration are used in place of X 0 in subsequent iterations. This forces X n+1 to converge on 1/B in a quadratic fashion.
  • FIG. 24 a flowchart depicting one embodiment of a method to calculate the reciprocal using multiplier 50 is shown.
  • X 0 is calculated first (step 700 ). Once X 0 is determined, it is multiplied by B (step 702 ). The results are then routed down two parallel paths 706 and 708 , one that assumes an overflow took place in the multiplication (path 706 ), and another that assumes no overflow occurred (path 708 ). Because X 0 is close to 1/B, the product of X 0 and B will be close to one, i.e., either slightly over one or slightly under one.
  • an overflow will only occur during the multiplication if X 0 is slightly greater than one (i.e., of the form 10.000 . . . with an exponent equal to 2 ⁇ 1 ). If there is no overflow, the result will be slightly less than one (i.e., in the form 01.111 . . . with an effective exponent equal to 2 ⁇ 1 ).
  • the term (2 ⁇ X 0 ⁇ B) is formed within each path by inverting the (X 0 ⁇ B) results. Since (X 0 ⁇ B) is close to one, (2 ⁇ X 0 ⁇ B) may be approximated by the absolute value of the two's complement of (X 0 ⁇ B). To further speed the calculation, the one's complement may be used because it only differs by a one in the least significant digit.
  • the approximations for (2 ⁇ X 0 ⁇ B) are performed in parallel within each path (steps 710 and 712 ). Specifically, in overflow path 706 , the bits are inverted to get 01.111 . . . (with an effective exponent equaling 2 ⁇ 1 ). In non-overflow path 708 , the bits are inverted to get 10.000 . . . (with an effective exponent equaling 2 ⁇ 1 ). Note that the sign bit of each intermediate value may also be forced to zero (positive).
  • either the overflow path result or the non-overflow path result is selected (step 714 ). This selection can be performed by examining the result from the path that assumes no overflow occurred. If the most significant bit of this result is a one, then an overflow occurred within the non-overflow path, and the result from the overflow path should be selected as the proper result. The corresponding sign and exponent bits are also selected along with the result. Note that different bits may be selected from each path. This is illustrated by the following example. Assuming the product from the multiplier is 64 bits wide, then the bits may be numbered from 0 (the least significant bit) to 63 (the overflow bit), with the binary radix point located between the most significant bit 62 and the most significant fractional bit 61 .
  • bits 62 through 0 are selected with the radix point positioned between bits 62 and 61 . If an overflow has not occurred, bits 63 though 0 are selected with the radix point positioned between bits 63 and 62 .
  • bits 10.000 . . . may be selected as 1.0000 . . . (along with a hardwired exponent equaling 2 0 ).
  • this configuration may save time by normalizing the inverted bits without requiring a dedicated normalization step. Note that other configurations and other widths are contemplated.
  • all the bits from the selected path need not be used. In some embodiments fewer bits may be selected, and in other embodiments extra bits may be padded with constants to meet a desired length.
  • the iteration may be performed any number of times (e.g., one, two, or five times). Using two paths may advantageously eliminate the need for normalization because the exponent and sign bits can be hard-wired based upon the known limits of the incoming operands and whether or not an overflow occurs.
  • multiplier 50 may be configured to calculate the reciprocal square root of an operand B using a modified version of the Newton-Raphson iteration.
  • the equation Y n+1 Y n ⁇ (3 ⁇ B ⁇ Y n 2 )/2 may be used to calculate the reciprocal square root of B.
  • the initial estimate, Y 0 may be determined in a number of ways, e.g., by using initial estimate generators that perform calculations on seed values read from ROM tables using B. In this iteration Y 0 approximately equals 1/ ⁇ square root ⁇ B. Each subsequent iteration of the equation forces Y n+1 to converges on 1/ ⁇ square root ⁇ B in a quadratic fashion.
  • both Y 0 and Y 0 2 may be produced using the same initial estimate generator that was used for the reciprocal calculation described above. This may be desirable because determining Y 0 2 may eliminate the need for a multiplication operation to form Y 0 2 from Y 0 .
  • an initial estimate generator refers to any hardware capable of generating an initial value such as X 0 or Y 0 , e.g., one or more ROM tables configured to output seed values that may be used to calculate the initial value using arithmetic and combinational logic.
  • FIG. 25 a flowchart depicting one embodiment of a method to calculate the reciprocal square root using multiplier 50 is shown.
  • Y 0 2 and Y 0 are determined first (step 730 ). Once Y 0 2 is determined, it is multiplied by B to form the term (B x Y 0 2 ) (step 732 ). The results are then routed down two parallel paths 734 and 736 , one that assumes an overflow took place in the multiplication (path 736 ), and another that assumes no overflow occurred (path 734 ). Because Y 0 2 is close to 1/B, the product of Y 0 2 and B will be close to one, i.e., either slightly over or slightly under one.
  • the overflow path 736 forms the one's complement by inverting the result (B ⁇ Y 0 2 ) (step 740 ).
  • the resulting value has the form 01.111 . . . with an effective exponent of 2 ⁇ 1 and approximates (2 ⁇ B ⁇ Y 0 2 ).
  • a one is effectively added to the result to form 1.111 . . . with an exponent of 2 0 (step 744 ).
  • This value is then be right shifted one bit to reflect the division by two in the term (3 ⁇ B ⁇ Y 0 2 )/2 (step 748 ). This results in a value having the form 1.111 . . . with an exponent of 2 ⁇ 1 (step 748 ).
  • the non-overflow path 734 also forms the one's complement by inverting the result (B ⁇ Y 0 2 ) (step 738 ).
  • the resulting value has the form 10.000 . . . with an effective exponent of 2 ⁇ 1 .
  • This form is normalized to 1.000 . . . with an exponent of 2 0 , which approximates (2 ⁇ B ⁇ Y 0 2 ).
  • a one is effectively added to the result to form 10.000 . . . (step 742 ).
  • This value is then be shifted right one bit to reflect the division by two in the term (3 ⁇ B ⁇ Y 0 2 )/2 (step 746 ).
  • the result has the form 1.000 . . . In this path, the result's exponent is forced to 2 0 (step 746 ).
  • either the overflow path result or the non-overflow path result is selected (step 750 ).
  • This selection can be performed as previously disclosed, i.e., based upon the value of the most significant bit of the result from each path. Different bits may be selected from each path to eliminate the need for normalization.
  • the iterative calculation may be performed any number of times (e.g., one, two, or five times).
  • using two paths may eliminate the need for normalization because the exponent and sign bits may be hard coded based upon the known limits of the incoming operands and whether or not an overflow occurs.
  • multiplier 50 configured to evaluate constant powers of an operand.
  • This embodiment may be configured to evaluate one or more constant powers of an operand such as ⁇ 1 (reciprocal), ⁇ 1/2, (reciprocal square root), and 1/2 (square root).
  • this embodiment of multiplier 50 comprises a non-overflow logic unit 770 A, an overflow logic unit 770 B, an initial estimate generator (IEG) 774 , two multiplexers 776 and 780 , and a control logic 778 .
  • IEG initial estimate generator
  • non-overflow logic unit 770 A and overflow logic unit 770 B may also be referred to herein as a “non-overflow path,” and “overflow path,” respectively.
  • Initial estimate generator 774 is coupled to receive multiplier operand 74 and communicate initial estimates, e.g., X 0 and Y 0 , to multiplexer 776 .
  • initial estimates e.g., X 0 and Y 0
  • X 0 Y 0 2 ⁇ B
  • Y 0 1/ ⁇ square root ⁇ B.
  • Multiplexer 776 is configured to select the first multiplication operand from either multiplicand operand 72 or the initial estimate output by initial estimate generator 774 .
  • multiplexer 780 is configured to select the second operand to be multiplied from either multiplier operand 74 or result 292 from multiplexer 290 .
  • Control logic 778 receives control signal 772 and controls multiplexers 776 and 780 , exponent control logic 254 , and logic units 770 A-B.
  • Non-overflow logic 770 A is coupled to receive values from LSB fix-up logic 288 A and output values to multiplexer 290 .
  • overflow logic 770 B is coupled to receive values from LSB fix-up logic 288 B and also output values to multiplexer 290 .
  • Logic units 770 A-B are controlled by control logic unit 778 , which indicates which, if any, constant power operation is being performed. If a constant power operation is not being performed, logic units 770 A-B may be configured to simply allow values from fix-up logic units 288 A-B to propagate through to multiplexer 290 unchanged.
  • logic units 770 A-B are configured to form approximations by inverting selected bits from the values received from fix-up logic units 770 A-B.
  • Logic units 770 A-B are also configured to force (e.g., hard-wire) the exponents associated with the values received from fix-up logic units 288 A-B to fixed values. These fixed exponents are communicated to exponent control logic 254 . Alternatively, exponent control logic 254 may force the exponents to fixed constants when instructed to do so by logic units 770 A-B.
  • Logic units 770 A-B may each comprise a plurality of inverters, a wire shifter, and one or more hard-wired constants.
  • a wire shifter is a plurality of signal, data, or control lines that are selectively connected and or offset to provide fixed shifting and routing. The following examples illustrate the operation of logic units 770 A-B and multiplier 50 in more detail.
  • multiplier 50 receives the operand to be inverted (referred to herein as “operand B”) as multiplier operand 74 .
  • operand B the operand to be inverted
  • multiplexer 780 is configured to select operand B.
  • Initial estimate generator 774 also receives operand B and in response outputs an initial estimate or approximation of the reciprocal (referred to as X 0 ) to multiplexer 776 .
  • Multiplexer 776 is configured to select, based upon control signals from control logic 778 , the initial estimate, which is then multiplied by operand B to form the quantity (X 0 ⁇ B).
  • the quantity (X 0 ⁇ B) propagates through multiplier 50 until it reaches logic units 770 A- 770 B.
  • Non-overflow logic unit 770 A receives a version from fix-up logic 288 A that assumes no overflow has occurred. Based upon control signal 772 , non-overflow logic unit 770 A inverts the version of (X 0 ⁇ B) it receives to approximate the quantity (2 ⁇ X 0 ⁇ B). Non-overflow logic unit 770 A may be configured to normalize its output by forcing the corresponding exponent to a constant, e.g., 2 0 . Note all references herein are to unbiased exponents. For example, an unbiased exponent 2 0 may translate to a biased exponent of 2 7F (assuming a +7F 16 or +127 10 bias).
  • overflow logic unit 770 B receives a version from fix-up logic 288 B that assumes an overflow has occurred and inverts it. Overflow logic unit 770 B may also be configured to normalize its output by forcing the corresponding exponent to a constant, e.g., 2 ⁇ 1 . Note that in some embodiments, not all bits from fix-up logic units 288 A-B may be used or inverted by logic units 770 A-B.
  • multiplexer 290 is configured to select one of the approximations based upon the value of MSB 284 from the output of fix-up logic unit 288 A. As previously noted, by looking at the most significant bit a determination of whether an overflow occurred can be made. If an overflow occurred, the approximation for the quantity (2 ⁇ X 0 ⁇ B) from logic unit 770 B (the overflow path) is selected. If an overflow did not occur, the approximation for the quantity (2 ⁇ X 0 ⁇ B) from logic unit 770 A (the non-overflow path) is selected. Note that other control configurations are possible, e.g., MSB 284 may be the most significant bit of the output from fix-up logic unit 288 B.
  • multiplexer 290 Once the appropriate approximation for the quantity (2 ⁇ X 0 ⁇ B) has been selected by multiplexer 290 , it is routed to multiplexers 776 and 780 . Multiplexer 780 is directed by control logic 778 to select the approximation so that it may be multiplied by initial estimate X 0 to form the quantity X 0 ⁇ (2 ⁇ X 0 ⁇ B). During this multiplication, however, logic units 770 A-B are configured to allow the values from fix-up logic units 288 A-B to pass through unchanged. The result selected by multiplexer 290 is the approximation of the reciprocal of operand B after one Newton-Raphson iteration. As previously noted, the process may be repeated a number of times to achieve greater accuracy.
  • multiplier 50 When a reciprocal square root operation is performed, multiplier 50 operates in much the same fashion as previously described for a reciprocal operation.
  • the operand to be raised to the ⁇ 1/2 power (referred to herein as “operand B”) is received as multiplier operand 74 .
  • multiplexer 780 is configured to select operand B.
  • Initial estimate generator 774 also receives operand B and in response outputs an initial estimate or approximation of the reciprocal (referred to as Y 0 2 , which equals X 0 ) and the reciprocal square root (referred to as Y 0 ) to multiplexer 776 .
  • Multiplexer 776 is configured to select, based upon control signals from control logic 778 , the initial estimate Y 0 2 , which is then multiplied by operand B to form the quantity (Y 0 2 ⁇ B).
  • the quantity (Y 0 2 ⁇ B) propagates through multiplier 50 until it reaches logic units 770 A- 770 B.
  • Non-overflow logic unit 770 A receives a version from fix-up logic 288 A that assumes no overflow has occurred. Based upon control signal 772 , non-overflow logic unit 770 A inverts the version of quantity (Y 0 2 ⁇ B) it receives to approximate the quantity (2 ⁇ Y 0 2 ⁇ B).
  • Logic unit 770 A also pads the most significant bit of the quantity (2 ⁇ Y 0 2 ⁇ B) with a constant one to approximate the quantity (3 ⁇ Y 0 2 ⁇ B). Logic unit 770 A may then normalize the quantity (3 ⁇ Y 0 2 ⁇ B) by selectively routing (e.g., wire-shifting) bits to multiplexer 290 in a particular position or offset and by forcing the corresponding exponent to 2 0.
  • Overflow logic unit 770 B may be similarly configured to invert the version of quantity (Y 0 2 ⁇ B) it receives to approximate the quantity (2 ⁇ Y 0 2 ⁇ B). Logic unit 770 B also pads the most significant bit of the quantity (2 ⁇ Y 0 2 ⁇ B) with a constant one to approximate the quantity (3 ⁇ Y 0 2 ⁇ B). Logic unit 770 B may then normalize the quantity (3 ⁇ Y 0 2 ⁇ B) by selectively routing bits to multiplexer 290 in a particular position or offset and by forcing the corresponding exponent to 2 ⁇ 1 . Note that in some embodiments, not all bits from fix-up logic units 288 A-B may be used or inverted by logic units 770 A-B.
  • multiplexer 290 is configured to select one of the approximations based upon the value of MSB 284 from the output of fix-up logic unit 288 A. As previously noted, by looking at the most significant bit a determination of whether an overflow occurred can be made. If an overflow occurred, the approximation for the quantity (3 ⁇ Y 0 2 ⁇ B) from logic unit 770 B (the overflow path) is selected. If an overflow did not occur, the approximation for the quantity (3 ⁇ Y 0 2 ⁇ B) from logic unit 770 A (the non-overflow path) is selected. Other control configurations are also possible.
  • multiplexer 780 is directed by control logic 778 to select the approximation so that it may be multiplied by initial estimate Y 0 that was read from initial estimate generator 744 to form the quantity Y 0 ⁇ (3 ⁇ Y 0 2 ⁇ B).
  • logic units 770 A-B are configured to allow the values from fix-up logic units 288 A-B to pass through unchanged.
  • the result selected by multiplexer 290 is the approximation of the reciprocal square root of operand B after one Newton-Raphson iteration.
  • the process may be repeated a number of times to achieve greater accuracy. However, in subsequent iterations the result may be squared to form Y n 2 , which is then used in place of the initial estimate Y 0 2 from initial estimate generator 774 .
  • non-overflow logic 770 A and overflow logic 770 B may instead be configured to receive rounded and normalized value 292 from multiplexer 290 , in which case a separate multiplexer (not shown) may be needed to select between the values output by non-overflow logic 770 A and overflow logic 770 B.
  • registers may be used to store various intermediate results, e.g., the inputs to multiplexers 776 and 780 , and the results from multiplexer 290 . The registers may the store the intermediate results for use during subsequent clock cycles.
  • non-overflow logic unit 770 A configured to calculate the quantity (3 ⁇ Y 0 2 ⁇ B) for the reciprocal square root calculation are shown.
  • inverters 792 A invert selected bits to approximate the quantity (2 ⁇ Y 0 2 ⁇ B). Note that an inverter may not be required for all bits, e.g., the most and least significant bits.
  • Constants 794 A are then used to replace the most significant bit of the quantity (2 ⁇ Y 0 2 ⁇ B) 790 A to approximate the quantity (3 ⁇ Y 0 2 ⁇ B), which is output to multiplexer 290 .
  • a constant or control signal may be routed to exponent control logic 254 to force the corresponding exponent to 2 0 .
  • a numerical example further illustrates the operation of non-overflow logic unit 770 A.
  • the value 1.111 . . . ⁇ 2 ⁇ 1 is received from LSB fix-up logic 228 A as an approximation of the quantity (Y 0 2 ⁇ B) 790 A.
  • inverters 792 A invert the quantity to generate 10.000 . . . ⁇ 2 ⁇ 1 as an approximation of the quantity (2 ⁇ Y 0 2 ⁇ B).
  • constants 794 A are used to replace the most significant bits. The results are shifted, resulting in the quantity 1.00000 . . . , and the corresponding exponent is forced to 2 0 . Note that the most and least significant bits of the quantity (Y 0 2 ⁇ B) 790 A may not be incorporated into the quantity (2 ⁇ Y 0 2 ⁇ B).
  • Overflow logic 770 B operates in a similar fashion. However, the most significant bit of quantity 790 B is replaced with only a single constant 794 B, and bits 30 through 0 are incorporated into the quantity (2 ⁇ Y 0 2 ⁇ B).
  • a numerical example further illustrates the operation of overflow logic unit 770 B.
  • the value 10.000 . . . ⁇ 2 ⁇ 1 is received from LSB fix-up logic 228 B as an approximation of the quantity (Y 0 2 ⁇ B) 790 B.
  • inverters 792 B invert the quantity to generate 01.111 . . . ⁇ 2 ⁇ 1 as an approximation of the quantity (2 ⁇ Y 0 2 ⁇ B).
  • constant 794 B is used to replace the most significant bit. The results are shifted, resulting in the quantity 1.1111 . . . , and the corresponding exponent is forced to 2 ⁇ 1 .
  • non-overflow logic 770 A and overflow logic 770 B are shown.
  • the embodiments shown in the figure are configured to return the quantity (2 ⁇ Y 0 2 ⁇ B) for the reciprocal calculation.
  • a numerical example illustrates the operation of non-overflow logic 770 A and overflow logic 770 B.
  • non-overflow logic 770 A receives a value 1.111 . . . ⁇ 2 ⁇ 1 as the quantity (Y 0 2 ⁇ B) 790 A
  • inverters 792 A are used to invert the bits (excluding the least significant bit) to obtain a value 0.000 . . .
  • Constant 796 A is then used to pad the most significant bit position.
  • the remaining bits are all shifted one position, with the result 1.0000 . . . being output to multiplexer 290 .
  • the corresponding exponent is forced to 2 0 by signal 796 A.
  • Overflow path 770 B operates in a similar fashion. For example, assuming value 790 B is 1.000 . . . ⁇ 2 ⁇ 1 , then inverters 792 B generate the value 0.111 . . . which is shifted and output to multiplexer 290 as the value 1.11 . . . Note the least significant bit may be padded with a constant 794 B, e.g., zero, while the corresponding exponent is forced to 2 ⁇ 1 by signal 796 B.
  • non-overflow logic 770 A and overflow logic 770 B and multiplier 50 are also contemplated.
  • the least significant bit from quantity 790 B may be duplicated instead of using constant 794 B.
  • Other constant values may also be used, and the widths of quantities 770 A-B may be reduced before they are routed to multiplexer 290 (e.g., from 32 bits to 24 bits).
  • Other logic components may be used in place of inverters 792 A-B, and the bit routing structure disclosed above may be replaced by other logic components, e.g., a shifter.
  • non-overflow logic 770 A and overflow logic 770 B may be provided in other components internal or external to multiplier 50 .
  • multiplier 50 may be configured to perform both reciprocal and reciprocal square root functions, e.g., by incorporating two versions of non-overflow logic 770 A and overflow logic 770 B, or by incorporating multiplexers within non-overflow logic 770 A and overflow logic 770 B to select which routing of bits and constants should be applied.
  • multiplier 50 calculates intermediate products which may be stored in registers. During the next iteration, the intermediate product may be read from the register and used as an operand.
  • each iteration may introduce rounding errors that accumulate in the final result. For example, assuming an N-bit significand, the results from each multiplication have significands that are 2N bits wide. This result may be rounded to N-bits or some other width. The greater the number of iterations, the larger the potential rounding error may be in the final result. For obvious reasons, it is desirable to reduce the magnitude of this rounding error.
  • the intermediate product is first calculated to N extra significant bits (step 600 ), wherein N is a predetermined constant. For example, assuming multiplier 50 receives 24-bit operands, multiplier 50 may calculate intermediate products to a precision of 28 bits. In this case, N equals 4 bits.
  • the next-to-most significant bit is examined (step 602 ). The value of the next-to-most significant bit determines the value of a signaling bit. If the next-to-most significant bit equals one (step 604 ), then the signaling bit equals one also.
  • the signaling bit is used to replace a portion of the intermediate product, thereby compressing the intermediate product (step 608 ).
  • the portion replaced by the signaling bit is N+1 bits wide. While this method assumes that the portion being replaced comprises entirely one's or zero's, this may be a safe assumption when certain types of iterations are being performed. For example, when performing the Newton-Raphson iterations previously disclosed for calculating the square root and inverse square root, the products (2 ⁇ B ⁇ X n ) and (3 ⁇ B ⁇ Y n 2 ) are formed during each iteration.
  • the maximum number of bits that may be compressed in a particular implementation may be determined by examining the number of bits that have the same values over the entire range of possible operand values. For example, if an embodiment using a 32-bit significand is determined to have nine bits that have the same value for all possible operand values, then the 32-bit results may compressed so that they may be stored in 24-bit registers.
  • the compression may be performed conditionally. For example, in one embodiment the compression may be performed only if a comparison of the bits to be compressed shows that they all have the same value. While many different types of hardware may be used to perform this comparison, one possible configuration may utilize multiple input AND gates and multiple input NAND gates. If the testing logic determines that the bits to be compressed do not all have the same value, then the operand may stored by truncating the extra least significant bits. While this implementation may lose the benefit of increased accuracy in some cases, this may be adequate if the bits to be compressed rarely have different values.
  • the compressed intermediate product is needed for the next iteration, it may be decompressed.
  • FIG. 29B one possible method for decompressing the compressed intermediate product is illustrated.
  • the compressed intermediate product is read from the storage register (step 612 ).
  • the compressed intermediate product is expanded by padding the next-to-most significant bits with copies of the signaling bit (step 614 ).
  • the number of copies of the signaling bit that are padded or inserted below the most significant bit in this embodiment equals N ⁇ 1.
  • the expanded intermediate product now has the same width as the original intermediate product. For example, if the compressed intermediate product comprises 24 bits, and the original intermediate product comprises 28 bits, then the signaling bit will be copied 4 times to render an expanded intermediate product having 28 bits.
  • FIGS. 29A and 29B no information is lost in the compression and decompression process.
  • the bits replaced by the signaling bit need not be the most significant bit. They may begin with the next-to-most significant bit. For example, if the most significant bit of the intermediate product is bit 27 , the bits replaced by the signal bit may comprise bits 22 through 26 . Further note that the signaling bit may simply be a particular bit within the intermediate product, i.e., an extra calculation to determine the signal bit is not required. Furthermore, the signal bit need not be the most significant or least significant bit in the range of bits to be compressed, i.e., the signal bit may be a bit in the middle of the range of bits to be compressed.
  • multiplier 50 configured to compress intermediate products.
  • this embodiment of multiplier 50 comprises partial product generator 60 , selection logic 62 , and partial product array adder 64 .
  • This embodiment also comprises demultiplexer 622 , 24-bit storage register 638 , and multiplexers 776 and 780 .
  • Demultiplexer 622 receives an intermediate product 620 from partial product array adder 64 .
  • Other embodiments are also contemplated.
  • demultiplexer 622 may receive intermediate product 620 from multiplexer 290 (see FIG. 26).
  • Demultiplexer 622 routes intermediate product 620 according to an iterative control signal 644 .
  • intermediate product 620 is routed to storage register 638 . If, on the other hand, iterative control signal 644 indicates that an iterative operation is not being performed, then intermediate product 620 may be routed to standard rounding logic (not shown) and then output. In another embodiment, intermediate product 620 may be rounded before reaching demultiplexer 622 . In this case, demultiplexer 622 may simply route product 620 to an output of multiplier 50 if an iterative calculation is not being performed.
  • storage register 638 is configured to store intermediate product 620 until it is needed for the next iteration.
  • the signal and data lines coupled to the inputs and outputs of storage register 638 may be referred to herein as a wire shifter because they provide a fixed shifting function.
  • storage register 638 may be implemented so that it is smaller than intermediate product 620 . Assuming, for example, that intermediate product is 28 bits wide, storage register 638 may be configured to store only the 24 least significant bits of intermediate product 620 .
  • bit 22 may be selected as a signal bit 632 to replace the four next-to-most significant bits 636 .
  • storage register 638 may be configured to store bits 0 - 21 , signal bit 632 , and bit 27 .
  • bits 0 - 21 , signal bit 632 , and bit 27 are read from storage register 638 .
  • signal bit 632 is copied four times to recreate bits 23 - 26 .
  • no information from intermediate product 620 is lost in the compression and decompression cycle.
  • Multiplier 50 may also be configured with optional testing logic 624 , which is configured to determine whether the five most significant bits from intermediate product 620 have the same value.
  • testing logic 624 may comprise five-input AND gate 626 , five-input NOR gate 628 , and two-input OR gate 630 .
  • the output from two-input OR gate 630 may be used in a number of ways, e.g., to signal an error condition or to cause register 638 to store the 24 most significant bits without compression.
  • testing logic 624 may be omitted.
  • demultiplexer 622 may also be omitted.
  • product 620 may be rounded and then routed to both storage register 638 and the outputs of multiplexer 50 .
  • external logic may be used to ensure that functional units or other parts of the microprocessor will not use the data output by multiplier 50 until the iterative calculation is completed.
  • uncompressed intermediate product 656 may comprise 28 bits numbered 0 through 27 . If the five most significant bits 652 from intermediate product 656 all have the same value, they may be compressed into one signal bit 654 . This allows uncompressed intermediate product 656 to be represented and stored as compressed intermediate product 658 , thereby using only 24 bits. When compressed intermediate product 658 is uncompressed, signal bit 654 is copied four times to create the four most significant bits of uncompressed intermediate product 660 .
  • intermediate product 676 is characterized by having five equal bits 672 directly below most significant bit 680 .
  • the five equal bits 672 may be compressed into one signal bit 674 even though they are not the most significant bits in intermediate product 676 .
  • Compressed product 678 is still able to fit within a 24 bit storage register.
  • To decompress compressed product 678 four copies of signal bit 674 are inserted below most significant bit 680 within uncompressed product 682 .
  • no information from intermediate product 676 is lost in the process.
  • the contemplated compression method may be used regardless of where the bits having equal values are located.
  • no information is lost if the bits having equal values are located in the same position in each iteration.
  • One method to determine whether the calculated result equals the exactly rounded result is to multiply the calculated result (calculated to at least one extra bit of accuracy, i.e., N+1 bits) and the original operand B (assuming the reciprocal of B has been calculated).
  • the exactly rounded result may then be selected from the following three values: the N-bit result (without the extra bit) plus one in the least significant bit; the N-bit result minus one in the least significant bit; or the N-bit result plus zero.
  • the exactly rounded result is selected based upon the value of the extra computed bit (i.e., bit N+1) and whether the result of multiplication was greater than one, less than one, or equal to one.
  • multiplier 50 may be configured to achieve nearly the same accuracy (i.e., computing the exactly rounded result with a probability close to one) by adding an “adjustment constant” to the result produced from the last step of the iterative calculation before rounding.
  • the probability that the calculated result is higher than the exactly rounded result (“P high ”) may be greater than the probability that the calculated result is lower than the exactly rounded result (“P low ”).
  • P equal the probability that the calculated result will equal the exactly rounded result
  • P high the probability that the calculated result will equal the exactly rounded result
  • P low the probability that the calculated result will equal the exactly rounded result
  • P high the probability that the calculated result will equal the exactly rounded result
  • P low the probability that the calculated result will equal the exactly rounded result
  • P high the probability that the calculated result will equal the exactly rounded result
  • P high the probability that the calculated result will equal the exactly rounded result
  • P high , P low , and P equal may be determined by passing a large number of differing input operand values through the iterative calculation (as performed by the multiplier) and then comparing each result with the corresponding exactly rounded result.
  • a computer program may be particularly useful in performing the comparisons.
  • the comparisons may also be performed before rounding, i.e., comparing the infinitely precise results with the results from the multiplier's final iteration before they are rounded.
  • multiplier 50 configured to add a correction constant is shown.
  • multiplier 50 may be configured similarly to other embodiments disclosed herein that are capable of performing iterative calculations.
  • control logic 778 is configured to convey adjustment constant 800 to partial product array adder 64 during the last step of an iterative calculation.
  • Partial product array adder 64 then sums adjustment constant 800 with the selected partial products from selection logic 62 .
  • this configuration may not require an additional set of adders to sum adjustment constant 800 into the result.
  • multiplier 50 i.e., carry-save adders 276 A-B, carry-propagate adders 278 A-B, sticky bit logic units 280 A-B, LSB fix-up logic units 288 A-B, and logic units 770 A-B).
  • adjustment constant 800 may instead be summed into the result by carry-save adders 276 A-B and or carry-propagate adders 278 A-B.
  • Another embodiment may incorporate an extra adder (not shown) to sum adjustment constant 800 with result 292 from multiplexer 290 .
  • this configuration may require additional logic in the event the result becomes denormalized or experiences an overflow as a result of the addition.
  • Control logic 778 may be configured to convey adjustment constant 800 to partial product array adder 64 during the final multiplication in each iteration, or just for the final multiplication during the final iteration. For example, if the iterative calculation involves two multiplication operations for each iteration, and three iterations are required to achieve the desired accuracy, then control logic 778 may be configured to convey adjustment constant 800 during the final multiplication of each iteration, i.e., three times, or only once during the second multiplication of the third and final iteration. In yet another embodiment, control logic 778 may convey adjustment constant 800 to partial product adder 64 during every multiplication in the iteration.
  • control logic unit 778 may store a number of different adjustment constants 800 , e.g., one for each type of iteration. In such a configuration, control logic unit 778 may convey the appropriate adjustment constant that corresponds to the type of iterative calculation being performed. Control logic unit 778 receives an indication of which iterative calculation is being perform via control signal 722 . For example, when control signal 722 indicates that the reciprocal iterative calculation is to be performed, control logic 778 may convey a first adjustment constant 800 to partial product array adder 64 . However, when control signal 722 indicates that the reciprocal square root iterative calculation is being performed, control logic 778 may convey a second, different adjustment constant to partial product array adder 64 .
  • multiplier 50 may be configured to calculate three versions of each result, i.e., a first result generated without adding an adjustment constant, a second result generated by adding an adjustment constant, and a third result generated by subtracting the adjustment constant.
  • the second and third results could be calculated by adding different adjustment constants. These results may be calculated in parallel by multiplier 50 .
  • a multiplication may be performed as described above to determine whether the result is correct, too high, or too low. The corresponding result may then be selected.
  • Multipliers 50 A and 50 B may be configured similarly to multiplier 50 as described in previous embodiments. As shown in the figure, multipliers 50 A and 50 B are configured to operate in parallel to execute a vector multiplication of a pair of vectors each comprising four 16-bit operands 380 A- 380 D and 382 A- 382 D. Note operands 380 A- 380 D may come from a first 64-bit MMX register, while operands 382 A- 382 D may come from a second 64-bit MMX register.
  • multipliers 50 A and 50 B operate in parallel to multiply a pair of vectors each comprising two 32-bit operands 384 A- 384 B and 386 A- 386 B.
  • operands 384 A- 384 B may come from a first 64-bit MMX register
  • operands 386 A- 386 B may come from a second 64-bit MMX register.
  • each individual multiplier 50 A and 50 B is performing a scalar multiplication.
  • multiplier 50 A may perform a 32-bit scalar multiplication independent from multiplier 50 B. While multiplier 50 A performs the multiplication, multiplier 50 B may sit idle or perform an independent multiplication operation.
  • FIG. 34 a block diagram of one embodiment of a computer system 400 including microprocessor 10 is shown.
  • Microprocessor 10 is coupled to a variety of system components through a bus bridge 402 .
  • a main memory 404 is coupled to bus bridge 402 through a memory bus 406
  • a graphics controller 408 is coupled to bus bridge 402 through an AGP bus 410 .
  • a plurality of PCI devices 412 A- 412 B are coupled to bus bridge 402 through a PCI bus 414 .
  • a secondary bus bridge 416 may further be provided to accommodate an electrical interface to one or more EISA or ISA devices 418 through an EISA/ISA bus 420 .
  • Microprocessor 10 is coupled to bus bridge 402 through a CPU bus 424 .
  • Bus bridge 402 provides an interface between microprocessor 10 , main memory 404 , graphics controller 408 , and devices attached to PCI bus 414 .
  • bus bridge 402 identifies the target of the operation (e.g. a particular device or, in the case of PCI bus 414 , that the target is on PCI bus 414 ).
  • Bus bridge 402 routes the operation to the targeted device.
  • Bus bridge 402 generally translates an operation from the protocol used by the source device or bus to the protocol used by the target device or bus.
  • secondary bus bridge 416 may further incorporate additional functionality, as desired.
  • secondary bus bridge 416 includes a master PCI arbiter (not shown) for arbitrating ownership of PCI bus 414 .
  • An input/output controller (not shown), either external from or integrated with secondary bus bridge 416 , may also be included within computer system 400 to provide operational support for a keyboard and mouse 422 and for various serial and parallel ports, as desired.
  • An external cache unit (not shown) may further be coupled to CPU bus 424 between microprocessor 10 and bus bridge 402 in other embodiments. Alternatively, the external cache may be coupled to bus bridge 402 and cache control logic for the external cache may be integrated into bus bridge 402 .
  • Main memory 404 is a memory in which application programs are stored and from which microprocessor 10 primarily executes.
  • a suitable main memory 404 comprises DRAM (Dynamic Random Access Memory), and preferably a plurality of banks of SDRAM (Synchronous DRAM).
  • PCI devices 412 A- 412 B are illustrative of a variety of peripheral devices such as, for example, network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards.
  • ISA device 418 is illustrative of various types of peripheral devices, such as a modem, a sound card, and a variety of data acquisition cards such as GPIB or field bus interface cards.
  • Graphics controller 408 is provided to control the rendering of text and images on a display 426 .
  • Graphics controller 408 may embody a typical graphics accelerator generally known in the art to render three-dimensional data structures which can be effectively shifted into and from main memory 404 .
  • Graphics controller 408 may therefore be a master of AGP bus 410 in that it can request and receive access to a target interface within bus bridge 402 to thereby obtain access to main memory 404 .
  • a dedicated graphics bus accommodates rapid retrieval of data from main memory 404 .
  • graphics controller 408 may further be configured to generate PCI protocol transactions on AGP bus 410 .
  • the AGP interface of bus bridge 402 may thus include functionality to support both AGP protocol transactions as well as PCI protocol target and initiator transactions.
  • Display 426 is any electronic display upon which an image or text can be presented.
  • a suitable display 426 includes a cathode ray tube (“CRT”), a liquid crystal display (“LCD”), etc.
  • computer system 400 may be a multiprocessing computer system including additional microprocessors (e.g. microprocessor 10 a shown as an optional component of computer system 400 ).
  • microprocessor 10 a may be similar to microprocessor 10 . More particularly, microprocessor 10 a may be an identical copy of microprocessor 10 .
  • Microprocessor 10 a may share CPU bus 424 with microprocessor 10 (as shown in FIG. 5) or may be connected to bus bridge 402 via an independent bus.
  • a signal is “asserted” if it conveys a value indicative of a particular condition. Conversely, a signal is “deasserted” if it conveys a value indicative of a lack of a particular condition.
  • a signal may be defined to be asserted when it conveys a logical zero value or, conversely, when it conveys a logical one value.
  • various values have been described as being discarded in the above discussion. A value may be discarded in a number of manners, but generally involves modifying the value such that it is ignored by logic circuitry which receives the value.
  • the logic state of the value may be inverted to discard the value.
  • the value is an n-bit value, one of the n-bit encodings may indicate that the value is invalid. Setting the value to the invalid encoding causes the value to be discarded.
  • an n-bit value may include a valid bit indicative, when set, that the n-bit value is valid. Resetting the valid bit may comprise discarding the value. Other methods of discarding a value may be used as well.

Abstract

A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0001]
  • This invention relates generally to the field of microprocessors and, more particularly, to rounding the results of iterative calculations in microprocessors. [0002]
  • 2. Description of the Related Art [0003]
  • Microprocessors are typically designed with a number of “execution units” that are each optimized to perform a particular set of functions or instructions. For example, one or more execution units within a microprocessor may be optimized to perform memory accesses, i.e., load and store operations. Other execution units may be optimized to perform general arithmetic and logic functions, e.g., shifts and compares. Many microprocessors also have specialized execution units configured to perform more complex arithmetic operations such as multiplication and reciprocal operations. These specialized execution units typically comprise hardware that is optimized to perform one or more particular arithmetic functions. In the case of multiplication, the optimized hardware is typically referred to as a “multiplier.”[0004]
  • In older microprocessors, multipliers were implemented using designs that conserved die space at the expense of arithmetic performance. Until recently, this was not a major problem because most applications, i.e., non-scientific applications such as word processors, did not frequently generate multiplication instructions. However, recent advances in computer technology and software are placing greater emphasis upon multiplier performance. For example, three dimensional computer graphics, rendering, and multimedia applications all rely heavily upon a microprocessor's arithmetic capabilities, particularly multiplication and multiplication-related operations. As a result, in recent years microprocessor designers have favored performance-oriented designs that use more die space. Unfortunately, the increased die space needed for these high performance multipliers reduces the space available for other execution units within the microprocessor. Thus, a mechanism for increasing multiplier performance while conserving die space in needed. [0005]
  • The die space used by multipliers is of particular importance to microprocessor designers because many microprocessors, e.g., those configured to execute MMXTM (multimedia extension) or 3D graphics instructions, may use more than one multiplier. MMX and 3D graphics instructions are often implemented as “vectored” instructions. Vectored instructions have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied. FIG. 1 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, many microprocessors use a number of multipliers in parallel. In order to conserve die space, a mechanism for reducing the number of multipliers in a microprocessor is desirable. Furthermore, a mechanism for reducing the amount of support hardware (e.g., bus lines) that may be required for each multiplier is also desirable. [0006]
  • Another factor that may affect the number of multipliers used within a microprocessor is the microprocessor's ability to operate upon multiple data types. Most microprocessors must support multiple data types. For example, x86 compatible microprocessors execute instructions that are defined to operate upon an integer data type and instructions that are defined to operate upon floating point data types. Floating point data can represent numbers within a much larger range than integer data. For example, a 32-bit signed integer can represent the integers between −2[0007] 31 and 231−1 (using two's complement format). In contrast, a 32-bit (“single precision”) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2−126 to 2−127×(2−2−23) in both positive and negative numbers. While both integer and floating point data types are capable of representing positive and negative values, integers are considered to be “signed” for multiplication purposes, while floating point numbers are considered to be “unsigned.” Integers are considered to be signed because they are stored in two's complement representation.
  • Turning now to FIG. 2A, an exemplary format for an 8-[0008] bit integer 100 is shown. As illustrated in the figure, negative integers are represented using the two's complement format 104. To negate an integer, all bits are inverted to obtain the one's complement format 102. A constant of one is then added to the least significant bit (LSB).
  • Turning now to FIG. 2B, an exemplary format for a 32-bit (single precision) floating point number is shown. A floating point number is represented by a significand, an exponent and a sign bit. The base for the floating point number is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, [0009] base 2 is typically used. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. In order to save space, the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number. Additional information regarding floating point numbers and operations performed thereon may be obtained in IEEE Standard 754. Unlike the integer representation, two's complement format is not typically used in the floating point representation. Instead, sign and magnitude form are used. Thus, only the sign bit is changed when converting from a positive value 106 to a negative value 108. For this reason, many microprocessors use two multipliers, i.e., one for signed values (two's complement format) and another for unsigned values (sign and magnitude format). Thus, a mechanism for increasing floating point, integer, and vector multiplier performance while conserving die space is needed.
  • SUMMARY OF THE INVENTION
  • The problems outlined above are in large part solved by a multiplier configured in accordance with the present invention. In one embodiment, the multiplier may perform signed and unsigned scalar and vector multiplication using the same hardware. The multiplier may receive either signed or unsigned operands in either scalar or packed vector format and accordingly output a signed or unsigned result that is either a scalar or a vector quantity. Advantageously, this embodiment may reduce the total number of multipliers needed within a microprocessor because it may be shared by execution units and perform both scalar and vector multiplication. This space savings may in turn allow designers to optimize the multiplier for speed without fear of using too much die space. [0010]
  • In another embodiment, speed may be increased by configuring the multiplier to perform fast rounding and normalization. This may be accomplished configuring the multiplier to calculate two version of an operand, e.g., an overflow version and a non-overflow version, in parallel and then select between the two versions. [0011]
  • In other embodiments, the multiplier may be further optimized to perform certain calculations such as evaluating constant powers of an operand (e.g., reciprocal or reciprocal square root operations). Iterative formulas may be used to recast these operations into multiplication operations. Iterative formulas generate intermediate products which are used in subsequent iterations to achieve greater accuracy. In some embodiments, the multiplier may be configured to store these intermediate products for future iterations. Advantageously, the multiplier may be configured to compress these intermediate products before storing them, which may further conserve die space. [0012]
  • In one embodiment, the multiplier may comprise a partial product generator, a selection logic unit, and an adder. The multiplier may also comprise a multiplicand input configured to receive a multiplicand operand (signed or unsigned), a multiplier input configured to receive a multiplier operand (also signed or unsigned), and a sign-in input. The sign-in input is configured to receive a sign-in signal indicative of whether the multiplier is to perform signed or unsigned multiplication. The partial product generator, which is coupled to the multiplicand input, is configured to generate a plurality of partial products based upon the multiplicand operand. The selection logic unit, which is coupled to the partial product generator and the multiplier input, is configured to select a number of partial products from the partial product generator based upon the multiplier operand. The adder, which is coupled to the selection logic unit, is configured to sum the selected partial products to form a final product. The final product, which may be signed or unsigned, may then be output to other parts of the microprocessor. [0013]
  • In addition, the multiplier may further comprise an “effective sign” calculation unit. In one embodiment, the calculation unit may comprise a pair of AND gates, each configured to receive the most significant bit of one operand and the sign-in signal. The output of each AND gate is used as the effective sign for that gate's operand. The effective sign may be appended to each operand for use as the operand's sign during the multiplication process. Advantageously, the effective sign may allow both unsigned operands and signed operands to be multiplied on the same hardware. [0014]
  • A method for operating a multiplier within a microprocessor is also contemplated. In one embodiment, the method comprises receiving a multiplier operand, a multiplicand operand, and a sign-in signal from other functional units within the microprocessor. An effective sign bit for the multiplicand operand is generated from the sign-in signal and the most significant bit of the multiplicand operand. A plurality of partial products may then be calculated from the effective sign bit and the multiplicand operand. Next, a number of the partial products may be selected according to the multiplier operand. The partial products are then summed, and the results are output. In other embodiments, the steps may be performed in parallel or in a different order. [0015]
  • In another embodiment, the multiplier may be capable of multiplying one pair of N-bit operands or two pairs of N/2-bit operands simultaneously. The multiplier may comprise a multiplier input and a multiplicand input, each configured to receive an operand comprising one N-bit value or two N/2-bit values. The multiplier may also comprise a partial product generator coupled to the multiplicand input, wherein the partial product generator is configured to generate a plurality of partial products based upon the value of the multiplicand operand. The multiplier may further comprise a selection logic unit coupled to the partial product generator and the multiplier input. The selection logic unit may be configured to select a plurality of partial products from the partial product generator based upon the value of the multiplier operand. An adder may be coupled to the selection logic unit to receive and sum the selected partial products to form a final product comprising either one 2N-bit value or two N-bit values. The multiplier may receive a vector_in signal indicating whether vector or scalar multiplication is to be formed. [0016]
  • A method for operating a multiplier capable of scalar and vector multiplication is also contemplated. The method may comprise receiving a multiplier operand, a multiplicand operand, and a vector-in signal as inputs from functional units within the microprocessor and then calculating a number of partial products from the multiplicand operand using inverters and shifting logic. Certain partial products may be selected according to the multiplier operand. The selected partial products may then be summed to generate a final product. The final product may be in scalar form if the vector_in signal is unasserted, and in vector form if the vector_in signal is asserted. [0017]
  • In another embodiment, the multiplier may also be configured to calculate vector dot products and may comprise a multiplier input and a multiplicand input, each configured to receive a vector. A partial product generator may be coupled to the multiplicand input and may be configured to generate a plurality of partial products based upon one of the vectors. A first adder may be coupled to receive the partial products and sum them to generate vector component products for each pair of vector components. A second adder may be coupled to the first adder and may be configured to receive and sum the vector component products to form a sum value and a carry value. A third adder may be configured to receive the sum and carry values and one or more vector component products from the first adder. The third adder may be configured to output the sum of the sum and carry values (and any carry bits resulting from the summation of the one or more vector components) as a final result. [0018]
  • In yet another embodiment, the multiplier may be configured to output the results in segments or portions. This may advantageously reduce the amount of interface logic and the number of bus lines needed to support the multiplier. Furthermore, the segments or portions may be rounded. In this embodiment, the multiplier may comprise a multiplier input, a multiplicand input, and a partial product generator. The generator is coupled to the multiplicand input and is configured to generate one or more partial products. An adder, coupled to the partial product generator and the multiplier input, may be configured to receive a number of the partial products. The adder may sum the partial products together with rounding constants to form a plurality of vector component products which are logically divided into portions. One or more of the portions may be rounded. [0019]
  • In another embodiment, the multiplier may be configured to round its outputs in a number of different modes. Thus, an apparatus and method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the apparatus comprises an adder configured to receive a plurality of redundant-form components. The adder is configured to sum the redundant-form components to generate a first non-redundant-form result. The adder may also be configured to generate a second non-redundant-form result comprising the sum of the redundant-form components plus a constant. Two shifters are configured to receive the results. Both shifters may be controlled by the most significant bits of the results they receive. A multiplexer may be coupled to receive the output from the shifters and select one of them for output based upon the least significant bits in the first non-redundant-form result. By generating more than version of the result (e.g., the result and the result plus a constant) in parallel, rounding may be accomplished in less time than previously required. [0020]
  • A multiplier configured to round and normalize products is also contemplated. In one embodiment, the multiplier may comprise two paths. Each path may comprise one or more adders, each configured to receive a redundant-form product and reduce it to a non-redundant form. The first path does so assuming no overflow will occur, while the second path does so assuming an overflow will occur. A multiplexer may be coupled to the outputs of the two paths, so as to select between the results from the first and second paths. [0021]
  • A method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the method comprises multiplying a first operand and a second operand to form a plurality of redundant-form components. A rounding constant is generated and added to the redundant-form component in two different bit positions. The first position assumes an overflow will occur, while the second position assumes no overflow will occur. A particular set of bits are selected for output as the final result from either the first addition or the second addition. [0022]
  • Also contemplated is an apparatus for evaluating a constant power of an operand using a multiplier. In one embodiment, the apparatus comprises an initial estimate generator configured to receive the operand and output an initial estimate of the operand raised to the desired constant power. A multiplier may be coupled to receive the operand and the initial estimate, wherein the multiplier is configured to calculate the product of the initial estimate and the operand. A first plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a first approximation, wherein the first approximation assumes an overflow has occurred in the multiplier. A second plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a second approximation, wherein the second approximation assumes an overflow has not occurred in the multiplier. A multiplexer may be configured to select either the first or second approximations for output. [0023]
  • Also contemplated is a method for evaluating a constant power of an operand using a multiplier. In one embodiment, the method comprises determining an initial estimate of the operand raised to a first constant power. The operand and the initial estimate are then multiplied in the multiplier to form a first product. A normalized first intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming an overflow occurred during the multiplying. A normalized second intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming no overflow occurred during the multiplying. Finally, a set of bits are selected from either the first intermediate approximation or the second intermediate approximation to form a selected approximation that may be output or used in subsequent iterations to achieve a more accurate result. [0024]
  • An apparatus for rounding and normalizing a redundant-form value is also contemplated. In one embodiment, the apparatus may comprise two adders and a multiplexer. The first adder is configured to receive the redundant-form value and add a rounding constant to its guard bit position, thereby forming a first rounded result, wherein the guard bit position is selected assuming no overflow will occur. The second adder is similarly configured and performs the same addition assuming, however, that an overflow will occur. A multiplexer is configured to select either the first rounded result or the second rounded result based upon one or more of the most significant bits from the first and second rounded results. Performing the rounding in parallel may advantageously speed the process by allowing normalization to take place in parallel with the multiplexer's selection. [0025]
  • A method for operating a multiplier that compresses intermediate results is also contemplated. In one embodiment, this method comprises calculating an intermediate product to a predetermined number of bits of accuracy. Next, a signaling bit is selected from the intermediate product. The signaling bit is equal to each of the N most significant bits of the intermediate product. Next, the intermediate product is compressed by replacing the N most significant bits of the intermediate product with the signaling bit. The compressed intermediate product is then stored into a storage register. During the next iteration, the storage register is read to determine the value of the compressed intermediate product. The compressed intermediate product may be expanded to form an expanded intermediate product by padding the compressed intermediate product with copies of the signaling bit. [0026]
  • A multiplier configured to perform iterative calculations and to compress intermediate products is also contemplated. In one embodiment, the multiplier comprises a multiplier input, a multiplicand input, and a partial product generator as described in previous embodiments. The multiplier also comprises a partial product array adder which is configured to receive and add a selected plurality of partial products to form an intermediate product. Compression logic may be coupled to the partial product array adder. The compression logic may comprise a wire shifter configured to replace a predetermined number of bits of the intermediate product with a single signal bit, which represents the information stored in the predetermined number of bits. The signal bit is selected so that it equals the value of each individual bit within the predetermined number of bits. A storage register may be coupled to receive and store the compressed intermediate product from the compression logic. [0027]
  • In another embodiment, the multiplier may be configured to add an adjustment constant to increase the frequency of exactly rounded results. In such an embodiment, the multiplier may comprise a multiplier input configured to receive a multiplier operand, a multiplicand input configured to receive a multiplicand operand, a partial product generator, and selection logic. In one embodiment, the partial product generator is coupled to the multiplicand input and configured to generate one or more partial products based upon the multiplicand operand. The selection logic may be coupled to the partial product generator and the multiplier, wherein the selection logic is configured to select a plurality of partial products based upon the multiplier. The partial product array adder may be coupled to the selection logic, wherein the adder is configured to receive and sum a number of the partial products and an adjustment constant to form a product. The adjustment constant is selected to increase the frequency that the result is exactly rounded. [0028]
  • A method for increasing the frequency of exactly rounded results is also contemplated. In one embodiment, the method comprises receiving an operand and determining an initial estimate of the result of an iterative calculation using the operand. The initial estimate and the operand are multiplied to generate an intermediate result. The multiplication is repeated a predetermined number of times, wherein the intermediate result is used in place of the initial estimate in subsequent iterations. The final repetition generates a final result, and an adjustment constant may be added to the final result, wherein the adjustment constant increases the probability that the final result will equal the exactly rounded result of the iterative calculation.[0029]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which: [0030]
  • FIG. 1 is a diagram illustrating an exemplary scalar multiplication and an exemplary vector multiplication. [0031]
  • FIG. 2A is a diagram of an exemplary integer data format using two's complement representation. [0032]
  • FIG. 2B is a diagram of an exemplary floating point data format. [0033]
  • FIG. 3 is a block diagram of one embodiment of an exemplary microprocessor. [0034]
  • FIG. 4 is a block diagram of one embodiment of the computational core from the microprocessor of FIG. 3. [0035]
  • FIG. 5A illustrates one embodiment of the shift-and-add algorithm for binary multiplication. [0036]
  • FIG. 5B illustrates one embodiment of Booth's algorithm for binary multiplication. [0037]
  • FIG. 6 is a block diagram illustrating details of one embodiment of the multiplier from FIG. 4. [0038]
  • FIG. 7 is a block diagram illustrating the operation of the multiplier from FIG. 6 for unsigned operands. [0039]
  • FIG. 8 is a block diagram illustrating an example of the operation of the multiplier from FIG. 6 for signed operands. [0040]
  • FIG. 9 is a block diagram illustrating another example of the operation of the multiplier from FIG. 6 for signed operands. [0041]
  • FIG. 10 is a diagram illustrating one embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication. [0042]
  • FIG. 11A is a diagram that illustrates details of one embodiment of the partial product generator from FIG. 6. [0043]
  • FIG. 11B is a diagram that illustrates in detail part of one embodiment of the selection logic from FIG. 6. [0044]
  • FIG. 12A-B is a diagram that illustrates details of one embodiment of the selection logic and adder from FIG. 6. [0045]
  • FIG. 13 is a diagram illustrating another embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication. [0046]
  • FIG. 14 is a diagram illustrating yet another embodiment of the multiplier from FIG. 4 that is configured to perform vector multiplication. [0047]
  • FIG. 15 is a diagram illustrating one embodiment of a multiplier that is configured to calculate the vector dot product of a pair of vector operands. [0048]
  • FIG. 16 is a diagram illustrating another embodiment of a multiplier that is configured to calculate the vector dot product of a pair of vector operands. [0049]
  • FIG. 17 is a diagram illustrating one embodiment of a multiplier that is configured to return vector component products in portions, some of which may be rounded. [0050]
  • FIG. 18 is a diagram illustrating another embodiment of a multiplier that is configured to return vector component products in portions, some of which may be rounded. [0051]
  • FIG. 19 is a diagram illustrating one embodiment of the multiplier from FIG. 6 configured to perform rounding. [0052]
  • FIG. 20 is a diagram illustrating a numerical example of the operation of the multiplier from FIG. 19. [0053]
  • FIG. 21 is a diagram illustrating details of one embodiment of the sticky bit logic from FIG. 19. [0054]
  • FIG. 22 is a diagram illustrating a numerical example of the operation of the multiplier from FIG. 19. [0055]
  • FIG. 23 is a diagram illustrating a no ther embodiment of the multiplier from FIG. 6. [0056]
  • FIG. 24 is a flowchart illustrating one embodiment of a method for calculating the reciprocal of an operand. [0057]
  • FIG. 25 is a flowchart illustrating one embodiment of a method for calculating the reciprocal square root of an operand. [0058]
  • FIG. 26 is a diagram illustrating one embodiment of the multiplier from FIG. 6 that is configured to perform iterative calculations. [0059]
  • FIG. 27 is a diagram illustrating details of one exemplary embodiment of the non-overflow and overflow logic units from FIG. 26. [0060]
  • FIG. 28 is a diagram illustrating details of another exemplary embodiment of non-overflow and overflow logic units from FIG. 26. [0061]
  • FIG. 29A is a flowchart illustrating one possible method for fast compression. [0062]
  • FIG. 29B is a flowchart illustrating one possible method for fast decompression. [0063]
  • FIG. 30 is a diagram illustrating one embodiment of the multiplier from FIG. 4 configured to compress intermediate products. [0064]
  • FIG. 31A is a figure illustrating one possible method for compression. [0065]
  • FIG. 31B is a figure illustrating another possible method for compression. [0066]
  • FIG. 32 is a figure illustrating one embodiment of a multiplier configured to achieve a higher frequency of exactly rounded results. [0067]
  • FIG. 33A is a diagram illustrating an example of a vector multiplication using two multipliers. [0068]
  • FIG. 33B is a diagram illustrating another example of a multiplication using two multipliers. [0069]
  • FIG. 34 is a block diagram of one embodiment of a computer system configured to utilize the microprocessor of FIG. 3.[0070]
  • While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. [0071]
  • DETAILED DESCRIPTION OF AN EMBODIMENT
  • Turning now to FIG. 3, a block diagram of one embodiment of a [0072] microprocessor 10 is shown. As depicted, microprocessor 10 comprises a predecode logic block 12, a bus interface unit 24, and a level one-cache controller 18, all of which are coupled to the following three caches: a level-one instruction cache 14, a level-one data cache 26, and an on-chip level-two cache 40. Both instruction cache 14 and data cache 26 are configured with translation lookaside buffers, i.e., TLBs 16 and 28, respectively. Microprocessor 10 further comprises a decode unit 20 which receives instructions from instruction cache 14, decodes them, and then forwards them to an execution engine 30 in accordance with inputs received from a branch logic unit 22.
  • [0073] Execution engine 30 comprises a scheduler buffer 32, an instruction control unit 34, and a plurality of execution units 36A-36F. Note that blocks referred to herein with a reference number followed by a letter may be collectively referred to by the reference number alone. For example, execution units 36A-F may be collectively referred to as execution units 36. Scheduler buffer 32 is coupled to receive decoded instructions from decode unit 20 and convey them to execution units 36 in accordance with input received from instruction control unit 34. In one embodiment, execution units 36A-F include a load unit 36A, a store unit 36B, two integer/MMX/ 3D units 36C and 36D, a floating point unit 36E, and a branch resolving unit 36F. Load unit 36A receives input from data cache 26, while store unit 36B interfaces with data cache 26 via a store queue 38. Integer/MMX/ 3D units 36C and 36D, and floating point unit 36E collectively form a computational core 42 for microprocessor 10. Computational core 42 may further comprise other execution units and specialized hardware such as multipliers.
  • Before describing [0074] computational core 42 in detail, other features of microprocessor 10 will be discussed. In one embodiment, instruction cache 14 is organized into sectors, with each sector including two 32-byte cache lines. The two cache lines within each sector share a common tag but have separate state bits that indicate the status of the line. Accordingly, two forms of cache misses (and associated cache fills) may take place: (1) sector replacement and (2) cache line replacement. In the case of sector replacement, the cache miss is caused by a tag mismatch in instruction cache 14. Thus the required cache line is supplied by external memory via bus interface unit 24. The cache line within the sector that is not needed is then marked invalid. In the case of a cache line replacement, a tag matches the requested address but the corresponding cache line is marked as invalid. The required cache line is then supplied by external memory, but unlike a sector replacement, the cache line within the sector that was not requested remains unaltered. In alternate embodiments, other organizations and replacement policies for instruction cache 14 may be utilized.
  • In one embodiment, [0075] microprocessor 10 may be configured to perform prefetching only in the case of sector replacements. During sector replacement, the required cache line is filled. If the required cache line is in the first half of the sector, the other cache line in the sector is prefetched. If the required cache line is in the second half of the sector, no prefetching is performed. Other prefetching methodologies may also be employed in different embodiments of microprocessor 10.
  • When cache lines of instruction data are retrieved from external memory by [0076] bus interface unit 24, the data is conveyed to predecode logic block 12. In one embodiment, the instructions processed by microprocessor 10 and stored in cache 14 are variable-length (e.g., the x86 instruction set). Because decoding variable-length instructions is particularly complex, predecode logic 12 may be configured to provide additional information to be stored in instruction cache 14 to aid during decode. In one embodiment, predecode logic 12 generates “predecode bits” for each byte in instruction cache 14. The predecode bits may provide various information useful during the decode process, e.g., the number of bytes to the start of the next variable-length instruction. The predecode bits are passed to decode unit 20 when instruction bytes are requested from cache 14.
  • In one embodiment, [0077] instruction cache 14 is implemented as a 32-Kbyte, two-way set-associative, writeback cache. The cache line size may be 32 bytes in this embodiment. Cache 14 also includes a 64-entry TLB that may be used to speed linear to physical address translation. Other variations of instruction cache 14 are possible and contemplated.
  • [0078] Instruction cache 14 receives instruction fetch addresses from cache controller 18. In one embodiment, up to 16 bytes may be fetched from cache 14 per clock cycle. The fetched information is placed into an instruction buffer that feeds into decode unit 20. In one embodiment of microprocessor 10, fetching may occur along a single execution stream with seven outstanding branches taken. In another embodiment, fetching may take place along multiple execution streams.
  • In one embodiment, the instruction fetch logic within [0079] cache controller 18 is capable of retrieving any 16 contiguous instruction bytes within a 32-byte boundary of cache 14 with no additional penalty when the 16 bytes cross a cache line boundary. New instructions are loaded into the instruction buffer as the current instructions are consumed by decode unit 20. Other configurations of cache controller 18 are also possible and contemplated.
  • In one embodiment, decode [0080] logic 20 may be configured to decode multiple instructions per processor clock cycle. Decode unit 20 may further be configured to accept instruction and predecode bytes from the instruction buffer (in x86 format), locate actual instruction boundaries, and generates corresponding “RISC ops”. RISC ops are fixed-format internal instructions, most of which are executable by microprocessor 10 in a single clock cycle. In one embodiment of microprocessor 10, RISC ops are combined to form every function in the x86 instruction set. Microprocessor 10 may use a combination of decoders to convert x86 instructions into RISC ops. In one embodiment, the hardware comprises three sets of decoders: two parallel short decoders, one long decoder, and one vector decoder. The parallel short decoders translate the most commonly-used x86 instructions (e.g., moves, shifts, branches, etc.) into zero, one, or two RISC ops each. The short decoders only operate on x86 instructions that are up to seven bytes long. In addition, they are configured to decode up to two x86 instructions per clock cycle. Commonly-used x86 instructions which are greater than seven bytes long, as well as those semi-commonly-used instructions that are up to seven bytes long, are handled by the long decoder.
  • The long decoder in [0081] decode unit 20 only performs one decode per clock cycle generating up to four RISC ops. All other translations (complex instructions, interrupts, etc.) are handled by a combination of the vector decoder and an on-chip ROM. For complex operations, the vector decoder logic provides the first set of RISC ops and an initial address to a sequence of further RISC ops within the on-chip ROM. The RISC ops fetched from the on-chip ROM are of the same type that are generated by the hardware decoders.
  • In one embodiment, decode [0082] unit 20 generates a group of four RISC ops each clock cycle. For clock cycles in which four RISC ops cannot be generated, decode unit 20 places RISC NOP operations in the remaining slots of the grouping. These groupings of RISC ops (and possible NOPs) are then conveyed to scheduler buffer 32. It is noted that in other embodiments, microprocessor 10 may be configured to decode other instructions sets in place of, or in addition to, the x86 instruction set.
  • [0083] Instruction control logic 34 contains the logic necessary to manage out-of-order execution of instructions stored in scheduler buffer 32. Instruction control logic 34 also manages data forwarding, register renaming, simultaneous issue and retirement of RISC ops, and speculative execution. In one embodiment, scheduler buffer 32 holds up to 24 RISC ops at one time, which is equivalent to a maximum of twelve x86 instructions. When possible, instruction control logic 34 may simultaneously issue (from buffer 32) RISC ops to any available execution units 36. In one embodiment, control logic 34 may be configured to issue up to six and retire up to four RISC ops per clock cycle.
  • In one embodiment, [0084] store unit 36A and load unit 36B may each have two-stage pipelines. Store unit 36A may be configured to perform memory and register writes such that the data is available for loading after one clock cycle. Similarly, load unit 36B may be configured to perform memory reads such that the data is available after two clock cycles. Other configurations for load and store units 36A and 36B are also possible with varying latencies.
  • Execution unit [0085] 36G (the branch resolving unit) is separate from branch prediction logic 22 in that it resolves conditional branches such as JCC and LOOP after the branch condition has been evaluated. Branch resolving unit 36G allows efficient speculative execution, enabling microprocessor 10 to execute instructions beyond conditional branches before knowing whether the branch prediction was correct. As described above, microprocessor 10 may be configured to handle up to seven outstanding branches in one embodiment.
  • [0086] Branch prediction logic 22, coupled to decode unit 20, is configured to increase the accuracy with which conditional branches are predicted in microprocessor 10. Ten to twenty percent of the instructions in typical applications include conditional branches. Branch prediction logic 22 is configured to handle this type of program behavior and its negative effects on instruction execution, such as stalls due to delayed instruction fetching. In one embodiment, branch prediction logic 22 includes an 8192-entry branch history table, a 16-entry by 16 byte branch target cache, and a 16-entry return address stack. Branch prediction logic 22 may implement a two-level adaptive history algorithm using the branch history table. The table stores executed branch information, predicts individual branches, and predicts behavior of groups of branches. In one embodiment, the branch history table does not store predicted target addresses in order to save space. Instead, the addresses are calculated on-the-fly during the decode stage.
  • To avoid a clock cycle penalty for a cache fetch when a branch is predicted taken, a branch target cache within [0087] branch logic 22 supplies the first 16 bytes at that address directly to the instruction buffer (assuming a hit occurs in the branch target cache). In one embodiment, branch prediction logic 22 achieves branch prediction rates of over 95%.
  • [0088] Branch logic 22 may also include special circuitry designed to optimize the CALL and RET instructions. This circuitry allows the address of the next instruction following the CALL instruction in memory to be pushed onto a return address stack. When microprocessor 10 encounters a RET instruction, branch logic 22 pops this address from the return stack and begins fetching.
  • Like [0089] instruction cache 14, data cache 26 may also be organized as two-way set associative 32-Kbyte storage. In one embodiment, data TLB 28 includes 128 entries that may be used to translate linear to physical addresses. Like instruction cache 14, data cache 26 may also be sectored. Data cache 26 may further implement a MESI (modified-exclusive-shared-invalid) protocol to track cache line status. Other configurations of data cache 26 are also possible and are contemplated.
  • Computational Core [0090]
  • Turning now to FIG. 4, more detail of one embodiment of [0091] computation core 42 is shown.
  • In one embodiment, [0092] computation core 42 comprises three execution units 36C-E and a multiplier 50. Integer/MMX/3D execution unit 36C is a fixed point execution unit which is configured to operate on all ALU operations, as well as multiplies, divides (both signed and unsigned), shifts, and rotates. In contrast, integer/MMX/3D execution unit 36E (Integer Y unit) is a fixed point execution unit configured to operate only on the basic word and doubleword ALU operations (ADD, AND, CMP, etc.).
  • [0093] Execution units 36C and 36D are also configured to accelerate performance of software written using multimedia and 3D graphics instructions. Applications that can take advantage of multimedia and 3D graphics instructions include 3D modeling and rendering, video and audio compression/decompression, speech recognition, and telephony. Execution units 36C and 36D may be configured to execute multimedia instructions in a single clock cycle. Many of these instructions are designed to perform the same operation to multiple sets of data at once (i.e., vector processing). In one embodiment, execution units 36C and 36D use registers which are mapped onto the stack of floating point unit 36E.
  • [0094] Execution unit 36E contains an IEEE 754-compatible floating point unit designed to accelerate the performance of software which utilizes the x86 instruction set. Floating point software is typically written to manipulate numbers that are either very large or small, require a great deal of precision, or result from complex mathematical operations such as transcendentals. Floating point execution unit 36E may comprise an adder unit, a multiply unit, and a divide/square root unit. In one embodiment, these low-latency units are configured to execute floating point instructions in as few as two clock cycles.
  • In one embodiment, [0095] execution units 36C and 36D are coupled to multiplier 50 and are configured to utilize multiplier 50 as a shared resource. Advantageously, this configuration allows both execution units 36C and 36D to perform multiplication without requiring two multipliers. In another configuration, each execution unit 36C and 36D may each have their own dedicated multiplier. Still other configurations are possible and contemplated. For example, two n-bit multipliers may be shared by execution units 36C and 36D. Configuring microprocessor 10 with two multipliers each having a width of 32-bits advantageously allows two single precision multiplications to be executed in parallel (each operand/significand is 24 bits wide), or one MMX packed multiply (i.e., multiplying a pair of vectors wherein each vector comprises four 16-bit components). In another embodiment, multiplier 50 may be configured to accept operands that are 76-bits wide (i.e., the width of the significand in a double precision floating point data type), thereby providing the same functionality as two separate 32-bit multipliers while further alleviating the need for a separate multiplier in floating point unit 36E. In such an embodiment, execution units 36C-36E may be directly coupled to multiplier 50, with each execution unit sharing multiplier 50.
  • [0096] Multiplier 50 may also be configured to perform both signed and unsigned multiplication. Advantageously, this allows multiplier 50 to support both integer multiplication for MMX instructions, and floating point multiplication for 3D graphics instructions.
  • While [0097] multiplier 50 may be configured to perform multiplication using a number of different algorithms, the embodiment shown in the figure is configured to use a modified version of Booth's Algorithm to improve multiplication times. Booth's algorithm relies upon calculating a number of partial products and then summing them to obtain a final product. Booth's algorithm is able to improve multiplication times over the standard “add-and-shift” algorithm by reducing the number of partial products that need to be summed in order to obtain the final product. For example, in performing an 8-bit by 8-bit multiplication, the shift-and-add algorithm generates eight partial products. By contrast, same 8-bit by 8-bit multiplication using the 2-bit version of Booth's algorithm generates only five partial products. This reduction in the number of partial products is illustrated in FIGS. 5A and 5B.
  • Turning now to FIG. 6, more detail of one embodiment of [0098] multiplier 50 is shown. In this embodiment, multiplier 50 comprises a partial product generator 60, a partial product selection logic unit 62, and an adder 64. As shown in the figure, partial product generator 60 is coupled to selection logic unit 62, which is in turn coupled to adder 64. When one of execution units 36C-36E receives an instruction invoking the multiplication function, the execution unit conveys two operands to multiplier 50, i.e., a multiplicand operand 72 and a multiplier operand 74. Partial product generator 60 is coupled to receive multiplicand operand 72, which is used as a starting value for calculating a plurality of partial products 70. For example, if partial product generator 60 is configured to use the 2-bit version of Booth's algorithm, the following partial products would be generated: the multiplicand itself (“+M”), a shifted version of the multiplicand (“+2M”), an inverted version of the multiplicand (“−M”), a shifted and inverted version of the multiplicand (“−2M”), and two constants, i.e., a positive zero (“+0”) and a negative zero (“−0”) in two's complement form.
  • Partial [0099] product selection unit 62 is coupled to receive multiplier operand 74. Selection unit 62 is configured to select a number of partial products from generator 60 based upon particular fields within multiplier operand 74. For example, using the 2-bit version of Booth's algorithm, multiplier operand 74 is padded with leading and trailing zeros (assuming an unsigned multiplication is being performed), and then one partial product is selected by each 3-bit field within the operand.
  • Finally, adder [0100] 64 is configured to receive and sum the partial products selected by selection unit 62. As noted in the figure, the selected partial products 68 are shifted before they are summed. The resulting final product 76 is output to the execution unit that transmitted the operands. As previously noted, multiplier 50 may advantageously be configured to perform both signed and unsigned multiplication. This is described in greater detail below.
  • Scalar Unsigned Multiplication [0101]
  • Turning now to FIG. 7, details of one embodiment of [0102] multiplier 50 are shown. The figure also illustrates the operation of multiplier 50 for an unsigned multiplication. While the figure shows an 8-bit by 8-bit multiplier using the 2-bit version of Booth's algorithm, other configurations are possible and contemplated, e.g., a 32-bit by 32-bit multiplier using a 3-bit version of Booth's algorithm. In this embodiment, multiplier 50 further comprises a “sign-in” input 78, which indicates whether a signed or unsigned multiplication is to be performed. Sign-in input 78 is coupled to AND gate 86A, which also receives the most significant bit (“MSB”) of multiplier operand 74. AND gate 86A outputs an “effective sign” bit 90 for multiplier operand 74 which is copied and appended to multiplier operand 74 for use by selection logic unit 62. Sign-in input 78 is also routed to AND gate 88B, which similarly calculates and appends an effective sign bit 92 for multiplicand operand 72. While other effective sign calculation logic may be used, the configuration illustrated advantageously generates an effective sign of zero for all unsigned operands and positive signed operands using a minimum amount of logic. Furthermore, in the embodiment shown only signed negative operands receive an asserted effective sign bit.
  • Partial [0103] product generation logic 60 uses multiplicand operand 72 and effective sign bit 92 to generate a number of partial products 80A-80C. For example, a shifted version 80A of multiplicand operand 72 is generated by shifting logic 84B. Shifted version 80A is equivalent to two times the multiplicand operand (+2M). Similarly, inverters 98 generate an inverted (i.e., one's complement) version (−M) of multiplicand operand 72. Shifting logic 84A is used to generate a shifted and inverted version 80C (−2M) of multiplicand operand 72. Partial product generation logic 60 also generates constants for use as partial products, e.g., positive zero 82B (+0) and negative zero 82A (−0). As illustrated in the figure, each partial product 80A, 80B, 80C, 72, 82A, and 82B may have an extra constant bit 88 associated with it. Extra constant bit 88 is asserted only for negative partial products, i.e., −M, −2M, and −0, and is added to the partial product within adder 64 to generate two's complement versions of the inverted partial products. The shaded areas of the figure denote constants that may be designed into multiplier 50.
  • Once [0104] partial product generator 60 has generated the partial products, selection logic 62 is configured to select partial products based upon 3-bit fields from multiplier operand 74. Multiplier operand 74 is padded with zeros and copies of effective sign bit 90 so that there are no fractional 3-bit fields. Selection logic 62 may comprise a number of multiplexers 94A-94F, one for each partial product to be selected. Each multiplexer 94A-94E is controlled by a different 3-bit field from multiplier operand 74. The 3-bit fields determine which partial product from those generated by partial product generator 60, i.e., +M, +2M, −M, −2M, +0, −0, will be selected. The selected partial products are then conveyed to adder 64. Using 2-bit Booth decoding, Table 1 describes how partial products will be selected.
    TABLE 1
    3-bit Multiplier Field Value Partial Product Selected
    000 +0
    001 +M
    010 +M
    011 +2M
    100 −2M
    101 −M
    110 −M
    111 −0
  • [0105] Adder 64 is configured to receive and sum the selected partial products. As illustrated in the figure, the partial products are shifted before being summed. Some of the partial products may have prefix bits added to eliminate the need for sign extending the partial product's most significant bit (i.e., sign bit) to the maximum width of final product 76. The prefixes may be generated using simple inverters coupled to the partial product's most significant bit and constants. Once the partial products are shifted, padded, and summed, final product 76 is output and conveyed to the execution unit that provided the operands. Adder 64 may use a number of different algorithms for summing the partial products. For example, adder 64 may configured as a carry look-ahead adder, a carry skip adder, a carry select adder, a carry-save adder, or a carry propagate adder.
  • The exemplary values in the figure illustrate the unsigned multiplication of two values, 240[0106] 10 and 23010. Sign-in input 78 is unasserted because unsigned multiplication to be performed. Sign-in input 78 may be provided by the same execution unit that provided the operands. The execution unit may generate sign-in input bit 78 based upon the type of multiply instruction it received. In the example shown in the figure, effective signs 90 and 92 are both zero because sign-in input 78 is unasserted. As shown in the illustration, an 8-bit by 8-bit version of multiplier 50 is able to multiply 8-bit unsigned operands (i.e., operands that do not have a sign bit) having values from 0 to 255 to obtain a 16-bit unsigned result.
  • Scalar Signed Multiplication [0107]
  • Turning now to FIG. 8, the same 8-bit by 8-bit version of [0108] multiplier 50 is shown. In this figure, however, multiplier 50 is performing signed multiplication. Sign-in input 78 is asserted because signed multiplication is to be performed. In the example illustrated, multiplicand operand 72 equals 10010, while multiplier operand 74 equals −50 10. Multiplier operand 74 is received in two's complement format because it is a negative signed value. Thus its effective sign bit 90 (as calculated by AND gate 88A) is asserted. In contrast, effective sign bit 92 for multiplicand operand 72 is unasserted because multiplicand operand 72 is positive. The final product 76 is a negative 16-bit number (−500010) represented in two's complement format with the MSB indicating the sign.
  • Turning now to FIG. 9, another example of [0109] multiplier 50 performing a signed multiplication is shown. In this example, however, both multiplier operand 74 (having a value of −5010) and multiplicand operand 72 (having a value of −10010) are received in two's complement format. The multiplication results in a signed final product 76 (having a value of 500010) that is positive. As FIGS. 6-8 illustrate, multiplier 50 may advantageously perform both signed and unsigned multiplication with the same hardware. Furthermore, multiplier 50 may advantageously be configured to use Booth's algorithm to further increase multiplication performance.
  • Component-wise Vector Multiplication [0110]
  • As previously noted, recent advances have placed a greater emphasis on microprocessors' multimedia and graphics performance. Multimedia and 3D extensions to the basic x86 instruction set include vectored multiply instructions to improve performance. Turning now to FIG. 10, an embodiment of [0111] multiplier 50 capable of performing vector multiplication is shown. As in previous embodiments, multiplier 50 comprises partial product generator 60, selection logic 62, and adder 64. This embodiment of multiplier 50 is configured to perform component-wise vector multiplication of two pairs of N-bit values (A1×B1 and A2×B2) simultaneously or a scalar multiplication of one pair of 2N-bit values (A×B). Advantageously, multiplier 50 may take the place of three separate multipliers (i.e., one for scalar multiplication and two for the vector multiplication), thereby saving valuable die space.
  • In this embodiment, [0112] multiplier 50 has several features which allow it to perform both scalar and component-wise vector multiplication. When scalar multiplication is performed, multiplier 50 functions as previously disclosed, i.e., adder 64 will sum the partial products selected by selection logic 62 from partial product generator 60 to form final product 76. When performing component-wise vector multiplication, however, multiplier 50 is configured to effectively operate as two separate multipliers. This behavior ensures that the results generated by multiplier 50 will equal the results that would have been generated had two separate multipliers been used. To indicate whether multiplier 50 should perform component-wise vector multiplication or scalar multiplication, multiplier 50 receives a vector_in input signal 120. When an asserted vector_in signal is received, a plurality of multiplexers within selection logic 62 (e.g., multiplexers 122 and 124) effectively isolate the two “logical halves” of multiplier 50. This separation prevents partial products from one pair of vector components (e.g., A1 and B1) from interfering with the multiplication of another pair of vector components (e.g., A2 and B2). The operation of multiplexers 122 and 124 is described in greater detail below.
  • As shown in the figure, [0113] multiplicand operand 72 and multiplier operand 74 may each comprise a vector (two N-bit values) or a scalar value (a single 2N-bit value). For example, multiplicand operand 72 may comprise a vector (A2, A1) or a single scalar value A. The partial products selected by selection logic 62 may be logically divided into four quadrants 130-136 for component-wise vector multiplications (assuming vector operands each having two vector components). Quadrant 130 represents the higher order bits of partial products selected by the least significant vector component of vector multiplier 74 (i.e., B1). Quadrant 132 represents the lower order bits of partial products selected by the least significant vector component of vector multiplier 74 (i.e., B1). Quadrant 134 represents the lower order bits of partial products selected by the most significant vector component of vector multiplier 74 (i.e., B2). Quadrant 136 represents the higher order bits of partial products selected by the most significant vector component of vector multiplier 74 (i.e., B2).
  • As the selected partial products are shifted before being summed in [0114] adder 64, the least significant bits of partial products selected by vector component B2 located within quadrant 134 may affect the addition performed to generate A1×B1 within final product 76. To prevent this “corruption” of final product 76, multiplexer 124 is configured to “zero-out” the lower order bits of partial products located within quadrant 134. Similarly, in some embodiments the higher order bits of partial products selected by vector component B1 may extend into quadrant 130, thereby possibly affecting the summation used to form B1×B2 within final product 76. Thus additional multiplexers similar to multiplexer 124 may be used to zero-out the higher order bits within quadrant 130.
  • Multiplexer [0115] 122 also assists in the logical separation that is advantageous for component-wise vector multiplication. Staggered bit fields within multiplier operand 74 are used to select partial products from partial product generator 60. When a bit field encompasses bits from more than one vector component within multiplier operand 74, the resulting partial product may also be “corrupted.” For example, selecting a partial product using one bit from vector component B1 and two bits from vector component B2 (as illustrated in the figure) will result in a partial product that is partially representative of vector component B1 and partially representative of vector component B2. This is undesirable because B1 is to be multiplied with A1 separately from B2. To remedy this, a multiplexer 122 may be used. When a bit field encompasses bits from more than one vector component, multiplexer 122 may zero-out the unwanted bit or bits (e.g., the most significant bit from B1 as shown in the figure). Thus, the partial product selected by multiplexer 94B will reflect only the bit values within the desired vector component. A second multiplexer similar to multiplexer 122 may zero out the opposite bits. Thus two partial products may be selected, one representing the end of vector operand B1 and one representing the beginning of vector operand B2. The zeroing-out of bits for partial product selection and summation are illustrated in more detail by way of a numerical example in FIGS. 11A through 12.
  • Turning now to FIG. 11A, more detail of one embodiment of [0116] partial product generator 60 is shown. To support component-wise vector multiplication when the vector components are signed, an additional effective sign bit 172A-172F may be generated for the lower-order portion of each partial product. The same logic may be used as previously disclosed, with AND-gate 86B being duplicated (see AND-gate 86C) to generate an effective sign for each lower-order vector component. Advantageously, multiplier 50 may be configured to perform both signed and unsigned vector multiplication. Generator 60 may also be configured to generate separate constant bits 88A-F (referred to as S1) and 170A-F (referred to as S2) to further improve separability when the selected partial products are summed in adder 64. The extra constant bits 170A-F and effective sign bits 172A-F may simply remain unused or unselected during scalar multiplication. Note the figure illustrates one possible set of partial products generated for an unsigned component-wise vector multiplication wherein the multiplicand operand 72 has the values of (6,7), i.e., A2×6 and A1=7. Sign_in input 78 is unasserted to indicate that an unsigned multiplication is being performed.
  • Turning now to FIG. 11B, detail of part of one embodiment of [0117] selection logic 62 is shown. In order to support both scalar and vector multiplication, selection logic 62 may comprise a plurality of multiplexers 310A-B, 312A-B, 314A-B, and 316A-B. These multiplexers operate to select particular bits from partial product generator 60 according to the status of vector_in signal 120. Each partial product has its own set of selection multiplexers (excluding constants +0 and −0 which are simply fed through as is; see 320A and 320B). For example, multiplexer 310A selects bits [9-0] from the partial product −2M and outputs them to the rest of selection logic 62 and adder 64 if vector_in is asserted. This may ensure that both effective sign bits 92A and 172A are conveyed to adder 64. Two effective sign bits are needed because two separate multiplications are being performed. Conversely, if vector_in is unasserted (indicating a scalar multiplication), extra effective sign bit 172A is not needed, thus multiplexer 31 OA selects bits [9-6, 4-0] and outputs them as bits [0-8]. The extra effective sign bit 172A is removed, and a constant zero is padded to the output to create bit [9]. As indicated in the figure, bit [S1] may be passed through as it is needed in both cases (scalar and component-wise vector multiplication). Multiplexer 310B selects bit [S2] if vector_in signal 10 is asserted, thereby providing two constants 88A and 170A. If vector_in signal 120 is not asserted and scalar multiplication is being performed, bit [S2] is not needed (and may cause an incorrect result if it is passed through to adder 64). Thus, multiplexer 310B is configured to select and convey a constant zero in place of actual S2 bit 170A if scalar multiplication is performed. Multiplexers 312A-B, 314A-B, and 316A-B operate in a similar fashion. Each multiplexer may be configured to select the required bits from partial product generator 60 without passing extra bits unless they are needed.
  • Turning now to FIG. 12A-B, more details of one embodiment of [0118] selection logic 62 and adder 64 are shown. In this embodiment, selection logic 62 comprises a plurality of multiplexers 94A-94F as in the previous embodiments. Note that multiplexers 312A-B, 314A-B, and 316A-B are not shown, but are instead included within partial product generator 60. Selection logic 62 further comprises multiplexers 152-156, which operate to select two portions of partial products: (1) a portion of the partial product corresponding to the higher order bits of vector operand B1, and (2) a portion of the partial product corresponding to the lower order bits of vector operand B2. Multiplexer 156 then selects this “combination” partial product when vector_in signal 120 is asserted. Advantageously, this configuration may remedy the problem of summation corruption when a bit field encompassing bits from more than one vector operand is used to select a partial product. This problem is described in greater detail below (see FIGS. 13 and 14).
  • In this embodiment, [0119] adder 64 comprises three pluralities of multiplexers 160A-160D, 162A-162E, and 164C-164E. Multiplexers 160A-160D are controlled by vector_in signal 120 and operate to “zero-out” portions of the partial products to prevent corruption of the vector components within final product 76 during the summation within adder 64. Multiplexers 164C-E are also controlled by vector_in signal 120 and operate to select either extra constant bits 140C-140E (in the event of a vector multiplication) or a zero constant (in the event of a scalar multiplication) for addition into the more significant product. Multiplexers 162A-162D are controlled by sign_in input 78 and are configured to select either the effective sign bit of the more significant portion of the selected partial product (in the event of a signed vector multiplication) or the actual sign (in the event of an unsigned vector multiplication). Multiplexers 164C-164E are also controlled by vector_in signal 102 and perform the same function as multiplexers 310B, 312B, 314B, and 316B, i.e., they select a constant zero instead of extra constant bit S2 if scalar multiplication is performed. Note that other configurations of logic for zeroing out and partial product selection are possible and contemplated. Further note that multiplexers 160A-160D, 162A-162E, and 164C-164E may be configured as part of adder 64, selection logic unit 62, or as a separate part of multiplier 50.
  • In addition to the features disclosed above, [0120] adder 64 may further comprise a plurality of multiplexers (not shown) to prevent carries across the boundaries of vector operands within final product 76 when summing the selected partial products. This boundary is represented by a dashed line 178 in the figure. Other embodiments of multiplier 50 may utilize different configurations of multiplexers. For example, multiplexers 160A-160C may be configured to select either additional sign-extension bits or the most significant bits of the selected partial products. In addition, multiplexers 160A-160C may be configured to pad each selected partial product with prefix bits until the most significant bit of each selected product corresponds to the most significant bit of final product 76 (as indicated by dashed bit positions 170A-170B). The prefix bits may comprise a constant, sign extension bits, or a combination thereof.
  • Note that FIGS. [0121] 11A-B and 12 together illustrate the exemplary component-wise multiplication of two vector operands, i.e., multiplier operand 74 having a value of (3,12), i.e., B2×3 and B1×12, and multiplicand operand 72 having a value of (6,7), i.e., A2×6, and A1×7, resulting in final product 76 having a value of (18,84). Further note that while the figures and exemplary embodiments have illustrated a multiplier configured to perform component-wise vector multiplication on vector operands having up to two vector components, other configurations are possible and contemplated, e.g. vectors having four or six vector components may be multiplied component-wise in parallel. Furthermore, a number of multipliers configured similarly to multiplier 50 may be used in parallel to achieve even higher performance. The widths of multiplier operand 74 and multiplicand operand 72 may also be varied, e.g., 32-bits or 64-bits, as may the widths of their vector components.
  • In addition, other embodiments of [0122] multiplier 50 may be configured to return only a portion of final product 76 per clock cycle. For example, the most significant vector component of final product 76 may be returned during a first clock cycle. Other vector components may be returned during subsequent clock cycles in order of their significance.
  • Turning now to FIG. 13, another embodiment of [0123] multiplier 50 is shown. In this embodiment, multiplier 50 further comprises multiplexer 138. When vector_in signal 120 is asserted, component-wise vector multiplication is performed. If the summing of partial products generates one or more carry bits 140, the upper vector component in final product 144 may be corrupted if carry bits 140 are allowed to propagate across boundary 176. To prevent this, multiplier 50 may comprise one or more carry multiplexers 138 to prevent carry bits from propagating to higher order vector components within final product 76. When multiplier 50 is performing scalar multiplication, multiplexers 138 may be configured to propagate carry bits normally. As shown in the figure, in this embodiment of multiplier 50 the partial products in quadrant 130 are zeroed out such that they will not affect the value of final product 144.
  • Turning now to FIG. 14, another embodiment of [0124] multiplier 50 is shown. In this embodiment, the partial products in quadrant 130 are not zeroed out. Instead, the selected partial products in quadrant 132 are allowed to sign extend across quadrant 130. In some instances, e.g., when vector components A1 and B1 have opposite signs, final product 76 will have a lower order vector component 142 that will be negative and may result in a sign extensions across quadrant 130. This sign extension may affect the value of the more significant vector component 144 within final product 76. Multiplexer 146 is configured to insert a constant to be summed with the selected partial products to form final product vector component 144. The constant (e.g., a binary value of one) is calculated to compensate for a negative sign extension across final product 144. For example, a negative sign extension may be equivalent to “11111111,” thus adding a constant of one (i.e., “00000001”) will negate the effect of the sign extension on result vector component 144. As this sign extension occurs only when vector components A1 and B1 have different signs, an XOR-gate 148 may be used in conjunction with vector_in input 120 to control multiplexer 146 so that the constant is only added when final product 142 will be negative and a component-wise vector multiplication is being performed. As illustrated, XOR-gate 148 may receive the sign bits (i.e., the most significant bits) of vector components A1 and B1 as inputs.
  • Vector Dot Product [0125]
  • [0126] Multiplier 50 may also be configured to calculate the “vector dot product” or inner product of two vectors. The following example illustrates the calculation of a vector dot product. Assuming vector A equals (x1, x2, x3), and vector B equals (y1, y2, y3), then the vector dot product A•B equals x1y1+x2y2+x3y3. As this example illustrates, calculation of the dot product entails performing a component-wise vector multiplication and then summing the vector component products.
  • Turning now to FIG. 15, one embodiment of [0127] multiplier 50 configured to calculate the vector dot product is shown. As shown in the figure, partial products 190 are summed within adder 64 to form vector component products 192A-N. Each vector component product 192A-N corresponds to one vector pair within multiplicand operand 72 and multiplier operand 74 as previously disclosed. Vector component products 192A-N are then summed using a plurality of carry-propagate adders 194A-N to form final result 196, which may then be output for use by other parts of microprocessor 10.
  • Turning now to FIG. 16, another embodiment of [0128] multiplier 50 configured to calculate the vector dot product is shown. In this embodiment, however, partial products 190 summed by adder 64 are kept in redundant form, i.e., each vector component product 192A-F is represented by more than one value. For example, each vector component product 192A-F may be represented by two values, a sum value 198A-F and a carry value 200A-F. A set of carry-save adders (not shown) may be used within adder 64 to sum partial products 192 in redundant form. Advantageously, carry-save adders may significantly reduce the amount of time and die space required to sum partial products 192. At the single-bit level, a carry-save adder will take three bits of the same significance and produce a sum value (having the same significance) and a carry value (having a significance one bit higher than the sum value). In contrast, the term “carry-propagate adder” denotes an adder that is not a carry-save adder. In one embodiment, a carry-save adder may be implemented as a number of independent full adders.
  • Once [0129] vector component products 192A-192F have been formed, they may be summed together using a second set of carry-save adders 202A-J. When the number of values remaining to be summed is reduced to two, a carry-propagate adder 204 may be used to perform the final summation. Note, however, that this configuration may require further modification if multiplier 50 is configured to propagate sign extension and carry bits as illustrated in FIG. 14. The embodiment of multiplier 50 illustrated in FIG. 14 relies upon carries from less significant products propagating into the more significant ones. In this case, summing partial products 190 and products 192A-F using carry-save adders may cause final result 196 to be less than the correct result by one unit-in-the-last-place (ULP) for each product below the most significant product. This is because carries from lower products are not incorporated into upper products during carry-save adds.
  • To ensure that [0130] final result 196 is correct when multiplier 50 is configured in a manner similar to the embodiment of FIG. 14, carry-propagate adder 204 may be configured to accept summands having a width equal to the cumulative width of all products 192A-F. Assuming the length of each operand (multiplier and multiplicand) is n bits wide and comprises p vector components, each product 192A-F will have a width of 2n/p. Thus to accommodate all products 192A-192F, adder 204 may be 2n bits wide or wider. The redundant forms of each product 192-192F (e.g., sum values 198A-F and carry values 200A-F) are conveyed as inputs to adder 204 (excluding the most significant product 192F). In place of the most significant product 192F, the final two summands remaining from the carry-save summation of products 192A-192F are input to adder 204 as the most significant inputs. While adder 204 will output a 2n-bit wide result, only the most significant 2n/p bits comprise the final result 196. This configuration advantageously allows adder 204 to propagate carry bits from lower order products to higher order products, thereby ensuring a proper result while still retaining the advantages associated with carry-save addition. Furthermore, the cost in die space of having a 2n-bit wide carry-propagate adder such as adder 204 may be reduced if other functions to performed by multiplier 50 also require a wide carry-propagate adder.
  • As with previous embodiments, this embodiment of [0131] multiplier 50 may be configured to accept operands having varying widths (n), and varying numbers of vector components (p). For example, multiplier 50 may be configured to calculate the dot product of two vector operands, each 64-bits wide and each having four vector components.
  • Rounded Products [0132]
  • As previously noted, some embodiments of [0133] multiplier 50 may be configured to conserve hardware resources (e.g., signal lines and registers) by returning only a portion of the final product (or products, in the case of component-wise vector multiplication) per clock cycle. For example, the higher order bits of the final product may be returned first, and then the lower order bits may be returned in subsequent clock cycles. However, in some embodiments it may be advantageous to return the higher order bits rounded to the nearest unit in the last place (“ULP”).
  • Turning now to FIG. 17, a diagram of another embodiment of [0134] multiplier 50 is shown. This embodiment is configured to round the higher order bits of each vector component product to the nearest ULP. As in the previous embodiment (illustrated in FIG. 16), partial products 190 are reduced in redundant form (e.g., a sum value and a carry value for each pairs of vector components) by adder 64. However, in this embodiment a plurality of adders 210A-210F. are used to add a rounding constant 214 to each vector component product. Rounding constant 214 may comprise a single asserted bit (i.e., a “one-hot”) added to the bit position below the least significant bit position in the portion of the vector component to be rounded. For example, assuming a vector component product has a width of 8 bits, and the four most significant bits (MSBs) are to be rounded, then a constant one would be added to the fourth bit (as illustrated in Table 2). By adding a constant one in the appropriate bit position, the upper portion of the vector component product may be rounded efficiently and without large amounts of additional logic.
    TABLE 2
    Bit Number -> 7 (MSB) 6 5 4 3 2 1 0 (LSB)
    Vector Component 0 1 1 0 1 0 1 1
    Product
    Rounding Constant
    0 0 0 0 1 0 0 0
    Rounded MSBs 0 1 1 1
    Output
  • As shown in FIG. 17, each [0135] adder 210A-210F is configured to receive the redundant form of a single vector component product. For example, adder 210A is configured to receive sum value 198A and carry value 200A and combine them with rounding constant 214. Adder 210A combines these three values and generates a redundant form output comprising a new sum value and a new carry value. Advantageously, adders 210A-210F may be configured as independent carry-save adders, thereby preventing carry-bits caused by rounding constant 214 from propagating to more significant vector component products. The outputs of each adder 210A-210F are coupled to the inputs of one of a plurality of carry-propagate adders 212A-212F. Each carry-propagate adder 212A-212F is configured to sum the outputs of adders 210A-210F and thereby generate a non-redundant form of each vector component product. The rounded MSBs of each vector product may be output first, while the remaining least significant bits (“LSBs”) may be output during a subsequent clock cycle. Adders 212A-212F may be configured independently to avoid the possibility of an unwanted carry-bit propagating across vector product boundaries.
  • In another embodiment, additional adders (not shown) may be configured to generate the LSBs (which are unrounded) separately from the MSBs. Advantageously, this may prevent the rounding process from altering the value of the LSBs. For example, [0136] adder 212A may be configured to generate the rounded MSBs by summing the sum and carry values generated by adder 210A, while an additional adder may be configured to sum the lower bits of sum value 198A and carry value 200A to generate the LSBs.
  • In the previously described embodiments, each [0137] adder 210A-210F and 212A-212F is configured to perform addition without propagating carry bits from one vector component product to another. While this may be desirable in many configurations, the non-propagation of carry bits may disrupt some configurations of adder 50. For example, the embodiment illustrated in FIG. 14 relies upon the propagation of sign extension bits across vector component product boundaries. If carry bits are not allowed to propagate during the final addition stages which convert the redundant-from vector component products to non-redundant-form, the higher order products may be incorrect.
  • Turning now to FIG. 18, an embodiment of [0138] multiplier 50 which rounds the higher order bits of each vector component product, yet still allows carry bits to propagate across consecutive vector component product boundaries, is shown. In this embodiment, rounding constant 214 is once again added to the redundant form sum values 198A-198F and carry values 200A-200F of each vector component product by carry-save adders 210A-210F. In order to allow carries from partial products 190 to propagate without allowing carries from rounding constant 214 to propagate, separate carry-propagate adders 212A-212F are used for each vector component product. The length of each adder 212A-212F may equal the number of bits in the vector component product itself plus all of the bits corresponding to less significant vector component products. For example, assuming each vector component product is eight bits wide, adder 212B may be 16 bits wide and may add redundant vector component values 198A-198C and 200A-200C. Advantageously, undesired carry-out bits from each vector component product will not affect higher order vector component products in this configuration. Furthermore, the carry bits that may be required for correct operation of the embodiment of multiplier 50 illustrated in FIG. 14 still propagate to form the correct result despite possible sign-extensions.
  • Note that other configurations of [0139] multiplier 50 are possible. For example, rounding constant 214 may be incorporated within the logic of adder 64, thereby potentially eliminating the need for an extra level of adders. Furthermore, multiplier 50 may be configured to round and return the upper portions of scalar products and vector dot products in addition to vector component products. The types of adders used may also be changed according to the implementation, e.g., carry-propagate adders may be used through out in conjunction with multiplexers configured to prevent carry bits from propagating across vector component product boundaries. In addition, various control signals, e.g., a round_in signal, may be used to indicate whether rounding is to be performed.
  • Fast Rounding and Normalization [0140]
  • Another possible area for improving the speed of multiplication relates to rounding and normalization. When performing floating point multiplication, the multiplier and multiplicand operands (i.e., the significands of two floating point numbers) are received in normalized form. A binary number is said to be normalized when the most significant asserted bit is directly to the left of the binary radix point. For example, 1.010011[0141] 2 is normalized, while 10.100112 and 0.010100112 are not. In order to normalize a binary number, the number is shifted either right or left until the most significant asserted bit is directly to the left of the binary radix point. The number's exponent is then increased or decreased an amount equal to the number of positions that the number was shifted.
  • When [0142] multiplier 50 performs floating point multiplication, it receives two normalized significands. In some embodiments, multiplier 64 may be configured to output the results in normalized form. For example, multiplier 50 may receive two 32-bit normalized significands as operands and be configured to output one 32-bit result in normalized form. After multiplier 50 generates and selects the partial products, they are summed by adder 64 to create the final result. As the final result may be in redundant form, it may be passed through a carry-propagate adder as previously described. Once in non-redundant form, the result is rounded and normalized before being output. Different methods of rounding are possible. For example, IEEE Standard 754 defines four different rounding methods: round to nearest (even), round to positive infinity, round to minus infinity, and round to zero. The round to nearest method is particularly useful because it ensures that the error in the final product is at most one-half ULP (unit in the last place).
  • Turning now to FIG. 19, another embodiment of [0143] multiplier 50 is shown. This embodiment comprises two “paths” which are configured to perform IEEE rounding and normalization by calculating two results in parallel, i.e., one result assuming there is an overflow and one result assume no overflow. This embodiment comprises a pair of carry-save adders 276A-B, a pair of carry-propagate adders 278A-B, a pair of sticky bit logic units 286A-B, and a pair of LSB fix-up logic units 288A-B. The “no-overflow path” comprises carry-save adder 276A, carry-propagate adder 278A, sticky bit logic unit 286A, and LSB fix-up logic unit 288A, while the “overflow path” comprises carry-save adder 276B, carry-propagate adder 278B, sticky bit logic unit 286B, and LSB fix-up logic unit 288B. Both carry-save adders 276A and 276B are configured to receive sum value 274A and carry value 274B from partial product array adder 64. Each carry-save adder 276A and 276B is also configured to receive a rounding constant 268 from multiplexer 266.
  • [0144] Multiplexer 266 is configured to select rounding constant 268 from one of four rounding constants. The first rounding constant is a hard-wired constant one and is selected when rounding mode input 270 indicates that round to nearest (even) is the selected rounding mode. The constant is added to the guard bit position by both carry save adders 276A and 276B. The second rounding constant is a hard-wired zero and is selected when rounding mode input 270 indicates that round to zero (truncate) is the selected rounding mode. The third rounding constant is the sign of the final product of the multiplication being performed. This sign may be obtained by exclusively ORing the sign bit 260A of multiplicand operand 72 and the sign bit 260B of multiplier operand 74 within XOR gate 262. The resulting sign bit is added to the guard bit position, and each bit position less significant than the guard bit position, by carry-save adders 276A and 276B. The fourth rounding constant is the inversion of the third rounding constant. It may obtained by inverting the rounding constant obtained from XOR gate 262 with inverter 264. The resulting inverted sign bit is added to the guard bit position and each bit position less significant than the guard bit position by carry-save adders 276A and 276B.
  • Carry-save [0145] adders 276A and 276B are configured to receive and add sum value 274A, carry value 274B, and the selected rounding constant from multiplexer 266. Carry- save adders 276A and 276B convey their results in redundant form to carry-propagate adders 278A and 278B, respectively. Carry-propagate adders 278A and 278B reduce the results to non-redundant form 282A and 282B and convey them to LSB fix-up logic units 288A and 288B, respectively.
  • In parallel with the addition performed by [0146] adders 276A-B and 278A-B, sticky bit logic units 280A-B calculate sticky bits 286A-B. Sticky bit logic units 280A-B each receive sum value 274A and carry value 274B as inputs. The calculation of sticky bits and the operation of sticky bit logic units 280A-B are described in greater detail below.
  • LSB fix-up [0147] logic units 288A and 288B are coupled to carry-propagate adders 278A-B and sticky bit logic units 280A-B. Fix-up logic units 288A-B are configured to conditionally invert the least significant bit of the non-redundant results received from adders 278A-B. In one embodiment, fix-up logic units 288A-B are configured to perform the inversion or “fix-up” when the “round to nearest” mode is being performed and the following equation is true: (inverse of L)·(G)·(inverse of S)=1, wherein L and G are the least significant bits (LSBs) and guard bits, respectively, of the sum of sum value 274A and carry value 274B, and wherein S is the corresponding sticky bit (either 286A or 286B). Note that L and G may be calculated within fix-up units 288A-B using sum value 274A and carry value 274. The calculation of L and G may be performed in parallel with the additions performed by adders 276A-B and 278A-B and need not include a rounding constant. L and G may be calculated within fix-up units 288A-B, or by using an extra component within multiplier 50 (e.g., a third pair of carry-save/carry-propagate adders). The fix-up may advantageously compensate for cases in which adders 276A-B have added a constant when a constant was not actually needed (e.g., result+1 is generated when result+0 is needed).
  • Next, the desired number of upper bits from the outputs of LSB fix-up [0148] logic units 288A and 288B may be conveyed to multiplexer 290, which selects one of the two values (overflow or no overflow) as output 292. Multiplexer 290 may be controlled by MSB 284 from the output of fix-up logic unit 288A. By looking at the most significant bit, a determination of whether an overflow occurred can be made. If an overflow occurred, the upper bits from the output of LSB fix-up logic unit 288A are selected. If an overflow did not occur, the upper bits from the output of LSB fix-up logic unit 288B are selected. Note that other control configurations are also possible, e.g., MSB 284 may be the most significant bit of the output from fix-up logic unit 288B. Furthermore, in some embodiments of multiplier 50 only one fix-up logic unit may be needed. For example, the single fix-up logic unit may be coupled to the output of multiplexer 290 and perform the fix-up before final result 292 is output.
  • In one embodiment, exponent [0149] control logic unit 254 is also controlled by the same signal that controls multiplexer 290. If an overflow occurs, exponent control logic unit 254 is configured to increment the corresponding exponent. This completes the normalization of the output.
  • Advantageously, the embodiment of [0150] multiplier 50 depicted in the figure may be able to round and normalize the final result in less time because normalization is performed in parallel. Furthermore, the fix-up is performed while multiplexer 290 is selecting a result (overflow or no overflow). This may further reduce the cycle time of this embodiment of multiplier 50.
  • Turning now to FIG. 20, a diagram illustrating the operation of one embodiment of carry-save [0151] adders 276A and 276B is shown. The example assumes eight bit sum and carry values 274A-B are being rounded to four bit values and that round to nearest (even) is being performed. Adders 276A-B each receive sum value 274A, carry value 274B, and rounding constant 268 as inputs. In the example shown, adder 276A is configured to add a constant one to the guard bit position of sum value 274A and constant value 274B assuming there will not be an overflow. The guard bit position is the bit position that is one bit less significant than the least significant bit of the portion to be output. An overflow occurs when the summation of sum value 274A, carry value 274B, and any added rounding constants, creates a carry out from the bit position directly to the left of the binary radix point. An overflow may require the result to be shifted to the right (and the corresponding exponent to be incremented) in order to produce a normalized output.
  • As the figure illustrates, adder [0152] 276A adds a constant one to the guard bit position of sum value 274A and carry value 274B assuming there will be no overflow. In contrast, adder 276B adds rounding constant 268 to the guard bit position of sum value 274A and carry value 274B assuming there is an overflow. Thus, adder 286B adds the constant one in a different bit position than adder 276A. For this reason, adders 276A and 276B each generate a different result. The results from adder 276A are conveyed to carry propagate adder 278A, which is configured to reduce them to non-redundant form. Similarly, the results from adder 276B are conveyed to carry propagate adder 278B, which operates in manner similar to adder 278A.
  • Turning now to FIG. 21, more detail of one embodiment of sticky [0153] bit logic unit 280A is shown. As the figure illustrates, sticky bit logic 280A receives the lower four bits of the sum and carry values (350 and 352, respectively) generated by adder 276A. A constant 354 (e.g., 1111) is added to the sum and carry bits within carry save adder 340A, thereby generating two different 4-bit outputs which are routed to exclusive NOR gate 342A. The output from exclusive NOR gate 342A is routed to 4-input OR gate 344A, which outputs sticky bit 286A. Sticky bit logic 280B is configured similarly to sticky bit logic 280A, but it may be configured to receive one extra bit, e.g., five bits as opposed to four bits, due to the assumed overflow.
  • Turning now to FIG. 22, a numerical example of the operation of the embodiment of [0154] multiplier 50 from FIG. 20 is shown. This example assumes an eight bit output from adder 64 is being rounded to a four bit result. The figure shows each of the four IEEE rounding modes being performed by both carry-save adders 276A and 276B. The selected rounding constant 268 corresponds to the rounding mode. The selected rounding constant 268 is added to sum value 274A and carry value 274B by carry save adders 276A and 276B. As the figure illustrates, the starting bit position to which the constant is added varies from adder 276A to adder 276B. As previously noted, this is because adder 276A adds the constant to the guard bit position assuming there is no overflow, while adder 276B assumes there is an overflow. In parallel, sticky bit logic units 280A and 280B each calculate their own version of the sticky bit (286A and 286B, respectively), also reflecting whether or not an overflow is presumed to occur.
  • Next, LSB fix-up [0155] logic units 288A and 288B fix-up (invert) the LSB of output 282A, if necessary. As the figure illustrates, the fix-up is only performed when round to nearest (even) is the selected rounding mode and the formula (inverse of LSB)·(Guard bit)·(inverse of Sticky Bit)=1 is true. Note that in this embodiment the LSB and Guard bit are taken from the sum of sum value 274A and carry value 274B without selected rounding constant 268. After the fix-up, the upper four bits are output to multiplexer 290. In one embodiment, LSB fix-up logic 288A and 288B may each comprise a single inverter configured to invert the least significant bit of results 282A and 282B, respectively.
  • Other configurations of [0156] multiplier 50 are possible and contemplated. Turning now to FIG. 23, another embodiment of multiplier 50 configured to perform rounding and normalization is shown. In this embodiment, the “fix-up” or inversion of the LSB is performed by a single LSB fix-up logic unit 288 after multiplexer 290 performs the overflow/no overflow selection. A second multiplexer 290B is included to select which sticky bit 286A or 286B will be used by LSB fix-up logic unit 288 in determining whether to perform the inversion. Note the rounding and normalization hardware disclosed herein may be configured to round and normalize redundant results from other functional units also, e.g., adders.
  • Fast Newton-Raphson Iteration to Calculate the Reciprocal (1/B) [0157]
  • As [0158] microprocessor 10 already contains a highly optimized multiplier 50, it would be advantageous to perform other calculations on multiplier 50 as well, e.g., division. This may be accomplished by recasting division operations into reciprocal operations followed by multiplication operations. For example, the operation “A divided by B” (A/B) may be recast into “A multiplied by the reciprocal of B” (A×B−1). Forming the reciprocal of B may also be recast into a series of multiplication operations by using a version of the Newton-Raphson iteration. The Newton-Raphson iteration uses the equation X1=X0×(2−X0×B) to calculate the reciprocal of B. The initial estimate, X0, may be determined in a number of different ways. For example, X0 may be read from a ROM table using B as the index, wherein X0 approximates 1/B. In another embodiment, X0 may be calculated directly from B or from one or more ROM tables configured to output seed values. The seed values may be manipulated, e.g., using arithmetic and combinational logic, to determine X0. Once X0 is known, the first iteration may be performed. Thereafter, the results from each iteration are used in place of X0 in subsequent iterations. This forces Xn+1 to converge on 1/B in a quadratic fashion.
  • Turning now to FIG. 24, a flowchart depicting one embodiment of a method to calculate the reciprocal using [0159] multiplier 50 is shown. As previously noted, X0 is calculated first (step 700). Once X0 is determined, it is multiplied by B (step 702). The results are then routed down two parallel paths 706 and 708, one that assumes an overflow took place in the multiplication (path 706), and another that assumes no overflow occurred (path 708). Because X0 is close to 1/B, the product of X0 and B will be close to one, i.e., either slightly over one or slightly under one. As a result, an overflow will only occur during the multiplication if X0 is slightly greater than one (i.e., of the form 10.000 . . . with an exponent equal to 2−1). If there is no overflow, the result will be slightly less than one (i.e., in the form 01.111 . . . with an effective exponent equal to 2−1).
  • After the multiplication, the term (2−X[0160] 0×B) is formed within each path by inverting the (X0×B) results. Since (X0×B) is close to one, (2−X0×B) may be approximated by the absolute value of the two's complement of (X0×B). To further speed the calculation, the one's complement may be used because it only differs by a one in the least significant digit. The approximations for (2−X0×B) are performed in parallel within each path (steps 710 and 712). Specifically, in overflow path 706, the bits are inverted to get 01.111 . . . (with an effective exponent equaling 2−1). In non-overflow path 708, the bits are inverted to get 10.000 . . . (with an effective exponent equaling 2−1). Note that the sign bit of each intermediate value may also be forced to zero (positive).
  • Next, either the overflow path result or the non-overflow path result is selected (step [0161] 714). This selection can be performed by examining the result from the path that assumes no overflow occurred. If the most significant bit of this result is a one, then an overflow occurred within the non-overflow path, and the result from the overflow path should be selected as the proper result. The corresponding sign and exponent bits are also selected along with the result. Note that different bits may be selected from each path. This is illustrated by the following example. Assuming the product from the multiplier is 64 bits wide, then the bits may be numbered from 0 (the least significant bit) to 63 (the overflow bit), with the binary radix point located between the most significant bit 62 and the most significant fractional bit 61. If an overflow has occurred, bits 62 through 0 are selected with the radix point positioned between bits 62 and 61. If an overflow has not occurred, bits 63 though 0 are selected with the radix point positioned between bits 63 and 62. Thus bits 10.000 . . . may be selected as 1.0000 . . . (along with a hardwired exponent equaling 20). Advantageously, this configuration may save time by normalizing the inverted bits without requiring a dedicated normalization step. Note that other configurations and other widths are contemplated. Furthermore, all the bits from the selected path need not be used. In some embodiments fewer bits may be selected, and in other embodiments extra bits may be padded with constants to meet a desired length.
  • After the appropriate bits are selected, the result is routed back to the multiplier, which multiplies it with X[0162] 0 to complete the first iteration and form X1 (step 716). If the desired accuracy has been achieved (step 718), the results are output (step 722). If the desired accuracy has not been achieved (step 720), the iteration is repeated to form X2, wherein X2=X1×(2−X1×B). As with the first iteration, the term (X1×B) is close to one. The results of the multiplication are once again passed down paths 706 and 708 in parallel.
  • Depending upon the accuracy of the initial guess X[0163] 0 and the accuracy desired in the final result, the iteration may be performed any number of times (e.g., one, two, or five times). Using two paths may advantageously eliminate the need for normalization because the exponent and sign bits can be hard-wired based upon the known limits of the incoming operands and whether or not an overflow occurs.
  • Fast Newton-Raphson Iteration to Calculate the Reciprocal Square Root (1/{square root}B) [0164]
  • In another embodiment, [0165] multiplier 50 may be configured to calculate the reciprocal square root of an operand B using a modified version of the Newton-Raphson iteration. The equation Yn+1=Yn×(3−B×Yn 2)/2 may be used to calculate the reciprocal square root of B. Once again, the initial estimate, Y0, may be determined in a number of ways, e.g., by using initial estimate generators that perform calculations on seed values read from ROM tables using B. In this iteration Y0 approximately equals 1/{square root}B. Each subsequent iteration of the equation forces Yn+1 to converges on 1/{square root}B in a quadratic fashion. In one embodiment, both Y0 and Y0 2 may be produced using the same initial estimate generator that was used for the reciprocal calculation described above. This may be desirable because determining Y0 2 may eliminate the need for a multiplication operation to form Y0 2 from Y0. As used herein, an initial estimate generator refers to any hardware capable of generating an initial value such as X0 or Y0, e.g., one or more ROM tables configured to output seed values that may be used to calculate the initial value using arithmetic and combinational logic.
  • Turning now to FIG. 25, a flowchart depicting one embodiment of a method to calculate the reciprocal square [0166] root using multiplier 50 is shown. As previously noted, Y0 2 and Y0 are determined first (step 730). Once Y0 2 is determined, it is multiplied by B to form the term (B x Y0 2) (step 732). The results are then routed down two parallel paths 734 and 736, one that assumes an overflow took place in the multiplication (path 736), and another that assumes no overflow occurred (path 734). Because Y0 2 is close to 1/B, the product of Y0 2 and B will be close to one, i.e., either slightly over or slightly under one. As a result, an overflow will only occur during the multiplication if the result (B×Y0 2) is slightly greater than one (i.e., of the form 10.000 . . . with an effective exponent equal to 2−1). If there is no overflow, the result will be slightly less than one (i.e., in the form 01.111 . . . with an effective exponent equal to 2).
  • After the multiplication, the [0167] overflow path 736 forms the one's complement by inverting the result (B×Y0 2) (step 740). The resulting value has the form 01.111 . . . with an effective exponent of 2−1 and approximates (2−B×Y0 2). To form the term (3−B×Y0 2), a one is effectively added to the result to form 1.111 . . . with an exponent of 20 (step 744). This value is then be right shifted one bit to reflect the division by two in the term (3−B×Y0 2)/2 (step 748). This results in a value having the form 1.111 . . . with an exponent of 2−1 (step 748).
  • The [0168] non-overflow path 734 also forms the one's complement by inverting the result (B×Y0 2) (step 738). The resulting value, however, has the form 10.000 . . . with an effective exponent of 2−1. This form is normalized to 1.000 . . . with an exponent of 20, which approximates (2−B×Y0 2). To approximate the term (3−B×Y0 2), a one is effectively added to the result to form 10.000 . . . (step 742). This value is then be shifted right one bit to reflect the division by two in the term (3−B×Y0 2)/2 (step 746). The result has the form 1.000 . . . In this path, the result's exponent is forced to 20 (step 746).
  • Next, either the overflow path result or the non-overflow path result is selected (step [0169] 750). This selection can be performed as previously disclosed, i.e., based upon the value of the most significant bit of the result from each path. Different bits may be selected from each path to eliminate the need for normalization.
  • The selected result is then routed back to the multiplier, which multiplies it with Y[0170] 0 (determined during step 730) to complete the first iteration and form Y1 (step 752). If the desired accuracy has been achieved (step 754), the results are output (step 756). If the desired accuracy has not been achieved, the iteration is repeated to form Y2, wherein Y2=Y1×(3−B×Y1 2)/2 (step 758). However, unlike the first iteration, subsequent iterations may require an additional multiplication to form the term Yn 2 (step 760). As with the first iteration, the term (B×Y1 2) is close to one. Once this term has been calculated, the results are once again passed down the two paths (overflow 736 and non-overflow 734) in parallel.
  • Depending upon the accuracy of the initial guess Y[0171] 0 and the accuracy desired in the final result, the iterative calculation may be performed any number of times (e.g., one, two, or five times). Advantageously, using two paths (overflow and non-overflow) may eliminate the need for normalization because the exponent and sign bits may be hard coded based upon the known limits of the incoming operands and whether or not an overflow occurs.
  • Note that the steps in the figures are show in a serial fashion for explanatory purposes only. Some steps may be performed in parallel or in a different order. Further note that the method above may also be used to determine the square root of an operand. To implement the square root function, an additional multiplication may be performed during each iteration. [0172]
  • Turning now to FIG. 26, an embodiment of [0173] multiplier 50 configured to evaluate constant powers of an operand is shown. This embodiment may be configured to evaluate one or more constant powers of an operand such as −1 (reciprocal), −1/2, (reciprocal square root), and 1/2 (square root). In addition to the features of the previous embodiments, this embodiment of multiplier 50 comprises a non-overflow logic unit 770A, an overflow logic unit 770B, an initial estimate generator (IEG) 774, two multiplexers 776 and 780, and a control logic 778. Note that non-overflow logic unit 770A and overflow logic unit 770B may also be referred to herein as a “non-overflow path,” and “overflow path,” respectively.
  • [0174] Initial estimate generator 774 is coupled to receive multiplier operand 74 and communicate initial estimates, e.g., X0 and Y0, to multiplexer 776. Note as used herein, X0=Y0 2≈B, and Y0≈1/{square root}B. Multiplexer 776 is configured to select the first multiplication operand from either multiplicand operand 72 or the initial estimate output by initial estimate generator 774. Similarly, multiplexer 780 is configured to select the second operand to be multiplied from either multiplier operand 74 or result 292 from multiplexer 290. Control logic 778 receives control signal 772 and controls multiplexers 776 and 780, exponent control logic 254, and logic units 770A-B. Non-overflow logic 770A is coupled to receive values from LSB fix-up logic 288A and output values to multiplexer 290. Similarly, overflow logic 770B is coupled to receive values from LSB fix-up logic 288B and also output values to multiplexer 290. Logic units 770A-B are controlled by control logic unit 778, which indicates which, if any, constant power operation is being performed. If a constant power operation is not being performed, logic units 770A-B may be configured to simply allow values from fix-up logic units 288A-B to propagate through to multiplexer 290 unchanged.
  • When a constant power operation is being performed, [0175] logic units 770A-B are configured to form approximations by inverting selected bits from the values received from fix-up logic units 770A-B. Logic units 770A-B are also configured to force (e.g., hard-wire) the exponents associated with the values received from fix-up logic units 288A-B to fixed values. These fixed exponents are communicated to exponent control logic 254. Alternatively, exponent control logic 254 may force the exponents to fixed constants when instructed to do so by logic units 770A-B. Logic units 770A-B may each comprise a plurality of inverters, a wire shifter, and one or more hard-wired constants. A wire shifter is a plurality of signal, data, or control lines that are selectively connected and or offset to provide fixed shifting and routing. The following examples illustrate the operation of logic units 770A-B and multiplier 50 in more detail.
  • Example of a Reciprocal Operation [0176]
  • When a reciprocal operation is performed, [0177] multiplier 50 receives the operand to be inverted (referred to herein as “operand B”) as multiplier operand 74. Initially, multiplexer 780 is configured to select operand B. Initial estimate generator 774 also receives operand B and in response outputs an initial estimate or approximation of the reciprocal (referred to as X0) to multiplexer 776. Multiplexer 776 is configured to select, based upon control signals from control logic 778, the initial estimate, which is then multiplied by operand B to form the quantity (X0×B). The quantity (X0×B) propagates through multiplier 50 until it reaches logic units 770A-770B. Non-overflow logic unit 770A receives a version from fix-up logic 288A that assumes no overflow has occurred. Based upon control signal 772, non-overflow logic unit 770A inverts the version of (X0×B) it receives to approximate the quantity (2−X0×B). Non-overflow logic unit 770A may be configured to normalize its output by forcing the corresponding exponent to a constant, e.g., 20. Note all references herein are to unbiased exponents. For example, an unbiased exponent 20 may translate to a biased exponent of 27F (assuming a +7F16 or +12710 bias). Similarly, overflow logic unit 770B receives a version from fix-up logic 288B that assumes an overflow has occurred and inverts it. Overflow logic unit 770B may also be configured to normalize its output by forcing the corresponding exponent to a constant, e.g., 2−1. Note that in some embodiments, not all bits from fix-up logic units 288A-B may be used or inverted by logic units 770A-B.
  • Once the overflow and non-overflow approximations for the quantity (2−X[0178] 0×B) have been output by logic units 770A-B, multiplexer 290 is configured to select one of the approximations based upon the value of MSB 284 from the output of fix-up logic unit 288A. As previously noted, by looking at the most significant bit a determination of whether an overflow occurred can be made. If an overflow occurred, the approximation for the quantity (2−X0×B) from logic unit 770B (the overflow path) is selected. If an overflow did not occur, the approximation for the quantity (2−X0×B) from logic unit 770A (the non-overflow path) is selected. Note that other control configurations are possible, e.g., MSB 284 may be the most significant bit of the output from fix-up logic unit 288B.
  • Once the appropriate approximation for the quantity (2−X[0179] 0×B) has been selected by multiplexer 290, it is routed to multiplexers 776 and 780. Multiplexer 780 is directed by control logic 778 to select the approximation so that it may be multiplied by initial estimate X0 to form the quantity X0×(2−X0×B). During this multiplication, however, logic units 770A-B are configured to allow the values from fix-up logic units 288A-B to pass through unchanged. The result selected by multiplexer 290 is the approximation of the reciprocal of operand B after one Newton-Raphson iteration. As previously noted, the process may be repeated a number of times to achieve greater accuracy.
  • Example of a Reciprocal Square Root Operation [0180]
  • When a reciprocal square root operation is performed, [0181] multiplier 50 operates in much the same fashion as previously described for a reciprocal operation. The operand to be raised to the −1/2 power (referred to herein as “operand B”) is received as multiplier operand 74. Initially, multiplexer 780 is configured to select operand B. Initial estimate generator 774 also receives operand B and in response outputs an initial estimate or approximation of the reciprocal (referred to as Y0 2, which equals X0) and the reciprocal square root (referred to as Y0) to multiplexer 776.
  • [0182] Multiplexer 776 is configured to select, based upon control signals from control logic 778, the initial estimate Y0 2, which is then multiplied by operand B to form the quantity (Y0 2×B). The quantity (Y0 2×B) propagates through multiplier 50 until it reaches logic units 770A-770B. Non-overflow logic unit 770A receives a version from fix-up logic 288A that assumes no overflow has occurred. Based upon control signal 772, non-overflow logic unit 770A inverts the version of quantity (Y0 2×B) it receives to approximate the quantity (2−Y0 2×B). Logic unit 770A also pads the most significant bit of the quantity (2−Y0 2×B) with a constant one to approximate the quantity (3−Y0 2×B). Logic unit 770A may then normalize the quantity (3−Y0 2×B) by selectively routing (e.g., wire-shifting) bits to multiplexer 290 in a particular position or offset and by forcing the corresponding exponent to 20.
  • [0183] Overflow logic unit 770B may be similarly configured to invert the version of quantity (Y0 2×B) it receives to approximate the quantity (2−Y0 2×B). Logic unit 770B also pads the most significant bit of the quantity (2−Y0 2×B) with a constant one to approximate the quantity (3−Y0 2×B). Logic unit 770B may then normalize the quantity (3−Y0 2×B) by selectively routing bits to multiplexer 290 in a particular position or offset and by forcing the corresponding exponent to 2−1. Note that in some embodiments, not all bits from fix-up logic units 288A-B may be used or inverted by logic units 770A-B.
  • Once the overflow and non-overflow approximations for the quantity (3−Y[0184] 0 2×B) have been output by logic units 770A-B, multiplexer 290 is configured to select one of the approximations based upon the value of MSB 284 from the output of fix-up logic unit 288A. As previously noted, by looking at the most significant bit a determination of whether an overflow occurred can be made. If an overflow occurred, the approximation for the quantity (3−Y0 2×B) from logic unit 770B (the overflow path) is selected. If an overflow did not occur, the approximation for the quantity (3−Y0 2×B) from logic unit 770A (the non-overflow path) is selected. Other control configurations are also possible.
  • Once the approximation for the quantity (3−Y[0185] 0 2×B) has been selected by multiplexer 290, it is routed to multiplexer 780. Multiplexer 780 is directed by control logic 778 to select the approximation so that it may be multiplied by initial estimate Y0 that was read from initial estimate generator 744 to form the quantity Y0×(3−Y0 2×B). During this multiplication, however, logic units 770A-B are configured to allow the values from fix-up logic units 288A-B to pass through unchanged. The result selected by multiplexer 290 is the approximation of the reciprocal square root of operand B after one Newton-Raphson iteration. As previously noted, the process may be repeated a number of times to achieve greater accuracy. However, in subsequent iterations the result may be squared to form Yn 2, which is then used in place of the initial estimate Y0 2 from initial estimate generator 774.
  • Note other configurations of [0186] multiplier 50 are possible and contemplated. For example, non-overflow logic 770A and overflow logic 770B may instead be configured to receive rounded and normalized value 292 from multiplexer 290, in which case a separate multiplexer (not shown) may be needed to select between the values output by non-overflow logic 770A and overflow logic 770B. In some embodiments of multiplier 50, registers may be used to store various intermediate results, e.g., the inputs to multiplexers 776 and 780, and the results from multiplexer 290. The registers may the store the intermediate results for use during subsequent clock cycles.
  • Turning to FIG. 27, details of one exemplary embodiment of [0187] non-overflow logic unit 770A configured to calculate the quantity (3−Y0 2×B) for the reciprocal square root calculation are shown. When the quantity (Y0 2×B) 790A is received from LSB fix-up logic 228A, inverters 792A invert selected bits to approximate the quantity (2−Y0 2×B). Note that an inverter may not be required for all bits, e.g., the most and least significant bits. Constants 794A are then used to replace the most significant bit of the quantity (2−Y0 2×B) 790A to approximate the quantity (3−Y0 2×B), which is output to multiplexer 290. A constant or control signal may be routed to exponent control logic 254 to force the corresponding exponent to 20.
  • A numerical example further illustrates the operation of [0188] non-overflow logic unit 770A. First, the value 1.111 . . . ×2−1 is received from LSB fix-up logic 228A as an approximation of the quantity (Y0 2×B) 790A. Next, inverters 792A invert the quantity to generate 10.000 . . . ×2−1 as an approximation of the quantity (2−Y0 2×B). Finally, constants 794A are used to replace the most significant bits. The results are shifted, resulting in the quantity 1.00000 . . . , and the corresponding exponent is forced to 20. Note that the most and least significant bits of the quantity (Y0 2×B) 790A may not be incorporated into the quantity (2−Y0 2×B).
  • [0189] Overflow logic 770B operates in a similar fashion. However, the most significant bit of quantity 790B is replaced with only a single constant 794B, and bits 30 through 0 are incorporated into the quantity (2−Y0 2×B). A numerical example further illustrates the operation of overflow logic unit 770B. First, the value 10.000 . . . ×2−1 is received from LSB fix-up logic 228B as an approximation of the quantity (Y0 2×B) 790B. Next, inverters 792B invert the quantity to generate 01.111 . . . ×2−1 as an approximation of the quantity (2−Y0 2×B). Finally, constant 794B is used to replace the most significant bit. The results are shifted, resulting in the quantity 1.1111 . . . , and the corresponding exponent is forced to 2−1.
  • Turning now to FIG. 28, another exemplary embodiment of [0190] non-overflow logic 770A and overflow logic 770B is shown. The embodiments shown in the figure are configured to return the quantity (2−Y0 2×B) for the reciprocal calculation. A numerical example illustrates the operation of non-overflow logic 770A and overflow logic 770B. Assuming non-overflow logic 770A receives a value 1.111 . . . ×2−1 as the quantity (Y0 2×B) 790A, then inverters 792A are used to invert the bits (excluding the least significant bit) to obtain a value 0.000 . . . Constant 796A is then used to pad the most significant bit position. The remaining bits are all shifted one position, with the result 1.0000 . . . being output to multiplexer 290. The corresponding exponent is forced to 20 by signal 796A.
  • [0191] Overflow path 770B operates in a similar fashion. For example, assuming value 790B is 1.000 . . . ×2−1, then inverters 792B generate the value 0.111 . . . which is shifted and output to multiplexer 290 as the value 1.11 . . . Note the least significant bit may be padded with a constant 794B, e.g., zero, while the corresponding exponent is forced to 2−1 by signal 796B.
  • Note the examples and figures referred to herein are exemplary. Other configurations for [0192] non-overflow logic 770A and overflow logic 770B and multiplier 50 are also contemplated. For example, the least significant bit from quantity 790B may be duplicated instead of using constant 794B. Other constant values may also be used, and the widths of quantities 770A-B may be reduced before they are routed to multiplexer 290 (e.g., from 32 bits to 24 bits). Other logic components may be used in place of inverters 792A-B, and the bit routing structure disclosed above may be replaced by other logic components, e.g., a shifter. The functionality provided by non-overflow logic 770A and overflow logic 770B may be provided in other components internal or external to multiplier 50. In addition, multiplier 50 may be configured to perform both reciprocal and reciprocal square root functions, e.g., by incorporating two versions of non-overflow logic 770A and overflow logic 770B, or by incorporating multiplexers within non-overflow logic 770A and overflow logic 770B to select which routing of bits and constants should be applied.
  • Compression of Intermediate Products [0193]
  • When performing iterative calculations, [0194] multiplier 50 calculates intermediate products which may be stored in registers. During the next iteration, the intermediate product may be read from the register and used as an operand. Unfortunately, each iteration may introduce rounding errors that accumulate in the final result. For example, assuming an N-bit significand, the results from each multiplication have significands that are 2N bits wide. This result may be rounded to N-bits or some other width. The greater the number of iterations, the larger the potential rounding error may be in the final result. For obvious reasons, it is desirable to reduce the magnitude of this rounding error.
  • One possible method to reduce the rounding error is to calculate extra bits for each intermediate product and then round at lower bit positions. Each iteration may generate accurate bits in lower (less significant) bit positions than the previous iteration. However, due to the fixed size of the storage registers within [0195] multiplier 50, the extra bits will not fit unless the registers within multiplier 50 are widened accordingly. There are several potential drawbacks to using wider registers, including the additional die space requirements and the additional architectural state requirements for context switches. Thus, a mechanism for maintaining the accuracy provided by the extra bits without using wider registers may be desirable.
  • One possible method for providing such extra accuracy without increasing the size of the storage registers is to compress the intermediate results before they are stored. However, not all compression algorithms are well suited for use within [0196] multiplier 50. One concern, in particular, is speed. Another concern is the die space required to implement the compression.
  • Turning now to FIG. 29A, a flowchart illustrating one possible method for fast compression is shown. In the embodiment illustrated, the intermediate product is first calculated to N extra significant bits (step [0197] 600), wherein N is a predetermined constant. For example, assuming multiplier 50 receives 24-bit operands, multiplier 50 may calculate intermediate products to a precision of 28 bits. In this case, N equals 4 bits. Once the intermediate product is calculated, the next-to-most significant bit is examined (step 602). The value of the next-to-most significant bit determines the value of a signaling bit. If the next-to-most significant bit equals one (step 604), then the signaling bit equals one also. If, on the other hand, the next-to-most significant bit equals zero (step 606), then the signaling bit equals zero. The signaling bit is used to replace a portion of the intermediate product, thereby compressing the intermediate product (step 608). In one embodiment, the portion replaced by the signaling bit is N+1 bits wide. While this method assumes that the portion being replaced comprises entirely one's or zero's, this may be a safe assumption when certain types of iterations are being performed. For example, when performing the Newton-Raphson iterations previously disclosed for calculating the square root and inverse square root, the products (2−B×Xn) and (3−B×Yn 2) are formed during each iteration. As previously noted, these product are very close to one (e.g., either slightly over one, 1.00000000 . . . ×2°, or slightly under one, 1.11111111 . . . ×2−1). Accordingly, many of the leading bits (excluding the most significant bit in some cases) of the products are identical, i.e., either all zeros or ones, from one iteration to the next with differences occurring in the less significant bits. This property allows the method illustrated in FIG. 29A to be used effectively.
  • The maximum number of bits that may be compressed in a particular implementation may be determined by examining the number of bits that have the same values over the entire range of possible operand values. For example, if an embodiment using a 32-bit significand is determined to have nine bits that have the same value for all possible operand values, then the 32-bit results may compressed so that they may be stored in 24-bit registers. [0198]
  • While the present embodiment illustrated in the figure performs the compression whenever a particular iterative calculation is performed, in other embodiments the compression may be performed conditionally. For example, in one embodiment the compression may be performed only if a comparison of the bits to be compressed shows that they all have the same value. While many different types of hardware may be used to perform this comparison, one possible configuration may utilize multiple input AND gates and multiple input NAND gates. If the testing logic determines that the bits to be compressed do not all have the same value, then the operand may stored by truncating the extra least significant bits. While this implementation may lose the benefit of increased accuracy in some cases, this may be adequate if the bits to be compressed rarely have different values. [0199]
  • When the compressed intermediate product is needed for the next iteration, it may be decompressed. Turning now to FIG. 29B, one possible method for decompressing the compressed intermediate product is illustrated. First, the compressed intermediate product is read from the storage register (step [0200] 612). Next, the compressed intermediate product is expanded by padding the next-to-most significant bits with copies of the signaling bit (step 614). The number of copies of the signaling bit that are padded or inserted below the most significant bit in this embodiment equals N−1. Advantageously, the expanded intermediate product now has the same width as the original intermediate product. For example, if the compressed intermediate product comprises 24 bits, and the original intermediate product comprises 28 bits, then the signaling bit will be copied 4 times to render an expanded intermediate product having 28 bits. Advantageously, using the methods illustrated in FIGS. 29A and 29B, no information is lost in the compression and decompression process.
  • Note that the bits replaced by the signaling bit need not be the most significant bit. They may begin with the next-to-most significant bit. For example, if the most significant bit of the intermediate product is [0201] bit 27, the bits replaced by the signal bit may comprise bits 22 through 26. Further note that the signaling bit may simply be a particular bit within the intermediate product, i.e., an extra calculation to determine the signal bit is not required. Furthermore, the signal bit need not be the most significant or least significant bit in the range of bits to be compressed, i.e., the signal bit may be a bit in the middle of the range of bits to be compressed.
  • Turning now to FIG. 30, one embodiment of [0202] multiplier 50 configured to compress intermediate products is shown. As in previous embodiments, this embodiment of multiplier 50 comprises partial product generator 60, selection logic 62, and partial product array adder 64. This embodiment also comprises demultiplexer 622, 24-bit storage register 638, and multiplexers 776 and 780. Demultiplexer 622 receives an intermediate product 620 from partial product array adder 64. Other embodiments are also contemplated. For example, demultiplexer 622 may receive intermediate product 620 from multiplexer 290 (see FIG. 26). Demultiplexer 622 routes intermediate product 620 according to an iterative control signal 644. For example, if iterative control signal 644 indicates that an iterative operation is being performed, then intermediate product 620 is routed to storage register 638. If, on the other hand, iterative control signal 644 indicates that an iterative operation is not being performed, then intermediate product 620 may be routed to standard rounding logic (not shown) and then output. In another embodiment, intermediate product 620 may be rounded before reaching demultiplexer 622. In this case, demultiplexer 622 may simply route product 620 to an output of multiplier 50 if an iterative calculation is not being performed.
  • In the event an iterative operation is being performed, [0203] storage register 638 is configured to store intermediate product 620 until it is needed for the next iteration. The signal and data lines coupled to the inputs and outputs of storage register 638 may be referred to herein as a wire shifter because they provide a fixed shifting function. As previously noted, storage register 638 may be implemented so that it is smaller than intermediate product 620. Assuming, for example, that intermediate product is 28 bits wide, storage register 638 may be configured to store only the 24 least significant bits of intermediate product 620. Assuming the five next-to-most significant bits of intermediate product 620 all have the same value for the particular iteration being performed, then bit 22 may be selected as a signal bit 632 to replace the four next-to-most significant bits 636. Thus, as the figured illustrates, storage register 638 may be configured to store bits 0-21, signal bit 632, and bit 27.
  • When the next iteration is performed, bits [0204] 0-21, signal bit 632, and bit 27 are read from storage register 638. To recreate the full 28 bits of intermediate product 620, signal bit 632 is copied four times to recreate bits 23-26. Advantageously, no information from intermediate product 620 is lost in the compression and decompression cycle.
  • [0205] Multiplier 50 may also be configured with optional testing logic 624, which is configured to determine whether the five most significant bits from intermediate product 620 have the same value. In one embodiment, testing logic 624 may comprise five-input AND gate 626, five-input NOR gate 628, and two-input OR gate 630. The output from two-input OR gate 630 may be used in a number of ways, e.g., to signal an error condition or to cause register 638 to store the 24 most significant bits without compression.
  • In some embodiments, [0206] testing logic 624 may be omitted. Furthermore, demultiplexer 622 may also be omitted. In such embodiments, product 620 may be rounded and then routed to both storage register 638 and the outputs of multiplexer 50. In the event of an iterative calculation, external logic may be used to ensure that functional units or other parts of the microprocessor will not use the data output by multiplier 50 until the iterative calculation is completed.
  • Turning now to FIG. 31A, a figure illustrating one possible method for compression is shown. As the figure illustrates, uncompressed [0207] intermediate product 656 may comprise 28 bits numbered 0 through 27. If the five most significant bits 652 from intermediate product 656 all have the same value, they may be compressed into one signal bit 654. This allows uncompressed intermediate product 656 to be represented and stored as compressed intermediate product 658, thereby using only 24 bits. When compressed intermediate product 658 is uncompressed, signal bit 654 is copied four times to create the four most significant bits of uncompressed intermediate product 660.
  • Turning now to FIG. 31B, a figure illustrating another possible method for compressing intermediate product is shown. In this embodiment, [0208] intermediate product 676 is characterized by having five equal bits 672 directly below most significant bit 680. As the figure illustrates, the five equal bits 672 may be compressed into one signal bit 674 even though they are not the most significant bits in intermediate product 676. Compressed product 678 is still able to fit within a 24 bit storage register. To decompress compressed product 678, four copies of signal bit 674 are inserted below most significant bit 680 within uncompressed product 682. Once again, no information from intermediate product 676 is lost in the process. As this example illustrates, the contemplated compression method may be used regardless of where the bits having equal values are located. Advantageously, no information is lost if the bits having equal values are located in the same position in each iteration.
  • Achieving Higher Frequencies of Exactly Rounded Results [0209]
  • When an infinitely precise result is rounded to the nearest machine number, the maximum possible error is one-half of a unit in the last place (ulp). When performing an iterative calculation such as the Newton-Raphson iterations discussed above for the reciprocal and reciprocal square root, the results converge toward the infinitely precise result. However, due to limitations in the number of bits of precision that are available, the number of iterations performed, and the approximations discussed above to improve the speed of each iteration, some input operands may generate results from [0210] multiplier 50 that do not equal the infinitely precise result rounded to the nearest machine number (also referred to as the “exactly rounded result”). This holds true even when each iteration is configured to use the “round to nearest” rounding mode.
  • Thus, it would be desirable to increase the frequency or probability that the calculated result equals the exactly rounded result. One method to determine whether the calculated result equals the exactly rounded result is to multiply the calculated result (calculated to at least one extra bit of accuracy, i.e., N+1 bits) and the original operand B (assuming the reciprocal of B has been calculated). The exactly rounded result may then be selected from the following three values: the N-bit result (without the extra bit) plus one in the least significant bit; the N-bit result minus one in the least significant bit; or the N-bit result plus zero. The exactly rounded result is selected based upon the value of the extra computed bit (i.e., bit N+1) and whether the result of multiplication was greater than one, less than one, or equal to one. [0211]
  • Rather than computing the exactly rounded result with a probability of one as described above (i.e., performing an extra multiplication step), [0212] multiplier 50 may be configured to achieve nearly the same accuracy (i.e., computing the exactly rounded result with a probability close to one) by adding an “adjustment constant” to the result produced from the last step of the iterative calculation before rounding. Depending upon the actual implementation of the multiplier (e.g., the number of bits of precision, the number of iterations performed, and the accuracy of the initial approximation) the probability that the calculated result is higher than the exactly rounded result (“Phigh”) may be greater than the probability that the calculated result is lower than the exactly rounded result (“Plow”). If this is the case, then adding a negative adjustment constant in the final step of the iteration may increase the probability that the calculated result will equal the exactly rounded result (“Pequal”). Similarly, if Phigh is less than Plow, then adding a positive adjustment constant may increase Pequal. The probabilities Phigh, Plow, and Pequal may be determined by passing a large number of differing input operand values through the iterative calculation (as performed by the multiplier) and then comparing each result with the corresponding exactly rounded result. A computer program may be particularly useful in performing the comparisons. The comparisons may also be performed before rounding, i.e., comparing the infinitely precise results with the results from the multiplier's final iteration before they are rounded.
  • Turning now to FIG. 32, an embodiment of [0213] multiplier 50 configured to add a correction constant is shown. Generally, multiplier 50 may be configured similarly to other embodiments disclosed herein that are capable of performing iterative calculations. However, in this embodiment control logic 778 is configured to convey adjustment constant 800 to partial product array adder 64 during the last step of an iterative calculation. Partial product array adder 64 then sums adjustment constant 800 with the selected partial products from selection logic 62. Advantageously, this configuration may not require an additional set of adders to sum adjustment constant 800 into the result. Another potential advantage of this configuration is that any overflows or denormalizations that occur as a result of adjustment constant 800 are addressed by the rounding and normalization process already built into multiplier 50 (i.e., carry-save adders 276A-B, carry-propagate adders 278A-B, sticky bit logic units 280A-B, LSB fix-up logic units 288A-B, and logic units 770A-B).
  • In another embodiment, adjustment constant [0214] 800 may instead be summed into the result by carry-save adders 276A-B and or carry-propagate adders 278A-B. Another embodiment may incorporate an extra adder (not shown) to sum adjustment constant 800 with result 292 from multiplexer 290. However, this configuration may require additional logic in the event the result becomes denormalized or experiences an overflow as a result of the addition.
  • [0215] Control logic 778 may be configured to convey adjustment constant 800 to partial product array adder 64 during the final multiplication in each iteration, or just for the final multiplication during the final iteration. For example, if the iterative calculation involves two multiplication operations for each iteration, and three iterations are required to achieve the desired accuracy, then control logic 778 may be configured to convey adjustment constant 800 during the final multiplication of each iteration, i.e., three times, or only once during the second multiplication of the third and final iteration. In yet another embodiment, control logic 778 may convey adjustment constant 800 to partial product adder 64 during every multiplication in the iteration.
  • In yet another embodiment, [0216] control logic unit 778 may store a number of different adjustment constants 800, e.g., one for each type of iteration. In such a configuration, control logic unit 778 may convey the appropriate adjustment constant that corresponds to the type of iterative calculation being performed. Control logic unit 778 receives an indication of which iterative calculation is being perform via control signal 722. For example, when control signal 722 indicates that the reciprocal iterative calculation is to be performed, control logic 778 may convey a first adjustment constant 800 to partial product array adder 64. However, when control signal 722 indicates that the reciprocal square root iterative calculation is being performed, control logic 778 may convey a second, different adjustment constant to partial product array adder 64.
  • In another embodiment, [0217] multiplier 50 may be configured to calculate three versions of each result, i.e., a first result generated without adding an adjustment constant, a second result generated by adding an adjustment constant, and a third result generated by subtracting the adjustment constant. Alternatively, the second and third results could be calculated by adding different adjustment constants. These results may be calculated in parallel by multiplier 50. Once the result without the adjustment constant is generated, a multiplication may be performed as described above to determine whether the result is correct, too high, or too low. The corresponding result may then be selected.
  • Exemplary Configuration Using Two Multipliers [0218]
  • Turning now to FIG. 33A, an example of a vector multiplication using two [0219] multipliers 50A and 50B is shown. Multipliers 50A and 50B may be configured similarly to multiplier 50 as described in previous embodiments. As shown in the figure, multipliers 50A and 50B are configured to operate in parallel to execute a vector multiplication of a pair of vectors each comprising four 16-bit operands 380A-380D and 382A-382D. Note operands 380A-380D may come from a first 64-bit MMX register, while operands 382A-382D may come from a second 64-bit MMX register.
  • Turning now to FIG. 33B, another example of a vector [0220] multiplication using multipliers 50A and 50B is shown. In this configuration, multipliers 50A and 50B operate in parallel to multiply a pair of vectors each comprising two 32-bit operands 384A-384B and 386A-386B. Once again, operands 384A-384B may come from a first 64-bit MMX register, while operands 386A-386B may come from a second 64-bit MMX register. Further note that while a vector operation is being performed, each individual multiplier 50A and 50B is performing a scalar multiplication. Other modes of operation are also contemplated, for example, multiplier 50A may perform a 32-bit scalar multiplication independent from multiplier 50B. While multiplier 50A performs the multiplication, multiplier 50B may sit idle or perform an independent multiplication operation.
  • Exemplary Computer System Using Multiplier [0221]
  • Turning now to FIG. 34, a block diagram of one embodiment of a [0222] computer system 400 including microprocessor 10 is shown. Microprocessor 10 is coupled to a variety of system components through a bus bridge 402. Other embodiments are possible and contemplated. In the depicted system, a main memory 404 is coupled to bus bridge 402 through a memory bus 406, and a graphics controller 408 is coupled to bus bridge 402 through an AGP bus 410. Finally, a plurality of PCI devices 412A-412B are coupled to bus bridge 402 through a PCI bus 414. A secondary bus bridge 416 may further be provided to accommodate an electrical interface to one or more EISA or ISA devices 418 through an EISA/ISA bus 420. Microprocessor 10 is coupled to bus bridge 402 through a CPU bus 424.
  • Bus bridge [0223] 402 provides an interface between microprocessor 10, main memory 404, graphics controller 408, and devices attached to PCI bus 414. When an operation is received from one of the devices connected to bus bridge 402, bus bridge 402 identifies the target of the operation (e.g. a particular device or, in the case of PCI bus 414, that the target is on PCI bus 414). Bus bridge 402 routes the operation to the targeted device. Bus bridge 402 generally translates an operation from the protocol used by the source device or bus to the protocol used by the target device or bus.
  • In addition to providing an interface to an ISA/EISA bus for PCI bus [0224] 414, secondary bus bridge 416 may further incorporate additional functionality, as desired. For example, in one embodiment, secondary bus bridge 416 includes a master PCI arbiter (not shown) for arbitrating ownership of PCI bus 414. An input/output controller (not shown), either external from or integrated with secondary bus bridge 416, may also be included within computer system 400 to provide operational support for a keyboard and mouse 422 and for various serial and parallel ports, as desired. An external cache unit (not shown) may further be coupled to CPU bus 424 between microprocessor 10 and bus bridge 402 in other embodiments. Alternatively, the external cache may be coupled to bus bridge 402 and cache control logic for the external cache may be integrated into bus bridge 402.
  • [0225] Main memory 404 is a memory in which application programs are stored and from which microprocessor 10 primarily executes. A suitable main memory 404 comprises DRAM (Dynamic Random Access Memory), and preferably a plurality of banks of SDRAM (Synchronous DRAM).
  • [0226] PCI devices 412A-412B are illustrative of a variety of peripheral devices such as, for example, network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards. Similarly, ISA device 418 is illustrative of various types of peripheral devices, such as a modem, a sound card, and a variety of data acquisition cards such as GPIB or field bus interface cards.
  • [0227] Graphics controller 408 is provided to control the rendering of text and images on a display 426. Graphics controller 408 may embody a typical graphics accelerator generally known in the art to render three-dimensional data structures which can be effectively shifted into and from main memory 404. Graphics controller 408 may therefore be a master of AGP bus 410 in that it can request and receive access to a target interface within bus bridge 402 to thereby obtain access to main memory 404. A dedicated graphics bus accommodates rapid retrieval of data from main memory 404. For certain operations, graphics controller 408 may further be configured to generate PCI protocol transactions on AGP bus 410. The AGP interface of bus bridge 402 may thus include functionality to support both AGP protocol transactions as well as PCI protocol target and initiator transactions. Display 426 is any electronic display upon which an image or text can be presented. A suitable display 426 includes a cathode ray tube (“CRT”), a liquid crystal display (“LCD”), etc.
  • It is noted that, while the AGP, PCI, and ISA or EISA buses have been used as examples in the above description, any bus architectures may be substituted as desired. It is further noted that [0228] computer system 400 may be a multiprocessing computer system including additional microprocessors (e.g. microprocessor 10 a shown as an optional component of computer system 400). Microprocessor 10 a may be similar to microprocessor 10. More particularly, microprocessor 10 a may be an identical copy of microprocessor 10. Microprocessor 10 a may share CPU bus 424 with microprocessor 10 (as shown in FIG. 5) or may be connected to bus bridge 402 via an independent bus.
  • It is still further noted that the present discussion may refer to the assertion of various signals. As used herein, a signal is “asserted” if it conveys a value indicative of a particular condition. Conversely, a signal is “deasserted” if it conveys a value indicative of a lack of a particular condition. A signal may be defined to be asserted when it conveys a logical zero value or, conversely, when it conveys a logical one value. Additionally, various values have been described as being discarded in the above discussion. A value may be discarded in a number of manners, but generally involves modifying the value such that it is ignored by logic circuitry which receives the value. For example, if the value comprises a bit, the logic state of the value may be inverted to discard the value. If the value is an n-bit value, one of the n-bit encodings may indicate that the value is invalid. Setting the value to the invalid encoding causes the value to be discarded. Additionally, an n-bit value may include a valid bit indicative, when set, that the n-bit value is valid. Resetting the valid bit may comprise discarding the value. Other methods of discarding a value may be used as well. [0229]
  • Although the embodiments above have been described in considerable detail, other versions are possible. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications. [0230]

Claims (30)

What is claimed is:
1. A microprocessor comprising
a first multiplier configured to receive a first multiplier operand and a first multiplicand operand; and
a second multiplier configured to receive a second multiplier operand and a second multiplicand operand;
wherein said first and second multipliers are coupled together and are configured to perform at least two types of multiplication, and
wherein said first type of multiplication is vector multiplication and is performed when said first and second multiplicand operands each comprise different components of a vector multiplicand operand and said first and second multiplier operands each comprise different components of a vector multiplier operand.
2. The microprocessor as recited in
claim 1
, wherein said second type of multiplication is scalar multiplication and is performed when said first and second multiplicand operands both comprise different portions of a single N-bit scalar multiplicand and said first and second multiplier operands both comprise different portions of a single N-bit scalar multiplier, wherein said scalar multiplication is performed in parallel in said first and second multipliers and results in a 2N-bit wide scalar result.
3. The microprocessor as recited in
claim 2
, wherein said first multiplier is configured to propagate carry values to said second multiplier only when said scalar multiplication is being performed.
4. The microprocessor as recited in
claim 2
, wherein said first and second multipliers are configured to perform a third type of multiplication, wherein said third type of multiplication is independent vector multiplication and is performed when said first and second multiplier operands and said first and second multiplicand operands each comprise an independent vector having at least two vector components, and wherein said independent vector multiplication results in two independent vector products, each having at least two vector components.
5. The microprocessor as recited in
claim 4
, wherein said first and second multipliers are further configured to perform a fourth type of multiplication, wherein said fourth type of multiplication is independent scalar multiplication and is performed when said first and second multiplier operands and said first and second multiplicand operands each comprise an independent scalar value.
6. The microprocessor as recited in
claim 5
, wherein said first and second multipliers each comprise:
a partial product array adder configured to sum a plurality of partial products to generate a redundant form result;
an overflow logic unit coupled to said partial product array adder and configured to convert said redundant form result into a first non-redundant form result assuming an overflow will occur during the formation of said redundant form product;
a non-overflow logic unit coupled to said partial product array adder and configured to convert said redundant form result into a second non-redundant form result assuming that no overflow will occur during the formation of said redundant form product; and
a multiplexer configured to select either said first or second non-redundant form result for output.
7. A microprocessor capable of evaluating a constant power of an operand in an iterative manner comprising:
an initial estimate generator configured to receive said operand and output an initial estimate of said operand's reciprocal;
a multiplier coupled to said initial estimate generator and configured to receive said operand and said initial estimate Y0, wherein said multiplier is configured to form a product during a first iteration by multiplying said initial estimate and said operand, wherein said multiplier comprises:
an overflow logic unit comprising a first plurality of inverters coupled to receive and invert selected bits from said product to form a first approximation of the quantity (2−Y0 2×B), wherein said first approximation assumes an overflow has occurred during the formation of said product, wherein said overflow logic unit is configured to pad the most significant bit of said first approximation with a constant one to approximate the quantity (3−Y0 2×B);
a non-overflow logic unit comprising a second plurality of inverters coupled to receive and invert selected bits from said product to form a second approximation of the quantity (2−Y0 2×B), wherein said second approximation assumes an overflow has not occurred during the formation of said product, wherein said non-overflow logic unit is configured to pad the most significant bit of said second approximation with a constant one to approximate the quantity (3−Y0 2×B); and
a multiplexer configured to select either said first or second approximations for output as an intermediate product;
compression logic coupled to receive and compress said intermediate product in a loss-free manner; and
a storage register coupled to said logic and configured to store said compressed intermediate product.
8. The microprocessor as recited in
claim 7
, wherein said logic is configured to compress said intermediate product by representing a predetermined number of bits below the most significant bit in said intermediate product with a single bit having the same bit value as said predetermined number of bits.
9. The microprocessor as recited in
claim 7
, wherein said microprocessor further comprises decompression logic coupled to said storage register and configured to decompress said compressed intermediate product, wherein said multiplier is configured to receive said decompressed intermediate product in lieu of said initial estimate when performing a second iteration.
10. The microprocessor as recited in claim, 7 further comprising:
routing logic coupled to said multiplexer, wherein said routing logic is configured to route said intermediate product to said multiplier in place of said initial estimate, wherein said multiplier is configured to perform a second multiplication to create a final result by using said intermediate product in place of said initial estimate;
an adder configured to sum an adjustment constant to said final result to form a second version of said final result;
an adder configured to subtract an adjustment constant from said final result to form a third version of said final result; and
selection logic configured to select for output either said final result, said second version of said final result, or said third version of said final result.
11. The microprocessor as recited in
claim 10
, wherein said adjustment constant is predetermined in order to improve the probability that the result selected for output equals the exactly rounded result.
12. A multiplier capable of supporting different rounding modes comprising:
a multiplier input configured to receive a multiplier operand;
a multiplicand input configured to receive a multiplicand operand;
a partial product generator configured to receive said multiplicand operand and generate a corresponding plurality of potential partial products;
partial product selection logic configured to receive said multiplier operand and select at least one of said potential partial products;
a partial product array adder coupled to partial product selection logic and configured to receive and sum said selected partial products to form a redundant-form product;
rounding constant selection logic configured to select a rounding constant corresponding to the selected rounding mode;
a first path comprising one or more adders, wherein said first path is configured to receive, sum, and round said redundant-form product and said rounding constant, assuming no overflow will occur, to form a first non-redundant-form product;
a second path comprising one or more adders, wherein said second path is configured to receive, sum, and round said redundant-form product and said rounding constant, assuming an overflow will occur, to form a second non-redundant-form product; and
selection logic coupled to said first path and said second path, wherein said selection logic is configured to select between said first non-redundant-form product and said second non-redundant-form product.
13. The multiplier as recited in
claim 12
, further comprising exponent control logic that is configured to adjust an exponent corresponding to said redundant form-product to reflect whether an overflow occurred during said rounding.
14. The multiplier as recited in
claim 12
, wherein said first path comprises a first sticky bit logic unit coupled to receive said redundant-form product and calculate a first sticky bit assuming no overflow will occur, and wherein said second path comprises a second sticky bit logic unit coupled to receive said redundant-form product and calculate a second sticky bit assuming an overflow will occur.
15. The multiplier as recited in
claim 12
, wherein said first path comprises a first fix-up logic unit configured to receive said first non-redundant-form product and said first sticky bit, wherein said first fix-up logic unit is configured to conditionally invert the least significant bit of said first non-redundant-form product, wherein said second path comprises a second fix-up logic unit configured to receive said second non-redundant-form product and said second sticky bit, wherein said second fix-up logic unit is configured to conditionally invert the least significant bit of said second non-redundant-form product.
16. The multiplier as recited in
claim 12
, wherein said multiplier is configured to perform iterative calculations comprising at least two iterations, wherein each iteration comprises at least one multiplication operation, and wherein said multiplier further comprises:
compression logic coupled to said selection logic and configured to compress said non-redundant-form product by representing a predetermined number of bits below the most significant bit in said non-redundant-form product with a single bit having the same bit value as said predetermined number of bits;
a plurality of storage registers for storing said compressed non-redundant-from products; and
decompression logic coupled to said plurality of storage registers, wherein said decompression logic is configured to decompress said compressed non-redundant-from products, wherein said multiplier is configured to use said decompressed product in a future iteration.
17. The multiplier as recited in
claim 16
, wherein said multiplier further comprises a means for adding a compensation constant to the non-redundant-form product of the final iteration.
18. The multiplier as recited in
claim 17
, wherein said compensation constant is selected to improve the probability that said non-redundant-form product of the final iteration will equal the exactly-rounded result.
19. The multiplier as recited in
claim 12
, wherein said multiplier is configured to perform both signed and unsigned multiplication, and wherein said multiplier further comprises a means for generating an effective sign for said multiplicand operand, wherein said means for generating an effective sign is coupled to said partial product generator, and wherein said effective sign is used by said partial product generator to generate said potential partial products.
20. The multiplier as recited in
claim 12
, wherein said multiplier is configured to perform both scalar and vector component-wise multiplication, wherein said partial product array adder comprises two regions, wherein said partial product array adder is configured to propagate carry information across said regions only when scalar multiplication is being performed.
21. The multiplier as recited in
claim 20
, wherein said multiplier further comprises a plurality of adders coupled to said partial product array and configured to sum said vector component results to form the vector dot product of said multiplier and multiplicand operands, wherein said multiplier and multiplicand operands each comprise a plurality of vector components.
22. The multiplier as recited in
claim 12
, wherein said partial product generator is configured to generate one set of partial products for each pair of vector components when vector multiplication is being performed.
23. A method for performing multiplication operations within a microprocessor comprising:
generating a plurality of partial products based upon said first operand;
selecting at least one of said partial products based upon a second operand;
generating a rounding constant based upon a selected rounding mode;
summing the selected partial products and said rounding constant to create a redundant form result;
reducing said redundant form result to a final product, wherein said final product is non-redundant; and
increasing the frequency of exactly rounded results by:
calculating a first version of said final product without using an adjustment constant;
calculating a second version of said final product by adding said adjustment constant to said final product;
calculating a third version of said final product by subtracting said adjustment constant from said final product; and
selecting a rounded product for output from said first, second, and third products.
24. The method as recited in
claim 23
, wherein said rounded product is selected based upon a comparison of the product of said first version of said final product and one of said operands.
25. The method as recited in
claim 23
, further comprising:
generating an effective sign bit based upon the most significant bit of a first operand, and
using said effective sign bit when generating said plurality of partial products.
26. The method as recited in
claim 25
, wherein said reducing further comprises:
reducing and normalizing the redundant form result to a first non-redundant form result assuming no overflow;
reducing and normalizing the redundant form result to a second non-redundant form result assuming an overflow; and
selecting a final product from said first and second non-redundant form results based upon at least one bit from said redundant form result.
27. A multiplier capable of multiplying one pair of N-bit operands or two pairs of N/2-bit operands simultaneously comprising:
a multiplier input configured to receive a multiplier operand, wherein said multiplier operand comprises one N-bit value or two N/2-bit values;
a multiplicand input configured to receive a multiplicand operand, wherein said multiplicand operand comprises one N-bit value or two N/2-bit values;
a partial product generator coupled to said multiplicand input, wherein said partial product generator is configured to generate a plurality of partial products based upon said multiplicand operand;
a selection logic unit coupled to said partial product generator and said multiplier input, wherein said selection logic unit is configured to select a plurality of partial products from said partial product generator based upon said multiplier operand; and
a partial product array adder coupled to said selection logic unit, wherein said adder comprises a less significant portion and a more significant portion, wherein said adder is configured to sum the partial products selected by said selection unit to create a redundant-form final product comprises either a single 2N-bit value or two N-bit values, wherein said less significant portion is configured to propagate carry information to said more significant portion only when said multiplier is multiplying two N-bit numbers.
28. The multiplier as recited in
claim 27
, further comprising:
an overflow logic unit coupled to said partial product array adder and configured to convert said redundant-form final product into a first non-redundant form result assuming an overflow will occur during the formation of said redundant-form product;
a non-overflow logic unit coupled to said partial product array adder and configured to convert said redundant form final product into a second non-redundant form result assuming that no overflow will occur during the formation of said redundant form product; and
a multiplexer configured to select either said first or second non-redundant form result as the final non-redundant form result.
29. The multiplier as recited in
claim 28
, further comprising:
a second adder coupled to said multiplexer, wherein said second adder is configured to receive said final non-redundant form result and sum it with an adjustment constant to form a first adjusted result;
a third adder coupled to said multiplexer, wherein said third adder is configured to receive said final non-redundant form result and subtract an adjustment constant to form a second adjusted result; and
a second multiplexer configured to select either said final non-redundant form result, said first adjusted result, or said second adjusted result for output.
30. The multiplier as recited in
claim 28
, further comprising:
a rounding constant selection unit configured to select a rounding constant based upon the selected rounding mode, wherein said partial product array adder comprises a plurality of carry-save adders, wherein said partial product array adder is configured to sum said rounding constant with the partial products selected by said selection unit to create said redundant-form final product.
US09/782,475 1998-08-14 2001-02-12 Method and apparatus for rounding in a multiplier Expired - Lifetime US6397238B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/782,475 US6397238B2 (en) 1998-08-14 2001-02-12 Method and apparatus for rounding in a multiplier

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/134,171 US6223198B1 (en) 1998-08-14 1998-08-14 Method and apparatus for multi-function arithmetic
US09/782,475 US6397238B2 (en) 1998-08-14 2001-02-12 Method and apparatus for rounding in a multiplier

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/134,171 Division US6223198B1 (en) 1997-10-23 1998-08-14 Method and apparatus for multi-function arithmetic

Publications (2)

Publication Number Publication Date
US20010023425A1 true US20010023425A1 (en) 2001-09-20
US6397238B2 US6397238B2 (en) 2002-05-28

Family

ID=22462083

Family Applications (3)

Application Number Title Priority Date Filing Date
US09/134,171 Expired - Lifetime US6223198B1 (en) 1997-10-23 1998-08-14 Method and apparatus for multi-function arithmetic
US09/782,475 Expired - Lifetime US6397238B2 (en) 1998-08-14 2001-02-12 Method and apparatus for rounding in a multiplier
US09/782,474 Expired - Lifetime US6381625B2 (en) 1998-08-14 2001-02-12 Method and apparatus for calculating a power of an operand

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/134,171 Expired - Lifetime US6223198B1 (en) 1997-10-23 1998-08-14 Method and apparatus for multi-function arithmetic

Family Applications After (1)

Application Number Title Priority Date Filing Date
US09/782,474 Expired - Lifetime US6381625B2 (en) 1998-08-14 2001-02-12 Method and apparatus for calculating a power of an operand

Country Status (1)

Country Link
US (3) US6223198B1 (en)

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460177B1 (en) * 1999-09-22 2002-10-01 Lucent Technologies Inc. Method for target-specific development of fixed-point algorithms employing C++ class definitions
US20040179681A1 (en) * 2003-03-14 2004-09-16 Samsung Electronics Co., Ltd. Apparatus and method for performing montgomery type modular multiplication
US20050144217A1 (en) * 2002-04-10 2005-06-30 Broadcom Corporation Low-error fixed-width modified booth multiplier
US20060224646A1 (en) * 2005-03-30 2006-10-05 Chhavi Kishore System (s), method (s), and apparatus for converting unsigned fixed length codes (decoded from exponential golomb codes) to signed fixed length codes
US20060253520A1 (en) * 2005-05-05 2006-11-09 Mips Technologies, Inc. Processor core and multiplier that support both vector and single value multiplication
US20060253519A1 (en) * 2005-05-05 2006-11-09 Mips Technologies, Inc. Processor core and multiplier that support a multiply and difference operation by inverting sign bits in booth recoding
US20070239817A1 (en) * 2006-04-07 2007-10-11 Oki Electric Industry Co., Ltd. Rounding computing method and computing device therefor
US7814137B1 (en) 2007-01-09 2010-10-12 Altera Corporation Combined interpolation and decimation filter for programmable logic device
US7822799B1 (en) 2006-06-26 2010-10-26 Altera Corporation Adder-rounder circuitry for specialized processing block in programmable logic device
US7836117B1 (en) 2006-04-07 2010-11-16 Altera Corporation Specialized processing block for programmable logic device
US7865541B1 (en) 2007-01-22 2011-01-04 Altera Corporation Configuring floating point operations in a programmable logic device
US7930336B2 (en) 2006-12-05 2011-04-19 Altera Corporation Large multiplier for programmable logic device
US7949699B1 (en) 2007-08-30 2011-05-24 Altera Corporation Implementation of decimation filter in integrated circuit device using ram-based data storage
US7948267B1 (en) 2010-02-09 2011-05-24 Altera Corporation Efficient rounding circuits and methods in configurable integrated circuit devices
US8041759B1 (en) 2006-02-09 2011-10-18 Altera Corporation Specialized processing block for programmable logic device
US8266198B2 (en) 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US8266199B2 (en) 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US8301681B1 (en) 2006-02-09 2012-10-30 Altera Corporation Specialized processing block for programmable logic device
US8307023B1 (en) 2008-10-10 2012-11-06 Altera Corporation DSP block for implementing large multiplier on a programmable integrated circuit device
US8386550B1 (en) 2006-09-20 2013-02-26 Altera Corporation Method for configuring a finite impulse response filter in a programmable logic device
US8386553B1 (en) 2006-12-05 2013-02-26 Altera Corporation Large multiplier for programmable logic device
US8396914B1 (en) 2009-09-11 2013-03-12 Altera Corporation Matrix decomposition in an integrated circuit device
US8412756B1 (en) 2009-09-11 2013-04-02 Altera Corporation Multi-operand floating point operations in a programmable integrated circuit device
US8468192B1 (en) 2009-03-03 2013-06-18 Altera Corporation Implementing multipliers in a programmable integrated circuit device
US8484265B1 (en) 2010-03-04 2013-07-09 Altera Corporation Angular range reduction in an integrated circuit device
US8510354B1 (en) 2010-03-12 2013-08-13 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8539016B1 (en) 2010-02-09 2013-09-17 Altera Corporation QR decomposition in an integrated circuit device
US8539014B2 (en) 2010-03-25 2013-09-17 Altera Corporation Solving linear matrices in an integrated circuit device
US8543634B1 (en) 2012-03-30 2013-09-24 Altera Corporation Specialized processing block for programmable integrated circuit device
US8577951B1 (en) 2010-08-19 2013-11-05 Altera Corporation Matrix operations in an integrated circuit device
US8589463B2 (en) 2010-06-25 2013-11-19 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8601044B2 (en) 2010-03-02 2013-12-03 Altera Corporation Discrete Fourier Transform in an integrated circuit device
US8620980B1 (en) 2005-09-27 2013-12-31 Altera Corporation Programmable device with specialized multiplier blocks
US8645450B1 (en) 2007-03-02 2014-02-04 Altera Corporation Multiplier-accumulator circuitry and methods
US8645451B2 (en) 2011-03-10 2014-02-04 Altera Corporation Double-clocked specialized processing block in an integrated circuit device
US8645449B1 (en) 2009-03-03 2014-02-04 Altera Corporation Combined floating point adder and subtractor
US8650236B1 (en) 2009-08-04 2014-02-11 Altera Corporation High-rate interpolation or decimation filter in integrated circuit device
US8650231B1 (en) 2007-01-22 2014-02-11 Altera Corporation Configuring floating point operations in a programmable device
US8706790B1 (en) 2009-03-03 2014-04-22 Altera Corporation Implementing mixed-precision floating-point operations in a programmable integrated circuit device
US8762443B1 (en) 2011-11-15 2014-06-24 Altera Corporation Matrix operations in an integrated circuit device
US8812576B1 (en) 2011-09-12 2014-08-19 Altera Corporation QR decomposition in an integrated circuit device
US8856201B1 (en) 2004-11-10 2014-10-07 Altera Corporation Mixed-mode multiplier using hard and soft logic circuitry
US8862650B2 (en) 2010-06-25 2014-10-14 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8949298B1 (en) 2011-09-16 2015-02-03 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US8959137B1 (en) 2008-02-20 2015-02-17 Altera Corporation Implementing large multipliers in a programmable integrated circuit device
US8996600B1 (en) 2012-08-03 2015-03-31 Altera Corporation Specialized processing block for implementing floating-point multiplier with subnormal operation support
US9053045B1 (en) 2011-09-16 2015-06-09 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US9098332B1 (en) 2012-06-01 2015-08-04 Altera Corporation Specialized processing block with fixed- and floating-point structures
US9189200B1 (en) 2013-03-14 2015-11-17 Altera Corporation Multiple-precision processing block in a programmable integrated circuit device
US9207909B1 (en) 2012-11-26 2015-12-08 Altera Corporation Polynomial calculations optimized for programmable integrated circuit device structures
US9348795B1 (en) 2013-07-03 2016-05-24 Altera Corporation Programmable device using fixed and configurable logic to implement floating-point rounding
US9600278B1 (en) 2011-05-09 2017-03-21 Altera Corporation Programmable device using fixed and configurable logic to implement recursive trees
US9684488B2 (en) 2015-03-26 2017-06-20 Altera Corporation Combined adder and pre-adder for high-radix multiplier circuit
US9853841B1 (en) * 2016-06-23 2017-12-26 Huawei Technologies Co., Ltd. Low complexity slicer architectures for N-tap look-ahead decision feedback equalizer (DFE) circuit implementations
US10942706B2 (en) 2017-05-05 2021-03-09 Intel Corporation Implementation of floating-point trigonometric functions in an integrated circuit device
WO2022011308A1 (en) * 2020-07-09 2022-01-13 The Regents Of The University Of California Bit-parallel vector composability for neural acceleration
WO2022081383A1 (en) * 2020-10-15 2022-04-21 Gigantor Technologies Inc. Custom mass multiplication circuits

Families Citing this family (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7509486B1 (en) * 1999-07-08 2009-03-24 Broadcom Corporation Encryption processor for performing accelerated computations to establish secure network sessions connections
US6745318B1 (en) * 1999-08-18 2004-06-01 Sanjay Mansingh Method and apparatus of configurable processing
US6546480B1 (en) * 1999-10-01 2003-04-08 Hitachi, Ltd. Instructions for arithmetic operations on vectored data
US6598063B1 (en) * 2000-08-14 2003-07-22 Lntel Corporation Fast calculation of (A/B)K by a parallel floating-point processor
US6681237B1 (en) * 2000-09-07 2004-01-20 International Business Machines Corporation Exponentiation circuit for graphics adapter
JP3563043B2 (en) * 2001-05-31 2004-09-08 株式会社半導体理工学研究センター Method for calculating reciprocal of square root, calculation circuit, and program
ITTO20010818A1 (en) * 2001-08-17 2003-02-17 Telecom Italia Lab Spa CIRCUIT FOR ELEVATING TO POWER.
US20030065699A1 (en) * 2001-10-01 2003-04-03 Koninklijke Philips Electronics N.V. Split multiplier for efficient mixed-precision DSP
US7065545B2 (en) * 2002-05-07 2006-06-20 Quintero-De-La-Garza Raul Gera Computer methods of vector operation for reducing computation time
US7016930B2 (en) * 2002-10-25 2006-03-21 Arm Limited Apparatus and method for performing operations implemented by iterative execution of a recurrence equation
US7260595B2 (en) * 2002-12-23 2007-08-21 Arithmatica Limited Logic circuit and method for carry and sum generation and method of designing such a logic circuit
US7308471B2 (en) 2003-03-28 2007-12-11 Arithmatica Limited Method and device for performing operations involving multiplication of selectively partitioned binary inputs using booth encoding
WO2004104820A2 (en) * 2003-05-23 2004-12-02 Arithmatica Limited A sum bit generation circuit
US20050086040A1 (en) * 2003-10-02 2005-04-21 Curtis Davis System incorporating physics processing unit
US7739479B2 (en) * 2003-10-02 2010-06-15 Nvidia Corporation Method for providing physics simulation data
US7895411B2 (en) * 2003-10-02 2011-02-22 Nvidia Corporation Physics processing unit
GB2409065B (en) * 2003-12-09 2006-10-25 Advanced Risc Mach Ltd Multiplexing operations in SIMD processing
GB2409062C (en) * 2003-12-09 2007-12-11 Advanced Risc Mach Ltd Aliasing data processing registers
GB2409060B (en) * 2003-12-09 2006-08-09 Advanced Risc Mach Ltd Moving data between registers of different register data stores
GB2409066B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
GB2411975B (en) * 2003-12-09 2006-10-04 Advanced Risc Mach Ltd Data processing apparatus and method for performing arithmetic operations in SIMD data processing
GB2409068A (en) * 2003-12-09 2005-06-15 Advanced Risc Mach Ltd Data element size control within parallel lanes of processing
GB2411973B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd Constant generation in SMD processing
GB2409064B (en) * 2003-12-09 2006-09-13 Advanced Risc Mach Ltd A data processing apparatus and method for performing in parallel a data processing operation on data elements
GB2409067B (en) * 2003-12-09 2006-12-13 Advanced Risc Mach Ltd Endianess compensation within a SIMD data processing system
GB2411974C (en) * 2003-12-09 2009-09-23 Advanced Risc Mach Ltd Data shift operations
GB2409059B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
GB2409063B (en) * 2003-12-09 2006-07-12 Advanced Risc Mach Ltd Vector by scalar operations
GB2411976B (en) * 2003-12-09 2006-07-19 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
GB2409061B (en) * 2003-12-09 2006-09-13 Advanced Risc Mach Ltd Table lookup operation within a data processing system
US7472155B2 (en) * 2003-12-29 2008-12-30 Xilinx, Inc. Programmable logic device with cascading DSP slices
US7860915B2 (en) * 2003-12-29 2010-12-28 Xilinx, Inc. Digital signal processing circuit having a pattern circuit for determining termination conditions
US7567997B2 (en) * 2003-12-29 2009-07-28 Xilinx, Inc. Applications of cascading DSP slices
US8495122B2 (en) * 2003-12-29 2013-07-23 Xilinx, Inc. Programmable device with dynamic DSP architecture
US7882165B2 (en) * 2003-12-29 2011-02-01 Xilinx, Inc. Digital signal processing element having an arithmetic logic unit
US7480690B2 (en) * 2003-12-29 2009-01-20 Xilinx, Inc. Arithmetic circuit with multiplexed addend inputs
US7853636B2 (en) * 2003-12-29 2010-12-14 Xilinx, Inc. Digital signal processing circuit having a pattern detector circuit for convergent rounding
US7840630B2 (en) 2003-12-29 2010-11-23 Xilinx, Inc. Arithmetic logic unit circuit
US7870182B2 (en) * 2003-12-29 2011-01-11 Xilinx Inc. Digital signal processing circuit having an adder circuit with carry-outs
US7844653B2 (en) * 2003-12-29 2010-11-30 Xilinx, Inc. Digital signal processing circuit having a pre-adder circuit
US7840627B2 (en) * 2003-12-29 2010-11-23 Xilinx, Inc. Digital signal processing circuit having input register blocks
US7467175B2 (en) * 2003-12-29 2008-12-16 Xilinx, Inc. Programmable logic device with pipelined DSP slices
US7467177B2 (en) * 2003-12-29 2008-12-16 Xilinx, Inc. Mathematical circuit with dynamic rounding
US7853632B2 (en) * 2003-12-29 2010-12-14 Xilinx, Inc. Architectural floorplan for a digital signal processing circuit
US7853634B2 (en) * 2003-12-29 2010-12-14 Xilinx, Inc. Digital signal processing circuit having a SIMD circuit
US7849119B2 (en) * 2003-12-29 2010-12-07 Xilinx, Inc. Digital signal processing circuit having a pattern detector circuit
US7865542B2 (en) * 2003-12-29 2011-01-04 Xilinx, Inc. Digital signal processing block having a wide multiplexer
GB2410097B (en) * 2004-01-13 2006-11-01 Advanced Risc Mach Ltd A data processing apparatus and method for performing data processing operations on floating point data elements
GB2411978B (en) * 2004-03-10 2007-04-04 Advanced Risc Mach Ltd Inserting bits within a data word
US20050251644A1 (en) * 2004-05-06 2005-11-10 Monier Maher Physics processing unit instruction set architecture
CN101002277A (en) * 2004-05-12 2007-07-18 斯班逊有限公司 Semiconductor device and its control method
US7774393B1 (en) * 2004-06-30 2010-08-10 Oracle America, Inc. Apparatus and method for integer to floating-point format conversion
US8044951B1 (en) * 2004-07-02 2011-10-25 Nvidia Corporation Integer-based functionality in a graphics shading language
US7958498B1 (en) 2004-07-02 2011-06-07 Nvidia Corporation Methods and systems for processing a geometry shader program developed in a high-level shading language
US7746347B1 (en) 2004-07-02 2010-06-29 Nvidia Corporation Methods and systems for processing a geometry shader program developed in a high-level shading language
US9557994B2 (en) * 2004-07-13 2017-01-31 Arm Limited Data processing apparatus and method for performing N-way interleaving and de-interleaving operations where N is an odd plural number
US7650266B2 (en) * 2005-05-09 2010-01-19 Nvidia Corporation Method of simulating deformable object using geometrically motivated model
US20080276067A1 (en) * 2007-05-01 2008-11-06 Via Technologies, Inc. Method and Apparatus for Page Table Pre-Fetching in Zero Frame Display Channel
US7627744B2 (en) * 2007-05-10 2009-12-01 Nvidia Corporation External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level
US8972472B2 (en) * 2008-03-25 2015-03-03 Densbits Technologies Ltd. Apparatus and methods for hardware-efficient unbiased rounding
US8271572B2 (en) * 2008-10-14 2012-09-18 The Research Foundation Of State University Of New York Generating partial sums
US8479133B2 (en) * 2009-01-27 2013-07-02 Xilinx, Inc. Method of and circuit for implementing a filter in an integrated circuit
US8543635B2 (en) * 2009-01-27 2013-09-24 Xilinx, Inc. Digital signal processing block with preadder stage
US8316071B2 (en) * 2009-05-27 2012-11-20 Advanced Micro Devices, Inc. Arithmetic processing unit that performs multiply and multiply-add operations with saturation and method therefor
US8990282B2 (en) * 2009-09-21 2015-03-24 Arm Limited Apparatus and method for performing fused multiply add floating point operation
US8570336B2 (en) * 2009-12-08 2013-10-29 Intel Corporation Texture unit for general purpose computing
EP2334006B1 (en) * 2009-12-10 2016-03-23 Nxp B.V. Side-channel resistant modular exponentiation
US8965945B2 (en) * 2011-02-17 2015-02-24 Arm Limited Apparatus and method for performing floating point addition
US9122517B2 (en) 2012-06-11 2015-09-01 International Business Machines Corporation Fused multiply-adder with booth-encoding
KR102057648B1 (en) 2013-01-04 2019-12-20 삼성전자주식회사 Mutiplication method and modular multiplier using redundant form recoding
US9645791B2 (en) 2014-06-16 2017-05-09 Apple Inc. Multiplier unit with speculative rounding for use with division and square-root operations
US10275247B2 (en) * 2015-03-28 2019-04-30 Intel Corporation Apparatuses and methods to accelerate vector multiplication of vector elements having matching indices
US10387366B2 (en) * 2015-10-08 2019-08-20 Via Alliance Semiconductor Co., Ltd. Neural network unit with shared activation function units
WO2017184011A1 (en) 2016-04-21 2017-10-26 Oracle International Corporation Iterative division with reduced latency
WO2017196204A1 (en) 2016-05-13 2017-11-16 Oracle International Corporation Methods for constructing lookup tables for division and square root implementations
US10586148B2 (en) * 2016-12-31 2020-03-10 Via Alliance Semiconductor Co., Ltd. Neural network unit with re-shapeable memory
US10565492B2 (en) * 2016-12-31 2020-02-18 Via Alliance Semiconductor Co., Ltd. Neural network unit with segmentable array width rotator
US10565494B2 (en) * 2016-12-31 2020-02-18 Via Alliance Semiconductor Co., Ltd. Neural network unit with segmentable array width rotator
US10318317B2 (en) 2017-05-12 2019-06-11 Tenstorrent Inc. Processing core with operation suppression based on contribution estimate
WO2019066797A1 (en) * 2017-09-27 2019-04-04 Intel Corporation Instructions for vector multiplication of unsigned words with rounding
US10447983B2 (en) * 2017-11-15 2019-10-15 Nxp Usa, Inc. Reciprocal approximation circuit
CN108364065B (en) * 2018-01-19 2020-09-11 上海兆芯集成电路有限公司 Microprocessor for booth multiplication
WO2023200817A1 (en) * 2022-04-11 2023-10-19 Nima Badizadegan Circuit, system and method for computer division approximation

Family Cites Families (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3633018A (en) 1969-12-18 1972-01-04 Ibm Digital division by reciprocal conversion technique
US3777132A (en) 1972-02-23 1973-12-04 Burroughs Corp Method and apparatus for obtaining the reciprocal of a number and the quotient of two numbers
US4163287A (en) 1978-04-20 1979-07-31 Northern Telecom Limited Binary multiplier circuit including coding circuit
GB2108736B (en) 1981-10-27 1984-12-12 Standard Telephones Cables Ltd Sum of products multiplier
DE3274164D1 (en) 1982-12-23 1986-12-11 Ibm Method and apparatus for division operations
JPS62229440A (en) * 1986-03-31 1987-10-08 Toshiba Corp Array multiplier
US4849923A (en) 1986-06-27 1989-07-18 Digital Equipment Corporation Apparatus and method for execution of floating point operations
EP0383965A1 (en) 1989-02-21 1990-08-29 International Business Machines Corporation Multiplier
US5212661A (en) 1989-10-16 1993-05-18 Matsushita Electric Industrial Co., Ltd. Apparatus for performing floating point arithmetic operation and rounding the result thereof
US5157624A (en) 1990-12-13 1992-10-20 Micron Technology, Inc. Machine method to perform newton iterations for reciprocal square roots
US5206823A (en) 1990-12-13 1993-04-27 Micron Technology, Inc. Apparatus to perform Newton iterations for reciprocal and reciprocal square root
JPH05241792A (en) 1992-02-27 1993-09-21 Nec Corp System and device for floating point addition and subtraction
US5539682A (en) * 1992-08-07 1996-07-23 Lsi Logic Corporation Seed generation technique for iterative, convergent digital computations
US5343416A (en) 1992-11-02 1994-08-30 Intel Corporation Method and apparatus for re-configuring a partial product reduction tree
US5606677A (en) 1992-11-30 1997-02-25 Texas Instruments Incorporated Packed word pair multiply operation forming output including most significant bits of product and other bits of one input
EP0924601B1 (en) * 1993-11-23 2001-09-26 Hewlett-Packard Company, A Delaware Corporation Parallel data processing in a single processor
IL116210A0 (en) 1994-12-02 1996-01-31 Intel Corp Microprocessor having a compare operation and a method of comparing packed data in a processor
US5729481A (en) 1995-03-31 1998-03-17 International Business Machines Corporation Method and system of rounding for quadratically converging division or square root
GB9514684D0 (en) 1995-07-18 1995-09-13 Sgs Thomson Microelectronics An arithmetic unit
US5953241A (en) * 1995-08-16 1999-09-14 Microunity Engeering Systems, Inc. Multiplier array processing system with enhanced utilization at lower precision for group multiply and sum instruction
US6240338B1 (en) * 1995-08-22 2001-05-29 Micron Technology, Inc. Seed ROM for reciprocal computation
US5737257A (en) 1995-09-13 1998-04-07 Holtek Microelectronics, Inc. Method and apparatus for compression of integer multiplication table
US5943250A (en) * 1996-10-21 1999-08-24 Samsung Electronics Co., Ltd. Parallel multiplier that supports multiple numbers with different bit lengths
US5841684A (en) 1997-01-24 1998-11-24 Vlsi Technology, Inc. Method and apparatus for computer implemented constant multiplication with multipliers having repeated patterns including shifting of replicas and patterns having at least two digit positions with non-zero values
US6134574A (en) * 1998-05-08 2000-10-17 Advanced Micro Devices, Inc. Method and apparatus for achieving higher frequencies of exactly rounded results
US6115733A (en) * 1997-10-23 2000-09-05 Advanced Micro Devices, Inc. Method and apparatus for calculating reciprocals and reciprocal square roots
US6269384B1 (en) * 1998-03-27 2001-07-31 Advanced Micro Devices, Inc. Method and apparatus for rounding and normalizing results within a multiplier
US6026483A (en) * 1997-10-23 2000-02-15 Advanced Micro Devices, Inc. Method and apparatus for simultaneously performing arithmetic on two or more pairs of operands
US6038583A (en) * 1997-10-23 2000-03-14 Advanced Micro Devices, Inc. Method and apparatus for simultaneously multiplying two or more independent pairs of operands and calculating a rounded products
US6055555A (en) * 1997-12-29 2000-04-25 Intel Corporation Interface for performing parallel arithmetic and round operations
US6035318A (en) * 1998-03-31 2000-03-07 Intel Corporation Booth multiplier for handling variable width operands

Cited By (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460177B1 (en) * 1999-09-22 2002-10-01 Lucent Technologies Inc. Method for target-specific development of fixed-point algorithms employing C++ class definitions
US20050144217A1 (en) * 2002-04-10 2005-06-30 Broadcom Corporation Low-error fixed-width modified booth multiplier
US7334200B2 (en) * 2002-04-10 2008-02-19 Broadcom Corporation Low-error fixed-width modified booth multiplier
US20040179681A1 (en) * 2003-03-14 2004-09-16 Samsung Electronics Co., Ltd. Apparatus and method for performing montgomery type modular multiplication
US8209369B2 (en) 2003-03-14 2012-06-26 Samsung Electronics Co., Ltd. Signal processing apparatus and method for performing modular multiplication in an electronic device, and smart card using the same
US20080065713A1 (en) * 2003-03-14 2008-03-13 Samsung Electronics Co., Ltd. Signal processing apparatus and method for performing modular multiplication in an electronic device, and smart card using the same
US7564971B2 (en) * 2003-03-14 2009-07-21 Samsung Electronics Co., Ltd. Apparatus and method for performing Montgomery type modular multiplication
US8856201B1 (en) 2004-11-10 2014-10-07 Altera Corporation Mixed-mode multiplier using hard and soft logic circuitry
US7801935B2 (en) * 2005-03-30 2010-09-21 Broadcom Corporation System (s), method (s), and apparatus for converting unsigned fixed length codes (decoded from exponential golomb codes) to signed fixed length codes
US20060224646A1 (en) * 2005-03-30 2006-10-05 Chhavi Kishore System (s), method (s), and apparatus for converting unsigned fixed length codes (decoded from exponential golomb codes) to signed fixed length codes
US8234326B2 (en) * 2005-05-05 2012-07-31 Mips Technologies, Inc. Processor core and multiplier that support both vector and single value multiplication
US20060253520A1 (en) * 2005-05-05 2006-11-09 Mips Technologies, Inc. Processor core and multiplier that support both vector and single value multiplication
US20060253519A1 (en) * 2005-05-05 2006-11-09 Mips Technologies, Inc. Processor core and multiplier that support a multiply and difference operation by inverting sign bits in booth recoding
US8229991B2 (en) 2005-05-05 2012-07-24 Mips Technologies, Inc. Processor core and multiplier that support a multiply and difference operation by inverting sign bits in booth recoding
US8620980B1 (en) 2005-09-27 2013-12-31 Altera Corporation Programmable device with specialized multiplier blocks
US8266198B2 (en) 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US8301681B1 (en) 2006-02-09 2012-10-30 Altera Corporation Specialized processing block for programmable logic device
US8041759B1 (en) 2006-02-09 2011-10-18 Altera Corporation Specialized processing block for programmable logic device
US8266199B2 (en) 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US7836117B1 (en) 2006-04-07 2010-11-16 Altera Corporation Specialized processing block for programmable logic device
US7912888B2 (en) * 2006-04-07 2011-03-22 Oki Semiconductor Co., Ltd. Rounding computing method and computing device therefor
US20070239817A1 (en) * 2006-04-07 2007-10-11 Oki Electric Industry Co., Ltd. Rounding computing method and computing device therefor
US7822799B1 (en) 2006-06-26 2010-10-26 Altera Corporation Adder-rounder circuitry for specialized processing block in programmable logic device
US8386550B1 (en) 2006-09-20 2013-02-26 Altera Corporation Method for configuring a finite impulse response filter in a programmable logic device
US9063870B1 (en) 2006-12-05 2015-06-23 Altera Corporation Large multiplier for programmable logic device
US9395953B2 (en) 2006-12-05 2016-07-19 Altera Corporation Large multiplier for programmable logic device
US8386553B1 (en) 2006-12-05 2013-02-26 Altera Corporation Large multiplier for programmable logic device
US7930336B2 (en) 2006-12-05 2011-04-19 Altera Corporation Large multiplier for programmable logic device
US8788562B2 (en) 2006-12-05 2014-07-22 Altera Corporation Large multiplier for programmable logic device
US7814137B1 (en) 2007-01-09 2010-10-12 Altera Corporation Combined interpolation and decimation filter for programmable logic device
US7865541B1 (en) 2007-01-22 2011-01-04 Altera Corporation Configuring floating point operations in a programmable logic device
US8650231B1 (en) 2007-01-22 2014-02-11 Altera Corporation Configuring floating point operations in a programmable device
US8645450B1 (en) 2007-03-02 2014-02-04 Altera Corporation Multiplier-accumulator circuitry and methods
US7949699B1 (en) 2007-08-30 2011-05-24 Altera Corporation Implementation of decimation filter in integrated circuit device using ram-based data storage
US8959137B1 (en) 2008-02-20 2015-02-17 Altera Corporation Implementing large multipliers in a programmable integrated circuit device
US8307023B1 (en) 2008-10-10 2012-11-06 Altera Corporation DSP block for implementing large multiplier on a programmable integrated circuit device
US8468192B1 (en) 2009-03-03 2013-06-18 Altera Corporation Implementing multipliers in a programmable integrated circuit device
US8645449B1 (en) 2009-03-03 2014-02-04 Altera Corporation Combined floating point adder and subtractor
US8706790B1 (en) 2009-03-03 2014-04-22 Altera Corporation Implementing mixed-precision floating-point operations in a programmable integrated circuit device
US8650236B1 (en) 2009-08-04 2014-02-11 Altera Corporation High-rate interpolation or decimation filter in integrated circuit device
US8412756B1 (en) 2009-09-11 2013-04-02 Altera Corporation Multi-operand floating point operations in a programmable integrated circuit device
US8396914B1 (en) 2009-09-11 2013-03-12 Altera Corporation Matrix decomposition in an integrated circuit device
US8539016B1 (en) 2010-02-09 2013-09-17 Altera Corporation QR decomposition in an integrated circuit device
US7948267B1 (en) 2010-02-09 2011-05-24 Altera Corporation Efficient rounding circuits and methods in configurable integrated circuit devices
US8601044B2 (en) 2010-03-02 2013-12-03 Altera Corporation Discrete Fourier Transform in an integrated circuit device
US8484265B1 (en) 2010-03-04 2013-07-09 Altera Corporation Angular range reduction in an integrated circuit device
US8510354B1 (en) 2010-03-12 2013-08-13 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8539014B2 (en) 2010-03-25 2013-09-17 Altera Corporation Solving linear matrices in an integrated circuit device
US8812573B2 (en) 2010-06-25 2014-08-19 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8589463B2 (en) 2010-06-25 2013-11-19 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8862650B2 (en) 2010-06-25 2014-10-14 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8577951B1 (en) 2010-08-19 2013-11-05 Altera Corporation Matrix operations in an integrated circuit device
US8645451B2 (en) 2011-03-10 2014-02-04 Altera Corporation Double-clocked specialized processing block in an integrated circuit device
US9600278B1 (en) 2011-05-09 2017-03-21 Altera Corporation Programmable device using fixed and configurable logic to implement recursive trees
US8812576B1 (en) 2011-09-12 2014-08-19 Altera Corporation QR decomposition in an integrated circuit device
US8949298B1 (en) 2011-09-16 2015-02-03 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US9053045B1 (en) 2011-09-16 2015-06-09 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US8762443B1 (en) 2011-11-15 2014-06-24 Altera Corporation Matrix operations in an integrated circuit device
US8543634B1 (en) 2012-03-30 2013-09-24 Altera Corporation Specialized processing block for programmable integrated circuit device
US9098332B1 (en) 2012-06-01 2015-08-04 Altera Corporation Specialized processing block with fixed- and floating-point structures
US8996600B1 (en) 2012-08-03 2015-03-31 Altera Corporation Specialized processing block for implementing floating-point multiplier with subnormal operation support
US9207909B1 (en) 2012-11-26 2015-12-08 Altera Corporation Polynomial calculations optimized for programmable integrated circuit device structures
US9189200B1 (en) 2013-03-14 2015-11-17 Altera Corporation Multiple-precision processing block in a programmable integrated circuit device
US9348795B1 (en) 2013-07-03 2016-05-24 Altera Corporation Programmable device using fixed and configurable logic to implement floating-point rounding
US9684488B2 (en) 2015-03-26 2017-06-20 Altera Corporation Combined adder and pre-adder for high-radix multiplier circuit
US9853841B1 (en) * 2016-06-23 2017-12-26 Huawei Technologies Co., Ltd. Low complexity slicer architectures for N-tap look-ahead decision feedback equalizer (DFE) circuit implementations
US20170373887A1 (en) * 2016-06-23 2017-12-28 Huong Ho Low complexity slicer architectures for n-tap look-ahead decision feedback equalizer (dfe) circuit implementations
US10942706B2 (en) 2017-05-05 2021-03-09 Intel Corporation Implementation of floating-point trigonometric functions in an integrated circuit device
WO2022011308A1 (en) * 2020-07-09 2022-01-13 The Regents Of The University Of California Bit-parallel vector composability for neural acceleration
WO2022081383A1 (en) * 2020-10-15 2022-04-21 Gigantor Technologies Inc. Custom mass multiplication circuits
US11748061B2 (en) 2020-10-15 2023-09-05 Gigantor Technologies Inc. Custom mass multiplication circuits

Also Published As

Publication number Publication date
US6397238B2 (en) 2002-05-28
US20010010051A1 (en) 2001-07-26
US6381625B2 (en) 2002-04-30
US6223198B1 (en) 2001-04-24

Similar Documents

Publication Publication Date Title
US6397238B2 (en) Method and apparatus for rounding in a multiplier
US6134574A (en) Method and apparatus for achieving higher frequencies of exactly rounded results
US6269384B1 (en) Method and apparatus for rounding and normalizing results within a multiplier
US6038583A (en) Method and apparatus for simultaneously multiplying two or more independent pairs of operands and calculating a rounded products
US6144980A (en) Method and apparatus for performing multiple types of multiplication including signed and unsigned multiplication
US6085213A (en) Method and apparatus for simultaneously multiplying two or more independent pairs of operands and summing the products
US6397239B2 (en) Floating point addition pipeline including extreme value, comparison and accumulate functions
CN109643228B (en) Low energy mantissa multiplication for floating point multiply-add operations
US6115733A (en) Method and apparatus for calculating reciprocals and reciprocal square roots
US6487575B1 (en) Early completion of iterative division
US6490607B1 (en) Shared FP and SIMD 3D multiplier
CN106325811B (en) Method in a microprocessor
Zhang et al. Efficient multiple-precision floating-point fused multiply-add with mixed-precision support
US6360241B1 (en) Computer method and apparatus for division and square root operations using signed digit
US6523050B1 (en) Integer to floating point conversion using one's complement with subsequent correction to eliminate two's complement in critical path
US6256653B1 (en) Multi-function bipartite look-up table
US5426600A (en) Double precision division circuit and method for digital signal processor
US6115732A (en) Method and apparatus for compressing intermediate products
US5918062A (en) Microprocessor including an efficient implemention of an accumulate instruction
JP5719341B2 (en) Mechanism for fast detection of overshifts in floating point units
US6026483A (en) Method and apparatus for simultaneously performing arithmetic on two or more pairs of operands
US8019805B1 (en) Apparatus and method for multiple pass extended precision floating point multiplication
Agarwal et al. Series approximation methods for divide and square root in the Power3/sup TM/processor
US6941334B2 (en) Higher precision divide and square root approximations
WO2000011546A1 (en) Method and apparatus for concurrently executing multiplication and iterative operations

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12

AS Assignment

Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:OBERMAN, STUART;JUFFA, NORBERT;SIU, MING;AND OTHERS;SIGNING DATES FROM 19980803 TO 19980812;REEL/FRAME:036613/0380

AS Assignment

Owner name: ADVANCED SILICON TECHNOLOGIES, LLC, NEW HAMPSHIRE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ADVANCED MICRO DEVICES, INC.;REEL/FRAME:036700/0001

Effective date: 20150925