US20170147288A1 - Floating point multiply accumulator multi-precision mantissa aligner - Google Patents
Floating point multiply accumulator multi-precision mantissa aligner Download PDFInfo
- Publication number
- US20170147288A1 US20170147288A1 US15/391,470 US201615391470A US2017147288A1 US 20170147288 A1 US20170147288 A1 US 20170147288A1 US 201615391470 A US201615391470 A US 201615391470A US 2017147288 A1 US2017147288 A1 US 2017147288A1
- Authority
- US
- United States
- Prior art keywords
- shift
- exponent
- precision
- binary number
- mantissa
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/487—Multiplying; Dividing
- G06F7/4876—Multiplying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/01—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/01—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
- G06F5/012—Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising in floating-point computations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2205/00—Indexing scheme relating to group G06F5/00; Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F2205/003—Reformatting, i.e. changing the format of data representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/38—Indexing scheme relating to groups G06F7/38 - G06F7/575
- G06F2207/3804—Details
- G06F2207/3808—Details concerning the type of numbers or the way they are handled
- G06F2207/3812—Devices capable of handling different types of numbers
- G06F2207/382—Reconfigurable for different fixed word lengths
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/485—Adding; Subtracting
Definitions
- Embodiments of this invention relate generally to processors and processing circuits, and, more particularly, to a method and apparatus for a floating point multiply accumulator (FMAC) multi-precision mantissa aligner.
- FMAC floating point multiply accumulator
- processors and processing circuits have evolved becoming faster and more power intensive. With increased speed and capabilities, processors and processing circuits must be adapted to be run more efficiently and with greater flexibility. As technology for these devices has progressed, there has developed a need for performance and efficiency improvements. However, complexity, power and performance considerations introduce substantial barriers to these improvements. Additionally, circuit area and circuit overhead requirements (e.g., routing and layout) provide barriers to improvements.
- Multi-precision mantissa alignment may alleviate or reduce the abovementioned barriers to power reduction, efficiency and flexibility.
- support for two parallel single-precision operations embedded in a higher precision datapath is not found.
- State of the art FMACs are thus incapable of improving power usage, overhead, efficiency and flexibility through the use of parallel single-precision operations.
- a processing device in one aspect of the present invention, includes a first, second and third precision operation circuit.
- the processing device further includes a shared, bit-shifting circuit that is communicatively coupled to the first, second and third precision operation circuits.
- a method in another aspect of the invention, includes multiplying a first binary number and a second binary number to obtain a product, where multiplying includes adding a first exponent value associated with the first binary number to a second exponent value associated with the second binary number to obtain an exponent sum and multiplying a first mantissa value associated with the first binary number to a second mantissa value associated with the second binary number.
- the method also includes that the exponent adding and the mantissa multiplying are performed substantially in parallel.
- the method further includes performing at least one of adding a third binary number to the product or subtracting the third binary number from the product.
- a computer readable storage device encoded with data that, when implemented in a manufacturing facility, adapts the manufacturing facility to create an apparatus includes a first, second and third precision operation circuit.
- the apparatus further includes a shared, bit-shifting circuit that is communicatively coupled to the first, second and third precision operation circuits.
- FIG. 1 schematically illustrates a simplified block diagram of a computer system including one or more FMACs, according to one embodiment
- FIG. 2 shows a simplified block diagram of multi-precision FMAC, according to one embodiment
- FIG. 3 provides a simplified block diagram of multi-precision FMAC(s) on a silicon die/chip, according to one embodiment
- FIG. 4 illustrates an exemplary detailed representation of a multi-precision FMAC produced in a semiconductor fabrication facility, according to one embodiment
- FIG. 5 illustrates a schematic diagram of an FMAC, according to one exemplary embodiment
- FIG. 6 illustrates a schematic diagram of data alignment using an FMAC, according to one exemplary embodiment
- FIG. 7 illustrates a schematic diagram of FMAC mantissa fields in an aligned dataflow, according to one exemplary embodiment.
- FIG. 8 illustrates a flowchart depicting steps for shifting and aligning data, according to one exemplary embodiment.
- the terms “substantially” and “approximately” may mean within 85%, 90%, 95%, 98% and/or 99%. In some cases, as would be understood by a person of ordinary skill in the art, the terms “substantially” and “approximately” may indicate that differences, while perceptible, may be negligent or be small enough to be ignored.
- Embodiments of the present invention generally provide for methods and apparatus for a floating point multiply accumulator (FMAC) multi-precision mantissa aligner. It is contemplated that various embodiments described herein are not mutually exclusive. That is, the various embodiments described herein may be implemented simultaneously with, or independently of, each other, as would be apparent to one of ordinary skill in the art having the benefit of this disclosure. The embodiments described herein show a novel design that efficiently solves the problems described above. The embodiments described herein may utilize multi-precision mantissa alignment for a FMAC comprising two parallel single-precision (SP) operations (operation circuits), as well as an extended-/double-precision (EP/DP) operation (operation circuit).
- SP parallel single-precision
- EP/DP extended-/double-precision
- SP The Institute for Electrical and Electronics Engineers
- DP operations Binary numbers may be formatted such that they comprise two distinct portions: an exponent portion and a mantissa portion.
- SP operations use 23 bits for the mantissa and 8 bits for the exponent
- EP operations use 64 bits for the mantissa and 64 bits for the exponent
- DP operations use 52 bits for the mantissa and 11 bits for the exponent.
- the embodiments described herein may allow for decreased latency in floating point multiply-add operations as well as higher throughput.
- the embodiments described herein may also allow for power and/or area optimization for floating point multiply-add circuits.
- the computer system 100 may be a personal computer, a laptop computer, a handheld computer, a tablet computer, a mobile device, a telephone, a personal data assistant (“PDA”), a server, a mainframe, a work terminal, or the like.
- the computer system includes a main structure 110 which may be a computer motherboard, circuit board or printed circuit board, a desktop computer enclosure and/or tower, a laptop computer base, a server enclosure, part of a mobile device, personal data assistant (PDA), or the like.
- the main structure 110 may include a graphics card 120 .
- the graphics card 120 may be a RadeonTM graphics card from Advanced Micro Devices (“AMD”) or any other graphics card using memory, in alternate embodiments.
- the graphics card 120 may, in different embodiments, be connected on a Peripheral Component Interconnect “(PCI”) Bus (not shown), PCI-Express Bus (not shown) an Accelerated Graphics Port (“AGP”) Bus (also not shown), or any other connection known in the art.
- PCI Peripheral Component Interconnect
- PCI-Express Bus not shown
- AGP Accelerated Graphics Port
- embodiments of the present invention are not limited by the connectivity of the graphics card 120 to the main computer structure 110 .
- computer runs an operating system such as Linux, Unix, Windows, Mac OS, or the like.
- the graphics card 120 may contain a graphics processing unit (GPU) 125 used in processing graphics data.
- the GPU 125 may include one or more embedded memories (not shown).
- the embedded memory(ies) may be an embedded random access memory (“RAM”), an embedded static random access memory (“SRAM”), or an embedded dynamic random access memory (“DRAM”).
- the embedded memory(ies) may be an embedded RAM (e.g., an SRAM).
- the embedded memory(ies) may be embedded in the graphics card 120 in addition to, or instead of, being embedded in the GPU 125 .
- the graphics card 120 may be referred to as a circuit board, a printed circuit board, a daughter card or the like.
- the computer system 100 includes a central processing unit (“CPU”) 140 , which is connected to a northbridge 145 .
- the CPU 140 and northbridge 145 may be housed on the motherboard (not shown) or some other structure of the computer system 100 .
- the graphics card 120 may be coupled to the CPU 140 via the northbridge 145 or some other connection as is known in the art.
- CPU 140 , northbridge 145 , GPU 125 may be included in a single package or as part of a single die or “chip(s)” (not shown).
- Chip(s) not shown
- Alternative embodiments which alter the arrangement of various components illustrated as forming part of main structure 110 are also contemplated.
- the CPU 140 may include one or more multi-precision FMACs 130 .
- the multi-precision FMACs 130 may include a multi-precision mantissa aligner comprising two or more parallel single-precision operation circuits (described below with respect to FIG. 5 ).
- the northbridge 145 may be coupled to a system RAM (or DRAM) 155 ; in other embodiments, the system RAM 155 may be coupled directly to the CPU 140 .
- the system RAM 155 may be of any RAM type known in the art; the type of RAM 155 does not limit the embodiments of the present invention.
- the northbridge 145 may be connected to a southbridge 150 .
- the northbridge 145 and southbridge 150 may be on the same chip in the computer system 100 , or the northbridge 145 and southbridge 150 may be on different chips.
- the southbridge 150 may be connected to one or more data storage units 160 using a data connection or bus 199 .
- the data storage units 160 may be hard drives, solid state drives, magnetic tape, or any other writable media used for storing data.
- one or more of the data storage units may be SATA data storage units and the data connection 199 may be a SATA bus/connection.
- the data storage units 160 may contain one or more multi-precision FMACs 130 .
- the central processing unit 140 , northbridge 145 , southbridge 150 , graphics processing unit 125 , DRAM 155 and/or embedded RAM may be a computer chip or a silicon-based computer chip, or may be part of a computer chip or a silicon-based computer chip.
- the various components of the computer system 100 may be operatively, electrically and/or physically connected or linked with a bus 195 or more than one bus 195 .
- the computer system 100 may be connected to one or more display units 170 , input devices 180 , output devices 185 and/or other peripheral devices 190 . It is contemplated that in various embodiments, these elements may be internal or external to the computer system 100 , and may be wired or wirelessly connected, without affecting the scope of the embodiments of the present invention.
- the display units 170 may be internal or external monitors, television screens, handheld device displays, and the like.
- the input devices 180 may be any one of a keyboard, mouse, track-ball, stylus, mouse pad, mouse button, joystick, scanner or the like.
- the output devices 185 may be any one of a monitor, printer, plotter, copier or other output device.
- the peripheral devices 190 may be any other device which can be coupled to a computer: a CD/DVD drive capable of reading and/or writing to corresponding physical digital media, a universal serial buss (“USB”) device, Zip Drive, external floppy drive, external hard drive, phone and/or broadband modem, router/gateway, access point and/or the like.
- a CD/DVD drive capable of reading and/or writing to corresponding physical digital media
- USB universal serial buss
- Zip Drive external floppy drive
- external hard drive external hard drive
- phone and/or broadband modem router/gateway, access point and/or the like.
- any number of computer systems 100 may be communicatively coupled and/or connected to each other through a network infrastructure. In various embodiments, such connections may be wired or wireless without limiting the scope of the embodiments described herein.
- the network may be a local area network (LAN), wide area network (WAN), personal network, company intranet or company network, the Internet, or the like.
- the computer systems 100 connected to the network via the network infrastructure may be a personal computer, a laptop computer, a handheld computer, a tablet computer, a mobile device, a telephone, a personal data assistant (PDA), a server, a mainframe, a work terminal, any other computing device described herein, and/or the like.
- the number of computers connected to the network may vary; in practice any number of computer systems 100 may be coupled/connected using the network.
- computer systems 100 may include one or more graphics cards and/or graphics processing units (GPUs).
- the graphics cards 120 may contain one or more GPUs 125 used in processing graphics data.
- the GPU 125 may include a multi-precision FMAC 130 .
- the multi-precision FMAC 130 may be embedded in the graphics card 120 in addition to, or instead of, being embedded in the GPU 125 .
- certain exemplary aspects of the graphics card 120 and/or the GPU(s) 125 are not described herein, such exemplary aspects may or may not be included in various embodiments without limiting the spirit and scope of the embodiments of the present invention as would be understood by one of skill in the art.
- the graphics processing unit 125 and multi-precision FMAC 130 may reside on the same silicon chip as the CPU 140 and/or the northbridge 145 .
- the multi-precision FMAC 130 may reside on the same silicon chip as the CPU 140 .
- the silicon chip(s) may be used in a computer system 100 in place of, or in addition to, the graphics card 120 .
- the silicon chip(s) may be housed on the motherboard (not shown) or other structure of the computer system 100 .
- FIG. 2 a simplified, exemplary representation of the multi-precision FMAC 130 which may be used in silicon die/chips 440 , as well as devices depicted in FIG. 1 , according to various embodiments, is illustrated.
- the multi-precision FMAC 130 may take on any of a variety of forms, including those described herein, without departing from the spirit and scope of the instant invention
- the silicon die/chip 440 is illustrated as including one or more the multi-precision FMACs 130 .
- various embodiments of the multi-precision FMAC 130 may be used in a wide variety of electronic devices, including, but not limited to, central processing units, motherboards, graphics cards, graphics processors, combinatorial logic implementations, stand-alone controllers, other integrated circuits (ICs), digital signal processors (DSPs), and/or the like.
- one or more of the multi-precision FMACs 130 may be included on the silicon die/chips 440 (or computer chip).
- the silicon die/chips 440 may contain one or more different configurations of the multi-precision FMACs 130 (e.g., a multi-precision FMACs 130 configured to include parallel SP operations/operational circuits).
- the silicon chips 440 may be produced on a silicon wafer 430 in a fabrication facility (or “fab”) 490 . That is, the silicon wafers 430 and the silicon die/chips 440 may be referred to as the output, or product of, the fab 390 .
- the silicon die/chips 440 may be used in electronic devices, such as those described above in this disclosure.
- an exemplary FMAC multiply-add operation may be conceptualized as: A ⁇ B ⁇ C (“A multiplied by B, plus or minus C”).
- a multiplied by B, plus or minus C their exponents may be added and their mantissas may be multiplied. Addition may require that exponents are “lined” up; in order to be added, numbers may need to have the same exponent.
- the mantissa of one or more operands may be shifted. By shifting the operand(s), the effective exponent of a number may be changed.
- the exponents when adding 1.0 ⁇ 10 3 and 2.0 ⁇ 10 2 the exponents would need to be equalized.
- 1.0 ⁇ 10 3 may be equalized with 2.0 ⁇ 10 2 by shifting the exponent 10 3 making it 10 2 .
- the resulting number would then be 10 ⁇ 10 2 .
- the numbers could then be added by adding mantissas (10+2), each having an exponent of “2”, for a result of 12 ⁇ 10 2 (i.e., 1.2 ⁇ 10 3 ).
- the FMAC 130 may comprise two parallel single-precision operation circuits such that the exponent addition and the mantissa multiplication (e.g., for multiplication operations) may be performed in parallel or substantially in parallel.
- the shifting of one or more operand mantissas (e.g., for addition operations) may also be performed in parallel or substantially in parallel to the mantissa multiplication.
- the multi-precision FMAC 130 may contain circuitry to perform multi-precision mantissa alignment and/or multi-precision mantissa alignment using two parallel single-precision operation circuits.
- the illustrated FMAC 130 may comprise shift blocks, shift blocks for extended precision (EP), double precision (DP) and/or single precision (SP).
- EP extended precision
- DP double precision
- SP single precision
- DP operations may be performed by the EP operations blocks. It should be noted, however, that in various embodiments, alternate DP operational blocks may be used.
- a block may also be referred to as a circuit, circuit portion and/or circuit block.
- the illustrated exemplary FMAC 130 may comprise an invert block 510 configured to invert the addend mantissa Mc 505 , a shift 1X block 515 , a shift 4X block 520 , an EP shift 16X block 525 , an EP shift 64X block 530 , an SP-lo shift 16X block 535 , an SP-lo shift 64X block 540 , an SP-hi shift 16X block 545 , an SP-hi shift 64X block 550 , and/or an overlap block 555 .
- the invert block 510 may be configured to invert bits of the Mc 505 if subtracting the product and addend instead of adding. In one embodiment, the inversion may be controlled by the invert controls 507 .
- the various shift blocks described herein may be adapted to, configured to, and/or capable of shifting a binary number (or number of another format) by a given number of bits.
- a 1X shift block e.g., shift 1X block 515
- a 4X shift block e.g., shift 4X block 520
- a 16X shift block (e.g., EP shift 16X block 525 , SP-lo shift 16X block 535 and/or SP-hi shift 16X block 545 ) may shift a binary number a given number of times by sixteen (16) bits. That is, a 16X shift block may shift a binary number by 0, 16, 32 or 48 bits.
- a 64X shift block (e.g., EP shift 64X block 530 , SP-lo shift 64X block 540 and/or SP-hi shift 64X block 550 ) may shift a binary number a given number of times by sixty-four (64) bits. That is, a 64X shift block may shift a binary number by 0, 64, 128 or 192 bits.
- shifting of bits may be performed by shifting zero (“0”) or more bits; that is, if a determination is made that a binary number should not be shifted, it may be said that the binary number was shifted by zero (“0”) bits.
- the overlap block 555 may, in one or more embodiments, be adapted to handle un-overlapped product and addend mantissa cases in the last aligner stage, as illustratively shown in FIG. 5 . In the case that the addend mantissa shift will result in the addend mantissa not potentially overlapping with the product mantissa, then the addend mantissa is effectively concatenated with the product mantissa.
- the overlap block 555 may output a 256-bit result R 599 .
- the exemplary FMAC 130 illustrated in FIG. 5 may be conceptualized for illustrative purposes as shifting in four stages.
- the first two stages may be shared.
- the shift 1X block 515 may be referred to as a first stage (i.e., shifting 0, 1, 2, or 3 bits) and the shift 4X block 520 may be referred to as a second stage (i.e., shifting 0, 4, 8, or 12 bits).
- These stages may be shared in that the output of the shift 1X block 515 and the shift 4X block 520 may be output to the EP (DP) shift 16X block 525 , the SP-lo shift 16X block 535 and/or the SP-hi shift 16X block 545 , as illustrated in FIG.
- EP DP
- the FMAC 130 may be output across the datapath to the three back end shifters of the FMAC 130 (i.e., to the EP (DP) aligner and/or one or both of the SP aligners, SP-lo and SP-hi, as described below with respect to FIG. 6 ).
- routing efficiency afforded by having the shared front end stages allows the data to only have to traverse the full (vertical) length of the 64-bit datapath just once (from the shift 4X block 520 output to the EP (DP) shift 16X block 525 , the SP-lo shift 16X block 535 and/or the SP-hi shift 16X block 545 inputs), thus optimizing/increasing routing efficiency.
- the multi-precision FMAC 130 may comprise an exponent difference block 560 and a decoder block 565 .
- the exponent difference block 560 and the decoder block 565 may be adapted to take one or more inputs from a system and, based at least in part on the inputs, determine the shifting schedule for the multi-precision FMAC 130 operations.
- the exponent difference calculation is affected in the exponent difference block 560 .
- the mantissa(s) may need to be shifted in order to have them align their respective binary points.
- the exponent difference block 560 may comprise a 4 to 2 adder/compressor 557 and/or a 2 to 1 carry propagate adder 559 .
- the exponent difference block 560 may take as inputs: the exponent of term A (Ea) 556 a, the exponent of term B (Eb) 556 b, the exponent of term C (Ec) 556 C, and a bias signal Bias 556 d.
- the adder 557 and the adder 559 may, in one or more embodiments, perform an effective subtraction operation to determine the difference between the sum of the product exponents Ea 556 and Eb 556 b, and the addend exponent Ec 556 c.
- the bias 556 d may be adapted to bias the exponent values such that some or all of the shift operations in the FMAC 130 may be performed by shifting to the right, rather than shifting to the left as well as removing any additional bias in order to perform the calculation.
- the outputs of the exponent difference block 560 may be decoded using the block 565 .
- the decoder block 565 may comprise one or more decoders (Dec) 561 a - n and/or one or more multiplexors 567 in order to generate the necessary shift controls from the calculated exponent difference for all the aligner complex stages.
- the multiplexors 567 may be 2-to-1 multiplexors and may be used to select between two exponent difference calculations (“Product minus Addend” and “Addend minus Product”). These two subtractions (differences) may be calculated in parallel using the carryout of one of the adders, and may be used to select which difference calculation is valid for the first two sets of 2-to-4 decoders for the four shift 1X controls and the four shift 4X controls described below. Calculating both differences in parallel and then selecting the proper difference may be done to further minimize latency.
- “one-hot” select n-to-1 multiplexors and decoders may be used extensively for realizing the shifters Shift 1X 515 , Shift 4X 520 , etc.
- This use of standard cells may increase routing efficiency and decrease footprint area.
- the 64X shift stages are straight (i.e., horizontal) routes in the datapath. That is, the routing of the 64X shift stages is optimally done to minimize the distance traveled by these shift stages.
- FIG. 6 a graphical representation of an illustrative side-by-side alignment of respective addends, post-alignment, for an EP (DP) operation and two parallel SP operations is depicted, in accordance with one embodiment.
- bit numbers [ 193 : 0 ] 605 are shown alongside the alignments.
- bit 193 is the most significant bit (MSB) and bit 0 is the least significant bit (LSB).
- MSB most significant bit
- LSB least significant bit
- an EP source addend portion 610 and one or two SP source addend portions 615 , 620 (SP-hi and SP-lo, respectively) may be aligned.
- the EP source addend portion 610 may be 64 bits and the SP source addend portions 615 , 620 may each be 24 bits. In one embodiment, aligner shifting is performed by shifting to the right. The most significant bit of SP-hi source addend portion 610 may be aligned with the most significant bit of the EP source addend portion 605 . The least significant bit of SP-lo source addend portion 615 may be aligned with the least significant bit of the EP source addend portion 605 .
- the EP source addend portion 610 may have a corresponding EP aligner output 625 , that may be, in one embodiment, 194 bits.
- the SP-hi source addend portion 615 may have a corresponding SP-hi incrementer aligner output 630 , that may be, in one embodiment, 26 bits.
- the SP-hi incrementer aligner output 630 may have its most significant bit aligned with the most significant bit (bit 193 ) of the EP aligner output 625 .
- the SP-lo source addend portion 620 may have a corresponding SP-lo incrementer aligner output 635 , that may be, in one embodiment, 26 bits.
- the SP-lo incrementer aligner output 635 may have its least significant bit aligned with the one hundred twenty-eighth bit (bit 128 ) of the EP aligner output 625 .
- the SP-hi source addend portion 615 may have a corresponding SP-hi adder aligner output 640 , that may be, in one embodiment, 48 bits.
- the SP-hi adder aligner output 640 may have its most significant bit aligned with the one hundred twenty-seventh bit (bit 127 ) of the EP aligner output 625 .
- the SP-lo source addend portion 620 may have a corresponding SP-lo adder aligner output 645 , that may be, in one embodiment, 48 bits.
- the SP-lo adder aligner output 645 may have its least significant bit aligned with the least significant bit (bit 0 ) of the EP aligner output 625 . It should be noted that in one embodiment, if only one SP operation is needed, only the SP-lo may be used/needed to perform a single, single-precision (SP) operation.
- SP single-precision
- the aligner output fields 625 , 640 and/or 645 may each have an accompanying sticky field 650 (EP), 655 (SP-hi), 660 (SP-lo) that may be adapted to facilitate rounding up and/or down to the nearest bit. That is, as bits are shifted out, the sticky fields 650 (EP), 655 (SP-hi), 660 (SP-lo) may keep track of the shifted out bits, and may perform an logical OR operation to influence rounding.
- the sticky field 650 (EP) may comprise 64 bits
- the sticky field 655 (SP-hi) may comprise 24 bits
- the sticky field 660 (SP-lo) may comprise 24 bits.
- rounding may be performed up to the next bit if a sticky field indicates the shifted out bits are more than (or equal to) half way to the next bit, or down to the current bit if the associated sticky field indicates the shifted out bits are less than half way to the next bit.
- the rounding may be performed to a specific bit (current or next). It should be noted that the sticky fields described above may be used to maintain precision, but may not otherwise affect the end result of the operations described herein.
- FIG. 7 a schematic diagram of FMAC mantissa fields (and their respective bit lengths) in an aligned 64-bit dataflow is illustrated, according to one exemplary embodiment.
- the aligned datapath may be “folded” in order to comport with a 64-bit datapath bit pitch to align data efficiently through the processor and fit into a fixed physical footprint.
- the actual width of the aligned data may be greater than the width of the datapath, and in some cases may be three times the width of the datapath (e.g., the aligned data may be 194 bits wide, while the datapath width may be 64-bits wide).
- various portions of data such as those depicted in FIG. 6 and described above, may be transmitted on a datapath that is narrower than the data itself.
- an FMAC operation may begin.
- the flow may proceed in parallel to 810 and 835 .
- the exponent and bias values may be obtained.
- the exponent values may be the Ea 556 a, the Eb 556 b and the Ec 556 c, and the bias value may be the Bias 556 d.
- the flow may proceed to 820 to calculate the difference between the product and addend exponents.
- the adder 557 and the adder 559 may perform an effective subtraction operation between the sum of the product exponents (the Ea 556 a and the Eb 556 b ) and the addend exponent (the Ec 556 c ).
- the flow may then proceed to 830 where the shift controls may be determined. In one embodiment, the shift controls may be based on (determined from) the output of the adder 559 as applied to the decoders 566 a - n and the muxes 567 , as shown in FIG. 5 . From 830 , the flow may proceed to 840 .
- a mantissa value (e.g., the Mc 505 ) may be input into the invert block 510 . In one embodiment, the mantissa input may be inverted if a subtraction operation is performed by the FMAC. From 835 , the flow may proceed to 840 and/or to 880 .
- the inverted or non-inverted mantissa may be input into a 1X shifter (e.g., the Shift 1X 515 ). From 840 , the flow may proceed to 850 . At 850 , the output of the 1X shifted mantissa may be input into the 4X shift block (e.g., the Shift 4X 520 ). From 850 , the flow may proceed to 860 , 863 and/or 866 . At 860 , the output of the 4 x shift from 850 may be input into the 16X extended-precision shifter (e.g., 525 ).
- the 16X extended-precision shifter e.g., 525
- the flow may proceed to 870 where the output of the 16X shift from 860 may be input into the 64X extended-precision shifter (e.g., 530 ). From 870 , the flow may proceed to 880 .
- the output of the 4 x shift from 850 may be input into the 16X single-precision lo shifter (e.g., 535 ).
- the flow may proceed to 873 where the output of the 16X shift from 863 may be input into the 64X single-precision lo shifter (e.g., 540 ). From 873 , the flow may proceed to 880 .
- the output of the 4 x shift from 850 may be input into the 16X single-precision hi shifter (e.g., 545 ). From 866 , the flow may proceed to 876 where the output of the 16X shift from 866 may be input into the 64X single-precision hi shifter (e.g., 550 ). From 876 , the flow may proceed to 880 . At 880 , any un-overlapped product and addend mantissa cases may be aligned. From 880 , the flow may continue to 890 . At 890 , the FMAC operation may be ended.
- the actions shown in FIG. 8 may be performed sequentially, in parallel, substantially in parallel or in alternate order(s) without departing from the spirit and scope of the embodiments presented herein.
- HDL hardware descriptive languages
- VLSI circuits very large scale integration circuits
- HDL are VHDL and Verilog/Verilog-XL, but other HDL formats not listed may be used.
- the HDL code e.g., register transfer level (RTL) code/data
- RTL register transfer level
- GDSII data is a descriptive file format and may be used in different embodiments to represent a three-dimensional model of a semiconductor product or device. Such models may be used by semiconductor manufacturing facilities to create semiconductor products and/or devices.
- the GDSII data may be stored as a database or other program storage structure. This data may also be stored on a computer readable storage device (e.g., data storage units 160 , RAMs 155 (including embedded RAMs), compact discs, DVDs, solid state storage and/or the like). In one embodiment, the GDSII data (or other similar data) may be adapted to configure a manufacturing facility (e.g., through the use of mask works) to create devices capable of embodying various aspects of the instant invention.
- a manufacturing facility e.g., through the use of mask works
- this GDSII data may be programmed into a computer 100 , processor 125 / 140 or controller, which may then control, in whole or part, the operation of a semiconductor manufacturing facility (or fab) to create semiconductor products and devices.
- a semiconductor manufacturing facility or fab
- silicon wafers containing FMACs with multi-precision mantissa aligners e.g., an FMAC utilizing parallel single-precision operations, as described herein
Abstract
A processing device is provided that includes a first, second and third precision operation circuit. The processing device further includes a shared, bit-shifting circuit that is communicatively coupled to the first, second and third precision operation circuits. A method is also provided for multiplying a first and second binary number including adding a first exponent value associated with the first binary number to a second exponent value associated with the second binary number and multiplying a first mantissa value associated with the first binary number to a second mantissa value associated with the second binary number. The method includes performing the exponent adding and mantissa multiplying substantially in parallel. The method further includes performing at least one of adding or subtracting a third binary number to the product. Also provided is a computer readable storage device encoded with data for adapting a manufacturing facility to create an apparatus.
Description
- 1. Field of the Invention
- Embodiments of this invention relate generally to processors and processing circuits, and, more particularly, to a method and apparatus for a floating point multiply accumulator (FMAC) multi-precision mantissa aligner.
- 2. Description of Related Art
- Processors and processing circuits have evolved becoming faster and more power intensive. With increased speed and capabilities, processors and processing circuits must be adapted to be run more efficiently and with greater flexibility. As technology for these devices has progressed, there has developed a need for performance and efficiency improvements. However, complexity, power and performance considerations introduce substantial barriers to these improvements. Additionally, circuit area and circuit overhead requirements (e.g., routing and layout) provide barriers to improvements.
- Multi-precision mantissa alignment may alleviate or reduce the abovementioned barriers to power reduction, efficiency and flexibility. In modern implementations for FMACs, support for two parallel single-precision operations embedded in a higher precision datapath is not found. State of the art FMACs are thus incapable of improving power usage, overhead, efficiency and flexibility through the use of parallel single-precision operations.
- In one aspect of the present invention, a processing device is provided. The processing device includes a first, second and third precision operation circuit. The processing device further includes a shared, bit-shifting circuit that is communicatively coupled to the first, second and third precision operation circuits.
- In another aspect of the invention, a method is provided. The method includes multiplying a first binary number and a second binary number to obtain a product, where multiplying includes adding a first exponent value associated with the first binary number to a second exponent value associated with the second binary number to obtain an exponent sum and multiplying a first mantissa value associated with the first binary number to a second mantissa value associated with the second binary number. The method also includes that the exponent adding and the mantissa multiplying are performed substantially in parallel. The method further includes performing at least one of adding a third binary number to the product or subtracting the third binary number from the product.
- In yet another aspect of the invention, a computer readable storage device encoded with data that, when implemented in a manufacturing facility, adapts the manufacturing facility to create an apparatus is provided. The apparatus includes a first, second and third precision operation circuit. The apparatus further includes a shared, bit-shifting circuit that is communicatively coupled to the first, second and third precision operation circuits.
- The invention may be understood by reference to the following description taken in conjunction with the accompanying drawings, in which the leftmost significant digit(s) in the reference numerals denote(s) the first figure in which the respective reference numerals appear, and in which:
-
FIG. 1 schematically illustrates a simplified block diagram of a computer system including one or more FMACs, according to one embodiment; -
FIG. 2 shows a simplified block diagram of multi-precision FMAC, according to one embodiment; -
FIG. 3 provides a simplified block diagram of multi-precision FMAC(s) on a silicon die/chip, according to one embodiment; -
FIG. 4 illustrates an exemplary detailed representation of a multi-precision FMAC produced in a semiconductor fabrication facility, according to one embodiment; -
FIG. 5 illustrates a schematic diagram of an FMAC, according to one exemplary embodiment; -
FIG. 6 illustrates a schematic diagram of data alignment using an FMAC, according to one exemplary embodiment; -
FIG. 7 illustrates a schematic diagram of FMAC mantissa fields in an aligned dataflow, according to one exemplary embodiment; and -
FIG. 8 illustrates a flowchart depicting steps for shifting and aligning data, according to one exemplary embodiment. - While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but, on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
- Illustrative embodiments of the invention are described below. In the interest of clarity, not all features of an actual implementation are described in this specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but may nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
- Embodiments of the present invention will now be described with reference to the attached figures. Various structures, connections, systems and devices are schematically depicted in the drawings for purposes of explanation only and so as to not obscure the disclosed subject matter with details that are well known to those skilled in the art. Nevertheless, the attached drawings are included to describe and explain illustrative examples of the present invention. The words and phrases used herein should be understood and interpreted to have a meaning consistent with the understanding of those words and phrases by those skilled in the relevant art. No special definition of a term or phrase, i.e., a definition that is different from the ordinary and customary meaning as understood by those skilled in the art, is intended to be implied by consistent usage of the term or phrase herein. To the extent that a term or phrase is intended to have a special meaning, i.e., a meaning other than that understood by skilled artisans, such a special definition will be expressly set forth in the specification in a definitional manner that directly and unequivocally provides the special definition for the term or phrase.
- As used herein, the terms “substantially” and “approximately” may mean within 85%, 90%, 95%, 98% and/or 99%. In some cases, as would be understood by a person of ordinary skill in the art, the terms “substantially” and “approximately” may indicate that differences, while perceptible, may be negligent or be small enough to be ignored.
- Embodiments of the present invention generally provide for methods and apparatus for a floating point multiply accumulator (FMAC) multi-precision mantissa aligner. It is contemplated that various embodiments described herein are not mutually exclusive. That is, the various embodiments described herein may be implemented simultaneously with, or independently of, each other, as would be apparent to one of ordinary skill in the art having the benefit of this disclosure. The embodiments described herein show a novel design that efficiently solves the problems described above. The embodiments described herein may utilize multi-precision mantissa alignment for a FMAC comprising two parallel single-precision (SP) operations (operation circuits), as well as an extended-/double-precision (EP/DP) operation (operation circuit). The Institute for Electrical and Electronics Engineers (IEEE) has set forth industry standards for SP, EP and DP operations. Binary numbers may be formatted such that they comprise two distinct portions: an exponent portion and a mantissa portion. SP operations use 23 bits for the mantissa and 8 bits for the exponent, EP operations use 64 bits for the mantissa and 64 bits for the exponent, and DP operations use 52 bits for the mantissa and 11 bits for the exponent. The embodiments described herein may allow for decreased latency in floating point multiply-add operations as well as higher throughput. The embodiments described herein may also allow for power and/or area optimization for floating point multiply-add circuits.
- Turning now to
FIG. 1 , a block diagram of anexemplary computer system 100, in accordance with an embodiment of the present invention, is illustrated. In various embodiments thecomputer system 100 may be a personal computer, a laptop computer, a handheld computer, a tablet computer, a mobile device, a telephone, a personal data assistant (“PDA”), a server, a mainframe, a work terminal, or the like. The computer system includes amain structure 110 which may be a computer motherboard, circuit board or printed circuit board, a desktop computer enclosure and/or tower, a laptop computer base, a server enclosure, part of a mobile device, personal data assistant (PDA), or the like. In one embodiment, themain structure 110 may include agraphics card 120. In one embodiment, thegraphics card 120 may be a Radeon™ graphics card from Advanced Micro Devices (“AMD”) or any other graphics card using memory, in alternate embodiments. Thegraphics card 120 may, in different embodiments, be connected on a Peripheral Component Interconnect “(PCI”) Bus (not shown), PCI-Express Bus (not shown) an Accelerated Graphics Port (“AGP”) Bus (also not shown), or any other connection known in the art. It should be noted that embodiments of the present invention are not limited by the connectivity of thegraphics card 120 to themain computer structure 110. In one embodiment, computer runs an operating system such as Linux, Unix, Windows, Mac OS, or the like. - In one embodiment, the
graphics card 120 may contain a graphics processing unit (GPU) 125 used in processing graphics data. TheGPU 125, in one embodiment, may include one or more embedded memories (not shown). In one embodiment, the embedded memory(ies) may be an embedded random access memory (“RAM”), an embedded static random access memory (“SRAM”), or an embedded dynamic random access memory (“DRAM”). In one or more embodiments, the embedded memory(ies) may be an embedded RAM (e.g., an SRAM). In alternate embodiments, the embedded memory(ies) may be embedded in thegraphics card 120 in addition to, or instead of, being embedded in theGPU 125. In various embodiments thegraphics card 120 may be referred to as a circuit board, a printed circuit board, a daughter card or the like. - In one embodiment, the
computer system 100 includes a central processing unit (“CPU”) 140, which is connected to anorthbridge 145. TheCPU 140 andnorthbridge 145 may be housed on the motherboard (not shown) or some other structure of thecomputer system 100. It is contemplated that in certain embodiments, thegraphics card 120 may be coupled to theCPU 140 via thenorthbridge 145 or some other connection as is known in the art. For example,CPU 140,northbridge 145,GPU 125 may be included in a single package or as part of a single die or “chip(s)” (not shown). Alternative embodiments which alter the arrangement of various components illustrated as forming part ofmain structure 110 are also contemplated. TheCPU 140, in certain embodiments, may include one ormore multi-precision FMACs 130. Themulti-precision FMACs 130 may include a multi-precision mantissa aligner comprising two or more parallel single-precision operation circuits (described below with respect toFIG. 5 ). In certain embodiments, thenorthbridge 145 may be coupled to a system RAM (or DRAM) 155; in other embodiments, thesystem RAM 155 may be coupled directly to theCPU 140. Thesystem RAM 155 may be of any RAM type known in the art; the type ofRAM 155 does not limit the embodiments of the present invention. In one embodiment, thenorthbridge 145 may be connected to asouthbridge 150. In other embodiments, thenorthbridge 145 andsouthbridge 150 may be on the same chip in thecomputer system 100, or thenorthbridge 145 andsouthbridge 150 may be on different chips. In various embodiments, thesouthbridge 150 may be connected to one or moredata storage units 160 using a data connection or bus 199. Thedata storage units 160 may be hard drives, solid state drives, magnetic tape, or any other writable media used for storing data. In one embodiment, one or more of the data storage units may be SATA data storage units and the data connection 199 may be a SATA bus/connection. Additionally, thedata storage units 160 may contain one ormore multi-precision FMACs 130. In various embodiments, thecentral processing unit 140,northbridge 145,southbridge 150,graphics processing unit 125,DRAM 155 and/or embedded RAM may be a computer chip or a silicon-based computer chip, or may be part of a computer chip or a silicon-based computer chip. In one or more embodiments, the various components of thecomputer system 100 may be operatively, electrically and/or physically connected or linked with abus 195 or more than onebus 195. - In different embodiments, the
computer system 100 may be connected to one ormore display units 170,input devices 180,output devices 185 and/or otherperipheral devices 190. It is contemplated that in various embodiments, these elements may be internal or external to thecomputer system 100, and may be wired or wirelessly connected, without affecting the scope of the embodiments of the present invention. Thedisplay units 170 may be internal or external monitors, television screens, handheld device displays, and the like. Theinput devices 180 may be any one of a keyboard, mouse, track-ball, stylus, mouse pad, mouse button, joystick, scanner or the like. Theoutput devices 185 may be any one of a monitor, printer, plotter, copier or other output device. Theperipheral devices 190 may be any other device which can be coupled to a computer: a CD/DVD drive capable of reading and/or writing to corresponding physical digital media, a universal serial buss (“USB”) device, Zip Drive, external floppy drive, external hard drive, phone and/or broadband modem, router/gateway, access point and/or the like. To the extent certain exemplary aspects of thecomputer system 100 are not described herein, such exemplary aspects may or may not be included in various embodiments without limiting the spirit and scope of the embodiments of the present invention as would be understood by one of skill in the art. - In one embodiment, any number of
computer systems 100 may be communicatively coupled and/or connected to each other through a network infrastructure. In various embodiments, such connections may be wired or wireless without limiting the scope of the embodiments described herein. The network may be a local area network (LAN), wide area network (WAN), personal network, company intranet or company network, the Internet, or the like. In one embodiment, thecomputer systems 100 connected to the network via the network infrastructure may be a personal computer, a laptop computer, a handheld computer, a tablet computer, a mobile device, a telephone, a personal data assistant (PDA), a server, a mainframe, a work terminal, any other computing device described herein, and/or the like. The number of computers connected to the network may vary; in practice any number ofcomputer systems 100 may be coupled/connected using the network. - In one embodiment,
computer systems 100 may include one or more graphics cards and/or graphics processing units (GPUs). Thegraphics cards 120 may contain one ormore GPUs 125 used in processing graphics data. TheGPU 125, in one embodiment, may include amulti-precision FMAC 130. In alternate embodiments, themulti-precision FMAC 130 may be embedded in thegraphics card 120 in addition to, or instead of, being embedded in theGPU 125. To the extent certain exemplary aspects of thegraphics card 120 and/or the GPU(s) 125 are not described herein, such exemplary aspects may or may not be included in various embodiments without limiting the spirit and scope of the embodiments of the present invention as would be understood by one of skill in the art. In one embodiment, thegraphics processing unit 125 andmulti-precision FMAC 130 may reside on the same silicon chip as theCPU 140 and/or thenorthbridge 145. In another embodiment, themulti-precision FMAC 130 may reside on the same silicon chip as theCPU 140. In such embodiments, the silicon chip(s) may be used in acomputer system 100 in place of, or in addition to, thegraphics card 120. The silicon chip(s) may be housed on the motherboard (not shown) or other structure of thecomputer system 100. - Turning now to
FIG. 2 , a simplified, exemplary representation of themulti-precision FMAC 130 which may be used in silicon die/chips 440, as well as devices depicted inFIG. 1 , according to various embodiments, is illustrated. However, those skilled in the art will appreciate that themulti-precision FMAC 130 may take on any of a variety of forms, including those described herein, without departing from the spirit and scope of the instant invention - Turning to
FIG. 3 , the silicon die/chip 440 is illustrated as including one or more themulti-precision FMACs 130. As discussed above, various embodiments of themulti-precision FMAC 130 may be used in a wide variety of electronic devices, including, but not limited to, central processing units, motherboards, graphics cards, graphics processors, combinatorial logic implementations, stand-alone controllers, other integrated circuits (ICs), digital signal processors (DSPs), and/or the like. - Turning now to
FIG. 4 , in accordance with one embodiment, and as described above, one or more of themulti-precision FMACs 130 may be included on the silicon die/chips 440 (or computer chip). The silicon die/chips 440 may contain one or more different configurations of the multi-precision FMACs 130 (e.g., amulti-precision FMACs 130 configured to include parallel SP operations/operational circuits). Thesilicon chips 440 may be produced on asilicon wafer 430 in a fabrication facility (or “fab”) 490. That is, thesilicon wafers 430 and the silicon die/chips 440 may be referred to as the output, or product of, the fab 390. The silicon die/chips 440 may be used in electronic devices, such as those described above in this disclosure. - Turning now to
FIG. 5 , a diagram of an exemplary implementation of a portion of themulti-precision FMAC 130 is illustrated, according to one embodiment. For purposes of illustration, an exemplary FMAC multiply-add operation may be conceptualized as: A×B±C (“A multiplied by B, plus or minus C”). In order to multiply two binary numbers, their exponents may be added and their mantissas may be multiplied. Addition may require that exponents are “lined” up; in order to be added, numbers may need to have the same exponent. In order to accomplish this alignment/equalization, the mantissa of one or more operands may be shifted. By shifting the operand(s), the effective exponent of a number may be changed. For example, when adding 1.0×103 and 2.0×102 the exponents would need to be equalized. In one embodiment, 1.0×103 may be equalized with 2.0×102 by shifting the exponent 103 making it 102. The resulting number would then be 10×102. The numbers could then be added by adding mantissas (10+2), each having an exponent of “2”, for a result of 12×102 (i.e., 1.2×103). In one or more embodiments, theFMAC 130 may comprise two parallel single-precision operation circuits such that the exponent addition and the mantissa multiplication (e.g., for multiplication operations) may be performed in parallel or substantially in parallel. Similarly, the shifting of one or more operand mantissas (e.g., for addition operations) may also be performed in parallel or substantially in parallel to the mantissa multiplication. - As previously described, in one or more embodiments, the
multi-precision FMAC 130 may contain circuitry to perform multi-precision mantissa alignment and/or multi-precision mantissa alignment using two parallel single-precision operation circuits. The illustratedFMAC 130 may comprise shift blocks, shift blocks for extended precision (EP), double precision (DP) and/or single precision (SP). For purposes of the discussion herein, DP operations may be performed by the EP operations blocks. It should be noted, however, that in various embodiments, alternate DP operational blocks may be used. As described herein, a block may also be referred to as a circuit, circuit portion and/or circuit block. The illustratedexemplary FMAC 130 may comprise aninvert block 510 configured to invert theaddend mantissa Mc 505, ashift 1X blockshift 4X block 520, anEP 525, anshift 16X blockEP 530, an SP-shift 64X blocklo shift 16X block 535, an SP-lo shift 64X block 540, an SP-hi 545, an SP-shift 16X blockhi 550, and/or anshift 64X blockoverlap block 555. Theinvert block 510 may be configured to invert bits of theMc 505 if subtracting the product and addend instead of adding. In one embodiment, the inversion may be controlled by the invert controls 507. The various shift blocks described herein may be adapted to, configured to, and/or capable of shifting a binary number (or number of another format) by a given number of bits. For example, a 1X shift block (e.g., shift 1X block 515) may shift a binary number a given number of times by a single bit. That is, a 1X shift block may shift a binary number by 0, 1, 2 or 3 bits. A 4X shift block (e.g., shift 4X block 520) may shift a binary number a given number of times by four (4) bits. That is, a 4X shift block may shift a binary number by 0, 4, 8 or 12 bits. A 16X shift block (e.g.,EP 525, SP-shift 16X blocklo shift 16X block 535 and/or SP-hi shift 16X block 545) may shift a binary number a given number of times by sixteen (16) bits. That is, a 16X shift block may shift a binary number by 0, 16, 32 or 48 bits. A 64X shift block (e.g.,EP 530, SP-shift 64X blocklo shift 64X block 540 and/or SP-hi shift 64X block 550) may shift a binary number a given number of times by sixty-four (64) bits. That is, a 64X shift block may shift a binary number by 0, 64, 128 or 192 bits. As described herein, shifting of bits may be performed by shifting zero (“0”) or more bits; that is, if a determination is made that a binary number should not be shifted, it may be said that the binary number was shifted by zero (“0”) bits. Theoverlap block 555 may, in one or more embodiments, be adapted to handle un-overlapped product and addend mantissa cases in the last aligner stage, as illustratively shown inFIG. 5 . In the case that the addend mantissa shift will result in the addend mantissa not potentially overlapping with the product mantissa, then the addend mantissa is effectively concatenated with the product mantissa. Theoverlap block 555 may output a 256-bit result R 599. - The
exemplary FMAC 130 illustrated inFIG. 5 may be conceptualized for illustrative purposes as shifting in four stages. In one embodiment, the first two stages may be shared. Theshift 1X block 515 may be referred to as a first stage (i.e., shifting 0, 1, 2, or 3 bits) and theshift 4X block 520 may be referred to as a second stage (i.e., shifting 0, 4, 8, or 12 bits). These stages may be shared in that the output of theshift 1X block 515 and theshift 4X block 520 may be output to the EP (DP)shift 16X blocklo shift 16X block 535 and/or the SP-hi 545, as illustrated inshift 16X blockFIG. 5 , and/or may be output across the datapath to the three back end shifters of the FMAC 130 (i.e., to the EP (DP) aligner and/or one or both of the SP aligners, SP-lo and SP-hi, as described below with respect toFIG. 6 ). It is noted that the routing efficiency afforded by having the shared front end stages (i.e., shift 1X block 515 and theshift 4X block 520) allows the data to only have to traverse the full (vertical) length of the 64-bit datapath just once (from theshift 4X block 520 output to the EP (DP)shift 16X blocklo shift 16X block 535 and/or the SP-hi shift 16X block 545 inputs), thus optimizing/increasing routing efficiency. It should be noted that the least significant bits (LSBs) of the exponent differences for all calculations are naturally available first and thus may be used first to optimize the aligner for lowest latency (i.e., the finest 1X shifting; the first two LSBs [0:1], decoded by 00=shift 0, 10=shift shift 2 and 11=shift 3) may occur first, followed by the 4-bit shifting (the next two significant bits [2:3], decoded by 00=shift 0, 10=shift shift 8 and 11=shift 12), the 16-bit shifting (the next two significant bits [4:5], decoded by 00=shift 0, 10=shift 16, 01=shift 32 and 11=shift 48), and the 64-bit shifting (the next two significant bits [6:7], decoded by 00=shift 0, 10=shift shift 128 and 11=shift 192) as the more significant bits of the exponent differences become available. Themulti-precision FMAC 130 may comprise anexponent difference block 560 and adecoder block 565. Theexponent difference block 560 and thedecoder block 565 may be adapted to take one or more inputs from a system and, based at least in part on the inputs, determine the shifting schedule for themulti-precision FMAC 130 operations. The exponent difference calculation is affected in theexponent difference block 560. Based, at least in part, on the difference calculated between the exponents, the mantissa(s) may need to be shifted in order to have them align their respective binary points. Theexponent difference block 560 may comprise a 4 to 2 adder/compressor 557 and/or a 2 to 1 carry propagate adder 559. Theexponent difference block 560 may take as inputs: the exponent of term A (Ea) 556 a, the exponent of term B (Eb) 556 b, the exponent of term C (Ec) 556C, and abias signal Bias 556 d. The adder 557 and the adder 559 may, in one or more embodiments, perform an effective subtraction operation to determine the difference between the sum of the product exponents Ea 556 andEb 556 b, and theaddend exponent Ec 556 c. In one or more embodiments, thebias 556 d may be adapted to bias the exponent values such that some or all of the shift operations in theFMAC 130 may be performed by shifting to the right, rather than shifting to the left as well as removing any additional bias in order to perform the calculation. The outputs of theexponent difference block 560 may be decoded using theblock 565. Thedecoder block 565 may comprise one or more decoders (Dec) 561 a-n and/or one ormore multiplexors 567 in order to generate the necessary shift controls from the calculated exponent difference for all the aligner complex stages. In one embodiment, themultiplexors 567 may be 2-to-1 multiplexors and may be used to select between two exponent difference calculations (“Product minus Addend” and “Addend minus Product”). These two subtractions (differences) may be calculated in parallel using the carryout of one of the adders, and may be used to select which difference calculation is valid for the first two sets of 2-to-4 decoders for the fourshift 1X controls and the fourshift 4X controls described below. Calculating both differences in parallel and then selecting the proper difference may be done to further minimize latency. - For example, in one embodiment, the Dec 566 a may output a control signal to the
Shift 1XShift 1Xshift 0, 10=shift shift 2 and 11=shift 3). Similarly, in one embodiment, the Dec 566 b may output four shift control signals to theShift 4X 520 that are decoded from bits [2:3] of the exponent difference block 560 (i.e., shift control bits [2:3], decoded as 00=shift 0, 10=shift shift 8 and 11=shift 12). It should be noted that “one-hot” select n-to-1 multiplexors and decoders (where n is the number of multiplexor inputs) may be used extensively for realizing theshifters Shift 1XShift 4X 520, etc. (e.g., n=4 for shifting by 0, 1, 2, or 3 bits, and n=4 for shifting by 0, 4, 8, or 12 bits; etc.). This use of standard cells may increase routing efficiency and decrease footprint area. It should be noted that the 64X shift stages are straight (i.e., horizontal) routes in the datapath. That is, the routing of the 64X shift stages is optimally done to minimize the distance traveled by these shift stages. - Turning now to
FIG. 6 , a graphical representation of an illustrative side-by-side alignment of respective addends, post-alignment, for an EP (DP) operation and two parallel SP operations is depicted, in accordance with one embodiment. For illustrative convenience, bit numbers [193:0] 605 are shown alongside the alignments. In one embodiment, bit 193 is the most significant bit (MSB) andbit 0 is the least significant bit (LSB). As shown, an EPsource addend portion 610, and one or two SPsource addend portions 615, 620 (SP-hi and SP-lo, respectively) may be aligned. The EPsource addend portion 610 may be 64 bits and the SPsource addend portions source addend portion 610 may be aligned with the most significant bit of the EP source addend portion 605. The least significant bit of SP-losource addend portion 615 may be aligned with the least significant bit of the EP source addend portion 605. - The EP
source addend portion 610 may have a correspondingEP aligner output 625, that may be, in one embodiment, 194 bits. The SP-hisource addend portion 615 may have a corresponding SP-hiincrementer aligner output 630, that may be, in one embodiment, 26 bits. The SP-hiincrementer aligner output 630 may have its most significant bit aligned with the most significant bit (bit 193) of theEP aligner output 625. The SP-losource addend portion 620 may have a corresponding SP-loincrementer aligner output 635, that may be, in one embodiment, 26 bits. The SP-loincrementer aligner output 635 may have its least significant bit aligned with the one hundred twenty-eighth bit (bit 128) of theEP aligner output 625. - The SP-hi
source addend portion 615 may have a corresponding SP-hiadder aligner output 640, that may be, in one embodiment, 48 bits. The SP-hiadder aligner output 640 may have its most significant bit aligned with the one hundred twenty-seventh bit (bit 127) of theEP aligner output 625. The SP-losource addend portion 620 may have a corresponding SP-loadder aligner output 645, that may be, in one embodiment, 48 bits. The SP-loadder aligner output 645 may have its least significant bit aligned with the least significant bit (bit 0) of theEP aligner output 625. It should be noted that in one embodiment, if only one SP operation is needed, only the SP-lo may be used/needed to perform a single, single-precision (SP) operation. - In one or more embodiments, the
aligner output fields - Turning now to
FIG. 7 , a schematic diagram of FMAC mantissa fields (and their respective bit lengths) in an aligned 64-bit dataflow is illustrated, according to one exemplary embodiment. The aligned datapath may be “folded” in order to comport with a 64-bit datapath bit pitch to align data efficiently through the processor and fit into a fixed physical footprint. It should be noted that the actual width of the aligned data may be greater than the width of the datapath, and in some cases may be three times the width of the datapath (e.g., the aligned data may be 194 bits wide, while the datapath width may be 64-bits wide). As shown inFIG. 7 , various portions of data, such as those depicted inFIG. 6 and described above, may be transmitted on a datapath that is narrower than the data itself. - Turning to
FIG. 8 , a flowchart depicting steps for shifting and aligning data is shown, according to one exemplary embodiment. At 805, an FMAC operation may begin. The flow may proceed in parallel to 810 and 835. At 810, the exponent and bias values may be obtained. In one embodiment, the exponent values may be theEa 556 a, theEb 556 b and theEc 556 c, and the bias value may be theBias 556 d. The flow may proceed to 820 to calculate the difference between the product and addend exponents. In one embodiment, the adder 557 and the adder 559 may perform an effective subtraction operation between the sum of the product exponents (theEa 556 a and theEb 556 b) and the addend exponent (theEc 556 c). The flow may then proceed to 830 where the shift controls may be determined. In one embodiment, the shift controls may be based on (determined from) the output of the adder 559 as applied to the decoders 566 a-n and themuxes 567, as shown inFIG. 5 . From 830, the flow may proceed to 840. At 835, a mantissa value (e.g., the Mc 505) may be input into theinvert block 510. In one embodiment, the mantissa input may be inverted if a subtraction operation is performed by the FMAC. From 835, the flow may proceed to 840 and/or to 880. - At 840, the inverted or non-inverted mantissa may be input into a 1X shifter (e.g., the
Shift 1X 515). From 840, the flow may proceed to 850. At 850, the output of the 1X shifted mantissa may be input into the 4X shift block (e.g., theShift 4X 520). From 850, the flow may proceed to 860, 863 and/or 866. At 860, the output of the 4x shift from 850 may be input into the 16X extended-precision shifter (e.g., 525). From 860, the flow may proceed to 870 where the output of the 16X shift from 860 may be input into the 64X extended-precision shifter (e.g., 530). From 870, the flow may proceed to 880. At 863, the output of the 4x shift from 850 may be input into the 16X single-precision lo shifter (e.g., 535). From 863, the flow may proceed to 873 where the output of the 16X shift from 863 may be input into the 64X single-precision lo shifter (e.g., 540). From 873, the flow may proceed to 880. At 866, the output of the 4x shift from 850 may be input into the 16X single-precision hi shifter (e.g., 545). From 866, the flow may proceed to 876 where the output of the 16X shift from 866 may be input into the 64X single-precision hi shifter (e.g., 550). From 876, the flow may proceed to 880. At 880, any un-overlapped product and addend mantissa cases may be aligned. From 880, the flow may continue to 890. At 890, the FMAC operation may be ended. - In accordance with one or more embodiments, the actions shown in
FIG. 8 may be performed sequentially, in parallel, substantially in parallel or in alternate order(s) without departing from the spirit and scope of the embodiments presented herein. - It is also contemplated that, in some embodiments, different kinds of hardware descriptive languages (HDL) may be used in the process of designing and manufacturing very large scale integration circuits (VLSI circuits) such as semiconductor products and devices and/or other types semiconductor devices. Some examples of HDL are VHDL and Verilog/Verilog-XL, but other HDL formats not listed may be used. In one embodiment, the HDL code (e.g., register transfer level (RTL) code/data) may be used to generate GDS data, GDSII data and the like. GDSII data, for example, is a descriptive file format and may be used in different embodiments to represent a three-dimensional model of a semiconductor product or device. Such models may be used by semiconductor manufacturing facilities to create semiconductor products and/or devices. The GDSII data may be stored as a database or other program storage structure. This data may also be stored on a computer readable storage device (e.g.,
data storage units 160, RAMs 155 (including embedded RAMs), compact discs, DVDs, solid state storage and/or the like). In one embodiment, the GDSII data (or other similar data) may be adapted to configure a manufacturing facility (e.g., through the use of mask works) to create devices capable of embodying various aspects of the instant invention. In other words, in various embodiments, this GDSII data (or other similar data) may be programmed into acomputer 100,processor 125/140 or controller, which may then control, in whole or part, the operation of a semiconductor manufacturing facility (or fab) to create semiconductor products and devices. For example, in one embodiment, silicon wafers containing FMACs with multi-precision mantissa aligners (e.g., an FMAC utilizing parallel single-precision operations, as described herein) may be created using the GDSII data (or other similar data). - It should also be noted that while various embodiments may be described in terms of precision mantissa aligners, it is contemplated that the embodiments described herein may have a wide range of applicability as would be apparent to one of skill in the art having the benefit of this disclosure.
- The particular embodiments disclosed above are illustrative only, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design as shown herein, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the claimed invention.
- Accordingly, the protection sought herein is as set forth in the claims below.
Claims (20)
1. (canceled)
2. (canceled)
3. (canceled)
4. (canceled)
5. (canceled)
6. (canceled)
7. (canceled)
8. A method comprising:
multiplying, at a processing device, a first binary number and a second binary number to obtain a product, wherein multiplying comprises adding a first exponent value associated with the first binary number to a second exponent value associated with the second binary number to obtain an exponent sum and multiplying a first mantissa value associated with the first binary number to a second mantissa value associated with the second binary number, and wherein the exponent adding and the mantissa multiplying are performed substantially in parallel; and
performing, at the processing device, at least one of adding a third binary number to the product or subtracting the third binary number from the product.
9. The method of claim 8 , further comprising:
comparing the exponent sum to a third exponent value associated with the third binary number; and
shifting a third mantissa associated with the third binary number based at least in part on the comparison of the third exponent value to the exponent sum.
10. The method of claim 9 , further comprising biasing at least one of the first exponent value, second exponent value, third exponent value or the exponent sum, such that shifting is performed by shifting to the right.
11. The method of claim 8 , wherein the first, second and third binary numbers conform to at least one of an extended-precision standard, a double-precision standard or a single-precision standard.
12. The method of claim 11 , wherein the multiplying and the at least one of adding or subtracting are performed by embedding at least one lower precision operation in at least one higher precision operational datapath.
13. The method of claim 11 , wherein performing the exponent adding and the mantissa multiplying substantially in parallel comprises performing the exponent adding and the mantissa multiplying using two parallel, single-precision operations; and
wherein a higher precision datapath and at least one lower precision datapath comprise a shared datapath portion.
14. (canceled)
15. (canceled)
16. (canceled)
17. (canceled)
18. (canceled)
19. (canceled)
20. (canceled)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/391,470 US20170147288A1 (en) | 2011-09-06 | 2016-12-27 | Floating point multiply accumulator multi-precision mantissa aligner |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/226,071 US9141337B2 (en) | 2011-09-06 | 2011-09-06 | Floating point multiply accumulator multi-precision mantissa aligner |
US14/824,691 US9557963B2 (en) | 2011-09-06 | 2015-08-12 | Floating point multiply accumulator multi-precision mantissa aligner |
US15/391,470 US20170147288A1 (en) | 2011-09-06 | 2016-12-27 | Floating point multiply accumulator multi-precision mantissa aligner |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/824,691 Division US9557963B2 (en) | 2011-09-06 | 2015-08-12 | Floating point multiply accumulator multi-precision mantissa aligner |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170147288A1 true US20170147288A1 (en) | 2017-05-25 |
Family
ID=47753969
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/226,071 Active 2034-03-07 US9141337B2 (en) | 2011-09-06 | 2011-09-06 | Floating point multiply accumulator multi-precision mantissa aligner |
US14/824,691 Active US9557963B2 (en) | 2011-09-06 | 2015-08-12 | Floating point multiply accumulator multi-precision mantissa aligner |
US15/391,470 Abandoned US20170147288A1 (en) | 2011-09-06 | 2016-12-27 | Floating point multiply accumulator multi-precision mantissa aligner |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/226,071 Active 2034-03-07 US9141337B2 (en) | 2011-09-06 | 2011-09-06 | Floating point multiply accumulator multi-precision mantissa aligner |
US14/824,691 Active US9557963B2 (en) | 2011-09-06 | 2015-08-12 | Floating point multiply accumulator multi-precision mantissa aligner |
Country Status (1)
Country | Link |
---|---|
US (3) | US9141337B2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10019227B2 (en) | 2014-11-19 | 2018-07-10 | International Business Machines Corporation | Accuracy-conserving floating-point value aggregation |
US9904545B2 (en) | 2015-07-06 | 2018-02-27 | Samsung Electronics Co., Ltd. | Bit-masked variable-precision barrel shifter |
CN109753268B (en) * | 2017-11-08 | 2021-02-02 | 北京思朗科技有限责任公司 | Multi-granularity parallel operation multiplier |
US11288040B2 (en) * | 2019-06-07 | 2022-03-29 | Intel Corporation | Floating-point dot-product hardware with wide multiply-adder tree for machine learning accelerators |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4758972A (en) * | 1986-06-02 | 1988-07-19 | Raytheon Company | Precision rounding in a floating point arithmetic unit |
US5513362A (en) * | 1992-04-23 | 1996-04-30 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for normalization of a floating point binary number |
US5559730A (en) * | 1994-02-18 | 1996-09-24 | Matsushita Electric Industrial Co., Ltd. | Shift operation unit and shift operation method |
US6256655B1 (en) * | 1998-09-14 | 2001-07-03 | Silicon Graphics, Inc. | Method and system for performing floating point operations in unnormalized format using a floating point accumulator |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5892698A (en) * | 1996-04-04 | 1999-04-06 | Hewlett-Packard Company | 2's complement floating-point multiply accumulate unit |
US6571266B1 (en) * | 2000-02-21 | 2003-05-27 | Hewlett-Packard Development Company, L.P. | Method for acquiring FMAC rounding parameters |
US6779013B2 (en) * | 2001-06-04 | 2004-08-17 | Intel Corporation | Floating point overflow and sign detection |
-
2011
- 2011-09-06 US US13/226,071 patent/US9141337B2/en active Active
-
2015
- 2015-08-12 US US14/824,691 patent/US9557963B2/en active Active
-
2016
- 2016-12-27 US US15/391,470 patent/US20170147288A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4758972A (en) * | 1986-06-02 | 1988-07-19 | Raytheon Company | Precision rounding in a floating point arithmetic unit |
US5513362A (en) * | 1992-04-23 | 1996-04-30 | Matsushita Electric Industrial Co., Ltd. | Method of and apparatus for normalization of a floating point binary number |
US5559730A (en) * | 1994-02-18 | 1996-09-24 | Matsushita Electric Industrial Co., Ltd. | Shift operation unit and shift operation method |
US6256655B1 (en) * | 1998-09-14 | 2001-07-03 | Silicon Graphics, Inc. | Method and system for performing floating point operations in unnormalized format using a floating point accumulator |
Also Published As
Publication number | Publication date |
---|---|
US20130060828A1 (en) | 2013-03-07 |
US9141337B2 (en) | 2015-09-22 |
US20150347090A1 (en) | 2015-12-03 |
US9557963B2 (en) | 2017-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9268528B2 (en) | System and method for dynamically reducing power consumption of floating-point logic | |
US8694572B2 (en) | Decimal floating-point fused multiply-add unit | |
Zhang et al. | Efficient multiple-precision floating-point fused multiply-add with mixed-precision support | |
US20170147288A1 (en) | Floating point multiply accumulator multi-precision mantissa aligner | |
US8819094B2 (en) | Multiplicative division circuit with reduced area | |
US20070061392A1 (en) | Fused multiply add split for multiple precision arithmetic | |
US10140092B2 (en) | Closepath fast incremented sum in a three-path fused multiply-add design | |
US20070266072A1 (en) | Method and apparatus for decimal number multiplication using hardware for binary number operations | |
US10346133B1 (en) | System and method of floating point multiply operation processing | |
Huang et al. | Low-cost binary128 floating-point FMA unit design with SIMD support | |
Hickmann et al. | A parallel IEEE P754 decimal floating-point multiplier | |
Del Barrio et al. | Ultra-low-power adder stage design for exascale floating point units | |
Jun et al. | Modified non-restoring division algorithm with improved delay profile and error correction | |
Tajasob et al. | Designing energy-efficient imprecise adders with multi-bit approximation | |
US9317250B2 (en) | Floating point multiply-add unit with denormal number support | |
Takagi et al. | A hardware algorithm for integer division | |
US7814138B2 (en) | Method and apparatus for decimal number addition using hardware for binary number operations | |
US7290023B2 (en) | High performance implementation of exponent adjustment in a floating point design | |
Tsen et al. | A combined decimal and binary floating-point multiplier | |
US8015231B2 (en) | Data processing apparatus and method for performing floating point multiplication | |
Akkaş | Dual-mode floating-point adder architectures | |
US20180129473A1 (en) | Fast sticky generation in a far path of a floating point adder | |
Mathis et al. | A well-equipped implementation: Normal/denormalized half/single/double precision IEEE 754 floating-point adder/subtracter | |
He et al. | Design and implementation of a quadruple floating-point fused multiply-add unit | |
Tsen et al. | Hardware designs for binary integer decimal-based rounding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HILKER, SCOTT;REEL/FRAME:040985/0677 Effective date: 20110901 |
|
STCV | Information on status: appeal procedure |
Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS |
|
STCV | Information on status: appeal procedure |
Free format text: BOARD OF APPEALS DECISION RENDERED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |