CN105359119A

CN105359119A - Memory architectures having wiring structures that enable different access patterns in multiple dimensions

Info

Publication number: CN105359119A
Application number: CN201480036153.XA
Authority: CN
Inventors: 阿尔泊·布约克托苏诺格鲁; 菲利普·G·埃玛; 艾伦·M·哈特斯坦; M·B·海利; K·K·凯拉斯
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-06-26
Filing date: 2014-01-23
Publication date: 2016-02-24
Anticipated expiration: 2034-01-23
Also published as: US20160209470A1; CN105359119B; US20150006986A1; WO2014209433A1; US9696379B2; US9383411B2

Abstract

Multi-dimensional memory architectures are provided having access wiring structures that enable different access patterns in multiple dimensions. Furthermore, three-dimensional multiprocessor systems are provided having multi-dimensional cache memory architectures with access wiring structures that enable different access patterns in multiple dimensions.

Description

There is the memory architecture of the distribution structure making it possible to the different access pattern realized in multidimensional

Cross reference

The application number that application claims is submitted on June 26th, 2013 is the right of priority of the U.S. Patent application of 13/927,846, and content disclosed in it is incorporated into herein by way of reference.

Technical field

Technical field relate generally to multidimensional memory architecture of the present invention, this multidimensional memory architecture has the access distribution structure making it possible to the different access pattern realized in multidimensional, and relate to three-dimensional (3-D) multicomputer system, this three-dimensional multiprocessor system has multidimensional cache memory architecture, and this multidimensional cache memory architecture has the access distribution structure making it possible to the different access pattern realized in multidimensional.

Background technology

In semiconductor processor chip manufacturing field, at the commitment of processor technology, a lot of company manufactures single-chip processor.In the past about 10 years, along with Moore's Law continues to shrink size, a lot of company and other entities started to design the processor chips on one deck with multiple processor.But due to the growth of core on-chip processor number, the chip-on communication between processor becomes difficulty.Such as, the processor chips of 2-D size rise to holds more processor, the growth of the horizontal wiring lengths between processor (millimeter and centimetre scope) cause the cycle time delay that communicates between processor, and require to use on the high-performance sheet of communication link between processor and drive.In addition, along with running frequency increases, increase with the cycle time delay communicating relevant between processor.

Summary of the invention

Embodiments of the invention usually comprise multidimensional memory architecture, this multidimensional memory architecture has the access distribution structure making it possible to the different access pattern realized in multidimensional, and has the 3-D multicomputer system of multidimensional cache memory architecture (this multidimensional cache memory architecture has the access distribution structure making it possible to the different access pattern realized in multidimensional).

Such as, in one embodiment of the invention, memory construction comprises first order storer and second level storer.Described first order storer comprises first memory cell array and has the wordline of first mode and the first access distribution structure of bit line.Each memory cell in described first memory cell array comprises memory element and is connected to the first access equipment of described memory element and described first access distribution structure.Described second level storer comprises the second access distribution structure of wordline and the bit line with the second pattern and is connected to multiple second access equipment of described second access distribution structure.Described second access equipment is also connected to the memory element of corresponding described first order storer.The wordline of the described first mode of described first access distribution structure and bit line are different from wordline and the bit line of described second pattern of described second access distribution structure, thus provide the different access pattern of access the same memory cell array.

In an alternative embodiment of the invention, memory construction comprises first order storer and second level storer.Described first order storer comprises first memory cell array and has the wordline of first mode and the first access distribution structure of bit line.Each memory cell in described first memory cell array comprises the first memory element and is connected to the first access equipment of described first memory element and described first access distribution structure.Described second level storer comprises second memory cell array and has the wordline of the second pattern and the second access distribution structure of bit line.Each memory cell in described second memory cell array comprises the second memory element and is connected to the second access equipment of described second memory element and described second access distribution structure.Described memory construction comprises the multiple wordline being connected to memory cell across the described first order and described second level storer further.

In another embodiment of the present invention, the method of access storer comprises and stores data in memory cell array, the access distribution being connected to the first mode of described memory cell is used to access data in described memory cell array, and use the access distribution being connected to the second pattern of described memory cell to access data in described memory cell array, wherein said first is different with the access distribution of described second pattern.In one embodiment, described memory cell array is the 2-D array of memory cell.In another embodiment, described memory cell array is the 3-D array of memory cell.In one embodiment, the access distribution of first mode is deployed in the first plane of described 3-D array, and the access distribution of described second pattern is deployed in different from the second plane of the described 3-D array of described first plane.Described first and second planes can be parallel or vertical.

The detailed description of the example embodiment below that reads in conjunction with the accompanying drawings, these embodiments will be described and become clear.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of processor chip.

Fig. 2 is the schematic diagram according to the stacking multiprocessor of the 3-D of example embodiment of the present invention.

Fig. 3 is the schematic diagram of chip-packaging structure.

Fig. 4 describes the multi-processor structure stacking according to the 3-D of another example embodiment of the present invention conceptually.

Fig. 5 schematically illustrates the physical implementation according to the stacking multi-processor structure of the 3-D of another example embodiment of the present invention, and it is based on the conceptual enforcement shown in Fig. 4.

Fig. 6 schematically illustrates the method for the multi-mode operation of the multi-processor structure stacking for control 3-D according to example embodiment of the present invention.

Fig. 7 is the schematic diagram of the processor that principle of the present invention can be applied to.

Fig. 8 according to example embodiment of the present invention, comprise the schematic diagram with the multiprocessor machine stacking with the 3-D of a pair processor of the identical layout processor described in Fig. 7.

Fig. 9 A is according to example embodiment of the present invention, the schematic diagram comprising the stacking multiprocessor machine of the 3-D of the first and second processors of the vertical stacking each other of L2 and the L3 buffer memory with alignment.

Fig. 9 B is according to example embodiment of the present invention, the schematic diagram with the stacking multiprocessor machine of the 3-D of Fig. 9 A of the L3 buffer memory of the operation of combining for the first and second processors as the L3 buffer memory shared.

Fig. 9 C according to example embodiment of the present invention, there is the schematic diagram of combining for as the L2 buffer memory of operation of the first and second processors of the L2 buffer memory shared and shared L3 buffer memory and the stacking multiprocessor machine of the 3-D of Fig. 9 A of L3 buffer memory.

Figure 10 is the schematic diagram according to the stacking multiprocessor machine of the 3-D of another example embodiment of the present invention.

Figure 11 schematically illustrate according to example embodiment of the present invention, communication link between multiple assemblies of the processor shown in Figure 10.

Figure 12 schematically illustrates the processor interconnect architecture for plane processor system.

Figure 13 schematically illustrate according to example embodiment of the present invention, for the processor interconnect architecture of the stacking multicomputer system of 3-D.

Figure 14 schematically illustrate according to another example embodiment of the present invention, for the processor interconnect architecture of the stacking multicomputer system of 3-D.

Figure 15 is according to example embodiment of the present invention, the top schematic view with the stacking multicomputer system of the 3-D of the processor interconnect architecture of the processor interconnect architecture based on Figure 14.

Figure 16 schematically illustrate according to another example embodiment of the present invention, for the processor interconnect architecture of the stacking multicomputer system of 3-D.

Figure 17 A schematically illustrate according to example embodiment of the present invention, two processors with identical topology, wherein the respective regions of two same processor is identified as the region being faster or slower than its correspondence.

Figure 17 B schematically illustrate according to example embodiment of the present invention, to be formed by two processors vertically shown in stacking Figure 17 A and the processor structure that the 3-D that runs is stacking as the single processor in the fastest region comprised in the respective regions of each processor.

Figure 18 schematically illustrate according to example embodiment of the present invention, for implementing the method for run-ahead (in advance) function in the processor system that 3-D is stacking.

Figure 19 schematically illustrates according to example embodiment of the present invention, the processor structure that the 3-D formed by vertically stacking multiple processor (each have similar status register layout) is stacking, wherein, the plurality of processor can be run independently or with the mode operation of cooperation to share their status register.

Figure 20 shows the plurality of operating modes of the stacking processor structure of the 3-D of Figure 19.

Figure 21 is the process flow diagram of a kind of operational mode describing the stacking processor structure of the 3-D of Figure 19.

Figure 22 schematically illustrates the memory array that embodiments of the invention can be applied to.

Figure 23 A, 23B, 23C jointly describe according to example embodiment of the present invention, for building the method for the memory construction comprising the Multilayer Memory with different access pattern.

Figure 24 schematically illustrates being multiplied and matrix multiple result being stored in the process in 4x4 block C of matrix for storing in two storer 4x4 block A with B.

Figure 25 schematically illustrate according to example embodiment of the present invention, for using the method for the row and column of single primitive operation accessing memory.

Figure 26 describe according to example embodiment of the present invention, the memory array that comprises memory cell array and oblique line access distribution pattern.

Figure 27 describe according to another example embodiment of the present invention, the memory array that comprises memory cell array and oblique line access distribution pattern.

Figure 28 describe according to another example embodiment of the present invention, the memory array that comprises memory cell array and row displacement access distribution pattern.

3-D memory construction that Figure 29 schematically illustrates example embodiment of the present invention, that make it possible to the 3-D access module realized on multilayered memory.

Figure 30 A, 30B and 30C schematically illustrate according to example embodiment of the present invention, for accessing the method for data in the multidimensional of the example 3-D memory construction using Figure 29.

Figure 31 describe according to example embodiment of the present invention, for 2-D data array structure is stored method in memory, it makes the access that can realize in one operation row and column.

Figure 32 schematically illustrate according to example embodiment of the present invention, for the method for 3-D array stores in 3-D memory construction by data.

Figure 33 is the side schematic diagram of the multichip system that embodiments of the invention can be applied to.

Figure 34 is the high level view of the 3-D computer processor system that embodiments of the invention can be applied to.

Figure 35 is the side schematic diagram of multichip system according to an embodiment of the invention.

Figure 36 describes 3-D computer processor system according to an embodiment of the invention, and this 3-D computer processor system is fabricated by the multichip system of associating shown in multiple Figure 35.

Figure 37 schematically illustrate according to the embodiment of the present invention, for global bus being connected to the technological means of each multichip system of 3-D computer processor system.

Figure 38 describes 3-D computer processor system according to another embodiment of the invention.

Figure 39 describes 3-D computer processor system according to another embodiment of the invention.

The 3-D disposal system that Figure 40 schematically illustrates according to an embodiment of the invention, have at least one test layer (this test layer has the circuit of sweep test for functional layer and system state checkpoint (checkpointing)).

Figure 41 schematically illustrates according to an embodiment of the invention, for the framework of the sweep test of the functional layer in 3-D disposal system and the test layer circuit of system state checkpoint (checkpointing).

Figure 42 schematically illustrate according to another embodiment of the invention, the 3-D disposal system with at least one test layer (this test layer has the circuit of sweep test for multiple functional layer and system state checkpoint (checkpointing)).

Figure 43 schematically illustrate according to another embodiment of the invention, the 3-D disposal system with multiple test layer (this test layer has the circuit of sweep test for multiple functional layer and system state checkpoint (checkpointing)).

Figure 44 schematically illustrates the test layer of 3-D disposal system and the circuit of functional layer according to an embodiment of the invention.

Figure 45 describes according to an embodiment of the invention, for obtaining the process flow diagram of the method for system state and recovery system state in the 3-D disposal system with at least one test layer (this test layer has the circuit of context switch for functional layer and system state checkpoint (checkpointing)).

Embodiment

About the 3-D multiprocessor machine formed by connecting multiple processor with stacking configuration, and for the stacking multiprocessor machine of control 3-D with the method for the mode operation selectively assembled with multiple resource and in shared model, example embodiment of the present invention will be described in further detail.

Fig. 1 is the schematic diagram of the processor chip that principle of the present invention can be applied to.Particularly, Fig. 1 schematically illustrates processor chip 10, processor chip 10 be included in wafer (die) 12 has multiple processor C1, C2 ..., C49 (being typically expressed as Cn) semiconductor wafer 12.With " plane " system arrangement processor Cn, wherein each processor Cn has oneself special take up room (footprint) in 2-D space.As being understood by one of ordinary skill in the art, in 2-D plane, use the horizontal distribution and the electrical connection that are formed as a part for BEOL (backendofline, the rear end of the line) structure of chip 10, processor Cn can connect each other.

In planar system as shown in Figure 1, along with the growth of processor number, the communication between processor has become problem.Such as, the 2-D size along with chip rises to holds more processor, and the length of the horizontal distribution between processor increases (scope at mm or cm), causes the cycle time delay of communication link between processor.This cycle delay requirement uses on the high-performance sheet of communication link between processor and drives.In addition, along with the growth of running frequency, this cycle time delay also increases.

Principle of the present invention utilizes chip-stacked technology to use Multilevel method device chip to form the stacking multi-processor structure of 3-D, wherein, it is a single stacking system with one single chip " take up room (footprint) " (that is, these stacking processor chips show as a single chip) that two or more processor chips are aggregated.Term used herein " processor chips " represents any semi-conductor chip or the wafer with one or more processors.Term used herein " processor chip " represents any semi-conductor chip or the wafer of two or more processor.Usually, in the structure that 3-D is stacking, the processor that two or more chip layer comprises the short vertical interconnect of use and is aligned and is connected to each other, the processor in such one deck is aligned and is connected to processor corresponding in another layer.It will be understood that, when the different processor of two in different processor chip layer or processor module/component are referred to as " alignment " mutually, term " alignment " refers to, such as, these two different processors or processor module/component are overlapping or overlapping fully at least in part mutually on the different layers.In this, two processors on the different layers of processor chips or processor module/component can be alignd fully, and such processor or assembly are the identical 2-D positions of each plane in the 3-D of processor chips is stacking.Alternatively, the assembly/component of processor or processor can generally be alignd, but in the 3-D of processor chips is stacking each plane 2-D position between have some to depart from.

Such as, Fig. 2 is the schematic diagram according to the stacking multiprocessor of the 3-D of example embodiment of the present invention.Particularly, Fig. 2 schematically illustrates and comprises the first processor chip 22A and be vertically stacked on the stacking processor chip 20 of the 3-D of the second processor chip 22B on this first processor chip 22A.In the exemplary embodiment in figure 2, processor chip 22A with 22B is substantially the same (identical on modular construction, but can be different on interconnect architecture), and be described to have 49 integrated processors, similar with the processor chip 10 described in Fig. 1.Particularly, the first processor chip 22A comprise multiple processor C1A, C2A ..., C49A, and the second processor chip 22B comprise multiple processor C1B, C2B ..., C49B.First and second processor chips 22A and 22B are by stacking perpendicular to each other, and connect each other, such processor to C1A/C1B, C2A/C2B ...., C49A/C49B (usually, CnA/CnB) be in alignment with each other with use vertically be connected interconnected.

For the exemplary construction described in Fig. 2, the stacking CnA/CnB of processor of each alignment comprises multiple processor connected vertically, and this processor is usually shared identical I/O and connected.These I/O connect by internally multiplexing, each processor position like this in 2-D space, (with interconnected) processor CnA/CnB of multiple vertical stacking logically shows as (for the processor that other are stacking) and runs as a single processor and work.Principle of the present invention can be extended to and comprise the stacking processor chip of multiple 3-D (such as shown in Fig. 2) and be encapsulated in together in a package substrates.With reference to figure 3,4 and 5, these principles will be described in further detail.

Fig. 3 is the schematic diagram of the chip-packaging structure that principle of the present invention can be applied to.Particularly, Fig. 3 describes processor system 30, and processor system 30 comprises package substrates 32 and the multiple processor chips P1 be arranged in package substrates 32, P2, P3, P4, P5 and P6.Package substrates 32 comprises the multiple electrical connection and trace that form electric distribution 34, and electric distribution 34 provides processor chips P1, whole in whole connection between P2, P3, P4, P5 and P6.Each in processor chips P1, P2, P3, P4, P5 and P6 is identical, and can be each processor chip with multiple processor.

Fig. 4 and Fig. 5 schematically illustrates the multicomputer system stacking according to the 3-D of another example embodiment of the present invention.Particularly, Fig. 4 is the concept map of the stacking multiprocessor encapsulating structure 40 of the 3-D of example.Similar with the encapsulating structure 30 described in Fig. 3, the multiprocessor encapsulating structure 40 that the 3-D of Fig. 4 is stacking comprises package substrates 32 and the multiple ground floor processor chips P1A be arranged in package substrates 32, P2A, P3A, P4A, P5A and P6A.Package substrates 32 comprises the multiple electrical connection and trace that form electric distribution 34, and electric distribution 34 provides processor chips P1A, whole in whole connection between P2A, P3A, P4A, P5A and P6A.Each in processor chips P1A, P2A, P3A, P4A, P5A and P6A is identical, and can be each processor chip with multiple processor.

As further illustrated in Figure 4, use short vertical connection 36, multiple second layer processor chips P1B, P2B, P3B, P4B, P5B and P6B are vertically disposed and are arranged on corresponding ground floor processor chips P1A, P2A, P3A, P4A, P5A and P6A.Second layer processor chips P1B, P2B, P3B, P4B, P5B and P6B are identical with corresponding ground floor processor chips P1A, P2A, P3A, P4A, P5A and P6A, and can be each processor chips with multiple processor.Fig. 4 describes multiple dotted line 34a, and the processor chips P1B of its representative in the chip of second layer encapsulated layer, virtual whole between P2B, P3B, P4B, P5B and P6B arrive whole distributions.These virtual distribution 34a not physically exist, but represent second layer processor chips P1B, P2B, P3B, P4B, P5B and P6B be connected to each other, and can be used in the identical physics distribution 34 that package substrates 32 is formed and communicate.

Fig. 5 schematically illustrates the physics realization according to the stacking multi-processor structure 50 of the 3-D of another example embodiment of the present invention, and it is based on the conceptual realization shown in Fig. 4.As described in Fig. 5, the only distribution physically existed in the multiprocessor encapsulating structure 50 that 3-D is stacking is the distribution 34 that package substrates 32 is formed, and in the stacking short vertical connection 36 formed between P1A/P1B, P2A/P2B, P3A/P3B, P4A/P4B, P5A/P5B and P6A/P6B of the processor chips of correspondence.In the multiprocessor encapsulating structure 50 that the 3-D of Fig. 5 is stacking, use the vertical connection 36 (these vertically connect 36 and comprise the connection formed between the processor of the correspondence alignment in different processor chip layer) formed between processor chips, given vertical stacking P1A/P1B, P2A/P2B, P3A/P3B, P4A/P4B, P5A/P5B can communicate mutually with the processor chips in P6A/P6B.

According to example embodiment of the present invention, use known semiconductor fabrication, can combine two processor chips, wherein, two identical processor chips can be bound together by " in the face of the back of the body " or " face-to-face ".In back of the body configuration, the active surface (face (face)) of first processor chip is bound to the inactive surface (back of the body) of the second processor chips, wherein, the processor of two processor chips and the element of other correspondences are aligned.With this structure, vertical wires (such as conductive via) can be formed in the active surface of first processor chip and to be exposed as the first touch panel array enlivened on face of first processor chip, and vertical wires (such as, through silicon via hole) can be formed the dorsal part of through second processor chips and the second touch panel on the inactive surface being the second processor chips that is exposed is very tired.When first and second processor chips by the face of the back of the body (face-to-back) in conjunction with time, the first and second touch panel arrays can be welded together, thus formed alignment processor elements between short vertical connection.In order to the length that shortening vertical connects, known technology can be used to carry out the dorsal part (backside) of grinding second processor, to make wafer thinner.

In " face-to-face (face-to-face) " configuration, wherein two of mirror image identical processor chips (functionally identical) are bound each other, the active surface (face) of such first processor is bound to the active surface (face) of the second processor chips, and the processor of two chips and other elements are alignment.With this structure, vertical wires (such as conductive via) can be formed in the active surface of first processor chip and to expose as the first touch panel array of the active surface of first processor, and vertical wires can be formed in the active surface of the second processor chips and the second touch panel array exposed as in the active surface of the second processor chips.When the first and second processor chips by face-to-face in conjunction with time, the first and second touch panel arrays can be welded together, thus formed alignment processor elements between short vertical connection.

With the processor system that 3-D is stacking, (or veritably) to be co-located on their surface level but two or more processors on the different layers can run independently approx, or run collaboratively by gathering and/or shared resource to strengthen function and advance and run thresholding, reliability and performance, make it surmount may realize in planar system (wherein each chip encapsulates the space with himself in 2 dimensions).With reference to figure 6-18, below by further detail describe be used for the stacking multiprocessor of control 3-D with selectively with one or more multiple resource assemble and/or shared model run multiple method.Usually, the exemplary method for the stacking multiprocessor of selectively control 3-D makes one group of stacking processor to run concurrently, but runs independently of each other for specific application.For other application discussed below, between use processor layer, short vertical connection is as high-speed traffic link, two or more vertically stacking processor can by across multiple layer share or aggregated resources (such as, thread, performance element, buffer memory etc.) and be controlled to selectively run with cooperation mode, to provide the operation of enhancing.

According to example embodiment of the present invention, control program can be used to control the multi-mode operation of the processor of two or more vertical stacking, the processor in such vertical stacking can be selected to control to run independently or with collaboration mode.Such as, Fig. 6 signal describe according to example embodiment of the present invention, for the method for the multi-mode operation of the stacking multi-processor structure of control 3-D.Particularly, the control program 60 shown in Fig. 6 comprises multiplexer 61, and multiplexer 61 selectively receives multiple configuration parameter set 62 and 64 and configuration mode control signal 66 as input.Different configuration parameter set A and B is selectively exported, as the machine input 68 of the vertical stacking to given processor, wherein, this this processor of machine input configuration is stacked in the pattern inputted by machine in the 68 multiple different operational modes of specifying and runs.Although for convenience of description, show two input configuration parameter set A and B, the different configuration parameter set of 3 or more the device 61 that can be re-used inputs and selectively exports.It is to be appreciated that the control program of Fig. 6 is for being local system a processor is stacking, and each processor in given processor system is stacking will have control circuit corresponding as shown in Figure 6.

The control system 60 of Fig. 6 can by global control system (such as processor-server, this processor-server scan in the control information and export configuration control signal 66 to each multiplexer 61 in processor system with by processor stack arrangement by given pattern) controlled.Using for the processor of vertical stacking is the circuit of inner (on chip), being outputted to the stacking machine input 68 of alignment processing device from each multiplexer 61 can by further multiplexing and/or decoding, with control multiple I/O port (will being shared or being bypassed) and can be used to control given processor stacking in different layers processor share and/or other of gathering are changed.

In multiple example embodiment of the present invention discussed below, when the two or more processors in a vertical stacking be overlap spatially time, can combining processor and their assembly synergistically in different modes, to provide the multiple application newly of the system of processor tuple to strengthen the property.First, by what be noted be, because vertical processor is stacking (more or less by two or more, processor exactly or approx) is directly placed on over each other, as initial compression, this is seemingly not attainable, because it has doubled the heat relevant to any focus (it tends to be positioned among processor most).Consider this point, by being operated in by stacking processor more, lower power levels is (such as, by modulation working voltage and/or running frequency), example controls to take precautions against the stacking power that can be implemented the processor controlling agreement position, and such general power (such as overall power density and/or total power consumption) is manageable.

More specifically, in an exemplary embodiment of the invention, the processor device that 3-D is stacking, it connects multiple processor chips and manufactured by vertically piling superimposition, can plurality of operating modes run with the power of the stacking processor device of control 3-D.Such as, in the processor device that the 3-D with the first and second processor chips is stacking, can selectively run the stacking processor device of 3-D in a first pattern, wherein, first processor chip is opened and the second processor chips are closed.In a first mode, each processor of first processor chip is opened and may operate in highest frequency and total power, the whole power can supported with encapsulating structure (such as, the power density of certain hotspot is controlled, like this for given encapsulating structure, in encapsulation, the heat at given focus place can not be excessive).

In another operational mode, can selectively run the stacking processor device of 3-D in a second mode, wherein the first and second processors are all opened.In this situation, two processor chips can operate in the power level of the general power (such as, power density or power consumption) that highest frequency and encapsulating structure can be supported.In another kind of situation, in the second operational mode, each processor in first and second processor chips may operate in and is less than total power, and the general power of the processor device that the general power of the processor device that such 3-D is stacking is stacking with the 3-D when only having first processor chip or only have each processor in the second processor chips to operate in total power and/or highest frequency is substantially identical.Namely, in order to obtain identical power consumption or power density configuration, processor in each processor chips layer may operate in comparatively low suppling voltage (or lower running frequency), and the first mode that the processor in the power consumption of assembling like this and only a processor chips layer enlivens is same or similar.

Power control scheme is based on understanding below in accordance with the principles of the present invention, namely only need the running frequency of processor to be reduced much smaller ratio (such as 10%), can by the power reduction being supplied to processor obvious ratio (such as 50%).Power control scheme can be used for the power supply voltage of selectively control processor, or by adjust operation frequency, each whole power consumptions of serving adjustment processor chips in both.Therefore, have in the stacking processor chips structure of the 3-D of multilevel processor, Modulating Power supply voltage, and in the ability permission system of the selectively subset of closing process device plane, there is the scope of operational mode, comprise one or more pattern, the general power consumed when wherein multilevel processor operates in lower voltage to keep general power and to run the processor of a plane is substantially identical (or by running multilevel processor, during the processor of a plane, in the process chip structure that 3-D is stacking, given focus place keeps identical power density).

During 3-D processor is stacking, in encapsulation and outside encapsulation, in each power controlling run pattern, often organizes vertically stacking processor and use identical interconnecting signal collection.Consider this point, because each processor chips layer in vertical stacking shares identical interconnecting signal, even if when processor chips operate in more low frequency (in a second mode), by having, less communication request (less I/O bandwidth) is requested.Therefore, the principle of the present invention of the technology of use is reused (multiplexing) interconnecting signal and encapsulation I/O signal, promote by the less bandwidth demand of each layer of generation in 3-D is stacking that the more low frequency required because of keeping the constant restriction of power consumption is run.

In other example embodiment of the present invention, comprise two or etc. the processor system of stacking processor chips of multilayer, wherein each processor chips comprise one or more processor, the processor in different processor chip layer is wherein connected by the vertical connection between different processor chip layer, mode control circuit (such as, above with reference to shown by Fig. 6 and describe) two or more processors that can selectively configure in different chip layer run with the one in plurality of operating modes.Such as, in a kind of operational mode, given stacking in processor chips in one, multiple or all can run independently, the vertical connection between the processor chips layer wherein run independently can be used as this stacking in the processor chips run independently between communication link.

In another operational mode, the multiple assembly/resources in different processor chip layer can be aggregated the micro-architecture of the one or more processors strengthened in different processor chip layer.As those skilled in the art understand, " micro-architecture " of term processor refers to physics (hardware) configuration of processor.The micro-architecture of processor comprises assembly, the arrangement of such as buffer memory, bus structure (duct width), performance element and number, command unit, arithmetical unit etc.Such as, suppose that the stacking processor chips equipment of 3-D comprises the first processor chip with first processor, and there are the second processor chips of the second processor.In a kind of operational mode, when the first and second processor chips are all active, by the element of polymerization from the first and second processors, the micro-architecture of the first processor of first processor chip can be configured or strengthen, and by the element of polymerization from the first and second processors, the micro-architecture of the second processor of the second processor chips can be configured or strengthen.In another embodiment, first processor chip can be active, second processor chips can be sluggish, wherein by utilizing a part for the second processor of sluggish second processor chips, can strengthen the micro-architecture of the first processor of active first processor chip.The element of set can be the performance element, register set, buffer memory etc. of part.

In another kind of example operational mode, the multiple element/resources in different processor chip layer " can be shared " by between the different processor in different processor chip layer.Such as, as explained below, two different processors in different processor chip layer can combine their buffer memory (such as, L1, L2 or L3 buffer memory) to create the buffer memory that is the twice size that two processors are shared actively.In this situation, (associating) assembly of different processors sharing polymerizations or resource.In another kind of example operational mode, the two or more different processor in different processor chip layer can be combined to run single processor image.Below with reference to Fig. 7,8,9A, 9B, 9C, 10,11,12,13,14,15,16,17A, 17B, 18,19,20 and 21, be used for being polymerized by showing in further detail and/or share and/or the example embodiment of the present invention of different operational modes of combining processor resource.

Such as, Fig. 7 and 8 describes for selectively configuring different processor in different processor chip layer with the part performance element of polymerization and/or shared different processor to strengthen the example operational mode of the executive capability of one or more different processor.Fig. 7 is the schematic diagram of the processor 70 that principle of the present invention can be applied to.Fig. 7 schematically illustrates the micro-architecture of processor 70, and wherein processor 70 comprises multiple assembly, such as L3 buffer memory 71, L2 buffer memory 72, performance element 73 and command unit 74.Performance element 73 comprises the first floating point unit 75 and the second floating point unit 76 (wherein, the first and second floating point units 75 and 76 are identical) and one group of flating point register 77.Use multiple processors 70 of Fig. 7, the multi-processor structure that 3-D as shown in Figure 8 is stacking can be built.

Particularly, Fig. 8 is the schematic diagram of the stacking multiprocessor machine 80 of 3-D, and the stacking multiprocessor machine 80 of 3-D comprises first processor 70A and is vertically stacked on the second processor 70B on first processor 70A.In the example embodiment of Fig. 8, processor 70A and 70B is structurally identical, and has the layout processor as described in Fig. 7.Particularly, first processor 70A comprises L3 buffer memory 71A, L2 buffer memory 72A, performance element 73A and command unit 74A.Performance element 73A comprises the first floating point unit 75A and the second floating point unit 76A (wherein the first and second floating point unit 75A and 76A are identical) and one group of flating point register 77A.In addition, the second processor 70B comprises L3 buffer memory 71B, L2 buffer memory 72B, performance element 73B and command unit 74B.Performance element 73B comprises the first floating point unit 75B and the second floating point unit 76B (wherein the first and second floating point unit 75B and 76B are identical) and one group of flating point register 77B.

In an exemplary embodiment of the present invention, performance element 73A with 73B of the first and second processor 70A with 70B is in alignment with each other and uses short vertical connection to be connected to each other.With this structure, performance element can by vertically distribution, like this for processor 70A and 70B of two shown in Fig. 8, the performance element 73A of first processor 70A can comprise the half of the element of the right performance element 73A/73B of processor functionally, the performance element 73B of the second processor 70B can comprise the other half of the element of the right performance element 73A/73B of processor functionally, wherein, the half of often pair is selected to reduce the plane domain of each performance element.

The 3-D polymerization of performance element is better than traditional flat shape.In traditional planar system, the performance element being arranged in two processors of same plane can be connected, and the output of such performance element can be imported into the second performance element.But, " level " electrical connection between the performance element of two processors can be (such as, 5mm to 20mm) relatively grown, like this, there is one or two " extremely " cycle in Signal transmissions between the processors, it causes the unwanted time delay in Signal transmissions.As a comparison be, in the stacking processor of 3-D in such as Fig. 8 framework on a processor, the element of the half of the performance element of each processor is aggregated to a new performance element effectively, and the performance element in each like this plane is effectively less on area.Because the similar elements of each processor is by position of reaching an agreement on spatially, by vertically connecting performance element element across 3-D layer, achieve the region of the assembly of the polymerization of two processors.

Such as, in the example embodiment of Fig. 8, suppose that each processor 70A with 70B has two identical floating point unit 75A/76A and 75B/76B.In first processor plane 70A, because horizontal range between floating point unit 75A and 76A, it uses the time delay in 1-2 cycle signal to be sent to the input of the second floating point unit 76A from the output of the first floating point unit 75A.But, if the agreement position of the first floating point unit 75A with 75B in two planes is to being vertically connected, and the agreement position of the second floating point unit 76A with 76B is to being vertically connected, so the performance element 73A of first processor 70A can utilize the vertical connection of the first floating point unit 75A with 75B right, and the performance element 73B of the second processor 70B can utilize the vertical connection of the second floating point unit 76A with 76B right, the performance element of each like this processor 70A and 70B still has two floating point units.

In functional processor, the vertical connection between processor elements 75A and 76A and processor elements 75B and 76B provides shorter path, and allows to use the element from the processor of the Different Plane of 3-D framework to build each processor 70A and 70B.This effectively reduces the flat shape of each processor, and because faster from the path outputting to the input of performance element (in another plane) of a performance element (a plane), from execution flow process, remove the dead cycle.As will be illustrated in further detail below, these principles can be applied to the assembly of other alignment of performance element, such as arithmetical unit etc., and other processor elements, such as L2 and L3 buffer memory.

As in the example embodiment of the present invention that describes in Fig. 8, what can make in purpose processor 70A and 70B independently of one another is each, wherein, across processor layer processor unit between vertical connection can not be used to assemble or shared resource.Such as, in an operational mode, processor 70A and 70B both can run (typically in incoherent program) on the power reduced, like this, if general power is with once only substantially identical with the power of capacity operation processor 70A or 70B.In another operational mode, in processor 70A and 70B, individual one can be closed, and another can run with the fast mode of such as two times powers (or turbo pattern).

In another example embodiment of the present invention, in " Turbo " operational mode strengthened, processor 70A or 70B can disabled (inactive), and another one can be run with the fast mode of two times powers (or turbo pattern), but wherein, active processor can use some elements of the performance element of sluggish processor, thus strengthens its executive capability.Such as, in the example embodiment of Fig. 8, second processor 70B (primary processor) can be opened and with the Power operation of the increase in high speed turbo pattern, first processor 70A can be closed simultaneously, but wherein, by using the element of first (inactive) processor 70A, the micro-architecture of second (enlivening) processor 70B can be strengthened.By concrete example, when operating in the turbo pattern of enhancing, second (enlivening) processor 70B performance element 73B can utilize floating point unit 75A and 76A and the register 77A of first (inactive) processor 70A, such second processor 70B can have four floating point units 75A, 75B, 76A, 76B and extra register 77A, operates in the speed of enhancing.The framework of this enhancing allows the second processor 70B quickly and efficiently operation code.With this framework, can configuration mode control program, given processor can be closed like this, allow the supply lines of assembly by connecting and go the needs being connected to inactive processor simultaneously, selectively open or close one or more assemblies of inactive processor.

In another example embodiment of the present invention, use vertical connection can different buffer memorys in the different layers of associated processors chip, such processor can using the levels operation buffer memory concrete arbitrarily in cache hierarchy as a single buffer memory shared.Such as, if two stacking processors make L2 buffer memory align and the alignment of L3 buffer memory, so L2 buffer memory alignment to running as the single shared L2 buffer memory with twice capacity, and the alignment of L3 buffer memory to can the shared L3 buffer memory single as with twice capacity and running.These principles will be explained in further detail below with reference to Fig. 9 A, 9B and 9C.

Fig. 9 A is the schematic diagram comprising first processor 90A and be vertically stacked on the stacking multiprocessor machine 90 of the 3-D of the second processor 90B on first processor 90A.In the example embodiment of Fig. 9 A, processor 90A and 90B is structurally identical, and has respective processor core 91A and 91B, L2 buffer memory 92A and 92B, L3 buffer memory 93A and 93B.As described in Fig. 9 A, L2 buffer memory 92A and 92B is alignment and has identical take up room (footprint) (2-D region).In addition, L3 buffer memory 93A and 93B is alignment and has identical take up room (footprint).In the stacking framework of this 3-D, L2 buffer memory 92A with 92B of alignment vertically can be connected and be run as the single L2 buffer memory shared.In addition, L3 buffer memory 93A with 93B of alignment vertically can be connected and be run as the single L3 buffer memory shared.

Such as, Fig. 9 B is the schematic diagram of the stacking multiprocessor machine 90 of the 3-D of Fig. 9 A, wherein L3 buffer memory 93A and 93B united and can be run by one or two in processor 90A and 90B as a shared L3 buffer memory 93A/B.Similarly, Fig. 9 C is the schematic diagram of the stacking multiprocessor machine 90 of the 3-D of Fig. 9 A, wherein L2 buffer memory 92A and 92B united and can be run by one or two in processor 90A and 90B as a shared L2 buffer memory 93A/B.Concrete, in an example embodiment, L2 and the L3 buffer memory of processor 90A and 90B is vertically linked together, and can use L2 and L3 buffer memorys-or as independently buffer memory using two kinds of optional patterns, wherein do not use cross-layer they between connection; Or share to cross-layer, thus strengthen the caching capabilities of all processors in this layer.

An advantage of the Cache Framework that 3-D is stacking is, not needing increases the cache access time, has doubled the storage capacity of buffer memory.In fact, it has been generally acknowledged that the access speed of buffer memory is that ratio is in the square root of buffer zone.In the example embodiment of Fig. 9 B and 9C, because take up room (footprint) of L2 and the L3 buffer memory of correspondence overlaps spatially, L2 and the L3 buffer memory vertically connecting alignment does not increase buffer zone.Consider this point, because rely on vertical connection, the region of the region of the L2 buffer memory 92A/B of associating and the L3 buffer memory 93A/B of associating does not increase, and cache access speed is still identical.Can access identical buffer address space when running distinct program to make processor 90A and 90B, buffer control scheme has been implemented to control and organize shared CACHE DIRECTORY and the buffer consistency kept between multiple cache layer.

In another exemplary embodiment of the invention, the processor device that 3-D is stacking can be built as comprise the multiple processors that can combine stacking with the 3-D increasing processor in the Thread Count expected of single processor image.Such as, in the processor device that the 3-D comprising the first processor chip with first processor and second processor chips with the second processor is stacking, first and second processor chips can be active, and wherein the first and second processors are configured to run as single processor and the thread being polymerized them can be the amount of the thread that the first and second processors use to increase.The multithreading ability of the single processor during this allows 3-D stacking is effectively increased, and does not need and the expense (thread) that certainly extra thread must be used relevant with it at single processor.These principles will be explained further with reference to Figure 10 and 11.

Figure 10 is the schematic diagram comprising first processor 100A and be vertically stacked on the stacking processor device 100 of the 3-D of the second processor 100B on first processor 100A.In the embodiment in figure 10, the first and second processor 100A and 100B are multiline procedure processors, and have identical processor and resistor collection.Particularly, first processor 100A comprises four groups of registers 101A, 102A, 103A and 104A to perform four threads.Similarly, the second processor 100B comprises four groups of registers 101B, 102B, 103B and 104B to perform four threads.

In the example embodiment of Figure 10, by vertically alignment and connection handling device 100A and 100B, 3-D processor is stacking can be polymerized to one and have the single multiline procedure processor of relative more multithreading and run.Such as, in an example of Figure 10, four threads 101A, 101B, 102A, 102B, 103A, 103B, 104A and 104B of two processor 101A and 101B can jointly run, and such 3-D processor stacking 100 shows as the single processor of operation 8 threads.Independently, for the system-level impartial verdict in 3-D, when two or more processors are aligned, the individual node that this group processor will show as in the impartial verdict scheme of system.By this way, impartial verdict as discussed below " tree ", such as, when extra processor is added in new stacking plane, complexity does not increase.

For traditional planar system, the independent register collection that processor can be manufactured to the number with growth can by the more thread run concurrently to perform, to increase multiprogrammable processing power.But along with the growth of the number of threads of processor, the planar dimension of processor also increases, cause the cycle time delay in the communication between set of registers processor performance element, and the power increased.With the framework that 3-D as shown in Figure 10 is stacking, processor can be reduced to has less register set to support less thread of each processor, thread simultaneously between aggregation processor layer, as needs to increase to the overall number of the utilizable thread of given layer.Such as, suppose that the maximum load of given application operates to and has four or less thread, processor 100A and 100B as shown in Figure 10 can be optimised for four thread processors.If given load request is performed more than four threads (up to 8 threads), processor 100A and 100B so in 3-D processor stacking 100 can be combined and run as the single processor with 8 threads.

In the example embodiment of Figure 10, control program and communication link are implemented as to be supported across the polymerization of the thread of different layers, and buffer memory between articulamentum and keep buffer consistency.These control programs are design communication links, and like this when actual their address space shared of the thread of different layers, each processor will see identical state.These concepts are schematically illustrated in Figure 11.

Particularly, Figure 11 schematically illustrate according to example embodiment of the present invention, communication link between multiple assemblies of the processor shown in Figure 10.As described in Figure 11, first processor 100A comprises and first processor unit 105A, L2 and L3 buffer memory 110A, multiple register set 101A that instruction buffer 112A and data buffer storage 114A is relevant, 102A, 103A and 104A (being also hereinafter referred to as T0, T2, T4 and T6).Similarly, the second processor 100B comprises and first processor unit 105B, L2 and L3 buffer memory 110B, multiple register set 101B that instruction buffer 112B and data buffer storage 114B is relevant, 102B, 103B and 104B (being also hereinafter referred to as T1, T3, T5 and T7).

Instruction buffer 112A and 112B and data buffer storage 114A and 114B receives the programmed instruction and data that are stored in each L2 or L3 buffer memory 110A and/or 110B.Such as, can as above with reference to figure 9C described combine and shared L2 and/or L3 buffer memory 110A and/or 110B.The programmed instruction that is stored in instruction buffer 112A and 112B can be performed for one or more thread by each processor 105A and 105B, and in each thread status register T0, T1, T2, T3, T4, T5, T6, T7, store the executing state of given thread.Because the execution from programmed instruction creates data, processor 105A stores data in its data buffer storage 114A, and processor 105B stores data in its data buffer storage 114B.According to principle of the present invention, the extra communication link 116 between processor 105A and 105B and data buffer storage 114A with 114B are used to promote consistent storage.Communication link 116 may be implemented as processor on processor, because port is spatially juxtaposed when processor is aligned.

Although the example embodiment of Figure 10 and 11 describes and eachly has register set to support the processor of 4 active threads, principle of the present invention can be extended to the processor with n thread, wherein, if each processor is n-road multithreading, processor to can be run by as 2n-road multiline procedure processor, as system other parts seen.In addition, because this enforcement, when majority time, run n thread (wherein each processor is not by severe ground thread), and thus allow base processor by for n thread running optimizatin, but have when needed that expanding system is to run the ability of 2n thread, it is by particularly useful.

As noted above, when in 3-D stack arrangement, two or more processors are aligned, the individual node that processor will show as in system impartial verdict scheme.With this scheme, impartial verdict " tree " (or more generally, processor interconnect architecture) can be created, do not increase in complexity when extra processor is added into newly stacking plane like this.Below with reference to Figure 12,13,14,15 and 16, example processor interconnect architecture in accordance with the principles of the present invention will be introduced in further detail.

Figure 12 schematically illustrates the processor interconnect scheme for plane processor system.Particularly, Figure 12 describes the plane processor system 120 comprising and be disposed in first processor 120A on same level and the second processor 120B.First processor 120A comprises multiple processor P 1A, P2A, P3A, P4A, P5A, P6A, P7A and P8A (being referred to as, PnA) and relevant L3 buffer memory.The processor P nA of first processor 120A communicates on processor interconnect architecture 122A.Similarly, the second processor 120B comprises multiple processor P 1B, P2B, P3B, P4B, P5B, P6B, P7B and P8B (being referred to as, PnB) and each L3 buffer memory.The processor P nB of the second processor 120B communicates on processor interconnect architecture 122B.In the example embodiment of Figure 12, processor interconnect architecture 122A and 122B is described to " tree " structure of implementation criteria impartial verdict scheme.

In addition, as described in Figure 12, bus bar structure 124 is used to be interconnected by communication bus 122A and 122B.In the planar system 120 of Figure 12, this bus bar structure 124 is relatively long in 2-D plane.According to principle of the present invention, such as Figure 13 describes, and more can simplify this processor interconnect architecture in the stacking framework of 3-D.Particularly, Figure 13 schematically illustrates the processor interconnect scheme of the multicomputer system stacking for 3-D according to example embodiment of the present invention.Particularly, Figure 13 describes the plane processor system 130 comprising first processor 130A He be disposed in the second processor 130B on first processor 130A.First processor 130A comprises multiple processor P 1A, P2A ..., P8A (being referred to as, PnA), make purpose processor interconnect architecture 132A by they interconnected with communicate.Similarly, the second processor 130B comprises multiple processor P 1B, P2B ..., P8B (being referred to as, PnB), make purpose processor interconnect architecture 132B by they interconnected with communicate.Processor interconnect architecture 132A and 132B is described to " tree " structure of implementation criteria impartial verdict scheme.

As further described in Figure 13, connection bus structure 134 are used to be interconnected by processor interconnect architecture 132A and 132B.The whole processor interconnect scheme of Figure 13 is conceptually similar with the whole processor interconnect scheme of Figure 12, except using the vertical connection between stacking processor chips 130A and 130B to form bus line connecting structure 134 (its connection handling device interconnect architecture 132A and 132B).Based on this point, the vertical bus structure 134 that connect connect bus structure 124 much shorter than the plane described in Figure 12 in length.Therefore, the whole processor interconnect scheme in Figure 13 is in fact less and faster than the whole processor interconnect scheme in Figure 12.

Figure 14 schematically illustrate according to another example embodiment of the present invention, for the processor interconnect architecture of the stacking multicomputer system of 3-D.Figure 14 schematically illustrates the interconnected framework of processor with the stacking processor of the 3-D that is equivalent to Figure 13 but sooner and the stacking processor structure 140 of the 3-D of the interconnected framework of the processor more simplified dimensionally topologically.More specifically, as shown in Figure 14, use the tree construction 132B on the second processor chips 130B, and multiple vertical connection 141,142,143,144,145,146,147 and 148 (it extends to each processor of first processor chip 130A from the end points of the tree bus structure 132B the second processor chips 130B), implement processor interconnect scheme.The processor interconnect scheme of Figure 14 considers that the processor on the first and second processor chips 130A and 130B is in alignment with each other, and the tree bus structure 132A of such first and second processor chips 130A and 130B and the end point of 132B are also alignment (see Figure 13).With this vertical alignment, vertical busses can be implemented in the position of single vertical busses interconnected 134 (as shown in Figure 13) and connect for 141,142,143,144,145,146,147 and 148 (as shown in Figure 14).Because each end point of the bus tree construction 132B on the processor chips 130B of top snaps to the end point of the bus tree construction 132A of the processor chips 130A of below, short vertical connection can be used to connect the end point of two tree construction 132A and 132B, allow one in tree construction 132A and 132B be left in the basket or do not use like this.Discuss further below with reference to Figure 15 and describe these principles.

Particularly, Figure 15 is the schematical top view of that basis has the example embodiment of the present invention of the processor interconnect architecture of the processor interconnect architecture scheme based on Figure 14, that 3-D is stacking multicomputer system.Figure 15 describes the stacking multicomputer system of 3-D 150, it is the physics realization of the conceptual system in Figure 14, wherein, the processor P nA on the processor chips 130A of below and the processor P nB on the processor chips 130B of top is alignd by with the end point of bus tree construction 132B.This allows to use short vertical conduction via hole to connect 141,142,143,144,145,146,147 and 148 respectively, at each end point place of bus tree construction 132B, bus tree construction 132B is connected to processor to P1A/P1B, P2A/P2B, P3A/P3B, P4A/P4B, P5A/P5B, P6A/P6B, P7A/P7B and P8A/P8B.Because these vertical conduction via holes connect relatively short, each up/down of processor is to the single vertical drop that can be counted as global bus 132B.In addition, compared with the single vertical busses syndeton 134 shown in Figure 13, vertical vias 141,142 ..., 148 use provide alignment processor between shorter communication link.

Figure 16 schematically illustrates the processor interconnect architecture according to the stacking multicomputer system of the 3-D of another example embodiment of the present invention.Figure 16 schematically illustrates the stacking processor structure 160 of the 3-D with total wire frame (being similar in Figure 14, introducing and use except the extra tree construction 162A on the processor chips 130A of below).Extra tree construction 162A can be used to the communication link between shortening plane inner treater and increase communication bandwidth.Particularly, in the example embodiment of Figure 16, tree construction 162A can be used to the communication of the processor between the processor P nA on first processor chip 130A to processor, and do not need to use that short vertical busses is interconnected 141,142 ..., 148 or upper tree structure 132B.Similarly, tree construction 132B can be used to the communication of the processor between the processor P nB on the second processor chips 130B to processor, and do not need to use that short vertical busses is interconnected 141,142 ..., 148 or lower bus tree construction 162A.

In another control program, two tree construction 162A and 132B can the short vertical interconnect 141 of united, 142 ..., 148 to use concurrently, to provide between any two processors two independently communication links, the 2x that can realize communication bandwidth like this increases.In fact, suppose that each tree construction 132B and 162A is 16-byte bus, it requires that 16 cycles communicate the information of 256 bytes between the processors.In this embodiment, communication bandwidth can be risen to 32 bytes (every link 16 byte) simultaneously, thus increases the information of communication bandwidth to 512 bytes for 16 cycles.

In another example embodiment of the present invention, the stacking multiprocessor machine of 3-D can be built as and comprise multiple processor, by selectively combining the fastest assembly of each vertically stacking processor, the plurality of processor united and be configured to single ultra-fast processor.With the technology of advanced person, equipment performance between same processor can have sizable difference, and some subsystems of one of them processor can faster than the same subsystem of another same processor, and simultaneously, for different subsystems, this relation can be contrary.In fact, based on the difference of equipment size and shape, and the difference etc. of doping, one group that given wafer is formed identical processor, there is identical layout and macroefficiency assembly, the processor same components more identical than another can be had sooner or slower assembly.

In this respect, according to another example embodiment of the present invention, when the processor of two in different processor chip layer (the first and second processors) has identical subsystem Regional Distribution, in a kind of operational mode, by combining the fastest and close corresponding subsystem region slower of the first and second processors of the corresponding subsystem region of the first and second processors, configuring the first and second processors is run as single processor.To describe and discuss these principles in further detail below with reference to Figure 17 A and 17B.

Particularly, Figure 17 A schematically illustrates two processors with identical topology according to example embodiment of the present invention, and wherein, the corresponding region of two same processor is identified as the region more corresponding than it sooner or slower.Particularly, Figure 17 A describes and has 11 identical main region (grand) R1, R2, R3, R4, R5, R6, R7, R8, R9, R10 processor 170A and 170B identical with two of R11.After fabrication, the tested speed in these regions of processor, because when processor is identical, some given regions will be faster/slower than the same area of another same processor.In the example embodiment of Figure 17 A, region R1, R2, R4, R6, R8, R9 and R11 of first processor 170A are identified as the same area faster than (being labeled as " F ") same processor 170B.In addition, region R2, R5, R7 and R10 of the second processor 170B are identified as the same area faster than (being labeled as " F ") same processor 170A.

Figure 17 B is according to example embodiment of the present invention, the schematic diagram comprising the stacking multicomputer system 170 of the 3-D of processor 170A and 170B of Figure 17 A.Particularly, Figure 17 B schematically illustrate according to example embodiment of the present invention, to be formed by two processors vertically shown in stacking Figure 17 A and the single processor formed as the fastest person in the corresponding region by each processor and the stacking multi-processor structure of the 3-D that runs.In Figure 17, processor vertically connects with being aligned, such corresponding region R1, R2 ...., R11 is in alignment with each other and is connected.The buffer memory of two processor 170A with 170B is vertically connected with execution resource with being aligned, the stacking multicomputer system 170 of such 3-D can in different modes in one run.

Such as, as discussed above, in a node, processor 170A and 170B can as independently processor operation, and wherein each processor is active and runs with half-power.In another example embodiment, one in processor 170A or 170B can be run with the power (Turbo pattern) of total power or enhancing, and another processor is cut out simultaneously.In another embodiment, processor 170A and 170B can run as single processor (this single processor comprises those regions from each processor of the fastest person be identified as in region), and the processor obtained like this can run as single ultrafast processor (if having the speed faster than using from the only all components of a processor layer).Such as, in the example embodiment of Figure 17 B, the stacking multicomputer system 170 of 3-D can run as single processor, and this single processor comprises 11 regions be made up of fast region R2, R5, R7 and R10 of fast region R1, R2, R4, R6, R8, R9 and R11 of first processor 170A and the second processor 170B.

In another exemplary embodiment of the invention, the stacking multiprocessor machine of 3-D can have as the processor of multiple associatings that runs of single processor image logic ground, but wherein at least one processor is used to " run-ahead (in advance) " function.Particularly, exemplarily, in the multiprocessor machine that the 3-D with the first and second stacking processors vertically connected with being in alignment with each other is stacking, first processor can be the primary processor of the state of the framework of responsible machine, and the state of the framework of machine can not be changed from processor not by framework or program limit, can run early solve branch and produce miss by primary processor in advance from processor.

In this example embodiment, buffer memory and the execution resource of the first and second processors are joined together, they can be used in like this, such as, two kinds of replaceable mode Zhong – or as independent processor, connection wherein between processor layer is not used, or to cooperatively, wherein primary processor executive routine, from the more simple version of processor working procedure, primary processor can be led over like this from processor, produce memory requests and solve branch, the memory access of the long delay during its result can be used for avoiding other to select by primary processor and branch's forecast by mistake.Below with reference to Figure 18, implement this concept that is advanced or auxiliary-thread by describing in further detail in the stacking processor system of 3-D.

Particularly, Figure 18 signal describe according to example embodiment of the present invention, for implementing the method for advanced function in the processor system that 3-D is stacking.Particularly, Figure 18 describe performed by the primary processor running main thread, about multiple operations 181 and 182 of between the processor shared storage of advocating peace, and by working in coordination with multiple operations 184,185,186,187,188 and 189 that primary processor performs, that run as advanced thread from processor.

Particularly, as shown in Figure 18, when in the processor system that 3-D is stacking during executive routine, primary processor obtains instruction 181 from storer 183 and performs each programmed instruction 182.When executing an instruction, primary processor obtains from the storer 183 shared and program data, and keeps machine state (storage) to be visual for whole external entities.In other words, primary processor is executive routine exactly, and such primary processor is with correct order executing instruction operations, and when those changes are considered to correct, the remainder only to system indicates state change message.But, faster in order to make program perform, with the concurrency of higher instruction-level, run as " in advance " processor from processor, wherein, do not ensure correct and legal operation from processor, and do not indicate state change to the remainder of system.Alternatively, it runs as quickly as possible with the pattern of conjecture, and not perplex by the instruction that has nothing to do with program circuit.By running in this mode, look ahead device, by what can do early than primary processor, solves many branches and generates the cache miss of much necessity.This by run primary processor more normal than it can operation quickly.

Particularly, as shown in Figure 18, obtain instruction 184 from processor by from the storer 183 shared, and perform specific instruction, such as data acquisition instruction, and obtain data 185 in response to this data acquisition instruction from the storer 183 shared.Instruction data storage will be performed from processor, and execute store accessing operation 186 is to determine whether necessary data is stored in storer 183.From processor, execution simple instruction 187 is performed branch instruction 188, and abandon or ignore and determine cache miss or solve branch to be redirected irrelevant other whole instructions 189 obtained.In step 186, when seeing that from processor instruction data storage comes then, will determine whether exist for data buffer storage line to be stored from processor.If cache lines does not exist, cache miss will be produced from processor, and continue to make for this data storage allocation cache lines, and obtain suitable license data to be stored into newly assigned cache lines (that is, guaranteeing that the state of new cache lines is " datastoreready (data storage is ready to complete) " state).If cache lines exists, from processor by determining whether this cache lines is " datastoreready (data storage is ready to complete) " state, and if no, continue to obtain suitable license.In this way, when primary processor performs instruction data storage, cache lines will be available and is in " storeready (storage is ready to complete) " state, thus avoids the cache miss performed in flow process.

By solving unscheduled event before primary processor finds, accelerate primary processor from processor (look ahead device).Can run like this from processor, because it does not need to perform each instruction, and not need executive routine operation exactly.In the configuration that 3-D is stacking, because advocating peace from processor is that space is consistent and connected by short vertical connection, they can share and observe executing state, and compare coplanar configuration (wherein, long distribution exchanges suitable synchronizing information by being required), more easily with robustly synchronous.Even if having the coplanar distribution between coplanar processor, the state that coplanar processor will probably can not as one man be observed each other.In the configuration that 3-D is stacking, between worker thread and main thread, in order to the communication of shared value and synchronous treatment scheme and interconnected, by the short vertical connection of advocating peace between the resource of processor, be more easily implemented.

In another example embodiment of the present invention, the stacking processor device of 3-D can have the processor of multiple associating, the processor logic ground of the plurality of associating runs as single processor image, but a part for the wherein storage of their framework as can not stacking by 3-D outside the privately owned storage space (or temporarily providing room) of processor access run.In other words, multiprocessor can united be single operation main body (being externally " processor "), this single operation main body has and may be used for temporarily providing room and the privately owned storage area organizing other data structures, wherein, this privately owned storage area be in system other run main bodys sightless.When the tuple of processor runs (with advanced pattern or super turbo pattern or other combination arbitrarily) as single logic processor, one or more buffer memorys of this tuple can be used as the privately owned storage with program-ad hoc structure.

In other example embodiment of the present invention, described in Figure 19, according to example embodiment of the present invention, the stacking processor structure of 3-D can be formed by vertically stacking multiple processor (each there is similar status register layout), wherein, multiple processor can run or run to cooperatively the status register to share them independently.More specifically, Figure 19 schematically illustrates first processor 190A and the second processor 190B, and it is vertically stacking to form the stacking processor structure of 3-D 190.In the example embodiment of Figure 19, each processor 190A and 190B has identical status register (generally being described by the group of one or more rectangle) layout.Such as, first processor 190A and the second processor 190B have group 191A and the 191B of that be deployed in respectively in the substantially identical 2-D region of each self processor, identical status register.It is to be appreciated that the description of status register collection (group of rectangle) on each processor 190A and 190B is random, and only refer to that usually describing processor comprises its status register.

The status register collection that on each processor, 190A saves on 190B can be used to store " state " of each processor 190A and 190B at the end of each cycle of operation of this processor.Term " state " refers to the information required for executing state (which program performed so far completes) fully obtaining the program performed on a given processor.As understood by those skilled in the art, " state " comprises general-purpose register, control register, situation code, address register and keeps the information that comprises in any other register of requisite status information.Suppose program is run on first processor 190A.At some set points (end completing the cycle of operation of processor 190A) that program is run, " state " of first processor 190A can be dispatched troops in the corresponding state register being kept at the second processor 190B by scanning from its status register, and the second processor 190B can use the status information be scanned into stored in the status register of the second processor 190B, the point performed is stopped to start to perform identical program at first processor 190A.Based on this point, program can continue on the second processor 190B, bring into operation from the halt first processor 190A, and program can not know that it has been moved to different processors.Therefore, " state " is possible requested with the whole static informations of all obtaining the required processor about the operation at any period." state " is the register set of intactly specifying the full detail relevant with the program run on a processor.

Usually, technology discussed above can be used to the stacking structure 190 of the 3-D manufacturing Figure 19, wherein processor 190A with 190B can be overlie one another and be vertically connected, and the status register collection (and other assemblies) of processor 190A with 190B of each layer is alignment and uses being vertically connected interconnected like this.As noted above, the corresponding assembly that term " alignment " refers to each layer of processor 190A with 190B is directly deployed in top (occupying 2-D space substantially identical in every one deck in vertical 3-D space) each other, or it is as discussed above, displacement staggered fixing uniformly, processor 190A and 190B is added certain departing from and is produced lower power density.In this way, by implementing suitable vertical connection between status register and other assemblies, the stacking processor 190 of 3-D can in different modes in one run.

Figure 20 is the table of multiple operational modes of showing the processor structure that the 3-D of Figure 19 is stacking.Such as, as shown in Figure 20, in a kind of operational mode (" normally " pattern), first and second processor 190A and 190B can as independently processor operation, wherein each processor 190A and 190B is active (being opened) and with lower than capacity operation (such as, each operate in half-power).In another operational mode (" turbo " pattern), a 190A (or 190B) in processor operates in total power, and another processor is closed.In " normally " pattern, each half that may operate in their peak power in processor 190A and 190B, such processor takes up room (footprint) to by power identical for the single processor had with operate in full speed degree (turbo pattern)." normally " operational mode can be utilized, power identical like this and cooling architecture can deal with this two kinds of situations, that is, operate in the single processor of full speed degree (turbo pattern), or operate in the processor pair of speed (normal mode) of reduction.

It is to be appreciated that term " full speed degree " or " total power " or " maximum safe speed " all refer to herein: for all possible situation and input, given processor is by the travelling speed of true(-)running." maximum safe speed " of given processor uses multiple Computer Simulation, the measurement of model and given processor products and the operation characteristic determined in advance.Computer product is distributed for has its maximal rate, and it can not run quickly.In fact, the state that the majority that may run for given processor is possible, and for the possible program of majority and input, in fact processor can run than " maximum safe speed " quickly.But, when processor exceed its known " maximum safe speed " run time, because the particular combination of state and input may cause run-time error, " full speed degree " restriction be typically provided, under any ruuning situation that such processor runs and environment, the problem that will not have occurs.

In another example embodiment of the present invention, as shown in the form of Figure 20, the example 3-D structure 190 of Figure 19 may operate in the pattern being called as " super-Turbo (Hyper-Turbo) ", one (such as processor 190A) wherein in processor can be running in and exceed total power (such as than the maximum safe speed travelling speed faster of this processor), simultaneously other processors (such as processor 190B) deactivated jump (closedown), but a kind of status register of the deactivated processor jumped is enlivened processor for enlivening " checkpoint (checkpointing) " of processor state.In this example embodiment, the framework enlivening processor in stacking be improved to have stacking in the status register of another sluggish processor, to make to enliven the travelling speed that processor can operate in enhancing (super-Turbo), use the status register of inactive processor to be stored in the current state information at the end of each cycle of operation of enlivening processor simultaneously, " checkpoint (checkpointing) " object of processor state is enlivened in situation about occurring for execution error (when the enhancing travelling speed enlivening processor and to be considered to exceed the speed of " safety " is run).

Figure 21 describes according to example embodiment of the present invention, with the process flow diagram of the operational mode of the stacking processor structure 190 of the 3-D of Figure 19 of " super-Turbo (Hyper-Turbo) " mode operation.Initially, the speed exceeding its " maximum safe speed " can be operated in by activating primary processor (such as processor 190A), and close from processor (such as 190B), allow to keep from the status register of (inactive) processor the use (step 200) that enlivens for primary processor, selectively the processor of control Figure 19 is stacking enters " super-Turbo (Hyper-Turbo) " pattern simultaneously.Preset sequence the term of execution, on the basis completed of each cycle of operation, primary processor will start the next cycle of operation (step 201).If current period is done (positive result in step 202), and do not have mistake that (negative decision in step 203) occurs at current period, the current state (end at current period) of primary processor makes an inventory of (checkpointed) (storage) in the status register from processor 204 (step 204) by tested, and the next cycle of operation will start (step 201).

If there is (positive result in step 203) in mistake in the current cycle of operation, by copying the Current Content of the state of the checkpoint (checkpointed) in the status register of the second processor, state rollback one-period (step 205) of primary processor.It is the state of the status register of the primary processor that the time point completed in a upper cycle of operation of primary processor exists from the state of the checkpoint (checkpointed) in the status register of processor.Primary processor then gets back to the current period (cycle that mistake occurs) (using the state of accessing the checkpoint (checkpointed) obtained from the status register from processor) (step 206) of operation.In an example embodiment, the primary processor that this process (step 206) is running in its " safety " maximal rate preferentially performs, and to guarantee causing the procedure operation of problem when more speed, this will accurately perform error-free.

Once the current cycle of operation (performing with normal safe speed) completes (positive result in step 207), tested make an inventory of (checkpointed) of the current state of primary processor is from the status register of processor (step 208).After this, processor is stacking turns back to super-turbo pattern (wherein primary processor starts to exceed the travelling speed of its maximum safe speed and runs) (step 200).In the instantiation procedure of Figure 21, because the state of primary processor is by for each cycle of operation of completing and tested make an inventory of (checkpointed), and because have stacking shape, (namely recovery action can simply and fast be completed, by vertically to connect the just accessed status register refreshing primary processor from the content of the status checking point (checkpoint) the status register of processor, return to form), the speed that primary processor can exceed its maximum safe speed is run.

Above with reference in the example embodiment that Fig. 9 A, 9B and 9C discuss, such as, use vertical connection, forming different buffer memorys in the processor chips not at the same level of the stacking processor structure of 3-D can united, and the buffer memory of any specific level in cache hierarchy can be operated to single shared buffer memory by such processor.Such as noted above, the alignment of L2 buffer memory (two not at the same level on) is to running as the single shared L2 buffer memory with twice capacity, and the alignment of L3 buffer memory (two not at the same level on) is to running as the single shared L3 buffer memory with twice capacity.In other example embodiment of the present invention described in detail with reference to figure 22-32 below, the buffer memory of different processor chip can be built as the distribution structure having and make it possible to the different access pattern realized in multidimensional.

Figure 22 schematically illustrates the memory array that embodiments of the invention can be applied to.Particularly, Figure 22 schematically illustrate comprise memory cell 220 (multiple wordline (WL0, WL1 ..., WLn) and multiple bit line (BL0, BL1 ..., BLn) this memory cell can be accessed) the memory array 210 of 2-D array.Each memory cell 220 comprises access transistor (transistor) 222 and memory element 224 (such as capacitor), wherein access transistor 222 have be connected to wordline door end, be connected to the source of bit line and be connected to the drain terminal of memory element 224.In the descriptive memory array 210 of Figure 22, every row of memory cells 220 is connected to identical wordline, and wherein every line storage 220 comprises the bit group (amount) of composition given byte, word, storage line etc.In addition, often row memory cell 220 is connected to identical bit line, is wherein connected to and correspond to by the given bit position in the specified rate of storer 210 read/write (byte, word, cache lines etc.) to each memory cell 220 of position line.

Each wordline (WL0, WL1 ..., WLn) be connected to corresponding driving 226, drive 226 to run with the wordline activated and deactivation is given.Driving 226 for given wordline applies the door end of voltage to each access transistor 222 in the row of memory cells being connected to given wordline, and it opens each access transistor 222 in the row of memory cells being connected to given wordline.Wordline is the line corresponding to the abundant decoding being used to the address performing read or write, like this when any given when, only has a wordline to be activated.In addition, connect each bit line (BL0, BL1 ..., BLn) to the correspondence of the upper sense charge (logical zero or logical one) of given memory element 224 (capacitor) of the given memory cell 220 being connected to this bit line receiver 228 (such as, sense amplifier), it is selected by given wordline.For read or write, a wordline (OK) is driven 226 to activate by the wordline of correspondence, and it opens each access transistor 222 of each memory cell 220 in this given row.Once given wordline is activated, the bit line via correspondence visits the one or more bits (row) in selected wordline.Be joined together to whole bits of position line, but any given moment only has a bit to be selected.

Figure 22 schematically illustrates DRAM (dynamic RAM), and wherein the state (such as, logical zero or logical one) of each memory cell is stored as the electric charge on the capacitor of memory element 224.Know as known in the art, in other memory architectures, such as SRAM (static random-access memory), the row-column " framework " about wordline and bit line is identical, but each memory element will comprise multiple transistor (transistor) but not capacitor.

According to embodiments of the invention, multidimensional memory architecture can pass through the storer (such as DRAM, SRAM) of stacking multiple grades and build, and this storer has the access distribution making it possible to the different access pattern for read/write realized in multidimensional.Such as Figure 23 A, 23B, 23C jointly describe according to example embodiment of the present invention, for building the method for the memory construction comprising the multi-level store with different access pattern.More specifically, Figure 23 A schematically illustrates first order storer 230A, Figure 23 B schematically illustrates second level storer 230B, and Figure 23 C schematically illustrates the 3-D storer 230C with the second level storer 230B (Figure 23 B) be deployed on first order buffer memory 230A (Figure 23 A).First order buffer memory 230A shown in Figure 23 A comprises the array of memory cell MC1, MC2, MC3 and MC4, each comprise access transistor 222A and memory element 224A, multiple wordline (such as WL0_A, WL1_A), multiple bit line (such as BL0_A, BL1_A), wordline drive 226A and bit line receiver 228A.First order buffer memory 230A structure in Figure 23 A and operation on be similar to as above with reference to figure 22 description memory array 210, except in Figure 23 A, for convenience, illustrate only four memory cells MC1, MC2, MC3 and MC4, and two wordline (OK) WL0_A and WL1_A, two bit line (row) BL0_A and BL1_A.The direction that wordline is expert at extends, and bit line extends in the direction of row.

Second level storer 230B shown in Figure 23 B comprises multiple memory cell MC1, MC2, MC3 and MC4 corresponding to memory cell MC1, MC2, MC3 and the MC4 shown in Figure 23 A.The each access transistor 222B that comprises of memory cell shown in Figure 23 B is connected 224B with vertical vias.Vertical vias connects the memory element that 224B is connected to the corresponding memory cell formed on different buffer memory layer.Such as, vertical vias connects the memory component 224A that 224B is connected to the corresponding memory cell on first order storer 230A.In addition, the second level storer 230B in Figure 23 B comprises multiple wordline (such as WL0_B, WL1_B), multiple bit line (such as BL0_B, BL1_B), wordline driving 226B and bit line receiver 228B.In Figure 23 B, each wordline vertically (row) extends, and is connected to the door end of the access transistor 222B in the memory cell of given row.In addition, each bit line flatly (OK) extends, and is connected to the source of the access transistor 222B in the memory cell of given row.

According to example embodiment of the present invention, second level storer 230B is deployed on first order storer 230A to form the 3-D memory architecture described in Figure 23 C.Particularly, Figure 23 C shows 3-D storer 230C, and wherein each memory cell MC1, MC2, MC3 and MC4 comprise a memory element 224A and two access transistor 222A and 222B on the different memory layer of the memory element 224A for using the mode access of two kinds of different access distributions identical.Access transistor 222B on the storer 230B of the second level is by the memory element 224A be connected to via vertical connection 224B on corresponding first order storer 230A.In the 3-D memory architecture of Figure 23 C, when there is single memory element 224A for each memory cell MC1, MC2, MC3 and MC4, the wordline on storer different layers and bit line distribution, wordline driving and bit line sensing circuits provide the different access module of the same memory unit MC1, MC2, MC3 and MC4.

Particularly, described in Figure 23 C, the wordline WL0_A on first order storer 230A and WL1_A is set to be orthogonal to the wordline WL0_B on the storer 230B of the second level and WL1_B.In addition, bit line BL0_A and BL1_A on first order storer 230A is set to be orthogonal to bit line BL0_B and BL1_B on the storer 230B of the second level.Based on this point, for each bit (memory cell), two on first and second grades of storer 230A and 230B orthogonal distribution patterns make it possible to the access of the data structure realized in different dimensions (such as, the row and column of array).Such as, the wordline (WL0_A and WL1_A) on first order storer 230A can be used to visit the memory cell of the horizontal line in 3-D storer 230C, and the wordline (WL0_B and WL1_B) on the storer 230B of the second level can be used to visit the memory cell of the vertical row in 3-D storer 230C.Make it possible to realize to use the shape of different wordline on storer not at the same level and bit line to the access of same bits array because be typically connected to two different access transistor 222A and 222B, 3-D storer 230C for the memory element 222A of each memory cell MC1, MC2, MC3 and MC4.

In one embodiment of the invention, the 3-D storer 230C of Figure 23 C is implemented as cache memory structures.In another embodiment of the present invention, the 3-D storer 230C of Figure 23 C is implemented as main system memory structure.In addition, it will be understood that storer 230A and 230B of each grade forming the structure shown in Figure 23 C may be implemented as " storer of conceptual level " or independently " storer of physical level ".

Particularly, consider " storer of physical level ", storer 230A and 230B of each grade is structured on independently substrate or chip, wherein two independently substrate or chip by mutual carry to form stacking 3-D structure.With the first order storer 230A shown in Figure 23 A, multiple circuit unit 222A, 224A, 226A and 228A can be structured in the active surface of the first substrate, and the access distribution pattern of wordline (WL0_A, WL1_A) and bit line (BL0_A, BL1_A) is manufactured to a part for the BEOL structure of the first substrate simultaneously.In addition, second level storer 230B shown in Figure 23 B comprises the second substrate, the active surface of the second substrate defines and has multiple assembly 222B, 226B and 228B, the access distribution pattern of wordline (WL0_A, WL1_A) and bit line (BL0_A, BL1_A) is manufactured to a part for the BEOL structure of the second substrate simultaneously.Connecting 224B can be that vertical through via hole connects, and it extends through the first and second substrates, and the access transistor 222B on the storer 230B of the second level is connected to the memory element 224A on first order storer 230A.

Consider " storer of conceptual level ", storer 230A and 230B of each grade is structured on identical substrate or chip, provides the storer of a physical level, but provides the storer of two different concepts levels.In this embodiment, there are circuit unit 222A, 222B, 226A, 226B more,, 228A, 228B will be formed in the active surface of a substrate, and the part of identical BEOL structure that the wordline access distribution pattern different with two of bit line will be manufactured in the active surface of this substrate.Based on this point, buffer memory can be built as the buffer memory on one single chip (such as processor chips) with multiple " concept " level, and two different distribution access modules can be used like this to visit identical memory cell 2-D array.In one embodiment of the present of invention, the buffer memory of multiple conceptual level can be used, L2 and/or the L3 buffer memory on to build in the first and second processor 90A and 90B as shown in Fig. 9 A, 9B and 9C respectively each.

In multiple programs (wherein obtain data and will increase efficiency and the speed of specific workloads in multidimensional), the memory architecture of Figure 23 C is useful.Such as, in the application of such as matrix multiple, the distribution frame of the memory construction of Figure 23 C will be useful, as described with reference to Figure 24 and 25.Particularly, Figure 24 schematically illustrates and represents each memory block A, B and the C with the block of the storer of the 4x4 of 4 row and 4 row.Memory block C represents the result of the matrix multiple (AxB) of the row and column of memory block A and B.It is to be appreciated that memory block A, B and C can be considered to different memory constructions, or the different piece of the same memory structure.In example shown in Figure 24, when applying the matrix multiple of AxB, the given entrance (bit) of memory block C is calculated as the dot product of the row vector of memory block A and the column vector of memory block B, as follows:

C _ij＝RA _i·CB _j，

Wherein, RA _ithe sequence number representing memory block A is the row vector of i (wherein i=1,2,3, or 4), and wherein CB _jthe sequence number representing memory block B is the column vector of j (wherein j=1,2,3, or 4).Such as, for the entrance C in the memory cell of i=1 and j=1 _ijto be calculated as:

C ₁₁＝RA ₁·CB ₁＝(A ₁₁xB ₁₁)+(A ₁₂xB ₂₁)+(A ₁₃xb ₃₁)+(A ₁₄xB ₄₁)。

As implied above, the matrix multiple of memory block A and memory block B obtains the row of memory block A and the row of memory block B by requiring.Suppose that memory block A and B has framework traditional as shown in Figure 22, to obtain in memory block A given is about to requirement primitive operation (with high level language) entirely to obtain given row.In fact, because each row is by word line access, each memory cell of given row is activated by unique wordline address, and is read along each memory cell of this row via each bit line.Such as, by inputting the unique address relevant to row RA1 to activate its wordline, and then activate the bit line relevant to often arranging CA1, CA2, CA3 and CA4 with from memory cell position A11, A12, A13 and A14 sense data, thus in single operation, reading row RA1, the first row RA1 of memory block A can be read out (it comprises A11, A12, A13 and A14).

On the other hand, in the conventional frame of Figure 22, because often arranging of the array in memory block B is intactly stored in the single row of storer, obtains row from memory block B and will require multiply operation.Such as, in order to read the first row CB1 (B11, B21, B31 and B41) of the memory block B in Figure 24, in memory block B, each row RB1, RB2, RB3 and RB4 must sequentially be activated, and preset time, place was from the row of each activation of target column CB1, only had a bit (B11, B21, B31 and B41) by accessed.This will require 4 continuous print word line activatings and read operation.

According to embodiments of the invention, the framework of Figure 23 C can be used to build memory block A and B in Figure 24, like this can the row in single operation in memory blocks A and the row in memory block B.Such as, Figure 25 schematically illustrate according to example embodiment of the present invention, the method that uses the row and column of single primitive operation memory blocks.Particularly, Figure 25 describes and in single presumptive instruction LDA [2, i], can make memory-aided access layer (wherein wordline flatly extends), the single row (the second row) of memory blocks A, wherein LD representative " loading " operation in assembly language.Similarly, Figure 25 describes and in single presumptive instruction LDB [I, 4], can make memory-aided access layer (wherein wordline vertically extends), the single row (the 4th row) of memory blocks B.

The example embodiment of Figure 23 C describes the 3-D memory construction of the storer comprising two levels with different orthogonal access distribution pattern.In other embodiments of the invention, 3-D memory construction can be formed the different access distribution pattern of 3 with the storage for accessing a level or more levels.In addition, Figure 23 C describes the 3-D memory construction had by the storage of a level of two different distribution mode access, in other embodiments of the invention, 3-D memory construction is built as the storage with two or more grades, wherein each grade storage by one or more different access join nemaline layer share.In addition, in other example embodiment of the present invention, the access distribution pattern outside orthogonal modes can be implemented, as shown in Figure 26,27,28.

Usually, Figure 26 describe according to example embodiment of the present invention, the memory array that comprises memory cell array and oblique line access distribution pattern.More specifically, Figure 26 describe comprise be set to 8 row ((R1, R2 ..., R8) and 8 row (C1, C2 ..., C8) 2-D array in the memory array 240 of 64 memory cells (M).Memory array 240 comprise be set to oblique line access module multiple wordline WL1, WL2, WL3, WL4, WL5, WL6, WL7 and WL8, wherein each wordline is connected to from often going and often arranging M (i, j) memory cell, wherein i represents line order number, and j represents row sequence number.Such as, wordline WL1 is connected to memory cell M (1,8), M (2,7), M (3,6), M (4,5), M (5,4), M (6,3), M (7,2) and M (8,1).In addition, wordline WL2 is connected to memory cell M (1,1), M (2,8), M (3,7), M (4,6), M (5,5), M (6,4), M (7,3) and M (8,2).Although do not illustrate particularly in Figure 26, in one embodiment of the invention, whole bit lines of the memory array 240 in Figure 26 can run with column direction or line direction.So, when activating given wordline, each bit line can be simultaneously activated, to sense by a bit in the often row of given word line activating.

In addition, Figure 27 describe according to another example embodiment of the present invention, the memory array that comprises memory cell array and oblique line access distribution pattern.More specifically, Figure 27 describes to comprise and is set to 8 row (R1, R2, R8) and 8 row (C1, C2 ..., C8) the memory array 250 of 64 memory cells (M) of 2-D array, be similar to Figure 26, but multiple wordline WL1 wherein in Figure 27, WL2, WL3, WL4, WL5, WL6, WL7 and WL8 are set to oblique line access module (this oblique line access module is the mirror image of the wordline distribution pattern shown in Figure 26).In Figure 27, each wordline is connected to from often going and the memory cell M often arranging M (i, j), and wherein i represents line order number, and j represents row sequence number.Such as, wordline WL8 is connected to memory cell M (1,1), M (2,2), M (3,3), M (4,3), M (5,5), M (6,6), M (7,7) and M (8,8).Although do not illustrate particularly in Figure 27, in one embodiment of the invention, all bit lines of the memory array 250 in Figure 27 may operate in column direction or line direction.So, when activating given wordline, each bit line can be simultaneously activated, to sense by a bit in the often row of given word line activating.

Figure 28 describe according to another example embodiment of the present invention, the memory array that comprises memory cell array and displacement access distribution pattern.More specifically, Figure 28 describes to comprise and is set to 8 row (R1, R2, R8) and 8 row (C1, C2 ..., C8) the memory array 260 of 64 memory cells (M) of 2-D array, be similar to Figure 26 and 27, but multiple wordline WL1 wherein in Figure 27, WL2, WL3, WL4, WL5, WL6, WL7 and WL8 are set to row-displacement access module.Particularly, the wordline in Figure 28 is shown in column direction and extends at least two row, and is then moved to another row (wherein they extend at least two row) by oblique line, so continues.Such as, wordline WL1 is connected to memory cell M (Isosorbide-5-Nitrae), M (2,4), M (3,3), M (4,3), M (5,2), M (6,2), M (7,1) and M (8,1).Although do not illustrate particularly in Figure 28, in one embodiment of the present of invention, the direction that the bit line of the memory array 260 in Figure 28 can be expert at is run, and like this when given wordline is activated, each bit line can be activated to sense a bit by the often row of given word line activating.

It is to be appreciated that Figure 23 C, 26, access distribution pattern shown in 27 and 28 time the embodiment that describes, and other access distribution patterns can be implemented.3-D memory construction can have the different access distribution structure of multilayer, such as, shown in Figure 23 C, 26,27 and 28, it can be usually connected to the storage of a level.As noted above, the orthogonal access distribution pattern shown in Figure 23 C makes data structure can be accessed in different dimensions (such as, the row and column of array).Figure 26,27 and 28 random access mode allow with supports password and bug check arbitrary patterns storage data.Such as, can use Figure 26,27 and 28 access distribution pattern store data with arbitrary patterns, such data are encrypted in fact with unique pattern.In addition, if each dimension keeps simple symmetrical, different access distribution patterns can be used to perform the error correcting of the stalwartness on array.Such as, if the symmetry of a line and row is bad, so the bit of this row and column infall can be confirmed as the bit of mistake.

In other embodiments of the invention, 3-D memory construction can be configured to the storage with multilayer, wherein can store in three dimensions and visit data.Such as, Figure 19 schematically illustrate according to example embodiment of the present invention, the 3-D memory construction that makes it possible to realize the 3-D access module that multiple grades store.Particularly, Figure 29 describe comprise multiple grades storer (such as, plane 0, plane 1, plane 2 ...) 3-D memory construction 270, wherein, each storage level comprises the storage of a level and at least one 2-D accesses distribution shape.By explanation, Figure 29 shows to be included in given plane to use and comprises wordline (WL0_0, and the first order storer (plane 0) of the 2-D array of memory cell M1, M2, M3 and M4 that can access of the orthogonal modes of the access distribution of bit line (BL0_0, BL1_0) WL1_0).Second level storer (plane 1) is included in given plane to use and comprises wordline (WL0_1, and the 2-D array of memory cell M5, M6, M7 and M8 that can access of the orthogonal modes of the access distribution of bit line (BL0_1, BL1_1) WL1_1).In addition, third-level storage (plane 2) is included in given plane to use and comprises wordline (WL0_2, and the 2-D array of memory cell M9, M10, M11 and M12 that can access of the orthogonal modes of the access distribution of bit line (BL0_2, BL1_2) WL1_2).

In addition, the 3-D memory construction 270 of Figure 29 comprises multiple vertical wordline WL0_3, WL1_3, WL2_3 and WL3_3, and it is connected to the row of the memory cell across storage not at the same level.Particularly, the first vertical wordline WL0_3 is connected to memory cell M3, M7 and the M11 in first three plane (plane 0,1,2).Second vertical wordline WL1_3 is connected to memory cell M1, M5 and M9 in first three plane.3rd vertical wordline WL2_3 is connected to memory cell M4, M8 and M12 in first three plane.4th vertical wordline WL3_3 is connected to memory cell M2, M6 and M10 in first three plane.Based on this point, Figure 29 describes 3-D storage organization, and wherein data can be stored in three dimensions in any one and accessed in three dimensions any one.These concepts will be further described with reference to figure 30A, 30B and 30C.

Particularly, Figure 30 A, 30B and 30C schematically illustrates the method for the example 3-D memory construction visit data in multidimensional for using Figure 29.Particularly, Figure 30 A describe in the memory construction 270 of Figure 29 for the method for fixing x value visit data (memory cell M1, M2, M5, M6, M9 and M10) in y-z plane.Figure 30 B describe in the memory construction 270 of Figure 29 for the method for fixing z value visit data (memory cell M5, M6, M7 and M8) in an x-y plane.Figure 30 C describe in the memory construction 270 of Figure 29 for the method for fixing y value visit data (memory cell M1, M3, M5, M7, M9 and M11) in x-z plane.The 3-D structural support of Figure 29 is used for the use of the new primitive operation of Mobile data.Such as, in any dimension, as a primitive operation, the data of one deck can be moved in orthogonal dimension.Exemplarily, in Figure 30 B, as a primitive operation, the data for the x-y plane of fixing z value can be moved to the memory location of another x-y plane for another value.In other embodiments of the present invention, a primitive operation can be defined as the data manipulation of the parallel plane of two, transposition (exchange).The line of the multiple horizontal and verticals shown in Figure 30 A, 30B and 30C is described to have double-head arrow, and depend on the distribution framework of enforcement, these lines usually represent wordline and/or bit line like this.

It is to be appreciated that each memory cell shown in Figure 29 (and Figure 30 A, 30B and 30C) can express individual bit, byte, word, cache lines or other data volume arbitrarily.It is to be further understood that, for convenience of description, each 2-D memory plane (plane 0, plane 1, plane 2) shown in Figure 29 is for having 4 memory cells and 2 wordline and bit line, but each memory plane can have more processor unit, wordline and bit line.In addition, illustrate only the storer of 3 2-D planes in Figure 29,3-D memory construction can be built as to be had or the 2-D of 2 levels stores or the 2-D of 4 or more levels stores, and wherein the storage of each grade has relative access distribution pattern.In fact, Figure 29 shows the access distribution structure relevant to the storage of each 2-D level, one or more in memory plane (plane 0, plane 1, plane 2) can have relative two or more access distribution pattern, the 2-D array data of given memory plane can be accessed like this, as above described in Figure 23 C with different distribution shape.

It is to be further understood that as described above, the storer of each level (plane) of the 3-D memory construction 270 shown in Figure 29 may be implemented as the storer of physical level or the storer of conceptual level.Such as, in one embodiment of the invention, 3-D memory construction 270 may be implemented within single substrate or chip, in single substrate or chip, wherein form all memory circuitry assemblies (access transistor, memory element, driving, sensing amplifier etc.), and wherein all distributions will be manufactured to a part for the BEOL structure of one single chip.In this embodiment, all stored bits of 3-D storer will be deployed in single 2-D plane, but access distribution structure will be designed to the stored bits of connected storage unit, by this way, the virtual 3-D memory construction as illustrated conceptually in Figure 29 will be created.

In an alternative embodiment of the invention, in order to obtain the storage density of enhancing, the storer of each level (plane) in the 3-D memory construction 270 shown in Figure 29 is formed on independently substrate or chip, and wherein different substrate/chips is by the stacking memory construction of the 3-D that overlies one another to be formed physics.In this embodiment, each substrate/chip will have memory element, access equipment and the access distribution structure relevant to other storer given, wherein run through different base/chip bit and formed vertically run through via hole connect, create the vertical access distribution (such as wordline) for accessing the memory cell across different physical level storer.Exemplarily, in one embodiment of the present of invention, utilize the buffer memory of many physical levels of the structure using Figure 29, L2 and the L3 buffer memory of the associating between the first and second processor 90A and 90B as shown in Fig. 9 C can be created respectively.

In other embodiments of the present invention, the 3-D memory construction 270 shown in Figure 29 can be fabricated to the combination of the storer of concept and physical level.Such as, suppose 4 grades of memory constructions, 2 of 4 grades of storeies can be created in the storage level as the first and second concepts in the first substrate, and the storer of all the other 2 levels can be manufactured on the storage level as the third and fourth concept in independently the second substrate.First and second substrates (each storer with two conceptual levels) can be overlie one another to be formed the stacking structure of the 3-D with 4 grades of storeies.

As above with reference to described by figure 23C and 24, such as, 2-D array (data structure) can be stored in the memory construction of the storer with a level (this level has two different access distribution patterns), can use the whole row of a primitive operation access 2-D array or whole row like this.In other embodiments of the present invention, 2-D data array structure can be stored in the standard memory structure of a storer and access distribution pattern with a level, so in one operation can travel all over row or column.Such as, Figure 31 describes the method for storing 2-D data array in memory according to illustrated embodiments of the invention, and the method makes it possible to realize the access to row and column in an operation.Figure 31 schematically illustrates to comprise and is arranged on 4 row (R0, R1, R2 and R3) and 4 row (C0, C1, C2 and C3) in the memory array 280 of 2-D array of memory cell, wherein memory cell can by comprising 4 wordline (WL0, WL1, WL2, WL3) and 4 bit line (BL0, BL1, BL2, BL3) access distribution structure accessed.

The memory array 280 of Figure 31 is described to the 4x4 data array structure that storage comprises data element A (i, j), and wherein i represents line order number, and j represents row sequence number.Formed with the data storage arrangement of the memory block A shown in Figure 24 and contrast, the row and column of the data array structure shown in Figure 31 is stored in the arrangement of conversion, all elements of such row are stored in different row, and all elements of regulation row are stored in different rows.Particularly, by the row of often going with the skew of its number of lines, the element A (i, j) of data array structure can be stored in a memory cell, and such data are by row and column deflection simultaneously.

Such as, Tu31Zhong, the 0th row (R0) of storer 280 comprises the data structure (A11, A12, A13 and A14) of the first row being stored in normal place.But the data structure (A21, A22, A23 and A24) of the second row, moves right 1 with data element, be stored in the 1st row (R1) of storer 280.In addition, the data structure (A31, A32, A33, A34) of the third line, 2 are moved right with data element, be stored in the 2nd row (R2) of storer 280, and the data structure of fourth line (A41, A42, A43, A44), move right 3 with data element, be stored in the 3rd row (R3) of storer 280.Based on this point, data structure A often goes and is often listed in the different row and columns of memory array 280.This allows to obtain row and arbitrarily row arbitrarily in single operation.Such as, by activating wordline WL0, and then activate each bit line BL0, BL1, BL2 and BL3 with each element (A11, A12, A13, A14) in the first row RA1 of sense data array structure A in one operation, can the first row RA1 (element A11, A12, A13, A14) of visit data structure A.In addition, by activating each wordline WL0 ~ WL3, then each bit line BL0 ~ BL3 is activated with each element (A11, A21, A31, A41) of the first row CA1 of sense data array structure A in one operation, can the first row CA1 (element A11, A21, A31, A41) (as shown by hacures 282) of visit data array structure A.

In a similar fashion, can from second, third and fourth line and row of storer 280 sense data array structure, but wheel shifting method 284 is used to position bit being moved to the left some, as placed required for bit in appropriate order.Such as, when the second row of data array structure is read out, the data element on bit line BL0, BL1, BL2 and BL3 will be the order of A24, A21, A22 and A23.The right-shift operation of 1 bit position will be implemented, so that data element is placed as suitable position, that is, A21, A22, A23 and A24.

In an alternative embodiment of the invention, the example storage method discussed above with reference to Figure 31 can be extended to 3-D application, such as, shown in Figure 32.Figure 32 schematically illustrate according to example embodiment of the present invention, for 3-D data array being stored in the method in 3-D memory construction.Figure 32 describes the cube structure of the memory component comprising 4x4x4 matrix.Such as, the representative of this cube structure has the 3-D memory construction of the framework described in Figure 29.In this embodiment, by offseting the row and column in each 4x42-D plane, and in vertical (stacking) direction, the data of 3-D array can be stored in 3-D storer.In Figure 32, each cubical number (1,2,3,4) of memory construction 290 represents the bit distance of the data element of the given row of given 2-D array, and the related column number of given row in given 2-D array.

The storer of Figure 32 arranges any 2-D section any 4x4 (2-D) plane allowed in 3-D accumulator system being kept 4x4x4 (3-D) data structure, can access the data element of each 2-D data slicer so in one operation.In other embodiment, by dimension being stored in the plane, 2-D data can be mapped in 3-D memory construction.Such as, by being two 4x4 parts by 4x16 array stripe, and each 4x4 part is stored in the separate planes of 3-D memory construction, 4x162-D matrix can be stored in the 3-D storer of Figure 32.In addition, suppose that 3-D storer is built as the 64X256 storer with multiple grades, by first 256 dimension being divided into 4 independently partly (such as, form 4 64x256 parts), and on 4 that each in 4 parts are stored in 3-D storer 64x256 not at the same level, the data of 256x2562-D array can be stored in 3-D accumulator system.

Other embodiments of the present invention comprise structure and the method for the 3-D computer processor system for implementing use multichip system.Such as, Figure 33 is the side diagram of the multichip system that embodiments of the invention can be applied to.Particularly, Figure 33 shows the multichip system 300 comprising package substrates 310, the 3-D using surface installation structure 330 (such as ball grid array structure) to be arranged in substrate 310 calculates stacking 320, and is arranged on the coldplate 340 calculated on stacking 320.Calculate stacking 320 and comprise multiple stacking layer, comprise one or more processor core central layer 321, interconnected and I/O wiring layer 322, L3 buffer memory layer 323, multiple L4 buffer memory layer 324, optional layer 325, power converter layer 326.Each layer 321,322,323,324,325 and 326 comprises semi-conductor chip 321A, 322A, 323A, 324A, 325A and the 326A with surface, front side (enlivening) respectively, and the dorsal part relative with active surface (inactive) surface.

Power converter layer 326 comprises the circuit for the high-voltage power transmitted by package substrates 310 (such as 10V) being converted to the low pressure and low power (such as 1V) enlivening circuit being supplied to multiple layers.Power converter layer 326 can comprise other circuit and circuit unit, such as, for implementing capacitor and the accelerator circuit of other standards function.Such as accelerator is the ASIC hardware engine performing specific function.The dorsal part of power converter layer 326 is connected to package substrates 310 by via surface installation structure 330.Optional layer 325 can comprise idle storer or other features.L4 buffer memory layer 324 comprises mutual face to the multiple memory layers (L1, L2, L3 and L4) be installed together privately.L3 buffer memory layer 323 is installed on the face of ground floor L1 of L4 buffer memory stacking 324 dorsally.The active surface 323A of L3 buffer memory layer 323 may further include driving for controlling multiple L4 buffer memory layer 324 and control circuit.

In one embodiment, processor core central layer 326 comprises multiple processor chips, and wherein each processor chips can comprise one or more processor.Such as, can use and carry out connection handling device chip above with reference to the technology described by Figure 13,14,15 and 16.Interconnected and I/O wiring layer 322 comprises each distribution be connected to each other of processor core central layer 321, wherein, interconnected and I/O wiring layer comprises multiple input/output end port, and wherein multiple processor core central layer 321 is universally connected and shares the plurality of input/output end port.In the example embodiment of Figure 33, processor core central layer 321 stacking in lower reason device core layer be shown as being installed together by interconnected array 327 (such as solder ball) with interconnected and I/O wiring layer 322 Face to face.

Interconnected and I/O wiring layer 322 comprises distribution network to link together, each local storage layer (that is, memory layer 323 and 324) to create storage system.Such as, use above with reference to the one or more technology described by Fig. 9 A, 9B, 9C, 22 to 32, multiple memory layer can be connected to each other and control.In addition, interconnected and I/O wiring layer 322 comprises distribution network, with the stacking input/output end port universally shared by processor core central layer 321, is connected to the storage system of the polymerization formed by interconnected memory layer 323 and 324.In addition, global interconnect bus, it comprises and runs through vertical wires that memory layer 323,324 and power converter layer 326 formed and interconnected, is formed that interconnected and I/O wiring layer 322 are connected to the distribution (via surface installation structure 330) formed in package substrates.

Although Figure 33 describes one and calculates stacking 320, multiple calculating is stacking can be installed in emulation substrate to form multiple processor computation system.Such as, Figure 34 is the high level view of the 3-D computer processor system that embodiments of the invention can be applied to.Particularly, Figure 34 describes the 3-D multiple processor computation system 400 having and be arranged on multiple calculating stacking 420 that are in common substrate 410 and that cooled by the general plate structure of cooling being thermally connected to the upper surface calculating stacking 320.Calculating shown in Figure 34 stacking 420 can have the stacking 320 same or similar structures with the calculating shown in Figure 33.Package substrates 410 comprises the interconnected and trace of the multiple electricity forming electric distribution (electric distribution provide between multiple calculating stacking 420 whole arrive whole connection).Coldplate 440 can be the structure supporting liquid cooling, or supports air cooled hot distribution plate.

In the embodiment of Figure 34, because many reasons, use general coldplate 440 may be problematic to cool each calculating stacking 420.Such as, because reason understood by one of ordinary skill in the art, to cooling technology (such as, the liquid cooling adopted, Air flow) dependence, general coldplate 440 may not provide sufficient heat to cool fully and calculate stacking 420 to the difference of the diverse location being positioned at coldplate 440.In addition, along with coldplate 440 is expanded and shunk (because of its hot coefficient of dilatation), different pressure calculates the hot interface of the upper surface of stacking 420 with straining the zones of different (area) that may be applied to coldplate 440 and be positioned at coldplate 440, this is unmanageable.Such as, displacement between the surface on the surface stacking relative to given calculating 420 of coldplate 440 is greater than to be positioned at and calculates stacking 420 from those of the center more distant positions of coldplate, and this causes larger stress to coldplate 440 and the hot interface between the calculating stacking 420 closer to coldplate 440 outer boundary position and possible infringement.In addition, have the 3-D computing system 400 of Figure 34, because the distribution structure of multiple grades linking together required by stacking for all calculating 420, the making of package substrates 410 will be more expensive and complicated.In fact, depend on the number of the calculating of formation system stacking 420, and the concrete distribution network structure used, package substrates 410 can have the distribution of 100 or more levels, and its manufacture may be very expensive.

In other embodiments of the present invention, by building the 3-D computer processor system of the multiple multichip systems comprised in paradigmatic structure (paradigmatic structure is combined with multiple local power and cooling layer and is connected the global interconnect structure of the multichip system in the structure of polymerization), eliminate the problem relevant with the package substrates with complicated distribution to general coldplate 440.Such as, Figure 35,36,37,38 and 39 schematically illustrates the embodiment of the 3-D computer processor system comprising multiple multichip system.Figure 35 is the side view of multichip system according to an embodiment of the invention.Figure 36 describes 3-D computer processor system according to an embodiment of the invention, and this 3-D computer processor system is created by the multiple multichip system as shown in Figure 35 of associating.

Particularly, Figure 35 describes the multichip system 500 comprising local power converter layer 510, multiple m memory layer 520, local interconnected and I/O wiring layer 530, multiple processor core central layer 540 and local cooling layer 550.Local cooling layer 550 comprises local port 552 and local breakout 554.Local power converter layer 510 comprises power supplied locally 512 and local grounding connection 514.Multichip system 500 comprises the global bus 560 of running through the stacking structure being connected to local interconnected and I/O distribution structure 530 further.Multiple layers 510,520,530,540 and 550 of multichip system 500 are similar to the layer 326,324/323,322,321 and 340 of the similar correspondence of the multichip system 300 shown in Figure 33 on 26S Proteasome Structure and Function.But the multichip system 500 shown in Figure 35 provides the construction block for 3-D computer processor system (this 3-D computer processor system builds by being physically polymerized and combine multiple multichip system as shown in Figure 35).

Figure 36 schematically illustrates the 3-D computer processor system 600 according to the embodiment of the present invention, and it is formed by stacking multiple multichip system (multichip system 500 such as shown in Figure 35) in a vertical structure.Particularly, Figure 36 show comprise vertically be stacked on 10 multichip systems over each other (500_1,500_2 ..., 500_10) 3-D computer processor system 600.System 600 comprise be connected to each multichip system (500_1,500_2 ..., 500_10) the global power structure 610 of each local power supply power converter layer 510, and be connected to each multichip system (500_1,500_2 ..., 500_10) the local port of local cooling layer 550 and outlet overall coolant system 650.In this embodiment, be all polymerized cooling system by means of in independently local cooling system 550, the 3-D computer processor system 600 of each multichip system 500.This structure eliminates the demand relevant to general coldplate 440 as shown in Figure 34 and problem.

Figure 37 schematically illustrates according to an embodiment of the invention, for global bus being connected to the technology of each multichip system of 3-D computer processor system.Particularly, Figure 37 shows 3-D computer processor system 700,3-D computer processor system 700 and comprises multiple multichip system (701,702,703,704,705,706,707,708,709 and 710) and be connected the global interconnect structure 760 of the multichip system in 3-D computer processor system 700.For convenience of description, in Figure 37, global bus 760 is generically described as the bus shared of each multichip system (701,702,703,704,705,706,707,708,709 and 710) connected in 3-D computer processor system 700.In embodiment, global bus 760 can be electric bus, by each distribution run through in the multiple chip layer forming multichip system (701,702,703,704,705,706,707,708,709 and 710) with interconnectedly to be formed.Such as, the bus element 560 shown in Figure 35 represents a part for the global bus 760 of Figure 37, and this part runs through the local chip layer of each multichip system 500 and is connected to the interconnected and I/O wiring layer 530 in this locality of each multichip system.

Global bus 760 shown in Figure 37 is connected to the interconnected and I/O wiring layer in this locality of each multichip system (701,702,703,704,705,706,707,708,709 and 710) forming 3-D computer processor system 700.As discussed above, in given multichip system, this locality in given multichip system is interconnected to be connected to each other all processor core central layer with I/O wiring layer, is connected to each other by all memory layers 520, and all native processor cores and memory layer is connected to each other.Global bus 760 makes it possible to realize the point to point link between each multichip system (701,702,703,704,705,706,707,708,709 and 710) in 3-D computer processor system 700.Global bus 760 eliminates the demand to (to connect each multichip system 420 in the 3-D computer processor system 400 shown in Figure 34) distribution structure that package substrates 410 provides.In this embodiment, in supposing the system 700, each layer is 100 micron thickness, 100 layers in 3-D system 700 will be 1cm left and right thickness, and between the outermost multichip system 701 and 710 of 3-D computer processor system 700, whole length of the global bus 760 of distribution will be no longer problems.

In an alternative embodiment of the invention, global bus 760 can be formed by using the fibre system of laser communication.In this embodiment, by distributing different communication signal wavelength (color) to each multi-chip (701,702,703,704,705,706,707,708,709 and 710) in 3-D computer processor system 700, point to point link on the fiber buss shared can be realized.Such as, basis wavelength can be assigned to the first multichip system 701, and then each remaining multichip system (702,703,704,705,706,707,708,709 and 710) can be assigned with the wavelength of the incrementally laser of larger (or less).Fibre system uploads carry information to other multichip systems by allowing multichip system (701,702,703,704,705,706,707,708,709 and 710) in the bus 760 shared, and do not need to wait for the control of shared bus 760, as global bus 760 by electricity implement time will be required.Such as, global bus 760 whether light ground or electric to implement, coherence scheme controls and coordinates the point to point link in the global bus 760 of sharing by being implemented.

Figure 38 describes 3-D computer processor system in accordance with another embodiment of the present invention.Particularly, Figure 38 shows the 3-D computer processor system 800 comprising multiple multichip system 820.As described above, multichip system 820 comprises united multiple layers 826, and its middle level 826 comprises processor chips, memory chip, local interconnected and I/O wiring layer etc.Some in multichip system 820 only can comprise processor chips, only comprise memory chip or their combination.Multichip system 820 comprises local power converter layer 822 and local cooling layer 824 further.As in other embodiments above discuss, local cooling layer 824 has the entrance and exit being connected to overall coolant system 830.

3-D computer processor system 800 comprises substrate 810 further, and multiple multichip system 820 is mounted thereon.Particularly, multiple chip of multichip system 820 and layer can be installed in substrate 810 by border land.In an embodiment, substrate 810 comprises distribution and assembly to be provided for global power to be supplied to the distribution of power network of each local power transfer layer 822, and implements the distribution of the overall situation electricity bus being connected to the interconnected and I/O wiring layer in this locality of multichip system 820 by border land.In another embodiment, substrate 810 comprises distribution and assembly to realize distribution of power network, be simultaneously used in the distribution that formed in multiple layers of the multichip system 820 of polymerization and interconnectedly build global interconnect network, longitudinally extending to one end from one end and run through multichip system 820.

Figure 39 describes the 3-D computer processor system according to another embodiment.Particularly, be similar to Figure 38, Figure 39 shows and comprises the 3-D computer processor system 900 that edge is arranged on the multiple multichip systems 920 in substrate 910.As described above, multichip system 920 comprises the multiple layers 926 be joined together, and its middle level 926 comprises processor chips, memory chip, local interconnected and I/O wiring layer etc.Some multichip systems only can comprise processor chips, only comprise memory chip or their combination.Multichip system 920 comprises local power converter layer 922 further.The substrate 910 of 3-D computer processor system 900 comprises distribution and assembly to be provided for global power to be supplied to the distribution of power network of each local power transfer layer 922, and implements the distribution of the overall situation electricity bus being connected to the interconnected and I/O wiring layer in this locality of multichip system 920 by border land.

In addition, in the 3-D computer processor system 900 of Figure 39, multichip system 920 is installed to substrate 910 by edge, has the space 932 be arranged between adjacent multichip system.Inhibition layer 930 is connected to the upper surface of multiple layers of multichip system 920 to provide mechanical stability, and the enclosed cavity provided by space 932 is provided, can flow by the air of this cavity pressurized or cooling medium the cooling being provided for multichip system 920.

As below with reference to figure 40,41,42,43,44 and 45 discuss, in other embodiments of the present invention, three-dimensional computer processor system is built as the chip of the associating with multiple layers, wherein at least one chip layer has for other layers (such as, processor core central layer, memory layer etc.) on the circuit of scanning monitoring of functional circuit, and it supports the conversion of dynamic chek point (checkpointing), fast context and the fast quick-recovery of system state.With the state of the art of semiconductor technology, large scale integrated circuit is typically fabricated to be applied for DFT (can test design), wherein, integrated circuit is designed to have scanning observation circuit, and this scanning observation circuit is used for the internal error situation of testing integrated circuits during chip manufacturing.Scanning observation circuit typically comprises scan chain and/or scan cycle, and this scan chain and/or scan cycle are by sequentially linking together multiple scanning element, and the internal node state of gated sweep chain and/or scan cycle Access Integration circuit and being formed.Use a series of latch or trigger (such as, the trigger that can scan, the d type flip flop of such as scan enable), can scanning element be realized.

Usually, during test process, use scan chain and/or scan cycle to set up and concrete state in multiple pieces of integrated circuit that test of reading back is lower, the object of testing for n-back test is to determine whether correctly operating to certain portions of integrated circuit (IC) design.Scanning element (trigger that such as, can scan) is configured to select in two inputs (data input (D) and scanning t test (SI)).During sweep phase, by the test pattern making the scanning t test of scanning element (SI) transfer to the scanning t test place of scanning element, and this test pattern is applied to the input of the logical block of the combination of integrated circuit, the scanning element of given scan chain is configured to form serial shift register.Followed by sweep phase, inputted by the data (D) of enable scans unit, the scanning acquisition stage is performed, to obtain the data exported from the logical block of combination in response to test pattern.Therefore, the scanning t test (SI) of scanning element is by again enable to shift out the output data obtained by scanning element.Based on this point, the sweep test of integrated circuit performs in two duplication stages, and namely, in the scan shift stage, wherein the scanning element of scan chain is configured to serial shift register for the immigration of each input and output scan-data with shift out; And the scanning acquisition stage, wherein the scanning element of scan chain obtains the data exported from the combined logic block of integrated circuit.Obtain data be moved out of and with expect model comparision with determine combine logical block whether undesirably run.

Scan chain typically comprises very long bit sequence.So, complete one group of scan testing mode is input to chip and will a large amount of time be required, the speed that the part which has limited integrated circuit can be tested from the process of chip output scanning result.A method avoiding such scan chain I/O to limit is, builds to have to produce fast in integrated circuit self, run and the integrated circuit of built-in self-test (built-in-self-test, BIST) module of checkout pattern.But BIST module can occupy the relatively a large amount of region on chip, and it is otiose at the normal operation period of chip.In addition, in order to implement scan chain, integrated circuit needs to comprise extra raceways and extra latch/trigger, and this is for the latch/trigger of chip is connected to scan chain, and the extra logic for supporting sweep test to operate.Connect scanning element and formation scan chain, the I/O route of scan chain bit be provided and be provided for the extra distribution required for route of scan chain clock signal, the obvious route resource of chip can be occupied, and therefore cause the undue growth of chip area consumption and circuit pathways time delay.

Embodiments of the invention comprise the 3-D disposal system with the multilayer chiop of united in stacked structure, wherein one or more test chip layers are fabricated, to comprise test structure particularly or mainly, such as BIST module, scan wiring, test I/O distribution and scan control function and logical circuit, to support and to perform the sweep test of functional circuit of other chip layer one or more (such as, processor layer, memory layer, other functional chip layers etc.).In an embodiment, test chip layer is permanent fixture, and it is included in is sold in the 3-D semiconductor product of client.In another embodiment, test chip layer is interim assembly, and it is used to the functional circuit of other chip layer of testing 3-D semiconductor equipment, and is removed before point-of-sale terminal product is to client.As follows by further detailed description, in other embodiments, in the system of test layer for good and all as a part for final products, test layer can be built as and comprise control circuit further, to obtain status data from one or more functional chip layer, and the status data storing one or more functional chip layer is to provide system state checkpoint (checkpointing) and application context translation function.

Figure 40 schematically illustrate according to the embodiment of the present invention, the 3-D computing system with at least one test layer (this test layer has the circuit of sweep test for functional layer and system state checkpoint (checkpointing)).Particularly, Figure 40 comprises the diagrammatic side view physically combining to be formed the first chip 1002 of stacked structure and the semiconductor equipment 1000 of the second chip 1004 by interconnected array 1006 (such as solder ball).First chip 1002 is included in the functional circuit formed in front side (enlivening) the surperficial 1002A of the first chip 1002.Depend on chip type, the type of functional circuit will change (such as, processor core, memory array etc.).In an embodiment as shown in Figure 40, the first chip 1002 is the processor chips with one or more processor core.In other embodiments, the first chip 1002 can be the functional chip of memory chip or the other types with the functional circuit for given application.No matter chip type, the functional circuit of the first chip 1002 will comprise multiple scanning elements with memory component (trigger that such as can scan and latch).

In one embodiment of the present of invention, the second chip 1004 can be scan chain configuration and the test layer with scan test circuit (test architecture) and test I/O (I/O) interface 1004A.Via sweep test I/O interface 1004A, the scanning element of the functional circuit of the first chip 1002 is connected to the scan test circuit on the second chip 1004.Sweep test I/O interface 1004A comprises wide array or the setting of the I/O plate in the extensive region of the active surface being arranged on the second chip 1004.As described in more detail below, scan test circuit on second chip 1004 runs dynamically to configure the electrical connection between the scanning element on the first chip 1002, to form scan chain or the scan cycle of the part for testing the functional circuit on the first chip 1002.

In an alternative embodiment of the invention, such as, second chip 1004 has system state obtain and recover control circuit and other support circuit, to obtain system state data from the functional circuit of the first chip 1004 and to recover the system state of the functional circuit of the first chip 1004 needed, thus provide the system state retrieving layer of system state checkpoint (checkpointing) and application context translation function.In this embodiment, functional circuit will have multiple memory component, such as register and buffer memory, and typically store other elements representing the data of the current system conditions of functional circuit.Via system state I/O interface 1004B, the system state be connected to by these memory components on the first chip 1002 on second chip 1004 obtains and recovers control circuit.System state I/O interface 1004B comprises wide array or the setting of the I/O plate in the extensive region of the active surface being arranged on the second chip 1004.

In order to the object described, in Figure 40, test I/O interface 1004A and system state I/O interface 1004B is shown as independently element, because, in one embodiment of the present of invention, I/O plate and the some distribution structure of test and system state I/O interface 1004A and 1004B are logically independent of one another, and form independently interface.But, I/O plate and the some distribution structure of test and system state I/O interface 1004A and 1004B can scatter each other or be wound around, the test of such second chip 1004 and system state I/O interface 1004A and 1004B cross in the extensive region of the active surface 1002A of the first chip 1002, to minimize the interconnect length between the control circuit on the second chip 1004 and the functional circuit on the first chip 1004.

Figure 41 schematically illustrates according to an embodiment of the invention, for the framework of the sweep test of the functional layer in 3-D disposal system and the test layer circuit of system state checkpoint (checkpointing).Particularly, Figure 41 describes according to an embodiment of the invention, comprises circuit to support the embodiment of the test chip 1100 that scan chain configuration and test and system state are recovered.The test chip 1100 of Figure 41 describes an embodiment of second chip 1004 of Figure 40.As shown in Figure 41, test chip 1100 comprises test I/O interface 1004A, system state I/O interface 1004B, checkpoint (checkpointing) control circuit 1010, context switch control circuit 1012, storer 1014, scan chain configuration circuit 1016, scan chain configuration and test control circuit 1022, scan chain output multiplexer 1028, output register 1030, test I/O controller 1032 and test interface 1034.Storer 1014 can be volatile memory, or nonvolatile memory, or test layer 1100 can comprise nonvolatile memory and volatile memory simultaneously, and this depends on application.Scan chain configuration circuit 1016 comprises de-multiplexing circuitry 1018 and multiplex circuit 1020.Scan chain configuration and test control circuit 1022 comprise BIST module 1024 and test clock generator 1026.

Such as, scanning test function supported by multiple assemblies 1016,1022,1028,1030,1032 and 1034 of test chip 1100, will describe this function in further detail below with reference to Figure 44.Briefly, the electrical connection between scan test circuit 1016 and 1022 runs with the scanning element dynamically configuring the functional circuit on given functional chip layer, to form scan chain for the part of test function circuit or scan cycle.As will be described in further detail with reference to Figure 44, by test I/O interface 1004A, the data of the scanning element on functional chip layer export the input being connected to de-multiplexing circuitry 1018; The output of de-multiplexing circuitry 1018 is connected to the input of multiplexer circuit 1020; And by test I/O interface 1004A, the output of multiplex circuit 1020 is connected to the scanning t test of the scanning element on functional chip layer.Scan chain configuration and test control circuit 1022 produce control signal selectively to control demodulation multiplexer circuit 1018 and multiplexer circuit 1020, with dynamically configure scanning element export between scanning element input, via the electrical connection of electric internet (this electric internet is formed dynamically on test chip 1100 via scan chain configuration circuit 1016).

The function of BIST module 1024 implementation criteria and control circuit are to produce and application testing pattern, and this test pattern is entered the scanning t test port of the scanning element of the functional chip layer being connected to test I/O interface 1004A by scanning.Test clock generator 1026 produces the test clock signals of the test frequency needed, and its scanning element being used to scan function chip layer operates to perform sweep test with the test frequency at test clock.Select an output in the scan chain output of multiple scan chains of scan chain output multiplexer 1028 on the first chip 1002, and the output of the scan chain of selection is stored in output register 1030.The selection control signal gated sweep chain output multiplexer 1028 that test I/O controller produces.External testing agency provide multiple test control signal and test pattern by test interface 1034, its tested I/O controller 1032 processes and is delivered to scan chain configuration and test control circuit 1022, to implement sweep test operation via the external control signal exported from test I/O controller 1032 and test pattern.Via test interface 1034, scan test signal and test pattern data are imported into test I/O controller 1032.Via test interface 1034, the scan chain stored in test I/O controller 1032 access register 1030 exports data, and these data are output to outer testing system.

Such as, as will be described in further detail with reference to figure 45 below, multiple assemblies 1010,1012 and 1014 back-up system state of test chip 1100 obtains and restore funcitons.Briefly, checkpoint (checkpointing) control circuit 1010 is used to the dynamic chek point (checkpointing) performing the process performed on functional chip layer.As noted above, the functional circuit on functional chip layer will have multiple memory component, and such as register and buffer memory and typically storage represent other elements of the data of the current system conditions of functional circuit.In an embodiment, in a small amount of cycle, the whole state of the micro-architecture of checkpoint (checkpointing) control circuit 1010 backs up automatically and periodically (obtain and store) functional circuit, and any buffer memory or other state-holding structures can not be polluted.By concrete example, checkpoint (checkpoint) process can be the cycle, or in logic by checkpoint (checkpointing) control circuit 1010, programmable specific event starts.In other embodiments, can by the tested actual treatment start-up check point (checkpoint) making an inventory of (checkpointed).For the process of start-up check point (checkpoint), new instruction is added into the instruction set starting such time.In this embodiment, checkpoint (checkpointing) control circuit 1010 by the instruction received in response to the functional circuit from the given functional chip in 3-D disposal system to activate checkpoint (checkpoint) (state of framework stores or fetches function).

Storer 1014 can be used to the copy of the state storing the micro-architecture that multiple moment gets.The state got can be used to multiple object.Such as, when recovering mistake and occurring, the copy stored from the storer a small amount of cycle, the whole state of micro-architecture can be rewritten.In fact, when finding mistake during running process, state (checkpoint (checkpoint)) that system can be restored to " known is good ", and heavily run process from this checkpoint (checkpoint).Certainly, have the storage of sufficient storer 1014, the multiple checkpoint obtaining and store given process with time sequencing and/or the multiple checkpoints obtaining and store the different threads that may run on functional chip layer are feasible.In addition, when critical event (such as power fail) occurs, the checkpoint of important information can be obtained and is stored in storer 1014 immediately.These checkpoints can be extracted instantaneously by close, and this allows more healthy and stronger recovery.Such as, current state during power failure can be acquired, and then when power up, the high bandwidth provided via system state I/O interface 1004B and short electricity interconnected, and be promptly sent to given function system.

In other embodiments, storer 1014 can store the chip-specific information of known (static state) problem of the one or more functional chip layers about 3-D disposal system.Such as, if the specific part of the functional circuit of given functional chip is known as work improperly, this information can be maintained in storer 1014, like this when functional chip is in the future by use, scan chain configuration and test control circuit 1022 will know the part not configuring this (known) idle functional circuit.In addition, storer 1014 can be used to store the test procedure and the test pattern that are used to implement scanning test function by scan chain configuration and test control circuit 1022.As noted above, depend on application, storer 1014 can make volatile memory or nonvolatile memory, or test layer may be implemented as volatibility and nonvolatile memory.Such as, for have nothing to do from serious fault recovery but n-back test is to make context switch or the application from more unconspicuous fault recovery simply, storer 1014 may be implemented as volatile memory.

In addition, context switch control circuit 1012 is used to perform application context conversion, and wherein, the micro-architecture of given functional layer can be changed back and forth between the context of different application, and can not suffer to pollute buffer memory and retry the cost that row arranges code.Context switch control circuit 1012 runs to obtain current system conditions for application context conversion, and the state of acquisition is stored in storer 1014.Such as, when obtaining system state for application context conversion, under the operation of context switch control circuit 1012, the current context of given application, as representated by the current data stored in multiple buffer memorys of functional chip layer, can be acquired and be stored in storer 1014.This allows new application context to start quickly, is accomplished automatically because preserve original context.In addition, test layer can have ability and context is stored in this context and be stored in place in system self, but it can run process concurrently with new context.In essence, test layer extracts " checkpoint (checkpoint) " of interrupted process, and checkpoint (checkpoint) data are stored as low priority batch processing, and it can run concurrently with the process newly started.The ability of the context switch of low punishment makes it possible to the use realizing much optimization in multiprogram environment, and this too expends in conventional systems.

In other embodiments of the present invention, because test layer can be manufacture and comprise programmable storage configurablely, and because test layer can be manufactured to the functional chip layer connecting known physical position (physical connections to other stacking parts), we can manufacture a universal test layer, and it can be used to the chip of many difference in functionalitys.Namely, by the physical contact between definition universal test layer and functional layer, any functional layer can be built as and meet those predefined contacts.In other words, test layer can be built as the I/O interface (physics with both of logic) with standard, and it makes it possible to realize the reusing of test chip for the test of multiple difference in functionality chip.In addition, in another embodiment, functional layer also can have (less) test architecture that formed, that can be driven by test layer thereon.This can be not only " easily " for some function systems, and it can also be situation below: given functional layer comprises proprietary third-party structure, and standard, general test layer does not test this structure.In fact, if these structures are special, third party can not wish the content revealing them, but by the test of runtime self.

In other embodiments of the present invention, 3-D disposal system may be implemented as have two or more functional layer and/or two or etc. many special test layer.Such as, Figure 42 schematically illustrate according to another embodiment of the invention, there is at least one test layer (this test layer has the circuit of sweep test for Multifunctional layered and system state checkpoint (checkpointing)) 3-D disposal system.Particularly, Figure 42 is the diagrammatic side view of the semiconductor equipment 1200 comprising the first functional chip 1202 (having front side (enlivening) surperficial 1202A) and the second functional chip 1204 (having front side (enlivening) surperficial 1204A) and test chip 1206 (having test I/O interface 1206A and system state I/O interface 1206B).Via interconnected array 1208 (such as solder ball), functional chip 1204 is physically joined to test chip, and the first and second functional chips 1202 and 1204 are installed to form stacking structure in facing behind.In the embodiment of Figure 42, test chip 1206 implements independently special circuit and function carrys out test function chip 1202 and 1204.In this embodiment, via the vertical connection running through the second functional chip 1204, the functional circuit before test I/O interface 1206A and system state I/O interface 1206B is connected on (enlivening) side 1202A.

Figure 43 schematically illustrate according to another embodiment of the invention, the 3-D computer processor system with multiple test layer (test layer has the circuit of sweep test for multiple functional layer and system state checkpoint (checkpointing)).Particularly, Figure 43 is the diagrammatic side view of the semiconductor equipment 1300 comprising the first functional chip 1302 (having front side (enlivening) surperficial 1302A), the first test chip 1304 (having test I/O interface 1304A and system state I/O interface 1304B), the second functional chip 1306 (having front side (enlivening) surperficial 1306A) and the second test chip 1308 (having test I/O interface 1308A and system state I/O interface 1308B).Via interconnected array 1310 (such as solder ball), the first functional chip 1302 is physically joined to the first test chip 1304; Via interconnected array 1312 (such as solder ball), the second functional chip 1306 is physically joined to the second test chip 1308.

In the embodiment of Figure 43, each test chip 1304 and 1308 comprises the independently special circuit obtaining for the sweep test of in the functional chip 1302 and 1306 of correspondence and/or back-up system state/recover.Functional chip 1302 and 1306 does not have directly not close and is not connected to each other, the thickness of test chip 1304 can be very thin, such two functional chips 1302 and 1306 (such as, processor core central layer) between test chip 1304 arbitrary of running through directly electrical connection can be relatively short, to make it possible to the high-speed traffic between practical function chip 1302 and 1306, such as, utilize as above refer to figs. 14 and 15 describe multiple interconnection technique.Use known technology, the dorsal part of the first test chip 1304 is connected to the dorsal part of the second functional chip 1306, with by chip bonding together, and the I/O plate connecting and be formed the electric distribution (such as, silicon runs through via hole) running through the first test chip 1304 and the second functional chip 1306 is provided.

It will be understood that, although functional chip layer is shown as processor chips by Figure 40,42 and 43, functional chip layer can be the chip of other types, such as memory chip and the functional chip that can be included in for the other types of given application in 3-D disposal system.In addition, although Figure 41 describes test chip comprise circuit to support test, checkpoint (checkpointing) and context switch controlling functions, in other embodiments of the invention, test chip only can comprise scan test circuit, only comprise checkpoint (checkpointing) or context switch control circuit, or sweep test, checkpoint (checkpointing) and context switch control circuit combination in any.

Figure 44 schematically illustrates according to an embodiment of the invention, the test layer of 3-D disposal system and the circuit of functional layer.Particularly, Figure 44 generally describes the functional layer 1400 comprising functional circuit 1402, functional circuit 1402 comprises multiple scanning element 1404,1406,1408,1410 and 1412, and these scanning elements are dispersed in can by multiple circuit blocks 1414,1416,1418 and 1420 of the functional circuit 1402 of sweep test.In one embodiment of the present of invention as shown in Figure 44, each scanning element 1404,1406,1408,1410 and 1412 is scan type d type flip flops, and it comprises data (D) input port, scan input end (SI) mouth, data (Q) output port, clock (CLK) input port and scan enable (SE) control port.As illustrated further in Figure 44, test layer 1422 comprises multiplexer/demultiplexer circuit 1424, and multiplexer/demultiplexer circuit 1424 comprises multiple multiplexer M1, M3, M3, and multiple demodulation multiplexer D1, D2 and D3.Test layer 1422 comprises scan chain configuration and test control circuit 1426, scan chain output multiplexer 1428, output register 1430, test I/O controller 1432 and interface 1434 further, has the same or analogous function of those functions with the corresponding assembly as above described by reference Figure 41.

Test layer 1422 comprises scan enable signals controller 1436 and test clock generator 1438 further, and it runs under the control of scan chain configuration and test control circuit 1426.Scan enable signals controller 1436 runs under the operation of scan chain configuration and test control circuit 1426, to produce scan enable signals, by the sweep test I/O interface of test layer 1422, scan enable signals is sent to scan enable (SE) input port of the scanning element in functional layer 1400.In addition, test clock generator 1438 produces test clock, and test clock is imported into clock input (CLK) port of scanning element to perform sweep test (it is different from the frequency of normal functioning mode clock signal) at the test frequency needed.

In example embodiment shown in Figure 44, scan chain configuration and test control circuit 1426 produce control signal to control multiple multiplexer M1, M2, M3 and multiple demodulation multiplexer D1, D2, D3, be dynamically configured for connect scanning element 1404,1406,1408,1410 and 1412 internet to be formed for the scan chain of test function logical one 414,1416,1418 and 1420 and/or scan cycle.Particularly, as shown in Figure 44, each demodulation multiplexer circuit D1, D2 and D3 have input, this input is connected to the output (the sweep test I/O interface by test layer 1422) of the scanning element in functional layer 1400, and two or more output, this output is connected to the different input of in multiplexer circuit M1, M2, M3.In addition, each multiplexer circuit M1, M2, M3 have output, this input is connected to the input (the sweep test I/O interface by test layer 1422) of the scanning element of functional layer 1400, and two or more input, this input is connected to the output of different demodulation multiplexer circuit D1, D2 and D3.Scan chain configuration and test control circuit 1426 produce control signal to control demodulation multiplexer and multiplexer circuit, interconnected with the electricity that the scanning element on dynamically configuration feature layer exports between input, to be formed for scan chain and/or scan cycle.

Such as, as shown in figure 44, the data-out port (Q) of scanning element 1404,1406 and 1408 is connected respectively to the input of demodulation multiplexer D1, D2 and D3.In addition, the scanning t test port (SI) of scanning element 1406,1408 and 1410 is connected to the output of multiplexer circuit M1, M2 and M3.In this embodiment, multiplexer circuit 1424 (control signal via exporting from scan chain configuration and test control circuit 1426) is conciliate by selectively controlling multiplexer, the output of scanning element can be connected to the different scanning input of different scanning unit, to create different scanning chain and/or scan cycle.Such as, by selecting the output being connected to the demodulation multiplexer D1 of the input of multiplexer M1, M2 or M3 of needs, and correspondingly selectively control those multiplexers M1, M2 or M3, data output (Q) (it is connected to the input of demodulation multiplexer D1) of scanning element 1404 can be routed to the scanning t test of in scanning element 1406,1408 or 1410.

It will be understood that, in the functional circuit 1402 of Figure 44, not each scanning element exports (Q) needs the input being connected to demodulation multiplexer circuit, and not each scanning element input (SI) needs the output being connected to multiplexer circuit.In fact, as in Figure 44 about shown by scanning element 1410 and 1412, a series of two or more scanning element can be connected to each other, to form the scanning element section being similar to traditional scan chain (such as, the data of a scanning element export the scanning t test (SI) that (Q) is connected to another scanning element).In such embodiment, the end points of each scanning element section can start with multiplexer circuit, and terminate (namely with demodulation multiplexer circuit, the output of multiplexer circuit is connected to the scanning t test (SI) of first scanning element of given section, and the data that the input of demodulation multiplexer circuit is connected to last scanning element of given section export (Q)).

This dynamic configurability makes it possible to realize large-scale additional features.The scanning element of functional layer 1402 can be configured to eachly be connected to same bits but with multiple scan chains of different orders, it makes it possible to the rich and varied test process realizing contributing to the number reducing the test duration or increase the test that can run in cycle preset time.Such as, if two function requests need by the bit of the different distance along given scan chain of the test that immediately lands, so likely each bit is in the more short distance along different scanning chain (conciliate multiplexer circuit 1424 by the multiplexer dynamically controlled on test layer 1422, can create this different scanning chain in the numerous available scan chain in functional layer 1400).This will make scan operation complete in shorter time.Because test layer 1422 has control circuit (relevant with distribution with the circuit in functional layer 1400) sparsely, the large scale network having sufficient place multiplexer to be conciliate multiplexer circuit 1424 is included on test layer 1422, for with numerous different modes configure scan chain or scan cycle, and make it possible to the configuration of the multiple not same areas (scanning element concrete arbitrarily wherein in functional layer 1400 can belong to more than one territory) realizing scan cycle.This is convenient to realize very concrete sweep test, and allows more " efficient " configuration for fc-specific test FC, must not be " comprehensively " simultaneously.About " efficient ", we refer to just tested function and can be configured to allow shorter and more thoroughly test.About " incomplete ", we refer in specifically testing arbitrarily, we can allow the part of circuit not tested, will with the test of the difference of other test wrappers or testing chain and/or configuration by (with efficiently) test all sidedly because know those parts.This and conventional scan test technology (wherein scan chain and scan cycle are not configurable neatly) define striking contrast.

Figure 45 describes according to the embodiment of the present invention, for obtaining the process flow diagram of the method for system state and recovery system state in the 3-D disposal system with at least one layer (this layer has the circuit of context switch for functional layer and system state checkpoint (checkpointing)).In order to the object described, the method for Figure 45 describes the example operational mode of checkpoint (checkpointing) control circuit 1010 in the test layer 1100 of Figure 41 and context switch control circuit 1012.Figure 45 describes the process for obtaining system state and the process for recovery system state, and it runs concurrently after system initialization.The initial step of two process comprises system initialization (block 1500).Followed by system initialization, be initialised for the process obtaining system state, wherein system enters the waiting status (block 1502) that waiting system state obtains trigger event.In one embodiment of the present of invention, system state obtains trigger event and comprises checkpoint (checkpointing) operation expiring by time period of starting.In another embodiment, system state obtains the context switch event of the conversion between context that trigger event comprises the different application that triggering is just being performed by the functional circuit of functional chip.

Whether checkpoint (checkpointing) or context switch operation start, the generation of trigger event is obtained in response to system state, checkpoint (checkpointing) or context switch control circuit (1010 of Figure 41 or 1012) run and represent functional chip (such as to obtain, processor or memory chip) on the status data (block 1504) of current system conditions of functional circuit, and the status data acquired is sent to test layer (block 1506).In an embodiment, multiple memory component is present in functional circuit, and it includes register and the buffer memory of the data of the current system conditions of the representative functional circuit of storage.By system state I/O interface, checkpoint (checkpointing) control circuit 1010 or context switch control circuit 1012 are connected to the access circuit in functional layer, and the system state data controlling to obtain is from the access circuit functional layer to the transmission of test layer, wherein status data is stored in the storer 1014 be arranged on test layer, or is stored into certain storer be arranged in independent of on another layer of test layer.

In addition, system for tracking initialization, the process for recovery system state is initialised, and wherein system enters the waiting status (block 1510) that waiting status recovers trigger event.In one embodiment of the present of invention, for checkpoint (checkpointing) application, recovering state trigger event can be power fail or recoverable system mistake.For context switch application, recovering state trigger event can be trigger just by the context switch event of the conversion between the context of the different application of the functional circuit execution in functional layer.When receiving recovering state trigger event (positive result in block 1510), checkpoint (checkpointing) control circuit 1010 or context switch control circuit 1012 are by the copy (block 1512) from the memory access status data relevant to goal systems state.Then, by the system state I/O interface of test layer, under the control of the control circuit on test layer, this status data is sent to functional layer (block 1514).Then, by the copy of the status data had access to being stored into the target cache/register of functional layer, the goal systems state of functional circuit is resumed (block 1516).

Although described example embodiment of the present invention with reference to accompanying drawing herein, but it will be understood that, the present invention is not restricted to those clear and definite embodiments, and does not depart from the scope of claim herein, and those skilled in the art can make other changes multiple and amendment wherein.

Claims

1. a memory construction, comprising:

First order storer, described first order storer comprises first memory cell array and has the wordline of first mode and the first access distribution structure of bit line, and each memory cell in wherein said first memory cell array comprises memory element and is connected to the first access equipment of described memory component and described first access distribution structure; And

Second level storer, described second level storer comprises the second access distribution structure of wordline and the bit line with the second pattern and is connected to multiple second access equipment of described second access distribution structure, and wherein said second access equipment is connected to the memory element of corresponding described first order storer.

Wherein, the wordline of described first mode of described first access distribution structure and bit line are different from wordline and the bit line of described second pattern of described second access distribution structure.

2. memory construction according to claim 1, the wordline of wherein said first access distribution structure is set to the wordline being orthogonal to described second access distribution structure.

3. memory construction according to claim 1, wherein said first access distribution structure and described second is accessed the wordline of at least one in distribution structure and is extended across residing first memory cell array oblique line.

4. memory construction according to claim 1, wherein said first access distribution structure and described second is accessed the wordline of at least one in distribution structure and is extended with row-direction of displacement.

5. memory construction according to claim 1, wherein said first order storer is formed in the first substrate, and described second level storer is formed in the second substrate, wherein said first and second substrates are overlie one another, and wherein said first and second substrates comprise and vertically run through via hole the second access transistor on the storer of the described second level to be connected to the memory element on corresponding described first order storer.

6. memory construction according to claim 5, wherein said first and second substrates are different memory chips.

7. memory construction according to claim 5, wherein said first and second substrates are different processor chips.

8. memory construction according to claim 1, wherein said first order storer and described second level storer are formed on a single substrate.

9. memory construction according to claim 8, wherein said single substrate is memory chip.

10. memory construction according to claim 8, wherein said single substrate is processor chips.

11. memory constructions according to claim 1, comprise further:

Third-level storage, described third-level storage comprises second memory cell array and has the wordline of the 3rd pattern and the 3rd access distribution structure of bit line, and each memory cell in wherein said second memory cell array comprises the second memory element and is connected to the 3rd access equipment of described second memory element and described 3rd access distribution structure; And

Connect multiple wordline of memory cell of described first and third-level storage.

12. memory constructions according to claim 1, the wordline of wherein said 3rd pattern is identical with at least one in bit line with the wordline of described second model with described first mode with bit line.

13. memory constructions according to claim 11, wherein said first order storer is formed in the first substrate, wherein said second level storer is formed in the second substrate, wherein said third-level storage is formed in the 3rd substrate, and wherein said first, second, third substrate is overlie one another.

14. memory constructions according to claim 11, wherein said first order storer, described second level storer and described third-level storage are all formed on a single substrate.

15. memory constructions according to claim 11, wherein said described first order storer, described second level storer are formed in the first substrate, wherein said third-level storage is formed in the second substrate, and wherein said first and second substrates are overlie one another.

16. memory constructions according to claim 1, wherein said memory construction is buffer memory.

17. memory constructions according to claim 1, wherein said memory construction is main system memory.

18. 1 kinds of memory constructions, comprising:

First order storer, described first order storer comprises first memory cell array and has the wordline of first mode and the first access distribution structure of bit line, and each memory cell in wherein said first memory cell array comprises the first memory element and is connected to the first access equipment of described first memory element and described first access distribution structure; And

Second level storer, described second level storer comprises second memory cell array and has the wordline of the second pattern and the second access distribution structure of bit line, and each memory cell in wherein said second memory cell array comprises the second memory element and is connected to the second access equipment of described second memory element and described second access distribution structure; And

Multiple wordline of memory cell are connected to across the described first order and second level storer.

19. memory constructions according to claim 18, the wordline of described first mode of wherein said first access distribution structure and bit line have the wordline of described second pattern of accessing distribution structure with described second pattern identical with bit line.

20. memory constructions according to claim 18, the wordline of the described first mode of wherein said first access distribution structure and bit line are different from wordline and the bit line of described second pattern of described second access distribution structure.

21. memory constructions according to claim 18, wherein said first order storer is formed in the first substrate, wherein said second level storer is formed in the second substrate, wherein said first and second substrates are overlie one another, and wherein said first and second substrates comprise and vertically run through via hole, the described hole that vertically penetrates forms the described multiple wordline being connected to memory cell across described first and second grades of storeies.

22. memory constructions according to claim 21, wherein said first and second substrates are different memory chips.

23. memory constructions according to claim 21, wherein said first and second substrates are different processor chips.

24. memory constructions according to claim 18, wherein said first order storer and described second level storer are formed on a single substrate.

25. memory constructions according to claim 24, wherein said single substrate is memory chip.

26. memory constructions according to claim 24, wherein said single substrate is processor chips.

27. memory constructions according to claim 18, wherein said memory construction is buffer memory.

28. memory constructions according to claim 16, wherein said memory construction is main system memory.

29. 1 kinds, for accessing the method for storer, comprising:

Store data in memory cell array;

The access distribution being connected to the first mode of described memory cell is used to access data in described memory cell array;

The access distribution being connected to the second pattern of described memory cell is used to access data in described memory cell array; And

The access distribution of wherein said first and second patterns is different.

30. methods according to claim 29, wherein said memory cell array is the 2-D array of memory cell.

31. methods according to claim 29, wherein said memory cell array is the 3-D array of memory cell.

32. methods according to claim 31, the access distribution of wherein said first mode is disposed in the first plane of described 3-D array, and the access distribution of described second pattern is disposed in the second plane of described 3-D array, and described second plane is different from described first plane.

33. methods according to claim 32, wherein said first and second planes are parallel.

34. methods according to claim 32, wherein said first and second planes are vertical.