WO2008100324A9 - Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom - Google Patents
Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom Download PDFInfo
- Publication number
- WO2008100324A9 WO2008100324A9 PCT/US2007/071370 US2007071370W WO2008100324A9 WO 2008100324 A9 WO2008100324 A9 WO 2008100324A9 US 2007071370 W US2007071370 W US 2007071370W WO 2008100324 A9 WO2008100324 A9 WO 2008100324A9
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cache
- levels
- hierarchy
- island
- level
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L25/00—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
- H01L25/03—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes
- H01L25/04—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers
- H01L25/065—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L25/0657—Stacked arrangements of devices
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L25/00—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
- H01L25/03—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes
- H01L25/04—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers
- H01L25/065—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L25/0652—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00 the devices being arranged next and on each other, i.e. mixed assemblies
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L25/00—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
- H01L25/18—Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof the devices being of types provided for in two or more different subgroups of the same main group of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2225/00—Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
- H01L2225/03—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
- H01L2225/04—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
- H01L2225/065—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L2225/06503—Stacked arrangements of devices
- H01L2225/06513—Bump or bump-like direct electrical connections between devices, e.g. flip-chip connection, solder bumps
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2225/00—Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
- H01L2225/03—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
- H01L2225/04—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
- H01L2225/065—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L2225/06503—Stacked arrangements of devices
- H01L2225/06517—Bump or bump-like direct electrical connections from device to substrate
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2225/00—Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
- H01L2225/03—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
- H01L2225/04—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
- H01L2225/065—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L2225/06503—Stacked arrangements of devices
- H01L2225/06541—Conductive via connections through the device, e.g. vertical interconnects, through silicon via [TSV]
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2225/00—Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
- H01L2225/03—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
- H01L2225/04—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
- H01L2225/065—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L2225/06503—Stacked arrangements of devices
- H01L2225/06572—Auxiliary carrier between devices, the carrier having an electrical connection structure
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2225/00—Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
- H01L2225/03—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
- H01L2225/04—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
- H01L2225/065—All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
- H01L2225/06503—Stacked arrangements of devices
- H01L2225/06589—Thermal management, e.g. cooling
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L23/00—Details of semiconductor or other solid state devices
- H01L23/58—Structural electrical arrangements for semiconductor devices not otherwise provided for, e.g. in combination with batteries
- H01L23/64—Impedance arrangements
- H01L23/642—Capacitive arrangements
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2924/00—Indexing scheme for arrangements or methods for connecting or disconnecting semiconductor or solid-state bodies as covered by H01L24/00
- H01L2924/0001—Technical content checked by a classifier
- H01L2924/0002—Not covered by any one of groups H01L24/00, H01L24/00 and H01L2224/00
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01L—SEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
- H01L2924/00—Indexing scheme for arrangements or methods for connecting or disconnecting semiconductor or solid-state bodies as covered by H01L24/00
- H01L2924/15—Details of package parts other than the semiconductor or other solid state devices to be connected
- H01L2924/151—Die mounting substrate
- H01L2924/153—Connection portion
- H01L2924/1531—Connection portion the connection portion being formed only on the surface of the substrate opposite to the die mounting surface
- H01L2924/15311—Connection portion the connection portion being formed only on the surface of the substrate opposite to the die mounting surface being a ball array, e.g. BGA
Definitions
- the present invention generally relates to a method for electronic computing, and more specifically, to a method for design of cache hierarchies in 3-dimensional chips, and 3- dimensional cache hierarchy structures resulting therefrom.
- the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods.
- First recent work has demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes (e.g., to a few hundred microns or less), etching dense via patterns in them, and then interconnecting them with metalization processes.
- the resulting structure is a monolithic "chip" including multiple planes of circuits. This advance is quite literally a new dimension in the scaling of circuit density.
- Ll Level-1 cache
- the access time of a cache is determined, to a large extent, by its area. Therefore, a processor's Level-1 cache (Ll), which is integral to the processor pipeline itself is kept small so that its access time is commensurate with the processor speed, which today can be several Gigahertz. Because the Ll is small, it cannot contain all of the data that will be used by the processor when running programs. When the processor references a datum that is not contained within the Ll, this is called an Ll "cache miss.”
- the reference is forwarded to the next level in the hierarchy (say, L2), to determine whether the datum is there. [0008] If the requested datum is in the L2 cache, then data (including the datum that was specifically referenced) are moved from the L2 cache to the Ll cache, and the original reference is satisfied.
- Data in a cache are stored in "lines,” which are contiguous chunks of data (i.e., being a power-of-2 number of bytes long, aligned on boundaries corresponding to this size). Thus, when a cache miss is serviced, it is not merely the specific datum that was requested that is moved down the hierarchy. Instead, the entire cache line containing the requested datum is moved. Data are stored in "lines” for two reasons.
- each entry in a cache has a corresponding directory entry (e.g., in a cache directory) that contains information about the cache entry. If each byte in a cache were given its own entry, there would be a prohibitive number of directory entries (i.e., equal to the number of bytes in the cache) making the administrative overhead for the cache (the directory) huge. Thus, instead, there is one directory entry per cache line (which is typically between 32-256 bytes today, depending on the processor).
- program reference patterns exhibit what is called “spatial locality of reference,” meaning that if a particular datum is referenced, then it is very likely that other data that are physically proximate (e.g., by address) to the referenced datum will also be referenced.
- the bandwidth between levels in a hierarchy is equal to the amount of data that is moved per unit of time. It is noted that the bandwidth includes both necessary movement (e.g., the data that are actually used) and unnecessary movement (e.g., data that are not ever referenced).
- misses are a dominant component of delay.
- One method of reducing the number of misses incurred is to anticipate what might be used by a running program, and to "prefetch" that data down the hierarchy before it is referenced. In this way, if the program actually does reference what was anticipated, there is no miss.
- the more a processor speculates about what might be referenced (so as to prefetch) the more unnecessary movement takes place, since some of what is anticipated will be wrong. This means that facilitating higher performance by the elimination of misses will require more bandwidth.
- the actual bandwidth used (as defined above) is the amount of data moved per unit time. If the program runs fast, the unit of time will be shorter, hence the bandwidth higher. Notice the distinction between the actual bandwidth, and the "bandwidth capacity," which is equal to the maximum amount of data that could be moved if the busses were 100% utilized. Bandwidth capacity is equal to the width of the bus (in bytes) times the bus frequency. [0016] For example, if an 8-byte bus runs at 1 Gigahertz, then the bandwidth capacity of the bus is 8 Gigabytes per second. It is noted also that if the processor runs at 2 Gigahertz, then the bus is 2 times slower than the processor.
- bus widths tend to be much narrower than cache line sizes (e.g., 8 bytes instead of 128 bytes) is that planar wiring capability is limited. Because these busses tend to be long, they also tend to be wired in high-level (e.g., relatively thick or fat) wire to minimize resistance so as to maximize speed. Wide busses (e.g., much wider than 8 bytes) would impose considerable blockages on the upper levels of metal, so they are generally not used.
- Busses tend to be slower than processors (e.g., 1 Gigahertz instead of 2 Gigahertz) because they are too long (e.g., 5-10 millimeters or more) since they are connecting large aerial structures (caches) in a plane.
- an exemplary feature of the present invention is to provide a method which maximize the number of interconnections between the levels in a cache hierarchy so as to maximize the bus widths, and the cache hierarchy resulting from such a novel method.
- the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods.
- An exemplary feature of the present invention is to minimize the bus wire lengths between the levels in a cache hierarchy so as to maximize the bus speed and so as to minimize the energy required per transfer.
- a method for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that a "cache island" containing specific congruence classes at one cache level are directly above or below the corresponding cache island which contains the corresponding congruence classes of an adjacent cache level, i.e., physically positioning each cache island directly over the corresponding cache islands of different cache levels.
- a method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with the computing system for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level correspond to a same congruence classes within a corresponding cache island at another cache level, and physically positioning each cache island directly over the corresponding cache islands of different cache levels.
- a computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus performs a method program for causing a computer to perform the exemplary methods described herein.
- a method is provided for the physical arrangement of the bits in cache hierarchies implemented in three (3) dimensions, so as to minimize the planar wiring required in the busses connecting the levels of the hierarchy. In this way, the data paths between the levels are primarily determined by the vias themselves. This leads to very short, (and hence fast) and low power busses.
- the bits within each level of a cache hierarchy are so arranged such that transfers between levels in the hierarchy do not involve very much of the planar wiring in any level.
- the present invention restricts bit placement within the plane by grouping the bits in accordance with two (orthogonal) criteria such that the movement between cache levels is nearly completely vertical in a 3-dimensional structure.
- the present inventors have recognized that an emerging technology is the ability to make 3-dimensional structures by thinning chips, laminating them together, and interconnecting them with though-vias in an areal grid.
- such structures provide very wide interconnecting busses.
- the present invention provides for interconnecting multiple planes of memory together on a wide vertical bus to provide unprecedented amounts of bandwidth between adjacent levels in a cache hierarchy.
- the present invention provides a method for organizing the bits within the arrays of the adjacent levels so that minimal x-y wiring is required to transfer data.
- the wide bus can be almost entirely vertical, and also can be very short (e.g., a few hundreds microns). This provides a bus that can run at much higher speed, and much lower power than conventional 2-dimensional busses.
- Figure 1 exemplarily illustrates a cache hierarchy showing an Ll .5 SRAM plane and an L2 including two planes of eDRAM;
- Figure 2 logically illustrates cache geometries for Ll.5 and L2 caches, according to exemplary aspects of the present invention
- Figure 3 exemplarily illustrates a bitline structure in which 32 bits in each of the eDRAM planes can drive the corresponding 8 bits on the SRAM plane;
- Figures 4A-4B exemplarily illustrate a multicore example of 8 processors, each with a private cache hierarchy and some intra-processor interconnection;
- Figure 5 illustrates an exemplarily method according to the present invention.
- Figure 6 illustrates another exemplarily method according to the present invention.
- the Ll.5 is taken to be 1 Megabyte, and the L2 is 8 Megabytes.
- these caches can be arranged so that they are in "aerial correspondence.” That is, the areas of the two caches can be made approximately the same. One of them can be placed on top of the other. This is preferred (e.g., required) so that when moving data between the two levels, large horizontal movements are not required. Preferably, substantially only vertical movements will be needed. Then, the bits of the two caches will be arranged so that the physically corresponding bits (between the two levels) are in identical x-y (planar) proximities.
- the capacity of the L2 is (in this case) 8X larger than the capacity of the
- the invention preferably makes them aerially commensurate by using different circuit densities, and by using multiple planes of storage.
- the Ll.5 be implemented in 6T SRAM
- the L2 be implemented in embedded DRAM (eDRAM).
- eDRAM offers a 4X density improvement over SRAM.
- To facilitate an 8X capacity difference then requires two planes of eDRAM to implement the L2 to correspond to a single plane of SRAM for the
- Figure 1 illustrates a cache hierarchy structure 100 showing an Ll.5 SRAM plane 110 (having for example 1 MB of SRAM) and an L2 cache 120 comprising two planes of eDRAM 120A, 12B (having 4 MBs for example).
- Figure 2 logically illustrates exemplary cache geometries 210, 220 for Ll.5 and 2 caches, respectively.
- a combined L1.5/L2 directory is chosen to be constructed, in SRAM, and put it on the SRAM plane, although this is arbitrary, and not essential to the present invention.
- a "combined" directory is really an L2 directory (having 32K entries), where each entry has an additional field to specify its status (MESI) in the Ll.5. in the case that the corresponding data is resident in the Ll.5. This was estimated to take about 1.6 square millimeters, assuming 8 bytes for a directory entry.
- the directory On a directory search, if there is an Ll.5 miss and an L2 hit, then the directory should identify the line in the L2 that is to be brought into the Ll.5, and it should choose the location (set within congruence class) in the Ll.5 into which it is to be placed. Therefore, it should select one of 32K lines in the L2, which it can do by transmitting as few as 15 bits up to the L2 plane, or as many as 32K select signals, depending on how many vias are desired for use, and where one wishes to place decoders. The choice here is arbitrary (with regard to teaching this invention), albeit important to the goals of the design. The point is that the number of vias used for this can be small.
- a first (and trivial) reason for this mapping limitation is that each bit in a cache line occupies a unique position. That is, a line is a contiguous and ordered group of 256 bytes, and each of those bytes is a contiguous and ordered group of 9 bits. Therefore, there are 2304 unique bit positions in a cache line.
- bit 0 of the L2 line is placed into bit position 0 in the Ll.5 line
- bit 1 of the L2 line is placed into bit position 1 in the Ll.5 line, and so on.
- Bit 0 (and all other bits) can only be placed into its corresponding bit position. It cannot be placed into any other bit position.
- the L1.5/L2 combination can be viewed as 512 (the number of Ll.5 congruence classes) independent L1.5/L2 sub-caches, each in its own little island with its own bus between the Ll.2 and L2.
- Each subcache is an Ll.5 congruence class and the set of L2 congruence classes that map to it. Therefore, the IM Ll.5 / 8M L2 aggregation can be thought of as 512 independent 2K Ll.5 / 16K L2 aggregations, and each of those have 2304 independent bit positions.
- Figure 3 exemplarily illustrates a bitline structure 300 in which 32 bits in each of the eDRAM planes 310, 320 can drive the corresponding 8 bits on the SRAM plane 330.
- the data path is shown as having independent signals for each eDRAM plane. For a fixed number of vias, this would imply half the bus width (per line) than what was described above, although in this configuration, two lines could be moved at the same time (e.g., one from each eDRAM plane). Alternatively, a single via could be used, and only one plane could be selected at a time for transmission.
- the last row in the table basically represents the situation implied above - using 80K signal vias. That is, for a 256-byte bus width, we can transfer a 256-byte line in a single cycle. With our 80K signal vias, we can have 32 such busses, hence can have 32 lines in flight simultaneously. Partitioning the 512 congruence classes into 32 islands (each with its own bus) would put 16 congruence classes into each island - as shown in the first column.
- the first row shows that if one partitioned the cache into 512 islands, each having a single congruence class, then with 8OK signal vias, each of the islands could only have a 16 byte bus, and it would take 16 cycles to move a 256 byte line. In this case, 512 lines can be in flight at the same time, but it takes longer to move a line.
- Each row in the table shows a different partitioning, but each of the partitionings has the same bandwidth capacity: 8 Kilobytes per cycle.
- Figures 4A-4B illustrate a multicore example of 8 processors, each with a private hierarchy and some intra-processor interconnection.
- Figures 4A-4B show a generalization of a processor chip in which there are 8 processors, each with a private Ll.5 cache.
- Figure 4A shows the 8 processors with the Ll.5s. Of course, the area is dominated by the Ll.5s (which is not apparent in the conceptual view shown).
- FIG. 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible.
- eDRAM e.g., eDRAM
- FIG. 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible.
- eDRAM e.g., eDRAM
- Figure 5 shows an exemplarily method 500 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized.
- the method can include co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of the cache hierarchy, wherein the cache levels are physically positioned over each other.
- Figure 6 shows an exemplarily method 60 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized.
- the method can include the method can include physically partitioning each cache level into cache islands (e.g., step 610).
- each cache island preferably can include a subset of congruence classes. It is also noted that the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level.
- the method 600 includes physically positioning each cache island directly over the corresponding cache islands of different cache levels (e.g., step 620).
Abstract
A method of electronic computing, and more specifically, a method of design of cache hierarchies in 3-dimensional chips, and a cache hierarchy resulting therefrom, including a physical arrangement of bits in cache hierarchies implemented in 3 dimensions such that the planar wiring required in the busses connecting the levels of the hierarchy is minimized. In this way, the data paths between the levels are primarily the vias themselves, which leads to very short, hence fast and low power busses.
Description
METHOD FOR ACHIEVING VERY HIGH BANDWIDTH BETWEEN THE LEVELS
OF A CACHE HIERARCHY IN 3-DIMENSIONAL STRUCTURES, AND A 3-
DIMENSIONAL STRUCTURE RESULTING THEREFROM
BACKGROUND OF THE INVENTION
Field of the Invention
[0001] The present invention generally relates to a method for electronic computing, and more specifically, to a method for design of cache hierarchies in 3-dimensional chips, and 3- dimensional cache hierarchy structures resulting therefrom.
Description of the Related Art
[0002] The present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods. [0003] First, recent work has demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes (e.g., to a few hundred microns or less), etching dense via patterns in them, and then interconnecting them with metalization processes. The resulting structure is a monolithic "chip" including multiple planes of circuits. This advance is quite literally a new dimension in the scaling of circuit density. [0004] Second, as circuit density has scaled, single chips have grown to contain more and more of the computer system. Two decades ago, it was a revelation that an entire processor could fit on a single chip. At the 180 nanometer CMOS node, it was a revelation that not only the processor's Level- 1 cache (Ll) was contained, but for the first time it was also feasible to
include the next level of cache, L2, on the chip with the processor. Additionally, about a decade ago, the first single-chip multiprocessors began being produced.
[0005] At densities facilitated at the 90 nanometer node and beyond, together with the aforementioned ability to create 3-dimensional structures, single chip systems of the future will contain not only multiple processors, but also multiple levels of the cache hierarchy.
[0006] The access time of a cache is determined, to a large extent, by its area. Therefore, a processor's Level-1 cache (Ll), which is integral to the processor pipeline itself is kept small so that its access time is commensurate with the processor speed, which today can be several Gigahertz. Because the Ll is small, it cannot contain all of the data that will be used by the processor when running programs. When the processor references a datum that is not contained within the Ll, this is called an Ll "cache miss."
[0007] In the event of an Ll miss, the reference is forwarded to the next level in the hierarchy (say, L2), to determine whether the datum is there. [0008] If the requested datum is in the L2 cache, then data (including the datum that was specifically referenced) are moved from the L2 cache to the Ll cache, and the original reference is satisfied.
[0009] If the referenced datum is not in the L2 cache, then this is also a "cache miss" (an L2 "cache miss"), and the reference continues to percolate up the hierarchy (e.g., say, to L3 and above). By convention, higher levels in a cache hierarchy are physically larger (and hence, hold more data), and thus, progressively slower.
[0010] Data in a cache are stored in "lines," which are contiguous chunks of data (i.e., being a power-of-2 number of bytes long, aligned on boundaries corresponding to this size). Thus, when a cache miss is serviced, it is not merely the specific datum that was requested that is
moved down the hierarchy. Instead, the entire cache line containing the requested datum is moved. Data are stored in "lines" for two reasons.
[0011] First, each entry in a cache has a corresponding directory entry (e.g., in a cache directory) that contains information about the cache entry. If each byte in a cache were given its own entry, there would be a prohibitive number of directory entries (i.e., equal to the number of bytes in the cache) making the administrative overhead for the cache (the directory) huge. Thus, instead, there is one directory entry per cache line (which is typically between 32-256 bytes today, depending on the processor).
[0012] Second, program reference patterns exhibit what is called "spatial locality of reference," meaning that if a particular datum is referenced, then it is very likely that other data that are physically proximate (e.g., by address) to the referenced datum will also be referenced.
Thus, by bringing in an entire cache line, more of the spatial context of the program is captured, thereby reducing the number of misses.
[0013] The bandwidth between levels in a hierarchy is equal to the amount of data that is moved per unit of time. It is noted that the bandwidth includes both necessary movement (e.g., the data that are actually used) and unnecessary movement (e.g., data that are not ever referenced).
[0014] To achieve high performance in a processor, it is important to not take many misses, since misses are a dominant component of delay. One method of reducing the number of misses incurred is to anticipate what might be used by a running program, and to "prefetch" that data down the hierarchy before it is referenced. In this way, if the program actually does reference what was anticipated, there is no miss. However, the more a processor speculates about what might be referenced (so as to prefetch), the more unnecessary movement takes place, since some
of what is anticipated will be wrong. This means that facilitating higher performance by the elimination of misses will require more bandwidth.
[0015] The actual bandwidth used (as defined above) is the amount of data moved per unit time. If the program runs fast, the unit of time will be shorter, hence the bandwidth higher. Notice the distinction between the actual bandwidth, and the "bandwidth capacity," which is equal to the maximum amount of data that could be moved if the busses were 100% utilized. Bandwidth capacity is equal to the width of the bus (in bytes) times the bus frequency. [0016] For example, if an 8-byte bus runs at 1 Gigahertz, then the bandwidth capacity of the bus is 8 Gigabytes per second. It is noted also that if the processor runs at 2 Gigahertz, then the bus is 2 times slower than the processor. And if the cache line size is 128 bytes, then the 8-byte bus requires 16 bus cycles (which is 2 X 16 = 32 processor cycles) to move a cache line. Some of the data that is moved during these 32 processor cycles is useless. Further, if a subsequent miss occurs during the large window in which this cache line is being moved, the subsequent miss can be further delayed by the bus transfer that is already in progress. [0017] For this reason, it is important to have a bandwidth capacity that is much larger than the actual bandwidth demand. Very high bandwidth facilitates two things that are crucial to high performance computing.
[0018] First, very high bandwidth allows cache lines to be transferred very quickly so that the transfers will not interfere with other miss traffic in the system. (For example, if the bus above were 128 bytes wide and ran at 3 Gigahertz, it could transfer the cache line in a single processor cycle.)
[0019] Second, having an ample surplus of bandwidth capacity facilitates operations like prefetching, which will place a much higher bandwidth demand on the system.
[0020] The reason that bus widths tend to be much narrower than cache line sizes (e.g., 8 bytes instead of 128 bytes) is that planar wiring capability is limited. Because these busses tend to be long, they also tend to be wired in high-level (e.g., relatively thick or fat) wire to minimize resistance so as to maximize speed. Wide busses (e.g., much wider than 8 bytes) would impose considerable blockages on the upper levels of metal, so they are generally not used.
[0021] Busses tend to be slower than processors (e.g., 1 Gigahertz instead of 2 Gigahertz) because they are too long (e.g., 5-10 millimeters or more) since they are connecting large aerial structures (caches) in a plane.
SUMMARY OF THE INVENTION
[0022] In view of the foregoing and other exemplary problems, drawbacks, and disadvantages of the conventional methods and structures, an exemplary feature of the present invention is to provide a method which maximize the number of interconnections between the levels in a cache hierarchy so as to maximize the bus widths, and the cache hierarchy resulting from such a novel method.
[0023] Particularly, the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods. [0024] An exemplary feature of the present invention is to minimize the bus wire lengths between the levels in a cache hierarchy so as to maximize the bus speed and so as to minimize the energy required per transfer.
[0025] In a first aspect of the present invention, a method for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-
level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that a "cache island" containing specific congruence classes at one cache level are directly above or below the corresponding cache island which contains the corresponding congruence classes of an adjacent cache level, i.e., physically positioning each cache island directly over the corresponding cache islands of different cache levels.
[0026] In yet another aspect of the present invention, a method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with the computing system for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level correspond to a same congruence classes within a corresponding cache island at another cache level, and physically positioning each cache island directly over the corresponding cache islands of different cache levels.
[0027] In another aspect of the present invention, a computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus performs a method program for causing a computer to perform the exemplary methods described herein.
[0028] In the present invention, a method is provided for the physical arrangement of the bits in cache hierarchies implemented in three (3) dimensions, so as to minimize the planar wiring
required in the busses connecting the levels of the hierarchy. In this way, the data paths between the levels are primarily determined by the vias themselves. This leads to very short, (and hence fast) and low power busses.
[0029] In the present invention, the bits within each level of a cache hierarchy are so arranged such that transfers between levels in the hierarchy do not involve very much of the planar wiring in any level. The present invention restricts bit placement within the plane by grouping the bits in accordance with two (orthogonal) criteria such that the movement between cache levels is nearly completely vertical in a 3-dimensional structure.
[0030] For example, the present inventors have recognized that an emerging technology is the ability to make 3-dimensional structures by thinning chips, laminating them together, and interconnecting them with though-vias in an areal grid. However, such structures provide very wide interconnecting busses.
[0031] The present invention provides for interconnecting multiple planes of memory together on a wide vertical bus to provide unprecedented amounts of bandwidth between adjacent levels in a cache hierarchy. The present invention provides a method for organizing the bits within the arrays of the adjacent levels so that minimal x-y wiring is required to transfer data. In this way, the wide bus can be almost entirely vertical, and also can be very short (e.g., a few hundreds microns). This provides a bus that can run at much higher speed, and much lower power than conventional 2-dimensional busses.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] The foregoing and other exemplary purposes, aspects and advantages will be better understood from the following detailed description of exemplary aspects of the invention with reference to the drawings, in which: [0033] Figure 1 exemplarily illustrates a cache hierarchy showing an Ll .5 SRAM plane and an L2 including two planes of eDRAM;
[0034] Figure 2 logically illustrates cache geometries for Ll.5 and L2 caches, according to exemplary aspects of the present invention;
[0035] Figure 3 exemplarily illustrates a bitline structure in which 32 bits in each of the eDRAM planes can drive the corresponding 8 bits on the SRAM plane;
[0036] Figures 4A-4B exemplarily illustrate a multicore example of 8 processors, each with a private cache hierarchy and some intra-processor interconnection;
[0037] Figure 5 illustrates an exemplarily method according to the present invention; and
[0038] Figure 6 illustrates another exemplarily method according to the present invention.
DETAILED DESCRIPTION OF EXEMPLARY ASPECTS OF THE INVENTION
[0039] Referring now to the drawings, and more particularly, to Figures 1-6, there are shown exemplary aspects of the method and structures according to the present invention. [0040] For the purposes of illustrating the principle concepts of the present invention, herein will be first focused on a simple 2-level hierarchy comprising an Ll.5 cache and an L2 cache.
Roughly speaking, it is a good rule of thumb that adjacent levels in a hierarchy have a capacity
difference of an order of magnitude or so. In this example, the Ll.5 is taken to be 1 Megabyte, and the L2 is 8 Megabytes.
[0041] First, these caches can be arranged so that they are in "aerial correspondence." That is, the areas of the two caches can be made approximately the same. One of them can be placed on top of the other. This is preferred (e.g., required) so that when moving data between the two levels, large horizontal movements are not required. Preferably, substantially only vertical movements will be needed. Then, the bits of the two caches will be arranged so that the physically corresponding bits (between the two levels) are in identical x-y (planar) proximities.
This will eliminate the need for much x-y (planar) wiring in the intra-planar communication paths.
[0042] Although the capacity of the L2 is (in this case) 8X larger than the capacity of the
Ll.5. the invention preferably makes them aerially commensurate by using different circuit densities, and by using multiple planes of storage.
[0043] For example, let the Ll.5 be implemented in 6T SRAM, and let the L2 be implemented in embedded DRAM (eDRAM). For argument's sake, assume that eDRAM offers a 4X density improvement over SRAM. To facilitate an 8X capacity difference then requires two planes of eDRAM to implement the L2 to correspond to a single plane of SRAM for the
Ll.5. Figure 1 shows this basic arrangement.
[0044] That is, Figure 1 illustrates a cache hierarchy structure 100 showing an Ll.5 SRAM plane 110 (having for example 1 MB of SRAM) and an L2 cache 120 comprising two planes of eDRAM 120A, 12B (having 4 MBs for example).
[0045] Assuming a 0.65 square micron SRAM cell with a 75% array efficiency, 1 Megabyte of SRAM is approximately 8.2 square millimeters. Using this number, a total area in the
neighborhood of 10 square millimeters can be aimed for (after having de-rated the array efficiency to account for rearranging the bits, and after having put in an aerial array of vias). [0046] In this exemplary embodiment, a line size of 256 bytes is chosen. The 1 Megabyte Ll.5 then has 4K lines, and the 8 Megabyte L2 has 32K lines. The set associativity of the Ll.5 is chosen as 8-way, and the set associativity of the L2 is chosen as 16-way. This makes the number of congruence classes in the Ll.5 be 4K / 8 = 512, and the number of congruence classes in the L2 be 32K / 16 = 2K. These geometries are shown in Figure 2.
[0047] That is, Figure 2 logically illustrates exemplary cache geometries 210, 220 for Ll.5 and 2 caches, respectively. [0048] Because this is a composite L1.5/L2 structure, a combined L1.5/L2 directory is chosen to be constructed, in SRAM, and put it on the SRAM plane, although this is arbitrary, and not essential to the present invention. Basically, a "combined" directory is really an L2 directory (having 32K entries), where each entry has an additional field to specify its status (MESI) in the Ll.5. in the case that the corresponding data is resident in the Ll.5. This was estimated to take about 1.6 square millimeters, assuming 8 bytes for a directory entry.
[0049] On a directory search, if there is an Ll.5 miss and an L2 hit, then the directory should identify the line in the L2 that is to be brought into the Ll.5, and it should choose the location (set within congruence class) in the Ll.5 into which it is to be placed. Therefore, it should select one of 32K lines in the L2, which it can do by transmitting as few as 15 bits up to the L2 plane, or as many as 32K select signals, depending on how many vias are desired for use, and where one wishes to place decoders. The choice here is arbitrary (with regard to teaching this invention), albeit important to the goals of the design. The point is that the number of vias used for this can be small.
[0050] Initially, a goal is to maximize the width of the data bus. For a 256-byte line, were a line to be moved in a single transfer, it would require 2304 signals (256 bytes X 9 bits per byte). Using one signal via per power via requires a total count of about 5000 vias to do this (together with address selects and various control signals). Using a via size of 16 square microns yields a total via area of 0.08 square millimeters. So in fact, if allowed a total via area overhead of 20% (of the 10 square millimeter area target), 32 such busses could be placed in use, and as many as 32 cache lines (this is 8 Kilobytes) to be moved simultaneously in an area of about 12 square millimeters. [0051] While this amount of bandwidth capacity is staggering by today's standards, the question remains as to how to route this many signals (say 2500 X 32 = 80K) in a feasible way, and so that those wires are very short.
[0052] The first observation made is that, of all of the bits in the L2 (8 Megabytes X 9 bits per byte = 72 Megabits) and of all of the bits in the Ll.5 (9 Megabits), it is not the case that any of the 72 Megabits in the L2 can be moved to any of the 9 Megabits in the Ll.5. In fact, for any particular bit in the Ll.5, there are only 64 bits in the L2 that can map into that particular bit. This is true for two independent reasons as described below. The present invention physically arranges the bits of the Ll.5 and L2 so that only those bits of the L2 that can map to a bit in the Ll.5 are above it. [0053] A first (and trivial) reason for this mapping limitation, is that each bit in a cache line occupies a unique position. That is, a line is a contiguous and ordered group of 256 bytes, and each of those bytes is a contiguous and ordered group of 9 bits. Therefore, there are 2304 unique bit positions in a cache line. When a line is moved from the L2 to the Ll.5, bit 0 of the L2 line is placed into bit position 0 in the Ll.5 line, bit 1 of the L2 line is placed into bit position 1 in the
Ll.5 line, and so on. Bit 0 (and all other bits) can only be placed into its corresponding bit position. It cannot be placed into any other bit position.
[0054] Secondly, and less obviously, for any congruence class in the Ll.5, there are only a handful of congruence classes in the L2 that can map to it. Specifically, if there are nl congruence classes in the Ll.5, and n2 congruence classes in the L2, then only n2/nl L2 congruence classes can map to a given Ll.5 congruence class. In the example being shown, the numbers are 2K / 512 = 4 L2 congruence classes per Ll.5 congruence class. Further, those L2 congruence classes cannot map to any other Ll.5 congruence class. [0055] To put it differently, the L1.5/L2 combination can be viewed as 512 (the number of Ll.5 congruence classes) independent L1.5/L2 sub-caches, each in its own little island with its own bus between the Ll.2 and L2. Each subcache is an Ll.5 congruence class and the set of L2 congruence classes that map to it. Therefore, the IM Ll.5 / 8M L2 aggregation can be thought of as 512 independent 2K Ll.5 / 16K L2 aggregations, and each of those have 2304 independent bit positions. [0056] It is noted that a line coming from the L2 (which can reside in any of the 16 sets of the 4 congruence classes of its subcache) can be placed into any of the 8 sets of the Ll.5 congruence class. If the cache is partitioned down to the bit position, it means that there are 16 X 4 = 64 bits in the L2 that can map to any of 8 bits (the set associativity of the Ll.5) in the Ll.5. Thus, conceivably, the smallest "island" in this cache would correspond to an 8-bit island of the Ll.5 residing underneath 2 (recall that there are 2 eDRAM planes) 32-bit islands of the eDRAM. [0057] This situation is shown in Figure 3. That is, Figure 3 exemplarily illustrates a bitline structure 300 in which 32 bits in each of the eDRAM planes 310, 320 can drive the corresponding 8 bits on the SRAM plane 330.
[0058] In Figure 3, the data path is shown as having independent signals for each eDRAM plane. For a fixed number of vias, this would imply half the bus width (per line) than what was described above, although in this configuration, two lines could be moved at the same time (e.g., one from each eDRAM plane). Alternatively, a single via could be used, and only one plane could be selected at a time for transmission.
[0059] While these particular partitionings are too small to lead to good array efficiencies, they demonstrate that the total x-y (planar) motion required for an L2-L1.5 transfer is very small indeed, i.e., it can be made almost completely vertical. It is noted that since the vertical distance is small, e.g., 400 microns for the three 200-micron thick planes shown here, the transfer speeds can be much faster than what would be achievable in several millimeters of horizontal wiring (i.e., than what is possible in a plane). Additionally, since the capacitance is very small, the energy required per transfer is also much smaller.
[0060] Assuming that a real design would tradeoff array efficiency with the size of the islands (which determines the length of horizontal wire required in a transfer), the table below shows several ways to partition the caches, and the busses that result using 80K signal vias:
[0061] The last row in the table basically represents the situation implied above - using 80K signal vias. That is, for a 256-byte bus width, we can transfer a 256-byte line in a single cycle. With our 80K signal vias, we can have 32 such busses, hence can have 32 lines in flight
simultaneously. Partitioning the 512 congruence classes into 32 islands (each with its own bus) would put 16 congruence classes into each island - as shown in the first column. [0062] On the other end of the spectrum, the first row shows that if one partitioned the cache into 512 islands, each having a single congruence class, then with 8OK signal vias, each of the islands could only have a 16 byte bus, and it would take 16 cycles to move a 256 byte line. In this case, 512 lines can be in flight at the same time, but it takes longer to move a line. [0063] Each row in the table shows a different partitioning, but each of the partitionings has the same bandwidth capacity: 8 Kilobytes per cycle. [0064] While the above descriptions pertain to a single Ll.5 / L2 pair, the question arises as to how this same reasoning applies to a multicore system. Figures 4A-4B illustrate a multicore example of 8 processors, each with a private hierarchy and some intra-processor interconnection. [0065] That is, Figures 4A-4B show a generalization of a processor chip in which there are 8 processors, each with a private Ll.5 cache. Figure 4A shows the 8 processors with the Ll.5s. Of course, the area is dominated by the Ll.5s (which is not apparent in the conceptual view shown).
[0066] While any number of processors is acceptable, the reason that 8 was chosen in this example is that the chip can be partitioned into 9 similar regions, with the outside 8 regions each holding a processor and its cache, where the central region can be used for a switch, to handle traffic between the caches. [0067] Figure 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible. Just as the bottom processor chip is partitioned into 9 physical regions, each of the memory chips above is so partitioned as well (although this isn't explicitly shown). Essentially, this system is just 8 copies of the single
system described above, with some switching (at least on the processor chip, but perhaps on the memory chips as well) to facilitate intra-processor communications.
[0068] Figure 5 shows an exemplarily method 500 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized. As shown in Figure 5, the method can include co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of the cache hierarchy, wherein the cache levels are physically positioned over each other. [0069] Figure 6 shows an exemplarily method 60 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized. As shown in Figure 6, the method can include the method can include physically partitioning each cache level into cache islands (e.g., step 610). It is noted that each cache island preferably can include a subset of congruence classes. It is also noted that the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level. The method 600 includes physically positioning each cache island directly over the corresponding cache islands of different cache levels (e.g., step 620). [0070] While the invention has been described in terms of several exemplary aspects, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.
[0071] Further, it is noted that, Applicant's intent is to encompass equivalents of all claim elements, even if amended later during prosecution.
Claims
CLAIMS What is claimed is: 1. A method of arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized, said method comprising: co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of said cache hierarchy, wherein said cache levels are positioned over each other.
2. The method according to claim 1, wherein said co-designing comprises: physically partitioning each said cache level into cache islands, each said cache island including a subset of congruence classes, wherein said partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level.
3. The method according to claim 1, wherein said bits within said cache hierarchy are arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
4. The method according to claim 1, wherein adjacent cache levels are arranged to include different capacities.
5. The method according to claim 1, wherein adjacent cache levels are arranged to include same capacities.
6. The method according to claim 1, wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes.
7. The method according to claim 1, wherein areas of each cache level are arranged to be substantially the same.
8. The method according to claim 2, wherein said partitioning is selected to optimize a size of each said cache island.
9. A method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with said computing system to perform the method according to Claim 1.
10. A computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus to perform a method of causing a computer to perform the method according to Claim 1.
11. A method of arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized, said method comprising: physically partitioning each cache level into cache islands, each said cache island including a subset of congruence classes, wherein said partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level; and positioning each said cache island over the corresponding cache islands of different cache levels.
12. The method according to claim 11, wherein said bits within said cache hierarchy are arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
13. The method according to claim 11, wherein adjacent cache levels are arranged to include different capacities.
14. The method according to claim 11, wherein adjacent cache levels are arranged to include same capacities.
15. The method according to claim 11, wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes .
16. The method according to claim 11, wherein areas of each cache level are arranged to be substantially the same.
17. The method according to claim 11, wherein said partitioning is selected to optimize a size of each said cache island.
18. A method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with said computing system to perform the method according to Claim 11.
19. A computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus to perform a method of causing a computer to perform the method according to Claim 11.
20. A cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, said cache hierarchy comprising: a plurality of cache levels physically partitioned into cache islands, each said cache island including a subset of congruence classes, wherein said cache levels are partitioned in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level, and wherein said cache islands are positioned over the corresponding cache islands of different cache levels.
21. The cache hierarchy according to claim 20, wherein adjacent cache levels include different capacities.
22. The cache hierarchy according to claim 20. wherein adjacent cache levels include same capacities.
23. The cache hierarchy according to claim 20. wherein areas of each cache level are substantially the same.
24. The cache hierarchy according to claim 20. wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes .
25. The cache hierarchy according to claim 20. wherein said partitioning is selected to optimize a size of each said cache island.
26. The cache hierarchy according to claim 20. wherein said cache hierarchy is arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
27. The cache hierarchy according to claim 20. wherein data paths between said cache levels comprise vias.
28. A design tool for designing a cache hierarchy according to claim 20, which is implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized.
29. A design tool comprising: a cache hierarchy according to claim 20, which is implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized.
30. A computer system, comprising: a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, said cache hierarchy comprising: a plurality of cache levels physically partitioned into cache islands, each said cache island including a subset of congruence classes, wherein said cache levels are partitioned in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level, and wherein said cache islands are positioned over the corresponding cache islands of different cache levels.
31. A computer system according to claim 30, wherein said cache islands are physically positioned directly over the corresponding cache islands of different cache levels.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007800188856A CN101473436B (en) | 2006-06-16 | 2007-06-15 | Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom |
EP07863368A EP2036126A2 (en) | 2006-06-16 | 2007-06-15 | Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/453,885 | 2006-06-16 | ||
US11/453,885 US7616470B2 (en) | 2006-06-16 | 2006-06-16 | Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom |
US11/538,567 | 2006-10-04 | ||
US11/538,567 US7518225B2 (en) | 2006-06-16 | 2006-10-04 | Chip system architecture for performance enhancement, power reduction and cost reduction |
Publications (3)
Publication Number | Publication Date |
---|---|
WO2008100324A2 WO2008100324A2 (en) | 2008-08-21 |
WO2008100324A9 true WO2008100324A9 (en) | 2009-05-22 |
WO2008100324A3 WO2008100324A3 (en) | 2011-01-13 |
Family
ID=38860723
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/071370 WO2008100324A2 (en) | 2006-06-16 | 2007-06-15 | Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom |
Country Status (4)
Country | Link |
---|---|
US (3) | US7616470B2 (en) |
EP (1) | EP2036126A2 (en) |
CN (1) | CN101473436B (en) |
WO (1) | WO2008100324A2 (en) |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7602062B1 (en) * | 2005-08-10 | 2009-10-13 | Altera Corporation | Package substrate with dual material build-up layers |
JP4824397B2 (en) * | 2005-12-27 | 2011-11-30 | イビデン株式会社 | Multilayer printed wiring board |
US8110899B2 (en) * | 2006-12-20 | 2012-02-07 | Intel Corporation | Method for incorporating existing silicon die into 3D integrated stack |
US8032711B2 (en) * | 2006-12-22 | 2011-10-04 | Intel Corporation | Prefetching from dynamic random access memory to a static random access memory |
US20080237738A1 (en) * | 2007-03-27 | 2008-10-02 | Christoph Andreas Kleint | Integrated circuit, cell, cell arrangement, method for manufacturing an integrated circuit, method for manufacturing a cell arrangement; memory module |
US9229887B2 (en) * | 2008-02-19 | 2016-01-05 | Micron Technology, Inc. | Memory device with network on chip methods, apparatus, and systems |
US7978721B2 (en) | 2008-07-02 | 2011-07-12 | Micron Technology Inc. | Multi-serial interface stacked-die memory architecture |
US8086913B2 (en) | 2008-09-11 | 2011-12-27 | Micron Technology, Inc. | Methods, apparatus, and systems to repair memory |
US20100078788A1 (en) | 2008-09-26 | 2010-04-01 | Amir Wagiman | Package-on-package assembly and method |
JP2010108204A (en) * | 2008-10-30 | 2010-05-13 | Hitachi Ltd | Multichip processor |
US8417974B2 (en) * | 2009-11-16 | 2013-04-09 | International Business Machines Corporation | Power efficient stack of multicore microprocessors |
US9123552B2 (en) | 2010-03-30 | 2015-09-01 | Micron Technology, Inc. | Apparatuses enabling concurrent communication between an interface die and a plurality of dice stacks, interleaved conductive paths in stacked devices, and methods for forming and operating the same |
US8466543B2 (en) | 2010-05-27 | 2013-06-18 | International Business Machines Corporation | Three dimensional stacked package structure |
US8299608B2 (en) | 2010-07-08 | 2012-10-30 | International Business Machines Corporation | Enhanced thermal management of 3-D stacked die packaging |
KR20120079397A (en) * | 2011-01-04 | 2012-07-12 | 삼성전자주식회사 | Stacked semiconductor device and manufacturing method thereof |
US8569874B2 (en) | 2011-03-09 | 2013-10-29 | International Business Machines Corporation | High memory density, high input/output bandwidth logic-memory structure and architecture |
KR20140109914A (en) * | 2011-12-01 | 2014-09-16 | 컨버전트 인텔렉츄얼 프로퍼티 매니지먼트 인코포레이티드 | Cpu with stacked memory |
CN102662909B (en) * | 2012-03-22 | 2013-12-25 | 东华理工大学 | Three-dimensional many-core system on chip |
US8891279B2 (en) | 2012-09-17 | 2014-11-18 | International Business Machines Corporation | Enhanced wiring structure for a cache supporting auxiliary data output |
US9378793B2 (en) * | 2012-12-20 | 2016-06-28 | Qualcomm Incorporated | Integrated MRAM module |
US9037791B2 (en) | 2013-01-22 | 2015-05-19 | International Business Machines Corporation | Tiered caching and migration in differing granularities |
US9336144B2 (en) * | 2013-07-25 | 2016-05-10 | Globalfoundries Inc. | Three-dimensional processing system having multiple caches that can be partitioned, conjoined, and managed according to more than one set of rules and/or configurations |
CN107564825B (en) * | 2017-08-29 | 2018-09-21 | 睿力集成电路有限公司 | A kind of chip double-side encapsulating structure and its manufacturing method |
CN107564881B (en) * | 2017-08-29 | 2018-09-21 | 睿力集成电路有限公司 | A kind of chip stack stereo encapsulation structure and its manufacturing method |
FR3082656B1 (en) | 2018-06-18 | 2022-02-04 | Commissariat Energie Atomique | INTEGRATED CIRCUIT COMPRISING MACROS AND ITS MANUFACTURING METHOD |
CN110540164A (en) * | 2019-10-09 | 2019-12-06 | 太仓全众智能装备有限公司 | Bottle type buffer memory machine |
EP4071593A4 (en) * | 2021-02-26 | 2023-08-23 | Beijing Vcore Technology Co.,Ltd. | Stacked cache system based on sedram, and control method and cache device |
CN113096706B (en) * | 2021-03-09 | 2023-06-16 | 长江先进存储产业创新中心有限责任公司 | CPU and manufacturing method thereof |
CN113097383B (en) * | 2021-03-09 | 2023-07-18 | 长江先进存储产业创新中心有限责任公司 | CPU and manufacturing method thereof |
US11887908B2 (en) | 2021-12-21 | 2024-01-30 | International Business Machines Corporation | Electronic package structure with offset stacked chips and top and bottom side cooling lid |
CN114244920B (en) * | 2021-12-29 | 2024-02-09 | 苏州盛科通信股份有限公司 | New and old chip stacking head compatible method and system and chip |
WO2023203435A1 (en) * | 2022-04-22 | 2023-10-26 | 株式会社半導体エネルギー研究所 | Semiconductor device |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5133061A (en) | 1987-10-29 | 1992-07-21 | International Business Machines Corporation | Mechanism for improving the randomization of cache accesses utilizing abit-matrix multiplication permutation of cache addresses |
US5502667A (en) * | 1993-09-13 | 1996-03-26 | International Business Machines Corporation | Integrated multichip memory module structure |
US6059835A (en) | 1997-06-13 | 2000-05-09 | International Business Machines Corporation | Performance evaluation of processor operation using trace pre-processing |
US6175160B1 (en) | 1999-01-08 | 2001-01-16 | Intel Corporation | Flip-chip having an on-chip cache memory |
US6725334B2 (en) * | 2000-06-09 | 2004-04-20 | Hewlett-Packard Development Company, L.P. | Method and system for exclusive two-level caching in a chip-multiprocessor |
US6678814B2 (en) * | 2001-06-29 | 2004-01-13 | International Business Machines Corporation | Method and apparatus for allocating data usages within an embedded dynamic random access memory device |
JP4047788B2 (en) * | 2003-10-16 | 2008-02-13 | 松下電器産業株式会社 | Compiler device and linker device |
US7130967B2 (en) * | 2003-12-10 | 2006-10-31 | International Business Machines Corporation | Method and system for supplier-based memory speculation in a memory subsystem of a data processing system |
US7217994B2 (en) | 2004-12-01 | 2007-05-15 | Kyocera Wireless Corp. | Stack package for high density integrated circuits |
US7305523B2 (en) * | 2005-02-12 | 2007-12-04 | International Business Machines Corporation | Cache memory direct intervention |
US7533321B2 (en) * | 2005-09-13 | 2009-05-12 | International Business Machines Corporation | Fault tolerant encoding of directory states for stuck bits |
US7404041B2 (en) * | 2006-02-10 | 2008-07-22 | International Business Machines Corporation | Low complexity speculative multithreading system based on unmodified microprocessor core |
JP4208895B2 (en) * | 2006-05-30 | 2009-01-14 | 株式会社東芝 | Cache memory device and processing method |
-
2006
- 2006-06-16 US US11/453,885 patent/US7616470B2/en not_active Expired - Fee Related
- 2006-10-04 US US11/538,567 patent/US7518225B2/en active Active
-
2007
- 2007-06-15 WO PCT/US2007/071370 patent/WO2008100324A2/en active Application Filing
- 2007-06-15 CN CN2007800188856A patent/CN101473436B/en not_active Expired - Fee Related
- 2007-06-15 EP EP07863368A patent/EP2036126A2/en not_active Withdrawn
-
2008
- 2008-05-07 US US12/116,771 patent/US7986543B2/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
WO2008100324A3 (en) | 2011-01-13 |
WO2008100324A2 (en) | 2008-08-21 |
US7616470B2 (en) | 2009-11-10 |
US7518225B2 (en) | 2009-04-14 |
US20070294479A1 (en) | 2007-12-20 |
US20080209126A1 (en) | 2008-08-28 |
US7986543B2 (en) | 2011-07-26 |
CN101473436B (en) | 2011-04-13 |
CN101473436A (en) | 2009-07-01 |
EP2036126A2 (en) | 2009-03-18 |
US20070290315A1 (en) | 2007-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7986543B2 (en) | Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom | |
Li et al. | Design and management of 3D chip multiprocessors using network-in-memory | |
US10310976B2 (en) | System and method for concurrently checking availability of data in extending memories | |
US8234453B2 (en) | Processor having a cache memory which is comprised of a plurality of large scale integration | |
Madan et al. | Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy | |
EP2353095B1 (en) | A spiral cache memory and method of operating a spiral cache memory | |
US10963022B2 (en) | Layered super-reticle computing : architectures and methods | |
US20060081971A1 (en) | Signal transfer methods for integrated circuits | |
CN102203747A (en) | Storage array tile supporting systolic movement operations | |
Poremba et al. | There and back again: Optimizing the interconnect in networks of memory cubes | |
US7444473B1 (en) | Speculative memory accesses in a proximity communication-based off-chip cache memory architecture | |
CN116610630B (en) | Multi-core system and data transmission method based on network-on-chip | |
US7496712B1 (en) | Proximity communication-based off-chip cache memory architectures | |
Jagasivamani et al. | Tileable monolithic ReRAM memory design | |
CN105930300A (en) | Three-dimensional in-chip cache based processor structure and method for manufacturing same | |
US11844223B1 (en) | Ferroelectric memory chiplet as unified memory in a multi-dimensional packaging | |
US20240078195A1 (en) | Systems, methods, and devices for advanced memory technology | |
Daneshtalab et al. | Memory-efficient logic layer communication platform for 3D-stacked memory-on-processor architectures | |
US11822475B2 (en) | Integrated circuit with 3D partitioning | |
US11789641B2 (en) | Three dimensional circuit systems and methods having memory hierarchies | |
CN115309670A (en) | Memory chip, electronic device and memory system | |
WO2024049823A1 (en) | Locality-based data processing | |
Franzon et al. | Applications and design styles for 3DIC | |
TW202331519A (en) | Computer system and memory management method based on wafer-on-wafer architecture | |
CN117690808A (en) | Method for producing chip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200780018885.6 Country of ref document: CN |
|
DPE2 | Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007863368 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07863368 Country of ref document: EP Kind code of ref document: A2 |