WO2008100324A9 - Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom - Google Patents

Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom Download PDF

Info

Publication number
WO2008100324A9
WO2008100324A9 PCT/US2007/071370 US2007071370W WO2008100324A9 WO 2008100324 A9 WO2008100324 A9 WO 2008100324A9 US 2007071370 W US2007071370 W US 2007071370W WO 2008100324 A9 WO2008100324 A9 WO 2008100324A9
Authority
WO
WIPO (PCT)
Prior art keywords
cache
levels
hierarchy
island
level
Prior art date
Application number
PCT/US2007/071370
Other languages
French (fr)
Other versions
WO2008100324A3 (en
WO2008100324A2 (en
Inventor
Philip George Emma
John Ulrich Knickerbocker
Chirag S Patel
Original Assignee
Ibm
Philip George Emma
John Ulrich Knickerbocker
Chirag S Patel
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ibm, Philip George Emma, John Ulrich Knickerbocker, Chirag S Patel filed Critical Ibm
Priority to CN2007800188856A priority Critical patent/CN101473436B/en
Priority to EP07863368A priority patent/EP2036126A2/en
Publication of WO2008100324A2 publication Critical patent/WO2008100324A2/en
Publication of WO2008100324A9 publication Critical patent/WO2008100324A9/en
Publication of WO2008100324A3 publication Critical patent/WO2008100324A3/en

Links

Classifications

    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L25/00Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
    • H01L25/03Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes
    • H01L25/04Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers
    • H01L25/065Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L25/0657Stacked arrangements of devices
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L25/00Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
    • H01L25/03Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes
    • H01L25/04Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers
    • H01L25/065Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L25/0652Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof all the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N, e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group H01L27/00 the devices being arranged next and on each other, i.e. mixed assemblies
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L25/00Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
    • H01L25/18Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof the devices being of types provided for in two or more different subgroups of the same main group of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2225/00Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
    • H01L2225/03All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
    • H01L2225/04All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
    • H01L2225/065All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L2225/06503Stacked arrangements of devices
    • H01L2225/06513Bump or bump-like direct electrical connections between devices, e.g. flip-chip connection, solder bumps
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2225/00Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
    • H01L2225/03All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
    • H01L2225/04All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
    • H01L2225/065All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L2225/06503Stacked arrangements of devices
    • H01L2225/06517Bump or bump-like direct electrical connections from device to substrate
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2225/00Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
    • H01L2225/03All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
    • H01L2225/04All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
    • H01L2225/065All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L2225/06503Stacked arrangements of devices
    • H01L2225/06541Conductive via connections through the device, e.g. vertical interconnects, through silicon via [TSV]
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2225/00Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
    • H01L2225/03All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
    • H01L2225/04All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
    • H01L2225/065All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L2225/06503Stacked arrangements of devices
    • H01L2225/06572Auxiliary carrier between devices, the carrier having an electrical connection structure
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2225/00Details relating to assemblies covered by the group H01L25/00 but not provided for in its subgroups
    • H01L2225/03All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00
    • H01L2225/04All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers
    • H01L2225/065All the devices being of a type provided for in the same subgroup of groups H01L27/00 - H01L33/648 and H10K99/00 the devices not having separate containers the devices being of a type provided for in group H01L27/00
    • H01L2225/06503Stacked arrangements of devices
    • H01L2225/06589Thermal management, e.g. cooling
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L23/00Details of semiconductor or other solid state devices
    • H01L23/58Structural electrical arrangements for semiconductor devices not otherwise provided for, e.g. in combination with batteries
    • H01L23/64Impedance arrangements
    • H01L23/642Capacitive arrangements
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2924/00Indexing scheme for arrangements or methods for connecting or disconnecting semiconductor or solid-state bodies as covered by H01L24/00
    • H01L2924/0001Technical content checked by a classifier
    • H01L2924/0002Not covered by any one of groups H01L24/00, H01L24/00 and H01L2224/00
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L2924/00Indexing scheme for arrangements or methods for connecting or disconnecting semiconductor or solid-state bodies as covered by H01L24/00
    • H01L2924/15Details of package parts other than the semiconductor or other solid state devices to be connected
    • H01L2924/151Die mounting substrate
    • H01L2924/153Connection portion
    • H01L2924/1531Connection portion the connection portion being formed only on the surface of the substrate opposite to the die mounting surface
    • H01L2924/15311Connection portion the connection portion being formed only on the surface of the substrate opposite to the die mounting surface being a ball array, e.g. BGA

Definitions

  • the present invention generally relates to a method for electronic computing, and more specifically, to a method for design of cache hierarchies in 3-dimensional chips, and 3- dimensional cache hierarchy structures resulting therefrom.
  • the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods.
  • First recent work has demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes (e.g., to a few hundred microns or less), etching dense via patterns in them, and then interconnecting them with metalization processes.
  • the resulting structure is a monolithic "chip" including multiple planes of circuits. This advance is quite literally a new dimension in the scaling of circuit density.
  • Ll Level-1 cache
  • the access time of a cache is determined, to a large extent, by its area. Therefore, a processor's Level-1 cache (Ll), which is integral to the processor pipeline itself is kept small so that its access time is commensurate with the processor speed, which today can be several Gigahertz. Because the Ll is small, it cannot contain all of the data that will be used by the processor when running programs. When the processor references a datum that is not contained within the Ll, this is called an Ll "cache miss.”
  • the reference is forwarded to the next level in the hierarchy (say, L2), to determine whether the datum is there. [0008] If the requested datum is in the L2 cache, then data (including the datum that was specifically referenced) are moved from the L2 cache to the Ll cache, and the original reference is satisfied.
  • Data in a cache are stored in "lines,” which are contiguous chunks of data (i.e., being a power-of-2 number of bytes long, aligned on boundaries corresponding to this size). Thus, when a cache miss is serviced, it is not merely the specific datum that was requested that is moved down the hierarchy. Instead, the entire cache line containing the requested datum is moved. Data are stored in "lines” for two reasons.
  • each entry in a cache has a corresponding directory entry (e.g., in a cache directory) that contains information about the cache entry. If each byte in a cache were given its own entry, there would be a prohibitive number of directory entries (i.e., equal to the number of bytes in the cache) making the administrative overhead for the cache (the directory) huge. Thus, instead, there is one directory entry per cache line (which is typically between 32-256 bytes today, depending on the processor).
  • program reference patterns exhibit what is called “spatial locality of reference,” meaning that if a particular datum is referenced, then it is very likely that other data that are physically proximate (e.g., by address) to the referenced datum will also be referenced.
  • the bandwidth between levels in a hierarchy is equal to the amount of data that is moved per unit of time. It is noted that the bandwidth includes both necessary movement (e.g., the data that are actually used) and unnecessary movement (e.g., data that are not ever referenced).
  • misses are a dominant component of delay.
  • One method of reducing the number of misses incurred is to anticipate what might be used by a running program, and to "prefetch" that data down the hierarchy before it is referenced. In this way, if the program actually does reference what was anticipated, there is no miss.
  • the more a processor speculates about what might be referenced (so as to prefetch) the more unnecessary movement takes place, since some of what is anticipated will be wrong. This means that facilitating higher performance by the elimination of misses will require more bandwidth.
  • the actual bandwidth used (as defined above) is the amount of data moved per unit time. If the program runs fast, the unit of time will be shorter, hence the bandwidth higher. Notice the distinction between the actual bandwidth, and the "bandwidth capacity," which is equal to the maximum amount of data that could be moved if the busses were 100% utilized. Bandwidth capacity is equal to the width of the bus (in bytes) times the bus frequency. [0016] For example, if an 8-byte bus runs at 1 Gigahertz, then the bandwidth capacity of the bus is 8 Gigabytes per second. It is noted also that if the processor runs at 2 Gigahertz, then the bus is 2 times slower than the processor.
  • bus widths tend to be much narrower than cache line sizes (e.g., 8 bytes instead of 128 bytes) is that planar wiring capability is limited. Because these busses tend to be long, they also tend to be wired in high-level (e.g., relatively thick or fat) wire to minimize resistance so as to maximize speed. Wide busses (e.g., much wider than 8 bytes) would impose considerable blockages on the upper levels of metal, so they are generally not used.
  • Busses tend to be slower than processors (e.g., 1 Gigahertz instead of 2 Gigahertz) because they are too long (e.g., 5-10 millimeters or more) since they are connecting large aerial structures (caches) in a plane.
  • an exemplary feature of the present invention is to provide a method which maximize the number of interconnections between the levels in a cache hierarchy so as to maximize the bus widths, and the cache hierarchy resulting from such a novel method.
  • the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods.
  • An exemplary feature of the present invention is to minimize the bus wire lengths between the levels in a cache hierarchy so as to maximize the bus speed and so as to minimize the energy required per transfer.
  • a method for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that a "cache island" containing specific congruence classes at one cache level are directly above or below the corresponding cache island which contains the corresponding congruence classes of an adjacent cache level, i.e., physically positioning each cache island directly over the corresponding cache islands of different cache levels.
  • a method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with the computing system for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level correspond to a same congruence classes within a corresponding cache island at another cache level, and physically positioning each cache island directly over the corresponding cache islands of different cache levels.
  • a computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus performs a method program for causing a computer to perform the exemplary methods described herein.
  • a method is provided for the physical arrangement of the bits in cache hierarchies implemented in three (3) dimensions, so as to minimize the planar wiring required in the busses connecting the levels of the hierarchy. In this way, the data paths between the levels are primarily determined by the vias themselves. This leads to very short, (and hence fast) and low power busses.
  • the bits within each level of a cache hierarchy are so arranged such that transfers between levels in the hierarchy do not involve very much of the planar wiring in any level.
  • the present invention restricts bit placement within the plane by grouping the bits in accordance with two (orthogonal) criteria such that the movement between cache levels is nearly completely vertical in a 3-dimensional structure.
  • the present inventors have recognized that an emerging technology is the ability to make 3-dimensional structures by thinning chips, laminating them together, and interconnecting them with though-vias in an areal grid.
  • such structures provide very wide interconnecting busses.
  • the present invention provides for interconnecting multiple planes of memory together on a wide vertical bus to provide unprecedented amounts of bandwidth between adjacent levels in a cache hierarchy.
  • the present invention provides a method for organizing the bits within the arrays of the adjacent levels so that minimal x-y wiring is required to transfer data.
  • the wide bus can be almost entirely vertical, and also can be very short (e.g., a few hundreds microns). This provides a bus that can run at much higher speed, and much lower power than conventional 2-dimensional busses.
  • Figure 1 exemplarily illustrates a cache hierarchy showing an Ll .5 SRAM plane and an L2 including two planes of eDRAM;
  • Figure 2 logically illustrates cache geometries for Ll.5 and L2 caches, according to exemplary aspects of the present invention
  • Figure 3 exemplarily illustrates a bitline structure in which 32 bits in each of the eDRAM planes can drive the corresponding 8 bits on the SRAM plane;
  • Figures 4A-4B exemplarily illustrate a multicore example of 8 processors, each with a private cache hierarchy and some intra-processor interconnection;
  • Figure 5 illustrates an exemplarily method according to the present invention.
  • Figure 6 illustrates another exemplarily method according to the present invention.
  • the Ll.5 is taken to be 1 Megabyte, and the L2 is 8 Megabytes.
  • these caches can be arranged so that they are in "aerial correspondence.” That is, the areas of the two caches can be made approximately the same. One of them can be placed on top of the other. This is preferred (e.g., required) so that when moving data between the two levels, large horizontal movements are not required. Preferably, substantially only vertical movements will be needed. Then, the bits of the two caches will be arranged so that the physically corresponding bits (between the two levels) are in identical x-y (planar) proximities.
  • the capacity of the L2 is (in this case) 8X larger than the capacity of the
  • the invention preferably makes them aerially commensurate by using different circuit densities, and by using multiple planes of storage.
  • the Ll.5 be implemented in 6T SRAM
  • the L2 be implemented in embedded DRAM (eDRAM).
  • eDRAM offers a 4X density improvement over SRAM.
  • To facilitate an 8X capacity difference then requires two planes of eDRAM to implement the L2 to correspond to a single plane of SRAM for the
  • Figure 1 illustrates a cache hierarchy structure 100 showing an Ll.5 SRAM plane 110 (having for example 1 MB of SRAM) and an L2 cache 120 comprising two planes of eDRAM 120A, 12B (having 4 MBs for example).
  • Figure 2 logically illustrates exemplary cache geometries 210, 220 for Ll.5 and 2 caches, respectively.
  • a combined L1.5/L2 directory is chosen to be constructed, in SRAM, and put it on the SRAM plane, although this is arbitrary, and not essential to the present invention.
  • a "combined" directory is really an L2 directory (having 32K entries), where each entry has an additional field to specify its status (MESI) in the Ll.5. in the case that the corresponding data is resident in the Ll.5. This was estimated to take about 1.6 square millimeters, assuming 8 bytes for a directory entry.
  • the directory On a directory search, if there is an Ll.5 miss and an L2 hit, then the directory should identify the line in the L2 that is to be brought into the Ll.5, and it should choose the location (set within congruence class) in the Ll.5 into which it is to be placed. Therefore, it should select one of 32K lines in the L2, which it can do by transmitting as few as 15 bits up to the L2 plane, or as many as 32K select signals, depending on how many vias are desired for use, and where one wishes to place decoders. The choice here is arbitrary (with regard to teaching this invention), albeit important to the goals of the design. The point is that the number of vias used for this can be small.
  • a first (and trivial) reason for this mapping limitation is that each bit in a cache line occupies a unique position. That is, a line is a contiguous and ordered group of 256 bytes, and each of those bytes is a contiguous and ordered group of 9 bits. Therefore, there are 2304 unique bit positions in a cache line.
  • bit 0 of the L2 line is placed into bit position 0 in the Ll.5 line
  • bit 1 of the L2 line is placed into bit position 1 in the Ll.5 line, and so on.
  • Bit 0 (and all other bits) can only be placed into its corresponding bit position. It cannot be placed into any other bit position.
  • the L1.5/L2 combination can be viewed as 512 (the number of Ll.5 congruence classes) independent L1.5/L2 sub-caches, each in its own little island with its own bus between the Ll.2 and L2.
  • Each subcache is an Ll.5 congruence class and the set of L2 congruence classes that map to it. Therefore, the IM Ll.5 / 8M L2 aggregation can be thought of as 512 independent 2K Ll.5 / 16K L2 aggregations, and each of those have 2304 independent bit positions.
  • Figure 3 exemplarily illustrates a bitline structure 300 in which 32 bits in each of the eDRAM planes 310, 320 can drive the corresponding 8 bits on the SRAM plane 330.
  • the data path is shown as having independent signals for each eDRAM plane. For a fixed number of vias, this would imply half the bus width (per line) than what was described above, although in this configuration, two lines could be moved at the same time (e.g., one from each eDRAM plane). Alternatively, a single via could be used, and only one plane could be selected at a time for transmission.
  • the last row in the table basically represents the situation implied above - using 80K signal vias. That is, for a 256-byte bus width, we can transfer a 256-byte line in a single cycle. With our 80K signal vias, we can have 32 such busses, hence can have 32 lines in flight simultaneously. Partitioning the 512 congruence classes into 32 islands (each with its own bus) would put 16 congruence classes into each island - as shown in the first column.
  • the first row shows that if one partitioned the cache into 512 islands, each having a single congruence class, then with 8OK signal vias, each of the islands could only have a 16 byte bus, and it would take 16 cycles to move a 256 byte line. In this case, 512 lines can be in flight at the same time, but it takes longer to move a line.
  • Each row in the table shows a different partitioning, but each of the partitionings has the same bandwidth capacity: 8 Kilobytes per cycle.
  • Figures 4A-4B illustrate a multicore example of 8 processors, each with a private hierarchy and some intra-processor interconnection.
  • Figures 4A-4B show a generalization of a processor chip in which there are 8 processors, each with a private Ll.5 cache.
  • Figure 4A shows the 8 processors with the Ll.5s. Of course, the area is dominated by the Ll.5s (which is not apparent in the conceptual view shown).
  • FIG. 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible.
  • eDRAM e.g., eDRAM
  • FIG. 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible.
  • eDRAM e.g., eDRAM
  • Figure 5 shows an exemplarily method 500 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized.
  • the method can include co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of the cache hierarchy, wherein the cache levels are physically positioned over each other.
  • Figure 6 shows an exemplarily method 60 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized.
  • the method can include the method can include physically partitioning each cache level into cache islands (e.g., step 610).
  • each cache island preferably can include a subset of congruence classes. It is also noted that the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level.
  • the method 600 includes physically positioning each cache island directly over the corresponding cache islands of different cache levels (e.g., step 620).

Abstract

A method of electronic computing, and more specifically, a method of design of cache hierarchies in 3-dimensional chips, and a cache hierarchy resulting therefrom, including a physical arrangement of bits in cache hierarchies implemented in 3 dimensions such that the planar wiring required in the busses connecting the levels of the hierarchy is minimized. In this way, the data paths between the levels are primarily the vias themselves, which leads to very short, hence fast and low power busses.

Description

METHOD FOR ACHIEVING VERY HIGH BANDWIDTH BETWEEN THE LEVELS
OF A CACHE HIERARCHY IN 3-DIMENSIONAL STRUCTURES, AND A 3-
DIMENSIONAL STRUCTURE RESULTING THEREFROM
BACKGROUND OF THE INVENTION
Field of the Invention
[0001] The present invention generally relates to a method for electronic computing, and more specifically, to a method for design of cache hierarchies in 3-dimensional chips, and 3- dimensional cache hierarchy structures resulting therefrom.
Description of the Related Art
[0002] The present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods. [0003] First, recent work has demonstrated the viability of interconnecting two or more planes of circuits by thinning those planes (e.g., to a few hundred microns or less), etching dense via patterns in them, and then interconnecting them with metalization processes. The resulting structure is a monolithic "chip" including multiple planes of circuits. This advance is quite literally a new dimension in the scaling of circuit density. [0004] Second, as circuit density has scaled, single chips have grown to contain more and more of the computer system. Two decades ago, it was a revelation that an entire processor could fit on a single chip. At the 180 nanometer CMOS node, it was a revelation that not only the processor's Level- 1 cache (Ll) was contained, but for the first time it was also feasible to include the next level of cache, L2, on the chip with the processor. Additionally, about a decade ago, the first single-chip multiprocessors began being produced.
[0005] At densities facilitated at the 90 nanometer node and beyond, together with the aforementioned ability to create 3-dimensional structures, single chip systems of the future will contain not only multiple processors, but also multiple levels of the cache hierarchy.
[0006] The access time of a cache is determined, to a large extent, by its area. Therefore, a processor's Level-1 cache (Ll), which is integral to the processor pipeline itself is kept small so that its access time is commensurate with the processor speed, which today can be several Gigahertz. Because the Ll is small, it cannot contain all of the data that will be used by the processor when running programs. When the processor references a datum that is not contained within the Ll, this is called an Ll "cache miss."
[0007] In the event of an Ll miss, the reference is forwarded to the next level in the hierarchy (say, L2), to determine whether the datum is there. [0008] If the requested datum is in the L2 cache, then data (including the datum that was specifically referenced) are moved from the L2 cache to the Ll cache, and the original reference is satisfied.
[0009] If the referenced datum is not in the L2 cache, then this is also a "cache miss" (an L2 "cache miss"), and the reference continues to percolate up the hierarchy (e.g., say, to L3 and above). By convention, higher levels in a cache hierarchy are physically larger (and hence, hold more data), and thus, progressively slower.
[0010] Data in a cache are stored in "lines," which are contiguous chunks of data (i.e., being a power-of-2 number of bytes long, aligned on boundaries corresponding to this size). Thus, when a cache miss is serviced, it is not merely the specific datum that was requested that is moved down the hierarchy. Instead, the entire cache line containing the requested datum is moved. Data are stored in "lines" for two reasons.
[0011] First, each entry in a cache has a corresponding directory entry (e.g., in a cache directory) that contains information about the cache entry. If each byte in a cache were given its own entry, there would be a prohibitive number of directory entries (i.e., equal to the number of bytes in the cache) making the administrative overhead for the cache (the directory) huge. Thus, instead, there is one directory entry per cache line (which is typically between 32-256 bytes today, depending on the processor).
[0012] Second, program reference patterns exhibit what is called "spatial locality of reference," meaning that if a particular datum is referenced, then it is very likely that other data that are physically proximate (e.g., by address) to the referenced datum will also be referenced.
Thus, by bringing in an entire cache line, more of the spatial context of the program is captured, thereby reducing the number of misses.
[0013] The bandwidth between levels in a hierarchy is equal to the amount of data that is moved per unit of time. It is noted that the bandwidth includes both necessary movement (e.g., the data that are actually used) and unnecessary movement (e.g., data that are not ever referenced).
[0014] To achieve high performance in a processor, it is important to not take many misses, since misses are a dominant component of delay. One method of reducing the number of misses incurred is to anticipate what might be used by a running program, and to "prefetch" that data down the hierarchy before it is referenced. In this way, if the program actually does reference what was anticipated, there is no miss. However, the more a processor speculates about what might be referenced (so as to prefetch), the more unnecessary movement takes place, since some of what is anticipated will be wrong. This means that facilitating higher performance by the elimination of misses will require more bandwidth.
[0015] The actual bandwidth used (as defined above) is the amount of data moved per unit time. If the program runs fast, the unit of time will be shorter, hence the bandwidth higher. Notice the distinction between the actual bandwidth, and the "bandwidth capacity," which is equal to the maximum amount of data that could be moved if the busses were 100% utilized. Bandwidth capacity is equal to the width of the bus (in bytes) times the bus frequency. [0016] For example, if an 8-byte bus runs at 1 Gigahertz, then the bandwidth capacity of the bus is 8 Gigabytes per second. It is noted also that if the processor runs at 2 Gigahertz, then the bus is 2 times slower than the processor. And if the cache line size is 128 bytes, then the 8-byte bus requires 16 bus cycles (which is 2 X 16 = 32 processor cycles) to move a cache line. Some of the data that is moved during these 32 processor cycles is useless. Further, if a subsequent miss occurs during the large window in which this cache line is being moved, the subsequent miss can be further delayed by the bus transfer that is already in progress. [0017] For this reason, it is important to have a bandwidth capacity that is much larger than the actual bandwidth demand. Very high bandwidth facilitates two things that are crucial to high performance computing.
[0018] First, very high bandwidth allows cache lines to be transferred very quickly so that the transfers will not interfere with other miss traffic in the system. (For example, if the bus above were 128 bytes wide and ran at 3 Gigahertz, it could transfer the cache line in a single processor cycle.)
[0019] Second, having an ample surplus of bandwidth capacity facilitates operations like prefetching, which will place a much higher bandwidth demand on the system. [0020] The reason that bus widths tend to be much narrower than cache line sizes (e.g., 8 bytes instead of 128 bytes) is that planar wiring capability is limited. Because these busses tend to be long, they also tend to be wired in high-level (e.g., relatively thick or fat) wire to minimize resistance so as to maximize speed. Wide busses (e.g., much wider than 8 bytes) would impose considerable blockages on the upper levels of metal, so they are generally not used.
[0021] Busses tend to be slower than processors (e.g., 1 Gigahertz instead of 2 Gigahertz) because they are too long (e.g., 5-10 millimeters or more) since they are connecting large aerial structures (caches) in a plane.
SUMMARY OF THE INVENTION
[0022] In view of the foregoing and other exemplary problems, drawbacks, and disadvantages of the conventional methods and structures, an exemplary feature of the present invention is to provide a method which maximize the number of interconnections between the levels in a cache hierarchy so as to maximize the bus widths, and the cache hierarchy resulting from such a novel method.
[0023] Particularly, the present invention provides a method in which a natural synergy is achieved by marrying two evolving fields of endeavor, which has not been recognized by the conventional methods. [0024] An exemplary feature of the present invention is to minimize the bus wire lengths between the levels in a cache hierarchy so as to maximize the bus speed and so as to minimize the energy required per transfer.
[0025] In a first aspect of the present invention, a method for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that a "cache island" containing specific congruence classes at one cache level are directly above or below the corresponding cache island which contains the corresponding congruence classes of an adjacent cache level, i.e., physically positioning each cache island directly over the corresponding cache islands of different cache levels.
[0026] In yet another aspect of the present invention, a method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with the computing system for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, includes physically partitioning each cache level into cache islands, each cache island including a subset of congruence classes, wherein the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level correspond to a same congruence classes within a corresponding cache island at another cache level, and physically positioning each cache island directly over the corresponding cache islands of different cache levels.
[0027] In another aspect of the present invention, a computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus performs a method program for causing a computer to perform the exemplary methods described herein.
[0028] In the present invention, a method is provided for the physical arrangement of the bits in cache hierarchies implemented in three (3) dimensions, so as to minimize the planar wiring required in the busses connecting the levels of the hierarchy. In this way, the data paths between the levels are primarily determined by the vias themselves. This leads to very short, (and hence fast) and low power busses.
[0029] In the present invention, the bits within each level of a cache hierarchy are so arranged such that transfers between levels in the hierarchy do not involve very much of the planar wiring in any level. The present invention restricts bit placement within the plane by grouping the bits in accordance with two (orthogonal) criteria such that the movement between cache levels is nearly completely vertical in a 3-dimensional structure.
[0030] For example, the present inventors have recognized that an emerging technology is the ability to make 3-dimensional structures by thinning chips, laminating them together, and interconnecting them with though-vias in an areal grid. However, such structures provide very wide interconnecting busses.
[0031] The present invention provides for interconnecting multiple planes of memory together on a wide vertical bus to provide unprecedented amounts of bandwidth between adjacent levels in a cache hierarchy. The present invention provides a method for organizing the bits within the arrays of the adjacent levels so that minimal x-y wiring is required to transfer data. In this way, the wide bus can be almost entirely vertical, and also can be very short (e.g., a few hundreds microns). This provides a bus that can run at much higher speed, and much lower power than conventional 2-dimensional busses. BRIEF DESCRIPTION OF THE DRAWINGS
[0032] The foregoing and other exemplary purposes, aspects and advantages will be better understood from the following detailed description of exemplary aspects of the invention with reference to the drawings, in which: [0033] Figure 1 exemplarily illustrates a cache hierarchy showing an Ll .5 SRAM plane and an L2 including two planes of eDRAM;
[0034] Figure 2 logically illustrates cache geometries for Ll.5 and L2 caches, according to exemplary aspects of the present invention;
[0035] Figure 3 exemplarily illustrates a bitline structure in which 32 bits in each of the eDRAM planes can drive the corresponding 8 bits on the SRAM plane;
[0036] Figures 4A-4B exemplarily illustrate a multicore example of 8 processors, each with a private cache hierarchy and some intra-processor interconnection;
[0037] Figure 5 illustrates an exemplarily method according to the present invention; and
[0038] Figure 6 illustrates another exemplarily method according to the present invention.
DETAILED DESCRIPTION OF EXEMPLARY ASPECTS OF THE INVENTION
[0039] Referring now to the drawings, and more particularly, to Figures 1-6, there are shown exemplary aspects of the method and structures according to the present invention. [0040] For the purposes of illustrating the principle concepts of the present invention, herein will be first focused on a simple 2-level hierarchy comprising an Ll.5 cache and an L2 cache.
Roughly speaking, it is a good rule of thumb that adjacent levels in a hierarchy have a capacity difference of an order of magnitude or so. In this example, the Ll.5 is taken to be 1 Megabyte, and the L2 is 8 Megabytes.
[0041] First, these caches can be arranged so that they are in "aerial correspondence." That is, the areas of the two caches can be made approximately the same. One of them can be placed on top of the other. This is preferred (e.g., required) so that when moving data between the two levels, large horizontal movements are not required. Preferably, substantially only vertical movements will be needed. Then, the bits of the two caches will be arranged so that the physically corresponding bits (between the two levels) are in identical x-y (planar) proximities.
This will eliminate the need for much x-y (planar) wiring in the intra-planar communication paths.
[0042] Although the capacity of the L2 is (in this case) 8X larger than the capacity of the
Ll.5. the invention preferably makes them aerially commensurate by using different circuit densities, and by using multiple planes of storage.
[0043] For example, let the Ll.5 be implemented in 6T SRAM, and let the L2 be implemented in embedded DRAM (eDRAM). For argument's sake, assume that eDRAM offers a 4X density improvement over SRAM. To facilitate an 8X capacity difference then requires two planes of eDRAM to implement the L2 to correspond to a single plane of SRAM for the
Ll.5. Figure 1 shows this basic arrangement.
[0044] That is, Figure 1 illustrates a cache hierarchy structure 100 showing an Ll.5 SRAM plane 110 (having for example 1 MB of SRAM) and an L2 cache 120 comprising two planes of eDRAM 120A, 12B (having 4 MBs for example).
[0045] Assuming a 0.65 square micron SRAM cell with a 75% array efficiency, 1 Megabyte of SRAM is approximately 8.2 square millimeters. Using this number, a total area in the neighborhood of 10 square millimeters can be aimed for (after having de-rated the array efficiency to account for rearranging the bits, and after having put in an aerial array of vias). [0046] In this exemplary embodiment, a line size of 256 bytes is chosen. The 1 Megabyte Ll.5 then has 4K lines, and the 8 Megabyte L2 has 32K lines. The set associativity of the Ll.5 is chosen as 8-way, and the set associativity of the L2 is chosen as 16-way. This makes the number of congruence classes in the Ll.5 be 4K / 8 = 512, and the number of congruence classes in the L2 be 32K / 16 = 2K. These geometries are shown in Figure 2.
[0047] That is, Figure 2 logically illustrates exemplary cache geometries 210, 220 for Ll.5 and 2 caches, respectively. [0048] Because this is a composite L1.5/L2 structure, a combined L1.5/L2 directory is chosen to be constructed, in SRAM, and put it on the SRAM plane, although this is arbitrary, and not essential to the present invention. Basically, a "combined" directory is really an L2 directory (having 32K entries), where each entry has an additional field to specify its status (MESI) in the Ll.5. in the case that the corresponding data is resident in the Ll.5. This was estimated to take about 1.6 square millimeters, assuming 8 bytes for a directory entry.
[0049] On a directory search, if there is an Ll.5 miss and an L2 hit, then the directory should identify the line in the L2 that is to be brought into the Ll.5, and it should choose the location (set within congruence class) in the Ll.5 into which it is to be placed. Therefore, it should select one of 32K lines in the L2, which it can do by transmitting as few as 15 bits up to the L2 plane, or as many as 32K select signals, depending on how many vias are desired for use, and where one wishes to place decoders. The choice here is arbitrary (with regard to teaching this invention), albeit important to the goals of the design. The point is that the number of vias used for this can be small. [0050] Initially, a goal is to maximize the width of the data bus. For a 256-byte line, were a line to be moved in a single transfer, it would require 2304 signals (256 bytes X 9 bits per byte). Using one signal via per power via requires a total count of about 5000 vias to do this (together with address selects and various control signals). Using a via size of 16 square microns yields a total via area of 0.08 square millimeters. So in fact, if allowed a total via area overhead of 20% (of the 10 square millimeter area target), 32 such busses could be placed in use, and as many as 32 cache lines (this is 8 Kilobytes) to be moved simultaneously in an area of about 12 square millimeters. [0051] While this amount of bandwidth capacity is staggering by today's standards, the question remains as to how to route this many signals (say 2500 X 32 = 80K) in a feasible way, and so that those wires are very short.
[0052] The first observation made is that, of all of the bits in the L2 (8 Megabytes X 9 bits per byte = 72 Megabits) and of all of the bits in the Ll.5 (9 Megabits), it is not the case that any of the 72 Megabits in the L2 can be moved to any of the 9 Megabits in the Ll.5. In fact, for any particular bit in the Ll.5, there are only 64 bits in the L2 that can map into that particular bit. This is true for two independent reasons as described below. The present invention physically arranges the bits of the Ll.5 and L2 so that only those bits of the L2 that can map to a bit in the Ll.5 are above it. [0053] A first (and trivial) reason for this mapping limitation, is that each bit in a cache line occupies a unique position. That is, a line is a contiguous and ordered group of 256 bytes, and each of those bytes is a contiguous and ordered group of 9 bits. Therefore, there are 2304 unique bit positions in a cache line. When a line is moved from the L2 to the Ll.5, bit 0 of the L2 line is placed into bit position 0 in the Ll.5 line, bit 1 of the L2 line is placed into bit position 1 in the Ll.5 line, and so on. Bit 0 (and all other bits) can only be placed into its corresponding bit position. It cannot be placed into any other bit position.
[0054] Secondly, and less obviously, for any congruence class in the Ll.5, there are only a handful of congruence classes in the L2 that can map to it. Specifically, if there are nl congruence classes in the Ll.5, and n2 congruence classes in the L2, then only n2/nl L2 congruence classes can map to a given Ll.5 congruence class. In the example being shown, the numbers are 2K / 512 = 4 L2 congruence classes per Ll.5 congruence class. Further, those L2 congruence classes cannot map to any other Ll.5 congruence class. [0055] To put it differently, the L1.5/L2 combination can be viewed as 512 (the number of Ll.5 congruence classes) independent L1.5/L2 sub-caches, each in its own little island with its own bus between the Ll.2 and L2. Each subcache is an Ll.5 congruence class and the set of L2 congruence classes that map to it. Therefore, the IM Ll.5 / 8M L2 aggregation can be thought of as 512 independent 2K Ll.5 / 16K L2 aggregations, and each of those have 2304 independent bit positions. [0056] It is noted that a line coming from the L2 (which can reside in any of the 16 sets of the 4 congruence classes of its subcache) can be placed into any of the 8 sets of the Ll.5 congruence class. If the cache is partitioned down to the bit position, it means that there are 16 X 4 = 64 bits in the L2 that can map to any of 8 bits (the set associativity of the Ll.5) in the Ll.5. Thus, conceivably, the smallest "island" in this cache would correspond to an 8-bit island of the Ll.5 residing underneath 2 (recall that there are 2 eDRAM planes) 32-bit islands of the eDRAM. [0057] This situation is shown in Figure 3. That is, Figure 3 exemplarily illustrates a bitline structure 300 in which 32 bits in each of the eDRAM planes 310, 320 can drive the corresponding 8 bits on the SRAM plane 330. [0058] In Figure 3, the data path is shown as having independent signals for each eDRAM plane. For a fixed number of vias, this would imply half the bus width (per line) than what was described above, although in this configuration, two lines could be moved at the same time (e.g., one from each eDRAM plane). Alternatively, a single via could be used, and only one plane could be selected at a time for transmission.
[0059] While these particular partitionings are too small to lead to good array efficiencies, they demonstrate that the total x-y (planar) motion required for an L2-L1.5 transfer is very small indeed, i.e., it can be made almost completely vertical. It is noted that since the vertical distance is small, e.g., 400 microns for the three 200-micron thick planes shown here, the transfer speeds can be much faster than what would be achievable in several millimeters of horizontal wiring (i.e., than what is possible in a plane). Additionally, since the capacitance is very small, the energy required per transfer is also much smaller.
[0060] Assuming that a real design would tradeoff array efficiency with the size of the islands (which determines the length of horizontal wire required in a transfer), the table below shows several ways to partition the caches, and the busses that result using 80K signal vias:
Figure imgf000015_0001
[0061] The last row in the table basically represents the situation implied above - using 80K signal vias. That is, for a 256-byte bus width, we can transfer a 256-byte line in a single cycle. With our 80K signal vias, we can have 32 such busses, hence can have 32 lines in flight simultaneously. Partitioning the 512 congruence classes into 32 islands (each with its own bus) would put 16 congruence classes into each island - as shown in the first column. [0062] On the other end of the spectrum, the first row shows that if one partitioned the cache into 512 islands, each having a single congruence class, then with 8OK signal vias, each of the islands could only have a 16 byte bus, and it would take 16 cycles to move a 256 byte line. In this case, 512 lines can be in flight at the same time, but it takes longer to move a line. [0063] Each row in the table shows a different partitioning, but each of the partitionings has the same bandwidth capacity: 8 Kilobytes per cycle. [0064] While the above descriptions pertain to a single Ll.5 / L2 pair, the question arises as to how this same reasoning applies to a multicore system. Figures 4A-4B illustrate a multicore example of 8 processors, each with a private hierarchy and some intra-processor interconnection. [0065] That is, Figures 4A-4B show a generalization of a processor chip in which there are 8 processors, each with a private Ll.5 cache. Figure 4A shows the 8 processors with the Ll.5s. Of course, the area is dominated by the Ll.5s (which is not apparent in the conceptual view shown).
[0066] While any number of processors is acceptable, the reason that 8 was chosen in this example is that the chip can be partitioned into 9 similar regions, with the outside 8 regions each holding a processor and its cache, where the central region can be used for a switch, to handle traffic between the caches. [0067] Figure 4B shows that same 8-processor chip on the bottom of a stack of 4 memory chips (e.g., eDRAM) 440A-440D, although any number is feasible. Just as the bottom processor chip is partitioned into 9 physical regions, each of the memory chips above is so partitioned as well (although this isn't explicitly shown). Essentially, this system is just 8 copies of the single system described above, with some switching (at least on the processor chip, but perhaps on the memory chips as well) to facilitate intra-processor communications.
[0068] Figure 5 shows an exemplarily method 500 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized. As shown in Figure 5, the method can include co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of the cache hierarchy, wherein the cache levels are physically positioned over each other. [0069] Figure 6 shows an exemplarily method 60 according to the present invention for arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized. As shown in Figure 6, the method can include the method can include physically partitioning each cache level into cache islands (e.g., step 610). It is noted that each cache island preferably can include a subset of congruence classes. It is also noted that the partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level. The method 600 includes physically positioning each cache island directly over the corresponding cache islands of different cache levels (e.g., step 620). [0070] While the invention has been described in terms of several exemplary aspects, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims. [0071] Further, it is noted that, Applicant's intent is to encompass equivalents of all claim elements, even if amended later during prosecution.

Claims

CLAIMS What is claimed is: 1. A method of arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized, said method comprising: co-designing physical structures of cache levels for optimizing interconnections between logically adjacent levels of said cache hierarchy, wherein said cache levels are positioned over each other.
2. The method according to claim 1, wherein said co-designing comprises: physically partitioning each said cache level into cache islands, each said cache island including a subset of congruence classes, wherein said partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level.
3. The method according to claim 1, wherein said bits within said cache hierarchy are arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
4. The method according to claim 1, wherein adjacent cache levels are arranged to include different capacities.
5. The method according to claim 1, wherein adjacent cache levels are arranged to include same capacities.
6. The method according to claim 1, wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes.
7. The method according to claim 1, wherein areas of each cache level are arranged to be substantially the same.
8. The method according to claim 2, wherein said partitioning is selected to optimize a size of each said cache island.
9. A method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with said computing system to perform the method according to Claim 1.
10. A computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus to perform a method of causing a computer to perform the method according to Claim 1.
11. A method of arranging bits within a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized, said method comprising: physically partitioning each cache level into cache islands, each said cache island including a subset of congruence classes, wherein said partitioning is performed in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level; and positioning each said cache island over the corresponding cache islands of different cache levels.
12. The method according to claim 11, wherein said bits within said cache hierarchy are arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
13. The method according to claim 11, wherein adjacent cache levels are arranged to include different capacities.
14. The method according to claim 11, wherein adjacent cache levels are arranged to include same capacities.
15. The method according to claim 11, wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes .
16. The method according to claim 11, wherein areas of each cache level are arranged to be substantially the same.
17. The method according to claim 11, wherein said partitioning is selected to optimize a size of each said cache island.
18. A method of deploying computing infrastructure in which recordable, computer-readable code is integrated into a computing system, and combines with said computing system to perform the method according to Claim 11.
19. A computer-readable medium tangibly embodying a program of machine readable instructions executable by a digital processing apparatus to perform a method of causing a computer to perform the method according to Claim 11.
20. A cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, said cache hierarchy comprising: a plurality of cache levels physically partitioned into cache islands, each said cache island including a subset of congruence classes, wherein said cache levels are partitioned in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level, and wherein said cache islands are positioned over the corresponding cache islands of different cache levels.
21. The cache hierarchy according to claim 20, wherein adjacent cache levels include different capacities.
22. The cache hierarchy according to claim 20. wherein adjacent cache levels include same capacities.
23. The cache hierarchy according to claim 20. wherein areas of each cache level are substantially the same.
24. The cache hierarchy according to claim 20. wherein one of said cache levels in the cache hierarchy comprises a plurality of physical planes .
25. The cache hierarchy according to claim 20. wherein said partitioning is selected to optimize a size of each said cache island.
26. The cache hierarchy according to claim 20. wherein said cache hierarchy is arranged to minimize horizontal transfers between said cache levels in the cache hierarchy.
27. The cache hierarchy according to claim 20. wherein data paths between said cache levels comprise vias.
28. A design tool for designing a cache hierarchy according to claim 20, which is implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized.
29. A design tool comprising: a cache hierarchy according to claim 20, which is implemented on multiple physical planes such that horizontal wiring distances in intra- level busses are minimized.
30. A computer system, comprising: a cache hierarchy implemented on multiple physical planes such that horizontal wiring distances in intra-level busses are minimized, said cache hierarchy comprising: a plurality of cache levels physically partitioned into cache islands, each said cache island including a subset of congruence classes, wherein said cache levels are partitioned in correspondence across cache levels such that the congruence classes within a cache island at one cache level map to same congruence classes of a corresponding cache island at a different cache level, and wherein said cache islands are positioned over the corresponding cache islands of different cache levels.
31. A computer system according to claim 30, wherein said cache islands are physically positioned directly over the corresponding cache islands of different cache levels.
PCT/US2007/071370 2006-06-16 2007-06-15 Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom WO2008100324A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2007800188856A CN101473436B (en) 2006-06-16 2007-06-15 Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom
EP07863368A EP2036126A2 (en) 2006-06-16 2007-06-15 Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11/453,885 2006-06-16
US11/453,885 US7616470B2 (en) 2006-06-16 2006-06-16 Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom
US11/538,567 2006-10-04
US11/538,567 US7518225B2 (en) 2006-06-16 2006-10-04 Chip system architecture for performance enhancement, power reduction and cost reduction

Publications (3)

Publication Number Publication Date
WO2008100324A2 WO2008100324A2 (en) 2008-08-21
WO2008100324A9 true WO2008100324A9 (en) 2009-05-22
WO2008100324A3 WO2008100324A3 (en) 2011-01-13

Family

ID=38860723

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/071370 WO2008100324A2 (en) 2006-06-16 2007-06-15 Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3- dimensional structure resulting therefrom

Country Status (4)

Country Link
US (3) US7616470B2 (en)
EP (1) EP2036126A2 (en)
CN (1) CN101473436B (en)
WO (1) WO2008100324A2 (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7602062B1 (en) * 2005-08-10 2009-10-13 Altera Corporation Package substrate with dual material build-up layers
JP4824397B2 (en) * 2005-12-27 2011-11-30 イビデン株式会社 Multilayer printed wiring board
US8110899B2 (en) * 2006-12-20 2012-02-07 Intel Corporation Method for incorporating existing silicon die into 3D integrated stack
US8032711B2 (en) * 2006-12-22 2011-10-04 Intel Corporation Prefetching from dynamic random access memory to a static random access memory
US20080237738A1 (en) * 2007-03-27 2008-10-02 Christoph Andreas Kleint Integrated circuit, cell, cell arrangement, method for manufacturing an integrated circuit, method for manufacturing a cell arrangement; memory module
US9229887B2 (en) * 2008-02-19 2016-01-05 Micron Technology, Inc. Memory device with network on chip methods, apparatus, and systems
US7978721B2 (en) 2008-07-02 2011-07-12 Micron Technology Inc. Multi-serial interface stacked-die memory architecture
US8086913B2 (en) 2008-09-11 2011-12-27 Micron Technology, Inc. Methods, apparatus, and systems to repair memory
US20100078788A1 (en) 2008-09-26 2010-04-01 Amir Wagiman Package-on-package assembly and method
JP2010108204A (en) * 2008-10-30 2010-05-13 Hitachi Ltd Multichip processor
US8417974B2 (en) * 2009-11-16 2013-04-09 International Business Machines Corporation Power efficient stack of multicore microprocessors
US9123552B2 (en) 2010-03-30 2015-09-01 Micron Technology, Inc. Apparatuses enabling concurrent communication between an interface die and a plurality of dice stacks, interleaved conductive paths in stacked devices, and methods for forming and operating the same
US8466543B2 (en) 2010-05-27 2013-06-18 International Business Machines Corporation Three dimensional stacked package structure
US8299608B2 (en) 2010-07-08 2012-10-30 International Business Machines Corporation Enhanced thermal management of 3-D stacked die packaging
KR20120079397A (en) * 2011-01-04 2012-07-12 삼성전자주식회사 Stacked semiconductor device and manufacturing method thereof
US8569874B2 (en) 2011-03-09 2013-10-29 International Business Machines Corporation High memory density, high input/output bandwidth logic-memory structure and architecture
KR20140109914A (en) * 2011-12-01 2014-09-16 컨버전트 인텔렉츄얼 프로퍼티 매니지먼트 인코포레이티드 Cpu with stacked memory
CN102662909B (en) * 2012-03-22 2013-12-25 东华理工大学 Three-dimensional many-core system on chip
US8891279B2 (en) 2012-09-17 2014-11-18 International Business Machines Corporation Enhanced wiring structure for a cache supporting auxiliary data output
US9378793B2 (en) * 2012-12-20 2016-06-28 Qualcomm Incorporated Integrated MRAM module
US9037791B2 (en) 2013-01-22 2015-05-19 International Business Machines Corporation Tiered caching and migration in differing granularities
US9336144B2 (en) * 2013-07-25 2016-05-10 Globalfoundries Inc. Three-dimensional processing system having multiple caches that can be partitioned, conjoined, and managed according to more than one set of rules and/or configurations
CN107564825B (en) * 2017-08-29 2018-09-21 睿力集成电路有限公司 A kind of chip double-side encapsulating structure and its manufacturing method
CN107564881B (en) * 2017-08-29 2018-09-21 睿力集成电路有限公司 A kind of chip stack stereo encapsulation structure and its manufacturing method
FR3082656B1 (en) 2018-06-18 2022-02-04 Commissariat Energie Atomique INTEGRATED CIRCUIT COMPRISING MACROS AND ITS MANUFACTURING METHOD
CN110540164A (en) * 2019-10-09 2019-12-06 太仓全众智能装备有限公司 Bottle type buffer memory machine
EP4071593A4 (en) * 2021-02-26 2023-08-23 Beijing Vcore Technology Co.,Ltd. Stacked cache system based on sedram, and control method and cache device
CN113096706B (en) * 2021-03-09 2023-06-16 长江先进存储产业创新中心有限责任公司 CPU and manufacturing method thereof
CN113097383B (en) * 2021-03-09 2023-07-18 长江先进存储产业创新中心有限责任公司 CPU and manufacturing method thereof
US11887908B2 (en) 2021-12-21 2024-01-30 International Business Machines Corporation Electronic package structure with offset stacked chips and top and bottom side cooling lid
CN114244920B (en) * 2021-12-29 2024-02-09 苏州盛科通信股份有限公司 New and old chip stacking head compatible method and system and chip
WO2023203435A1 (en) * 2022-04-22 2023-10-26 株式会社半導体エネルギー研究所 Semiconductor device

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5133061A (en) 1987-10-29 1992-07-21 International Business Machines Corporation Mechanism for improving the randomization of cache accesses utilizing abit-matrix multiplication permutation of cache addresses
US5502667A (en) * 1993-09-13 1996-03-26 International Business Machines Corporation Integrated multichip memory module structure
US6059835A (en) 1997-06-13 2000-05-09 International Business Machines Corporation Performance evaluation of processor operation using trace pre-processing
US6175160B1 (en) 1999-01-08 2001-01-16 Intel Corporation Flip-chip having an on-chip cache memory
US6725334B2 (en) * 2000-06-09 2004-04-20 Hewlett-Packard Development Company, L.P. Method and system for exclusive two-level caching in a chip-multiprocessor
US6678814B2 (en) * 2001-06-29 2004-01-13 International Business Machines Corporation Method and apparatus for allocating data usages within an embedded dynamic random access memory device
JP4047788B2 (en) * 2003-10-16 2008-02-13 松下電器産業株式会社 Compiler device and linker device
US7130967B2 (en) * 2003-12-10 2006-10-31 International Business Machines Corporation Method and system for supplier-based memory speculation in a memory subsystem of a data processing system
US7217994B2 (en) 2004-12-01 2007-05-15 Kyocera Wireless Corp. Stack package for high density integrated circuits
US7305523B2 (en) * 2005-02-12 2007-12-04 International Business Machines Corporation Cache memory direct intervention
US7533321B2 (en) * 2005-09-13 2009-05-12 International Business Machines Corporation Fault tolerant encoding of directory states for stuck bits
US7404041B2 (en) * 2006-02-10 2008-07-22 International Business Machines Corporation Low complexity speculative multithreading system based on unmodified microprocessor core
JP4208895B2 (en) * 2006-05-30 2009-01-14 株式会社東芝 Cache memory device and processing method

Also Published As

Publication number Publication date
WO2008100324A3 (en) 2011-01-13
WO2008100324A2 (en) 2008-08-21
US7616470B2 (en) 2009-11-10
US7518225B2 (en) 2009-04-14
US20070294479A1 (en) 2007-12-20
US20080209126A1 (en) 2008-08-28
US7986543B2 (en) 2011-07-26
CN101473436B (en) 2011-04-13
CN101473436A (en) 2009-07-01
EP2036126A2 (en) 2009-03-18
US20070290315A1 (en) 2007-12-20

Similar Documents

Publication Publication Date Title
US7986543B2 (en) Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom
Li et al. Design and management of 3D chip multiprocessors using network-in-memory
US10310976B2 (en) System and method for concurrently checking availability of data in extending memories
US8234453B2 (en) Processor having a cache memory which is comprised of a plurality of large scale integration
Madan et al. Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy
EP2353095B1 (en) A spiral cache memory and method of operating a spiral cache memory
US10963022B2 (en) Layered super-reticle computing : architectures and methods
US20060081971A1 (en) Signal transfer methods for integrated circuits
CN102203747A (en) Storage array tile supporting systolic movement operations
Poremba et al. There and back again: Optimizing the interconnect in networks of memory cubes
US7444473B1 (en) Speculative memory accesses in a proximity communication-based off-chip cache memory architecture
CN116610630B (en) Multi-core system and data transmission method based on network-on-chip
US7496712B1 (en) Proximity communication-based off-chip cache memory architectures
Jagasivamani et al. Tileable monolithic ReRAM memory design
CN105930300A (en) Three-dimensional in-chip cache based processor structure and method for manufacturing same
US11844223B1 (en) Ferroelectric memory chiplet as unified memory in a multi-dimensional packaging
US20240078195A1 (en) Systems, methods, and devices for advanced memory technology
Daneshtalab et al. Memory-efficient logic layer communication platform for 3D-stacked memory-on-processor architectures
US11822475B2 (en) Integrated circuit with 3D partitioning
US11789641B2 (en) Three dimensional circuit systems and methods having memory hierarchies
CN115309670A (en) Memory chip, electronic device and memory system
WO2024049823A1 (en) Locality-based data processing
Franzon et al. Applications and design styles for 3DIC
TW202331519A (en) Computer system and memory management method based on wafer-on-wafer architecture
CN117690808A (en) Method for producing chip

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200780018885.6

Country of ref document: CN

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007863368

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: RU

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07863368

Country of ref document: EP

Kind code of ref document: A2