CA2298780A1 - I/o page kill definition for improved dma and l1/l2 cache performance - Google Patents

I/o page kill definition for improved dma and l1/l2 cache performance Download PDF

Info

Publication number
CA2298780A1
CA2298780A1 CA002298780A CA2298780A CA2298780A1 CA 2298780 A1 CA2298780 A1 CA 2298780A1 CA 002298780 A CA002298780 A CA 002298780A CA 2298780 A CA2298780 A CA 2298780A CA 2298780 A1 CA2298780 A1 CA 2298780A1
Authority
CA
Canada
Prior art keywords
cache
page
buffer
memory
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002298780A
Other languages
French (fr)
Inventor
Gary Dean Anderson
Ronald Xavier Arroyo
Bradly George Frey
Guy Lynn Guthrie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CA2298780A1 publication Critical patent/CA2298780A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • G06F12/0835Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means for main memory peripheral accesses (e.g. I/O or DMA)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1673Details of memory controller using buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4027Coupling between buses using bus bridges
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4221Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
    • G06F13/4226Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus with asynchronous protocol
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0024Peripheral component interconnect [PCI]

Abstract

A special 'I/O' page, is defined as having a large size (e.g., 4K bytes), but with distinctive cache line characteristics. For DMA reads, the first cache line in the I/O
page may be accessed, by a PCI Host Bridge, as a cacheable read and all other lines are non-cacheable access (DMA
Read with no intent to cache). For DMA writes, the PCI Host Bridge accesses all cache lines as cacheable. The PCI Host Bridge maintains a cache snoop granularity of the I/O
page size for data, which means that if the Host Bridge detects a store (invalidate) type system bus operation on any cache line within an I/O page, cached data within that page is invalidated (L1/L2 caches continue to treat all cache lines in this page as cacheable. By defining the first line as cacheable, only one cache line need be invalidated on the system bus by the L1/L2 cache in order to cause invalidation of the whole page of data in the PCI Host Bridge. All stores to the other cache lines in the I/O Page can occur directly in the L1/L2 cache without system bus operations, since these lines have been left in the 'modified' state in the L1/L2 cache.

Description

I/O PAGE KILL DEFINITION FOR IMPROVED DMA AND

BACKGROUND OF THE INVENTION
Field of the Invention The present invention relates in general to data processing systems and in particular to processing systems which pre-fetch data from a main memory and one or more cache memories.
More particularly, the present invention relates to improving performance of direct memory access and cache memory.
Description of the Prior Art In modern microprocessor systems, processor cycle time continues to decrease as technology continues to improve. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like, continue to improve the performance of processing systems. The improved performance puts a heavier burden on the system's memory interface since the processor demands data and instructions more rapidly from memory. To increase the performance of processing systems, cache memory systems are often implemented.
Processing systems employing cache memories are well known in the art. Cache memories are very high-speed memory devices that increase the speed of a data processing system by making current programs and data available to a processor ("CPU") with a minimal amount of latency. Large on-chip caches (L 1 caches) are implemented to help reduce memory latency, and they are often 2 0 augmented by larger off chip caches (L2 caches). The cache serves as a storage area for cache line data. Cache memory is typically divided into "lines" with each line having an associated "tag" and attribute bits. The lines in cache memory contain copies of data from main memory. For instance, a "4K page" of data in cache may be defined as comprising 32 lines of data from memory having 128 bytes in each line.
2 5 The primary advantage behind cache memory systems is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time of the overall processing system will approach the access time of the cache. Although cache memory is AUS9-1998-0375 t only a small fraction of the size of main memory, a large fraction of memory requests are successfully found in the fast cache memory because of the "locality of reference" property of programs. This property holds that memory references confined to a few localized areas of memory (in this instance, L1 and L2).
The basic operation of cache memories is well-known. When the processor needs to access memory, the cache is examined. If the word addressed by the processor is found in the cache, it is read from the fast memory. If the word addressed by the processor is not found in the cache, the main memory is accessed to read the word. A block of words containing the word being accessed is then transferred from main memory to cache memory. In this manner, additional data is transferred to cache (pre-fetched) so that future references to memory will likely fmd the required words in the fast cache memory.
Pre-fetching techniques are often implemented to supply memory data to the on-chip L 1 cache ahead of time to reduce latency. Ideally, data and instructions are pre-fetched far enough in advance so that a copy of the instructions and data is always in the L1 cache when the processor needs it. Pre-fetching of instructions and/or data is well-known in the art.
In a system which requires high I/O Direct Memory Access (DMA) performance (i.e., graphics), a typical management of system memory data destined for I/O may be as follows:
1 ) A system processor produces data by doing a series of stores into a set of 4K byte (4K) page buffers in system memory space. This causes the data to be marked as 'modified' (valid in the cache, not written back to system memory) in the cache.
2) The processor initiates an I/O device to perform a DMA Read to these 4K
pages as they are produced.
3) The I/0 device does a series of DMA reads into system memory.
2 5 4) A PCI Host bridge, which performs DMA operations on behalf of the I/O
device, pre-fetches and caches data in a 'shared' (valid in cache, valid in system memory) state. The L1/L2 changes each data cache line from the'modified' state to the'shared' state as the PCI Host Bridge reads the data (i.e., L1/L2 intervenes and either supplies the data directly or 'pushes' it to memory where it can be read).
5) When the DMA device finishes, the 4K buffer is re-used (i.e., software has a fixed set of buffers that the data circulates through).
In order to maintain DMA I/O performance, a PCI Host Bridge may contain its own cache which it uses to pre-fetch/cache data in the shared state. This allows DMA
data to be moved close to the data consumer (i. e., an I/O device) to maximize DMA Read performance.
When the PCI Host Bridge issues a cacheable read on the system bus, this causes the L1/L2 to go from the'modified' to the'shared' state due to the PCI host bridge performing a cacheable read. This state changing action produces a performance penalty when the software wants to re-use this 4K page cache space to store the new DMA data since every line in the L 1 /L2 cache has been changed to the 'shared' state. In order for the new stores to take place, the L 1 /L2 has to perform a system bus command for each line to indicate that the line is being taken from'shared' to'modified.' This must occur for each cache line (there are 32) in the 4K page even though the old data is of no use (the PCI
Host Bridge needs an indication that its data is now invalid). The added memory coherency traffic, 32 system bus commands, that must be done on the system bus to change the state of all these cache lines to 'modified' before the new store may be executed can degrade processor performance significantly.
It has been shown that stores to a 4K page by the processor may take 4-5 times longer when the L1/L2 cache is in the 'shared' state as opposed to being in the 'modified' state. This is due to added coherency traffic needed on the system bus to change the state of each cache line to'modified' 2 0 It would be desirable to provide a method and apparatus that increases the speed and efficiency of a Direct Memory Access device. It would also be desirable to provide a method and apparatus to reduce the number of system bus commands required to change state of a page of data in the L 1 /L2 cache.
Summary of the Invention 2 5 It is therefore one object of the present invention to provide a method and apparatus that will reduce the number of system bus commands required to change state of a buffer in an L1/L2 cache.

It is another object of the present invention to provide a method and apparatus that will increase the speed and efficiency of Direct Memory Access devices.
It is yet another object of the present invention to provide a method and apparatus that allows a cache to clear a memory buffer with one bus operation.
The foregoing objects are achieved as is now described.
A special 'I/O' page, is defined as having a large size (e.g., 4K bytes), but with distinctive cache line characteristics. For DMA reads, the first cache line in the I/O
page may be accessed, by a PCI Host Bridge, as a cacheable read and all other lines are non-cacheable access (DMA Read with no intent to cache). For DMA writes, the PCI Host Bridge accesses all cache lines as cacheable. The PCI Host Bridge maintains a cache snoop granularity of the I/O page size for data, which means that if the Host Bridge detects a store (invalidate) type system bus operation on any cache line within an I/O page, cached data within that page is invalidated (L1/L2 caches continue to treat all cache lines in this page as cacheable. By defining the first line as cacheable, only one cache line need be invalidated on the system bus by the L1/L2 cache in order to cause invalidation of the whole page of data in the PCI Host Bridge. All stores to the other cache lines in the I/O
Page can occur directly in the L 1 /L2 cache without system bus operations, since these lines have been left in the 'modified' state in the L 1 /L2 cache.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
2 0 Brief Description of the Drawings The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
2 5 Figure 1 depicts a high-level block diagram of a data processing system in which a preferred embodiment of the present invention may be implemented;
Figure 2A is a high-level flow diagram of a method for utilizing a special DMA
I/O page in accordance with a preferred embodiment of the present invention;

Figure 2B depicts a high-level flow diagram of the method for re-using the special DMA I/O
page in accordance with a preferred embodiment of the present invention;
Figure 3 is a flow diagram for L1/L2 coherency procedure for performing processor stores in accordance with a preferred embodiment of the present invention;
Figure 4 depicts a high-level flow diagram of the method for utilizing a special DMA I/O
page wherein PCI Host Bridge may service DMA requests in accordance with a preferred embodiment of the present invention; and Figure 5 is a high level flow diagram of a portion of the method for utilizing a special DMA
I/O page wherein PCI Host Bridge may snoop System Bus coherency, in accordance with a preferred embodiment of the present invention.
Description of the Preferred Embodiment With reference now to the figures, and in particular with reference to Figure 1, a multiprocessor data processing system in accordance with a preferred embodiment of the present invention is depicted. Data processing system 100 is a multiprocessor (SMP) system 102 (only one shown), which preferably comprise one of the PowerPCTM family of processors available from International Business Machines of Armonk, New York. Although only one processor is depicted in the exemplary embodiment, those skilled in the art will appreciate that additional processors may be utilized in a multiprocessor data processing system in accordance with the present invention.
Processor 102 includes a level one (L I ) cache 104. In order to minimize data access latency, 2 0 one or more additional levels of cache memory may be implemented within data processing system 100, such as a level two (L2) cache 106. The lower cache level, L2, is employed to stage data to the Ll cache and typically have progressively larger storage capacities but longer access latencies. For example, L 1 cache 104 may have a storage capacity of 32KB and an access latency of approximately 1-2 processor cycles. L2 cache 106 might have a storage capacity of 512KB but an access latency 2 5 of 5 processor cycles. L2 cache 106 serves as intermediate storage between processor 102 and system memory 110 which typically has a much larger storage capacity but may have an access latency of greater than 50 processor cycles.
Both the number of levels in the cache hierarchy and the cache hierarchy configuration employed in data processing system 100 may vary. L2 cache 106 is a dedicated cache connected between CPU 102 and system memory 110 (via system bus 112). Those skilled in the art will recognize that various permutations of levels and configurations depicted may be implemented.
L2 cache 106 is connected to system memory 110 via system bus 112. Also connected to system bus 112 is a memory controller 114 and PCI host bridge 108. Memory controller 114 regulates access to system memory 110. Software can organize within system memory 110 buffer regions that are utilized by DMA memory 124 (e.g., DMA memory 124 may be a set of 4k page buffers in system memory 110 space). PCI host bridge 108 connects system bus 112 to PCI bus 116, which provides connections for I/O devices such as a graphics adapter providing a connection for a display (not shown), I/O devices 118 and 120 and PCI bus 116. System bus 112, PCI host bridge 108, and PCI bus 116 thus form an interconnect coupling the attached devices, for which alternative implementations are known in the art.
An input/output (I/O) subsystem typically is made up of I/O bus 116, such as a Peripheral Component Interconnect (PCI) bus, to which is attached several I/O devices 118 and 120 along with PCI host bridge (PCIHB) 108. I/O bus 116 is used to connect one or more I/O
devices to system bus 112 via PCIHB 108 and allows I/O devices 118 and 120 to transfer commands and data to/from system memory 110 via PCIHB 108.
PCIHB 108 may pass processor commands from system bus 112 to I/O bus 116 when processor 102 wants to access I/O devices 118 and 120. Additionally, PCIHB 108 may also pass 2 0 direct memory accesses (DMA) from I/O bus 116 initiated by I/O devices 118 and 120 to system memory 110. For DMA access, PCIHB 108 may pre-fetch and cache data to help improve DMA
performance. PCIHB 108 behaves very much like processor 102 on system bus 112 in issuing system bus commands to access system memory 110 and to maintain coherency across L 1 /L2 caches 104 and 106 as well as its own cache 109.
2 5 An I/O Master Device is a device that may initiate a DMA on I/O bus 116 which transfers data from system memory 110 to some other location (and vice versa) via PCIHB
108. In this block diagram, I/O device 120 represents an I/O Master Device capable of transferring data to and from system memory 110. These types of transfers may be done without intervention by processor 102.
I/O devices 118 and 120 may comprise conventional peripheral devices including a graphical pointing device such as a mouse or trackball, a display, and a printer, all of which may be interfaced to PCI bus 116 via conventional adapters. Non-volatile memory 122 may comprise a hard disk drive and stores an operating system and other software controlling operation of system 100, which are loaded into volatile system memory 110 in response to system 100 being powered on. Those skilled in the art will recognize that data processing system 100 may include many additional components not shown in Figure 1, such as serial and parallel ports, connections to networks or attached devices, etc. Such modifications and variations are within the spirit and scope of the present invention.
Within DMA buffer 124, data may be stored in, for instance, 4K page buffers 130 and 132 consisting of 32 lines of data of 128 bytes each. Before LI /L2 cache 102 and 104 can execute a store from processor 102 to a line that is in the shared state in the L 1 /L2 cache, a separate system bus operation is required in order to inform the other caches to invalidate each cache's copy. Since this is done for each cache line, the processor is slowed down due to the number of repetitive bus operations to clear one page buffer to make room for new data. The present invention sets up a 4K
page buffer (I/O) so that the buffer may be cleared in one bus operation instead of 32 bus operations.
Typical 4K page buffers are represented by buffers 130 and 132. 4K I/O page buffers, from the present invention, are represented by buffers 134 and 136. Lines of data within the buffers are represented by the blocks within the buffers and a crosshatch within a block represents a shared state.
In buffer 130 all the cache lines are shared after the DMA access completes, requiring individual system bus operations for each cache line (32 lines) before the buffer may be cleared. Buffer 132 2 0 cache lines are shown as modified allowing data to be written to buffer 132. I/O buffer's 134 first cache line is in a shared state with the remaining lines in a modified state after the DMA access completes as required by the present invention. All cache lines in I/O buffer 136 are in a modified state. In contrast to converting the coherency state of buffer 130 to that of buffer 132, the conversion of the coherency state of buffer 134 to that of buffer 136 requires only the first line in I/O buffer 134 2 5 to be changed in order to allow data to be stored to I/O buffer 134. In comparison, converting an I/O
page buffer (which only takes one line to change) state would take magnitudes less time than clearing a typical buffer (which requires changing 32 lines to change state).
Referring to Figure 2A, a high-level flow diagram of a method for utilizing a special DMA
I/O page in accordance with a preferred embodiment of the present invention, is illustrated. The process begins with step 202, which depicts a software application acquiring a 4K I/O page, currently not in use, to create data for a PCI I/O device to read later. The process next passes to step 204, which illustrates the software application accomplishing a series of stores to the 4K I/O page, where at least one of the stores is to the first cache line in the 4K I/O page. The process continues to step 206, which depicts the software application initiating a DMA device to perform a DMA Read of the 4K I/O page via the PCI Host Bridge, where at least one of the reads is to the first cache line in the 4K I/O Page. The process next passes to step 208, which illustrates a determination of whether the software application has more data to send. If not, the process is complete.
If there is more data to send, the process instead returns to step 202 wherein the software application acquires a 4K I/O Page buffer not in use.
Referring now to Figure 2B, a high-level flow diagram of the method for re-using the special DMA I/O page in accordance with a preferred embodiment of the present invention is depicted. The process begins with step 222, which depicts a determination of whether an I/O
device has completed a DMA read from an I/O page buffer. If not, the process returns to step 222 and repeats the step.
If the I/O device has completed a DMA read from the I/O Page buffer, the process instead passes to step 224, which illustrates software marking the I/O Page buffer as "ready for re-use by the software application." The process then returns to step 222 and determines whether an I/O device is finished with a DMA read to an I/O Page buffer.
Referring to Figure 3, a high-level flow diagram for an L1/L2 coherency process for 2 0 performing processor stores in accordance with a preferred embodiment of the present invention, is illustrated. The process begins with step 300, which depicts starting the procedure. The process proceeds to step 302, which illustrates a determination of whether a processor is attempting to execute a store operation. If not, the process returns to step 302 and repeats the step. If the processor is trying to execute a store operation, the process passes instead to step 304, which depicts the L1/L2 2 5 cache checking the state of the L 1 /L2 cache before permitting the store to complete. The process then passes to step 306, which illustrates a determination of whether the Ll/L2 cache line state is 'Invalid'. If the cache is 'Invalid', the process proceeds to step 308, which depicts an instruction issued to perform a'Read with Intent to Modify' operation on the system bus to read a copy of the cache line and the line to the 'Modified' state. The process then proceeds to step 318, which AUS9-1998-0375 s illustrates the processor's store instruction being executed into the Ll/L2 cache.
Returning to step 306, if the L1/L2 cache line state is not'Invalid', the process proceeds to step 310, which depicts a determination of whether the L 1 /L2 cache line state is'shared'. If the cache line is 'shared', the process proceeds to step 312, which illustrates a 'data claim' operation being executed on the system bus in order to gain ownership of the line and change the line to a'modified' state. The process then passes to step 318 where the processor's store function is executed into the L1/L2 cache. If, instead, the L1/L2 cache line state is not'shared', the process proceeds to step 314, which depicts a determination of whether the L1/L2 cache line state is 'modified'. If the cache line is not modified, the process passes to step 316, which illustrates an error message, since there are assumed to be only 3 L 1 /L2 cache line states.
Returning to step 314, if the L1/L2 cache line state is'Modified', the process passes instead to step 318, which depicts the processor's store being executed into the L1/L2 cache. The process then proceeds to step 302, which illustrates the processor attempting to execute another store.
Referring to Figure 4, a high-level flow diagram of the method for utilizing a special DMA
I/O page, wherein PCI Host Bridge may service DMA requests in accordance with a preferred embodiment of the present invention, is illustrated. The process begins with step 400, which depicts the I/O page buffer being designated. The step proceeds to step 402, which illustrates a determination of whether an I/O device is trying to execute a DMA read. If not, the process returns to step 402, and repeats until a DMA read is determined. If an I/O device is attempting to execute 2 0 a DMA read, the process proceeds instead to step 404, which depicts the PCI Host Bridge checking the state of lines in the Host Bridge cache. Next the process passes to step 406, which illustrates a determination of whether the PCI Host Bridge cache is in the 'Invalid' state.
If the cache is in the 'Invalid' state, the process proceeds to step 408, which depicts a determination of whether the DMA
read to the I/O page is a read of the first cache line of the I/O page or a read of a conventional I/O
2 5 buffer (not an I/O page). If the read is a read of the first cache line of an I/O Page or any cache line in a conventional I/O buffer" the process passes to step 412, which illustrates a 'Read' system bus operation is executed to retrieve a shared copy of the line. The L1/L2 cache is forced to change the state of the line from'modified' to'shared'. The process then proceeds to step 418, which depicts the PCI Host Bridge delivering the DMA read to the I/O device.

Returning to step 408, if the read is of an I/O Page, but not to the first cache line in the page, the process passes instead to step 410, which illustrates a 'Read with Intent to Cache' system bus operation to retrieve a shared copy of the line and the L1/L2 cache may keep the cache line in a 'modified' state. The process then passes to step 418, which depicts the PCI
Host Bridge delivering DMA read data to the I/O device.
Returning now to step 406, if the PCI Host Bridge cache is not in the 'Invalid' state, the process instead passes to step 414, which illustrates a determination of whether the L 1 /L2 cache line is in the 'shared' state. If the cache line is not in the 'shared' state, the process proceeds to step 416, which depicts an error message since there are assumed to be only two PCI Host Bridge cache line states. Returning to step 414, if the LI/L2 cache line is in the shared state, the process proceeds to step 418, which illustrates the PCI Host Bridge delivering the DMA Read data to the I/O device. The process continues to step 401, which illustrates an I/O device attempting to execute a DMA read to an I/O page.
To manage the coherence of the 'I/O pages' the PCI Host Bridge is triggered to invalidate a 4K'I/O page' by storing to the first cache line of the page before the 4K'I/O
page' can be re-used.
The PCI Host Bridge treats the first cache line as special on DMA reads because the first line is devised to appear as a cacheable read to L1/L2 caches. The L1/L2 cache does a system bus coherency access indicating the processor's intentions to change the first cache line from'shared' to 'modified'. The PCI Host Bridge is snooping on a 4K page granularity (size) so when a store occurs 2 0 to the first cache line of a 4K page, the PCI Host Bridge will invalidate the entire page, avoiding all the system bus traffic required to invalidate every cache line in the 4K page.
Referring now to Figure 5, a high level flow diagram of the I/O page invalidation portion of the method for utilizing a special DMA I/O page wherein PCI Host Bridge may snoop System Bus coherency, in accordance with a preferred embodiment of the present invention is depicted. The 2 5 process begins with step 500, which depicts beginning the invalidation procedure. The process proceeds to step 502, which illustrates a determination of whether the L1/L2 cache is trying to perform a system bus operation that will change the state of an L1/L2 cache line that hits a 4K I/O
Page marked'shared' by the PCI Host Bridge. If not, the process passes to step 504, which depicts no action being taken by the PCI Host Bridge. The process continues to step 502 and repeats.

Returning to step 502, if the L 1 /L2 cache is trying to perform a system bus operation that will change the state of an L 1 /L2 cache line, the process instead passes to step 506, which illustrates the PCI Host Bridge invalidating the subject 4K page (e.g., I/O page) of data in the PCI
Host Bridge cache, since the page was marked 'shared'.
By defining only the first cache line in a 4K I/O page to be read as cacheable, the L1 cache will still have all but the first cache line in the 'modified' state when it attempts to re-use the 4K
buffer. Only the first line will be in the 'shared' state when the DMA is performed. Software will store to the first cache line in the I/O page whenever it is going to re-use a page so that the PCI Host Bridge is aware that it should invalidate the page. A DMA read or DMA write to the first cache line of an I/O page causes the L1/L2 to change the first cache line from'modified' to 'shared'.
This I/O page, as defined according to a preferred embodiment of the present invention, greatly improves the performance of the processor when it is creating a new 4K
page by storing to an old re-usable 4K I/O page since the store to the first cache line of a 4K
page will require only a single system bus transaction to take the L1/L2 from a'shared' state to a'modified' state'. All other cache lines in the I/O page are left in the'modified' state in the L1/L2, so the processor stores to these cache lines can go directly into the L1/L2 cache requiring no system bus coherency traffic.
The present invention may be applied to systems where memory pages are accessed by different means. An additional embodiment of the present invention may be provided for a system that utilizes a Translation Control Entry (TCE) table in a PCI Host Bridge. A
TCE table is usually 2 0 provided in a PCI Host Bridge for use in accessing system memory above a set limit; for example four gigabytes (GB). In such a system, the TCE entry itself may be used as a trigger mechanism instead of using a first cache line in a 4K I/O page. In this instance, the PCI Host Bridge could perform ALL reads as 'Reads with no intent to cache' (no longer treated as special) and program logic invalidates a page by doing a store to the TCE entry that was used for the DMA
read each time the 2 5 page is being re-used (i.e., PCI Host Bridge invalidates any data it fetched within a 4K page if the TCE it used to fetch the data was modified). As in the I/O page embodiment, system bus traffic is considerably reduced.
It is important to note that while the present invention has been described in the context of a fully functional device, those skilled in the art will appreciate that the mechanism of the present AUS9-1998-0375 t t invention and/or aspects thereof are capable of being distributed in the form of a computer usable medium of instructions in a variety of forms, and that the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of computer usable media include: nonvolatile, hard-coded type media such as read only memories (ROMs) or erasable, electrically programmable read only memories (EEPROMs), recordable type media such as floppy disks, hard disk drives and CD-ROMs, and transmission type media such as digital and analog communication links.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (21)

1. A method for improving direct memory access and cache performance, comprising the steps of:
defining a memory buffer wherein a first cache line within said memory buffer is to be read as cacheable;
further defining remaining cache lines within said memory buffer as being in a modified state; and responsive to an attempt by said cache to change said first cache line from a shared state to a modified state, invalidating said entire memory buffer.
2. ~The method of claim 1, further comprising:
triggering a PCI host bridge to invalidate said memory buffer.
3. ~The method of claim 2, further comprising:
performing a store operation to said first cache line of said page.
4. ~The method of claim 1, further comprising:
snooping a page buffer in system memory.
5. ~The method of claim 4, further comprising:
invalidating entire page of data in said buffer if said first cache line changed from said shared state to said modified state.
6. ~An apparatus for improving direct memory access and cache performance, comprising:
a cache for storing information;

control logic for executing cache operations;
hardware for defining a memory buffer wherein a first cache line within said memory buffer is to be read as cacheable; and means for simultaneously invalidating all cache lines within said memory buffer without additional system bus commands.
7. ~The apparatus of claim 6, further comprising:
logic means capable of reserving memory storage.
8. ~The apparatus of claim 6, further comprising:
additional logic means for further defining remaining cache lines within said memory buffer as being in a modified state.
9. ~The apparatus of claim 6, further comprising:
logic for triggering a PCI Host Bridge to invalidate said memory.
10. ~The apparatus of claim 6, further comprising:
means for performing a store operation to said first cache line of said page.
11. ~The apparatus of claim 6, further comprising:
snoop means for snooping a page buffer in system memory.
12. ~The apparatus of claim 1, wherein means for simultaneously invalidating all cache lines within said memory buffer without additional system bus commands, further comprises:
means for invalidating entire page of data in said memory buffer if said first cache line is changed from said shared to said modified state
13. ~A computer program product having instructions within a computer readable medium for improving direct memory access and cache performance, comprising:
14 instructions within said computer readable medium for defining a page buffer wherein a first cache line within said page buffer is to be read as cacheable;
instructions within said computer readable medium for further defining remaining cache lines within said page buffer as being in a modified state; and responsive to an attempt by said cache to change said first cache line from a shared to a modified state, instructions within said computer readable medium for invalidating contents of said page buffer.

14. The computer program product of claim 13, further comprising:
instructions within said computer readable medium for triggering a PCI Host Bridge to invalidate contents of said page buffer.
15. The computer program product of claim 14, further comprising:
instructions within said computer readable medium for performing a store operation to said first cache line of said page buffer.
16. The computer program product of claim 13, further comprising:
instructions within said computer readable medium for snooping said page buffer in system memory.
17. The computer program product of claim 16, further comprising:
instructions within said computer readable medium for invalidating entire page of data in said buffer if said first cache line changed from said shared state to said modified state.
18. A method for improving direct memory access and cache performance, comprising the steps of:
defining an address entry for a 4K I/O page within an address table for a cache memory;
utilizing said entry as a trigger mechanism;
performing all reads to said 4K I/O page as 'read with no intent to cache';

storing a Direct Memory Access (DMA) read to a location referenced by said entry; and invalidating all data within said 4K I/O page.
19. The method of claim 18, further comprising:
determining if said 4K I/O page is to be re-used.
20. The method of claim 17, wherein storing said DMA read to said entry, further comprises:
triggering a host bridge to invalidate all data in said 4K I/O page.
21. The method of claim 17, further comprises:
writing new data to said 4K page.
CA002298780A 1999-03-31 2000-02-16 I/o page kill definition for improved dma and l1/l2 cache performance Abandoned CA2298780A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/282,631 1999-03-31
US09/282,631 US6338119B1 (en) 1999-03-31 1999-03-31 Method and apparatus with page buffer and I/O page kill definition for improved DMA and L1/L2 cache performance

Publications (1)

Publication Number Publication Date
CA2298780A1 true CA2298780A1 (en) 2000-09-30

Family

ID=23082392

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002298780A Abandoned CA2298780A1 (en) 1999-03-31 2000-02-16 I/o page kill definition for improved dma and l1/l2 cache performance

Country Status (6)

Country Link
US (1) US6338119B1 (en)
JP (1) JP3893008B2 (en)
KR (1) KR100353656B1 (en)
CN (1) CN1240000C (en)
CA (1) CA2298780A1 (en)
TW (1) TW457433B (en)

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257158B1 (en) 1998-05-18 2007-08-14 Kendyl A. Román System for transmitting video images over a computer network to a remote receiver
US7671864B2 (en) * 2000-01-14 2010-03-02 Roman Kendyl A Faster image processing
US7233619B1 (en) 1998-12-21 2007-06-19 Roman Kendyl A Variable general purpose compression for video images (ZLN)
US8290034B2 (en) 1998-12-21 2012-10-16 Zin Stai Pte. In, Llc Video transmission and display including bit-wise sub-sampling video compression
US8170095B2 (en) * 1998-12-21 2012-05-01 Zin Stai Pte. In, Llc Faster image processing
US8416847B2 (en) * 1998-12-21 2013-04-09 Zin Stai Pte. In, Llc Separate plane compression using plurality of compression methods including ZLN and ZLD methods
US20030005428A1 (en) * 2001-05-26 2003-01-02 Roman Kendyl A. Global media exchange
US8068544B2 (en) * 1998-12-21 2011-11-29 Zin Stai Pte. In, Llc Compression with doppler enhancement
US7016417B1 (en) 1998-12-23 2006-03-21 Kendyl A. Roman General purpose compression for video images (RHN)
US8004572B2 (en) * 1999-05-17 2011-08-23 Zin Stai Pte. In, Llc System for transmitting a video stream over a computer network to a remote receiver
US7191462B1 (en) 1999-11-08 2007-03-13 Kendyl A. Román System for transmitting video images over a computer network to a remote receiver
US6574682B1 (en) * 1999-11-23 2003-06-03 Zilog, Inc. Data flow enhancement for processor architectures with cache
US6728835B1 (en) * 2000-08-30 2004-04-27 Unisys Corporation Leaky cache mechanism
US6820161B1 (en) * 2000-09-28 2004-11-16 International Business Machines Corporation Mechanism for allowing PCI-PCI bridges to cache data without any coherency side effects
US6470429B1 (en) * 2000-12-29 2002-10-22 Compaq Information Technologies Group, L.P. System for identifying memory requests as noncacheable or reduce cache coherence directory lookups and bus snoops
US6463510B1 (en) * 2000-12-29 2002-10-08 Compaq Information Technologies Group, L.P. Apparatus for identifying memory requests originating on remote I/O devices as noncacheable
US6684297B2 (en) * 2001-04-11 2004-01-27 Sun Microsystems, Inc. Reverse directory for facilitating accesses involving a lower-level cache
US6701417B2 (en) * 2001-04-11 2004-03-02 Sun Microsystems, Inc. Method and apparatus for supporting multiple cache line invalidations per cycle
US6973528B2 (en) * 2002-05-22 2005-12-06 International Business Machines Corporation Data caching on bridge following disconnect
US20050038946A1 (en) * 2003-08-12 2005-02-17 Tadpole Computer, Inc. System and method using a high speed interface in a system having co-processors
US7610433B2 (en) * 2004-02-05 2009-10-27 Research In Motion Limited Memory controller interface
US7765534B2 (en) 2004-04-30 2010-07-27 International Business Machines Corporation Compiler with cache utilization optimizations
US7484016B2 (en) * 2004-06-30 2009-01-27 Intel Corporation Apparatus and method for high performance volatile disk drive memory access using an integrated DMA engine
TWI243999B (en) * 2004-08-17 2005-11-21 Via Tech Inc Apparatus and related method for maintaining read caching data of south bridge with north bridge
US20060100997A1 (en) * 2004-10-27 2006-05-11 Wall Gary C Data caching
US20060179173A1 (en) * 2005-02-02 2006-08-10 Bockhaus John W Method and system for cache utilization by prefetching for multiple DMA reads
US7328310B2 (en) * 2005-02-02 2008-02-05 Hewlett-Packard Development Company, L.P. Method and system for cache utilization by limiting number of pending cache line requests
US7330940B2 (en) * 2005-02-02 2008-02-12 Hewlett-Packard Development Company, L.P. Method and system for cache utilization by limiting prefetch requests
US20060179174A1 (en) * 2005-02-02 2006-08-10 Bockhaus John W Method and system for preventing cache lines from being flushed until data stored therein is used
US20060184735A1 (en) * 2005-02-15 2006-08-17 Maxwell Technologies, Inc. Methodology for effectively utilizing processor cache in an electronic system
US20090063734A1 (en) * 2005-03-14 2009-03-05 Matsushita Electric Industrial Co., Ltd. Bus controller
US7533198B2 (en) * 2005-10-07 2009-05-12 International Business Machines Corporation Memory controller and method for handling DMA operations during a page copy
US7734842B2 (en) * 2006-03-28 2010-06-08 International Business Machines Corporation Computer-implemented method, apparatus, and computer program product for managing DMA write page faults using a pool of substitute pages
JP4785637B2 (en) * 2006-06-16 2011-10-05 キヤノン株式会社 Data transfer apparatus and control method thereof
US20090119460A1 (en) * 2007-11-07 2009-05-07 Infineon Technologies Ag Storing Portions of a Data Transfer Descriptor in Cached and Uncached Address Space
CN101600029B (en) * 2008-06-06 2013-05-08 博通集成电路(上海)有限公司 System and method for reducing background noise
US8806140B1 (en) * 2009-12-16 2014-08-12 Applied Micro Circuits Corporation Dynamic memory module switching with read prefetch caching
US20130262780A1 (en) * 2012-03-30 2013-10-03 Srilatha Manne Apparatus and Method for Fast Cache Shutdown
GB2529425A (en) * 2014-08-19 2016-02-24 Ibm Data processing apparatus and method
CN106484334A (en) * 2016-10-20 2017-03-08 郑州云海信息技术有限公司 A kind of release pre-reads the method and device of resource
CN108628766B (en) * 2017-03-23 2024-01-23 三星电子株式会社 Nonvolatile memory, computing system, and reading method
KR101951309B1 (en) * 2017-04-19 2019-04-29 서울시립대학교 산학협력단 Data processing apparatus and data processing method
US10599568B2 (en) * 2018-04-09 2020-03-24 Intel Corporation Management of coherent links and multi-level memory
CN111049566B (en) * 2019-11-20 2022-03-08 中国航空工业集团公司西安航空计算技术研究所 Information transfer method and airborne LRM module
CN114157621A (en) * 2020-09-07 2022-03-08 华为技术有限公司 Method and device for sending clearing message
US20240053891A1 (en) * 2022-08-12 2024-02-15 Advanced Micro Devices, Inc. Chipset Attached Random Access Memory

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4885680A (en) * 1986-07-25 1989-12-05 International Business Machines Corporation Method and apparatus for efficiently handling temporarily cacheable data
JPH0612363A (en) * 1992-06-26 1994-01-21 Toshiba Corp Memory controller and multiprocessor system
US5613153A (en) * 1994-10-03 1997-03-18 International Business Machines Corporation Coherency and synchronization mechanisms for I/O channel controllers in a data processing system
JP2852232B2 (en) * 1996-04-24 1999-01-27 新潟日本電気株式会社 Computer
US5884100A (en) * 1996-06-06 1999-03-16 Sun Microsystems, Inc. Low-latency, high-throughput, integrated cache coherent I/O system for a single-chip processor
US5953538A (en) * 1996-11-12 1999-09-14 Digital Equipment Corporation Method and apparatus providing DMA transfers between devices coupled to different host bus bridges
US6128711A (en) * 1996-11-12 2000-10-03 Compaq Computer Corporation Performance optimization and system bus duty cycle reduction by I/O bridge partial cache line writes
EP0887738B1 (en) * 1997-06-27 2008-08-13 Bull S.A. Interface bridge between a system bus and local buses with translation of local addresses for system space access programmable by address space
JPH11272555A (en) * 1998-03-20 1999-10-08 Fujitsu Ltd Cache memory control system
US6003106A (en) * 1998-05-27 1999-12-14 International Business Machines Corporation DMA cache control logic

Also Published As

Publication number Publication date
TW457433B (en) 2001-10-01
JP3893008B2 (en) 2007-03-14
CN1240000C (en) 2006-02-01
JP2000305842A (en) 2000-11-02
US6338119B1 (en) 2002-01-08
KR100353656B1 (en) 2002-09-19
CN1268695A (en) 2000-10-04
KR20010006757A (en) 2001-01-26

Similar Documents

Publication Publication Date Title
US6338119B1 (en) Method and apparatus with page buffer and I/O page kill definition for improved DMA and L1/L2 cache performance
US5715428A (en) Apparatus for maintaining multilevel cache hierarchy coherency in a multiprocessor computer system
US7032074B2 (en) Method and mechanism to use a cache to translate from a virtual bus to a physical bus
US5903911A (en) Cache-based computer system employing memory control circuit and method for write allocation and data prefetch
EP2430551B1 (en) Cache coherent support for flash in a memory hierarchy
US6345342B1 (en) Cache coherency protocol employing a read operation including a programmable flag to indicate deallocation of an intervened cache line
US6321296B1 (en) SDRAM L3 cache using speculative loads with command aborts to lower latency
US6321306B1 (en) High performance multiprocessor system with modified-unsolicited cache state
US20050204088A1 (en) Data acquisition methods
US11816032B2 (en) Cache size change
JP2000250813A (en) Data managing method for i/o cache memory
US5909697A (en) Reducing cache misses by snarfing writebacks in non-inclusive memory systems
US8347034B1 (en) Transparent level 2 cache that uses independent tag and valid random access memory arrays for cache access
US7685372B1 (en) Transparent level 2 cache controller
US6385702B1 (en) High performance multiprocessor system with exclusive-deallocate cache state
US6314498B1 (en) Multiprocessor system bus transaction for transferring exclusive-deallocate cache state to lower lever cache
US6349369B1 (en) Protocol for transferring modified-unsolicited state during data intervention
US7035981B1 (en) Asynchronous input/output cache having reduced latency

Legal Events

Date Code Title Description
EEER Examination request
FZDE Discontinued