METHODS AND APPARATUS FOR FACILITATING DIRECT MEMORY ACCESS
CROSS-REFERENCE TO A RELATED APPLICATION
This application is takes priority under 35 U.S. C. §119(e) of International Patent
Application No 60/121,022 filed February 22, 1999 (Attorney Docket No.: SCHIP003P) naming Henry Stracovsky as inventor and assigned to the assignee of the present application which is also incorporated herein by reference for all purposes.
FIELD OF THE INVENTION
The present invention relates in general to computing systems and, more particularly, to direct memory access of a memory in a data processing system. More particularly still, the present invention is directed towards a method and apparatus that facilitates the direct accessing of a memory device by a high speed processor.
BACKGROUND OF THE INVENTION
In may applications, high performance processors are used to perform processing on packet oriented objects. An example of this is an Ethernet to ATM protocol bridge. In these situations, the processor is typically called on to examine and modify some wrapper that encompasses a data pay load, for example. The mechanics of data transfer are such that typically the data is deposited and subsequently moved out of main memory. Unfortunately, this is very unfavorable to microprocessor oriented processing as a high performance CPU is typically an order of magnitude faster than main memory. It is not unusual for this type of application for a processor to spend as much as 50 percent of its cycles waiting for memory access to be completed. Since most high performance microprocessors are isolated from main memory by a set of cache memories, it is possible to speed up performance of the application by manually (i.e., under program control) invalidating portions of the cache into which pertinent data from a new packet will be read in. While this technique can significantly improve performance, unfortunately, however, it is processor and cache architecture specific and thus requires a significant understanding of the underlying hardware by the programmer. Furthermore, optimized code (as would be required in this case) is almost never portable requiring, therefore, new optimization for every new product.
Therefore, what is desired are improved platform independent techniques for accessing memory in a high speed processing environment.
Summary of the Invention
To achieve the foregoing, and in accordance with the purpose of the present invention, methods, apparatus for facilitating a direct memory access in a computing system. In one embodiment, a computing system having a central processing unit (CPU) and a main system memory, a DMA engine coupled to CPU performs a first DMA process that identifies a desired data set. Once identified, the DMA engine moves the identified desired data set from a main system memory to a memory segment that is temporally closer to the processor than is the main system memory.
In a preferred embodiment, the segment of memory is a scratch pad type memory that is incorporated into the processor.
In another embodiment, in a computing system having a central processing unit (CPU) and a main system memory arranged to store data, a method for moving a desired data set from the main system memory to a memory segment that is temporally closer to the CPU than is the main system memory is described. A plurality of data sets that includes the desired data set is distributed across a buffer pool included in the main system memory. The desired data set is then identified as being stored in an associated local buffer that is part of the buffer pool. The identified desired data set is fetched from the local buffer and moved from the local buffer to the memory segment that is temporally closer to the CPU than is the main system memory.
In a preferred embodiment, the segment of memory is a scratch pad type memory that is incorporated into the processor.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Fig. 1 shows a computing system in accordance with an embodiment of the invention; Fig. 2 illustrates the format of a data packet transmitted in a representative Ethernet network compliant with the IEEE 802.3 (1985) standard;
Fig. 3 shows a flowchart detailing a process that describes an operation of the computing system in accordance with an embodiment of the invention;
Fig. 4 shows a flowchart detailing a particular implementation of the identification shown in Fig. 3 in accordance with an embodiment of the invention; and
Fig. 5 illustrates a typical, general-purpose computer system suitable lor implementing the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Reference will now be made in detail to a preferred embodiment of the invention. An example of the preferred embodiment is illustrated in the accompanying drawings. While the invention will be described in conjunction with a preferred embodiment, it will be understood that it is not intended to limit the invention to one preferred embodiment. To the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims.
Substantial improvements in memory access can be achieved by using a hardware based DMA (direct memory access) that moves pertinent data either to a cache memory or some high speed scratch pad memory that is temporally closer to a processor than is main memory. When a cache memory is used, the processor supports an update operation of a snoop hit of a data write to memory by an alternate master. However, when the high speed scratch memory is used, the DMA performs a memory to memory transfer from main memory to the scratch pad memory. In either case, a special DMA channel is typically required as data packets accumulate in memory before they are processed by the CPU. This means that if data is placed in cache or high speed scratch pad as it is initially brought in from the I/O device, large amounts of cache or scratch pad are used up thus displacing other potentially valuable data or code. If sufficient number of packets are queued up it is possible to overflow the cache or scratch pad, thus potentially creating a net loss in performance. Therefore, in a preferred embodiment, the inventive DMA process performs a DMA fetch operation by fetching data into a ping-pong type buffer arrangement. In this arrangement, at least two scratch pads or cache segments are used in such a way that data is fetched into one buffer or segment of cache while the processor is operating on the other buffer or segment of cache.
In a preferred embodiment, pertinent data instances are defined by an offset from a packet start and a packet size. When the offset portion is divided by the buffer size, the integer portion indicates that number of buffers that must be traversed and the remainder indicates the offset into the buffer itself.
Turning now to Fig. 1, a computing system 100 is shown in accordance with an embodiment of the invention. The computing system 100 includes an I/O device 102 coupled by way of a DMA channel 104 to a main memory 106. The main memory 106 is typically formed of dynamic random access memory (DRAM) arranged to store data and code. The main memory 106 is, in turn, coupled to a DMA controller unit 108 that includes a DMA engine 110 configured as either hardware or software. The DMA controller unit 108 also includes several data buffers 112-1 through 112-n where only 110-1 and 110-2 are shown. In
the described embodiment, each of the data buffers 112 are arranged to store a portion, or portions, of a data or data packet that has been pre-fetched as determined by the DMA engine 110. In a preferred embodiment, the buffers 112 are coupled in a ping pong type arrangement in order to prevent the accumulation of a large number of data packets in the buffers 112. In this way, the pipelining of the DMA channel 104 substantially improves overall processor throughput.
In the described embodiment, the pre-fetched data is then made available to a processor unit 114 having, in some embodiments, an L2 type cache 116 memory coupled thereto. In a preferred embodiment, the fetched data is stored in a memory segment 115 that is incorporated as part of the processor 114. It should be noted that when a cache memory is used, it must be able to update a copy of shared data from the bus when shared dirty data is broadcast to the bus.
Fig. 2 illustrates the format of a data packet transmitted in a representative Ethernet network compliant with the IEEE 802.3 (1985) standard. A packet generally includes a preamble 202 which is 8 bytes long. The last byte (or octet) in the preamble is a start frame delimiter (not shown). After the start frame delimiter, a destination address (DA) 204 which is 6 bytes is used to identify the node that is to receive the Ethernet packet. Following DA 204, is a source address (SA) 206 which is 6 bytes long, SA 206 is used to identify the transmitting node directly on the transmitted packet. After the SA 206, a length/type field (L/T) 208 (typically 2 bytes) is generally used to indicate the length and type of the data field that follows. As is well known in the art, if a length is provided, the packet is classified as an 802.3 packet, and if the type field is provided, the packet is classified as an Ethernet packet.
The following data field is identified as LLC data 210 since the data field also includes information that may have been encoded by an LLC layer described below. A pad 212 is also shown following LLC data 210. As is well known in the art, if a given Ethernet packet is less than 64 bytes, most media access controllers add a padding of l's and 0's following LLC data 210 in order to increase the Ethernet packet size to at least 64 bytes. Once pad 212 is added, if necessary, a 4 byte cyclic redundancy check (CRC) field 214 is appended to the end of a packet in order to check for corrupted packets at a receiving end. As used herein, a "frame" should be understood to be a sub-portion of data contained within a packet.
Typically, in an Ethernet type network, the processor 114 would only be interested in a portion, or portion, of the Ethernet frame 200 (such the source address data field SA 206 and/or the destination address field DA 204). In these cases, the DMA engine 110 would parse a particular Ethernet frame and stored the parsed result in any number of arbitrary buffers included in the buffer pool 118 that are logically linked by, for example, an associated set of descriptors.
Although not shown, another well known packetized data format referred to as TCP (Transmission Control Protocol) is a method (protocol) used along with the Internet Protocol (IP) to send data in the form of message units between computers over the Internet. While IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the individual units of data (called packets) that a message is divided into for efficient routing through the Internet. For example, when an HTML file is sent to a client from a Web server, the Transmission Control Protocol (TCP) program layer in that server divides the file into one or more packets, numbers the packets, and then forwards them individually to the IP program layer. Although each packet has the same destination IP address, it may get routed differently through the network. At the other end (the client program), TCP reassembles the individual packets and waits until they have arrived to forward them to the client. TCP is known as a connection-oriented protocol, which means that a connection is established and maintained until such time as the message or messages to be exchanged by the application programs at each end have been exchanged. TCP is responsible for ensuring that a message is divided into the packets that IP manages and for reassembling the packets back into the complete message at the other end.
As with an Ethernet frame, a data packet with associated with the TCP program layer would have associated with it a TCP header that includes all information related to, for example, source and destination addresses.
In a similar manner as with the Ethernet frame 200, in a TCP based communication system, since the processor 114 may only be interested in a small portion of the TCP header (the destination address, for example) the DMA engine 110 breaks the data up into several portions, only some of which may be pertinent to the current processor task. Since data packets can be relatively large, the DMA engine 110 distributes incoming data (in some embodiments based upon selecting pertinent portions of the data to be processed by the processor 114) into a series of buffers that in the described embodiment take the form of a buffer pool 118 that are logically defined by a particular memory management scheme. Since packets may thus span several buffers and so pertinent data from a wrapper may be located in any of a number of arbitrary buffers. In order to fetch a data packet, the buffers must be logically linked. The linking of buffers can be performed in any number of ways, such as in a descriptor ring, a linked list of descriptors, or simply as a linked list of buffers. During the prefetch operation, the DMA engine must therefore be capable of "walking down" the list to the location of the pertinent data and then transfer only this data. If multiple instances of data are required, the DMA engine must continue to the next instance, transfer it to the scratch pad buffer or cache segment and if so required, proceed to the next instance.
As directed by the DMA engine 110, data is pre-fetched from a particular one of buffers of the buffer pool 118 as required by the processor 114 based upon a descriptor which links the particular buffer to the data portion stored therein and stored in the buffer 112-2
where it is made available to the processor 114. By linking each of the buffers of the buffer pool 118, the DMA engine 110 can use the pointer to determine the next buffer for which data must be pre-fetched and stored in the buffer 112-1 which is made available to the processor 114 in a ping pong style management scheme. In some embodiments, it is possible to provide the desired data to the processor cache instead of a scratch pad memory. Such a scheme is possible if the processor cache supports the replacement of cache lines from a front side bus. As an example, a snooping protocol, such as one protocol known in the art as the "Illinois' snooping protocol, can be used. This particular protocol utilizes a local cache to update an entry from another cache or, as in the described embodiment, the DMA engine 110. In this implementation, a programmer would define a
"shared" memory region that would be initially induced by program fetches into the cache and then subsequently updated by the DMA engine 110. If the data were to be updated, it would have to be explicitly written into main memory or provided to another DMA process if that DMA process accepted coherency updates from the cache. In another embodiment, an I/O event causes the DMA channel 104 to bring data into the system memory 106 by distributing it over the buffer pool 120. Then the processor 114 starts a processing task on selected data, referred to as desired data, from a particular data set which is typically in the form of a data packet. The processor 114 then initiates another DMA channel (not shown) to fetch the desired data set (i.e., selected) from a next packet to be processed from main memory to a location in the memory segment 115 (in the form of either a scratch pad or a cache memory) such the fetch occurs in parallel with the processing of the current packet by the processor 114. In this way, the second DMA channel fetches the desired data into a designated location from the main memory 106 and then notifies the processor 114 when the fetching task is completed. Referring now to Fig. 3, a process 300 describing an operation of the computing system
100 is illustrated in accordance with an embodiment of the invention. The process 300 begins at 302 by the processor initiating a DMA process which in some embodiments can be include accessing data from a DMA register. At 304, a descriptor is retrieved for the current data packet after which at 305, appropriate data offsets associated with the desired data are calculated. Next, at 306, a local buffer is identified that contains the desired data. In one embodiment, the local buffer containing the desired data is identified by way of a "pointer walk" process described in further detail in Fig. 4. Once the desired local buffer has been identified, the data offset in the identified local buffer identified is calculated at 308 after which the desired data packet (or portion) is moved to a selected cache element at 310. In a preferred embodiment, the cache element is temporally closer to the processor than is, for example, a main memory. At 312, a determination is made whether or not the calculated offset is the last offset. If it is determined that the calculated offset is not the last offset, then control is passed back to 306 where the DMA process is instantiated. However, if it is determined that
the calculated offset is the last offset, then the processor is notified that appropriate data is available to be retrieved from the cache element at 314.
Turning now to Fig. 4, shows a flowchart of a process 400 describing a particular implementation of the identification operation 306, and more particularly, the pointer walk process mentioned above, in accordance with a particular embodiment of the invention. It should be noted that although the identification operation described is directed at the pointer walk process carried that is typically carried out the by DMA engine 110, that any appropriate identification process can in fact be used with the invention.
In the described embodiment, the pointer walk process 400 begins at 402 by adding a current buffer size to a passed data accumulator (DPA). At 404, a determination is made whether or not the value of the DPA is greater than the desired data offset. If it is determined that the DPA value is greater than the desired data offset, then the pointer walk process is determined to be completed at 406 and the process 400 stops. Otherwise, control is passed to 408 where a determination is made whether or not the correct descriptor is the last descriptor in the descriptor chain. If it is determined that the correct descriptor is in fact the last descriptor in the chain, then control is passed to 410 where an error flag is thrown. Otherwise, control is passed to 412 where the next descriptor is fetched after which control is passed back to 402.
Fig. 5 illustrates a typical, general-purpose computer system 500 suitable for implementing the present invention. The computer system 500 includes any number of processors 502 (also referred to as central processing units, or CPUs) that are coupled to memory devices including primary storage devices 504 (typically a read only memory, or ROM) and primary storage devices 506 (typically a random access memory, or RAM).
Computer system 500 or, more specifically, CPUs 502, may be arranged to support a virtual machine, as will be appreciated by those skilled in the art. One example of a virtual machine that is supported on computer system 500 will be described below with reference to Fig. 5. As is well known in the art, ROM acts to transfer data and instructions uni- directionally to the CPUs 502, while RAM is used typically to transfer data and instructions in a bi-directional manner. CPUs 502 may generally include any number of processors. Both primary storage devices 504, 506 may include any suitable computer-readable media. A secondary storage medium 508, which is typically a mass memory device, is also coupled bi- directionally to CPUs 502 and provides additional data storage capacity. The mass memory device 508 is a computer-readable medium that may be used to store programs including computer code, data, and the like. Typically, mass memory device 508 is a storage medium such as a hard disk or a tape which generally slower than primary storage devices 504, 506. Mass memory storage device 508 may take the form of a magnetic or paper tape reader or some other well-known device. It will be appreciated that the information retained within the mass memory device 508, may, in appropriate cases, be incorporated in standard fashion as part of RAM 506 as virtual memory. A specific primary storage device 504 such as a CD- ROM may also pass data uni-directionally to the CPUs 502.
CPUs 502 are also coupled to one or more input/output devices 510 that may include, but are not limited to, devices such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, CPUs 502 optionally may be coupled to a computer or telecommunications network, e.g., an Internet network or an intranet network, using a network connection as shown generally at 512. With such a network connection, it is contemplated that the CPUs 502 might receive information from the network, or might output information to the network in the course of performing the above-described method steps. Such information, which is often represented as a sequence of instructions to be executed using CPUs 502, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.
Although only a few embodiments of the present invention have been described, it should be understood that the present invention may be embodied in many other specific forms without departing from the spirit or the scope of the present invention. By way of example, operations involved with performing the inventive DMA process may be reordered. Operations may also be removed or added without departing from the spirit or the scope of the present invention.