1
BRIDGES PERFORMING REMOTE READS AND WRITES AS UNCACHEABLE COHERENT OPERATIONS
RELATED APPLICATIONS 5
This application claims priority pursuant to 35 U.S.C. § 120 as a continuation-in-part (CIP) application of application Ser. No. 10/269,922, filed Oct. 11, 2002, now U.S. Pat. No. 7,206, 879, issued Apr. 17, 2007. The Ser. No. 10/269,922 applica- 10 tion claims priority pursuant to 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 60/380,740, filed May 15, 2002; U.S. Provisional Patent Application Ser. No. 60/331,789, filed Nov. 20, 2001; U.S. Provisional Patent Application Ser. No. 60/344,713, filed Dec. 24, 2001; U.S. 15 Provisional Patent Application Ser. No. 60/348,777, filed Jan. 14, 2002; and U.S. Provisional Patent Application Ser. No. 60/348,717, filed Jan. 14, 2002; in which all of the abovelisted provisional applications are incorporated herein by reference in entirety. 20
Furthermore, this application is related to U.S. patent application Ser. No. 10/270,016, filed Oct. 11, 2002, now U.S. Pat. No. 7,227,870, issued Jun. 5, 2007; and U.S. patent application Ser. No. 10/269,666, filed Oct. 11, 2002, now U.S. Pat. No. 6,912,602, issued Jun. 28, 2005; each of which 25 is incorporated herein by reference in its entirety.
BACKGROUND OF THE INVENTION
1. Field of the Invention 30 The present invention is directed in general to data communications. In one aspect, the present invention relates to a method and system for improving read and write operations
in high-speed data communication systems.
2. Related Art 35 As is known, communication technologies that link electronic devices are many and varied, servicing communications via both physical media and wirelessly. Some communication technologies interface a pair of devices, other communication technologies interface small groups of 40 devices, and still other communication technologies interface large groups of devices.
Examples of communication technologies that couple small groups of devices include buses within digital computers, e.g., PCI (peripheral component interface) bus, ISA (in- 45 dustry standard architecture) bus, USB (universal serial bus), and SPI (system packet interface). One relatively new communication technology for coupling relatively small groups of devices is the HyperTransport (HT) technology, previously known as the Lightning Data Transport technology (Hyper- 50 Transport I/O Link Specification "HT Standard"). The HT Standard sets forth definitions for a high-speed, low-latency protocol that can interface with today's buses like AGP, PCI, SPI, 1394, USB 2.0, and 1 Gbit Ethernet as well as next generation buses including AGP 8x, Infiniband, PCI-X, PCI 55 3.0, and 10 Gbit Ethernet. HT interconnects provide highspeed data links between coupled devices. Most HT enabled devices include at least a pair of HT ports so that HT enabled devices may be daisy-chained. In an HT chain or fabric, each coupled device may communicate with each other coupled 60 device using appropriate addressing and control. Examples of devices that may be HT chained include packet data routers, server computers, data storage devices, and other computer peripheral devices, among others.
Of these devices that may be HT chained together, many 65 require significant processing capability and significant memory capacity. While a device or group of devices having
2
a large amount of memory and significant processing resources may be capable of performing a large number of tasks, significant operational difficulties exist in coordinating the operation of multiprocessors. For example, while each processor may be capable of executing a large number of operations in a given time period, the operation of the processors must be coordinated and memory must be managed to assure coherency of cached copies. In a typical multi-processor installation, each processor typically includes a Level 1 (LI) cache coupled to a group of processors via a processor bus. The processor bus is most likely contained upon a printed circuit board. A Level 2 (L2) cache and a memory controller (that also couples to memory) also typically couples to the processor bus. Thus, each of the processors has access to the shared L2 cache and the memory controller and can snoop the processor bus for its cache coherency purposes. This multiprocessor installation (node) is generally accepted and functions well in many environments.
Because network switches and web servers often times require more processing and storage capacity than can be provided by a single small group of processors sharing a processor bus, in some installations, multiple processor/ memory groups (nodes) are sometimes contained in a single device. In these instances, the nodes may be rack mounted and may be coupled via a back plane of the rack. Unfortunately, while the sharing of memory by processors within a single node is a fairly straightforward task, the sharing of memory between nodes is a daunting task. Memory accesses between nodes are slow and severely degrade the performance of the installation. Many other shortcomings in the operation of multiple node systems also exist. These shortcomings relate to cache coherency operations, interrupt service operations, etc. For example, when data write operations are implemented in a multi-node system using cacheable store commands, such cache stores require a read of the line before the store can complete. In multi-node systems where latencies for the reads are large, this can greatly reduce the write bandwidth out of a CPU.
Therefore, a need exists for methods and/or apparatuses for improving read and write bandwidth in a multi-node system without sacrificing data coherency. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.
SUMMARY OF THE INVENTION
In accordance with the present invention, a system and method are provided for improving the bandwidth for data read and write operations in a multi-node system by using uncacheable read and write commands to a home node in the multi-node system so that the home node can determine whether the commands needs to enter the coherent memory space. In a selected embodiment where nodes are connected via HT interfaces, posted commands are used to transmit uncacheable write commands over the HT fabric to a remote home node so that no response is required from the home node. When both cacheable and uncacheable memory operations are mixed in a multi-node system, a producer-consumer software model may be used to require that the data and flag must be co-located in the home node's memory and that the producer write both the data and flag using regular HT I/O commands.
In one embodiment of the invention, a system for managing data in multiple data processing devices using common data paths comprises a first data processing system comprising a
memory, wherein the memory comprises a cacheable coherent memory space; and a second data processing system communicatively coupled to the first data processing system with the second data processing system comprising at least one bridge, wherein the bridge is operable to perform an 5 uncacheable remote access to the cacheable coherent memory space of the first data processing system.
In some embodiments of the invention, the access performed by the bridge comprises a data write to the memory of the first data processing system for incorporation into the 10 cacheable coherent memory space of the first data system. In other embodiments of the invention, the access performed by the bridge comprises a data read from the cacheable coherent memory space of the first data system.
In various embodiments of the invention, the data written 15 by the bridge during the uncacheable remote access is processed by the first data system to convert the data to conform to a cacheable coherent memory protocol in the cacheable memory space. The converted data in the cacheable coherent memory space is accessed by an agent subsequent to the 20 conversion. The remote access by said bridge and the subsequent access by the agent conform to a producer-consumer protocol, wherein the bridge corresponds to the producer and the agent corresponds to the consumer of the producer-consumer protocol. 25
The objects, advantages and other novel features of the present invention will be apparent from the following detailed description when read in conjunction with the appended claims and attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a block diagram of a network multiprocessor switching system-on-a-chip.
FIG. 2 is a block diagram of one embodiment of a packet 35 processing system including two (or more) of the systems shown in FIG. 1.
FIG. 3 is a block diagram of a first example of communication in the packet processing system shown in FIG. 2.
FIG. 4 is a block diagram of a second example of commu- 40 nication in the packet processing system shown in FIG. 2.
DETAILED DESCRIPTION
An apparatus and method in accordance with the present 45 invention provide a system for reading and writing data in a system of multiprocessor switching chips. A system level description of the operation of an embodiment of the multiprocessor switching system of the present invention is shown in FIG. 1 which depicts a schematic block diagram of a 50 multiprocessor device 100 in accordance with the present invention. The multiprocessor device 100 may be an integrated circuit or it may be constructed from discrete components. The multiprocessor device 100 includes a plurality of processing units 102, 106, 110, 114, cache memory 118, 55 memory controller 122, which interfaces with on and/or offchip system memory 125, an internal bus 130, a node controller 134, a switching module 140, a packet manager 148, a system controller 152, an I/O Bridge 156 which interfaces the system bus various system interfaces, and a plurality of con- 60 figurable packet based interfaces 162,166,170, such as three flexible HyperTransport/SPI-4 Phase 2 links.
As shown in FIG. 1, the four processors 102,106,110,114 are joined to the internal bus 130. When implemented as standard MIPS64 cores, the processors 102, 106, 110, 114 65 have floating-point support, and are independent, allowing applications to be migrated from one processor to another if
necessary. The processors 102, 106, 110, 114 may be designed to any instruction set architecture, and may execute programs written to that instruction set architecture. Exemplary instruction set architectures may include the MIPS instruction set architecture (including the MIPS-3D and MIPS MDMX application specific extensions), the IA-32 or IA-64 instruction set architectures developed by Intel Corp., the PowerPC instruction set architecture, the Alpha instruction set architecture, the ARM instruction set architecture, or any other instruction set architecture. The system 100 may include any number of processors (e.g., as few as one processor, two processors, four processors, etc.). In addition, each processing unit 102, 106, 110, 114 may include a memory sub-system (level 1 cache) of an instruction cache and a data cache and may support separately, or in combination, one or more processing functions. With respect to the processing system example of FIG. 2, each processing unit 102, 106, 110, 114 may be a destination within multiprocessor device 100 and/or each processing function executed by the processing modules 102, 106, 110, 114 may be a source within the processor device 100.
The internal bus 130 may be any form of communication medium between the devices coupled to the bus. For example, the bus 130 may include shared buses, crossbar connections, point-to-point connections in a ring, star, or any other topology, meshes, cubes, etc. In selected embodiments, the internal bus 130 may be a split transaction bus (i.e., having separate address and data phases). The data phases of various transactions on the bus may proceed out of order with the address phases. The bus may also support coherency and thus may include a response phase to transmit coherency response information. The bus may employ a distributed arbitration scheme, and may be pipelined. The bus may employ any suitable signaling technique. For example, differential signaling may be used for high speed signal transmission. Other embodiments may employ any other signaling technique (e.g., TTL, CMOS, GTL, HSTL, etc.). Other embodiments may employ non-split transaction buses arbitrated with a single arbitration for address and data and/or a split transaction bus in which the data bus is not explicitly arbitrated. Either a central arbitration scheme or a distributed arbitration scheme may be used, according to design choice. Furthermore, the bus may not be pipelined, if desired. In addition, the internal bus 130 may be a high-speed (e.g., 128-Gbit/s) 256 bit cache line wide split transaction cache coherent multiprocessor bus that couples the processing units 102, 106, 110, 114, cache memory 118, memory controller 122 (illustrated for architecture purposes as being connected through cache memory 118), node controller 134 and packet manager 148 together. The bus 130 may run in big-endian and little-endian modes, and may implement the standard MESI protocol to ensure coherency between the four CPUs, their level 1 caches, and the shared level 2 cache 118. In addition, the bus 130 may be implemented to support all on-chip peripherals, including the input/output bridge interface 156 for the generic bus, SMbus, UARTs, GPIO, Ethernet MAC and PCI/PCI-X interface.
The cache memory 118 may function as an L2 cache for the processing units 102, 106, 110, 114, node controller 134 and/or packet manager 148. With respect to the processing system example of FIG. 2, the cache memory 118 may be a destination within multiprocessor device 100.
The memory controller 122 provides an interface to system memory, which, when the multiprocessor device 100 is an integrated circuit, may be off-chip and/or on-chip. With respect to the processing system example of FIG. 2, the system memory may be a destination within the multiprocessor
5
device 100 and/or memory locations within the system memory may be individual destinations within the device 100 (as illustrated with channels 0-3). Accordingly, the system memory may include one or more destinations for the multinode processing systems. The memory controller 122 is con- 5 figured to access the system memory in response to read and write commands received on the bus 130. The L2 cache 118 may be coupled to the bus 130 for caching various blocks from the system memory for more rapid access by agents coupled to the bus 130. In such embodiments, the memory 10 controller 122 may receive a hit signal from the L2 cache 118, and if a hit is detected in the L2 cache for a given read/write command, the memory controller 122 may not respond to that command. Generally, a read command causes a transfer of data from the system memory (although some read com- 15 mands may be serviced from a cache such as an L2 cache or a cache in the processors 102, 106, 110, 114) and a write command causes a transfer of data to the system memory (although some write commands may be serviced in a cache, similar to reads). The memory controller 122 may be 20 designed to access any of a variety of types of memory. For example, the memory controller 122 may be designed for synchronous dynamic random access memory (SDRAM), and more particularly double data rate (DDR) SDRAM. Alternatively, the memory controller 122 may be designed for 25 DRAM, DDR synchronous graphics RAM (SGRAM), DDR fast cycle RAM (FCRAM), DDR-II SDRAM, Rambus DRAM (RDRAM), SRAM, or any other suitable memory device or combinations of the above mentioned memory devices. 30
The node controller 134 functions as a bridge between the internal bus 130 and the configurable packet-based interfaces 162,166,170. Accordingly, accesses originated on either side of the node controller will be translated and sent on to the other. The node controller also supports the distributed shared 35 memory model associated with the cache coherency nonuniform memory access (CC-NUMA) protocol.
The packet manager 148 circuitry communicates packets between the interfaces 162,166,170 and the system memory, and may be a direct memory access (DMA) engine that writes 40 packets received from the switching module 140 into input queues of the system memory and reads packets from output queues of the system memory to the appropriate configurable packet-based interface 162, 166, 170. The packet manager 148 may include a packet manager input and a packet man- 45 ager output, each having its own DMA engine and associated cache memory. The cache memory may be arranged as firstin-first-out (FIFO) buffers that respectively support the input queues and output queues.
The packet manager circuit 148 comprises circuitry shared 50 by the interfaces 162, 166, 170. The packet manager may generate write commands to the memory controller 122 to write received packets to the system memory, and may generate read commands to read packets from the system memory for transmission by one of the interfaces 162, 166, 55 170. In some embodiments, the packet manager 148 may be a more efficient use of hardware than having individual DMA engines for each of the interfaces 162,166,170. Additionally, the packet manager may simplify communication on the bus 130, in some embodiments, for packet data transfers. It is 60 noted that, in some embodiments, the system 100 may include an L2 cache coupled to the bus 130. The packet manager 148 may be configured, in some embodiments, to cause a portion of the packet data to be stored into the L2 cache in addition to being stored in memory. In some embodi- 65 ments, the packet manager 148 may use descriptors to locate the memory locations for reading and writing packet data.
6
The descriptors may be stored in the L2 cache or in main memory. The packet manager 148 may read and write the descriptors as well.
In some embodiments, the interfaces 162, 166, 170 may have dedicated communication paths to the node controller 134 or packet manager 148. However, in the illustrated embodiment, the system 100 employs a switch 140. The switch 140 may selectively couple one of the receive/transmit interfaces 162,166,170 to the node controller 134 or packet manager 148 to transfer received data. The switch 140 may selectively couple the packet manager 148 to one of the interfaces 162, 166, 170 to transfer packet data from the packet manager 148 to the interfaces 162,166,170 for transmission on the corresponding ports 172, 174, 176. The switch 140 may have request/grant interfaces to each of the interfaces 162, 166, 170 and the packet manager 148 for requesting transfers and granting those transfers. As will be appreciated, a receive/transmit interface includes any circuitry configured to communicate on a port according to the protocol defined for the port. The interface may include receive circuitry configured to receive communications on the port and to transmit the received communications to other circuitry internal to the system that includes the interface. The interface may also include transmit circuitry configured to receive communications from the other circuitry internal to the system and configured to transmit the communications on the port. The switching module 140 functions to direct data traffic, which may be in a generic format, between the node controller 134 and the configurable packet-based interfaces 162, 166, 170 and between the packet manager 148 and the configurable packet-based interfaces. The generic format may include 8 byte data words or 16 byte data words formatted in accordance with a proprietary protocol, in accordance with asynchronous transfer mode (ATM) cells, in accordance with internet protocol (IP) packets, in accordance with transmission control protocol/internet protocol (TCP/IP) packets, and/or in general, in accordance with any packet-switched protocol or circuit-switched protocol. In a selected embodiment, a 256-Gbit/s switch 140 connects the on-chip memory 118 and processors 102, 106, 110, 114 to the three HyperTransport/SPI-4 links 162,166,170, and provides transparent forwarding of network, ccNUMA access, and HyperTransport packets when necessary.
The configurable packet-based interfaces 162, 166, 170 generally function to convert data from a high-speed communication protocol (e.g., HT, SPI, etc.) utilized between multiprocessor devices 100 and the generic format of data within the multiprocessor devices 100. Accordingly, the configurable packet-based interface 162, 166, 170 may convert received HT or SPI packets into the generic format packets or data words for processing within the multiprocessor device 100, such as by using a receiver interface (which amplifies and time aligns the data received via the physical link and then converts the received protocol-formatted data into data from a plurality of virtual channels having the generic format), hash and route block and receiver buffer for holding the data until a routing decision is made. Packets arriving through receiver interface(s) of the chip can be decoded in either SPI-4 mode (native packet mode) or in HyperTransport (HT) mode, in which case, it uses a special extension called Packet-overHT (PoHT) to transfer the packets. From a logical perspective, both modes provide almost identical services. In addition, the configurable packet-based interfaces 162, 166, 170 may convert outbound (transmit) data of a plurality of virtual channels in the generic format received from the switching module 140 into HT packets or SPI packets, such as by using a transmitter formatter and transmitter interface, which take
« PrécédentContinuer » |