WO2003048955A1

WO2003048955A1 - Multi-processor system

Info

Publication number: WO2003048955A1
Application number: PCT/JP2002/012523
Authority: WO
Inventors: Koji Hosogi; Kiyokazu Nishioka; Toru Nojiri; Kazuhiko Tanaka
Original assignee: Hitachi, Ltd.
Priority date: 2001-12-03
Filing date: 2002-11-29
Publication date: 2003-06-12
Also published as: JPWO2003048955A1

Abstract

In data communication between processors, it is possible to remove unnecessary data transfer between the processors, thereby eliminating lowering of performance. Moreover, in a multi-processor using an interleave cache, it is possible to prevent use efficiency lowering of the cache memory due to fixation of the interleave configuration. A multi-processor system includes a plurality of processors (50) having a data cache (26) and a main memory (13) which are connected to each other via a bus (10). The system has data transfer engine (11) having a region for storing information on data access and issuing a load instruction and a store instruction to the data cache (26) according to the information. Each of the processors (50) has judgment means (22) for storing information on sharing of the data cache and referencing this information upon reception of an address to be accessed, so as to judge which processor is to be accessed.

Description

Details

Multiplex mouth processor system

The present invention relates to a multiprocessor system, and more particularly to a technique for performing high-speed communication between processors. Background art

For media processing that requires high processing power, such as real-time processing, in a multiprocessor environment with multiple processors, coprocessors, etc., the processing is divided, and the divided processing is assigned to each processor and executed in parallel. Software and pipelines are becoming mainstream.

In such a multiprocessor environment, when cooperating and executing the divided processing in parallel, it is necessary to transfer data between processors. Therefore, conventionally, data has been transferred between processors by using a method in which a memory is shared by a plurality of processors and the memory is accessed.

Generally, this is achieved by having multiple processors ft with one main memory or second level cache memory. However, the access latency in such a structure with shared memory below the secondary level is several times to several ten times greater than that for the primary level cache memory, and the processor performance is remarkable accordingly. descend. '

Various approaches have been taken to avoid this. For example, a shared data cache among multiple processors and broadcast Access method (Japanese Patent Laid-Open No. 10-254 779), a method of maintaining data consistency by cache memory of store-through method and snoop control (Japanese Patent Application Laid-Open No. 8-2976464) And a method using an interleaved cache with a fixed address for the shared data cache (Japanese Patent Laid-Open No. 3-172690). By the way, in the broadcast method, or the method using the store-through type cache memory and snoop control, useless access occurs to the processor that does not need data transfer, and the performance is reduced by half by redundant transfer. There is a problem that it is reduced by about one tenth. Furthermore, in the case of the snoop method, having a dedicated address tag for snooping leads to an increase in chip area. In the interleaving method using fixed addresses, the data transfer is not always efficiently distributed because the interleaving configuration is fixed, and one interleave is used. The use efficiency of the device is reduced, and only half the performance can be obtained on average. , Disclosure of the invention

A first object of the present invention is to eliminate useless data transfer between processors in data communication between processors and prevent performance degradation.

A second object of the present invention is to prevent a decrease in the use efficiency of a cache memory due to a fixed interleave configuration in a system that uses a shared interleave cache between processors.

According to a first aspect of the present invention, there is provided a multi-port processor system in which a plurality of processors having a data cache and a main memory are connected by a bus. Area for storing information for specifying cache or main memory, information for specifying addresses to be accessed, and information for fingering access types And a data transfer engine for issuing a load instruction and a store instruction to a data cache or a main memory in accordance with information recorded in the area. Is provided. .

According to a second aspect of the present invention, there is provided a multiprocessor system in which a plurality of processors each having a data cache are connected by a path. An area for setting whether or not to share the data cache (including the case where sharing is not performed) and an area for setting the size of the data cache to be shared. A multiprocessor system is provided which includes a determination unit that determines which processor should be accessed with reference to two regions. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram for explaining the configuration of the first exemplary embodiment of the present invention. FIG. 2 is a block diagram for explaining the configuration of the second embodiment of the present invention. FIG. 3 is a block diagram for explaining the mapping control unit 22 according to the second embodiment of the present invention.

FIG. 4 is a diagram for explaining a boundary register 41 and a shared processor 'register 33 according to the second embodiment of the present invention.

FIG. 5 is a diagram for explaining the sharing of the data cache of the processor 1 ′ in the second embodiment of the present invention.

FIG. 6 is a block diagram for explaining the configuration of the third exemplary embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

Embodiments of the present invention will be described in detail with reference to the drawings. First, a first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram for explaining a configuration of a multiprocessor system according to the present embodiment.

As shown in the figure, in this multiprocessor system, NP processors 1 each having a data cache 2 therein are connected to an internal bus 10 and share a main memory 13. At this time, the main memory 13 is connected to the internal bus 10 via a main memory control unit 12 including a main memory control circuit and an interface. Further, a data transfer engine 11 which is a characteristic part of the present embodiment is connected to the internal bus 10.

The data transfer engine 11 issues a command / store instruction to the data cache 2 in the processor 1 or the main memory 13 connected to the internal bus 10 to execute the processor la and the other processor lb. It has a function to control the data transfer between processor 1 and la and lb, or processor 1 and main memory 13.

'In FIG. 1, each processor 1 includes a data cache 2, a load / store control unit 3, a CPU 4, and an internal bus control unit 5.

The data cache 2 can be a general data cache including a data memory for storing a part of the data in the main memory 13 and a tag memory for storing address data.

The load / store control unit 3 is a control circuit for accessing the data cache 2. The exchange of control signals, memory addresses, store data, load data, and the like between the load / store control unit 3 and the data cache 2 is performed via a path 6.

The CPU 4 can be, for example, a general-purpose CPU, a dedicated coprocessor for a specific use, or the like.

In the present embodiment, the data cache for the load / store control unit 3 is There are two routes for issuing access requests to (2). One is a normal load / store request notified by the CPU 4 as an issuer to the portal store controller 3 via a path 7, and the other is an internal path controller 5 as an issuer. This is a store request sent to the load / store control unit 3 via the path 8.

However, the load / store request that is issued by the internal path control unit 5 is not issued by the internal path control unit 5 as a request / master, but is issued by the data transfer engine 11 request / master. This is issued to the internal bus control unit 5 in each port processor 1 via the internal path 10. At this time, the internal bus control unit 5 operates as a slave module on the internal bus 1.0.

The load / store control unit 3 arbitrates these two types of load / store requests, and accesses the data cache 2 via the path 6. 'That is, in the present embodiment, communication between the processor 1a and the processor 1b (for example, when transferring the contents of the data cache 2a of one processor 1a to the data cache 2b of another processor 1b) Alternatively, when data communication between the data cache 2 of the processor 1 and the main memory 13 is required (for example, when prefetching data), the data transfer engine 11 Controls the process of reading data and writing to the destination.

Here, the data transfer engine 11 will be described. The data transfer engine 11 sends a load instruction for reading data from the data cache 2 to the processor 1 in the multiprocessor system via the internal path 10 and a data instruction to the data cache 2 via the internal path 10. It is an engine that can issue store instructions to be written. Similarly, a load instruction and a store instruction can be issued to the main memory 13 via the main memory control unit 12. Here, it is assumed that the slave processor 1 or the main memory control unit 12 has identification information in response to this access, and the data transfer engine 11 uses the identification information and It is possible to access the mouth processor 1 or the main memory controller 12.

As shown in FIG. 1, the data transfer engine 11 can be configured to include, for example, an 'internal bus interface 11', an address generator 112, and a buffer 113.

The address generator 112 is used to read / write data to / from the modules (data cache 2 and main memory 13 in the processor 1) connected to the internal path 10. Generate an address. The address generator 112 also generates a selection signal for specifying which module is to be accessed. In order to perform these processes, the address generation unit 112 generates the start address, width, pitch, number of repetitions, module identification information, buffer 113 entry number (information for specifying the storage location), and access. A register group that indicates the read Z write as a seed is provided. The address generation unit 112 can hold a plurality of sets with these register evenings as one set. The value of each registry can be set by software, for example, via the operating system.

As shown in FIG. 1, the address generator 1 1 2 generates an address based on the start address, width ′, pitch, and number of repetitions (these are referred to as “address generation information”) set in the register. An address that specifies the two-dimensional area 1 2 1 can be generated. Then, it is possible to determine which processor 1 or main memory 13 to access based on the identification information. The generated address and the selection signal are transmitted to the internal bus interface 111. Of course, the address generation information generated by the address generator is not limited to this. '

The data transfer engine 11 reads data from the buffer according to the entry number and transfers it to the internal bus interface 11 1 at the time of write access (when the write is indicated by the register). . Internal bus interface The device 11 specifies the output destination based on the input address and the selection signal, and outputs the data read from the buffer 113 via the internal bus 10.

At the time of read access (when the read is indicated by the register), the data read via the internal bus 10 is transferred to the buffer 113, and the buffer 113 stores the entry number set by the register. Store the night for

For example, two sets are set in the registers of the address generators 112, respectively. 1) Address generation information: AO, recognition information: processor 0, entry number: B0, read-only write: Read 2) Address generation information: A1, recognition information: Processor 1, entry number: B0, read / write: Write. This is because the data in the address area specified by the address generation information A0 is read from the processor 0, stored in the buffer entry number B0, and stored in the address area specified by the address generation information A1 of the processor 1. This indicates that data stored in the entry number B 0 of the buffer is to be written, that is, data is transferred from the processor 0 to the processor 1.

By doing so, the data transfer engine 11 performs the data transfer between the processor 1 a and the processor 1 or the data transfer between the processor 1 and the main memory 13 by the CPU of the processor 1. This can be realized in parallel with the processing in (4). At this time, since the data transfer engine 11 transfers only necessary data from a specific transfer source to a specific transfer destination, traffic due to unnecessary data transfer does not occur.

The data transfer engine 11 is started by interrupt or polling, thereby synchronizing the processors.

If the request issued by the CPU 4 or the internal bus control unit 5 is a store instruction, the load / store control unit 3 performs a write process on the data cache 2 via the path 6.

If data cache 2 hits the cache, store data is directly downloaded. — Write to Evening Cache 2. When the data cache 2 makes a cache miss, the load / store control unit 3 transfers the cache miss / address to the internal bus control unit 5 via the path 9 and the internal path control unit 5 sends the internal bus 10 A read request for the address is issued to the main memory controller 12 via the interface. The main memory control unit 12 has a general main memory interface, reads data corresponding to the requested address from the main memory 13 via the path 14, and reproduces the data via the internal bus 10. Transfer to the internal bus control unit 5 in 1. The transferred data is transferred to the password store control unit 3 via the path 9 and executes the cache fill process for the data cache 2. This sequence is similar to the general data cache fill method.

Even if the access request to the data key Yasshu 2 mouth one de instruction, generic de - _O is similar to the control of data cache in this case, the data mouth one de 'store controller 5 via a path 6 key Yasshu 2 Perform load access to.

If the cache 2 is a cache hit, read the data from the data cache 2 and return the load data via the path 7 or 8 to the CPU 4 or the internal bus control unit 5 that requested access. . In the case of a data cache miss, the same sequence as for a cache miss at the time of store is performed. 4 or returned to the internal bus control unit 5.

By such processing, the target data can be reliably stored in the data cache 2. That is, for example, it can operate as a prefetch to the data cache 2.

Next, a second embodiment of the present invention will be described.

The multiprocessor system according to the present embodiment includes an interface for sharing a data cache among a plurality of processors having a data cache and a sosh. The data cache is shared with which processor in the system, and the size of the data cache to be allocated can be set. FIG. 2 is a block diagram illustrating a configuration of a multiprocessor system according to the second embodiment.

In the figure, the multiprocessor system is configured by NP processors 20 having a single data cache 26 connected by a global bus 23. '

Each processor 20 includes a data cache 26, a CPU 4, a load storage controller 21, and a mapping controller 22. The global bus 28 has an arbiter 25 for arbitrating paths.

Here, the load / store control unit 21 loads the data cache 26 issued from the CPU 4 of the local CPU 4 (or the own processor) or the CPU 4 of another processor 20 sharing the data cache 26. Process instructions and store instructions.

The mapping control unit 22 determines which processor 20 on the global bus 28 should access the data cache 26 when loading or storing the data cache 26.

When the mapping control unit 22 determines that the mapping control unit 22 accesses the data cache 26 in its own processor 20 (oral processor 20), the load / store control unit 21 executes the local bus operation. Access local data cache 26 through 27.

On the other hand, when the mapping control unit 22 determines that the data cache 26 in the other processor 20 (the other processor 20 connected by the global bus 28) is to be accessed, the global bus 28 Access the data cache 26 in the other target processor 20 via. Next, the mapping control unit 22 will be described in more detail with reference to FIG. '

The mapping control unit 22 includes a shared processor 'register 33 indicating which processor 20 is to share the data cache, a boundary register 30 indicating the size of the data cache to be allocated, and a boundary register 30. And a shifter 31 for shifting an address 23 input from the load / store control unit 21 in accordance with the value of the register.

Further, each processor 20 in the multiprocessor is provided with a processor ID for identifying the processor, and the mapping control unit 22 holds a processor ID 36 indicating the processor 20 itself.

Further, the mapping control unit 22 outputs a processor selection signal 24 to the aviator 25 based on the output of the shifter 31, the processor ID 36, and the value of the shared processor 'register 33. It has 3 4.

The values of the shared processor register 33 and the pandary register 30 can be set, for example, via an operating system that controls the multiprocessor system. For example, the application software executed in the multiprocessor system uses the values of the shared processor register 33 and the pandary register 30 so that the data cache suitable for executing the application software can be shared. Make settings.

Here, in the present embodiment, an example of a format used in the shared processor 'Register 33' will be described. As described above, the shared processor register 33 determines which processor 20 among the plurality of processors 20 connected to the global bus 28 should share the data cache. This is a register to specify. '

For a multiprocessor system consisting of NP processors 20: Shared processor · Register 33 The bit width of register 3 is log 2 (NP). Then, when all bits of the shared processor 'register 33' are "0", the data cache is not shared with other processors 20, and only the lower 1 bit of the shared processor register 33 is set to " When "1" is set, the cache is shared with the processor 20 having the same processor ID 36 excluding the lower 1 bit. In this case, two processors 20 share the data cache. It is assumed that each processor 20 is given a processor ID 36 in ascending order from 0X0.

Similarly, when the data cache is shared with the group of processors 20 having the same processor ID 36 excluding the lower m bits, the m bits from the lower side of the shared processor / register 33 are set to "1". That is, sharing is performed between 2, 4, 8,... Processors, and data cache sharing is performed between a maximum of ^2π processors 20. When the processor 20 performs data cache sharing with all other processors 20, all bits of the shared processor register 33 are set to "1". In addition, the processors 20 sharing the cache with each other have the same value in the shared processor / register 33.

Next, in the present embodiment, an example of a format used in the boundary register S0 which is a register indicating the allocation of the size of the data cache 26 will be described. When the size of each data cache 26 is determined, the boundary of the address of each data cache 26 is obtained. (The addresses of the data cache 26 should be assigned in the order of the processor ID of the processor 20 to be shared. Yes,),

Here, the minimum address pane C (byte), which is the minimum unit of the allocation size of the data cache, is determined in advance. For the lowest address and foundry C, for example, it can be set via the operating system. You can make it. 'At this time, if a puncture / interleave configuration is adopted with the minimum address boundary in the data cache sharing, all bits of the boundary register '30 are set to "0". Then, each time the size of the cache overnight 26 is set to twice the minimum address boundary C, it is set to "1" sequentially from the lower bits of the boundary register 30. For example, the size of the data cache 26 of the processor 20 in which only the lower 2 bits of the boundary register 30 are "1" is set to four times the minimum address boundary C.

The format of the boundary register 30 and the shared processor register 33 is not limited to the above example.

Next, the processing of the mapping control unit 22 will be described.

The address 23 input to the matching control unit 22 is shifted in the shifter 31 based on the value of the boundary register 30 and the like, and becomes a shift address 35.

Here, the shift address 35 is <1 og 2 (minimum address boundary C) + ∑ {boundary register 30} +1 og 2 (NP) —1: 1 og 2 (minimum address boundary C) + ∑ {Boundary register 3 0}>. '

Here, the {Boundary Register 30} specifies the address space to interleave. ∑ {Boundary register 30} indicates, for example, ∑ {Bit 0, Bit 1,. Bit 2} when the boundary register is represented by 3 bits. The shift address 35 when the minimum address' boundary C is 1 KB is as follows.

Boundary register = 0 0 0: N 1 2: 1 0>

Boundary register = 0 0 1: <1 3: 1 1>

Boundary Regis evening = 0 1 1: <1 4: 1 2> Boundary register = 1 1 1: 1 5: 1 3>

For example, if the number of processors NP is 8, the value of the boundary register 30 is 0 X 0 1 1 and the minimum address boundary C is 1 KB, and the input and input addresses are <31: 0>, Shift address 35 is <1 og 2 (1 KB) + ∑ {0 1, 1} + 1 og 2 (8)-1: 1 og 2 (1 KB) + ∑ ί 0 _; 1, 1}>, That is, the address becomes <14:12>.

The shift addresses 3, 5 and the processor ID 36 are input to the selector 34. The selector 34 generates a processor selection signal 24 indicating a processor ID to be accessed, based on the value of the shared processor 'register 33, and transmits the processor selection signal 24 to the aviator 25.

In the selector 34, the value of the shared processor 'register 33' is adjusted for each bit. If the value of the shared processor · register 33 is “1”, the value of the shift address 35 is selected, and the value of the shared processor · register 33 is selected. Is "0", the processor selection signal 24 is generated by selecting the value of the processor ID 36.

For example, a processor 20 having a processor ID 36 of 0 X 101 shares a data cache with four processors 20 (shared processor register 3 3 = 0 x 0 1 1) and shift address 35 Is 0x010, the first bit is "0" of shift address 35, the second bit is also "1" of shift address 35, and the third bit is As a result, 1 of the processor ID 36 is selected, and as a result, the access 'processor ID 24' becomes 0 × 110. This indicates that the processor ID should access the data cache 26 of the processor 20 at 0X110.

When the mapping control unit 22 determines that the data cache 26 in the local processor 20 is to be accessed, the load / store control unit 21 uses the local bus 27 to execute the local data cache 2. Load against 6 Execute instructions or store instructions.

If the mapping control unit 22 determines that the data cache 26 in the other processor 20 is to be accessed, the load / store control unit 21 requests the arbiter 25 for a path right, and if the path right is obtained, The global bus 28 is used to execute a read instruction or a store instruction to the data cache 26 in another processor 20.

The data cache 26 to be accessed actually accesses the data memory and tag memory in the data cache 26 after arbitrating access from the oral path 27 or access from the global bus 28. . A further specific example of a multiprocessor system using this method will be described with reference to FIGS.

In this example, the multiprocessor system is a system composed of eight processors (processor IDs are assumed to be from 0x0000 to 0x111), and the minimum address' boundary C is set to 1 KB. .

In FIG. 4, processor 0 (0x0000) and processor 1 (0x001) indicate that shared 'register 42 is 0x0000 and does not perform data cache sharing. I have. At this time, the value of the boundary register 41 has no meaning. Also, processor 2 (O x 0 1 0) and processor 3 (0 0 1 1) have shared register 42 with 0 0 1 (the processor with the same high-order 2 bits of processor ID). Yes, indicating that two processors share the data cache with each other. Each of the boundary registers 41 is 0x011 (CX4 = 4 KB). In addition, processor 4 (0x100), processor 5 (0x101), processor 6 (0x110), and processor 7 (0x111) are shared registers. 2 is 0x0 1 1 (processors with the same high-order i-bit of the processor ID), which indicate each time the four processors share the data cache. are doing. The boundary register 41 is both 0x00 (CX1 = 1KB).

FIG. 5 is a diagram showing an image of data cache sharing of the multiprocessor system shown in FIG.

As shown in the figure, processor 0 (0x00) and processor 1 (0x001) are defined as distributed distributed caches (codes 45, 46). . Processor 2 (0 0 1 0) and processor 3 (0 x 0 1 1) operate as a multiprocessor connected by a shared data cache, and transfer addresses 0 to 4 KB—address 1 to processor 2 ( 0 X 0 10) Allocation, interleave configuration in which 4 KB to 8 KB-1 are allocated to processor 3 (reference numeral 47). Processor 4 (0x100) to processor 7 (0111) operate as multiprocessors connected by a shared data cache, and the interleave address boundary is 1 KB (code 48). .

As described above, according to the present embodiment, in the interleaving method in which a data cache is shared among a plurality of processors including a local data cache, which processor in the system shares the data cache, and When sharing, you can set the size of the data cache to be allocated.

As a result, the application software executed in this multiprocessor system can share the data cache suitable for the execution, and prevent the cache memory usage efficiency from decreasing due to the fixed interleaving configuration. it can.

Here, due to floor plan constraints, each processor 20 may be located at a physically remote location, and bus arbitration is also added. Therefore, the latency for accessing the local data cache 26 is limited. In general, the latency to access the data cache 26 of another processor increases. Also, The cause of the increase in the number of storage cycles of the processor 20 is mainly a load instruction. Therefore, with respect to non-local accesses, it is possible to reduce the logical scale of the arbiter 25 and the global bus 28 by having a constraint that only the store instruction can be executed.

FIG. 6 shows a third embodiment of the present invention. The basic configuration is a combination of the above-described two embodiments, and is a multiprocessor system in which NP processors 50 are connected by an internal bus 10 and a global bus 28. 'The same parts as those in the first embodiment and the second embodiment are denoted by the same reference numerals. The load / store control unit 51 has the functions of both the load / store control unit 3 in the first embodiment and the portal / storage control unit 21 in the second embodiment. ,

Since the data transfer by the data transfer engine 11 of the first embodiment is executed by interruption or polling, overhead until the start is generated. However, since data can be transferred independently of the processor 50, large-capacity data can be transferred at high speed. In the shared data cache system in which the global bus 28 and the arbiter 25 according to the second embodiment are connected, large-capacity data transfer and small-capacity data transfer are also possible, but the data cache 26 of other processors is used. Latency is high for the 命令 instruction. The present embodiment has a configuration in which these two features are taken into consideration and can be properly used by the application software.

As described above, according to the first embodiment of the present invention, in data communication between processors, useless data transfer between processors can be eliminated, and performance degradation can be prevented.

Further, according to the second embodiment of the present invention, in a method using an interleaved cache of a multiprocessor, it is possible to prevent a decrease in the use efficiency of the cache memory due to the fixed interleave configuration. '. Industrial applicability

The present invention relates to a data communication between a plurality of processors provided with a data cache, which eliminates useless data transfer between the processors, and a method of using an interleaved cache of a multiprocessor to fix an interleave structure. It can be applied to data communication of a multiprocessor system for the purpose of preventing a decrease in the use efficiency of the cache memory due to the development.

Claims

The scope of the claims

1.

In a multiprocessor system in which a plurality of processors having a data cache and a main memory are connected by a bus,

Data to be accessed, comprising a plurality of areas for storing information for specifying cache or main memory, information for specifying an address to be accessed, and information for specifying an access type.

A multiprocessor system comprising a data transfer engine for issuing a load instruction and a store instruction to a data cache or a main memory according to information recorded in the area.

2.

2. The multiprocessor system according to claim 1, wherein the data transfer engine includes a buffer area for temporarily storing data.

Storing data read from the data cache or main memory in the buffer area based on the issued command;

A multiprocessor system characterized by transmitting data stored in the buffer area ^ to a data cache or a main memory together with the issued store instruction.

3.

In a multiprocessor system in which multiple processors with data caches are connected by a bus,

Each processor:

Setting whether to share the data cache with any other processor Area for setting the size of the data cache to be shared, and

A multiprocessor system comprising: a determination unit that determines which processor to access by referring to the above two areas when an address to be accessed is received.

Four . ' : .

In the multiprocessor system according to claim 3 ′,

'The multiprocessor system is characterized in that the data cache is shared by a bank interleave method.

Five .

In the multiprocessor system according to claim 3 or 4,

A multiprocessor system characterized in that when the access type is load, the determination means determines not to access another processor.

6.

A plurality of areas for storing information for specifying a data cache or a main memory to be accessed, information for specifying an address to be accessed, and information for indicating an access type in association with each other; According to the information recorded in the data cache or main memory.

-It has a data transfer engine that issues hardware and store instructions,

Each processor should have access to an area for setting whether to share the data cache with any other processor and an area for setting the size of the shared data cache. A multiprocessor system comprising: a determination unit that determines which processor should be accessed by referring to the two areas when an address is received.