US9292284B2 - Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program - Google Patents

Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program Download PDF

Info

Publication number
US9292284B2
US9292284B2 US13/935,790 US201313935790A US9292284B2 US 9292284 B2 US9292284 B2 US 9292284B2 US 201313935790 A US201313935790 A US 201313935790A US 9292284 B2 US9292284 B2 US 9292284B2
Authority
US
United States
Prior art keywords
loop
description
data
arithmetic
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/935,790
Other versions
US20140019726A1 (en
Inventor
Takao Toi
Taro Fujii
Yoshinosuke Kato
Toshiro KITAOKA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renesas Electronics Corp
Original Assignee
Renesas Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renesas Electronics Corp filed Critical Renesas Electronics Corp
Assigned to RENESAS ELECTRONICS CORPORATION reassignment RENESAS ELECTRONICS CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FUJII, TARO, KATO, YOSHINOSUKE, KITAOKA, TOSHIRO, TOI, TAKAO
Publication of US20140019726A1 publication Critical patent/US20140019726A1/en
Priority to US15/042,527 priority Critical patent/US20160162291A1/en
Application granted granted Critical
Publication of US9292284B2 publication Critical patent/US9292284B2/en
Assigned to RENESAS ELECTRONICS CORPORATION reassignment RENESAS ELECTRONICS CORPORATION CHANGE OF ADDRESS Assignors: RENESAS ELECTRONICS CORPORATION
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F17/505
    • G06F17/5054
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/327Logic synthesis; Behaviour synthesis, e.g. mapping logic, HDL to netlist, high-level language to RTL or netlist
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/34Circuit design for reconfigurable circuits, e.g. field programmable gate arrays [FPGA] or programmable logic devices [PLD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/34Circuit design for reconfigurable circuits, e.g. field programmable gate arrays [FPGA] or programmable logic devices [PLD]
    • G06F30/343Logical level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • Y02B60/1207
    • Y02B60/1225
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • FIG. 9 is a flowchart illustrating an operation of the behavioral synthesis section according to the second embodiment.
  • FIG. 1 is a diagram illustrating an exemplary configuration of an array-type processor (parallel arithmetic device) 20 according to a first embodiment of the present invention.
  • the array-type processor 20 according to the first embodiment includes a plurality of processor elements that are capable of performing a plurality of arithmetic processes in parallel upon receipt of one operation instruction.
  • the array-type processor 20 according to the present embodiment permits a dynamic configuration of a circuit (parallel arithmetic processing circuit) for performing arithmetic processes in parallel upon receipt of a smaller number of instructions than in the past. This makes it possible to efficiently use circuit resources. Details are given below.
  • An object code (circuit data) 15 is supplied from the outside to the interface section 201 .
  • the code memory 202 includes an information storage medium such as RAM to memorize the object code 15 supplied to the interface section 201 .
  • Each arithmetic unit 212 performs an arithmetic process on input data in compliance with the operation instruction read out from the instruction memory 211 .
  • each of the eight arithmetic units 212 performs an arithmetic process in parallel on each of a plurality of sets of input data in compliance with one operation instruction (SIMD instruction) read out from the instruction memory 211 .
  • SIMD instruction operation instruction
  • each processor element 207 can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction (SIMD computation).
  • SIMD computation one operation instruction
  • the array-type processor 20 according to the present embodiment can perform two or more arithmetic processes in parallel by using one processor element.
  • four stages A1, B1, C1, D1 forming a first loop process are sequentially executed.
  • four stages A2, B2, C2, D2 forming a second loop process are sequentially executed. These processing steps are repeated until a tenth loop process is completed. Consequently, a total of forty execution cycles are required for loop process execution.
  • the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit.
  • the array-type processor 20 dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
  • the CPU 111 controls, for example, various processes performed in the computer 110 and access to the RAM 112 , ROM 113 , interface 114 , and HDD 115 .
  • the computer 110 is configured so that the CPU 111 reads and executes the OS and the data processing program 118 , which are stored on the HDD 115 . This enables the computer 110 to implement the behavioral synthesis section 100 according to the present embodiment and the data processing device 10 having the behavioral synthesis section 100 .
  • the loop processes for the outer and inner loops are, for example, as indicated below.
  • Outer loop The loop process is performed on each block of 1024 point signals.
  • the loop process is an FFT process that is performed on each block of 1024 point signals.

Abstract

A parallel arithmetic device includes a status management section, a plurality of processor elements, and a plurality of switch elements for determining the relation of coupling of each of the processor elements. Each of the processor elements includes an instruction memory for memorizing a plurality of operation instructions corresponding respectively to a plurality of contexts so that an operation instruction corresponding to the context selected by the status management section is read out, and a plurality of arithmetic units for performing arithmetic processes in parallel on a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
The disclosure of Japanese Patent Application No. 2012-154903 filed on Jul. 10, 2012 including the specification, drawings, and abstract is incorporated herein by reference in its entirety.
BACKGROUND
The present invention relates to a parallel arithmetic device, a data processing system with the parallel arithmetic device, and a computer-readable medium storing a data processing program.
A coarse-grained dynamically reconfigurable processor (parallel arithmetic device or array-type processor) forms a circuit by dynamically changing the processing of each of a plurality of processor elements and the relation of coupling therebetween in accordance with an externally input object code. Hence, the dynamically reconfigurable processor can reuse circuit resources to perform a complicated arithmetic process with a small-scale circuit. An exemplary configuration of the dynamically reconfigurable processor is disclosed, for instance, in Japanese Patents Nos. 3921367 and 3861898.
A single-instruction multiple-data (SIMD) processor, which performs a plurality of arithmetic processes in parallel upon receipt of a single-operation instruction, is disclosed in Japanese Patents Nos. 4292197 and 4699002 and in Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2010-539582.
SUMMARY
When processing a plurality of sets of data in a parallel manner, a related-art dynamically reconfigurable processor (parallel arithmetic device or array-type processor) has to not only issue an operation instruction about processing to each of a plurality of processor elements, which correspond to the sets of data, but also issue an operation instruction about the relation of coupling to each of a plurality of switch elements that determine the relation of coupling between the processor elements. This also holds true for the case where a circuit corresponding to an unrolled part of a loop description is dynamically configured. Hence, the related-art dynamically reconfigurable processor has to store in memory an enormous number of instructions about the processing of each of the processor elements and about the relation of coupling therebetween. Consequently, the related-art dynamically reconfigurable processor cannot efficiently use circuit resources.
Other problems and novel features will become apparent from the following description and from the accompanying drawings.
According to an aspect of the present invention, there is provided a parallel arithmetic device including a status management section, a plurality of processor elements, and a plurality of switch elements that determine the relation of coupling of each of the processor elements. The processor elements each include an instruction memory and a plurality of arithmetic units. The instruction memory memorizes a plurality of operation instructions corresponding respectively to a plurality of contexts so that an operation instruction corresponding to a context selected by the status management section is read out. The arithmetic units perform parallel arithmetic processes on a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory.
According to another aspect of the present invention, there is provided a computer-readable medium storing a data processing program for supplying circuit data to a parallel arithmetic device that includes a status management section, a plurality of processor elements, and a plurality of switch elements that determine the relation of coupling of each of the processor elements. The processor elements each include an instruction memory and a plurality of arithmetic units. The instruction memory memorizes a plurality of operation instructions corresponding to each of a plurality of contexts so that an operation instruction corresponding to a context selected by the status management section is read out. The arithmetic units perform parallel arithmetic processing on each of a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory. The data processing program causes a computer to perform a behavioral synthesis process and a layout process. The behavioral synthesis process is performed to generate a structural description by unrolling, for behavioral synthesis purposes, a loop description that is included in an operation description and with no data dependency between iterations. The layout process is performed to generate the circuit data by subjecting the structural description to logic synthesis and performing a place and route.
The above aspects of the present invention make it possible to provide a parallel arithmetic device capable of efficiently using circuit resources.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention will be described in detail based on the following figures, in which:
FIG. 1 is a diagram illustrating an exemplary configuration of an array-type processor according to a first embodiment of the present invention;
FIG. 2 is a diagram illustrating an exemplary configuration of processor elements and switch elements according to the first embodiment;
FIG. 3 is a block diagram illustrating an exemplary logical configuration of a data processing device according to a second embodiment of the present invention;
FIG. 4 is a conceptual diagram illustrating a behavioral synthesis section according to the second embodiment;
FIG. 5A is a conceptual diagram illustrating a pipelining scheme;
FIG. 5B is a conceptual diagram illustrating a pipelining scheme;
FIG. 5C is a conceptual diagram illustrating a pipelining scheme;
FIG. 6 is a conceptual diagram illustrating a data hazard;
FIG. 7 is a diagram illustrating a SIMD computation process that is performed by a processor element according to the second embodiment;
FIG. 8 is a diagram illustrating how a conditional branch is executed by the array-type processor according to the second embodiment;
FIG. 9 is a flowchart illustrating an operation of the behavioral synthesis section according to the second embodiment;
FIG. 10 is a block diagram illustrating an exemplary hardware configuration of the data processing device according to the second embodiment;
FIG. 11 is a block diagram illustrating an exemplary configuration of a data processing system according to the second embodiment;
FIG. 12 is a diagram illustrating an example of a loop description;
FIG. 13 is a diagram illustrating how a SIMD instruction is assigned by the behavioral synthesis section according to a fourth embodiment of the present invention;
FIG. 14 is a diagram illustrating how the SIMD instruction is assigned by the behavioral synthesis section according to the fourth embodiment;
FIG. 15 is a diagram illustrating how the SIMD instruction is assigned by the behavioral synthesis section according to the fourth embodiment;
FIG. 16 is a flowchart illustrating an operation of the behavioral synthesis section according to the fourth embodiment;
FIG. 17 is a flowchart illustrating an operation of the behavioral synthesis section according to the fourth embodiment;
FIG. 18 is a diagram illustrating how a conditional branch is executed by a related-art SIMD processor; and
FIG. 19 is a diagram illustrating how a conditional branch is executed by a related-art dynamically reconfigurable processor.
DETAILED DESCRIPTION
Embodiments of the present invention will now be described with reference to the accompanying drawings. As the drawings are simplified, they should not be used to narrowly interpret the technical scope of each embodiment. Like elements are designated by like reference numerals and will not be redundantly described.
In the following description of the embodiments, if necessary for convenience sake, a description of the present invention will be given in a divided manner in plural sections or embodiments, but unless specifically stated, they are not unrelated to each other, but are in such a relation that one is a modification, an application, a detailed explanation, or a supplementary explanation of a part or the whole of the other. Also, in the embodiments described below, when referring to the number of elements (including the number of pieces, numerical values, amounts, ranges, and the like), the number of elements is not limited to a specific number unless specifically stated or except the case where the number is apparently limited to a specific number in principle. The number larger or smaller than the specified number is also applicable.
Further, in the embodiments described below, their components (including operating steps and the like) are not always indispensable unless specifically stated or except the case where the components are apparently indispensable in principle. Similarly, in the embodiments described below, when the shape of the components, the positional relationship therebetween, and the like are mentioned, the substantially approximate and similar shapes and the like are included therein unless specifically stated or except the case where it is conceivable that they are apparently excluded in principle. The same goes for the aforementioned number of elements (including the number of pieces, numerical values, amounts, ranges, and the like).
First Embodiment
FIG. 1 is a diagram illustrating an exemplary configuration of an array-type processor (parallel arithmetic device) 20 according to a first embodiment of the present invention. The array-type processor 20 according to the first embodiment includes a plurality of processor elements that are capable of performing a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Thus, the array-type processor 20 according to the present embodiment permits a dynamic configuration of a circuit (parallel arithmetic processing circuit) for performing arithmetic processes in parallel upon receipt of a smaller number of instructions than in the past. This makes it possible to efficiently use circuit resources. Details are given below.
The array-type processor 20 shown in FIG. 1 is a processor (dynamically reconfigurable processor) in which a circuit suitable for the situation is dynamically configured. The array-type processor 20 includes an interface section 201, a code memory 202, a status management section 203, a matrix circuit section 205, and a data memory section 206. The matrix circuit section 205 is configured so that a plurality of processor elements (PEs) 207 and a plurality of switch elements (SWEs) 208 are disposed in a matrix form. The data memory section 206 includes a plurality of memory units 210. The memory units 210 are disposed, for example, to surround the matrix circuit section 205.
An object code (circuit data) 15 is supplied from the outside to the interface section 201. The code memory 202 includes an information storage medium such as RAM to memorize the object code 15 supplied to the interface section 201.
The object code 15 includes a plurality of contexts (corresponding to a plurality of later-described data paths) and a state transition condition (corresponding to a later-described state transition machine). As each context, operation instructions for the processor elements 207 and for the switch elements 208 are set. As the state transition condition, an operation instruction for the status management section 203, which selects one of the contexts depending on the situation, is set.
The status management section 203 selects one of the contexts in accordance with the status of the state transition machine and outputs a plurality of instruction pointers (IPs) corresponding to the selected context to the processor elements 207.
FIG. 2 is a diagram illustrating an exemplary configuration of a pair of a processor element 207 and a switch element 208. The processor element 207 includes an instruction memory 211, a plurality of arithmetic units 212, and a plurality of registers 213. The switch element 208 includes wiring coupling switches 214 to 218. Although not shown in the figure, each element in the processor element 207 exchanges data through data wiring and exchanges flags through flag wiring.
The present embodiment will be described on the assumption that the processor element 207 includes eight arithmetic units 212 for performing an arithmetic process on 16-bit data and eight registers 213 for retaining 16-bit data. Accordingly, each data wiring is 128 bits wide (=16 bits wide×8 pieces) and each flag wiring is 8 bits wide (=1 bit wide×8 pieces).
The processor element 207 performs an arithmetic process on data that is supplied from another processor element 207 through data wiring, and outputs a computation result (data) to another processor element 207 through data wiring. Further, a flag is supplied to the processor element 207 from another processor element 207 through flag wiring, and the processor element 207 outputs a flag to another processor element 207 through flag wiring. For example, the processor element 207 determines in accordance with a flag supplied from another processor element 207 whether or not to start an arithmetic process, and outputs a flag corresponding to the result of the arithmetic process to another processor element 207.
The instruction memory 211 stores an operation instruction corresponding to each of the contexts. One of the operation instructions is read out from the instruction memory 211 in accordance with an instruction pointer (IP) from the status management section 203. The processor element 207 and the switch element 208 perform operations in compliance with the operation instruction read out from the instruction memory 211.
Each arithmetic unit 212 performs an arithmetic process on input data in compliance with the operation instruction read out from the instruction memory 211. In this instance, each of the eight arithmetic units 212 performs an arithmetic process in parallel on each of a plurality of sets of input data in compliance with one operation instruction (SIMD instruction) read out from the instruction memory 211.
Each register 213 temporarily stores, for example, the data input into a corresponding arithmetic unit 212, the result of computation performed by the corresponding arithmetic unit 212, and intermediate data derived from the arithmetic process performed by the corresponding arithmetic unit 212. The result of computation performed by each arithmetic unit 212 may be directly output to the outside of a processor unit while bypassing the register 213.
In compliance with an operation instruction read out from the instruction memory 211, the wiring coupling switches 214 to 216 couple a data wiring between the corresponding processor element 207 (a processor element 207 having the instruction memory 211 storing the operation instruction) and another processor element 207 (e.g., a neighboring processor element 207).
In compliance with an operation instruction read out from the instruction memory 211, the wiring coupling switches 216 to 218 couple a flag wiring between the corresponding processor element 207 (a processor element 207 having the instruction memory 211 storing the operation instruction) and another processor element 207 (e.g., a neighboring processor element 207).
The wiring coupling switches 214 to 216 couple wirings in compliance with an operation instruction read out from the instruction memory 211. The wiring coupling switch 216 is disposed at the intersection of a data wiring and a flag wiring.
As described above, each processor element 207 can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction (SIMD computation). In other words, unlike in the past, the array-type processor 20 according to the present embodiment can perform two or more arithmetic processes in parallel by using one processor element.
As such being the case, the array-type processor 20 according to the present embodiment permits the dynamic configuration of a circuit (parallel arithmetic processing circuit) for performing arithmetic processes in parallel upon receipt of a smaller number of instructions than in the past. This makes it possible to efficiently use circuit resources.
Obviously, each processor element 207 can individually perform an arithmetic process upon receipt of one operation instruction (scalar computation) in the same manner as in the past when one of the arithmetic units 212 is activated.
Processor elements developed in the past can perform only one arithmetic process upon receipt of one operation instruction. In other words, related-art array-type processors can perform a plurality of arithmetic processes in parallel only when they use a plurality of processor elements. Thus, the related-art array-type processors need to dynamically configure a parallel arithmetic processing circuit in response to many operation instructions. Therefore, the related-art array-type processors cannot efficiently use circuit resources.
Second Embodiment
A second embodiment of the present invention will now be described in relation to a data processing device 10 that generates the object code 15 to be supplied to the array-type processor 20. FIG. 3 is a block diagram illustrating an exemplary logical configuration of the data processing device 10 according to the second embodiment.
The data processing device 10 shown in FIG. 3 includes a behavioral synthesis section (behavioral synthesizer) 100 and an object code generation section (layout section) 109. The behavioral synthesis section 100 includes a loop processing section 108, a DFG generation section 101, a scheduling section 102, an allocation section 103, an FSM generation section 104, a data path generation section 105, a pipeline configuration generation section 106, and an RTL description generation section 107.
As shown in a conceptual diagram of FIG. 4, the behavioral synthesis section 100 generates a state transition machine (FSM or finite state machine) and a plurality of data paths corresponding to a plurality of states in the state transition machine from a description in C language or the like of a circuit operation (operation description; hereinafter referred to as the source code) 11, and outputs the generated information as a description of a circuit structure (structural description; hereinafter referred to as the RTL description) 14.
The loop processing section 108 analyzes the syntax of the source code 11 and unrolls a predefined loop description of a plurality of loop descriptions included in the source code 11. In the present embodiment, the loop processing section 108 unrolls a user-specified predefined loop description. The loop processing section 108 may alternatively be configured so that a predefined loop description is automatically selected from the loop descriptions and unrolled.
For example, the loop processing section 108 unrolls a loop description with no data dependency between iterations. In other words, the loop processing section 108 unrolls a loop description with no data dependency between a plurality of loop processes. The unrolled part is synthesized by a behavioral synthesizer as a circuit to be eventually subjected to parallel arithmetic processing (parallel arithmetic processing circuit). This parallel arithmetic processing circuit is dynamically configured by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction.
The DFG generation section 101 creates a DFG (data flow graph) in accordance with the result of analysis of the source code 11 and the result of processing by the loop processing section 108. The DFG includes nodes, which represent various processing functions such as a computational function, and branches, which represent the flows of data.
The scheduling section 102 performs scheduling in accordance with a synthesis constraint 12 and circuit information 13 to determine when to execute a plurality of nodes, and outputs the result of scheduling as a CDFG (control data flow graph). The allocation section 103 determines, in accordance with the synthesis constraint 12 and circuit information 13, a register and a memory unit that are to be used to temporarily store data represented by branches in the CDFG. The allocation section 103 also determines which arithmetic unit is to be used to perform computations represented by nodes in the CDFG.
The synthesis constraint 12 includes preselected information such as a circuit scale, a resource amount, a delay constraint (timing constraint; clock frequency), and a pipelining target. The circuit information 13 includes preselected information such as the scale and delay of each of later-described resources (arithmetic unit 212, register 213, memory unit 210, etc.) included in the array-type processor 20.
The scheduling section 102 and the allocation section 103 perform scheduling and allocation, respectively, by giving a dedicated synthesis constraint and circuit information to or performing a forwarding (bypassing) process (described later) on a loop description to be pipelined, which is included in the loop descriptions excluding the loop descriptions to be unrolled. In the present embodiment, the loop description to be pipelined is user-specified. Alternatively, the loop description to be pipelined may be specified automatically by the data processing device 10.
Loop description pipelining will now be briefly described with reference to FIGS. 5A to 5C. FIG. 5A is a conceptual diagram illustrating a process that is performed when a loop description (the number of states=4) is not to be pipelined. FIG. 5B is a conceptual diagram illustrating a process that is performed when four states of a loop description are to be folded into two states for pipelining purposes. FIG. 5C is a conceptual diagram illustrating a process that is performed when four states of a loop description are to be folded into one state for pipelining purposes. In the current example, it is assumed that the number of pipelining stages is four, and that the number of loops (the number of iterations) is ten. In the current example, it is also assumed that one execution cycle (clock cycle) is necessary for executing one stage (a series of processing steps).
When, as shown in FIG. 5A, a loop description (the number of states=4) is not to be pipelined, four stages A1, B1, C1, D1 forming a first loop process are sequentially executed. Next, four stages A2, B2, C2, D2 forming a second loop process are sequentially executed. These processing steps are repeated until a tenth loop process is completed. Consequently, a total of forty execution cycles are required for loop process execution.
When, as shown in FIG. 5B, four states of a loop description are folded into two states for pipelining purposes, four stages A1, B1, C1, D1 forming the first loop process are sequentially executed. Further, four stages A2, B2, C2, D2 forming the second loop process are sequentially executed with a delay of two steps (two execution cycles) from the start of the first loop process. Similarly, the four stages of a third to tenth loop processes are sequentially executed with a delay of two steps (two execution cycles) from the start of the immediately preceding loop process. Hence, for example, the two stages C1, A2 and the two stages D1, B2 are executed in parallel, respectively. Further, for example, the two stages C2, A3 and the two stages D2, B3 are executed in parallel, respectively. As a result, the number of execution cycles required for loop process execution is equal to eighteen execution cycles plus the number of execution cycles required for initialization (epilogue) and post-processing (prologue).
When, as shown in FIG. 5C, four states of a loop description are folded into one state for pipelining purposes, four stages A1, B1, C1, D1 forming the first loop process are sequentially executed. Further, four stages A2, B2, C2, D2 forming the second loop process are sequentially executed with a delay of one step (one execution cycle) from the start of the first loop process. Similarly, the four stages of the third to tenth loop processes are sequentially executed with a delay of one step (one execution cycle) from the start of the immediately preceding loop process. Hence, for example, the four stages D1, C2, B3, A4 and the four stages D2, C3, B4, and A5 are executed in parallel, respectively. As a result, the number of execution cycles required for loop process execution is equal to seven execution cycles plus the number of execution cycles required for initialization (epilogue) and post-processing (prologue). If there is no description other than a loop description when the number of states of the loop description is folded into one state, no state transition machine is generated except during initialization and post-processing.
When a loop description is pipelined as described above, the number of execution cycles is reduced. This provides an increased throughput (processing capacity).
Details of loop description pipelining are also disclosed in “High-level Synthesis Challenges for Mapping a Complete Program on a Dynamically Reconfigurable Processor” (Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato, Toru Awashima, Kazutoshi Wakabayashi, IPSJ Transaction on System LSI Design Methodology, February 2010, vol. 3, pp. 91-104), published by the inventors of the present invention.
However, when a loop description is pipelined, a data hazard may occur. Therefore, it is necessary to avoid such a data hazard. The data hazard will now be described briefly with reference to FIG. 6. In the example used for the description of the data hazard, it is assumed that the prevailing conditions are the same as those indicated in FIG. 5C.
At first, the four stages A1 (read), B1 (read), C1 (write), D1 (read) of the first loop process are sequentially executed. Further, the four stages A2 (read), B2 (read), C2 (write), D2 (read) of the second loop process are sequentially executed with a delay of one step from the start of the first loop process. In this instance, a data read process at the stage A2 is performed earlier than a data write process at the stage C1. Therefore, it is probable that irrelevant data may be read out. This type of problem is referred to as a data hazard.
Such a data hazard can be avoided, for example, by performing a forwarding (bypassing) process during scheduling for behavioral synthesis. This ensures that the data read process at the stage A2 is not executed earlier than the data write process at the stage C1.
Returning to FIG. 3, the FSM generation section 104 generates a state transition machine in accordance with the results produced by the scheduling section 102 and the allocation section 103. Further, the data path generation section 105 generates a plurality of data paths corresponding respectively to a plurality of states included in the state transition machine in accordance with the results produced by the scheduling section 102 and the allocation section 103. Furthermore, the pipeline configuration generation section 106 achieves pipelining by collapsing a plurality of states included in the loop description to be pipelined.
The RTL description generation section 107 outputs the above-mentioned state transition machine and the data paths corresponding respectively to the states included in the state transition machine as the RTL description 14.
Subsequently, the object code generation section 109 reads the RTL description 14, generates a net list by performing, for example, technology mapping and a place and route, subjects the net list to binary conversion, and outputs the result of binary conversion as the object code 15.
As described above, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. The array-type processor 20 then dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
FIG. 7 is a diagram illustrating a SIMD computation process that is performed by the processor element 207. As shown in FIG. 7, the eight arithmetic units 212 in the processor element 207 add 16-bit data A0 to A7 to 16-bit data B0 to B7 and output data X0 to X7 as the results of additions in compliance with one operation instruction (SIMD instruction).
FIG. 8 is a diagram illustrating how a conditional branch is executed by the array-type processor 20. As shown in FIG. 8, a conditional branch is synthesized as a data path containing a multiplexer (marked ▴ or ▾ in the figure) during later-described behavioral synthesis. This ensures that even if there are two or more conditional branches for a certain set of data, the multiplexer selects one of the conditional branches after the conditional branches are executed in parallel. Hence, an increase in the number of execution cycles is suppressed. When a loop description containing a conditional branch is unrolled, the above operation is performed in parallel for unrolled data sets in parallel. The selection to be made by the multiplexer is controlled by a flag or the like.
(Flowchart)
An operation of the behavioral synthesis section 100 in the data processing device 10 will now be described with reference to FIG. 9. FIG. 9 is a flowchart illustrating an operation of the behavioral synthesis section 100.
First of all, the behavioral synthesis section 100 performs syntax analysis (step S101) upon receipt of the source code 11, and then optimizes an operation description language level (step S102).
In this instance, the behavioral synthesis section 100 selects a predefined loop description from a plurality of loop descriptions included in the source code 11 and unrolls the selected loop description (step S103).
For example, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations. In other words, the behavioral synthesis section 100 unrolls a loop description with no data dependency between a plurality of loop processes. The unrolled part is synthesized by a behavioral synthesizer as a circuit to be eventually subjected to parallel arithmetic processing (parallel arithmetic processing circuit). This parallel arithmetic processing circuit is dynamically configured by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction.
Subsequently, the behavioral synthesis section 100 assigns nodes, which represent various processing functions, and branches, which represent the flows of data (step S104), and prepares a DFG (step S105).
Next, the behavioral synthesis section 100 performs scheduling (step S106) and allocation (step S107) in accordance with the synthesis constraint 12 and with the circuit information 13.
Next, in accordance with the results of scheduling and allocation, the behavioral synthesis section 100 generates a state transition machine and a plurality of data paths corresponding respectively to the states included in the state transition machine (steps S108 and S109). Further, the behavioral synthesis section 100 achieves pipelining by collapsing a plurality of states included in the loop description to be pipelined (step S110). Subsequently, the behavioral synthesis section 100 optimizes an RT level and logic level with respect to the state transition machines and data paths (step S111), and outputs the result of optimization as the RTL description 14 (step S112).
As described above, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. Subsequently, the array-type processor 20 dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
(Exemplary Hardware Configuration of Data Processing Device 10)
The behavioral synthesis section 100 according to the present embodiment and the data processing device 10 having the behavioral synthesis section 100 can be implemented, for instance, with a general-purpose computer system. A brief description is given below with reference to FIG. 10.
FIG. 10 is a block diagram illustrating an exemplary hardware configuration of the data processing device 10 according to the present embodiment. A computer 110 includes, for example, a CPU (central processing unit) 111, a RAM (random-access memory) 112, a ROM (read-only memory) 113, an interface 114, and a HDD (hard disk drive) 115. The CPU 111 acts as a control device. The interface 114 acts as an interface with the outside world. The HDD 115 is an example of a nonvolatile storage device. The computer 110 may also include devices not shown including an input device such as a keyboard or a mouse and a display device such as a display.
The HDD 115 stores an OS (operating system) (not shown), operation description information 116, circuit information 117, and a data processing program 118. The operation description information 116 relates to a circuit operation and corresponds, for instance, to the source code (operation description) 11 shown in FIG. 3. The circuit information 117 relates to a circuit structure and corresponds to the object code 15 shown in FIG. 3. The data processing program 118 is a computer program in which a behavioral synthesis process according to the present embodiment is implemented.
The CPU 111 controls, for example, various processes performed in the computer 110 and access to the RAM 112, ROM 113, interface 114, and HDD 115. The computer 110 is configured so that the CPU 111 reads and executes the OS and the data processing program 118, which are stored on the HDD 115. This enables the computer 110 to implement the behavioral synthesis section 100 according to the present embodiment and the data processing device 10 having the behavioral synthesis section 100.
(Data Processing System 1)
FIG. 11 is a block diagram illustrating an exemplary configuration of a data processing system 1 having the data processing device 10 and the array-type processor 20.
In the data processing system 1 shown in FIG. 11, the data processing device 10 reads the source code 11, the synthesis constraint 12, and the circuit information 13 to generate the object code 15. The array-type processor 20 performs an arithmetic process on processing data supplied from the outside and outputs the result of processing as result data while dynamically changing a circuit configuration for each state in accordance with the object code 15 output from the data processing device 10.
Third Embodiment
A third embodiment of the present invention will now be described in relation to a concrete example of a loop description that is to be unrolled. FIG. 12 is a diagram illustrating an example of a loop description included in the source code 11. In FIG. 12, a multiple loop is depicted as an example of a loop description. In the following description, an inner loop description is referred to as the inner loop, whereas an outer loop description is referred to as the outer loop.
The behavioral synthesis section 100 not only pipelines the whole or part of a loop description with data dependency between iterations and performs behavioral synthesis, but also unrolls the whole or part of a loop description with no data dependency between iterations and performs behavioral synthesis.
In the example shown in FIG. 12, the inner loop has data dependency between iterations and is therefore suitable for being pipelined. On the other hand, the outer loop has no data dependency between iterations and is suitable for being unrolled and processed independently. In the example shown in FIG. 12, therefore, the behavioral synthesis section 100 not only pipelines the inner loop and performs behavioral synthesis, but also unrolls the outer loop and performs behavioral synthesis.
As a result, the array-type processor 20 can not only perform SIMD computations on a circuit corresponding to the outer loop for parallel processing purposes, but also pipeline a circuit corresponding to the inner loop for parallel processing purposes. Consequently, the array-type processor 20 can perform a wider range of parallel processing than a related-art SIMD processor.
As regards a multiple loop, the inner loop is suitable for being pipelined because it has data dependency between iterations, whereas the outer loop is suitable for being unrolled and processed independently because it has no data dependency between iterations. Concrete examples are given below.
First Concrete Example
As for JPEG and MPEG, the loop processes for the outer and inner loops are, for example, as indicated below. Outer loop: The loop process is performed on each macro-block of 8-row×8-column pixels or 16-row×16-column pixels.
Inner loop: The loop process is a DCT conversion process that is performed on each macro-block of a plurality of pixels.
Second Concrete Example
As for voice signal FFT conversion, the loop processes for the outer and inner loops are, for example, as indicated below.
Outer loop: The loop process is performed on each block of 1024 point signals.
Inner loop: The loop process is an FFT process that is performed on each block of 1024 point signals.
Third Concrete Example
As for an FIR image filter, the loop processes for the outer and inner loops are, for example, as indicated below.
Outer loop: The loop process is performed on each block that is obtained when an image frame or an image is divided into regions.
Inner loop: The loop process is a filter process that is performed on each block of pixels.
Fourth Concrete Example
When the moving average of a plurality of stock prices is to be calculated, the loop processes for the outer and inner loops are, for example, as indicated below.
Outer loop: Stock name.
Inner loop: The loop process is a process of calculating the moving average of each stock.
Fourth Embodiment
A fourth embodiment of the present invention will now be described in relation to a detailed method that the behavioral synthesis section 100 uses to automatically assign a SIMD instruction to an unrolled part of a loop described with a scalar variable (a detailed method of automatically rewriting the unrolled part with a vector variable). The fourth embodiment will be described on the assumption that the behavioral synthesis section 100 unrolls a part of the outer loop of a multiple loop shown in FIG. 12 and pipelines the remaining loop description.
For example, the behavioral synthesis section 100 divides the outer loop having four hundred iterations into a loop description (first loop description) A having eight iterations and a loop description (second loop description) B having fifty iterations in accordance with the number of arithmetic units (eight units) in each processor element 207.
Next, the behavioral synthesis section 100 unrolls the loop description A having eight loop descriptions as shown in FIG. 14. More specifically, as shown in FIG. 14, the behavioral synthesis section 100 unrolls the process in the inner loop into eight sections in parallel in accordance with the number of iterations of the loop description A.
Next, as shown in FIG. 15, the behavioral synthesis section 100 consolidates eight scalar data of an unrolled process in parallel in the inner loop into one vector data and substitutes one vector addition instruction VADD for eight scalar addition instructions (i.e., assigns one SIMD addition instruction VADD to eight scalar addition instructions).
As described above, the behavioral synthesis section 100 unrolls a loop described with a scalar variable and then assigns a SIMD instruction to the unrolled part.
Exemplary methods that the behavioral synthesis section 100 uses to assign the SIMD instruction will now be described with reference to FIGS. 16 and 17. FIG. 16 is a flowchart illustrating an operation that is performed by the behavioral synthesis section 100 when a first SIMD instruction assignment method is used. FIG. 17 is a flowchart illustrating an operation that is performed by the behavioral synthesis section 100 when a second SIMD instruction assignment method is used.
(First SIMD Instruction Assignment Method)
Referring to the example shown in FIG. 16, the behavioral synthesis section 100 selects a predefined loop description from a plurality of loop descriptions included in the source code 11 to unroll the selected loop description (step S103), assigns the above-mentioned SIMD instruction to the unrolled part (step S104B), and assigns a related-art scalar instruction to other loop descriptions (step S104A).
The other operations of the behavioral synthesis section 100 shown in FIG. 16 are the same as those indicated in FIG. 9 and will not be redundantly described.
The first SIMD instruction assignment method inhibits a scheduling process from becoming complicated and is therefore advantageous in that the total processing time for behavioral synthesis is reduced.
(Second SIMD Instruction Assignment Method)
Referring to the example shown in FIG. 17, the behavioral synthesis section 100 selects a predefined loop description from the loop descriptions included in the source code 11 to unroll the selected loop description (step S103), and assigns the related-art scalar instruction to both the rolled part and the other loop descriptions (step S104A).
In the subsequent processes ranging from DFG generation (step S105) to pipelining (step S110), data is entirely processed as a scalar amount.
Subsequently, the behavioral synthesis section 100 assigns the above-mentioned SIMD instruction to the unrolled part of the loop description (step S1101), optimizes the RTL (step S111), and outputs the RTL description (step S112).
The second SIMD instruction assignment method is advantageous in that it is practically unnecessary to change an existing behavioral synthesis flow.
As described above, the behavioral synthesis section 100 according to the present embodiment unrolls a predefined loop description included in an operation description and automatically assigns a SIMD instruction for the array-type processor 20 to the unrolled part. This eliminates the necessity of learning a dedicated language including, for example, the SIMD instruction. Consequently, the length of design time can be reduced.
As described above, the array-type processor (parallel arithmetic device) 20 according to the foregoing embodiments includes a plurality of processor elements 207 that are capable of performing a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 according to the foregoing embodiments can dynamically configure a parallel arithmetic processing circuit that performs arithmetic processes in parallel upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
Further, the behavioral synthesis section (behavioral synthesizer) 100 according to the foregoing embodiments unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. The array-type processor 20 then dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Consequently, the array-type processor 20 according to the foregoing embodiments can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
Furthermore, the behavioral synthesis section (behavioral synthesizer) 100 according to the foregoing embodiments unrolls a loop description included in an operation description and automatically assigns a SIMD instruction to the unrolled part. This eliminates the necessity of learning a dedicated language including, for example, the SIMD instruction. Consequently, the length of design time can be reduced.
Moreover, the array-type processor 20 according to the foregoing embodiments can not only perform SIMD computations for parallel processing purposes, but also perform pipelining for parallel processing purposes. Consequently, the array-type processor 20 can perform a wider range of parallel processing than a related-art SIMD processor. As the amount of computation processible by a single operation instruction increases with an increase in the degree of parallelism, the performance per unit area increases. Further, as the same amount of computation can be provided at a lower frequency, the power consumption per unit performance is suppressed.
The foregoing embodiments have been described on the assumption that each processor element 207 includes eight arithmetic units capable of performing an arithmetic process on 16-bit data. However, the present invention is not limited to such a configuration. The employed configuration can be changed as needed so that each processor element 207 includes two or more arithmetic units. Further, the employed configuration can also be changed as needed so that each processor element 207 includes arithmetic units capable of performing an arithmetic process on data with a bit width other than 16 bits. In this instance, however, the bit widths of data wiring, flag wiring, and the like need to be changed as well.
The foregoing embodiments have been described on the assumption that the behavioral synthesis section 100 unrolls an outer loop of a multiple loop having an inner loop as well as the outer loop. However, the present invention is not limited to the use of such an unrolling scheme. The employed configuration can be changed as needed so that the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations.
It is useful in terms of functional safety to provide each processor element 207 with two or any more number of arithmetic units. In this case, the behavioral synthesis section 100 assigns identical copied scalar instructions to multiple arithmetic units in the respective processor elements 207 by copying scalar instructions during assignment of a SIMD instruction. Detecting whether the results of operations are matched between the multiple arithmetic units with the identical instructions allocated or providing an additional circuit for correction make it possible to handle the case of failure in part of the arithmetic units in the processor element. For the use together with the foregoing embodiments, for instance, copied scalar instructions are assigned to two arithmetic units in the above-described processor element provided with eight arithmetic units, and then the processor element can be used as SIMD computation enabling parallel execution of four instructions. For correction, three or more arithmetic units are made to execute identical instructions to enable majority calculation of correct values. Besides, the behavioral synthesis section 100 according to the foregoing embodiments and the data processing device having the behavioral synthesis section 100 can implement an arbitrary process by having the CPU (central processing unit) execute a computer program.
In the above example, the program can be stored on various types of non-transitory computer-readable media and supplied to the computer. The non-transitory computer-readable media include various types of tangible storage media. More specifically, the non-transitory computer-readable media include a magnetic recording medium (e.g., flexible disk, magnetic tape, or hard disk drive), a magneto-optical recording medium (e.g., magneto-optical disk), a CD-ROM (read-only memory), a CD-R, a CD-R/W, a DVD (digital versatile disc), a BD (Blu-ray (registered trademark) disc), a semiconductor memory (e.g., mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, or RAM (random-access memory)). The program may be supplied to the computer by using various types of transitory computer-readable media. The transitory computer-readable media include an electrical signal, an optical signal, and an electromagnetic wave. The transitory computer-readable media can supply the program to the computer through an electric wire, optical fiber, or other wired communication path or a wireless communication path.
(Differences from Related-Art Technologies)
The SIMD processor disclosed in Japanese Patents Nos. 4292197 and 4699002 and in Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2010-539582 successively interprets operation instructions in the same manner as a CPU and the like and performs a plurality of arithmetic processes in parallel. In the SIMD processor, a conditional branch is implemented as a jump instruction.
Consequently, even when, for instance, a conditional branch dependent only on one of a plurality of sets of data to be processed in parallel occurs, the SIMD processor cannot execute the conditional branch with respect to only the one set of data. In other words, even when, for instance, a conditional branch dependent only on one of a plurality of sets of data to be processed in parallel occurs, the SIMD processor needs to execute the conditional branch with respect to all the sets of data. It means that the SIMD processor has to entirely execute the conditional branch with respect to all the sets of data to be processed in parallel. This results in an increase in the number of execution cycles.
FIG. 18 is a diagram illustrating how a conditional branch is executed by the SIMD processor. In the example shown in FIG. 18, the numerals 0 to 7 represent a plurality of sets of data to be processed in parallel, “EXECUTION” represents the execution of an arithmetic process, “TRUE” indicates that the conditional branch is true, and “FALSE” indicates that the conditional branch is false. As is obvious from FIG. 18, “if” and “else” conditional branches are both executed with respect to the sets of data 0 to 7 to be processed in parallel. That is why the number of execution cycles is increased.
Meanwhile, the array-type processor according to the foregoing embodiments includes a plurality of processor elements that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction (SIMD computations). Therefore, the array-type processor according to the foregoing embodiments can dynamically configure a circuit for performing arithmetic processes in parallel upon receipt of a small number of operation instructions. In other words, the array-type processor according to the foregoing embodiments can perform SIMD computations as is the case with the SIMD processor.
Here, the conditional branch is not implemented as a jump instruction, but is synthesized as a data path containing a multiplexer during behavioral synthesis. Therefore, unlike the SIMD processor, the array-type processor according to the foregoing embodiments suppresses an increase in the number of execution cycles even when a plurality of conditional branches exist. In other words, the degradation in computational performance is suppressed.
Further, the related-art SIMD processor is often designed by using an operation description given in a dedicated language extended to handle a vector as a variable instead of using an operation description given, for instance, in C language.
Meanwhile, the behavioral synthesis section (behavioral synthesizer) according to the foregoing embodiments unrolls a predefined loop description included in an operation description given in C language or the like and automatically assigns a SIMD instruction to the unrolled part. This eliminates the necessity of learning a dedicated language, thereby making it possible to reduce the length of design time.
The dynamically reconfigurable processor disclosed in Japanese Patents Nos. 3921367 and 3861898 forms a circuit by dynamically changing the processing of each of a plurality of processor elements and the relation of coupling therebetween as described earlier. Hence, the dynamically reconfigurable processor can reuse circuit resources to perform a complicated arithmetic process with a small-scale circuit. Here, as shown in FIG. 19, a conditional branch is synthesized as a data path containing a multiplexer (marked ▴ or ▾ in the figure) during behavioral synthesis. Therefore, even when a plurality of conditional branches exist, the related-art dynamically reconfigurable processor suppresses an increase in the number of execution cycles.
However, as mentioned earlier, the related-art dynamically reconfigurable processor has to store in memory an extremely large number of instructions about the processing of each of a plurality of processor elements and the relation of coupling therebetween. Hence, the related-art dynamically reconfigurable processor cannot efficiently use circuit resources.
Meanwhile, the array-type processor according to the foregoing embodiments includes a plurality of processor elements that are capable of performing a plurality of arithmetic processes in parallel (SIMD computations) upon receipt of one operation instruction. Hence, the array-type processor according to the foregoing embodiments can dynamically configure a circuit for performing arithmetic processes in parallel upon receipt of a small number of instructions. Consequently, circuit resources can be efficiently used.
While the present invention contemplated by its inventors has been described in detail in terms of preferred embodiments, it is to be understood that the present invention is not limited to those preferred embodiments, but extends to various modifications that nevertheless fall within the spirit and scope of the appended claims.

Claims (12)

What is claimed is:
1. A parallel arithmetic device comprising:
a status management section that selects one of a plurality of contexts depending on the situation;
a plurality of processor elements that determine, in accordance with the context selected by the status management section, arithmetic processing to be performed; and
a plurality of switch elements that determine the relation of coupling of each of the processor elements in accordance with the context selected by the status management section;
wherein each of the processor elements includes:
an instruction memory that memorizes a plurality of operation instructions corresponding respectively to the contexts so that an operation instruction corresponding to the context selected by the status management section is read out; and
a plurality of arithmetic units that perform parallel arithmetic processes on a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory.
2. The parallel arithmetic device according to claim 1, wherein each of the processor elements further includes a plurality of registers that temporarily store the sets of input data, the results of computations performed by the arithmetic units, or intermediate data derived from the arithmetic processes performed by the arithmetic units.
3. The parallel arithmetic device according to claim 1, wherein, when a single set of input data is supplied, one of the arithmetic units included in each of the processor elements performs an arithmetic process on the input data in a manner compliant with the operation instruction read out from the instruction memory.
4. A data processing system comprising:
a data processing device; and
the parallel arithmetic device according to claim 1 in which a circuit is dynamically configured, depending on the situation, in accordance with a state based on the result of output from the data processing device;
wherein the data processing device includes:
a behavioral synthesis section that generates a structural description by unrolling, for behavioral synthesis purposes, a loop description that is included in an operation description and with no data dependency between iterations; and
a layout section that subjects the structural description to logic synthesis and performs a place and route.
5. The data processing system according to claim 4, wherein the parallel arithmetic device dynamically configures a circuit corresponding to an unrolled part of the loop description by using the arithmetic units included in at least one of the processor elements.
6. The data processing system according to claim 4, wherein the behavioral synthesis section divides the loop description with no data dependency between iterations into a first loop description having a number of iterations according to the number of arithmetic units included in each of the processor elements and a second loop description adapted to perform a loop process on the first loop description, and unrolls the first loop description.
7. The data processing system according to claim 4, wherein the behavioral synthesis section performs behavioral synthesis by unrolling an outer loop of a multiple loop having an inner loop as well as the outer loop.
8. The data processing system according to claim 4, wherein the behavioral synthesis section replaces the unrolled part of the loop description with a vector variable and outputs the vector variable as the structural description.
9. A computer-readable medium storing a data processing program for supplying circuit data to a parallel arithmetic device that includes a status management section for selecting one of a plurality of contexts depending on the situation, a plurality of processor elements for determining, in accordance with the context selected by the status management section, arithmetic processing to be performed, and a plurality of switch elements for determining the relation of coupling of each of the processor elements in accordance with the context selected by the status management section, and wherein each of the processor elements includes an instruction memory for memorizing a plurality of operation instructions corresponding respectively to the contexts so that an operation instruction corresponding to the context selected by the status management section is read out, and a plurality of arithmetic units for performing arithmetic processes in parallel on a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory, the computer-readable storage medium storing the data processing program for causing a computer to perform a process comprising the steps of:
performing an behavioral synthesis process to generate a structural description by unrolling, for behavioral synthesis purposes, a loop description that is included in an operation description and with no data dependency between iterations; and
performing a layout process to subject the structural description to logic synthesis, perform a place and route, and generate the circuit data.
10. The computer-readable medium storing the data processing program according to claim 9, wherein the behavioral synthesis process divides the loop description with no data dependency between iterations into a first loop description having a number of iterations according to the number of arithmetic units included in each of the processor elements and a second loop description adapted to perform a loop process on the first loop description, and unrolls the first loop description.
11. The computer-readable medium storing the data processing program according to claim 9, wherein the behavioral synthesis process performs behavioral synthesis by unrolling an outer loop of a multiple loop having an inner loop as well as the outer loop.
12. The computer-readable medium storing the data processing program according to claim 9, wherein the behavioral synthesis process replaces the unrolled part of the loop description with a vector variable and outputs the vector variable as the structural description.
US13/935,790 2012-07-10 2013-07-05 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program Active 2034-08-14 US9292284B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/042,527 US20160162291A1 (en) 2012-07-10 2016-02-12 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012154903A JP2014016894A (en) 2012-07-10 2012-07-10 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program
JP2012-154903 2012-07-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/042,527 Continuation US20160162291A1 (en) 2012-07-10 2016-02-12 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program

Publications (2)

Publication Number Publication Date
US20140019726A1 US20140019726A1 (en) 2014-01-16
US9292284B2 true US9292284B2 (en) 2016-03-22

Family

ID=49915018

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/935,790 Active 2034-08-14 US9292284B2 (en) 2012-07-10 2013-07-05 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program
US15/042,527 Abandoned US20160162291A1 (en) 2012-07-10 2016-02-12 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/042,527 Abandoned US20160162291A1 (en) 2012-07-10 2016-02-12 Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program

Country Status (2)

Country Link
US (2) US9292284B2 (en)
JP (1) JP2014016894A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170123794A1 (en) * 2015-11-04 2017-05-04 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US10120685B2 (en) 2015-11-04 2018-11-06 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US10846260B2 (en) 2018-07-05 2020-11-24 Qualcomm Incorporated Providing reconfigurable fusion of processing elements (PEs) in vector-processor-based devices

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101076869B1 (en) * 2010-03-16 2011-10-25 광운대학교 산학협력단 Memory centric communication apparatus in coarse grained reconfigurable array
JP6056509B2 (en) * 2013-01-30 2017-01-11 富士通株式会社 Information processing apparatus and information processing apparatus control method
US10203960B2 (en) * 2014-02-20 2019-02-12 Tsinghua University Reconfigurable processor and conditional execution method for the same
US9665370B2 (en) 2014-08-19 2017-05-30 Qualcomm Incorporated Skipping of data storage
US9348595B1 (en) * 2014-12-22 2016-05-24 Centipede Semi Ltd. Run-time code parallelization with continuous monitoring of repetitive instruction sequences
US9135015B1 (en) 2014-12-25 2015-09-15 Centipede Semi Ltd. Run-time code parallelization with monitoring of repetitive instruction sequences during branch mis-prediction
US9208066B1 (en) 2015-03-04 2015-12-08 Centipede Semi Ltd. Run-time code parallelization with approximate monitoring of instruction sequences
US10296350B2 (en) 2015-03-31 2019-05-21 Centipede Semi Ltd. Parallelized execution of instruction sequences
US10296346B2 (en) 2015-03-31 2019-05-21 Centipede Semi Ltd. Parallelized execution of instruction sequences based on pre-monitoring
US9715390B2 (en) 2015-04-19 2017-07-25 Centipede Semi Ltd. Run-time parallelization of code execution based on an approximate register-access specification
WO2017033336A1 (en) * 2015-08-27 2017-03-02 三菱電機株式会社 Circuit design assistance device and circuit design assistance program
WO2017132385A1 (en) * 2016-01-26 2017-08-03 Icat Llc Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler
US10963265B2 (en) * 2017-04-21 2021-03-30 Micron Technology, Inc. Apparatus and method to switch configurable logic units
US10331445B2 (en) * 2017-05-24 2019-06-25 Microsoft Technology Licensing, Llc Multifunction vector processor circuits
JP6553694B2 (en) * 2017-09-25 2019-07-31 Necスペーステクノロジー株式会社 Processor element, programmable device and control method of processor element
WO2019171428A1 (en) * 2018-03-05 2019-09-12 株式会社日立製作所 Circuit generation device and software generation device
JP7038608B2 (en) * 2018-06-15 2022-03-18 ルネサスエレクトロニクス株式会社 Semiconductor device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04293150A (en) 1991-03-20 1992-10-16 Hitachi Ltd Compiling method
JP2001249818A (en) 2000-03-03 2001-09-14 Matsushita Electric Ind Co Ltd Optimizing device and computer readable recording medium in which optimization program is recorded
US20030046513A1 (en) * 2001-08-31 2003-03-06 Nec Corporation Arrayed processor of array of processing elements whose individual operations and mutual connections are variable
US6574724B1 (en) * 2000-02-18 2003-06-03 Texas Instruments Incorporated Microprocessor with non-aligned scaled and unscaled addressing
US20030126404A1 (en) 2001-12-26 2003-07-03 Nec Corporation Data processing system, array-type processor, data processor, and information storage medium
US20050141784A1 (en) * 2003-12-31 2005-06-30 Ferriz Rafael M. Image scaling using an array of processing elements
US7103720B1 (en) 2003-10-29 2006-09-05 Nvidia Corporation Shader cache using a coherency protocol
JP2006260479A (en) 2005-03-18 2006-09-28 Ricoh Co Ltd Simd type microprocessor and data processing method
US7120903B2 (en) 2001-09-26 2006-10-10 Nec Corporation Data processing apparatus and method for generating the data of an object program for a parallel operation apparatus
JP3861898B2 (en) 2004-11-08 2006-12-27 日本電気株式会社 Data processing system, array type processor, data processing apparatus, computer program, information storage medium
US20080127065A1 (en) * 2006-08-24 2008-05-29 Bryant William K Devices, systems, and methods for configuring a programmable logic controller
JP4292197B2 (en) 2005-12-02 2009-07-08 エヌヴィディア コーポレイション Method for processing a thread group within a SIMD architecture
US7761693B2 (en) 2003-12-09 2010-07-20 Arm Limited Data processing apparatus and method for performing arithmetic operations in SIMD data processing
JP2010539582A (en) 2007-09-11 2010-12-16 コア ロジック,インコーポレイテッド Reconfigurable array processor for floating point operations
US20150310311A1 (en) * 2012-12-04 2015-10-29 Institute Of Semiconductors, Chinese Academy Of Sciences Dynamically reconstructable multistage parallel single instruction multiple data array processing system

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05158895A (en) * 1991-12-05 1993-06-25 Fujitsu Ltd System for improving loop calculation efficiency in parallel computer system
JP2001338492A (en) * 2000-05-26 2001-12-07 Matsushita Electric Ind Co Ltd Semiconductor device and control method
US7461234B2 (en) * 2002-07-01 2008-12-02 Panasonic Corporation Loosely-biased heterogeneous reconfigurable arrays
US7471643B2 (en) * 2002-07-01 2008-12-30 Panasonic Corporation Loosely-biased heterogeneous reconfigurable arrays
GB2417586B (en) * 2002-07-19 2007-03-28 Picochip Designs Ltd Processor array
US7168070B2 (en) * 2004-05-25 2007-01-23 International Business Machines Corporation Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer
JP2006350907A (en) * 2005-06-20 2006-12-28 Ricoh Co Ltd Simd type microprocessor, data transfer unit, and data conversion unit
JP2007073010A (en) * 2005-09-09 2007-03-22 Ricoh Co Ltd Simd processor and image processing method using the simd method processor and image processor
JP2008071130A (en) * 2006-09-14 2008-03-27 Ricoh Co Ltd Simd type microprocessor
JP4801605B2 (en) * 2007-02-28 2011-10-26 株式会社リコー SIMD type microprocessor
JP4690362B2 (en) * 2007-07-04 2011-06-01 株式会社リコー SIMD type microprocessor and data transfer method for SIMD type microprocessor
JP4913685B2 (en) * 2007-07-04 2012-04-11 株式会社リコー SIMD type microprocessor and control method of SIMD type microprocessor

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04293150A (en) 1991-03-20 1992-10-16 Hitachi Ltd Compiling method
US6574724B1 (en) * 2000-02-18 2003-06-03 Texas Instruments Incorporated Microprocessor with non-aligned scaled and unscaled addressing
JP2001249818A (en) 2000-03-03 2001-09-14 Matsushita Electric Ind Co Ltd Optimizing device and computer readable recording medium in which optimization program is recorded
US6993756B2 (en) 2000-03-03 2006-01-31 Matsushita Electric Industrial Co., Ltd. Optimization apparatus that decreases delays in pipeline processing of loop and computer-readable storage medium storing optimization program
US20030046513A1 (en) * 2001-08-31 2003-03-06 Nec Corporation Arrayed processor of array of processing elements whose individual operations and mutual connections are variable
JP3921367B2 (en) 2001-09-26 2007-05-30 日本電気株式会社 Data processing apparatus and method, computer program, information storage medium, parallel processing apparatus, data processing system
US7120903B2 (en) 2001-09-26 2006-10-10 Nec Corporation Data processing apparatus and method for generating the data of an object program for a parallel operation apparatus
US20030126404A1 (en) 2001-12-26 2003-07-03 Nec Corporation Data processing system, array-type processor, data processor, and information storage medium
US7103720B1 (en) 2003-10-29 2006-09-05 Nvidia Corporation Shader cache using a coherency protocol
JP4699002B2 (en) 2003-12-09 2011-06-08 アーム・リミテッド Data processing apparatus and method for performing arithmetic operations in SIMD data processing
US7761693B2 (en) 2003-12-09 2010-07-20 Arm Limited Data processing apparatus and method for performing arithmetic operations in SIMD data processing
US20050141784A1 (en) * 2003-12-31 2005-06-30 Ferriz Rafael M. Image scaling using an array of processing elements
JP3861898B2 (en) 2004-11-08 2006-12-27 日本電気株式会社 Data processing system, array type processor, data processing apparatus, computer program, information storage medium
US20060236075A1 (en) 2005-03-18 2006-10-19 Kazuhiko Hara SIMD microprocessor and data processing method
JP2006260479A (en) 2005-03-18 2006-09-28 Ricoh Co Ltd Simd type microprocessor and data processing method
JP4292197B2 (en) 2005-12-02 2009-07-08 エヌヴィディア コーポレイション Method for processing a thread group within a SIMD architecture
US20080127065A1 (en) * 2006-08-24 2008-05-29 Bryant William K Devices, systems, and methods for configuring a programmable logic controller
JP2010539582A (en) 2007-09-11 2010-12-16 コア ロジック,インコーポレイテッド Reconfigurable array processor for floating point operations
US8078835B2 (en) 2007-09-11 2011-12-13 Core Logic, Inc. Reconfigurable array processor for floating-point operations
US20150310311A1 (en) * 2012-12-04 2015-10-29 Institute Of Semiconductors, Chinese Academy Of Sciences Dynamically reconstructable multistage parallel single instruction multiple data array processing system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Office Action issued by Japanese Patent Office in Japanese Application No. 2012-154903 mailed Dec. 1, 2015.
Toru Awashima et al., "C Compiler for Dynamically Reconfigurable Processor: DRP", The Institute of Electronics, Information and Communication Engineers, Technical Report of IEICE, vol. 103 No. 580, Jan. 2004.

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170123794A1 (en) * 2015-11-04 2017-05-04 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US10120685B2 (en) 2015-11-04 2018-11-06 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US10528356B2 (en) * 2015-11-04 2020-01-07 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
US10846260B2 (en) 2018-07-05 2020-11-24 Qualcomm Incorporated Providing reconfigurable fusion of processing elements (PEs) in vector-processor-based devices

Also Published As

Publication number Publication date
US20140019726A1 (en) 2014-01-16
JP2014016894A (en) 2014-01-30
US20160162291A1 (en) 2016-06-09

Similar Documents

Publication Publication Date Title
US9292284B2 (en) Parallel arithmetic device, data processing system with parallel arithmetic device, and data processing program
KR102413832B1 (en) vector multiply add instruction
EP2951681B1 (en) Solution to divergent branches in a simd core using hardware pointers
JP5646737B2 (en) Conditional comparison instructions
JP4042604B2 (en) Program parallelization apparatus, program parallelization method, and program parallelization program
JP4489102B2 (en) Profiler for optimization of processor structure and applications
TWI728068B (en) Complex multiply instruction
KR102256188B1 (en) Data processing apparatus and method for processing vector operands
JP6245031B2 (en) Compilation program, compilation method, and compilation apparatus
EP2951682B1 (en) Hardware and software solutions to divergent branches in a parallel pipeline
JP4973101B2 (en) Automatic synthesizer
JP2004021553A (en) Processor, device and method for program conversion, and computer program
JP2002323982A (en) Method for processing instruction
US9354893B2 (en) Device for offloading instructions and data from primary to secondary data path
JP2010271755A (en) Simulation system, method, and program
JP2019517060A (en) Apparatus and method for managing address conflicts in performing vector operations
US9354850B2 (en) Method and apparatus for instruction scheduling using software pipelining
US11262989B2 (en) Automatic generation of efficient vector code with low overhead in a time-efficient manner independent of vector width
US9383981B2 (en) Method and apparatus of instruction scheduling using software pipelining
CN117313595B (en) Random instruction generation method, equipment and system for function verification
KR101711388B1 (en) Device and method to compile for scheduling block at pipeline
US20240053970A1 (en) Processor and compiler
JP5598114B2 (en) Arithmetic unit
JP2021196637A (en) Compiler program, compilation method, and information processing device
Teich et al. Compact Code Generation and Throughput Optimization for Coarse-Grained Reconfigurable Arrays

Legal Events

Date Code Title Description
AS Assignment

Owner name: RENESAS ELECTRONICS CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TOI, TAKAO;FUJII, TARO;KATO, YOSHINOSUKE;AND OTHERS;REEL/FRAME:030743/0423

Effective date: 20130524

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
AS Assignment

Owner name: RENESAS ELECTRONICS CORPORATION, JAPAN

Free format text: CHANGE OF ADDRESS;ASSIGNOR:RENESAS ELECTRONICS CORPORATION;REEL/FRAME:044928/0001

Effective date: 20150806

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8