WO2012068475A2 - Method and apparatus for moving data from a simd register file to general purpose register file - Google Patents

Method and apparatus for moving data from a simd register file to general purpose register file Download PDF

Info

Publication number
WO2012068475A2
WO2012068475A2 PCT/US2011/061428 US2011061428W WO2012068475A2 WO 2012068475 A2 WO2012068475 A2 WO 2012068475A2 US 2011061428 W US2011061428 W US 2011061428W WO 2012068475 A2 WO2012068475 A2 WO 2012068475A2
Authority
WO
WIPO (PCT)
Prior art keywords
risc
lead
address
processor
register file
Prior art date
Application number
PCT/US2011/061428
Other languages
French (fr)
Other versions
WO2012068475A3 (en
Inventor
Willliam Johnson
John W. Glotzbach
Hamid Sheikh
Ajay Jayaraj
Stephen Busch
Murali Chinnakonda
Jeffrey L. Nye
Toshio Nagata
Shalini Gupta
Robert J. Nychka
David H. Bartley
Ganesh Sundararajan
Original Assignee
Texas Instruments Incorporated
Texas Instruments Japan Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Texas Instruments Incorporated, Texas Instruments Japan Limited filed Critical Texas Instruments Incorporated
Priority to JP2013540058A priority Critical patent/JP2014505916A/en
Priority to CN201180055771.5A priority patent/CN103221935B/en
Publication of WO2012068475A2 publication Critical patent/WO2012068475A2/en
Publication of WO2012068475A3 publication Critical patent/WO2012068475A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30054Unconditional branch instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing
    • G06F9/3552Indexed addressing using wraparound, e.g. modulo or circular addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3853Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters

Definitions

  • the disclosure relates generally to a processor and, more particularly, to a processing cluster.
  • FIG. 1 is a graph that depicts speed-up in execution rate versus parallel overhead for a multi-core systems (ranging from 2 to 16 cores), where speed-up is the single-processor execution time divided by the parallel-processor execution time.
  • the parallel overhead has to be close to zero to obtain a significant benefit from large number of cores.
  • the overhead tends to be very high if there is any interaction between parallel programs, it is normally very difficult to efficiently use more than one or two processors for anything but completely decoupled programs.
  • An embodiment of the present disclosure accordingly, provides a method.
  • the method is characterized by: changing the state of a signal on a data movement lead (risc is mfwr) to indicate the data movement instruction from a first register file (4358-1 to 4358-8, 7902) in a computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to a second register file (5206) in a processor (4322, 7614); providing a lane address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a first address lead (risc is ua); providing a read address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a second address lead (risc is ra); and transferring data from the first register file (4358-1 to 4358-8, 7902
  • FIG. 1 is a graph of multicore speed-up parameters
  • FIG. 2 is a diagram of a system in accordance with an embodiment of the present disclosure
  • FIG. 3 is a diagram of the SOC n accordance with an embodiment of the present disclosure.
  • FIG. 4 is a diagram of a parallel processing cluster in accordance with an embodiment of the present disclosure
  • FIGS. 5 and 6 are diagram of a portion of a node or computing element in the processing cluster
  • FIG. 7 is a block diagram of shared function-memory
  • FIG. 8 is a diagram of the SIMD data paths for the shared function-memory
  • FIG. 9 is a diagram of a portion of one SIMD data path
  • FIG. 10 is a more detailed diagram of a node processor or RISC processor.
  • FIGS. 11 and 12 are diagrams of examples of portions of a pipeline for a node processor or RISC processor.
  • an imaging device 1250 (which can, for example, be a mobile phone or camera) generally comprises an image sensor 1252, an SOC 1300, a dynamic random access memory (DRAM) 1254, a flash memory 1256, display 1526, and power management integrated circuit (PMIC) 1260.
  • the image sensor 1252 is able to capture image information (which can be a still image or video) that can be processed by the SOC 1300 and DRAM 1254 and stored in a nonvolatile memory (namely, the flash memory 1256).
  • image information stored in the flash memory 1256 can be displayed to the use over the display 1258 by use of the SOC 1300 and DRAM 1254.
  • imaging devices 1250 are oftentimes portable and include a battery as a power supply; the PMIC 1260 (which can be controlled by the SOC 1300) can assist in regulating power use to extend battery life.
  • FIG. 3 an example of a system-on-chip or SOC 1300 is depicted in accordance with an embodiment of the present disclosure.
  • This SOC 1300 (which is typically an integrated circuit or IC, such as an OMAPTM) generally comprises a processing cluster 1400 (which generally performs the parallel processing described above) and a host processor 1316 that provides the hosted environment (described and referenced above).
  • the host processor 1316 can be wide (i.e., 32 bits, 64 bits, etc.) RISC processor (such as an ARM Cortex-A9) and that communicates with the bus arbitrator 1310, buffer 1306, bus bridge 1320 (which allows the host processor 1316 to access the peripheral interface 1324 over interface bus or Ibus 1330), hardware application programming interface (API) 1308, and interrupt controller 1322 over the host processor bus or HP bus 1328.
  • Processing cluster 1400 typically communicates with functional circuitry 1302 (which can, for example, be a charged coupled device or CCD interface and which can communicate with off-chip devices), buffer 1306, bus arbitrator 1310, and peripheral interface 1324 over the processing cluster bus or PC bus 1326.
  • the host processor 1316 is able to provide information (i.e., configure the processing cluster 1400 to conform to a desired parallel implementation) through API 1308, while both the processing cluster 1400 and host processor 1316 can directly access the flash memory 1256 (through flash interface 1312) and DRAM 1254 (through memory controller 1304). Additionally, test and boundary scan can be performed through Joint Test Action Group (JTAG) interface 1318.
  • JTAG Joint Test Action Group
  • processing cluster 1400 corresponds to hardware 722.
  • Processing cluster 1400 generally comprises partitions 1402-1 to 1402-R which include nodes 808-1 to 808-N, node wrappers 810-1 to 810-N, instruction memories 1404-1 to 1404-R, and bus interface units or (BIUs) 4710-1 to 4710-R (which are discussed in detail below).
  • partitions 1402-1 to 1402-R which include nodes 808-1 to 808-N, node wrappers 810-1 to 810-N, instruction memories 1404-1 to 1404-R, and bus interface units or (BIUs) 4710-1 to 4710-R (which are discussed in detail below).
  • BIUs bus interface units
  • Nodes 808-1 to 808-N are each coupled to data interconnect 814 (through its respectively BIU 4710-1 to 4710-R and the data bus 1422), and the controls or messages for the partitions 1402-1 to 1402-R are provided from the control node 1406 through the message 1420.
  • the global load/store (GLS) unit 1408 and shared function-memory 1410 also provide additional functionality for data movement (as described below).
  • a level 3 or L3 cache 1412, peripherals 1414 (which are generally not included within the IC), memory 1416 (which is typically flash memory 1256 and/or DRAM 1254 as well as other memory that is not included within the SOC 1300), and hardware accelerators (HWA) unit 1418 are used with processing cluster 1400.
  • An interface 1405 is also provided so as to communicate data and addresses to control node 1406.
  • Processing cluster 1400 generally uses a "push" model for data transfers.
  • the transfers generally appear as posted writes, rather than request-response types of accesses.
  • This has the benefit of reducing occupation on global interconnect (i.e., data interconnect 814) by a factor of two compared to request-response accesses because data transfer is one-way.
  • the push model generates a single transfer. This is important for scalability because network latency increases as network size increases, and this invariably reduces the performance of request-response transactions.
  • the push model along with the dataflow protocol (i.e., 812-1 to 812-N), generally minimize global data traffic to that used for correctness, while also generally minimizing the effect of global dataflow on local node utilization. There is normally little to no impact on node (i.e., 808-i) performance even with a large amount of global traffic.
  • Sources write data into global output buffers (discussed below) and continue without requiring an acknowledgement of transfer success.
  • the dataflow protocol i.e., 812-1 to 812-N
  • the dataflow protocol i.e., 812-1 to 812-N generally ensures that the transfer succeeds on the first attempt to move data to the destination, with a single transfer over interconnect 814.
  • the global output buffers (which are discussed below) can hold up to 16 outputs (for example), making it very unlikely that a node (i.e., 808-i) stalls because of insufficient instantaneous global bandwidth for output. Furthermore, the instantaneous bandwidth is not impacted by request-response transactions or replaying of unsuccessful transfers.
  • the push model more closely matches the programming model, namely programs do not "fetch" their own data. Instead, their input variables and/or parameters are written before being invoked.
  • initialization of input variables appears as writes into memory by the source program.
  • these writes are converted into posted writes that populate the values of variables in node contexts.
  • the global input buffers are used to receive data from source nodes. Since the data memory for each node 808-1 to 808-N is single-ported, the write of input data might conflict with a read by the local Single Input Multiple Data (SIMD). This contention is avoided by accepting input data into the global input buffer, where it can wait for an open data memory cycle (that is, there is no bank conflict with the SIMD access).
  • SIMD Single Input Multiple Data
  • the data memory can have 32 banks (for example), so it is very likely that the buffer is freed quickly. However, the node (i.e., 808-i) should have a free buffer entry because there is no handshaking to acknowledge the transfer.
  • the global input buffer can stall the local node (i.e., 808- i) and force a write into the data memory to free a buffer location, but this event should be extremely rare.
  • the global input buffer is implemented as two separate random access memories (RAMs), so that one can be in a state to write global data while the other is in a state to be read into the data memory.
  • the messaging interconnect is separate from the global data interconnect but also uses a push model.
  • nodes 808-1 to 808-N are replicated in processing cluster 1400 analogous to SMP or symmetric multi-processing with the number of nodes scaled to the desired throughput.
  • the processing cluster 1400 can scale to a very large number of nodes.
  • Nodes 808- 1 to 808-N are grouped into partitions 1402-1 to 1402-R, with each having one or more nodes .
  • Partitions 1402-1 to 1402-R assist scalability by increasing local communication between nodes, and by allowing larger programs to compute larger amounts of output data, making it more likely to meet desired throughput requirements.
  • nodes communicate using local interconnect, and do not require global resources.
  • the nodes within a partition also can share instruction memory (i.e., 1404-i), with any granularity: from each node using an exclusive instruction memory to all nodes using common instruction memory. For example, three nodes can share three banks of instruction memory, with a fourth node having an exclusive bank of instruction memory.
  • instruction memory i.e., 1404-i
  • the nodes generally execute the same program synchronously.
  • the processing cluster 1400 also can support a very large number of nodes (i.e., 808-i) and partitions (i.e., 1402-i).
  • the number of nodes per partition is usually limited to 4 because having more than 4 nodes per partition generally resembles a non-uniform memory access (NUMA) architecture.
  • partitions are connected through one (or more) crossbars (which are described below with respect to interconnect 814) that have a generally constant cross-sectional bandwidth.
  • Processing cluster 1400 is currently architected to transfer one node's width of data (for example, 64, 16-bit pixels) every cycle, segmented into 4 transfers of 16 pixels per cycle over 4 cycles.
  • the processing cluster 1400 is generally latency-tolerant, and node buffering generally prevents node stalls even when the interconnect 814 is nearly saturated (note that this condition is very difficult to achieve except by synthetic programs).
  • processing cluster 1400 includes global resources that are shared between partitions: (1) Control Node 1406, which implements the system- wide messaging interconnect (over message bus 1420), event processing and scheduling, and interface to the host processor and debugger (all of which is described in detail below).
  • Control Node 1406 which implements the system- wide messaging interconnect (over message bus 1420), event processing and scheduling, and interface to the host processor and debugger (all of which is described in detail below).
  • GLS unit 1408 which contains a programmable reduced instruction set (RISC) processor, enabling system data movement that can be described by C++ programs that can be compiled directly as GLS data-movement threads.
  • RISC programmable reduced instruction set
  • This enables system code to execute in cross-hosted environments without modifying source code, and is much more general than direct memory access because it can move from any set of addresses (variables) in the system or SIMD data memory (described below) to any other set of addresses (variables). It is multi-threaded, with (for example) 0-cycle context switch, supporting up to 16 threads, for example.
  • Shared Function-Memory 1410 which is a large shared memory that provides a general lookup table (LUT) and statistics-collection facility (histogram). It also can support pixel processing using the large shared memory that is not well supported by the node SIMD (for cost reasons), such as resampling and distortion correction.
  • This processing uses (for example) a six- issue RISC processor (i.e., SFM processor 7614, which is described in detail below), implementing scalar, vector, and 2D arrays as native types.
  • Hardware Accelerators 1418 which can be incorporated for functions that do not require programmability, or to optimize power and/or area. Accelerators appear to the subsystem as other nodes in the system, participate in the control and data flow, can create events and be scheduled, and are visible to the debugger. (Hardware accelerators can have dedicated LUT and statistics gathering, where applicable.)
  • Data Interconnect 814 and System Open Core Protocol (OCP) L3 connection 1412. These manage the movement of data between node partitions, hardware accelerators, and system memories and peripherals on the data bus 1422. (Hardware accelerators can have private connections to L3 also.)
  • OCP System Open Core Protocol
  • Node 808-i is the computing element in processing cluster 1400, while the basic element for addressing and program flow-control is RISC processor or node processor 4322.
  • this node processor 4322 can have a 32-bit data path with 20-bit instructions (with the possibility of a 20-bit immediate field in a 40-bit instruction).
  • Pixel operations for example, are performed in a set of 32 pixel functional units, in a SIMD organization, in parallel with four loads (for example) to, and two stores (for example) from, SIMD registers from/to SIMD data memory (the instruction- set architecture of node processor 4322 is described in section 7 below).
  • An instruction packet describes (for example) one RISC processor core instruction, four SIMD loads, and two SIMD stores, in parallel with a 3-issue SIMD instruction that is executed by all SIMD functional units 4308-1 to 4308-M.
  • loads and stores move data between SIMD data-memory locations and SIMD local registers, which can, for example, represent up to 64, 16- bit pixels.
  • SIMD loads and stores use shared registers 4320-i for indirect addressing (direct addressing is also supported), but SIMD addressing operations read these registers: addressing context is managed by the core 4320.
  • the core 4320 has a local memory 4328 for register spill/fill, addressing context, and input parameters.
  • partition instruction memory 1404-i provided per node, where it is possible for multiple nodes to share partition instruction memory 1404-i, to execute larger programs on datasets that span multiple nodes.
  • Node 808-i also incorporates several features to support parallelism.
  • the global input buffer 4316-i and global output buffer 4310-i (which in conjunction with Lf and Rt buffers 4314- i and 4312-i generally comprise input/output (IO) circuitry for node 808-i) decouple node 808-i input and output from instruction execution, making it very unlikely that the node stalls because of system IO.
  • Inputs are normally received well in advance of processing (by SIMD data memory 4306-1 to 4306-M and functional units 4308-1 to 4308-M), and are stored in SIMD data memory 4306-1 to 4306-M using spare cycles (which are very common).
  • SIMD output data is written to the global output buffer 4210-i and routed through the processing cluster 1400 from there, making it unlikely that a node (i.e., 808-i) can stalls even if the system bandwidth approaches its limit (which is also unlikely).
  • SIMD data memories 4308-1 to 4306-M and the corresponding SIMD functional unit 4306-1 to 4306-M are each collectively referred as a "SIMD units"
  • SIMD data memory 4306-1 to 4306-M is organized into non-overlapping contexts, of variable size, allocated either to related or unrelated tasks. Contexts are fully shareable in both horizontal and vertical directions. Sharing in the horizontal direction uses read-only memories 4330-i and 4332-i, which are typically read-only for the program but writeable by the write buffers 4302-i and 4304-i, load/store (LS) unit 4318-i, or other hardware. These memories 4330- i and 4332-i can also be about 512x2 bits in size. Generally, these memories 4330-i and 4332-i correspond to pixel locations to the left and right relative to the central pixel locations operated on.
  • These memories 4330-i and 4332-i use a write-buffering mechanism (i.e. write buffers 4302- i and 4304-i) to schedule writes, where side-context writes are usually not synchronized with local access.
  • the buffer 4302-i generally maintains coherence with adjacent pixel (for example) contexts that operate concurrently. Sharing in the vertical direction uses circular buffers within the SIMD data memory 4306-1 to 4306-M; circular addressing is a mode supported by the load and store instructions applied by the LS unit 4318-i. Shared data is generally kept coherent using system-level dependency protocols described above.
  • Context allocation and sharing is specified by SIMD data memory 4306-1 to 4306-M context descriptors, in context-state memory 4326, which is associated with the node processor 4322.
  • This memory 4326 can, for example, 16x16x32 bit or 2x16x256 bit RAM.
  • These descriptors also specify how data is shared between contexts in a fully general manner, and retain information to handle data dependencies between contexts.
  • the Context Save/Restore memory 4324 is used to support 0-cycle task switching (which is described above), by permitting registers 4320-i to be saved and restored in parallel.
  • SIMD data memory 4306-1 to 4306-M and processor data memory 4328 contexts are preserved using independent context areas for each task.
  • SIMD data memory 4306-1 to 4306-M and processor data memory 4328 are partitioned into a variable number of contexts, of variable size. Data in the vertical frame direction is retained and re-used within the context itself. Data in the horizontal frame direction is shared by linking contexts together into a horizontal group. It is important to note that the context organization is mostly independent of the number of nodes involved in a computation and how they interact with each other. The primary purpose of contexts is to retain, share, and re -use image data, regardless of the organization of nodes that operate on this data.
  • SIMD data memory 4306-1 to 4306-M contains (for example) pixel and intermediate context operated on by the functional units 4308-1 top 4308-M.
  • SIMD data memory 4306-1 to 4306-M is generally partitioned into (for example) up to 16 disjoint context areas, each with a programmable base address, with a common area accessible from all contexts that is used by the compiler for register spill/fill.
  • the processor data memory 4328 contains input parameters, addressing context, and a spill/fill area for registers 4320-i.
  • Processor data memory 4328 can have (for example) up to 16 disjoint local context areas that correspond to SIMD data memory 4306-1 to 4306-M contexts, each with a programmable base address.
  • the nodes i.e., node 808-i
  • the nodes have three configurations: 8 SIMD registers (first configuration); 32 SIMD registers (second configuration); and 32 SIMD registers plus three extra execution units in each of the smaller functional unit (third configuration).
  • FIG. 6 shown an example of SIMD unit (namely, SIMD data memory 4306-1 and SIMD functional unit 4308-1), node processor 4322, and LS unit 4318-i in greater detail can be seen.
  • SIMD functional unit 4308-i is generally comprised of eight, smaller functional units 4338-1 to 4338-8 uses the third configuration.
  • the node processor 4322 generally executes all the control related instructions and holds all the address register values and special register values for SIMD units shown in register files 4340 and 4342 (respectively). Up to six (for example) memory instructions can be calculated in a cycle. For address register values, the address source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values, which are then used by SIMD unit for address calculation. Similarly, for special register values, the special register source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values.
  • Node processor 4322 can have (for example) 15 read ports and six write ports for SIMD.
  • the 15 read ports include (for example) 12 read ports that accommodate two operands (i.e., lssrc and lssrc2) for each of six memory instructions and three ports for special register file 4312.
  • special register file 4342 include two registers named RCLIPMIN and RCLIPMAX, which should be provided together and which are generally restricted to the lower four registers of the 16 entry register file 4342.
  • RCLIPMAX and RCLIPMIN registers are then specified directly in the instruction.
  • the other special registers RND and SCL are specified by a 4-bit register identifier and can be located anywhere in the 16 entry register file 4342.
  • node processor 4322 includes a program counter execution unit 4344, which can update the instruction memory 1404-i.
  • the LS unit 4318-i generally comprises LS decoder 4334, LS execution unit 4336, logic unit 4346, multiply unit 4348, right execution unit 4350, and LS data memory 4339; however the details regarding the data path for LS unit 4318-i are provided below.
  • Each of the smaller functional units 4338-1 through 4338-8 generally (and respectively) comprises SIMD register files 4358-1 to 4358-8 (which can each include 32 registers, for example), left logic units 4352-1 to 4352-8, multiply units 4354-1 to 4354-8, and right logic units 4356-1 to 4356-8.
  • left logic units 4352-1 to 4352-8, multiply units 4354-1 to 4354-8, and right logic units 4356-1 to 4356-8 are generally duplications of left, middle, and right units 4346, 4348, and 4350, respectively. Additionally, similar to the LS unit 4318-i, the data path for each functional unit 4338-1 to 4338-8 is described below.
  • the sizes of some components (i.e., logic unit 4352-1) or the corresponding instruction may vary, while others may remain the same.
  • the LS data memory 4339, lookup table, and histogram remain relatively the same.
  • the LS data memory 4339 can be about 512*32 bits with the first 16 locations holding the context base addresses and the remaining locations being accessible by the contexts.
  • the lookup table or LUT (which is generally within the PC execution unit 4344) can have up to 12 tables with a memory size of 16Kb, wherein four bits can be used to select table and 14 bits can be used for addressing.
  • Histograms (which are also generally located in the PC execution unit 4344) can have 4 tables, where the histogram shares the 4-bit ID with LUT to select a table and uses 8 bits for addressing.
  • Table 1 below, the instructions sizes for each of the three example configurations can be seen, which can correspond to the sizes of various components.
  • Logic unit (i.e., 4346) 16 bits 24 bits 24 bits
  • Node processor 4322 O bits 20 bits for 20 bits
  • the shared function-memory 1410 is generally a large, centralized memory supporting operations that are not well- supported by the nodes (i.e., for cost reasons).
  • the main component of the shared function- memory 1410 are the two large memories: the function-memory 7602 and the vector-memory
  • This function-memory 7602 implements a synchronous, instruction-driven implementation of high-bandwidth, vector-based lookup-tables (LUTs) and histograms.
  • the vector-memory 7603 can support operations by (for example) a 6-issue processor (i.e., SFM processor 7614) that implies vector instructions (as detailed in section 8 above), which can, for example, be used for block-based pixel processing.
  • SFM processor 7614 can be accessed using the messaging interface 1420 and data bus 1422.
  • the SFM processor 7614 can, for example, operate on wide pixel contexts (64 pixels) that can have a much more general organization and total memory size than SIMD data memory in the nodes, with much more general processing applied to the data. It supports scalar, vector, and array operations on standard C++ integer datatypes as well as operations on packed pixels that are compatible with various datatypes.
  • the SIMD data paths associated with the vector memory 7603 and function-memory 7602 generally include ports 7605-1 to 7605 -Q and functional units 7607-1 to 7607-P.
  • the function-memory 7602 and vector-memory 7603 are generally "shared" in the sense that all processing nodes (i.e., 808-i) can access function-memory 7602 and vector- memory 7603. Data provided to the function-memory 7602 can be accessed via the SFM wrapper (typically in a write-only manner). This sharing is also generally consistent with the context management described above for processing nodes (i.e., 808-i). Data I/O between processing nodes and shared function-memory 1410 also uses the dataflow protocol, and processing nodes, typically, cannot directly access vector-memory 7603.
  • the shared function- memory 1410 can also write to the function-memory 7602, but not while it is being accessed by processing nodes.
  • Processing nodes i.e., 808-i
  • eight SIMD data paths (which can be partitioned into two, 16-bit halves because it can operate on 16-bit packed data) can be used.
  • these SIMD data paths generally comprise set of banks 7802-1 to 7802-L, associated registers 7804-1 to 7804-L, and associated sets of functional units 7806-1 to 7806-L.
  • this SIMD data path can include includes a 16-entry, 32-bit register file 7902, two 16-bit multipliers 7904 and 7906, and a single, 32-bit arithmetic/logical unit 7908 that can also perform two, 16-bit packed operations in a cycle.
  • each SIMD data path can perform two, independent 16-bit operations, or a combined, 32-bit operation.
  • this can form a 32-bit multiply using the 16-bit multipliers combined with 32-bit adds.
  • the arithmetic/logical unit 7908 can be capable of performing addition, subtraction, logical operations (i.e., AND), comparisons, and conditional moves.
  • the SIMD data path registers 7804-1 to 7804-L can use a load/store interface to the vector memory 7603. These loads and stores can use features of the vector memory 7603 that are provided for parallel LUT and histogram access by nodes (i.e., 808- i): for nodes, each SIMD data path half can provide an index into function-memory 7602; and, similarly, each SIMD data path half in SFM processor 7614 can provide an independent vector memory 7603 address.
  • Addressing is generally organized so that adjacent data paths can perform the same operation on multiple instances of datatypes such as scalars, vectors, and arrays of 8-, 16-, or 32-bit (for example) data: these are called vector-implied addressing modes (the vector is implied by the SIMD with linear vector memory 7603 addressing).
  • each data path can operate on packed pixels from regions of a frame within banks 7608-1 to 7608-J: these are called vector-packed addressing modes (vectors of packed pixels are implied by the SIMD, with two-dimensional vector memory 7603 addressing).
  • the programming model can hide the width of the SIMD, and programs are written as if they operate on a single pixel or element of other datatype.
  • Vector-implied datatypes are generally SIMD-implemented vectors of either 8-bit chars, 16-bit halfwords, or 32-bit ints, operated on individually by each SIMD data path (i.e., FIG. 9). These vectors are not generally explicit in the program, but rather implied by hardware operation. These datatypes can also be structured as elements within explicit program vectors or arrays: the SIMD effectively adds a hidden second or third dimension to these program vectors or arrays.
  • the programming view can be a single SIMD data path with a dedicated, 32- bit data memory, and this memory is accessed using conventional addressing modes. In the hardware, this view is mapped in a way that each of the 32 SIMD data paths has the appearance of a private data memory, but the implementation takes advantage of the wide, banked organization of vector memory 7603 to implement this functionality in the shared function- memory 1410.
  • the SFM processor 7614 SIMD generally operates within vector memory 7603 contexts similar node processor 4322 contexts, with descriptors having a base address aligned to the sets of banks 7802-1, and sufficiently large to address the entire vector memory 7603 (i.e., 13 bits for the size of 1024 kB).
  • Each half of the a SIMD data path is numbered with a 6-bit identifier (POSN), starting at 0 for the left-most data path.
  • PSN 6-bit identifier
  • the LSB of this value is generally ignored, and the remaining five bits are used to align the vector memory 7603 addresses generated by the data path to the respective words in the vector memory 7603.
  • general-purpose RISC processors serve various purposes.
  • node processor 4322 (which can be a RISC processor) can be used for program flow control. Below examples of RISC architectures are described.
  • processor 5200 i.e., node processor 4322
  • the pipeline used by processor 5200 generally provides support for general high level language (i.e., C/C++) execution in processing cluster 1400.
  • processor 5200 employs a three stage pipeline of fetch, decode, and execute.
  • context interface 5214 and LS port 5212 provide instructions to the program cache 508, and the instructions can be fetched from the program cache 5208 by instruction fetch 5204.
  • the bus between the instruction fetch 5204 and the program cache 5208 can, for example, be 40 bits wide, allowing the processor 5200 to support dual issue instructions (i.e., instructions can be 40 bits or 20 bits wide).
  • processing unit 5202 executes the smaller instructions (i.e., 20-bit instructions), while the “B-side” functional units execute the larger instructions (i.e., 40-bit instructions).
  • processing unit can use register file 5206 as a "scratch pad"; this register file 5206 can be (for example) a 16-entry, 32-bit register file that is shared between the "A-side" and "B-side.”
  • processor 5200 includes a control register file 5216 and a program counter 5218. Processor 5200 can also be access through boundary pins or leads; an example of each is described in Table 2 (with "z” denoting active low pins).
  • processor 5200 is executing the second half of a non-parallel 20-bit instruction pair.
  • This bus represents the vector unit source register for vector implied stores, or the vector unit destination register for vector implied loads.
  • rise regf ra[l :0] 4b:2 Input Register file read address ports There are two ports. These pins are driven by lane 0 (left most) vector unit. Allows the vector unit to read one of the lower 4 registers in the GPR file.
  • These pins are driven by lane 0 (left most) vector unit. These are the read data buses associated with rise regf ra.
  • the instruction fetch 5204 (which corresponds to the fetch stage 5306) is divided into an A-side and B-side, where the A-side receives the first 20-bits (i.e, [19:0]) of a "fetch packet" (which can be a 40-bit wide instruction word having one 40-bit instruction or two 20-bit instructions) and the B-side receives the last 20-bits (i.e., [39:20]) of a fetch packet.
  • the instruction fetch 5204 determines the structure and size of the instruction(s) in the fetch packet and dispatches the instruction(s) accordingly (which is discussed in section 7.3 below).
  • a decoder 5221 (which is part of the decode stage 5308 and processing unit 5202) decodes the instruction(s) from the instruction fetch 5204.
  • the decoder 5221 generally includes a operator format circuit 5223-1 and 5223-2 (to generate intermediates) and a decode circuit 5225-1 and 5225-2 for the B-side and A-side, respectively.
  • the output from the decoder 5221 is then received by the decode-to-execution unit 5220 (which is also part of the decode stage 5308 and processing unit 5202).
  • the decode-to-execution unit 5220 generates command(s) for the execution unit 5227 that correspond to the instruction(s) received through the fetch packet.
  • the A-side and B-side of the execution unit 5227 is also subdivided.
  • Each of the B- side and A-side of the execution unit 5227 respectively includes a multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, an add/subtract unit 5228-1/5228-2, and a move unit 5330-1/5330- 2.
  • the B-side of the execution unit 5227 also includes a load/store unit 5224 and a branches unit 5232.
  • the multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, a add/subtract unit 5228- 1/5228-2, and a move unit 5330-1/5330-2 can then, respectively, perform a multiply operation, a logical Boolean operation, add/subtract operation, and a data movement operation on data loaded into the general purpose register file 5206 (which also includes read addresses for each of the A- side and B-side). Move operations can also be performed in the control register file 5216.
  • a RISC processor with a vector processing module is generally used with shared function-memory 1410.
  • This RISC processor is largely the same as the RISC processor used for processor 5200 but it includes a vector processing module to extend the computation and load/store bandwidth.
  • This module can contain 16 vector units that are each capable of executing a 4-operation execute packet per cycle.
  • a typical execute packet generally includes a data load from the vector memory array, two register-to-register operations, and a result store to the vector memory array.
  • This type of RISC processor generally uses an instruction word that is 80 bits wide or 120 bits wide, which generally constitutes a "fetch packet" and which may include unaligned instructions.
  • a fetch packet can contain a mixture of 40 bit and 20 bit instructions, which can include vector unit instructions and scalar instructions similar to those used by processor 5200.
  • vector unit instructions can be 20 bits wide, while other instructions can be 20 bits or 40 bits wide (similar to processor 5200).
  • Vector instructions can also be presented on all lanes of the instruction fetch bus, but, if the fetch packet contains both scalar and vector unit instructions the vector instructions are presented (for example) on instruction fetch bus bits [39:0] and the scalar instruction(s) are presented (for example) on instruction fetch bus bits [79:40]. Additionally, unused instruction fetch bus lanes are padded with NOPs.
  • An "execute packet" can then be formed from one or more fetch packets. Partial execute packets are held in the instruction queue until completed. Typically, complete execute packets are submitted to the execute stage (i.e., 5310).
  • Four vector unit instructions for example), two scalar instructions (for example), or a combination of 20-bit and 40-bit instructions (for example) may execute in a single cycle.
  • Back-to-back 20-bit instructions may also be executed serially. If bit 19 of the current 20 bit instruction is set, this indicates that the current instruction, and the subsequent 20-bit instruction form an execute packet. Bit 19 can be generally referred to as the P-bit or parallel bit. If the P-bit is not set this indicates the end of an execute packet.
  • Back-to-back 20 bit instructions with the P-bit not set cause serial execution of the 20 bit instructions. It should also be noted that this RISC processor (with a vector processing module) may include any of the following constraints:
  • Load or store instructions should appear on the B-side of the instruction fetch bus (i.e., bits 79:40 for 40 bit loads and stores or on bits 79:60 of the fetch bus for 20 bit loads or stores);
  • the vector module includes a detector decoder 5246, decode-to-execution unit 5250, and an execution unit 5251.
  • the vector decoder includes slot decoders 5248-1 to 5248-4 that receive instructions from the instruction fetch 5204.
  • slot decoders 5248-1 and 5248-2 operate in a similar manner to one another, while slot decoders 5248-3 and 5248-4 include load/store decoding circuitry.
  • the decode-to-execution unit 5250 can then generate instructions for the execution unit 5251 based on the decoded output of vector decoder 5246.
  • Each of the slot decoders can generate instruction that can be used by the multiply unit 5252, add/subtract unit 5254, move unit 5256, and Boolean unit 5258 (that each use data and addresses in the general purpose register 5206). Additionally slot decoders 5248-3 and 5248-4 can generate load and store instructions for load/store units 5260 and 5262.
  • the general purpose resister file 5206 can be a 16-entry by 32-bit general purpose register file.
  • the widths of the general purpose registers (GPRs) can be parameterized.
  • processor 5200 when processor 5200 is used for nodes (i.e., 808-i), there are 4+15 (15 are controlled by boundary pins) read ports and 4+6 (6 are controlled by boundary pins) write ports, while processor 5200 used for GLS unit 1408 has 4 read ports and 4 write ports.
  • Table 4 below illustrates an example of an instruction set architecture for processor 5200, where: Unit designations .SA and .SB are used to distinguish in which issue slot a 20 bit instruction executes;
  • VUNIT/VREG s3Save s3.address()
  • vec_risc_wd gets value of Vreg(risc_vec_ra);

Abstract

A method for moving data from a first register file in a computational unit (808i) to a second register file in a processor (1410) is provided. The state of a signal on a data movement lead (risc is mfwr) is changed to indicate the data movement instruction from a first register file in a computational unit to a second register file in a processor (1410). A lane address from the processor to the computational unit is provided over a first address lead (risc is ua). A read address from the processor to the computational unit is provided over a second address lead (risc is ra), and data is transferred from the first register file in the computational unit to the second register file in the processor over a data interface lead (node regf rd).

Description

METHOD AND APPARATUS FOR MOVING DATA FROM A SIMD
REGISTER FILE TO GENERAL PURPOSE REGISTER FILE
[0001] The disclosure relates generally to a processor and, more particularly, to a processing cluster.
BACKGROUND
[0002] FIG. 1 is a graph that depicts speed-up in execution rate versus parallel overhead for a multi-core systems (ranging from 2 to 16 cores), where speed-up is the single-processor execution time divided by the parallel-processor execution time. As can be seen, the parallel overhead has to be close to zero to obtain a significant benefit from large number of cores. But, since the overhead tends to be very high if there is any interaction between parallel programs, it is normally very difficult to efficiently use more than one or two processors for anything but completely decoupled programs. Thus, there is a need for an improved processing cluster.
SUMMARY
[0003] An embodiment of the present disclosure, accordingly, provides a method. The method is characterized by: changing the state of a signal on a data movement lead (risc is mfwr) to indicate the data movement instruction from a first register file (4358-1 to 4358-8, 7902) in a computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to a second register file (5206) in a processor (4322, 7614); providing a lane address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a first address lead (risc is ua); providing a read address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a second address lead (risc is ra); and transferring data from the first register file (4358-1 to 4358-8, 7902) in the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to the second register file (5206) in the processor (4322, 7614) over a data interface lead (node regf rd).
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] FIG. 1 is a graph of multicore speed-up parameters; [0005] FIG. 2 is a diagram of a system in accordance with an embodiment of the present disclosure;
[0006] FIG. 3 is a diagram of the SOC n accordance with an embodiment of the present disclosure;
[0007] FIG. 4 is a diagram of a parallel processing cluster in accordance with an embodiment of the present disclosure;
[0008] FIGS. 5 and 6 are diagram of a portion of a node or computing element in the processing cluster;
[0009] FIG. 7 is a block diagram of shared function-memory;
[0010] FIG. 8 is a diagram of the SIMD data paths for the shared function-memory;
[0011] FIG. 9 is a diagram of a portion of one SIMD data path;
[0012] FIG. 10 is a more detailed diagram of a node processor or RISC processor; and
[0013] FIGS. 11 and 12 are diagrams of examples of portions of a pipeline for a node processor or RISC processor.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0014] An example of application for an SOC that performs parallel processing can be seen in FIG. 2. In this example, an imaging device 1250 is shown, and this imaging device 1250 (which can, for example, be a mobile phone or camera) generally comprises an image sensor 1252, an SOC 1300, a dynamic random access memory (DRAM) 1254, a flash memory 1256, display 1526, and power management integrated circuit (PMIC) 1260. In operation, the image sensor 1252 is able to capture image information (which can be a still image or video) that can be processed by the SOC 1300 and DRAM 1254 and stored in a nonvolatile memory (namely, the flash memory 1256). Additionally, image information stored in the flash memory 1256 can be displayed to the use over the display 1258 by use of the SOC 1300 and DRAM 1254. Also, imaging devices 1250 are oftentimes portable and include a battery as a power supply; the PMIC 1260 (which can be controlled by the SOC 1300) can assist in regulating power use to extend battery life.
[0015] In FIG. 3, an example of a system-on-chip or SOC 1300 is depicted in accordance with an embodiment of the present disclosure. This SOC 1300 (which is typically an integrated circuit or IC, such as an OMAP™) generally comprises a processing cluster 1400 (which generally performs the parallel processing described above) and a host processor 1316 that provides the hosted environment (described and referenced above). The host processor 1316 can be wide (i.e., 32 bits, 64 bits, etc.) RISC processor (such as an ARM Cortex-A9) and that communicates with the bus arbitrator 1310, buffer 1306, bus bridge 1320 (which allows the host processor 1316 to access the peripheral interface 1324 over interface bus or Ibus 1330), hardware application programming interface (API) 1308, and interrupt controller 1322 over the host processor bus or HP bus 1328. Processing cluster 1400 typically communicates with functional circuitry 1302 (which can, for example, be a charged coupled device or CCD interface and which can communicate with off-chip devices), buffer 1306, bus arbitrator 1310, and peripheral interface 1324 over the processing cluster bus or PC bus 1326. With this configuration, the host processor 1316 is able to provide information (i.e., configure the processing cluster 1400 to conform to a desired parallel implementation) through API 1308, while both the processing cluster 1400 and host processor 1316 can directly access the flash memory 1256 (through flash interface 1312) and DRAM 1254 (through memory controller 1304). Additionally, test and boundary scan can be performed through Joint Test Action Group (JTAG) interface 1318.
[0016] Turning to FIG. 4, an example of the parallel processing cluster 1400 is depicted in accordance with an embodiment of the present disclosure. Typically, processing cluster 1400 corresponds to hardware 722. Processing cluster 1400 generally comprises partitions 1402-1 to 1402-R which include nodes 808-1 to 808-N, node wrappers 810-1 to 810-N, instruction memories 1404-1 to 1404-R, and bus interface units or (BIUs) 4710-1 to 4710-R (which are discussed in detail below). Nodes 808-1 to 808-N are each coupled to data interconnect 814 (through its respectively BIU 4710-1 to 4710-R and the data bus 1422), and the controls or messages for the partitions 1402-1 to 1402-R are provided from the control node 1406 through the message 1420. The global load/store (GLS) unit 1408 and shared function-memory 1410 also provide additional functionality for data movement (as described below). Additionally, a level 3 or L3 cache 1412, peripherals 1414 (which are generally not included within the IC), memory 1416 (which is typically flash memory 1256 and/or DRAM 1254 as well as other memory that is not included within the SOC 1300), and hardware accelerators (HWA) unit 1418 are used with processing cluster 1400. An interface 1405 is also provided so as to communicate data and addresses to control node 1406.
[0017] Processing cluster 1400 generally uses a "push" model for data transfers. The transfers generally appear as posted writes, rather than request-response types of accesses. This has the benefit of reducing occupation on global interconnect (i.e., data interconnect 814) by a factor of two compared to request-response accesses because data transfer is one-way. There is generally no desire to route a request through the interconnect 814, followed by routing the response to the requestor, resulting in two transitions over the interconnect 814. The push model generates a single transfer. This is important for scalability because network latency increases as network size increases, and this invariably reduces the performance of request-response transactions.
[0018] The push model, along with the dataflow protocol (i.e., 812-1 to 812-N), generally minimize global data traffic to that used for correctness, while also generally minimizing the effect of global dataflow on local node utilization. There is normally little to no impact on node (i.e., 808-i) performance even with a large amount of global traffic. Sources write data into global output buffers (discussed below) and continue without requiring an acknowledgement of transfer success. The dataflow protocol (i.e., 812-1 to 812-N) generally ensures that the transfer succeeds on the first attempt to move data to the destination, with a single transfer over interconnect 814. The global output buffers (which are discussed below) can hold up to 16 outputs (for example), making it very unlikely that a node (i.e., 808-i) stalls because of insufficient instantaneous global bandwidth for output. Furthermore, the instantaneous bandwidth is not impacted by request-response transactions or replaying of unsuccessful transfers.
[0019] Finally, the push model more closely matches the programming model, namely programs do not "fetch" their own data. Instead, their input variables and/or parameters are written before being invoked. In the programming environment, initialization of input variables appears as writes into memory by the source program. In the processing cluster 1400, these writes are converted into posted writes that populate the values of variables in node contexts.
[0020] The global input buffers (which are discussed below) are used to receive data from source nodes. Since the data memory for each node 808-1 to 808-N is single-ported, the write of input data might conflict with a read by the local Single Input Multiple Data (SIMD). This contention is avoided by accepting input data into the global input buffer, where it can wait for an open data memory cycle (that is, there is no bank conflict with the SIMD access). The data memory can have 32 banks (for example), so it is very likely that the buffer is freed quickly. However, the node (i.e., 808-i) should have a free buffer entry because there is no handshaking to acknowledge the transfer. If desired, the global input buffer can stall the local node (i.e., 808- i) and force a write into the data memory to free a buffer location, but this event should be extremely rare. Typically, the global input buffer is implemented as two separate random access memories (RAMs), so that one can be in a state to write global data while the other is in a state to be read into the data memory. The messaging interconnect is separate from the global data interconnect but also uses a push model.
[0021] At the system level, nodes 808-1 to 808-N are replicated in processing cluster 1400 analogous to SMP or symmetric multi-processing with the number of nodes scaled to the desired throughput. The processing cluster 1400 can scale to a very large number of nodes. Nodes 808- 1 to 808-N are grouped into partitions 1402-1 to 1402-R, with each having one or more nodes . Partitions 1402-1 to 1402-R assist scalability by increasing local communication between nodes, and by allowing larger programs to compute larger amounts of output data, making it more likely to meet desired throughput requirements. Within a partition (i.e., 1402-i), nodes communicate using local interconnect, and do not require global resources. The nodes within a partition (i.e., 1404-i) also can share instruction memory (i.e., 1404-i), with any granularity: from each node using an exclusive instruction memory to all nodes using common instruction memory. For example, three nodes can share three banks of instruction memory, with a fourth node having an exclusive bank of instruction memory. When nodes share instruction memory (i.e., 1404-i), the nodes generally execute the same program synchronously.
[0022] The processing cluster 1400 also can support a very large number of nodes (i.e., 808-i) and partitions (i.e., 1402-i). The number of nodes per partition, however, is usually limited to 4 because having more than 4 nodes per partition generally resembles a non-uniform memory access (NUMA) architecture. In this case, partitions are connected through one (or more) crossbars (which are described below with respect to interconnect 814) that have a generally constant cross-sectional bandwidth. Processing cluster 1400 is currently architected to transfer one node's width of data (for example, 64, 16-bit pixels) every cycle, segmented into 4 transfers of 16 pixels per cycle over 4 cycles. The processing cluster 1400 is generally latency-tolerant, and node buffering generally prevents node stalls even when the interconnect 814 is nearly saturated (note that this condition is very difficult to achieve except by synthetic programs).
[0023] Typically, processing cluster 1400 includes global resources that are shared between partitions: (1) Control Node 1406, which implements the system- wide messaging interconnect (over message bus 1420), event processing and scheduling, and interface to the host processor and debugger (all of which is described in detail below).
(2) GLS unit 1408, which contains a programmable reduced instruction set (RISC) processor, enabling system data movement that can be described by C++ programs that can be compiled directly as GLS data-movement threads. This enables system code to execute in cross-hosted environments without modifying source code, and is much more general than direct memory access because it can move from any set of addresses (variables) in the system or SIMD data memory (described below) to any other set of addresses (variables). It is multi-threaded, with (for example) 0-cycle context switch, supporting up to 16 threads, for example.
(3) Shared Function-Memory 1410, which is a large shared memory that provides a general lookup table (LUT) and statistics-collection facility (histogram). It also can support pixel processing using the large shared memory that is not well supported by the node SIMD (for cost reasons), such as resampling and distortion correction. This processing uses (for example) a six- issue RISC processor (i.e., SFM processor 7614, which is described in detail below), implementing scalar, vector, and 2D arrays as native types.
(4) Hardware Accelerators 1418, which can be incorporated for functions that do not require programmability, or to optimize power and/or area. Accelerators appear to the subsystem as other nodes in the system, participate in the control and data flow, can create events and be scheduled, and are visible to the debugger. (Hardware accelerators can have dedicated LUT and statistics gathering, where applicable.)
(5) Data Interconnect 814 and System Open Core Protocol (OCP) L3 connection 1412. These manage the movement of data between node partitions, hardware accelerators, and system memories and peripherals on the data bus 1422. (Hardware accelerators can have private connections to L3 also.)
(6) Debug interfaces. These are not shown on the diagram but are described in this document.
[0024] Turning to FIG. 5, an example of a node 808-i can be seen in greater detail. Node 808-i is the computing element in processing cluster 1400, while the basic element for addressing and program flow-control is RISC processor or node processor 4322. Typically, this node processor 4322 can have a 32-bit data path with 20-bit instructions (with the possibility of a 20-bit immediate field in a 40-bit instruction). Pixel operations, for example, are performed in a set of 32 pixel functional units, in a SIMD organization, in parallel with four loads (for example) to, and two stores (for example) from, SIMD registers from/to SIMD data memory (the instruction- set architecture of node processor 4322 is described in section 7 below). An instruction packet describes (for example) one RISC processor core instruction, four SIMD loads, and two SIMD stores, in parallel with a 3-issue SIMD instruction that is executed by all SIMD functional units 4308-1 to 4308-M.
[0025] Typically, loads and stores (from load store unit 4318-i) move data between SIMD data-memory locations and SIMD local registers, which can, for example, represent up to 64, 16- bit pixels. SIMD loads and stores use shared registers 4320-i for indirect addressing (direct addressing is also supported), but SIMD addressing operations read these registers: addressing context is managed by the core 4320. The core 4320 has a local memory 4328 for register spill/fill, addressing context, and input parameters. There is a partition instruction memory 1404-i provided per node, where it is possible for multiple nodes to share partition instruction memory 1404-i, to execute larger programs on datasets that span multiple nodes.
[0026] Node 808-i also incorporates several features to support parallelism. The global input buffer 4316-i and global output buffer 4310-i (which in conjunction with Lf and Rt buffers 4314- i and 4312-i generally comprise input/output (IO) circuitry for node 808-i) decouple node 808-i input and output from instruction execution, making it very unlikely that the node stalls because of system IO. Inputs are normally received well in advance of processing (by SIMD data memory 4306-1 to 4306-M and functional units 4308-1 to 4308-M), and are stored in SIMD data memory 4306-1 to 4306-M using spare cycles (which are very common). SIMD output data is written to the global output buffer 4210-i and routed through the processing cluster 1400 from there, making it unlikely that a node (i.e., 808-i) can stalls even if the system bandwidth approaches its limit (which is also unlikely). SIMD data memories 4308-1 to 4306-M and the corresponding SIMD functional unit 4306-1 to 4306-M are each collectively referred as a "SIMD units"
[0027] SIMD data memory 4306-1 to 4306-M is organized into non-overlapping contexts, of variable size, allocated either to related or unrelated tasks. Contexts are fully shareable in both horizontal and vertical directions. Sharing in the horizontal direction uses read-only memories 4330-i and 4332-i, which are typically read-only for the program but writeable by the write buffers 4302-i and 4304-i, load/store (LS) unit 4318-i, or other hardware. These memories 4330- i and 4332-i can also be about 512x2 bits in size. Generally, these memories 4330-i and 4332-i correspond to pixel locations to the left and right relative to the central pixel locations operated on. These memories 4330-i and 4332-i use a write-buffering mechanism (i.e. write buffers 4302- i and 4304-i) to schedule writes, where side-context writes are usually not synchronized with local access. The buffer 4302-i generally maintains coherence with adjacent pixel (for example) contexts that operate concurrently. Sharing in the vertical direction uses circular buffers within the SIMD data memory 4306-1 to 4306-M; circular addressing is a mode supported by the load and store instructions applied by the LS unit 4318-i. Shared data is generally kept coherent using system-level dependency protocols described above.
[0028] Context allocation and sharing is specified by SIMD data memory 4306-1 to 4306-M context descriptors, in context-state memory 4326, which is associated with the node processor 4322. This memory 4326 can, for example, 16x16x32 bit or 2x16x256 bit RAM. These descriptors also specify how data is shared between contexts in a fully general manner, and retain information to handle data dependencies between contexts. The Context Save/Restore memory 4324 is used to support 0-cycle task switching (which is described above), by permitting registers 4320-i to be saved and restored in parallel. SIMD data memory 4306-1 to 4306-M and processor data memory 4328 contexts are preserved using independent context areas for each task.
[0029] SIMD data memory 4306-1 to 4306-M and processor data memory 4328 are partitioned into a variable number of contexts, of variable size. Data in the vertical frame direction is retained and re-used within the context itself. Data in the horizontal frame direction is shared by linking contexts together into a horizontal group. It is important to note that the context organization is mostly independent of the number of nodes involved in a computation and how they interact with each other. The primary purpose of contexts is to retain, share, and re -use image data, regardless of the organization of nodes that operate on this data.
[0030] Typically, SIMD data memory 4306-1 to 4306-M contains (for example) pixel and intermediate context operated on by the functional units 4308-1 top 4308-M. SIMD data memory 4306-1 to 4306-M is generally partitioned into (for example) up to 16 disjoint context areas, each with a programmable base address, with a common area accessible from all contexts that is used by the compiler for register spill/fill. The processor data memory 4328 contains input parameters, addressing context, and a spill/fill area for registers 4320-i. Processor data memory 4328 can have (for example) up to 16 disjoint local context areas that correspond to SIMD data memory 4306-1 to 4306-M contexts, each with a programmable base address.
[0031] Typically, the nodes (i.e., node 808-i), for example, have three configurations: 8 SIMD registers (first configuration); 32 SIMD registers (second configuration); and 32 SIMD registers plus three extra execution units in each of the smaller functional unit (third configuration).
[0032] As an example, FIG. 6 shown an example of SIMD unit (namely, SIMD data memory 4306-1 and SIMD functional unit 4308-1), node processor 4322, and LS unit 4318-i in greater detail can be seen. As shown in this example, SIMD functional unit 4308-i is generally comprised of eight, smaller functional units 4338-1 to 4338-8 uses the third configuration.
[0033] Looking first to the processor core, the node processor 4322 generally executes all the control related instructions and holds all the address register values and special register values for SIMD units shown in register files 4340 and 4342 (respectively). Up to six (for example) memory instructions can be calculated in a cycle. For address register values, the address source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values, which are then used by SIMD unit for address calculation. Similarly, for special register values, the special register source operands are sent to node processor 4322 from the SIMD unit shown, and the node processor 4322 sends back the register values.
[0034] Node processor 4322 can have (for example) 15 read ports and six write ports for SIMD. Typically, the 15 read ports include (for example) 12 read ports that accommodate two operands (i.e., lssrc and lssrc2) for each of six memory instructions and three ports for special register file 4312. Typically, special register file 4342 include two registers named RCLIPMIN and RCLIPMAX, which should be provided together and which are generally restricted to the lower four registers of the 16 entry register file 4342. RCLIPMAX and RCLIPMIN registers are then specified directly in the instruction. The other special registers RND and SCL are specified by a 4-bit register identifier and can be located anywhere in the 16 entry register file 4342. Additionally, node processor 4322 includes a program counter execution unit 4344, which can update the instruction memory 1404-i.
[0035] Turning now to the LS unit 4318-i and SIMD unit, the general structure for each can be seen in FIG. 6. As shown, the LS unit 4318-i generally comprises LS decoder 4334, LS execution unit 4336, logic unit 4346, multiply unit 4348, right execution unit 4350, and LS data memory 4339; however the details regarding the data path for LS unit 4318-i are provided below. Each of the smaller functional units 4338-1 through 4338-8 generally (and respectively) comprises SIMD register files 4358-1 to 4358-8 (which can each include 32 registers, for example), left logic units 4352-1 to 4352-8, multiply units 4354-1 to 4354-8, and right logic units 4356-1 to 4356-8. These left logic units 4352-1 to 4352-8, multiply units 4354-1 to 4354-8, and right logic units 4356-1 to 4356-8 are generally duplications of left, middle, and right units 4346, 4348, and 4350, respectively. Additionally, similar to the LS unit 4318-i, the data path for each functional unit 4338-1 to 4338-8 is described below.
[0036] Additionally, for the three example configurations for a node (i.e., node 808-i), the sizes of some components (i.e., logic unit 4352-1) or the corresponding instruction may vary, while others may remain the same. The LS data memory 4339, lookup table, and histogram remain relatively the same. Preferably, the LS data memory 4339 can be about 512*32 bits with the first 16 locations holding the context base addresses and the remaining locations being accessible by the contexts. The lookup table or LUT (which is generally within the PC execution unit 4344) can have up to 12 tables with a memory size of 16Kb, wherein four bits can be used to select table and 14 bits can be used for addressing. Histograms (which are also generally located in the PC execution unit 4344) can have 4 tables, where the histogram shares the 4-bit ID with LUT to select a table and uses 8 bits for addressing. In Table 1 below, the instructions sizes for each of the three example configurations can be seen, which can correspond to the sizes of various components.
Figure imgf000012_0001
Table 1
F-'ii-si ( on ll«'iii iii i»ii
, ,„„„„„, "XL, .'! iLH,,,,
4348) instruction
Logic unit (i.e., 4346) 16 bits 24 bits 24 bits
instruction
LS unit instructions 132 bits 160 bits 156 bits
Node processor 4322 O bits 20 bits for 20 bits
instruction
Context switch 2 bits for 2 bits 2 bits
indication
arrangement of Context : C : LS I : Context : C : LS I : Context : C : LS I : instruction line LS2 : LS3 : LS4 : LS5 : T20 : LS2 : LS3 : T20 : LS2 : LS3 : (Instruction Packet LS6 : LU : MU : RU LS4 : LS5 : LS6 : LS4 : LS5 : LS6 : Format) LU : MU : RU LU : MU : RU
[0037] Turning to FIG. 7, the shared function-memory 1410 can be seen. The shared function- memory 1410 is generally a large, centralized memory supporting operations that are not well- supported by the nodes (i.e., for cost reasons). The main component of the shared function- memory 1410 are the two large memories: the function-memory 7602 and the vector-memory
7603 (each of which has a configurable size between, for example 48 to 1024 Kbytes and organization). This function-memory 7602 implements a synchronous, instruction-driven implementation of high-bandwidth, vector-based lookup-tables (LUTs) and histograms. The vector-memory 7603 can support operations by (for example) a 6-issue processor (i.e., SFM processor 7614) that implies vector instructions (as detailed in section 8 above), which can, for example, be used for block-based pixel processing. Typically, this SFM processor 7614 can be accessed using the messaging interface 1420 and data bus 1422. The SFM processor 7614 can, for example, operate on wide pixel contexts (64 pixels) that can have a much more general organization and total memory size than SIMD data memory in the nodes, with much more general processing applied to the data. It supports scalar, vector, and array operations on standard C++ integer datatypes as well as operations on packed pixels that are compatible with various datatypes. For example and as shown, the SIMD data paths associated with the vector memory 7603 and function-memory 7602 generally include ports 7605-1 to 7605 -Q and functional units 7607-1 to 7607-P.
[0038] The function-memory 7602 and vector-memory 7603 are generally "shared" in the sense that all processing nodes (i.e., 808-i) can access function-memory 7602 and vector- memory 7603. Data provided to the function-memory 7602 can be accessed via the SFM wrapper (typically in a write-only manner). This sharing is also generally consistent with the context management described above for processing nodes (i.e., 808-i). Data I/O between processing nodes and shared function-memory 1410 also uses the dataflow protocol, and processing nodes, typically, cannot directly access vector-memory 7603. The shared function- memory 1410 can also write to the function-memory 7602, but not while it is being accessed by processing nodes. Processing nodes (i.e., 808-i) can read and write common locations in function-memory 7602, but (usually) either as read-only LUT operations or write-only histogram operations. It is also possible for a processing node to have read-write access to an function- memory 7602 region, but this should be exclusive for access by a given program.
[0039] Turing to FIG. 8, an example of the SIMD data paths 7800 for the shared function- memory 1410. For example, eight SIMD data paths (which can be partitioned into two, 16-bit halves because it can operate on 16-bit packed data) can be used. As shown, these SIMD data paths generally comprise set of banks 7802-1 to 7802-L, associated registers 7804-1 to 7804-L, and associated sets of functional units 7806-1 to 7806-L.
[0040] In FIG. 9, an example of a portion of one SIMD data path (namely and for example, a portion of one of the registers 7804-1 to 7804-L and a portion of one of the functional units 7806-1 to 7806-L) can be seen. As shown and for example, this SIMD data path can include includes a 16-entry, 32-bit register file 7902, two 16-bit multipliers 7904 and 7906, and a single, 32-bit arithmetic/logical unit 7908 that can also perform two, 16-bit packed operations in a cycle. Also, as an example, each SIMD data path can perform two, independent 16-bit operations, or a combined, 32-bit operation. For example, this can form a 32-bit multiply using the 16-bit multipliers combined with 32-bit adds. Additionally, the arithmetic/logical unit 7908 can be capable of performing addition, subtraction, logical operations (i.e., AND), comparisons, and conditional moves.
[0041] Turning back to FIG. 8, the SIMD data path registers 7804-1 to 7804-L can use a load/store interface to the vector memory 7603. These loads and stores can use features of the vector memory 7603 that are provided for parallel LUT and histogram access by nodes (i.e., 808- i): for nodes, each SIMD data path half can provide an index into function-memory 7602; and, similarly, each SIMD data path half in SFM processor 7614 can provide an independent vector memory 7603 address. Addressing is generally organized so that adjacent data paths can perform the same operation on multiple instances of datatypes such as scalars, vectors, and arrays of 8-, 16-, or 32-bit (for example) data: these are called vector-implied addressing modes (the vector is implied by the SIMD with linear vector memory 7603 addressing). Alternatively, each data path can operate on packed pixels from regions of a frame within banks 7608-1 to 7608-J: these are called vector-packed addressing modes (vectors of packed pixels are implied by the SIMD, with two-dimensional vector memory 7603 addressing). In both cases, as with the node processor 4322, the programming model can hide the width of the SIMD, and programs are written as if they operate on a single pixel or element of other datatype.
[0042] Vector-implied datatypes are generally SIMD-implemented vectors of either 8-bit chars, 16-bit halfwords, or 32-bit ints, operated on individually by each SIMD data path (i.e., FIG. 9). These vectors are not generally explicit in the program, but rather implied by hardware operation. These datatypes can also be structured as elements within explicit program vectors or arrays: the SIMD effectively adds a hidden second or third dimension to these program vectors or arrays. In effect, the programming view can be a single SIMD data path with a dedicated, 32- bit data memory, and this memory is accessed using conventional addressing modes. In the hardware, this view is mapped in a way that each of the 32 SIMD data paths has the appearance of a private data memory, but the implementation takes advantage of the wide, banked organization of vector memory 7603 to implement this functionality in the shared function- memory 1410.
[0043] The SFM processor 7614 SIMD generally operates within vector memory 7603 contexts similar node processor 4322 contexts, with descriptors having a base address aligned to the sets of banks 7802-1, and sufficiently large to address the entire vector memory 7603 (i.e., 13 bits for the size of 1024 kB). Each half of the a SIMD data path is numbered with a 6-bit identifier (POSN), starting at 0 for the left-most data path. For vector-implied addressing, the LSB of this value is generally ignored, and the remaining five bits are used to align the vector memory 7603 addresses generated by the data path to the respective words in the vector memory 7603. [0044] Within processing cluster 1400, general-purpose RISC processors serve various purposes. For example, node processor 4322 (which can be a RISC processor) can be used for program flow control. Below examples of RISC architectures are described.
[0045] Turning to FIG. 10, a more detailed example of RISC processor 5200 (i.e., node processor 4322) can be seen. The pipeline used by processor 5200 generally provides support for general high level language (i.e., C/C++) execution in processing cluster 1400. In operation, processor 5200 employs a three stage pipeline of fetch, decode, and execute. Typically, context interface 5214 and LS port 5212 provide instructions to the program cache 508, and the instructions can be fetched from the program cache 5208 by instruction fetch 5204. The bus between the instruction fetch 5204 and the program cache 5208 can, for example, be 40 bits wide, allowing the processor 5200 to support dual issue instructions (i.e., instructions can be 40 bits or 20 bits wide). Generally, "A-side" and "B-side" functional units (within processing unit 5202) execute the smaller instructions (i.e., 20-bit instructions), while the "B-side" functional units execute the larger instructions (i.e., 40-bit instructions). To execution the instructions provided, processing unit can use register file 5206 as a "scratch pad"; this register file 5206 can be (for example) a 16-entry, 32-bit register file that is shared between the "A-side" and "B-side." Additionally, processor 5200 includes a control register file 5216 and a program counter 5218. Processor 5200 can also be access through boundary pins or leads; an example of each is described in Table 2 (with "z" denoting active low pins).
Table 2
Figure imgf000016_0001
Table 2
Figure imgf000017_0001
Gated with vec regf enz assertion. Table 2
Figure imgf000018_0001
Table 2
Piii Name W d(li Dir Purpose
processor 5200 is executing the second half of a non-parallel 20-bit instruction pair.
rise fmem addr 20 Output Vector implied load/store address bus
rise fmem bez 4 Output Vector implied load/store byte enables
risc vec opr 4 Output This bus represents the vector unit source register for vector implied stores, or the vector unit destination register for vector implied loads.
risc is vild 1 Output Vector implied signed load flag.
risc is vildu 1 Output Vector implied unsigned load flag.
risc_is_vist 1 Output Vector implied store flag
risc_hg_posn 8 Output Reflects the current contents of the processor 5200
HG POSN control register
rise regf ra[l :0] 4b:2 Input Register file read address ports. There are two ports. These pins are driven by lane 0 (left most) vector unit. Allows the vector unit to read one of the lower 4 registers in the GPR file.
risc_regf_rd[ 1 :0]z lb:2 Input When de-asserted gates off switching on the risc_regf_rdata0/l buses. Should be driven low to read valid data on rise regf rdata.
risc_regf_rdata[l :0] 32 3x2 Output Register file read data ports. There are two ports.
These pins are driven by lane 0 (left most) vector unit. These are the read data buses associated with rise regf ra.
risc_inc_hg_posn 1 Output Asserted in DO when a BHGNE instruction is decoded.
wrp_hgposn_ne_hgsize 1 Input Asserted by the SFM wrapper. Indicates whether the wrappers copy of HG POSN and HG SIZE are not equal. [0046] Turning to FIG. 11, the processor 5200 can be seen in greater detail shown with the pipeline 5300. Here, the instruction fetch 5204 (which corresponds to the fetch stage 5306) is divided into an A-side and B-side, where the A-side receives the first 20-bits (i.e, [19:0]) of a "fetch packet" (which can be a 40-bit wide instruction word having one 40-bit instruction or two 20-bit instructions) and the B-side receives the last 20-bits (i.e., [39:20]) of a fetch packet. Typically, the instruction fetch 5204 determines the structure and size of the instruction(s) in the fetch packet and dispatches the instruction(s) accordingly (which is discussed in section 7.3 below).
[0047] A decoder 5221 (which is part of the decode stage 5308 and processing unit 5202) decodes the instruction(s) from the instruction fetch 5204. The decoder 5221 generally includes a operator format circuit 5223-1 and 5223-2 (to generate intermediates) and a decode circuit 5225-1 and 5225-2 for the B-side and A-side, respectively. The output from the decoder 5221 is then received by the decode-to-execution unit 5220 (which is also part of the decode stage 5308 and processing unit 5202). The decode-to-execution unit 5220 generates command(s) for the execution unit 5227 that correspond to the instruction(s) received through the fetch packet.
[0048] The A-side and B-side of the execution unit 5227 is also subdivided. Each of the B- side and A-side of the execution unit 5227 respectively includes a multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, an add/subtract unit 5228-1/5228-2, and a move unit 5330-1/5330- 2. The B-side of the execution unit 5227 also includes a load/store unit 5224 and a branches unit 5232. The multiply unit 5222-1/5222-2, a Boolean unit 5226-1/5226-2, a add/subtract unit 5228- 1/5228-2, and a move unit 5330-1/5330-2 can then, respectively, perform a multiply operation, a logical Boolean operation, add/subtract operation, and a data movement operation on data loaded into the general purpose register file 5206 (which also includes read addresses for each of the A- side and B-side). Move operations can also be performed in the control register file 5216.
[0049] A RISC processor with a vector processing module is generally used with shared function-memory 1410. This RISC processor is largely the same as the RISC processor used for processor 5200 but it includes a vector processing module to extend the computation and load/store bandwidth. This module can contain 16 vector units that are each capable of executing a 4-operation execute packet per cycle. A typical execute packet generally includes a data load from the vector memory array, two register-to-register operations, and a result store to the vector memory array. This type of RISC processor generally uses an instruction word that is 80 bits wide or 120 bits wide, which generally constitutes a "fetch packet" and which may include unaligned instructions. A fetch packet can contain a mixture of 40 bit and 20 bit instructions, which can include vector unit instructions and scalar instructions similar to those used by processor 5200. Typically, vector unit instructions can be 20 bits wide, while other instructions can be 20 bits or 40 bits wide (similar to processor 5200). Vector instructions can also be presented on all lanes of the instruction fetch bus, but, if the fetch packet contains both scalar and vector unit instructions the vector instructions are presented (for example) on instruction fetch bus bits [39:0] and the scalar instruction(s) are presented (for example) on instruction fetch bus bits [79:40]. Additionally, unused instruction fetch bus lanes are padded with NOPs.
[0050] An "execute packet" can then be formed from one or more fetch packets. Partial execute packets are held in the instruction queue until completed. Typically, complete execute packets are submitted to the execute stage (i.e., 5310). Four vector unit instructions (for example), two scalar instructions (for example), or a combination of 20-bit and 40-bit instructions (for example) may execute in a single cycle. Back-to-back 20-bit instructions may also be executed serially. If bit 19 of the current 20 bit instruction is set, this indicates that the current instruction, and the subsequent 20-bit instruction form an execute packet. Bit 19 can be generally referred to as the P-bit or parallel bit. If the P-bit is not set this indicates the end of an execute packet. Back-to-back 20 bit instructions with the P-bit not set cause serial execution of the 20 bit instructions. It should also be noted that this RISC processor (with a vector processing module) may include any of the following constraints:
(1) It is illegal for the P-bit to be set to 1 in a 40 bit instruction (for example);
(2) Load or store instructions should appear on the B-side of the instruction fetch bus (i.e., bits 79:40 for 40 bit loads and stores or on bits 79:60 of the fetch bus for 20 bit loads or stores);
(3) A single scalar load or store is legal;
(4) For the vector units both a single load and a single store can exist in a fetch packet;
(5) It is illegal for a 40 bit instruction to be preceded by a 20 bit instruction with a P-bit equal to 1; and
(6) No hardware is in place to detect these illegal conditions. These restrictions are expected to be enforced by the system programming tool 718.
[0051] Turning to FIG. 12, an example of a vector module can be seen. The vector module includes a detector decoder 5246, decode-to-execution unit 5250, and an execution unit 5251. The vector decoder includes slot decoders 5248-1 to 5248-4 that receive instructions from the instruction fetch 5204. Typically, slot decoders 5248-1 and 5248-2 operate in a similar manner to one another, while slot decoders 5248-3 and 5248-4 include load/store decoding circuitry. The decode-to-execution unit 5250 can then generate instructions for the execution unit 5251 based on the decoded output of vector decoder 5246. Each of the slot decoders can generate instruction that can be used by the multiply unit 5252, add/subtract unit 5254, move unit 5256, and Boolean unit 5258 (that each use data and addresses in the general purpose register 5206). Additionally slot decoders 5248-3 and 5248-4 can generate load and store instructions for load/store units 5260 and 5262.
[0052] The general purpose resister file 5206 can be a 16-entry by 32-bit general purpose register file. The widths of the general purpose registers (GPRs) can be parameterized. Generally, when processor 5200 is used for nodes (i.e., 808-i), there are 4+15 (15 are controlled by boundary pins) read ports and 4+6 (6 are controlled by boundary pins) write ports, while processor 5200 used for GLS unit 1408 has 4 read ports and 4 write ports.
[0053] Instructions that can move data between node processor 4322 and SIMD (i.e., SIMD unit including SIMD data memory 4306-1 and functional unit 4308-1) are indicated in Table 3
Figure imgf000022_0001
[0054] Table 4 below illustrates an example of an instruction set architecture for processor 5200, where: Unit designations .SA and .SB are used to distinguish in which issue slot a 20 bit instruction executes;
(2) 40 bit instructions are executed on the B-side (.SB) by convention;
(3) The basic form is <mnemonic> <unit> <comma separated operand list>;and Pseudo code has a C++ syntax and with the proper libraries can be directly included in
Figure imgf000023_0001
Table 4 vec_regf_enz._assert(0);
vec_regf_ra._assert(s2);
s3Save = s3.address();
initiate, live(true);
complete . live(vec_wdata_wrz . is(0)) ;
}
MFVVR .SB sl(R5), s2(R5), s3(R4)
void ISA::OPC_MFVVR_40b_264 (Vunit &sl, Vreg &s2,Gpr &s3)
{
Reg s3Save;
risc_is_mfwr._assert( 1 );
risc_vec_ua._assert(s 1 );
MOVE
risc_vec_ra._assert(s2);
VUNIT/VREG s3Save = s3.address();
GPR
initiate, live(true);
vec_risc_wa._assert(s3);
vec_risc_wd gets value of Vreg(risc_vec_ra);
complete . live(vec_risc_wrz . is(0)) ; //ditto
}
MTV .(SA,SB) sl(R4), s2(R5)
void ISA::OPC_MTV_20b_164 (Gpr &sl, Vreg &s2)
MOVE GPR
{
VREG,
Result rl;
REPLICATED
rl .clear();
(LOW VREG) rl = sl .range(0,15);
risc_is_mtv._assert(l); Table 4
Figure imgf000025_0001
Table 4
Figure imgf000026_0001
Table 4
Figure imgf000027_0001
[0055] Those skilled in the art to which the invention relates will appreciate that modifications may be made to the described embodiments and additional embodiments realized, without departing from the scope of the claimed invention.

Claims

CLAIMS What is claimed is:
1. An apparatus characterized by:
a computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) having a first register file
(4358-1 to 4358-8, 7902); and
a processor (4322, 7614) that is coupled to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P), wherein the processor (4322, 7614) includes an instruction set having a data movement instruction (MFVVR) from the first register file (4358-1 to 4358-8, 7902), wherein the processor includes :
a second register file (5206);
a first address lead (risc is ua) for indicating a lane address for the first register file (4358-1 to 4358-8, 7902);
a second address lead (risc is ra) for indicating a read address for the first register file (4358-1 to 4358-8, 7902);
a data interface lead (node regf rd) for transferring data; and
a data movement lead (risc is mfwr) for indicating the data movement instruction (MFVVR) from the first register file (4358-1 to 4358-8, 7902) to the second register file (5206) when the state of a signal on the data movement lead is changed.
2. The apparatus of Claim 1, wherein the first address lead (risc is ua) is further characterized by a plurality of first address leads (risc is ua), and wherein the second address lead (risc is ra) is further characterized by a plurality of second address leads (risc is ra).
3. The apparatus of Claim 2, wherein the plurality of first address leads (risc is ua) and the plurality of second address leads (risc is ra) are each 5 bits wide.
4. The apparatus of Claims 1, 2, or 3, wherein the processor includes a halfword lead (risc is hwz) for indicating whether to perform an upper half write, an lower half write, a full write, or a read.
5. The apparatus of Claims 1, 2, 3, or 4, wherein the halfword lead (risc_is_hwz) is further characterized by a plurality of halfword leads (risc is hwz).
6. The apparatus of claim 5, wherein the plurality of halfword leads (risc is hwz) is 2 bits wide.
7. The apparatus of Claims 1, 2, 3, 4, 5, or 6, the data interface lead (node regf rd) is further characterized by a plurality of data interface leads (node regf rd).
8. The apparatus of Claims 1, 2, 3, 4, 5, 6, or 7, wherein the computational unit
(4308-1 to 4308-M, 7607-1 to 7607-P) is further characterized by a plurality of single input multiple data (SIMD) functional units (4308-1 to 4308-M)
9. The apparatus of Claims 1, 2, 3, 4, 5, 6, or 7, wherein the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) is further characterized by a plurality of vector units
(7607-1 to 7607-P).
10. A method characterized by:
changing the state of a signal on a data movement lead (risc is mfwr) to indicate a data movement instruction (MFVVR) from a first register file (4358-1 to 4358-8, 7902) in a computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to a second register file (5206) in a processor (4322, 7614);
providing a lane address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a first address lead (risc is ua);
providing a read address from the processor (4322, 7614) to the computational unit
(4308-1 to 4308-M, 7607-1 to 7607-P) over a second address lead (risc is ra); and
transferring data from the first register file (4358-1 to 4358-8, 7902) in the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to the second register file (5206) in the processor (4322, 7614) over a data interface lead (node regf rd).
11. The method of Claim 10, wherein the first address lead (risc is ua) is further characterized by a plurality of first address leads (risc is ua), and wherein the second address lead (risc is ra) is further characterized by a plurality of second address leads (risc is ra).
12. The method of Claims 10 or 11, wherein the method is further characterized by indicating whether to perform an upper half write, an lower half write, a full write, or a read over a halfword lead (risc is hwz).
13. The method of Claims 10, 11, or 12, wherein the halfword lead (risc is hwz) is further characterized by a plurality of halfword leads (risc is hwz).
14. The method of Claims 10, 11, 12, or 13, wherein the data interface lead (node regf rd) is further characterized by a plurality of data interface leads (node regf rd).
15. A system characterized by:
means for changing the state of a signal on a data movement lead (risc is mfwr) to indicate a data movement instruction (MFVVR) from a first register file (4358-1 to 4358-8, 7902) in a computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to a second register file (5206) in a processor (4322, 7614);
means for providing a lane address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a first address lead (risc is ua);
means for providing a read address from the processor (4322, 7614) to the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) over a second address lead (risc is ra); and
means for transferring data from the first register file (4358-1 to 4358-8, 7902) in the computational unit (4308-1 to 4308-M, 7607-1 to 7607-P) to the second register file (5206) in the processor (4322, 7614) over a data interface lead (node regf rd).
16. The system of Claim 15, wherein the first address lead (risc is ua) is further characterized by a plurality of first address leads (risc is ua), and wherein the second address lead (risc is ra) is further characterized by a plurality of second address leads (risc is ra).
17. The system of Claims 15 or 16, wherein the system is further characterized by means for indicating whether to perform an upper half write, an lower half write, a full write, or a read over a halfword lead (risc is hwz).
18. The system of Claims 15, 16, or 17, wherein the halfword lead (risc_is_hwz) is further characterized by a plurality of halfword leads (risc is hwz).
19. The method of Claims 15, 16, 17, or 18, wherein the data interface lead (node regf rd) is further characterized by a plurality of data interface leads (node regf rd).
PCT/US2011/061428 2010-11-18 2011-11-18 Method and apparatus for moving data from a simd register file to general purpose register file WO2012068475A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2013540058A JP2014505916A (en) 2010-11-18 2011-11-18 Method and apparatus for moving data from a SIMD register file to a general purpose register file
CN201180055771.5A CN103221935B (en) 2010-11-18 2011-11-18 The method and apparatus moving data to general-purpose register file from simd register file

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US41520510P 2010-11-18 2010-11-18
US41521010P 2010-11-18 2010-11-18
US61/415,210 2010-11-18
US61/415,205 2010-11-18
US13/232,774 US9552206B2 (en) 2010-11-18 2011-09-14 Integrated circuit with control node circuitry and processing circuitry
US13/232,774 2011-09-14

Publications (2)

Publication Number Publication Date
WO2012068475A2 true WO2012068475A2 (en) 2012-05-24
WO2012068475A3 WO2012068475A3 (en) 2012-07-12

Family

ID=46065497

Family Applications (8)

Application Number Title Priority Date Filing Date
PCT/US2011/061431 WO2012068478A2 (en) 2010-11-18 2011-11-18 Shared function-memory circuitry for a processing cluster
PCT/US2011/061444 WO2012068486A2 (en) 2010-11-18 2011-11-18 Load/store circuitry for a processing cluster
PCT/US2011/061428 WO2012068475A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data from a simd register file to general purpose register file
PCT/US2011/061456 WO2012068494A2 (en) 2010-11-18 2011-11-18 Context switch method and apparatus
PCT/US2011/061474 WO2012068504A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data
PCT/US2011/061369 WO2012068449A2 (en) 2010-11-18 2011-11-18 Control node for a processing cluster
PCT/US2011/061487 WO2012068513A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data
PCT/US2011/061461 WO2012068498A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data to a simd register file from a general purpose register file

Family Applications Before (2)

Application Number Title Priority Date Filing Date
PCT/US2011/061431 WO2012068478A2 (en) 2010-11-18 2011-11-18 Shared function-memory circuitry for a processing cluster
PCT/US2011/061444 WO2012068486A2 (en) 2010-11-18 2011-11-18 Load/store circuitry for a processing cluster

Family Applications After (5)

Application Number Title Priority Date Filing Date
PCT/US2011/061456 WO2012068494A2 (en) 2010-11-18 2011-11-18 Context switch method and apparatus
PCT/US2011/061474 WO2012068504A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data
PCT/US2011/061369 WO2012068449A2 (en) 2010-11-18 2011-11-18 Control node for a processing cluster
PCT/US2011/061487 WO2012068513A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data
PCT/US2011/061461 WO2012068498A2 (en) 2010-11-18 2011-11-18 Method and apparatus for moving data to a simd register file from a general purpose register file

Country Status (4)

Country Link
US (1) US9552206B2 (en)
JP (9) JP2014501007A (en)
CN (8) CN103221934B (en)
WO (8) WO2012068478A2 (en)

Families Citing this family (229)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7904569B1 (en) * 1999-10-06 2011-03-08 Gelvin David C Method for remote access of vehicle components
US9710384B2 (en) 2008-01-04 2017-07-18 Micron Technology, Inc. Microprocessor architecture having alternative memory access paths
US8631411B1 (en) 2009-07-21 2014-01-14 The Research Foundation For The State University Of New York Energy aware processing load distribution system and method
US8446824B2 (en) * 2009-12-17 2013-05-21 Intel Corporation NUMA-aware scaling for network devices
US9003414B2 (en) * 2010-10-08 2015-04-07 Hitachi, Ltd. Storage management computer and method for avoiding conflict by adjusting the task starting time and switching the order of task execution
US9552206B2 (en) * 2010-11-18 2017-01-24 Texas Instruments Incorporated Integrated circuit with control node circuitry and processing circuitry
KR20120066305A (en) * 2010-12-14 2012-06-22 한국전자통신연구원 Caching apparatus and method for video motion estimation and motion compensation
CN103329365B (en) * 2011-01-26 2016-01-06 苹果公司 There are 180 degree and connect connector accessory freely
US8918791B1 (en) * 2011-03-10 2014-12-23 Applied Micro Circuits Corporation Method and system for queuing a request by a processor to access a shared resource and granting access in accordance with an embedded lock ID
WO2012144876A2 (en) * 2011-04-21 2012-10-26 한양대학교 산학협력단 Method and apparatus for encoding/decoding images using a prediction method adopting in-loop filtering
US20130060555A1 (en) * 2011-06-10 2013-03-07 Qualcomm Incorporated System and Apparatus Modeling Processor Workloads Using Virtual Pulse Chains
US9086883B2 (en) 2011-06-10 2015-07-21 Qualcomm Incorporated System and apparatus for consolidated dynamic frequency/voltage control
US8656376B2 (en) * 2011-09-01 2014-02-18 National Tsing Hua University Compiler for providing intrinsic supports for VLIW PAC processors with distributed register files and method thereof
CN102331961B (en) * 2011-09-13 2014-02-19 华为技术有限公司 Method, system and dispatcher for simulating multiple processors in parallel
US20130077690A1 (en) * 2011-09-23 2013-03-28 Qualcomm Incorporated Firmware-Based Multi-Threaded Video Decoding
KR101859188B1 (en) * 2011-09-26 2018-06-29 삼성전자주식회사 Apparatus and method for partition scheduling for manycore system
EP2783284B1 (en) * 2011-11-22 2019-03-13 Solano Labs, Inc. System of distributed software quality improvement
JP5915116B2 (en) * 2011-11-24 2016-05-11 富士通株式会社 Storage system, storage device, system control program, and system control method
CN104025022B (en) * 2011-12-23 2017-09-19 英特尔公司 For with the apparatus and method for speculating the vectorization supported
WO2013106210A1 (en) * 2012-01-10 2013-07-18 Intel Corporation Electronic apparatus having parallel memory banks
US8639894B2 (en) * 2012-01-27 2014-01-28 Comcast Cable Communications, Llc Efficient read and write operations
GB201204687D0 (en) * 2012-03-16 2012-05-02 Microsoft Corp Communication privacy
CN104205042B (en) 2012-03-30 2019-01-08 英特尔公司 Context handover mechanism for the processing core with universal cpu core and the accelerator of close-coupled
US10430190B2 (en) * 2012-06-07 2019-10-01 Micron Technology, Inc. Systems and methods for selectively controlling multithreaded execution of executable code segments
US9361115B2 (en) 2012-06-15 2016-06-07 International Business Machines Corporation Saving/restoring selected registers in transactional processing
US9436477B2 (en) * 2012-06-15 2016-09-06 International Business Machines Corporation Transaction abort instruction
US9772854B2 (en) 2012-06-15 2017-09-26 International Business Machines Corporation Selectively controlling instruction execution in transactional processing
US9367323B2 (en) 2012-06-15 2016-06-14 International Business Machines Corporation Processor assist facility
US9740549B2 (en) 2012-06-15 2017-08-22 International Business Machines Corporation Facilitating transaction completion subsequent to repeated aborts of the transaction
US9448796B2 (en) 2012-06-15 2016-09-20 International Business Machines Corporation Restricted instructions in transactional execution
US9442737B2 (en) 2012-06-15 2016-09-13 International Business Machines Corporation Restricting processing within a processor to facilitate transaction completion
US9317460B2 (en) 2012-06-15 2016-04-19 International Business Machines Corporation Program event recording within a transactional environment
US9384004B2 (en) 2012-06-15 2016-07-05 International Business Machines Corporation Randomized testing within transactional execution
US10437602B2 (en) 2012-06-15 2019-10-08 International Business Machines Corporation Program interruption filtering in transactional execution
US8688661B2 (en) 2012-06-15 2014-04-01 International Business Machines Corporation Transactional processing
US9348642B2 (en) 2012-06-15 2016-05-24 International Business Machines Corporation Transaction begin/end instructions
US9336046B2 (en) 2012-06-15 2016-05-10 International Business Machines Corporation Transaction abort processing
US8682877B2 (en) 2012-06-15 2014-03-25 International Business Machines Corporation Constrained transaction execution
US20130339680A1 (en) 2012-06-15 2013-12-19 International Business Machines Corporation Nontransactional store instruction
US10223246B2 (en) * 2012-07-30 2019-03-05 Infosys Limited System and method for functional test case generation of end-to-end business process models
US10154177B2 (en) 2012-10-04 2018-12-11 Cognex Corporation Symbology reader with multi-core processor
US9747107B2 (en) * 2012-11-05 2017-08-29 Nvidia Corporation System and method for compiling or runtime executing a fork-join data parallel program with function calls on a single-instruction-multiple-thread processor
JP6122135B2 (en) * 2012-11-21 2017-04-26 コーヒレント・ロジックス・インコーポレーテッド Processing system with distributed processor
US10140129B2 (en) 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit
US9804839B2 (en) * 2012-12-28 2017-10-31 Intel Corporation Instruction for determining histograms
US9417873B2 (en) 2012-12-28 2016-08-16 Intel Corporation Apparatus and method for a hybrid latency-throughput processor
US9361116B2 (en) * 2012-12-28 2016-06-07 Intel Corporation Apparatus and method for low-latency invocation of accelerators
US10346195B2 (en) 2012-12-29 2019-07-09 Intel Corporation Apparatus and method for invocation of a multi threaded accelerator
US11163736B2 (en) * 2013-03-04 2021-11-02 Avaya Inc. System and method for in-memory indexing of data
US9400611B1 (en) * 2013-03-13 2016-07-26 Emc Corporation Data migration in cluster environment using host copy and changed block tracking
US9582320B2 (en) * 2013-03-14 2017-02-28 Nxp Usa, Inc. Computer systems and methods with resource transfer hint instruction
US9158698B2 (en) 2013-03-15 2015-10-13 International Business Machines Corporation Dynamically removing entries from an executing queue
US9471521B2 (en) * 2013-05-15 2016-10-18 Stmicroelectronics S.R.L. Communication system for interfacing a plurality of transmission circuits with an interconnection network, and corresponding integrated circuit
US8943448B2 (en) * 2013-05-23 2015-01-27 Nvidia Corporation System, method, and computer program product for providing a debugger using a common hardware database
US9244810B2 (en) 2013-05-23 2016-01-26 Nvidia Corporation Debugger graphical user interface system, method, and computer program product
US20140351811A1 (en) * 2013-05-24 2014-11-27 Empire Technology Development Llc Datacenter application packages with hardware accelerators
US20140358759A1 (en) * 2013-05-28 2014-12-04 Rivada Networks, Llc Interfacing between a Dynamic Spectrum Policy Controller and a Dynamic Spectrum Controller
US9910816B2 (en) * 2013-07-22 2018-03-06 Futurewei Technologies, Inc. Scalable direct inter-node communication over peripheral component interconnect-express (PCIe)
US9882984B2 (en) 2013-08-02 2018-01-30 International Business Machines Corporation Cache migration management in a virtualized distributed computing system
US10373301B2 (en) * 2013-09-25 2019-08-06 Sikorsky Aircraft Corporation Structural hot spot and critical location monitoring system and method
US8914757B1 (en) * 2013-10-02 2014-12-16 International Business Machines Corporation Explaining illegal combinations in combinatorial models
GB2519108A (en) 2013-10-09 2015-04-15 Advanced Risc Mach Ltd A data processing apparatus and method for controlling performance of speculative vector operations
GB2519107B (en) * 2013-10-09 2020-05-13 Advanced Risc Mach Ltd A data processing apparatus and method for performing speculative vector access operations
US9740854B2 (en) * 2013-10-25 2017-08-22 Red Hat, Inc. System and method for code protection
US10185604B2 (en) * 2013-10-31 2019-01-22 Advanced Micro Devices, Inc. Methods and apparatus for software chaining of co-processor commands before submission to a command queue
US9727611B2 (en) * 2013-11-08 2017-08-08 Samsung Electronics Co., Ltd. Hybrid buffer management scheme for immutable pages
US10191765B2 (en) 2013-11-22 2019-01-29 Sap Se Transaction commit operations with thread decoupling and grouping of I/O requests
US9495312B2 (en) 2013-12-20 2016-11-15 International Business Machines Corporation Determining command rate based on dropped commands
US9552221B1 (en) * 2013-12-23 2017-01-24 Google Inc. Monitoring application execution using probe and profiling modules to collect timing and dependency information
CN105814537B (en) 2013-12-27 2019-07-09 英特尔公司 Expansible input/output and technology
US9307057B2 (en) * 2014-01-08 2016-04-05 Cavium, Inc. Methods and systems for resource management in a single instruction multiple data packet parsing cluster
US9509769B2 (en) * 2014-02-28 2016-11-29 Sap Se Reflecting data modification requests in an offline environment
US9720991B2 (en) * 2014-03-04 2017-08-01 Microsoft Technology Licensing, Llc Seamless data migration across databases
US9697100B2 (en) 2014-03-10 2017-07-04 Accenture Global Services Limited Event correlation
GB2524063B (en) 2014-03-13 2020-07-01 Advanced Risc Mach Ltd Data processing apparatus for executing an access instruction for N threads
JP6183251B2 (en) * 2014-03-14 2017-08-23 株式会社デンソー Electronic control unit
US9268597B2 (en) * 2014-04-01 2016-02-23 Google Inc. Incremental parallel processing of data
US9607073B2 (en) * 2014-04-17 2017-03-28 Ab Initio Technology Llc Processing data from multiple sources
US10102211B2 (en) * 2014-04-18 2018-10-16 Oracle International Corporation Systems and methods for multi-threaded shadow migration
US9400654B2 (en) * 2014-06-27 2016-07-26 Freescale Semiconductor, Inc. System on a chip with managing processor and method therefor
CN104125283B (en) * 2014-07-30 2017-10-03 中国银行股份有限公司 A kind of message queue method of reseptance and system for cluster
US9787564B2 (en) * 2014-08-04 2017-10-10 Cisco Technology, Inc. Algorithm for latency saving calculation in a piped message protocol on proxy caching engine
US9313266B2 (en) * 2014-08-08 2016-04-12 Sas Institute, Inc. Dynamic assignment of transfers of blocks of data
US9910650B2 (en) * 2014-09-25 2018-03-06 Intel Corporation Method and apparatus for approximating detection of overlaps between memory ranges
US9501420B2 (en) * 2014-10-22 2016-11-22 Netapp, Inc. Cache optimization technique for large working data sets
US20170262879A1 (en) * 2014-11-06 2017-09-14 Appriz Incorporated Mobile application and two-way financial interaction solution with personalized alerts and notifications
US9727500B2 (en) 2014-11-19 2017-08-08 Nxp Usa, Inc. Message filtering in a data processing system
US9697151B2 (en) 2014-11-19 2017-07-04 Nxp Usa, Inc. Message filtering in a data processing system
US9727679B2 (en) * 2014-12-20 2017-08-08 Intel Corporation System on chip configuration metadata
US9851970B2 (en) * 2014-12-23 2017-12-26 Intel Corporation Method and apparatus for performing reduction operations on a set of vector elements
US9880953B2 (en) 2015-01-05 2018-01-30 Tuxera Corporation Systems and methods for network I/O based interrupt steering
US9286196B1 (en) * 2015-01-08 2016-03-15 Arm Limited Program execution optimization using uniform variable identification
WO2016115075A1 (en) 2015-01-13 2016-07-21 Sikorsky Aircraft Corporation Structural health monitoring employing physics models
US20160219101A1 (en) * 2015-01-23 2016-07-28 Tieto Oyj Migrating an application providing latency critical service
US9547881B2 (en) * 2015-01-29 2017-01-17 Qualcomm Incorporated Systems and methods for calculating a feature descriptor
WO2016123808A1 (en) * 2015-02-06 2016-08-11 华为技术有限公司 Data processing system, calculation node and data processing method
US9785413B2 (en) * 2015-03-06 2017-10-10 Intel Corporation Methods and apparatus to eliminate partial-redundant vector loads
JP6427053B2 (en) * 2015-03-31 2018-11-21 株式会社デンソー Parallelizing compilation method and parallelizing compiler
US10095479B2 (en) * 2015-04-23 2018-10-09 Google Llc Virtual image processor instruction set architecture (ISA) and memory model and exemplary target hardware having a two-dimensional shift array structure
US10372616B2 (en) 2015-06-03 2019-08-06 Renesas Electronics America Inc. Microcontroller performing address translations using address offsets in memory where selected absolute addressing based programs are stored
US9923965B2 (en) 2015-06-05 2018-03-20 International Business Machines Corporation Storage mirroring over wide area network circuits with dynamic on-demand capacity
US10409599B2 (en) 2015-06-26 2019-09-10 Microsoft Technology Licensing, Llc Decoding information about a group of instructions including a size of the group of instructions
US10191747B2 (en) 2015-06-26 2019-01-29 Microsoft Technology Licensing, Llc Locking operand values for groups of instructions executed atomically
US10175988B2 (en) 2015-06-26 2019-01-08 Microsoft Technology Licensing, Llc Explicit instruction scheduler state information for a processor
US10346168B2 (en) 2015-06-26 2019-07-09 Microsoft Technology Licensing, Llc Decoupled processor instruction window and operand buffer
US10169044B2 (en) 2015-06-26 2019-01-01 Microsoft Technology Licensing, Llc Processing an encoding format field to interpret header information regarding a group of instructions
US10409606B2 (en) 2015-06-26 2019-09-10 Microsoft Technology Licensing, Llc Verifying branch targets
CN106293893B (en) * 2015-06-26 2019-12-06 阿里巴巴集团控股有限公司 Job scheduling method and device and distributed system
US10459723B2 (en) 2015-07-20 2019-10-29 Qualcomm Incorporated SIMD instructions for multi-stage cube networks
US9930498B2 (en) * 2015-07-31 2018-03-27 Qualcomm Incorporated Techniques for multimedia broadcast multicast service transmissions in unlicensed spectrum
US20170054449A1 (en) * 2015-08-19 2017-02-23 Texas Instruments Incorporated Method and System for Compression of Radar Signals
WO2017052548A1 (en) 2015-09-24 2017-03-30 Hewlett Packard Enterprise Development Lp Failure indication in shared memory
US20170104733A1 (en) * 2015-10-09 2017-04-13 Intel Corporation Device, system and method for low speed communication of sensor information
US9898325B2 (en) * 2015-10-20 2018-02-20 Vmware, Inc. Configuration settings for configurable virtual components
US20170116154A1 (en) * 2015-10-23 2017-04-27 The Intellisis Corporation Register communication in a network-on-a-chip architecture
CN106648563B (en) * 2015-10-30 2021-03-23 阿里巴巴集团控股有限公司 Dependency decoupling processing method and device for shared module in application program
KR102248846B1 (en) * 2015-11-04 2021-05-06 삼성전자주식회사 Method and apparatus for parallel processing data
US9977619B2 (en) * 2015-11-06 2018-05-22 Vivante Corporation Transfer descriptor for memory access commands
US10057327B2 (en) 2015-11-25 2018-08-21 International Business Machines Corporation Controlled transfer of data over an elastic network
US9923784B2 (en) 2015-11-25 2018-03-20 International Business Machines Corporation Data transfer using flexible dynamic elastic network service provider relationships
US9923839B2 (en) * 2015-11-25 2018-03-20 International Business Machines Corporation Configuring resources to exploit elastic network capability
US10581680B2 (en) 2015-11-25 2020-03-03 International Business Machines Corporation Dynamic configuration of network features
US10177993B2 (en) 2015-11-25 2019-01-08 International Business Machines Corporation Event-based data transfer scheduling using elastic network optimization criteria
US10216441B2 (en) 2015-11-25 2019-02-26 International Business Machines Corporation Dynamic quality of service for storage I/O port allocation
US10642617B2 (en) * 2015-12-08 2020-05-05 Via Alliance Semiconductor Co., Ltd. Processor with an expandable instruction set architecture for dynamically configuring execution resources
US10180829B2 (en) * 2015-12-15 2019-01-15 Nxp Usa, Inc. System and method for modulo addressing vectorization with invariant code motion
US20170177349A1 (en) * 2015-12-21 2017-06-22 Intel Corporation Instructions and Logic for Load-Indices-and-Prefetch-Gathers Operations
CN107015931A (en) * 2016-01-27 2017-08-04 三星电子株式会社 Method and accelerator unit for interrupt processing
CN105760321B (en) * 2016-02-29 2019-08-13 福州瑞芯微电子股份有限公司 The debug clock domain circuit of SOC chip
US20210049292A1 (en) * 2016-03-07 2021-02-18 Crowdstrike, Inc. Hypervisor-Based Interception of Memory and Register Accesses
GB2548601B (en) * 2016-03-23 2019-02-13 Advanced Risc Mach Ltd Processing vector instructions
EP3226184A1 (en) * 2016-03-30 2017-10-04 Tata Consultancy Services Limited Systems and methods for determining and rectifying events in processes
US9967539B2 (en) * 2016-06-03 2018-05-08 Samsung Electronics Co., Ltd. Timestamp error correction with double readout for the 3D camera with epipolar line laser point scanning
US20170364334A1 (en) * 2016-06-21 2017-12-21 Atti Liu Method and Apparatus of Read and Write for the Purpose of Computing
US10797941B2 (en) * 2016-07-13 2020-10-06 Cisco Technology, Inc. Determining network element analytics and networking recommendations based thereon
CN107832005B (en) * 2016-08-29 2021-02-26 鸿富锦精密电子(天津)有限公司 Distributed data access system and method
KR102247529B1 (en) * 2016-09-06 2021-05-03 삼성전자주식회사 Electronic apparatus, reconfigurable processor and control method thereof
US10353711B2 (en) 2016-09-06 2019-07-16 Apple Inc. Clause chaining for clause-based instruction execution
US10909077B2 (en) * 2016-09-29 2021-02-02 Paypal, Inc. File slack leveraging
WO2018078451A1 (en) * 2016-10-25 2018-05-03 Reconfigure.Io Limited Synthesis path for transforming concurrent programs into hardware deployable on fpga-based cloud infrastructures
US10423446B2 (en) * 2016-11-28 2019-09-24 Arm Limited Data processing
KR20180063542A (en) * 2016-12-02 2018-06-12 삼성전자주식회사 Vector processor and control methods thererof
GB2558220B (en) * 2016-12-22 2019-05-15 Advanced Risc Mach Ltd Vector generating instruction
CN108616905B (en) * 2016-12-28 2021-03-19 大唐移动通信设备有限公司 Method and system for optimizing user plane in narrow-band Internet of things based on honeycomb
US10268558B2 (en) 2017-01-13 2019-04-23 Microsoft Technology Licensing, Llc Efficient breakpoint detection via caches
US10671395B2 (en) * 2017-02-13 2020-06-02 The King Abdulaziz City for Science and Technology—KACST Application specific instruction-set processor (ASIP) for simultaneously executing a plurality of operations using a long instruction word
US11663450B2 (en) * 2017-02-28 2023-05-30 Microsoft Technology Licensing, Llc Neural network processing with chained instructions
US10169196B2 (en) * 2017-03-20 2019-01-01 Microsoft Technology Licensing, Llc Enabling breakpoints on entire data structures
US10360045B2 (en) * 2017-04-25 2019-07-23 Sandisk Technologies Llc Event-driven schemes for determining suspend/resume periods
US10552206B2 (en) * 2017-05-23 2020-02-04 Ge Aviation Systems Llc Contextual awareness associated with resources
US20180349137A1 (en) * 2017-06-05 2018-12-06 Intel Corporation Reconfiguring a processor without a system reset
US11021944B2 (en) 2017-06-13 2021-06-01 Schlumberger Technology Corporation Well construction communication and control
US11143010B2 (en) 2017-06-13 2021-10-12 Schlumberger Technology Corporation Well construction communication and control
US20180359130A1 (en) * 2017-06-13 2018-12-13 Schlumberger Technology Corporation Well Construction Communication and Control
US10599617B2 (en) * 2017-06-29 2020-03-24 Intel Corporation Methods and apparatus to modify a binary file for scalable dependency loading on distributed computing systems
US11436010B2 (en) 2017-06-30 2022-09-06 Intel Corporation Method and apparatus for vectorizing indirect update loops
CN111316234B (en) * 2017-09-12 2024-03-12 恩倍科微公司 Very low power microcontroller system
US10705973B2 (en) 2017-09-19 2020-07-07 International Business Machines Corporation Initializing a data structure for use in predicting table of contents pointer values
US10896030B2 (en) 2017-09-19 2021-01-19 International Business Machines Corporation Code generation relating to providing table of contents pointer values
US10884929B2 (en) 2017-09-19 2021-01-05 International Business Machines Corporation Set table of contents (TOC) register instruction
US10725918B2 (en) 2017-09-19 2020-07-28 International Business Machines Corporation Table of contents cache entry having a pointer for a range of addresses
US10713050B2 (en) 2017-09-19 2020-07-14 International Business Machines Corporation Replacing Table of Contents (TOC)-setting instructions in code with TOC predicting instructions
US10620955B2 (en) 2017-09-19 2020-04-14 International Business Machines Corporation Predicting a table of contents pointer value responsive to branching to a subroutine
US11061575B2 (en) * 2017-09-19 2021-07-13 International Business Machines Corporation Read-only table of contents register
US10761970B2 (en) * 2017-10-20 2020-09-01 International Business Machines Corporation Computerized method and systems for performing deferred safety check operations
CN109697114B (en) * 2017-10-20 2023-07-28 伊姆西Ip控股有限责任公司 Method and machine for application migration
US10572302B2 (en) * 2017-11-07 2020-02-25 Oracle Internatíonal Corporatíon Computerized methods and systems for executing and analyzing processes
US10705843B2 (en) * 2017-12-21 2020-07-07 International Business Machines Corporation Method and system for detection of thread stall
US10915317B2 (en) * 2017-12-22 2021-02-09 Alibaba Group Holding Limited Multiple-pipeline architecture with special number detection
CN108196946B (en) * 2017-12-28 2019-08-09 北京翼辉信息技术有限公司 A kind of subregion multicore method of Mach
US10366017B2 (en) 2018-03-30 2019-07-30 Intel Corporation Methods and apparatus to offload media streams in host devices
US11277455B2 (en) 2018-06-07 2022-03-15 Mellanox Technologies, Ltd. Streaming system
US10740220B2 (en) 2018-06-27 2020-08-11 Microsoft Technology Licensing, Llc Cache-based trace replay breakpoints using reserved tag field bits
CN109087381B (en) * 2018-07-04 2023-01-17 西安邮电大学 Unified architecture rendering shader based on dual-emission VLIW
US10862485B1 (en) * 2018-08-29 2020-12-08 Verisilicon Microelectronics (Shanghai) Co., Ltd. Lookup table index for a processor
CN109445516A (en) * 2018-09-27 2019-03-08 北京中电华大电子设计有限责任公司 One kind being applied to peripheral hardware clock control method and circuit in double-core SoC
US20200106828A1 (en) * 2018-10-02 2020-04-02 Mellanox Technologies, Ltd. Parallel Computation Network Device
US11061894B2 (en) * 2018-10-31 2021-07-13 Salesforce.Com, Inc. Early detection and warning for system bottlenecks in an on-demand environment
US11108675B2 (en) 2018-10-31 2021-08-31 Keysight Technologies, Inc. Methods, systems, and computer readable media for testing effects of simulated frame preemption and deterministic fragmentation of preemptable frames in a frame-preemption-capable network
US10678693B2 (en) * 2018-11-08 2020-06-09 Insightfulvr, Inc Logic-executing ring buffer
US10776984B2 (en) 2018-11-08 2020-09-15 Insightfulvr, Inc Compositor for decoupled rendering
US10728134B2 (en) * 2018-11-14 2020-07-28 Keysight Technologies, Inc. Methods, systems, and computer readable media for measuring delivery latency in a frame-preemption-capable network
CN109374935A (en) * 2018-11-28 2019-02-22 武汉精能电子技术有限公司 A kind of electronic load parallel operation method and system
US10761822B1 (en) * 2018-12-12 2020-09-01 Amazon Technologies, Inc. Synchronization of computation engines with non-blocking instructions
GB2580136B (en) * 2018-12-21 2021-01-20 Graphcore Ltd Handling exceptions in a multi-tile processing arrangement
US10671550B1 (en) * 2019-01-03 2020-06-02 International Business Machines Corporation Memory offloading a problem using accelerators
TWI703500B (en) * 2019-02-01 2020-09-01 睿寬智能科技有限公司 Method for shortening content exchange time and its semiconductor device
US11625393B2 (en) 2019-02-19 2023-04-11 Mellanox Technologies, Ltd. High performance computing system
EP3699770A1 (en) 2019-02-25 2020-08-26 Mellanox Technologies TLV Ltd. Collective communication system and methods
EP3935500A1 (en) * 2019-03-06 2022-01-12 Live Nation Entertainment, Inc. Systems and methods for queue control based on client-specific protocols
CN110177220B (en) * 2019-05-23 2020-09-01 上海图趣信息科技有限公司 Camera with external time service function and control method thereof
US11195095B2 (en) * 2019-08-08 2021-12-07 Neuralmagic Inc. System and method of accelerating execution of a neural network
US11573802B2 (en) * 2019-10-23 2023-02-07 Texas Instruments Incorporated User mode event handling
US11144483B2 (en) * 2019-10-25 2021-10-12 Micron Technology, Inc. Apparatuses and methods for writing data to a memory
FR3103583B1 (en) * 2019-11-27 2023-05-12 Commissariat Energie Atomique Shared data management system
US10877761B1 (en) * 2019-12-08 2020-12-29 Mellanox Technologies, Ltd. Write reordering in a multiprocessor system
CN111061510B (en) * 2019-12-12 2021-01-05 湖南毂梁微电子有限公司 Extensible ASIP structure platform and instruction processing method
CN111143127B (en) * 2019-12-23 2023-09-26 杭州迪普科技股份有限公司 Method, device, storage medium and equipment for supervising network equipment
CN113034653B (en) * 2019-12-24 2023-08-08 腾讯科技(深圳)有限公司 Animation rendering method and device
US11750699B2 (en) 2020-01-15 2023-09-05 Mellanox Technologies, Ltd. Small message aggregation
US11137936B2 (en) 2020-01-21 2021-10-05 Google Llc Data processing on memory controller
US11360780B2 (en) * 2020-01-22 2022-06-14 Apple Inc. Instruction-level context switch in SIMD processor
US11252027B2 (en) 2020-01-23 2022-02-15 Mellanox Technologies, Ltd. Network element supporting flexible data reduction operations
JP7339368B2 (en) 2020-02-05 2023-09-05 株式会社ソニー・インタラクティブエンタテインメント Graphic processor and information processing system
US11188316B2 (en) * 2020-03-09 2021-11-30 International Business Machines Corporation Performance optimization of class instance comparisons
US11354130B1 (en) * 2020-03-19 2022-06-07 Amazon Technologies, Inc. Efficient race-condition detection
US20210312325A1 (en) * 2020-04-01 2021-10-07 Samsung Electronics Co., Ltd. Mixed-precision neural processing unit (npu) using spatial fusion with load balancing
WO2021212074A1 (en) * 2020-04-16 2021-10-21 Tom Herbert Parallelism in serial pipeline processing
JP7380416B2 (en) 2020-05-18 2023-11-15 トヨタ自動車株式会社 agent control device
JP7380415B2 (en) * 2020-05-18 2023-11-15 トヨタ自動車株式会社 agent control device
US11507805B2 (en) 2020-06-16 2022-11-22 IntuiCell AB Computer-implemented or hardware-implemented method of entity identification, a computer program product and an apparatus for entity identification
US11876885B2 (en) 2020-07-02 2024-01-16 Mellanox Technologies, Ltd. Clock queue with arming and/or self-arming features
GB202010839D0 (en) * 2020-07-14 2020-08-26 Graphcore Ltd Variable allocation
WO2022047699A1 (en) * 2020-09-03 2022-03-10 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for improved belief propagation based decoding
US11340914B2 (en) * 2020-10-21 2022-05-24 Red Hat, Inc. Run-time identification of dependencies during dynamic linking
JP7203799B2 (en) 2020-10-27 2023-01-13 昭和電線ケーブルシステム株式会社 Method for repairing oil leaks in oil-filled power cables and connections
US11556378B2 (en) 2020-12-14 2023-01-17 Mellanox Technologies, Ltd. Offloading execution of a multi-task parameter-dependent operation to a network device
US11243773B1 (en) 2020-12-14 2022-02-08 International Business Machines Corporation Area and power efficient mechanism to wakeup store-dependent loads according to store drain merges
TWI768592B (en) * 2020-12-14 2022-06-21 瑞昱半導體股份有限公司 Central processing unit
CN112924962B (en) * 2021-01-29 2023-02-21 上海匀羿电磁科技有限公司 Underground pipeline lateral deviation filtering detection and positioning method
CN113112393B (en) * 2021-03-04 2022-05-31 浙江欣奕华智能科技有限公司 Marginalizing device in visual navigation system
CN113438171B (en) * 2021-05-08 2022-11-15 清华大学 Multi-chip connection method of low-power-consumption storage and calculation integrated system
CN113553266A (en) * 2021-07-23 2021-10-26 湖南大学 Parallelism detection method, system, terminal and readable storage medium of serial program based on parallelism detection model
US20230086827A1 (en) * 2021-09-23 2023-03-23 Oracle International Corporation Analyzing performance of resource systems that process requests for particular datasets
US11770345B2 (en) * 2021-09-30 2023-09-26 US Technology International Pvt. Ltd. Data transfer device for receiving data from a host device and method therefor
JP2023082571A (en) * 2021-12-02 2023-06-14 富士通株式会社 Calculation processing unit and calculation processing method
US20230289189A1 (en) * 2022-03-10 2023-09-14 Nvidia Corporation Distributed Shared Memory
WO2023214915A1 (en) * 2022-05-06 2023-11-09 IntuiCell AB A data processing system for processing pixel data to be indicative of contrast.
US11922237B1 (en) 2022-09-12 2024-03-05 Mellanox Technologies, Ltd. Single-step collective operations

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052868A1 (en) * 2000-10-04 2002-05-02 Sanjeev Mohindra SIMD system and method
US20050055534A1 (en) * 2003-09-08 2005-03-10 Moyer William C. Data processing system having instruction specifiers for SIMD operations and method thereof
US20050125635A1 (en) * 2003-12-09 2005-06-09 Arm Limited Moving data between registers of different register data stores
US20080133874A1 (en) * 2006-03-02 2008-06-05 International Business Machines Corporation Method, system and program product for simd-oriented management of register maps for map-based indirect register-file access

Family Cites Families (77)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4862350A (en) * 1984-08-03 1989-08-29 International Business Machines Corp. Architecture for a distributive microprocessing system
GB2211638A (en) * 1987-10-27 1989-07-05 Ibm Simd array processor
US5218709A (en) * 1989-12-28 1993-06-08 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Special purpose parallel computer architecture for real-time control and simulation in robotic applications
IL97315A (en) * 1990-02-28 1994-10-07 Hughes Aircraft Co Multiple cluster signal processor
US5815723A (en) * 1990-11-13 1998-09-29 International Business Machines Corporation Picket autonomy on a SIMD machine
CA2073516A1 (en) * 1991-11-27 1993-05-28 Peter Michael Kogge Dynamic multi-mode parallel processor array architecture computer system
US5315700A (en) * 1992-02-18 1994-05-24 Neopath, Inc. Method and apparatus for rapidly processing data sequences
JPH07287700A (en) * 1992-05-22 1995-10-31 Internatl Business Mach Corp <Ibm> Computer system
US5315701A (en) * 1992-08-07 1994-05-24 International Business Machines Corporation Method and system for processing graphics data streams utilizing scalable processing nodes
US5560034A (en) * 1993-07-06 1996-09-24 Intel Corporation Shared command list
JPH07210545A (en) * 1994-01-24 1995-08-11 Matsushita Electric Ind Co Ltd Parallel processing processors
US6002411A (en) * 1994-11-16 1999-12-14 Interactive Silicon, Inc. Integrated video and memory controller with data processing and graphical processing capabilities
JPH1049368A (en) * 1996-07-30 1998-02-20 Mitsubishi Electric Corp Microporcessor having condition execution instruction
JP3778573B2 (en) * 1996-09-27 2006-05-24 株式会社ルネサステクノロジ Data processor and data processing system
US6108775A (en) * 1996-12-30 2000-08-22 Texas Instruments Incorporated Dynamically loadable pattern history tables in a multi-task microprocessor
US6243499B1 (en) * 1998-03-23 2001-06-05 Xerox Corporation Tagging of antialiased images
JP2000207202A (en) * 1998-10-29 2000-07-28 Pacific Design Kk Controller and data processor
JP5285828B2 (en) * 1999-04-09 2013-09-11 ラムバス・インコーポレーテッド Parallel data processor
US8171263B2 (en) * 1999-04-09 2012-05-01 Rambus Inc. Data processing apparatus comprising an array controller for separating an instruction stream processing instructions and data transfer instructions
US6751698B1 (en) * 1999-09-29 2004-06-15 Silicon Graphics, Inc. Multiprocessor node controller circuit and method
EP1102163A3 (en) * 1999-11-15 2005-06-29 Texas Instruments Incorporated Microprocessor with improved instruction set architecture
JP2001167069A (en) * 1999-12-13 2001-06-22 Fujitsu Ltd Multiprocessor system and data transfer method
JP2002073329A (en) * 2000-08-29 2002-03-12 Canon Inc Processor
US6959346B2 (en) * 2000-12-22 2005-10-25 Mosaid Technologies, Inc. Method and system for packet encryption
JP5372307B2 (en) * 2001-06-25 2013-12-18 株式会社ガイア・システム・ソリューション Data processing apparatus and control method thereof
GB0119145D0 (en) * 2001-08-06 2001-09-26 Nokia Corp Controlling processing networks
JP2003099252A (en) * 2001-09-26 2003-04-04 Pacific Design Kk Data processor and its control method
JP3840966B2 (en) * 2001-12-12 2006-11-01 ソニー株式会社 Image processing apparatus and method
US7853778B2 (en) * 2001-12-20 2010-12-14 Intel Corporation Load/move and duplicate instructions for a processor
US7548586B1 (en) * 2002-02-04 2009-06-16 Mimar Tibet Audio and video processing apparatus
US7506135B1 (en) * 2002-06-03 2009-03-17 Mimar Tibet Histogram generation with vector operations in SIMD and VLIW processor by consolidating LUTs storing parallel update incremented count values for vector data elements
AU2003256870A1 (en) * 2002-08-09 2004-02-25 Intel Corporation Multimedia coprocessor control mechanism including alignment or broadcast instructions
JP2004295494A (en) * 2003-03-27 2004-10-21 Fujitsu Ltd Multiple processing node system having versatility and real time property
US7836276B2 (en) * 2005-12-02 2010-11-16 Nvidia Corporation System and method for processing thread groups in a SIMD architecture
DE10353267B3 (en) * 2003-11-14 2005-07-28 Infineon Technologies Ag Multithread processor architecture for triggered thread switching without cycle time loss and without switching program command
US8566828B2 (en) * 2003-12-19 2013-10-22 Stmicroelectronics, Inc. Accelerator for multi-processing system and method
US7206922B1 (en) * 2003-12-30 2007-04-17 Cisco Systems, Inc. Instruction memory hierarchy for an embedded processor
US7412587B2 (en) * 2004-02-16 2008-08-12 Matsushita Electric Industrial Co., Ltd. Parallel operation processor utilizing SIMD data transfers
JP4698242B2 (en) * 2004-02-16 2011-06-08 パナソニック株式会社 Parallel processing processor, control program and control method for controlling operation of parallel processing processor, and image processing apparatus equipped with parallel processing processor
JP2005352568A (en) * 2004-06-08 2005-12-22 Hitachi-Lg Data Storage Inc Analog signal processing circuit, rewriting method for its data register, and its data communication method
US7681199B2 (en) * 2004-08-31 2010-03-16 Hewlett-Packard Development Company, L.P. Time measurement using a context switch count, an offset, and a scale factor, received from the operating system
US7565469B2 (en) * 2004-11-17 2009-07-21 Nokia Corporation Multimedia card interface method, computer program product and apparatus
US7257695B2 (en) * 2004-12-28 2007-08-14 Intel Corporation Register file regions for a processing system
US20060155955A1 (en) * 2005-01-10 2006-07-13 Gschwind Michael K SIMD-RISC processor module
GB2437836B (en) * 2005-02-25 2009-01-14 Clearspeed Technology Plc Microprocessor architectures
GB2423840A (en) * 2005-03-03 2006-09-06 Clearspeed Technology Plc Reconfigurable logic in processors
US7992144B1 (en) * 2005-04-04 2011-08-02 Oracle America, Inc. Method and apparatus for separating and isolating control of processing entities in a network interface
CN101322111A (en) * 2005-04-07 2008-12-10 杉桥技术公司 Multithreading processor with each threading having multiple concurrent assembly line
US20060259737A1 (en) * 2005-05-10 2006-11-16 Telairity Semiconductor, Inc. Vector processor with special purpose registers and high speed memory access
WO2006123822A1 (en) * 2005-05-20 2006-11-23 Sony Corporation Signal processor
JP2006343872A (en) * 2005-06-07 2006-12-21 Keio Gijuku Multithreaded central operating unit and simultaneous multithreading control method
US20060294344A1 (en) * 2005-06-28 2006-12-28 Universal Network Machines, Inc. Computer processor pipeline with shadow registers for context switching, and method
US8275976B2 (en) * 2005-08-29 2012-09-25 The Invention Science Fund I, Llc Hierarchical instruction scheduler facilitating instruction replay
US7617363B2 (en) * 2005-09-26 2009-11-10 Intel Corporation Low latency message passing mechanism
US7421529B2 (en) * 2005-10-20 2008-09-02 Qualcomm Incorporated Method and apparatus to clear semaphore reservation for exclusive access to shared memory
CN101366004A (en) * 2005-12-06 2009-02-11 波士顿电路公司 Methods and apparatus for multi-core processing with dedicated thread management
US7788468B1 (en) * 2005-12-15 2010-08-31 Nvidia Corporation Synchronization of threads in a cooperative thread array
CN2862511Y (en) * 2005-12-15 2007-01-24 李志刚 Multifunctional interface panel for GJB-289A bus
US8560863B2 (en) * 2006-06-27 2013-10-15 Intel Corporation Systems and techniques for datapath security in a system-on-a-chip device
JP2008059455A (en) * 2006-09-01 2008-03-13 Kawasaki Microelectronics Kk Multiprocessor
EP2527972A3 (en) * 2006-11-14 2014-08-06 Soft Machines, Inc. Apparatus and method for processing complex instruction formats in a multi- threaded architecture supporting various context switch modes and virtualization schemes
US7870400B2 (en) * 2007-01-02 2011-01-11 Freescale Semiconductor, Inc. System having a memory voltage controller which varies an operating voltage of a memory and method therefor
JP5079342B2 (en) * 2007-01-22 2012-11-21 ルネサスエレクトロニクス株式会社 Multiprocessor device
US20080270363A1 (en) * 2007-01-26 2008-10-30 Herbert Dennis Hunt Cluster processing of a core information matrix
US8250550B2 (en) * 2007-02-14 2012-08-21 The Mathworks, Inc. Parallel processing of distributed arrays and optimum data distribution
CN101021832A (en) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution
US8132172B2 (en) * 2007-03-26 2012-03-06 Intel Corporation Thread scheduling on multiprocessor systems
US7627744B2 (en) * 2007-05-10 2009-12-01 Nvidia Corporation External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level
CN100461095C (en) * 2007-11-20 2009-02-11 浙江大学 Medium reinforced pipelined multiplication unit design method supporting multiple mode
FR2925187B1 (en) * 2007-12-14 2011-04-08 Commissariat Energie Atomique SYSTEM COMPRISING A PLURALITY OF TREATMENT UNITS FOR EXECUTING PARALLEL STAINS BY MIXING THE CONTROL TYPE EXECUTION MODE AND THE DATA FLOW TYPE EXECUTION MODE
CN101471810B (en) * 2007-12-28 2011-09-14 华为技术有限公司 Method, device and system for implementing task in cluster circumstance
US20090183035A1 (en) * 2008-01-10 2009-07-16 Butler Michael G Processor including hybrid redundancy for logic error protection
US9619428B2 (en) * 2008-05-30 2017-04-11 Advanced Micro Devices, Inc. SIMD processing unit with local data share and access to a global data share of a GPU
CN101739235A (en) * 2008-11-26 2010-06-16 中国科学院微电子研究所 Processor unit for seamless connection between 32-bit DSP and universal RISC CPU
CN101799750B (en) * 2009-02-11 2015-05-06 上海芯豪微电子有限公司 Data processing method and device
CN101593164B (en) * 2009-07-13 2012-05-09 中国船舶重工集团公司第七○九研究所 Slave USB HID device and firmware implementation method based on embedded Linux
US9552206B2 (en) * 2010-11-18 2017-01-24 Texas Instruments Incorporated Integrated circuit with control node circuitry and processing circuitry

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052868A1 (en) * 2000-10-04 2002-05-02 Sanjeev Mohindra SIMD system and method
US20050055534A1 (en) * 2003-09-08 2005-03-10 Moyer William C. Data processing system having instruction specifiers for SIMD operations and method thereof
US20050125635A1 (en) * 2003-12-09 2005-06-09 Arm Limited Moving data between registers of different register data stores
US20080133874A1 (en) * 2006-03-02 2008-06-05 International Business Machines Corporation Method, system and program product for simd-oriented management of register maps for map-based indirect register-file access

Also Published As

Publication number Publication date
CN103221918B (en) 2017-06-09
JP5989656B2 (en) 2016-09-07
CN103221935A (en) 2013-07-24
CN103221933A (en) 2013-07-24
WO2012068475A3 (en) 2012-07-12
US9552206B2 (en) 2017-01-24
WO2012068513A3 (en) 2012-09-20
JP2014505916A (en) 2014-03-06
CN103221937B (en) 2016-10-12
WO2012068494A2 (en) 2012-05-24
JP2014501009A (en) 2014-01-16
WO2012068449A8 (en) 2013-01-03
JP2014500549A (en) 2014-01-09
WO2012068486A2 (en) 2012-05-24
CN103221938B (en) 2016-01-13
CN103221936A (en) 2013-07-24
JP2016129039A (en) 2016-07-14
CN103221918A (en) 2013-07-24
WO2012068513A2 (en) 2012-05-24
WO2012068498A2 (en) 2012-05-24
US20120131309A1 (en) 2012-05-24
WO2012068498A3 (en) 2012-12-13
JP6096120B2 (en) 2017-03-15
JP2014501969A (en) 2014-01-23
CN103221934B (en) 2016-08-03
WO2012068504A2 (en) 2012-05-24
CN103221933B (en) 2016-12-21
JP5859017B2 (en) 2016-02-10
JP2014501007A (en) 2014-01-16
WO2012068449A2 (en) 2012-05-24
CN103221937A (en) 2013-07-24
JP2014501008A (en) 2014-01-16
WO2012068494A3 (en) 2012-07-19
WO2012068504A3 (en) 2012-10-04
WO2012068478A2 (en) 2012-05-24
JP2014503876A (en) 2014-02-13
CN103221938A (en) 2013-07-24
CN103221935B (en) 2016-08-10
JP2013544411A (en) 2013-12-12
WO2012068449A3 (en) 2012-08-02
WO2012068486A3 (en) 2012-07-12
CN103221936B (en) 2016-07-20
WO2012068478A3 (en) 2012-07-12
CN103221939B (en) 2016-11-02
CN103221939A (en) 2013-07-24
CN103221934A (en) 2013-07-24
JP6243935B2 (en) 2017-12-06

Similar Documents

Publication Publication Date Title
WO2012068475A2 (en) Method and apparatus for moving data from a simd register file to general purpose register file
EP1137984B1 (en) A multiple-thread processor for threaded software applications
US7114056B2 (en) Local and global register partitioning in a VLIW processor
US6343348B1 (en) Apparatus and method for optimizing die utilization and speed performance by register file splitting
US20150205324A1 (en) Clock routing techniques
US6785743B1 (en) Template data transfer coprocessor
US20220014202A1 (en) Three-Dimensional Stacked Programmable Logic Fabric and Processor Design Architecture
US6625634B1 (en) Efficient implementation of multiprecision arithmetic
US10620958B1 (en) Crossbar between clients and a cache
US20230153114A1 (en) Data processing system having distrubuted registers
Duric Specialization and reconfiguration of lightweight mobile processors for data-parallel applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11841878

Country of ref document: EP

Kind code of ref document: A2

ENP Entry into the national phase in:

Ref document number: 2013540058

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11841878

Country of ref document: EP

Kind code of ref document: A2