WO2009001368A2

WO2009001368A2 - A method and system-on-chip fabric

Info

Publication number: WO2009001368A2
Application number: PCT/IN2007/000262
Authority: WO
Inventors: Soumitra Kumar Nandy; Ranjani Narayan; Keshavan Varadarajan; Mythri Alle; Amar Nath Satrawala; Shimoga Janakiram Adarsha Rao
Original assignee: Indian Institute Of Science
Priority date: 2007-06-28
Filing date: 2007-06-28
Publication date: 2008-12-31
Also published as: WO2009001368A3

Abstract

This invention provides a fabric within a SoC framework, along with a system and method, in which resources can be composed as computational structures that best match the application's needs. The fabric, disclosed herein, contains compute, storage and communication resources that can be aggregated at runtime to perform specific application tasks. The system comprises of a scheduler, a cluster configuration store, an execution fabric containing a plurality of computational resources, a resource binding agent, a Load Store Unit, and Store Destination Decision Logic (SDDL). The method of the present invention comprises the steps of developing High Level Language (HLL) descriptions of applications modules; converting the HLL description of the modules of the application to an intermediate representation; compiling into clusters using the dataflow graph of the application; performing binding operations; and Performing controlled dataflow execution wherein a set of clusters are scheduled and executed on the fabric.

Description

A METHOD AND SYSTEM-ON-CHIP-FABRIC FIELD OF THE INVENTION

This invention relates to a System on Chip fabric in which compute, storage and communication resources can be aggregated at runtime to perform specific application tasks, along with a method and system to complement this.

DISCUSSION OF PRIORART

SoC platforms are programmable platforms comprising a variety of processing cores (RISC, DSP, application engines, coprocessors etc.), for which applications are developed in a High Level Language (HLL). Application development in HLL relies largely on the compiler infrastructure. The micro architectural resources are exposed to the compiler, so that functionalities/modules (of the application) are expressed as an execution sequence of instructions such that impositions of structural and execution semantics of the architecture are adhered to. Traditionally this is achieved by following an execution pipeline in terms of instruction fetch/decode, execute and memory writes back stages. Consequently, it becomes necessary to follow the process of application decomposition into modules that can be implemented on cores. The diktat of the given platform and execution paradigm determines the application decomposition and execution. The shortcomings manifest as compromised performance, and inefficient utilization of micro-architectural resources, since the application is tailored to fit the given platform.

With regard to Multiprocessor System-on-Chip (MP-SoCs), the key to performance is to arrive at an optimal partition of the application into modules, which can be assigned to the heterogeneous processing cores. While software solutions offer flexibility in realization of applications within certain performance levels, they cannot ensure scalability both in terms of performance and application enhancements. Hardware solutions, as in ASICs on the other hand guarantee performance at the cost of flexibility and large NRE cost associated with silicon fabrication of the device. Field Programmable Gate Arrays (FPGAs), offer a platform in which computational structures can be composed in terms of Look-up Tables (LUTs), which serve as logic elements. LUTs in the FPGA are used as universal logic gates that can emulate any logical operation. The composition of computational structures (i.e. sequential/combinational circuit) is achieved by the process of placement and routing. The disadvantages with regard to FPGAs are primarily due to the fine-grained nature of the elementary units that are meant to emulate logic elements (viz. AND, OR and NOT). The use of fine-grained elementary units (Le. LUTs/LEs) in FPGAs necessitates the use of RTL for specifying the application. The non-availability of efficient tools to translate from high level programming language to RTL aggravates the problem, since it would require users to write RTL equivalent models for the software code. In the current design flow followed by the embedded industry, RTL is not available until very late in the design cycle. This makes FPGA unsuitable for early functional verification and application engine synthesis from High Level Languages, hi addition, these logic elements incur latency equivalent to that of memory accesses, which is both slow and more power hungry when compared to their ASIC equivalents.

Automated Application synthesis has been touted as the next tipping point for the embedded industry, similar to EDA a decade ago . Research in this field has focused on reducing Engineering costs and reducing design time. Automatic Application Engine

Synthesis is a general methodology and several flavors have been proposed in various literature (academic and commercial solutions). Automatic Application Engine Synthesis can be classified based on the target generated as a result of application synthesis. These are

• ASIC targeted application synthesis

• Application retargeting to a generic platform

ASIC targeted application synthesis: This involves taking a high-level language description (typically C) of the application and transforming the same into RTL to make it hardware synthesizable. hi the currently available solutions, the application synthesis is a process of customizing template processor architectures for the given application and addition of new instructions to the Instruction Set Architecture of the template processor. The process of customizing involves determining the number of processors connected together, the interconnection bandwidth between these processors, the buffer sizes used hi communication, the instructions to be included/excluded in the processor etc. This technique of ASIC targeted application synthesis reduces Non Recurring Engineering (NRE) costs associated design time and number of designers required to accomplish the task. However, this does not address the NRE costs associated with back end process of design (i.e. RTL to die) and does not try to reduce the manufacturing costs i.e. die to chip. hi addition, this technique is not the preferred route if the number of instances of the final chip to be produced is a very small number, since it is not cost effective to manufacture ASICs in small numbers. The biggest advantage of ASICs are its power and performance characteristics, which are unmatched by general-purpose processors. Some commercial products operating in this space include Synfora's PICO Flex [14], Lisa from CoWare, Poseidon's Triton [9] and SiliconHive [4]. The technique of ASIC targeted application synthesis does not address the NRE costs associated with back end process of design (i.e. RTL to die). It does not even try to reduce the manufacturing costs i.e. die to chip. In addition, this technique is not the preferred route if the number of instances of the final chip to be produced is a very small number, since it is not cost effective to manufacture ASICs in small numbers. Even though the end product may be technically superior (w.r.t performance and power characteristics) it does not make economic sense in the age of "short lifetime" gadgets.

Application Retargeting to a generic platform: In this technique a generic platform is designed by a vendor. A compiler irifrastructure is provided that retargets the application to the said hardware platform. The effectiveness of this technique depends on the effectiveness of the compiler irifrastructure to retarget the given application for efficient execution on the given platform and depends on the ability of the platform to execute the application in a manner that makes it comparable to an ASIC. However, it is possible that the hardware platform designed is resource constrained, making it impossible to map an application completely, hi order to address this problem, the hardware vendor may choose to implement runtime reconfigurability to be able to partition the problem and overlay different partitions onto the same hardware. This solution definitely brings down the total NRE costs (i.e. design to RTL and RTL to die) and the manufacturing cost (i.e. die to chip), since only the generic platform is manufactured. The economic success of such a solution is highly dependent on the compiler infrastructure and the capability to dynamically reconfigure with minimum overhead. Commercial products that have been attempted in this space include DAPDNA from IPFLex [7] and Stretch from Stretch Inc. [13] MOLEN [9] from TU Delft also falls under this category. Mapping application logic for predetermined configurable platforms requires a completely different process of compilation. The compilation process needs to keep in mind the granularity of allocation of hardware units, the list of supported operations, the available bisection bandwidth and available throughput of the interconnection network and the total number of such units available on a given platform. The application substructures chosen for this hardware platform must be optimal, for the given hardware platform to improve performance of the application. In this case, the application substructures are chosen to match the given hardware. Solutions in this space include DAPDNA from IPFlex [7], Stretch from Stretch Inc and MOLEN [9].

The DAPDNA platform contains 376 ALUs packed in 6 segments. The application is expressed in a language called Dataflow C and then converted into a hardware configuration through the process of compilation. The greatest limitation of this solution was the design entry point. The language dataflow C is very restrictive. Yet, another limitation of the DAPDNA approach is the time required to reconfigure the fabric. It takes about 200 cycles to load a new configuration. MOLEN [9] and the solution from Stretch Inc. [13] are identical in their approaches. There is a core processor in both cases.

Li the case of Stretch, a custom VLIW core from Tensilica is used. There is a reconfigurable fabric that has been provided to obtain additional functionality only. These platforms aid adding a new instruction to the ISA based on the application, without having the user to ask for a refabrication of the chip. This helps reduce the manufacturing costs for small changes in the design, post manufacturing. The amount of functionality that can be added post manufacturing is limited by the size of the reconfigurable fabric. The entire philosophy of adding new instructions is to reduce the data transfer latency between two closely interacting instructions. Hence providing a hardware implementation for this new combined instruction helps make applications faster. As per Amadahl's law, the total performance gain due to addition of an instruction is limited by the portions of the program unaffected by the addition of the new instruction.

In summary, there are several different hardware paradigms employed by the industry for realizing SoCs. These range from costly to produce custom ASICs, to domain specific MP-SoCs. Some solutions deliver the right performance and power at an increased design cost. The other solutions are low cost alternatives but don't provide the required performance and power, since they are modeled along the lines of general-purpose processors. A hardware platform whose performance and power is comparable to those provided by ASIC solutions along with increased programmability in order to support multiple applications from the same domain is the focus of this invention. Such a platform must support an execution paradigm that closely reflects an ASIC implementation.

An improved design must address the following design constraints, in order to overcome the limitations of prior art: • Any solution that needs to extract as much performance as possible from a given application attempts to place closer, communicating entities, and tries to reduce the overhead of communication. This is one of the optimization criteria for the floor planning/placement step during back end processing to generate a die from RTL (for ASIC manufacture). This optimization criteria is also used on FPGAs, when performing placement and routing.

• The use of a non-restrictive high level language as a design entry point is crucial, since it helps reduce the non-recurring engineering costs with respect to design. Further, this high level language needs to be automatically converted into a hardware solution. • ASICs are the most efficient platforms with regard to performance and power efficiency. However, the Non Recurring Engineering (NRE) costs associated with the design of each ASIC makes it prohibitive to design ASICs for every new application and application standard. A generic platform that can come close to performance and power efficiency of an ASIC while still retaining a generic/configurable flavor would be most favorable one.

SUMMARY OF THE INVENTION

It is an object of this invention to provide a System on Chip fabric in which compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks while maximizing power and performance efficiency in a scalable fashion that also ameliorates the NRE costs associated with back end process of design, comprising: (a) a homogeneous structure whose basic units are called Building Blocks satisfying universality and regularity criteria; (b) means to map the modules and functions of the application to a hardware platform, independent of RTL; and (c) scalable means, which support the addition of building blocks to increase the capacity of the fabric.

Homogeneous Structure: The platform must comprise homogeneous building blocks (BBs), similar to an FPGA. The granularity of the BBs, unlike FPGAs¹, must be amenable to structural reorganization (runtime reconfiguration) so that application specific combinational/sequential circuits can be realized. The BBs must satisfy the universality criteria i.e. it must support all possible elementary operations. The universal nature of BBs will help reduce the design and development time for any new application and help support application scalability. Further a regular interconnect connecting the BBs would maintain regularity in their access pattern. The platform must have high fault resilience to support fault-free operation, with the working subset of resources. These characteristics make the job of composition tractable, since there are no additional restrictions placed by hardware (other than the constraints imposed by the application due to computation characteristics).

¹ Theoretically, FPGAs can support ran time reconfiguration, but in practice, the very large configuration data makes it infeasible. RTL independence: The identification of application modules/ftinctionalities and their mapping onto the platform must be independent of RTL, to enable early prototyping and to achieve application synthesis from HLL specification.

Application development in HLL: The platform must support a synthesis methodology by which applications developed hi HLL can be directly realized on the platform.

Scalability: The platform must support hardware scalability Le. increasing capacity/capability of the platform by increasing the number of building blocks, hi this invention, use of run time configurable hardware as a potential method to satisfy the above-cited requirements, is proposed.

It is another object of the present invention to provide a system relating to System on Chip fabric comprising, (a) A scheduler, (b) A cluster configuration store that contains the configuration for all possible clusters (which are defined as partitions that are disjoint sub-graphs of the dataflow graph or application graph) of the application, similar to an instruction store in a traditional architecture. The cluster configuration store cannot be overwritten during the course of execution, (c) an execution fabric containing a plurality of computational resources, referred to as Operation Service Units (OSUs), storage units and switches, which are connected through a regular interconnection wherein an additional overlay network is available to facilitate communication between two resources which are not directly connected by the interconnection, (d) a resource binding agent, which is the logic that maps virtually bound clusters (a group of instructions that have strong producer-consumer relationship) to the execution fabric. The binding determines unoccupied OSUs onto which the operations are mapped, the cluster configuration for the will fire clusters being obtained from the Cluster Configuration Store, (e) a Load Store Unit that handles all memory operations generated by the execution fabric, wherein a Controlled Dataflow paradigm is used wherein the memory is primarily used to store global variables, non-scalar variables and for pointer based manipulations, and (f) Store Destination Decision Logic (SDDL) that is responsible for determining where the output of a given cluster must be written to, wherein if the output data is meant for a cluster for which no input data is yet available then a new line is allocated within the scheduler and if the output data is meant for a cluster for which some of the inputs have arrived, then new data operand is written in the line already allocated to the cluster instance.

It is another object of the present invention to provide a method in which compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks while maximizing power and performance efficiency in a scalable fashion that help reduce the NRE costs associated with back end process of design, comprising the steps of (a) developing High Level Language (HLL) descriptions of applications modules; (b) converting the HLL description of the modules of the application to an intermediate representation in terms of an orthogonal instruction set in Static Single Assignment (SSA) Form, wherein the orthogonal set of instructions is referred to as the Virtual Instruction Set Architecture (VISA), from which dataflow graphs corresponding to these modules are generated and executed on computational structures composed at runtime; (c) Compiling into clusters using the dataflow graph of the application, which closely mimics the flow of signals in hardware wherein the dataflow graph or application graph, is partitioned into disjoint sub-graphs called clusters; (d) Performing binding operations wherein clusters are issued and assigned to hardware; and (e) Performing controlled dataflow execution wherein a set of clusters are scheduled and executed on the fabric.

It is an object of this invention to provide a fabric in which resources can be composed as computational structures that best match the application's needs are proposed. This SoC fabric in which compute, storage and communication resources can be aggregated at runtime to perform specific application tasks, is the focus of this invention. This presents the advantage of provisioning, on demand, the optimal set of resources for every application module in a way that it meets the guaranteed performance level of the application. Further, by adopting dataflow execution paradigm, it is possible to closely relate to hardware execution and hence offer power-performance solution close to that of ASICs while retaining the programmability of processing cores. Traditional compiler infrastructures are not capable of realizing application modules as execution sequence on . composed resources. This invention proposes a methodology in which modules are compiled into a Virtual Instruction Set Architecture (VISA) from which data flow graphs ccoorrrreessppoonnddiinngg ttoo tthheessee mmoodduulleess aarree ggeenneerraatteedd a and executed on computational structures² composed at runtime on the REDEFINE fabric.

BRIEF DESCRIPTION OF DRAWINGS

Fig. 1 shows a self-addressed active storage unit, which determines the next token to be issued to the OSU based on which it can fire.

Fig. 2 shows a regular tessellation formed using equilateral triangles. Fig. 3 shows rectangular and hexagonal tessellations meeting specified constraints.

Fig. 4 shows the system of the present invention.

Fig. 5 shows the method of the present invention.

Fig. 5a shows the step of controlled dataflow execution, in the method of the present invention, in more detail. Fig. 5b shows a cluster that has a fixed number of inputs and outputs.

Fig. 6a describes the templates showing mapping of a monadic operation.

Fig. 6b describes the templates showing mapping of a dyadic operation.

Fig. 7 shows the C language description of matrix vector multiply function.

Fig. 8 shows the Data Dependence graph of a matrix vector multiply kernel. Fig 9 shows the mapping of the data dependence graph onto a hexagonal fabric.

Fig. 10 shows the Data flow graph of a Fibonacci kernel.

Fig.11 shows an unoptimized mapping of the data flow graph on the hexagonal fabric.

Fig.12 shows the optimized Mapping of Fibonacci function on the hexagonal fabric.

² A Computational Structure is the subset of the hardware resources provisioned for execution of a subgraph from the dataflow graph of the application. DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The fabric proposed in the present invention is also referred to as the REDEFINE fabric is a regular interconnection of resources comprising computation units, storage units and switches. In this section we examine various choices of resources and their interconnects that are suited for runtime composition of computational structures on the fabric.

Computation Units

RTL independence is achieved by choosing an appropriate granularity of computation units. The choices of granularity of computation units are:

• Function Units (FUs): Each FU is capable of running a single operation. Examples of typical operations include add, multiply, AND, OR, NOT etc. In this case, the granularity of the application graph is in terms of these primitive operations.

• Arithmetic Logic Units (ALUs): Unlike FUs, each ALU is capable of executing several operations. The use of ALUs instead of FUs makes the process of mapping the application graph onto the fabric simpler, since each ALU supports more than one operation.

The choice between FUs and ALUs is a tradeoff between higher utilization per unit area of silicon versus ease of composition. A FU has only the necessary and sufficient logic required to execute a particular operation. The use of FUs in a fabric increases utilization, since it is not overloaded with logic to executed different operations. However, identification of subgraphs of the application graph to match the level of granularity imposed by the FUs and their interconnection is complex. ALUs on the other hand have the logic for generic computations and hence make the problem of identification and mapping of subgraphs simpler. The choice of computation unit is dependent on the domain of applications that the fabric targets, parameters of optimization (viz. power, utilization) etc. We refer to the chosen computation unit as Operation Service Unit (OSU). Storage Units

The storage units serve as placeholders for the input data, the control information for the FU and the intermediate results from OSUs. Any traditional control driven computational paradigm can be supported with a simple passive storage. In order to support dataflow execution paradigm, a distributed token matching unit is maintained in the storage units, necessitating active storage units. Active storage units are small SRAMs/Register Files. Each line accommodates the operands and predicates of an operation. A bitmap, called operand availability bitmap, maintains which operations have all inputs ready. All operations whose inputs are ready "can fire". One among the "Can Fire" operations, called the "Will Fire" operation, is chosen for execution on the OSU. The choice of "Will Fire" operations is made by the priority encoder. Each line may additionally contain the control information for the OSU (viz. opcode in case ALUs are used, destination storage unit, tag for output generated). The storage unit serves as a wait and match unit of the dataflow architecture.

Fig. 1 shows a self-addressed active storage unit, which determines the next operation to be issued to the OSU based on which it can fire. The self-addressed storage unit is used inside the execution fabric for scheduling operations on the OSU (42 in Fig. 4). The operand availability bitmap 1 is connected to a priority encoder 2, which in turn is connected to a row decoder 3 to compute the exact row in which the data 4 is available. 4 is capable of holding all the inputs for an operation. For each operand present in 4, there is a bit in 1. This bit indicates whether the operand has arrived or not. Once all the bits are set to 1, a input along that line to 2 (priority encoder) is held high. 2 selects one line, which is passed as input to 3 (row decoder). 3 enables this line, which causes the operands present in that line to be read out. Reading out a line is equivalent to choosing a cluster for execution. Since data flow semantics are followed, an operation can be chosen for execution only when all its inputs are available. Interconnection network

A regular interconnection network is defined in terms of switches such that data communication can be effected among OSUs and storage units and also serve as forwarding agents for data. Of the several possible regular interconnection networks, we consider only those which can yield a planar interconnection of switches to enable easy realization on VLSI.

Regular fabric is a tessellation of a single unit (or the geometric transformations of this single unit viz. its rotation, its reflection) repeated in 2D space³. Fig 2 is tessellation formed using equilateral triangles.

Not all types of polygons can be used to form tessellations. For unbounded repetition, the following property must be satisfied:

p,q <≡ I n

I is the set of integers and ⁱ⁼¹ represents the sum of all internal angles of the polygon. Only three regular polygons triangle, square and hexagon satisfy this property. Fig. 2 shows a regular tessellation formed using equilateral triangles 10, 11. Tessellations using squares and hexagons are shown in Fig. 3 wherein the corners 21a and 21b represent an OSU and a storage unit, respectively. A pentagon or any other polygon does not satisfy this property.

Topology of Regular Fabric

The fabric needs to contain a good mix of OSU and Storage Units placed optimally and interconnected in a way that mimics execution flow in hardware. The interconnection between OSUs and storage units is akin to a graph having edges between nodes, where

³3D space is not considered, since 2D space offers lesser complexity with respect to wiring in VLSI circuits. nodes represent either OSUs or storage units. There can only be three kinds of edges in this graph, namely type 1: between an OSU and Storage type 2: between two OSUs type 3: between two Storage Units (type 3).

Of these three types, type 2 and type 3 edges are not desired because

• Two storage units placed adjacent to each other is equivalent to placing one large storage unit.

• Two OSUs placed adjacent to each other will increase the complexity of the FSM of the combined unit. Two OSUs placed adjacent to each other is tantamount to having two synchronous OSUs, which may have to be maintained as a 2-stage pipeline or a vector unit. In either case, a storage unit separating the two OSUs is imminent.

It can be shown that any regular structure with only type 1 edges is bipartite. A triangular tessellation is not bipartite, whereas both square and hexagonal structures are, as shown in Fig 3. The square tessellation i.e. mesh structure, is more prevalent in designs. We however do not rule of the possibility of using hexagonal tessellations since the hexagon has not only lower degree per node, but also has a lower wiring overhead when compared to a mesh. The hexagon and square tessellations are chosen based on the nature of the application and its communication characteristics. Triangular tessellation is not bipartite; however, these networks can be employed with appropriately designed OSU, which has storage units integrated into it.

Application Synthesis and Execution Orchestration

In the present invention, an execution paradigm in which the application expressed in a

HLL is compiled into computational structures that directly execute on the fabric as either combinational/sequential circuits. The details of the various steps of the execution orchestration are given below:

Architecture independent Intermediate Form: The HLL description is translated into an intermediate representation (ER.) with an orthogonal instruction set in Static Single Assignment (SSA) Form. The orthogonal set of instructions is referred to as the Virtual Instruction Set Architecture (VISA). The use of orthogonal instruction set makes the dataflow graph architecture independent (i.e. ISA independent)

Compilation into Clusters: The Dataflow Graph of the application closely mimics the flow of signals in hardware. The dataflow graph, referred to henceforth as application graph, is partitioned into disjoint subgraphs called clusters, hi the initial state, every node of the dataflow graph may be considered as an independent cluster. The criteria for merging two clusters to obtain bigger aggregations are: o Two communicating clusters are candidates for merger into a single cluster o The number of nodes in a cluster may not exceed a pre-determined threshold. o The number of inputs and outputs to the cluster may not exceed a pre- determined threshold o Two communicating nodes that are separated by several levels⁴ in the dataflow graph cannot be merged if the absolute difference between their levels exceeds a certain threshold. o Two clusters cannot be merged, if there exists more than certain predefined maximum number of nodes guarded by complementary predicates.

There is a strong producer consumer relationship within a cluster. The communication within the cluster is maximized and the communication across clusters is minimized. The clustering algorithm is independent of the high level language syntaxes viz. loops as subgraphs, functions as subgraphs. The application graph is compiled into a set of clusters. The optimal computational structure for the cluster is determined based on the interactions of nodes within the cluster. The communication patterns between instructions are mapped onto the fabric to compose computational structures at runtime. It casts the communication pattern of the application as a subset of the communication capability available in the fabric. This mimics the construction of a circuit at a grosser level. The composition of the cluster

* Levels are assigned by performing a topological sort and its mapping on the fabric are determined at compile time. The actual binding of clusters to resources on fabric takes place at runtime.

Controlled Dataflow Execution: The Controlled Dataflow Execution Paradigm is used to schedule and execute clusters of operations onto REDEFINE. Even though the overall schedule of the clusters is guided by the application, different execution paradigms use different scheduling strategies in order to maximize resource utilization and maximize instruction throughput, hence application performance, hi Controlled Dataflow, the scheduling of clusters is based on a Dataflow schedule akin to the scheduling of instructions in a traditional dataflow machine. The cluster is treated as a "hyper" operation with multiple inputs and outputs. A scheduler identifies the clusters to be launched for execution. The Scheduler identifies as many clusters as possible for scheduling to maximally utilize resources on the fabric. Depending on the availability of resources on the fabric, a subset of clusters ready for execution is selected. Dataflow execution semantics identifies clusters, which "Can Fire". The scheduler determines which of these clusters "Will Fire". The primary difference over a traditional Dataflow architecture is the use of hierarchical scheduling. At the higher level clusters are scheduled using dataflow semantics. Once a cluster is chosen for being scheduled, the instructions contained in a cluster are also chosen using a dataflow schedule. The cluster scheduling follows the dynamic dataflow scheme allowing multiple cluster instances to execute simultaneously, while at the instruction level a static dataflow scheme is used. The use of static dataflow scheduling at the level of instructions, does not have the disadvantages of the traditional static dataflow machine. Static dataflow machines cannot support execution of multiple instances of the same instruction simultaneously, which prevents them from exploiting inter-iteration parallelism that may exist in a loop. The use of hierarchical scheduling with dynamic dataflow based cluster scheduling, helps keep the number of operands from exploding during high ILP periods. The use of clusters also helps in reducing the communication overheads. This is because a data produced and consumed within a cluster has no visibility outside the cluster instance. Due to reduced number of "visible" outputs (at cluster level), complexity of the Wait Match Unit (used in traditional dataflow architectures) is reduced. Only data writes to global load/stores and writes to other clusters are made visible. The entire cluster executes as an atomic operation.

Issue logic: The "Will Fire" clusters are issued to the fabric by the issue logic. In order to issue a cluster, the issue logic needs to identify the resources on the fabric where the given cluster will be mapped. Once the resources on the fabric are identified, the issue logic does the following: o Writes the cluster inputs into the identified storage locations. o Writes the configuration information (viz. opcode) for the OSU into the related storage unit. The process of issuing clusters and its assignment to hardware is called binding. The decision to bind clusters at runtime increases the complexity of the hardware (when compared to a VLIW architecture), but the hardware is no longer as complex as the instruction execution pipeline in a Superscalar processor. On the other hand, it helps increasing utilization, by allocating free OSUs to clusters that can potentially run in parallel. Further, in our execution paradigm, we do not have to maintain an execution pipeline, and hence save on the cycles expended on decode and writeback stages. Unlike in VLIW schedules where NOPs must be introduced if a functional unit cannot be used, we ensure that all functional units are maximally used without having to resolve structural hazards at runtime.

Design of a processor for the Controlled Data Flow paradigm

The high-level block diagram of the REDEFINE platform is shown in Figure 4. 41, 42, 43 and 49 form a part of the issue logic described previously. 45 is built of computation units and storage units interconnected through a regular network (as described above).

As shown in the figure 4, the various blocks comprising REDEFINE are

• Scheduler (42): The Scheduler is responsible for scheduling clusters. The scheduler determines which clusters "Can Fire" and which cluster "Will Fire". Any cluster that has all inputs available "Can Fire". In order to choose one cluster to fire, priority is used. The compiler infrastructure suggests a priority of a cluster, based on whether cluster appears on the critical path. Several other factors are also considered for determining the priority. The design of the scheduler is very similar to the storage unit described previously (Figure 1). Unlike the operation storage unit, it takes in more operands. Hence, more operands slots and operand availability bits are required. • Cluster Configuration Store (41): The cluster configuration store contains the configuration for all possible clusters of the application. This is similar to an instruction store in a traditional architecture. This region cannot be overwritten during the course of execution. The clusters as described previously are generated by the compiler. • Execution Fabric (45): The execution fabric contains OSU and storage units connected through a regular interconnection. An additional overlay network is available to facilitate communication between two resources, which are not directly connected by the interconnection.

• " Resource Binding Agent (43): This logic maps the virtually bound clusters to the execution fabric. The binding determines unoccupied OSUs onto which the operations are mapped. The cluster configuration for the "Will Fire" cluster is obtained from the Cluster Configuration Store.

• Load Store Unit (46): The load store unit handles all memory operations generated by the execution fabric. In the Controlled Dataflow paradigm the memory is primarily used to store global variables, non-scalar variables and for performing pointer based memory operations.

• Store Destination Decision Logic (SDDL; 49): The SDDL is responsible for determining where the output of a given cluster must be written to. There are two cases to be considered. If the output data is meant for a cluster for which no input data is yet available then a new line is allocated within the scheduler. If the output data is meant for a cluster for which some inputs are already available, then this data is also stored along with the already available inputs.

42 selects a cluster for execution. 42 provides the cluster number to be scheduled to 41 and 43. 41 contains the instructions included in a cluster. It also includes the resource requirement specification for the cluster along with the mapping of cluster input data to instruction input data. Resource requirement specification is a kxk matrix that indicates how many OSUs is needed for the execution of the cluster and their positions in the fabric where they are needed. 42 also provides the input operands for the cluster instance, which is chosen for execution. 41 supplies the resource requirement specification. 43 has data structures, which indicates the regions of 45 that are not being used. 43 matches the cluster requirement specification with the available resources and tries to find a match. If the resource requirement can be satisfied then the instructions are mapped on to the respective OSUs. 45 is a collection of OSUs, storage units and switches which are interconnected together in a predetermined manner. Once the region within 45 is identified, the cluster input is mapped to the instruction operands, and they are forwarded to the appropriate OSUs. On 45, the OSUs execute all the instructions that have been mapped to it. Once the OSU completes execution of all the instructions, a message (44) is sent back to 43 indicating that the OSU is free. The results (48) of computation that need to be sent to other cluster instances are relayed to 49. 49 looks at the destination cluster identifier (which is compiler generated number associated with a cluster). Several instances of that cluster may be created during the execution time of that program. 49 determines the right cluster instance for which the data is destined to and then forwards the data to the right location within 42.

The method of the present invention is depicted in Fig. 5 wherein compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks for maximizing power and performance efficiency in a scalable fashion. This one time design followed by subsequent runtime aggregation ameliorates the NRE costs associated with back end process of designing an ASIC. This comprises translating the High Level Language (HLL) applications 152 for SoC platforms to an intermediate representation 153. The intermediate form is an orthogonal instruction set in Static Single Assignment (SSA) Form 154, wherein the orthogonal set of instructions forms the Virtual Instruction Set Architecture (VISA) 154. The SSA VISA form is then converted into dataflow graphs 155. The dataflow graphs 156 obtained closely mimics the flow of signals in hardware. The data flow graphs are then compiled into clusters 157. Clusters include closely interacting group of nodes of the dataflow graphs. Several such disjoint sub graphs are grouped to form clusters. Each cluster has certain number of inputs and outputs 180. The clustered data flow graph 158 is then translated into an executable 159 that can be executed on the computational structures composed at runtime. The executable 160 is now available for execution.

The executable thus created is executed using controlled dataflow execution paradigm. In this paradigm, clusters ready for execution are first identified 170. Such clusters are called Can Fire clusters 171. Among these Can Fire clusters, one cluster is chosen for execution 172. This cluster is called the Will Fire Cluster 173. The issue logic checks for availability of resources as specified in the resource requirements for this cluster 174. When the resources are available, the cluster configuration data is transferred to the storage units identified for the execution of this cluster 175. The control information too is transferred 176. The execution of the operations in the cluster ensues 1771

EXAMPLES

In this section, we present examples of how applications are mapped onto the fabric (corresponding to a hexagonal tesselation) and we also present a cycle count comparison between the execution time, for some simple examples, on REDEFINE and HPL-PD architecture. The following are the steps involved in mapping the application: 1. Derive the application graph: The HLL description of the application is passed through LLVM [2] to obtain the SSA representation of the application. This is then converted into a dataflow graph. LLVM supports a virtual instruction set architecture (VISA) composed of 28 orthogonal instructions.

2. Mapping the data flow graph: The dataflow graph contains only dyadic and monadic operations. This can be easily mapped to the hexagonal structure as shown in the templates in Figure 6. Figure A is a mapping of a monadic operation on to the fabric. 60 is the producer and 61 is the consumer. Figure B is a mapping of a dyadic operation on to the fabric. 62 and 63 are producers. 64 is the consumer. Matrix Vector Multiply kernel

Figure 7 shows the kernel of the matrix vector multiply function. This code is passed through LLVM with optimization level set to 3. The resulting bytecode is then hand coded into a dataflow graph (refer Figure 8). Figure 8 shows the data dependence graph of a matrix vector multiplication kernel. Nodes 80 and 81 are multiplies whose result gets added in 82 and the sum of operations 82 and 83 is done at 84. This dataflow graph is then mapped onto the hexagonal fabric as shown in Figure 9. The figure 9 shows the mapping of the data dependence graph on a fabric (as in 45) with hexagonal interconnection network. Node 80 maps to 90, 81 to 91, 82 to 92, 83 to 93 and 84 to 94. The mapping is achieved using the templates shown in Figure 6 and 6 A. int i, j; for ( i = 0; i < n; i++)

{ temp = 0; for

; r[ i] = temp; }

Figure 7: The C language description of matrix vector multiply function.

Fibonacci Sequence

Figure 10 shows the dataflow graph for iterative Fibonacci sequence generator. The basic blocks are marked by dotted ellipse encircling the operation nodes. The basic blocks shown in this example are 110, 111, 112 and 113. The mapping of the basic blocks 111 and 112, to the execution fabric with a hexagonal interconnection is shown in Figure 11. The nodes 120-126 indicate the unoptimized mapping of the data dependence on to a fabric with hexagonal interconnection is shown in Figure 10. The mapping of the dataflow graph (refer Figure 10) onto the hexagonal fabric is shown in Figure 11. It is important to note in Figure 11, there are non-neighbor producer consumers. Presence of these requires special communication mechanisms to transfer data from one portion of the fabric to another. In order to achieve this, the data is routed through the hexagonal fabric. These communications are called Long Latency accesses. In Figure 12 the optimized mapping of the same data, flow graph is depicted. The graph forming the nodes 130-134 shows the optimized mapping of the same shown in Figure 12. REFERENCES

[1] Alberto Vincentelli, "Reasoning about the Trends and Challenges of Engineering

Design Automation" Plenary Session, 20th International Conference on VLSI

Design, VLSI 2007, January 2007. [2] Chris Lattner and Vikram Adve, "LLVM: A Compilation Framework for Lifelong

Program Analysis & Transformation", in the Proceedings of the 2004

International Symposium on Code Generation and Optimization (CGO '04),

March 2004.

[3] Engin Ipek, Meyrem Kirman, Nevin Kirman and Jose F. Martinez, "Core Fusion: Accommodating Software Diversity in Chip Multiprocessors", to appear in the

Proceedings of the 2007 International Symposium on Computer Architecture

(ISCA-07), June 2007. [4] Geoffrey F. Burns et al., "Enabling Software-Programmable Multi-Core Systems- on-Chip for Consumer Applications", http://www.siliconhive.com/uploads/GSPx2005 Paperl.pdf

[5] Henk Corporaal, "Transport Triggered Architecture", PhD. Thesis, TU Delft,

Netherlands. [6] International Technology Roadmap for Semiconductors (ITRS), 2006 Update,

"Design", pp 3-5, 2006. [7] IPFlex, "DAPDNA Architecture", www.ipflex.com

[8] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. Keckler, and C. Moore, "Exploiting JXP, TLP, and DLP with the polymorphous trips architecture," in the Proceedings of the International Symposium on Computer

Architecture (ISCA-03), June 2003. [9] Poseidon Systems Inc, www.poseidon-svstems.com

[10] Stamatis Vassiliadis, Stephan Wong, Georgi Gaydadjiev, Koen Bertels,

Georgi Kuzmanov, and Elena Moscu Panainte, "The MOLEN Polymorphic

Processor", in the IEEE Transactions on Computers, VoI 53, No. 11, pp 1363-

1375, November 2004. [11] Steven Swanson, Ken Michelson and Mark Oskin, "WaveScalar",

Technical Report UW-CSE-03-01-01, Deptartment of Computer Science and Engineering, University of Washington, Jan 2003.

[12] Steven Swanson, Ken Michelson, Andrew Schwerin and Mark Oskin, "Wavescalar", in the Proceedings of 36th International Symposium on

Microarchitecture, MICRO-03, Dec 2003.

[13] Stretch Inc., "The S6000 Family of Processors", http://www.stretchinc.com/ files/sόArcMtectureOverview.pdf. [14] Synfora Inc., "PICO Technology White Paper", www.svnfora.com [15] Vinod Kathail, Michael S. Schlansker, B. Ramakrishna Rau, "HPL-PD

Architecture Specification: Version 1.1", http://www.trrmaran.org/docs/hpl-pd.pdf.

Claims

1. A System on Chip fabric in which compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks while maximizing power and performance efficiency in a scalable fashion that also ameliorates the NRE costs associated with back end process of design, comprising: a. A homogeneous structure whose basic units are called Building Blocks satisfying universality and regularity criteria; b. Means to map the modules and functions to a hardware platform, independent of RTL; and c. Scalable means, which support the addition of building blocks to increase the capacity of the fabric.

2. A fabric of claim 1 wherein the homogenous structure composed of building blocks includes computation units called Operation Service Units comprising either of: a. Function Units, capable of executing a single operation; or b. Arithmetic Logic Units, capable of executing a combination of operations.

3. A fabric of claim 2 wherein the operations executed by the function and arithmetic logic units include: a. Boolean functions, such as AND, OR, NOT; and b. Arithmetic functions, such as ADD, MULTIPLY.

4. A fabric of claim 1 wherein the storage resources are comprised of units that serve as placeholders for input data, control information for the functional units and the intermediate results from the operation service units, turning into active storage units by means of having an associated token matching unit.

5. A fabric of claim 4 wherein the active storage units are small static random access memories that store one or more lines of operands associated with an instruction.

6. A fabric of claim 5 wherein each line includes the operand and the predicates of an operation.

7. A fabric of claim 5 wherein each line includes control information for the operation service units.

8. A fabric of claim 5 wherein a bitmap, called an operand availability bitmap, maintains information about which instructions (associated with tags) have all their inputs ready wherein: a. Operations whose inputs are ready have a status of can fir ; and b. Operations which have been issued have a status of will fire, as determined by the priority encoder.

9. A fabric of claim 1 wherein the communication resources are comprised of interconnection networks that are constructed using switches such that: a. Data communication is effected amongst the Operation Service Units and the storage units; and[ki] b. Data can be forwarded.

10. A fabric of claim 9 wherein the interconnection networks yield a planar interconnection of switches to enable easy realization on VLSI.

11. A fabric of claim 9 wherein the interconnection networks are a tessellation of a single unit (or the geometric transformations of this single unit viz. its rotation, its reflection) repeated in 2D space.

12. A fabric of claim 1 wherein the Operation Service Units and the storage units are optimally located at switching ports to cast a communication flow that rnimics execution in hardware.

13. A fabric of claim 9 wherein the interconnection networks are tessellations with planar realizations including triangular, square and hexagonal tessellations.

14. A system relating to System on Chip fabric in which compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks while maximizing power and performance efficiency in a scalable fashion that also ameliorates the NElE costs associated with back end process of design, comprising: a. A Scheduler that is responsible for scheduling clusters by determining which clusters can fire and which cluster will fire, wherein any cluster that has all inputs available can fire and the firing of clusters uses a certain priority for a cluster, suggested by the compiler infrastructure, based on whether cluster appears on the critical path, besides considering other criteria; b. A Cluster Configuration Store that contains the configuration for all clusters of the application, similar to an instruction store in a traditional architecture, which is a region that cannot be overwritten during the course of execution; c. An Execution Fabric containing a plurality of OSUs, storage units and switches, which are connected through a regular interconnection wherein an additional overlay network is available to facilitate communication between two resources which are not directly connected by the interconnection; d. A Resource Binding Agent, which is the logic that maps the clusters to the execution fabric wherein the binding determines unoccupied OSUs onto which the operations are mapped, the cluster configuration for the will fire clusters being obtained from the Cluster Configuration Store; e. A Load Store Unit that handles all memory operations generated by the execution fabric, wherein a Controlled Dataflow paradigm is used wherein the memory is primarily used to store global variables, non-scalar variables and for pointer based manipulations; f. Store Destination Decision Logic (SDDL) that is responsible for determining where the output of a given cluster must be written to wherein: i. If the output data is meant for a cluster for which no input data is yet available then a new line is allocated within the scheduler; ϋ. If the output data is meant for a cluster for which some of the inputs have arrived, then new data operand is written in the line already allocated to the cluster instance; and

15. A system of claim 14 wherein the criteria for the scheduler includes throughput requirements and resource utilization considerations.

16. A method in which compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks while maximizing power and performance efficiency in a scalable fashion that also ameliorates the NRE costs associated with back end process of design, comprising the steps of: a. Developing High Level Language (HLL) applications for SoC platforms comprised of one or more modules; b. Converting the HLL description of the modules of the application to an intermediate representation that is an orthogonal instruction set in Static Single Assignment (SSA) Form, wherein the orthogonal set of instructions is referred to as the Virtual Instruction Set Architecture (VISA), from which dataflow graphs corresponding to these modules are generated and executed on computational structures composed at runtime; c. Compiling into clusters using the dataflow graph of the application, which closely mimics the flow of signals in hardware wherein the dataflow graph or application graph, is partitioned into disjoint subgraphs called clusters; d. Performing controlled dataflow execution wherein a set of clusters are scheduled and executed on the fabric; and e. Performing binding operations wherein clusters are issued and assigned to hardware.

17. A method of claim 16 wherein the step of performing controlled dataflow execution and binding operations comprises the steps of: a. Selecting a one or more clusters to be launched for execution, wherein the selection is optimized to maximize utilization of the resources by performing hierarchical scheduling to determine which clusters can fire and which clusters will fire; b. Using issue logic to check for the availability of resources to execute the clusters and further: i. Writing the cluster inputs into identified storage locations; and ii. Writing the configuration information for the OSU into the related storage unit.