US20130290919A1 - Selective execution for partitioned parallel simulations - Google Patents

Selective execution for partitioned parallel simulations Download PDF

Info

Publication number
US20130290919A1
US20130290919A1 US13/646,669 US201213646669A US2013290919A1 US 20130290919 A1 US20130290919 A1 US 20130290919A1 US 201213646669 A US201213646669 A US 201213646669A US 2013290919 A1 US2013290919 A1 US 2013290919A1
Authority
US
United States
Prior art keywords
simulation
sub
graphs
graph
design
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/646,669
Inventor
Ramesh Narayanaswamy
Paraminder S. Sahai
Chiahon Chien
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synopsys Inc
Original Assignee
Synopsys Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synopsys Inc filed Critical Synopsys Inc
Priority to US13/646,669 priority Critical patent/US20130290919A1/en
Assigned to SYNOPSYS, INC. reassignment SYNOPSYS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHIEN, CHIAHON, NARAYANASWAMY, RAMESH, SAHAI, PARAMINDER S
Publication of US20130290919A1 publication Critical patent/US20130290919A1/en
Priority to US15/457,507 priority patent/US10853544B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • G06F30/3323Design verification, e.g. functional simulation or model checking using formal methods, e.g. equivalence checking or property checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/36Circuit design at the analogue level
    • G06F30/367Design verification, e.g. using simulation, simulation program with integrated circuit emphasis [SPICE], direct methods or relaxation methods

Definitions

  • This application relates generally to semiconductor circuit simulation and more particularly to selective execution of partitioned parallel simulation.
  • Modern electronic systems designs contain numerous components including digital, analog, and high frequency components, all of which can be problematic to design.
  • the design process may comprise top-down decomposition and/or bottom-up assembly. Feature sizes of the components making up the electronic systems now are routinely smaller than the wavelength of visible light.
  • Logic systems are routinely constructed from tens or even hundreds of millions of transistors. System designers are required to balance system performance, physical size, architectural complexity, power consumption, heat dissipation, fabrication complexity, and cost, to name only a few. Each of the related design decisions drive profound impacts on the resulting design.
  • developers create specifications around which to design their systems. The specifications attempt to balance the many disparate demands being made of the logic system and to contain what can easily be exploding design complexity.
  • HDL hardware description language
  • VHDL hardware description language
  • the high-level HDL is easier for developers to comprehend, especially for a vast system, and may describe highly complex concepts that are difficult to grasp using a lower level of abstraction.
  • the HDL description may be converted into other levels of abstraction as is helpful to the developers. For example, a high-level description may be converted to a logic-level register transfer level (RTL) description, a gate-level (GL) description, a layout-level description, or a mask-level description.
  • Each lower abstraction level introduces more detail into the design description.
  • the lower-levels of abstraction may be generated automatically by computer, derived from a design library, or created by another design automation technique. Therefore, it is critical to ensure that the performance of the resulting lower-level designs is still capable of matching the requirements of the system's specification.
  • the process of comparing a system design to a design specification (or one level of abstraction to another) can be called verification. Verification of modern electronic systems can ensure that a device under test (DUT) is simulated so that the behavior of the DUT is shown to match a system specification for the electronic design.
  • DUT device under test
  • a computer-implemented method for design simulation comprising: obtaining a high-level design for simulation; determining a graph representation for the high-level design; partitioning the graph representation into sub-graphs; selecting a subset of the sub-graphs for simulation based on input-change bits; and evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
  • the method may further comprise propagating the simulation result, based on the evaluating of the subset of the sub-graphs, to a remainder of the graph representation for further simulation.
  • the subset of the sub-graphs for simulation may be selected based on an input-change bit, for each sub-graph in the subset of the sub-graphs, being set to true.
  • the partitioning may further comprise determining sub-graphs based on levels of logic.
  • the partitioning into sub-graphs may be based on reducing a number of signals crossing sub-graph boundaries.
  • the method may further comprise using one or more of the input-change bits on a level as part of the selecting of the subset for simulation.
  • the evaluating may be based on an oblivious simulation model.
  • the method may further comprise allocating processes from the oblivious simulation model to a plurality of processors.
  • the method may further comprise selectively evaluating the processes based on an input change bit set being set to valid.
  • the graph representation may include a control data flow graph.
  • the control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design.
  • the partitioning may include creating value locality.
  • the partitioning may include creating event change locality.
  • the partitioning may include balancing of levels.
  • the partitioning may include collecting of readers of a simulation value.
  • the partitioning may include separating primitives evenly across clusters within a level.
  • the partitioning may include clustering sibling primitives.
  • the method may further comprise modifying clock gating.
  • the modifying clock gating may include moving gating to storage elements.
  • the modifying clock gating may include eliminating a clock gate to a combinational logic portion.
  • the modifying clock gating may be only for simulation purposes.
  • the clock gating may be restructured to combine phases and to occur on an active edge of clock.
  • the method may further comprise determining that an output of one of the sub-graphs has a change of state and copying that change of state to a processor where a process for a second sub-graph uses that change of state as input to the second sub-graph.
  • the method may further comprise copying a sequence of changes of state for that output of one of the sub-graphs and using the sequence as a series of inputs to the second sub-graph.
  • the method may further comprise copying the high-level design and simulating the high-level design as well as its copy on at least two different processors.
  • the copying may be performed to accomplish the simulating.
  • the method may further comprise maintaining primitives with the same function in a single cluster within the copy of the high-level design.
  • a computer system for design simulation may comprise: a memory which stores instructions; one or more processors coupled to the memory wherein the one or more processors are configured to: obtain a high-level design for simulation; determine a graph representation for the high-level design; partition the graph representation into sub-graphs; select a subset of the sub-graphs for simulation based on input-change bits; and evaluate the subset of the sub-graphs to produce a simulation result for the high-level design.
  • a computer program product embodied in a non-transitory computer readable medium for design simulation may comprise: code for obtaining a high-level design for simulation; code for determining a graph representation for the high-level design; code for partitioning the graph representation into sub-graphs; code for selecting a subset of the sub-graphs for simulation based on input-change bits; and code for evaluating the subset of the sub-graphs to produce a simulation result for the high-level design
  • FIG. 1 is a flow diagram for partitioned simulation.
  • FIG. 2 is a flow diagram for partitioning.
  • FIG. 3 is a flow diagram for sub-graph usage.
  • FIG. 4 is an example logic block diagram.
  • FIG. 5 is an example flow graph of the logic design.
  • FIG. 6 is an example flow graph showing example partitions.
  • FIG. 7 is an example levelized hypergraph.
  • FIG. 8 is an example logic block for selective evaluation.
  • FIG. 9 is an example logic combination to phase reduction.
  • FIG. 10 is a system diagram for partitioned simulation.
  • Event driven simulation as opposed to oblivious simulation, can maintain a time-ordered queue of waiting processes. Events are added to the list at various times, and are typically processed in the order in which they arrive. Processing an event may generate more events or alter the order of the list of pending events. The components of the electronic system which experience value changes (i.e. the one or more inputs that have changed) are added to the queue of pending processes. Thus, simulation computation is limited to design components which undergo input changes.
  • oblivious simulation evaluates all components of the design regardless of whether or not a given component has experienced a value change. Changes to input values are not monitored, and no queue insertion is performed. However, in this type of simulation, a computation of every component is performed for every simulation process; this often leads to redundant and unnecessary computations. Thus, the choice of a simulation approach and how that approach is implemented is critical to minimizing simulation time while maximizing effectiveness.
  • Parallel event driven simulation employs multiple time-ordered queues and processes, and assigns sections of the design to a time-ordered queue running on a processor. Since the time-ordered queues must be chronologically synchronized, this approach is often limited to few processors. It does not scale to large numbers of processors because synchronization costs are prohibitively high and processor work assignments are uneven, thus saturating one processor with tasks while others remain idle.
  • parallel oblivious simulation simplifies queue synchronization because only a single synchronization is required per level of the simulation model. Parallel oblivious simulation does perform redundant (i.e. unnecessary) operations. However, in a simulation with many value changes per design clock cycle or simulation cycle, parallel oblivious simulation is more efficient than parallel event driven simulation because of lower computational and synchronization demands.
  • the oblivious simulation model presents a static simulation view which both allows for efficient partitioning under multiple constraints and efficient allocation of tasks to clusters of processors.
  • the oblivious model may be instrumented for selective execution of processes, thus reducing computational overhead by removing from a simulation cycle static portions of the model which do not change in a given cycle.
  • a control data flow graph (CDFG) representing the description of the electronic system can be folded into a graph of the combinational and state regions. The CDFG may then be partitioned statically in order to create a locality of read and write values, balance computations by level, determine event change locality, and calculate the number of unique node (primitive) type reductions per partition.
  • Such partitioning is accomplished by collecting most or all readers of a simulation value into the same cluster, partitioning primitives evenly across clusters, clustering “sibling” primitive instances together, limiting the number of unique primitives generated, and clustering for a small number of unique primitives.
  • the statically partitioned CDFG can be instrumented with a single input change bit per cluster of nodes in a given level. Thus, such selective evaluation can skip the evaluation of a cluster for which the change bit is false.
  • a collection of change bits is gathered in a bit set that is both updated and traversed in parallel by allocating each bit to a processing thread.
  • a control data flow Graph may be assigned to an RTL representation of a high level design and partitioning performed. Each partitioning constraint is satisfied by collecting most or all readers of a particular simulation value into the same cluster. This reduces the number of necessary writes by the result writer, as well as evenly partitioning primitives in a level across clusters.
  • Clustering of sibling primitive instances from multiple instances of large user blocks e.g. four program counter (PC) incrementer primitives from a four CPU core model may be clustered together
  • PC program counter
  • Selective evaluation can then evaluate a cluster of nodes if the input change bit is true and bypass evaluation of a cluster of nodes if the input change bit is false.
  • selective evaluation minimizes unnecessary evaluation by skipping the evaluation of clusters for which the input change bit is false.
  • the use of an input change bit set per level can be executed efficiently on a parallel processor.
  • An input change bit may have many writers, where all writers are able to make a lock-free write of the value “true” to the input change bit. (e.g. multiple writes of “true” to a single bit).
  • the disclosed concept is more efficient than creating an event queue which requires a lock and/or the use of atomics. The disclosed concept thus provides a reduction in computational complexity.
  • clock gating check circuits are moved to storage elements that are controlled by the clock or clocks.
  • gated clock circuits which required more than one clock cycle for evaluation may be evaluated in one clock cycle.
  • the result is a reduced numbers of evaluation steps, and thus an improvement in computational efficiency.
  • Safe clock gating captures the clock enable signal on the inactive edge of the clock.
  • the registers implemented in the designs of various cores are sensitive only to the active edge of the clock. Since there is little simulation activity on the inactive edge of the clock, high overhead associated with parallel simulation may be eliminated for the inactive clock edge.
  • the topology of the safe clock gating and the clock enable signals may be identified.
  • the clock and clock enable signals can be moved into the process or processes which consume the gated clock.
  • Clock gating processes which are sensitive to the inactive edge of the clock, and the gated clock net, may be removed.
  • the design's sensitivity to the inactive edge of the clock is removed.
  • Very low simulation activity is thereby eliminated.
  • the parallel simulation overhead of many synchronization barriers required to execute a simulation time step for the inactive edge of the clock is removed.
  • Gated clock logic which is on the inactive edge of the clock is removed, and clock gating is fused into the process that is on the active edge of the clock.
  • Improved simulation efficiency takes advantage of parallel computing architectures such as a graphics processing unit (GPU).
  • GPU graphics processing unit
  • FIG. 1 is a flow diagram for partitioned simulation.
  • a flow 100 is described for selective execution of partitioned parallel simulation.
  • the flow 100 comprises a computer-implemented method for design simulation.
  • Effective parallel simulation is critical to design verification.
  • Parallel simulation can be performed on parallel architectures such as parallel processors, grid computers, and graphics processor units (GPUs for example).
  • the purpose of verification is to ensure that an electronic system design matches a predetermined specification.
  • Various scheduling algorithms exist such as event driven simulation and the oblivious algorithm simulation.
  • event driven simulation a time-order process queue is maintained. Components of a design which undergo value changes are inserted into the queue. As a result, computation is limited to the evaluation of parts of the design that have to be updated.
  • a gated clock may be derived from a main clock and may be implemented to disable regions of a design under control of a clock. Each gated clock in turn may be used to derive sub-gated clocks.
  • Such a hierarchy of gated clocks may serialize the evaluation of a simulation model.
  • the enable of a gated clock may be computed in one phase, which may trigger a change on a gated clock signal, which in turn may be evaluated in a second phase. If the clock gating check is moved into a storage element that may be controlled by the clock, then the evaluation may be done in one phase instead of two. Parallel evaluation may be increased, and the overhead of starting a second phase of evaluation may be removed.
  • the flow 100 includes obtaining a high-level design 110 for simulation.
  • Design simulation is a crucial step in the design, analysis, and verification of an electronic system.
  • the obtained design may be a high level design written in any of a variety of languages such as VerilogTM, VHDLTM, SystemVerilogTM, SystemCTM, or other design language.
  • a high-level design 110 may be represented by a graph.
  • the graph may be manipulated in a variety of ways for simulation purposes.
  • the graph may include a control data flow graph.
  • a control data flow graph (CDFG) may be determined for an RTL or behavioral representation of a high-level design.
  • a control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design.
  • Other graphical representations may be determined for a representation of a high-level design including a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • the flow 100 continues with partitioning the graph 130 into sub-graphs.
  • a graph may be partitioned into sub-graphs 130 for simulation purposes.
  • a graph can be partitioned into sub-graphs by a number of means, including static means, arbitrary means, and the like.
  • Static partitioning for example, may be based on creating value read/write locality, balancing computation by level, creating event-change locality, reducing unique node (or primitive) types in a given partition, and the like.
  • the partitioning may include determining sub-graphs based on reducing a number of signals crossing sub-graph boundaries. Arbitrary partitioning may be based on random, ad hoc, heuristic, or other means.
  • the flow 100 may include constructing input-change bit sets 132 .
  • An input-change bit may be a memory element that is added to a design to record when an input to a portion of the design changes state. By knowing that an input to the portion has changed, then that portion of the design can be re-evaluated and an output from that portion may change.
  • the input-change bit may only be inserted into a design for simulation purposes.
  • the input-change bit can be an artifact and is actually a memory element somewhere other than the design which aids in simulation.
  • a set of input-change bits may be assembled to track multiple portions of the design under simulation. The set may be optimized to cover proper portions of the design as would be appropriate to partition for simulation in separate sections. The partitioning may include determining sub-graphs based on levels of logic where the levels of logic may be successive stages.
  • the flow 100 continues with selecting a subset of the sub-graphs for simulation 140 .
  • the selection may be based on input-change bits.
  • a partition or subset of a graph may be instrumented with an input change bit.
  • An input change bit may be associated with a vertex (node) of a graph, or may be associated with a cluster of nodes of a graph.
  • the flow 100 may further comprise using one or more of the input change bits on a level as part of the selecting of the subset for simulation.
  • An input change bit may indicate that one or more inputs to a partition or subset of a graph (e.g. a sub-graph) have changed over some time period.
  • the selection of a subset of sub-graphs may further comprise selectively evaluating the processes based on an input change bit set being set to true (or valid). Selection of a subset of sub-graphs may be made based on a change in status of one or more input change bits. Selection of a subset of the sub-graphs may be selected for execution based on level, on an active edge of a clock, and the like.
  • the flow 100 may include allocating the subset of the sub-graphs for simulation 140 to one or more processors 154 . The subset may then in turn be simulated. In some embodiments, the flow 100 includes copying the high-level design 144 and simulating the high-level design as well as its copy on at least two different processors. The flow 100 continues with evaluating the subset of the sub-graphs 156 to produce a simulation result for the high-level design. A subset of sub-graphs which may be selected based on using one or more input-change bits 162 to determine a change of status. A change in status may cause re-evaluation of a subset of the sub-graphs 156 .
  • the subset of the sub-graphs for simulation are selected based on the input-change bit for each sub-graph in the subset of the sub-graphs being set to true.
  • the processors used in such simulations can include one or more copies of a core, a multicore processor, a graphics processing unit (GPU), a parallel processor, a grid computer, and the like.
  • Subsets of sub-graphs may indicate an unchanged status of one or more of the input change bits providing selective simulation of only those subsets of sub-graphs whose input change bits changed status. This strategy thereby reduces computation requirements by ignoring subsets of sub-graphs without a change of status in their input change bits.
  • Various approaches can be used for the evaluation process. Event driven simulation may be used.
  • the evaluating may be based on an oblivious simulation approach with the simulations being executed on one or more processors.
  • the allocating of processes from the oblivious simulation approach can be to a plurality of processors.
  • Such allocation of processes to a plurality of processors may be based on the input-change bit or bits thereby including evaluation of a subset of the sub-graphs.
  • the flow 100 continues with propagating the simulation result 160 , based on the evaluating of the subset of the sub-graphs 156 , to a remainder of the graph for further simulation.
  • a verification process may involve iterating the simulation process. Such iteration may occur for both verification processes based on event driven simulation and for verification based on Oblivious Simulation. Iterating the simulation process may cause a change in the order in which a process or processes are evaluated.
  • the flow 100 may further comprise modifying the clock gating 170 .
  • a portion of the design may be identified with clock gating which is inefficient from a simulation perspective.
  • the clock gating may be modified 170 , perhaps only for simulation purposes.
  • a clock signal may be gated for a variety of reasons including to reduce power consumption. With oblivious simulation an inefficient simulation may result if inactive segments of a design are nonetheless simulated. Safe clock gating may capture a clock enable signal on an inactive edge of a clock as can be seen with certain register designs. Using parallel simulation in this case may incur a high overhead and thus be inefficient. Modification of the clock signal may be undertaken in order to reduce simulation during low activity.
  • the modifying of the clock gating may include moving gating to storage elements. For example, a clock signal and a clock enable signal may be recognized. These signals may be moved to the process which consumes a gated clock. Further, removing sensitive clock gating processes to an inactive edge of a clock may further reduce simulation requirements.
  • the modifying may include changing the design, while in other embodiments the modifying of the clock gating may only be for simulation purposes.
  • the modifying of the clock gating may include eliminating a clock gate to a combinational logic portion 172 .
  • a gated clock may be derived from a main clock and, for example, may be employed to disable regions of a design under a clock control. Each gated clock may in turn be used to derive sub-gated clocks. Such a hierarchy of gated clocks may serialize the evaluation of a simulation model.
  • the clock gating may be restructured to combine phases and to occur on the active edge of clock. For example, the enable signal of a gated clock may be computed in one phase. This single phase computation may trigger a change on a gated clock signal, which may in turn necessitate evaluation using a second phase.
  • clock gating check were to be moved to a storage element controlled, in embodiments, by a clock, then an evaluation may be done in one clock phase.
  • a combination of clock phases may increase parallel evaluation efficiency, and may remove the overhead of starting a second phase of evaluation.
  • the updated clock gating can be reflected in the graph which was determined.
  • steps in the flow 100 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts.
  • Various embodiments of the flow 100 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 2 is a flow diagram for partitioning.
  • a flow 200 may continue or be part of the previous flow 100 .
  • the flow 200 stands on its own and works from pre-existing system designs, graphs, and the like.
  • a graph representing a system design is obtained.
  • a graph may be partitioned 210 to create a set of sub-graphs.
  • the graph may be any of a variety of graphs including a control data flow graph (CDFG), a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • CDFG control data flow graph
  • the graph may be partitioned into various sub-graphs by a variety of means including arbitrary means, static mean, ad hoc means, heuristic means, and the like.
  • partitioning of a graph into sub-graphs may be determined based on logic levels 220 .
  • a design may comprise one or more levels of logic.
  • simulation models may cause a simulation of a system to be performed in one or more levels. However, depending on simulation model, the computation required at a given level might be significantly different from that required at another level.
  • the use of an event driven simulation model may cause an uneven work assignment from level to level as a result of partitioning of a graph.
  • the evaluating may be based on an oblivious simulation model.
  • Use of an Oblivious model may further comprise allocating processes from the oblivious simulation model to a plurality of processors. oblivious simulation may not suffer from work starvation on one or more processors.
  • Synchronization may be simpler since a single synchronization per level of the model is sufficient.
  • parallel oblivious simulation may perform redundant computation since all logic blocks are evaluated each simulation cycle. When the number of value changes per design clock is low, parallel oblivious simulation may be slower than serial event driven simulation.
  • a graph may be partitioned into sub-graphs based on a reduction of sub-graph crossings 222 .
  • Sub-graph crossings may refer to the number of signals which flow among sub-graphs.
  • an evaluation region may be partitioned into sub-regions subject to constraints of a given design. Signal crossings among sub-regions may be minimized if most values may be produced and consumed within a sub-region; many consumers of a signal may be moved to a receiving sub-region when a signal crosses sub-regions; and the number of primitives in different sub-regions may roughly equal.
  • Partitioning of a graph into sub-graphs may be based on a variety of parameters.
  • the partitioning may include creating value locality 230 .
  • To create value locality signal producers and related signal consumers may be moved to the same partition. Producing and consuming signals within a given partition may reduce sub-graph crossings.
  • creating value read/write locality may permit data production and data consumption to remain local to a given processor assigned, in embodiments, to execute the sub-graph.
  • the partitioning may include creating event change locality 232 .
  • Aggregate selective evaluation may evaluate all primitives in a block even if only one input to the block may have changed.
  • One may choose to cluster many primitives that change at a given simulation time into a single block to improve the efficiency of ASE and to improve computational efficiency.
  • a variety of primitives may be clustered together so that evaluation of all primitives within a partition remains relevant to a given simulation rather than simply being wasted cycles.
  • the partitioning includes balancing of levels 234 . Such a partitioning may balance a given workload in three phases, such as fetching input values, evaluating primitives, and checking and writing output changes, for example.
  • the partitioning includes collecting of readers 236 of a simulation value. Simulation values may be created by producers, and consumed by readers. Computational efficiencies may result if a graph is divided into sub-graphs based on clustering of producers and readers. Efficiencies may result from assigning such sub-graphs to the same processors, because computational (e.g. data read/write) efficiencies from reduced processing overhead may result.
  • the partitioning includes separating primitives 238 evenly across clusters within a level.
  • a processor cluster or a sub cluster may support single instruction and/or multiple data (SIMD) instructions.
  • SIMD single instruction and/or multiple data
  • a SIMD instruction may efficiently compute, for example, 8, 16, 32, or more operations of the same primitive type, such as 16 additions in a single clock cycle of the CPU.
  • SIMD instructions a single primitive type or a small set of primitive types may be allocated to a cluster or sub-cluster.
  • the partitioning includes clustering sibling primitives 240 .
  • Clustering of sibling processes may be an embodiment of change locality.
  • multiple component instantiations can occur.
  • an 8-core CPU may comprise eight instantiations of a component Core.
  • a component which may perform an operation in Core may appear eight times in the CPU.
  • an incrementer which may increment a program counter (PC) by 1 in a core—appears eight times in a CPU.
  • PC program counter
  • these multiple instances of PCs are referred to as “siblings.” Since sibling primitives have a high probability of simultaneous input changes, then those sibling primitives can be clustered together into a given partition of a graph—a sub-graph.
  • a user may run thousands of tests on a single simulation model. Each of these tests may form an execution sequence.
  • a simulation model which runs 2, 4, or more copies of the simulation model at a time, one may increase parallel efficiency and achieve better utilization of a model.
  • Such simulation may follow a process such as: obtain a simulation model then obtain 2, 4, or more copies of the simulation model data, one copy for each test to be supported by a multi-test simulation model.
  • design representations can be shared across the copies.
  • circuit structures can be shared across the copies.
  • a simulation model may comprise an interconnection of simple primitives.
  • Simple primitives may have roughly equal computational and communication requirements.
  • a description of a logic design at a gate level may be in terms of simple primitives such as AND, OR, and NOT gates, flip-flops, latches, and the like.
  • An RTL description in a high-level language such as VerilogTM, VHDLTM, SystemVerilogTM, SystemCTM, or other design language, may be decomposed into primitives.
  • Higher level functions of arbitrary width such as multipliers, address, and the like, may be decomposed into components comprising fixed widths such as 32 or 64 bits, for example.
  • Selectors comprising an arbitrary number of inputs may be decomposed into components comprising a fixed number of inputs such as 3, 5, and the like.
  • Primitives may be clustered together for a variety of reasons that may be related to design and/or simulation. Such primitives can comprise a significant portion of a graph or sub-graph for evaluation.
  • steps in the flow 200 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts.
  • Various embodiments of the flow 200 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 3 is a flow diagram for sub-graph usage.
  • a flow 300 may continue or be part of the previous flow 100 .
  • the flow 300 stands on its own and works from pre-existing sub-graphs.
  • a graph representing a system design is obtained.
  • a graph may be portioned into sub-graphs.
  • Sub-graphs may be used for various aspects of evaluation and simulation processes.
  • a simulation sequence may trigger evaluation of one or more primitives. Evaluation may be based on input changes to one or more sub-graphs. Evaluation may result in updating the values at the outputs of the primitives.
  • Simulation of a graph may comprise one or more optimization criteria, including use of massively parallel architectures, selective execution of a parallel simulation, aggregated selective evaluation optimization for memory architecture, exploitation of SIMD instructions, and the like.
  • Simulation sequences for a design may be formed from sub-graphs. For example, simulation sequences may create regions large enough to keep a parallel machine busy, but the regions may not be so large that computation becomes inefficient due to unnecessary checking or execution. In embodiments, if a design were to have multiple clock regions corresponding to each clock's flip-flops/latches, then the primitives that produce the inputs to the flip-flops/latches may be formed into two regions.
  • a simulation sequence may trigger evaluation of one or more primitives based on input changes.
  • the evaluation may result in updating (e.g. changes to) the values at the outputs of the primitives.
  • the output changes may result in one or more input changes to downstream logic.
  • the flow 300 may further comprise determining that an output of one of the sub-graphs has changes 320 state and copying that change of state 322 to a processor.
  • a process for a second sub-graph may use that change of state as input to the second sub-graph.
  • the second sub-graph may represent another portion of the design being simulated. Because of the change of input state of a second sub-graph, a simulation sequence may be triggered.
  • the flow 300 may further comprise copying a sequence of changes of state 324 for that output of one of the sub-graphs and using this sequence as a series of inputs to the second sub-graph 326 .
  • a sequence of changes of state may result from one or more simulation steps of a sub-graph.
  • the series of changes of state may represent one or more evaluation steps.
  • the copying of a sequence of changes of state from an output of a sub-graph that may be used as a sequence of inputs to a second sub-graph may execute more efficiently if a first sub-graph and a second sub-graph are executed on the same processor, for example.
  • Various steps in the flow 300 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts.
  • Various embodiments of the flow 300 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 4 is an example logic block diagram 400 .
  • Simple primitives may have roughly equal computational and communication requirements.
  • a gate level description of a logic design may be in terms of simple primitives such as AND, OR, and NOR gates, flip flops, latches, and the like.
  • Register transfer level (RTL) and higher-level descriptions may be decomposed into simple primitives such as adders, multipliers, multi-bit and/or, selectors, multi-bit flip-flops, latches, state holding processes with arbitrary controls, loops with controls, etc.
  • RTL Register transfer level
  • a description may be evaluated at various levels of abstraction. In embodiments, the evaluating may be based on an oblivious simulation model.
  • a high-level or system-level description of a design may be partitioned for ease of evaluation.
  • a given RTL Description in VerilogTM, VHDLTM, SystemVerilogTM, SystemCTM, or other design language may be decomposed into primitives.
  • a system design may be composed of hundreds of thousands of statements. The statements may form primitives.
  • Primitives such as multipliers, adders, etc. of an arbitrary width may be decomposed into components that have a fixed width such as 32 or 64 bits.
  • Primitives such as Selectors that have an arbitrary number of inputs may be decomposed into components that have a fixed number of inputs, such as 3 or 5.
  • the description may be decomposed into simple primitives with suitable a number of inputs and widths, as well as sufficient complexity of operation.
  • various inputs may be applied to the logic.
  • the logic may have inputs such as IN_A 412 and IN_B 414 .
  • the inputs IN_A and IN_B may be applied to one or more gates comprising the logic block 400 .
  • Control signals may also be applied to a logic block.
  • a control signal Control 416 and a clock signal Clock 418 may be among a plurality of control signals which may be applied to the control block.
  • a multiplier 410 and a subtractor 420 may operate upon inputs 412 and 414 . In embodiments, any number of operations may be performed on inputs.
  • Outputs of a multiplier 410 and a subtractor 420 , or other logic function or functions, may be connected to another logic block such as a selector 430 .
  • a selector 430 may be controlled by a control signal Control 416 .
  • an output of a selector 430 may be captured into a flip-flop 440 .
  • a flip-flop 440 may be controlled by a clock signal Clock 418 .
  • the clock signal Clock 418 applied to a flip-flop, may generate an output OUT 442 .
  • the output OUT 442 may be connected to one or more logic blocks for design and/or simulation purposes.
  • FIG. 5 is an example flow graph 500 of the logic design.
  • the flow graph 500 may represent the logic block diagram 400 .
  • a high-level design may be mapped onto a graph. By determining a map representation, operators of a logic design may be mapped to nodes of a graph, and interconnections of a logic design to arcs of a graph.
  • a graph may define inputs and outputs to the graph.
  • a graph may include a control data flow graph, directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • a control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design.
  • a graph of a combinational region and a state region may be suitable to various verification and simulation purposes.
  • the graph 500 comprises inputs IN_A 512 , IN_B 514 , CONTROL 516 , and CLOCK 518 .
  • arcs to one or more nodes may connect one or more inputs.
  • the node 510 and the node 520 may be connected by the arc 560 and the arc 562 to a node SEL 530 .
  • the node 530 may be controlled by a control signal Control 516 .
  • the node 530 may be connected by an arc 564 to a node FF 540 .
  • the node FF 540 may represent a flip-flop.
  • a node 540 may be controlled by a control signal Clock 518 .
  • a node 540 may be connected by an arc 566 to an output OUT 542 .
  • a clock signal Clock 518 controlling a node 540 may also control an output 542 .
  • Logic blocks may be of varying sizes, ranging from a single input, operator, and output, for example, to many inputs, many operators, and many outputs. Large graphs, for example, may be partitioned into a series of sub-graphs for simulation purposes and/or design purposes.
  • FIG. 6 is an example flow graph showing example partitions.
  • the example flow graph with partitions 600 may be derived from the flow graph 500 .
  • a graph may be generated as part of a computer-implemented method for design simulation.
  • a graph may be represented by any of a variety of graphical methods such as a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • the control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design.
  • a graph may be partitioned into simulation sequences.
  • a simulation sequence may trigger evaluation of one or more primitives based on changes to an input or inputs. Evaluation may result in updating the values at the outputs of the primitives.
  • a simulation sequence may be constructed for the flow graph with partitions 600 .
  • a simulation sequence may trigger evaluation of one or more primitives based on input changes.
  • a simulation sequence may result in updating values at an output or outputs of one or more primitives.
  • the following simulation sequence may be constructed. If a change occurs on IN_A 612 , IN_B 614 , or Control 616 , then Region 1 652 is evaluated. If a change occurs on, for example, a positive edge of Clock 618 , then Region 2 654 is evaluated. More than one simulation sequence may be generated. For example, one of many alternative sequences may comprise the following. If a change occurs on any of IN_A 612 and IN_B 614 , then Region 11 650 is evaluated.
  • Simulation sequences for a given design may be formed to create regions large enough to keep a parallel machine busy but not so large that unnecessary checking or execution is performed. Unnecessary checking or execution may result in excess computational utilization and thus may reduce efficiency. If a design comprises multiple clock regions that may correspond to flip-flops/latches connected to each clock, then primitives which may produce inputs to a flip-flop/latch may be formed into regions.
  • FIG. 7 is an example of a levelized hypergraph 700 .
  • the levelized hypergraph 700 may represent a portion of the flow graph with partitions 600 .
  • a region as discussed above may comprise an interconnection of primitives which may form a hypergraph.
  • a hypergraph may comprise a series of sub-graphs.
  • a typical region may be acyclic.
  • a region which may have a combinational cycle may be partitioned such that a cycle may be cut at an arbitrary point.
  • a special primitive in a special region may be inserted at the point of the cut or at another location. By inserting a special primitive, all regions may be turned into acyclic hypergraphs.
  • An acyclic hypergraph for simulation may comprise a subset of sub-graphs. The subset of the sub-graphs for simulation may be selected based on the input-change bit for each sub-graph in the subset of the sub-graphs being set to true.
  • An acyclic hypergraph may be levelized in such a manner that each level may have a set of primitives which may not have value dependencies.
  • operations 710 and 720 may be evaluated in the same level Level K 750 .
  • a SEL 730 primitive may have value dependencies on operations 710 and 720 , and control signal Control 716 , and may be evaluated in a level Level K+1 752 .
  • Such a dependency may be permitted if the operations 710 and 720 are evaluated in a previous level 750 prior to the evaluation of level 752 .
  • Such a procedure may avoid time/event ordered queues, and synchronization of queues. There may be a single synchronization for each level if an oblivious simulation model is used.
  • a subset of sub-graphs may be selected by other means.
  • the selection may further comprise using one or more of the input change bits on a level as part of the selecting of the subset for simulation.
  • Input change bits may be used to indicate that one or more of the inputs, for example, if inputs 712 and 714 have changed.
  • the partitioning of a graph into sub-graphs may include the balancing of levels. Balancing of levels may permit more efficient simulation and evaluation on GPUs or other parallel processing chips and systems.
  • the partitioning may include separating primitives evenly across clusters within a level. Again, such separation may serve to improve computation efficiency of simulation and evaluation processes.
  • FIG. 8 is an example logic block 800 for selective evaluation.
  • the logic block 800 is shown which may include input change bits.
  • Input change bits may indicate changes of input values for one or more inputs, and may refer to changes for a single gate or a plurality of gates.
  • the input change bits may be included for the purposes of simulation, and/or may reflect alterations to a hardware design.
  • oblivious simulation may have an advantage over event driven simulation in that oblivious simulation proceeds without creating and maintaining an event queue required by event driven simulation. However, oblivious simulation may tend to simulate all elements rather than just those which experience an input change.
  • the input change bits may be added in order to simulate only those elements which experience input changes (e.g. selective simulation), and may render simulation more computationally efficient.
  • a selective evaluation procedure may require that input changes be maintained for each primitive of a block of primitives.
  • the logic block 800 may show two primitives, 810 and 840 , each with their own Input Change Bit, 830 and 860 respectively.
  • a single input change bit may be assigned to a logic block comprising primitives, in a given simulation level, grouped together to reduce the overhead of maintaining individual input change bits.
  • a single input change bit may indicate a change of one or more inputs into a block. Simulation of a block may only proceed if an input change bit indicates a change.
  • a single common input change value 830 may be maintained for a primitive 810 .
  • Selectively evaluating the processes may be based on an input change bit set being set to true. If one or more inputs Data1 820 , Data2 822 , and Data3 824 changes, a common input change bit 830 may be set. setting an input change bit 830 may result in evaluation of all primitives, Adder1 812 and Adder2 814 , in block 810 . For example, if Data1 820 transitions while Data2 822 and Data3 824 remain constant, then an input change bit 830 would be set to indicate that an input had changed, and primitives 812 and 814 would be evaluated.
  • a change on Data1 820 which may cause evaluation of primitives 812 and 814 , may cause the output of Adder1 812 to change. Such a change may be reflected on input 816 to storage 850 . Also, a change may result in an update 818 of Input Change Bit 860 .
  • the primitives of block 840 may be evaluated as a result of an input change 816 —stored in storage 850 and indicated by input change bit 860 .
  • an example procedure may include the steps in Table 1.
  • Writing of an output change to an Input Change Bit 860 of block 840 may result in evaluation of primitives Adder1 842 and Adder2 844 of block 840 .
  • a procedure such as the one indicated in the example may balance the workload in three phases.
  • the three phases may comprise fetching of input values, evaluation of primitives, and checking and writing of output changes.
  • a procedure such as the one indicated in the example may optimize memory access by fetching input values for a given logic block as a single contiguous region of memory. Since only blocks which undergo input changes may be evaluated, redundant evaluation may be reduced in comparison to a conventional Oblivious simulation.
  • FIG. 9 is an example logic combination to create a phase reduction.
  • a logic block 900 may be obtained for simulation by a computer-implemented method for design simulation.
  • Gated clock designs such as logic block 900 , may result from a variety of design decisions.
  • a gated clock 934 may be derived from a Main Clock 922 (e.g. system clock) and may be implemented to disable regions of a design under a clock control.
  • the main clock 922 may also permit capture of Data 920 into a storage element 912 .
  • each gated clock may in turn be used to derive sub-gated clocks, and so on. Such a hierarchy of gated clocks may permit serialization of evaluation of a given simulation model.
  • a gated clock may be enabled by an enable signal 924 .
  • An enable 924 may control the Clock Gate 914 of a gated clock 934 .
  • a gated clock 934 may be computed in one phase 910 .
  • An enable 924 of a gated clock 934 may be computed in one phase 910 , and may trigger a change on gated clock 934 which may be evaluated in a second phase. For example, a change on gated clock 934 may cause evaluation of Storage B 930 .
  • An example logic combination may include modifying clock gating.
  • the modifying of the clock gating may include moving gating to storage elements. For example, if a clock gating check were moved to a storage element that is controlled by a clock, then evaluation may be accomplished in one phase. For example, consider a logic block 902 . A clock enable gate has been moved into a storage element 950 .
  • the modifying of the clock gating may include eliminating a clock gate to a combinational logic portion. In embodiments, the modifying of the clock gating may only be for simulation purposes.
  • logic block 902 retains inputs Data 960 , Main Clock 962 , and Clock Gate Enable 964 .
  • a Main Clock 962 may be connected to two storage elements 942 and 950 .
  • the clock gating may be restructured to combine phases and to occur on the active edge of clock. Since a clock enable gate may be moved into a storage element 950 , then two primitives 942 and 950 may be combined into a single phase 940 , thus reducing the number of phases required for simulation to one. Reducing the number of simulation phases required may increase parallel evaluation and may remove computational overhead of starting additional phases.
  • FIG. 10 is a system diagram for partitioned simulation.
  • the computer system 1000 for partitioned design simulation may comprise one or more processors 1010 coupled to a memory 1012 and a display 1014 .
  • the one or more processors 1010 may be coupled to a Design Library 1020 , a Graph Module 1030 , a Partition Module 1040 , a Selector Module 1050 , and an Evaluation Module 1060 .
  • the one or more processors 1010 may accomplish the one or more of the graph, partition, selector, and evaluation functions. In at least one embodiment, all of these functions will be accomplished by the one or more processors 1010 .
  • the one or more processors 1010 are coupled to the memory 1012 which stores instructions, system support data, intermediate data, analysis, and the like.
  • the one or more processors 1010 may be coupled to an electronic display 1014 .
  • the display 1014 may be any electronic display, including but not limited to, a computer display, a laptop screen, a net-book screen, a tablet computer screen, a cell phone display, a mobile device display, a remote with a display, a television, a projector, or the like.
  • the one or more processors 1010 may be configured to obtain a high-level design 1020 for simulation.
  • the high-level design may comprise various aspects of a design including a system to be simulated, a simulation test bench, simulation models, simulation vectors, and the like.
  • the system 1000 may be configured to determine a graph representation for the high-level design 1030 .
  • a control data flow graph (CDFG) may be determined for an RTL representation of a high-level design.
  • the CDFG may be folded into a graph of a combinational region and a state region.
  • Other graphical representations may be determined for a representation of a high-level design including a directed graph an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • the system 1000 may be configured to partition the graph into sub-graphs.
  • a graph may be partitioned into sub-graphs by a variety of means including static means, arbitrary means, and the like.
  • Static partitioning for example, may be based on creating value read/write locality, balancing of computation by level, creating event-change locality, reducing unique node (or primitive) types in a given partition, and the like.
  • Arbitrary partitioning may be based on random, ad hoc, heuristic, and other means, for example.
  • the system 1000 may be configured to select a subset of the sub-graphs for simulation based on input-change bits.
  • a partition or subset of a graph may be instrumented with an input change bit.
  • An input change bit may be associated with a vertex (node) of a graph, or may be associated with a cluster of nodes of a graph.
  • An input change bit may indicate that one or more inputs to a partition or subset of a graph (e.g. a sub-graph) have changed over some time period. Selection of a subset of sub-graphs may be made based on a change in status of one or more input change bits.
  • the system 1000 may be configured to and evaluate the subset of the sub-graphs to produce a simulation result for the high-level design.
  • a subset of sub-graphs may be simulated based on a change of status of one or more input change bits.
  • Other subsets of sub-graphs may indicate that there may have been no change of status of one or more input change bits.
  • selective simulation of, for example, only subsets of sub-graphs that had a change of status of one or more input change bits may reduce computational complexity by ignoring subsets of sub-graphs that did not have a change of status of one or more input change bits.
  • the system 1000 may include computer program product comprising code for obtaining a high-level design for simulation; code for determining a graph representation for the high-level design; code for partitioning the graph into sub-graphs; code for selecting a subset of the sub-graphs for simulation based on input-change bits; and code for evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
  • Embodiments may include various forms of distributed computing, client/server computing, and cloud based computing. Further, it will be understood that the depicted steps or boxes contained in this disclosure's flow charts are solely illustrative and explanatory. The steps may be modified, omitted, repeated, or re-ordered without departing from the scope of this disclosure. Further, each step may contain one or more sub-steps. While the foregoing drawings and description set forth functional aspects of the disclosed systems, no particular implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. All such arrangements of software and/or hardware are intended to fall within the scope of this disclosure.
  • the block diagrams and flowchart illustrations depict methods, apparatus, systems, and computer program products.
  • the elements and combinations of elements in the block diagrams and flow diagrams show functions, steps, or groups of steps of the methods, apparatus, systems, computer program products and/or computer-implemented methods. Any and all such functions—generally referred to herein as a “circuit,” “module,” or “system”—may be implemented by computer program instructions, by special-purpose hardware-based computer systems, by combinations of special purpose hardware and computer instructions, by combinations of general purpose hardware and computer instructions, and so on.
  • a programmable apparatus which executes any of the above mentioned computer program products or computer-implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable array logic, memory devices, application specific integrated circuits, or the like. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, and so on.
  • a computer may include a computer program product from a computer-readable storage medium and that this medium may be internal or external, removable and replaceable, or fixed.
  • a computer may include a Basic Input/Output System (BIOS), firmware, an operating system, a database, or the like that may include, interface with, or support the software and hardware described herein.
  • BIOS Basic Input/Output System
  • Embodiments of the present invention are neither limited to conventional computer applications nor the programmable apparatus that run them.
  • the embodiments of the presently claimed invention could include an optical computer, quantum computer, analog computer, or the like.
  • a computer program may be loaded onto a computer to produce a particular machine that may perform any and all of the depicted functions. This particular machine provides a means for carrying out any and all of the depicted functions.
  • any combination of one or more computer readable media may be utilized including but not limited to: a non-transitory computer readable medium for storage; an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer readable storage medium or any suitable combination of the foregoing; a portable computer diskette; a hard disk; a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); an optical fiber; a portable compact disc; an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • computer program instructions may include computer executable code.
  • languages for expressing computer program instructions may include without limitation C, C++, Java, JavaScriptTM, ActionScriptTM, assembly language, Lisp, Perl, Tcl, Python, Ruby, hardware description languages, database programming languages, functional programming languages, imperative programming languages, and so on.
  • computer program instructions may be stored, compiled, or interpreted to run on a computer, a programmable data processing apparatus, a heterogeneous combination of processors or processor architectures, and so on.
  • embodiments of the present invention may take the form of web-based computer software, which includes client/server software, software-as-a-service, peer-to-peer software, or the like.
  • a computer may enable execution of computer program instructions including multiple programs or threads.
  • the multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and to facilitate substantially simultaneous functions.
  • any and all methods, program codes, program instructions, and the like described herein may be implemented in one or more threads which may in turn spawn other threads, which may themselves have priorities associated with them.
  • a computer may process these threads based on priority or other order.
  • the verbs “execute” and “process” may be used interchangeably to indicate execute, process, interpret, compile, assemble, link, load, or a combination of the foregoing. Therefore, embodiments that execute or process computer program instructions, computer-executable code, or the like may act upon the instructions or code in any and all of the ways described.
  • the method steps shown are intended to include any suitable method of causing one or more parties or entities to perform the steps. The parties performing a step, or portion of a step, need not be located within a particular geographic location or country boundary. For instance, if an entity located within the United States causes a method step, or portion thereof, to be performed outside of the United States then the method is considered to be performed in the United States by virtue of the causal entity.

Abstract

Computer implemented techniques for the partitioned simulation of parallel architectures are disclosed. A high-level design for simulation is obtained. A graph representation for the high-level design is determined. The graph for the high-level design is partitioned into sub-graphs. A subset of the sub-graphs is selected for simulation based on input-change bits of the sub-graphs. The subset of the sub-graphs is subsequently evaluated on parallel architectures in order to produce a simulation result for the high-level design.

Description

    RELATED APPLICATIONS
  • This application claims the benefit of U.S. provisional patent application “Simulation on Massively Parallel Architectures” Ser. No. 61/639,799, filed Apr. 27, 2012. The foregoing application is hereby incorporated by reference in its entirety.
  • FIELD OF ART
  • This application relates generally to semiconductor circuit simulation and more particularly to selective execution of partitioned parallel simulation.
  • BACKGROUND
  • Modern electronic systems designs contain numerous components including digital, analog, and high frequency components, all of which can be problematic to design. The design process may comprise top-down decomposition and/or bottom-up assembly. Feature sizes of the components making up the electronic systems now are routinely smaller than the wavelength of visible light. In addition, the rapid change of the market and consumer demands drives ever-increasing performance, feature sets, versatility, and various other system factors which inject contradictory design requirements into design processes. Logic systems are routinely constructed from tens or even hundreds of millions of transistors. System designers are required to balance system performance, physical size, architectural complexity, power consumption, heat dissipation, fabrication complexity, and cost, to name only a few. Each of the related design decisions drive profound impacts on the resulting design. To handle the design complexity, developers create specifications around which to design their systems. The specifications attempt to balance the many disparate demands being made of the logic system and to contain what can easily be exploding design complexity.
  • Systems may be described at a variety of levels of abstraction ranging from low-level transistor layouts to high-level description languages. Most designers describe and design their electronic systems at a high-level of abstraction using an IEEE standard hardware description language (HDL) such as Verilog, SystemVerilog™, or VHDL. The high-level HDL is easier for developers to comprehend, especially for a vast system, and may describe highly complex concepts that are difficult to grasp using a lower level of abstraction. The HDL description may be converted into other levels of abstraction as is helpful to the developers. For example, a high-level description may be converted to a logic-level register transfer level (RTL) description, a gate-level (GL) description, a layout-level description, or a mask-level description. Each lower abstraction level introduces more detail into the design description. The lower-levels of abstraction may be generated automatically by computer, derived from a design library, or created by another design automation technique. Therefore, it is critical to ensure that the performance of the resulting lower-level designs is still capable of matching the requirements of the system's specification. The process of comparing a system design to a design specification (or one level of abstraction to another) can be called verification. Verification of modern electronic systems can ensure that a device under test (DUT) is simulated so that the behavior of the DUT is shown to match a system specification for the electronic design.
  • SUMMARY
  • Techniques implemented for the verification of integrated circuits are required to stimulate the device under test (DUT) to a sufficient extent to ensure that the device matches a design specification. Further, the verification process, which is by necessity computationally intensive, must be undertaken in such a way as to minimize both test duration and computer resource utilization. A computer-implemented method for design simulation is disclosed comprising: obtaining a high-level design for simulation; determining a graph representation for the high-level design; partitioning the graph representation into sub-graphs; selecting a subset of the sub-graphs for simulation based on input-change bits; and evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
  • The method may further comprise propagating the simulation result, based on the evaluating of the subset of the sub-graphs, to a remainder of the graph representation for further simulation. The subset of the sub-graphs for simulation may be selected based on an input-change bit, for each sub-graph in the subset of the sub-graphs, being set to true. The partitioning may further comprise determining sub-graphs based on levels of logic. The partitioning into sub-graphs may be based on reducing a number of signals crossing sub-graph boundaries. The method may further comprise using one or more of the input-change bits on a level as part of the selecting of the subset for simulation. The evaluating may be based on an oblivious simulation model. The method may further comprise allocating processes from the oblivious simulation model to a plurality of processors. The method may further comprise selectively evaluating the processes based on an input change bit set being set to valid. The graph representation may include a control data flow graph. The control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design. The partitioning may include creating value locality. The partitioning may include creating event change locality. The partitioning may include balancing of levels. The partitioning may include collecting of readers of a simulation value. The partitioning may include separating primitives evenly across clusters within a level. The partitioning may include clustering sibling primitives. The method may further comprise modifying clock gating. The modifying clock gating may include moving gating to storage elements. The modifying clock gating may include eliminating a clock gate to a combinational logic portion. The modifying clock gating may be only for simulation purposes. The clock gating may be restructured to combine phases and to occur on an active edge of clock. The method may further comprise determining that an output of one of the sub-graphs has a change of state and copying that change of state to a processor where a process for a second sub-graph uses that change of state as input to the second sub-graph. The method may further comprise copying a sequence of changes of state for that output of one of the sub-graphs and using the sequence as a series of inputs to the second sub-graph. The method may further comprise copying the high-level design and simulating the high-level design as well as its copy on at least two different processors. The copying may be performed to accomplish the simulating. The method may further comprise maintaining primitives with the same function in a single cluster within the copy of the high-level design.
  • In embodiments, a computer system for design simulation may comprise: a memory which stores instructions; one or more processors coupled to the memory wherein the one or more processors are configured to: obtain a high-level design for simulation; determine a graph representation for the high-level design; partition the graph representation into sub-graphs; select a subset of the sub-graphs for simulation based on input-change bits; and evaluate the subset of the sub-graphs to produce a simulation result for the high-level design. In some embodiments, a computer program product embodied in a non-transitory computer readable medium for design simulation may comprise: code for obtaining a high-level design for simulation; code for determining a graph representation for the high-level design; code for partitioning the graph representation into sub-graphs; code for selecting a subset of the sub-graphs for simulation based on input-change bits; and code for evaluating the subset of the sub-graphs to produce a simulation result for the high-level design
  • Various features, aspects, and advantages of various embodiments will become more apparent from the following further description.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The following detailed description of certain embodiments may be understood by reference to the following figures wherein:
  • FIG. 1 is a flow diagram for partitioned simulation.
  • FIG. 2 is a flow diagram for partitioning.
  • FIG. 3 is a flow diagram for sub-graph usage.
  • FIG. 4 is an example logic block diagram.
  • FIG. 5 is an example flow graph of the logic design.
  • FIG. 6 is an example flow graph showing example partitions.
  • FIG. 7 is an example levelized hypergraph.
  • FIG. 8 is an example logic block for selective evaluation.
  • FIG. 9 is an example logic combination to phase reduction.
  • FIG. 10 is a system diagram for partitioned simulation.
  • DETAILED DESCRIPTION
  • Electronic circuit designs are vastly complex systems. Verification of these systems, and portions thereof, is critical to produce properly functioning logic designs. Many simulation techniques, including event driven simulation and oblivious simulation, have been proposed to aid in this verification process. Event driven simulation, as opposed to oblivious simulation, can maintain a time-ordered queue of waiting processes. Events are added to the list at various times, and are typically processed in the order in which they arrive. Processing an event may generate more events or alter the order of the list of pending events. The components of the electronic system which experience value changes (i.e. the one or more inputs that have changed) are added to the queue of pending processes. Thus, simulation computation is limited to design components which undergo input changes. In contrast, oblivious simulation evaluates all components of the design regardless of whether or not a given component has experienced a value change. Changes to input values are not monitored, and no queue insertion is performed. However, in this type of simulation, a computation of every component is performed for every simulation process; this often leads to redundant and unnecessary computations. Thus, the choice of a simulation approach and how that approach is implemented is critical to minimizing simulation time while maximizing effectiveness.
  • Both the event driven and oblivious simulation approaches may be parallelized, though the process is different in each approach. Parallel event driven simulation employs multiple time-ordered queues and processes, and assigns sections of the design to a time-ordered queue running on a processor. Since the time-ordered queues must be chronologically synchronized, this approach is often limited to few processors. It does not scale to large numbers of processors because synchronization costs are prohibitively high and processor work assignments are uneven, thus saturating one processor with tasks while others remain idle. On the other hand, parallel oblivious simulation simplifies queue synchronization because only a single synchronization is required per level of the simulation model. Parallel oblivious simulation does perform redundant (i.e. unnecessary) operations. However, in a simulation with many value changes per design clock cycle or simulation cycle, parallel oblivious simulation is more efficient than parallel event driven simulation because of lower computational and synchronization demands.
  • The oblivious simulation model presents a static simulation view which both allows for efficient partitioning under multiple constraints and efficient allocation of tasks to clusters of processors. The oblivious model may be instrumented for selective execution of processes, thus reducing computational overhead by removing from a simulation cycle static portions of the model which do not change in a given cycle. A control data flow graph (CDFG) representing the description of the electronic system can be folded into a graph of the combinational and state regions. The CDFG may then be partitioned statically in order to create a locality of read and write values, balance computations by level, determine event change locality, and calculate the number of unique node (primitive) type reductions per partition. Such partitioning is accomplished by collecting most or all readers of a simulation value into the same cluster, partitioning primitives evenly across clusters, clustering “sibling” primitive instances together, limiting the number of unique primitives generated, and clustering for a small number of unique primitives. The statically partitioned CDFG can be instrumented with a single input change bit per cluster of nodes in a given level. Thus, such selective evaluation can skip the evaluation of a cluster for which the change bit is false. In this system, a collection of change bits is gathered in a bit set that is both updated and traversed in parallel by allocating each bit to a processing thread.
  • In the disclosed concept, a control data flow Graph (CDFG) may be assigned to an RTL representation of a high level design and partitioning performed. Each partitioning constraint is satisfied by collecting most or all readers of a particular simulation value into the same cluster. This reduces the number of necessary writes by the result writer, as well as evenly partitioning primitives in a level across clusters. Clustering of sibling primitive instances from multiple instances of large user blocks (e.g. four program counter (PC) incrementer primitives from a four CPU core model may be clustered together) may be performed as well as limiting the number of unique primitives being generated and clustered to a small number. Instrumenting the CDFG with an input change bit per cluster of nodes in a level can be helpful. Selective evaluation can then evaluate a cluster of nodes if the input change bit is true and bypass evaluation of a cluster of nodes if the input change bit is false. Thus, selective evaluation minimizes unnecessary evaluation by skipping the evaluation of clusters for which the input change bit is false. In addition, the use of an input change bit set per level can be executed efficiently on a parallel processor. An input change bit may have many writers, where all writers are able to make a lock-free write of the value “true” to the input change bit. (e.g. multiple writes of “true” to a single bit). The disclosed concept is more efficient than creating an event queue which requires a lock and/or the use of atomics. The disclosed concept thus provides a reduction in computational complexity.
  • In some embodiments, clock gating check circuits are moved to storage elements that are controlled by the clock or clocks. As a result, gated clock circuits which required more than one clock cycle for evaluation may be evaluated in one clock cycle. The result is a reduced numbers of evaluation steps, and thus an improvement in computational efficiency. Safe clock gating captures the clock enable signal on the inactive edge of the clock. The registers implemented in the designs of various cores are sensitive only to the active edge of the clock. Since there is little simulation activity on the inactive edge of the clock, high overhead associated with parallel simulation may be eliminated for the inactive clock edge.
  • In the disclosed concept, the topology of the safe clock gating and the clock enable signals may be identified. The clock and clock enable signals can be moved into the process or processes which consume the gated clock. Clock gating processes which are sensitive to the inactive edge of the clock, and the gated clock net, may be removed. Thus, the design's sensitivity to the inactive edge of the clock is removed. Very low simulation activity is thereby eliminated. As a result, the parallel simulation overhead of many synchronization barriers required to execute a simulation time step for the inactive edge of the clock is removed. Gated clock logic which is on the inactive edge of the clock is removed, and clock gating is fused into the process that is on the active edge of the clock. Improved simulation efficiency takes advantage of parallel computing architectures such as a graphics processing unit (GPU).
  • FIG. 1 is a flow diagram for partitioned simulation. A flow 100 is described for selective execution of partitioned parallel simulation. The flow 100 comprises a computer-implemented method for design simulation. Effective parallel simulation is critical to design verification. Parallel simulation can be performed on parallel architectures such as parallel processors, grid computers, and graphics processor units (GPUs for example). The purpose of verification is to ensure that an electronic system design matches a predetermined specification. Various scheduling algorithms exist such as event driven simulation and the oblivious algorithm simulation. In event driven simulation, a time-order process queue is maintained. Components of a design which undergo value changes are inserted into the queue. As a result, computation is limited to the evaluation of parts of the design that have to be updated.
  • In oblivious simulation, all components of the design are evaluated. Component evaluation takes place whether the components undergoes a value change or not. Computation of the various components is simpler because value changes are not checked, and there is no queue insertion process that must be handled. However, redundant computation is the result when components which did not undergo value changes are evaluated.
  • In the disclosed concept, clock gating plays a role in simulation efficiency. A gated clock may be derived from a main clock and may be implemented to disable regions of a design under control of a clock. Each gated clock in turn may be used to derive sub-gated clocks. Such a hierarchy of gated clocks may serialize the evaluation of a simulation model. The enable of a gated clock may be computed in one phase, which may trigger a change on a gated clock signal, which in turn may be evaluated in a second phase. If the clock gating check is moved into a storage element that may be controlled by the clock, then the evaluation may be done in one phase instead of two. Parallel evaluation may be increased, and the overhead of starting a second phase of evaluation may be removed.
  • The flow 100 includes obtaining a high-level design 110 for simulation. Design simulation is a crucial step in the design, analysis, and verification of an electronic system. In embodiments, the obtained design may be a high level design written in any of a variety of languages such as Verilog™, VHDL™, SystemVerilog™, SystemC™, or other design language.
  • The flow 100 continues with determining a graph representation 120 for the high-level design. A high-level design 110 may be represented by a graph. The graph may be manipulated in a variety of ways for simulation purposes. The graph may include a control data flow graph. In embodiments, a control data flow graph (CDFG) may be determined for an RTL or behavioral representation of a high-level design. A control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design. Other graphical representations may be determined for a representation of a high-level design including a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • The flow 100 continues with partitioning the graph 130 into sub-graphs. A graph may be partitioned into sub-graphs 130 for simulation purposes. A graph can be partitioned into sub-graphs by a number of means, including static means, arbitrary means, and the like. Static partitioning, for example, may be based on creating value read/write locality, balancing computation by level, creating event-change locality, reducing unique node (or primitive) types in a given partition, and the like. The partitioning may include determining sub-graphs based on reducing a number of signals crossing sub-graph boundaries. Arbitrary partitioning may be based on random, ad hoc, heuristic, or other means. The flow 100 may include constructing input-change bit sets 132. An input-change bit may be a memory element that is added to a design to record when an input to a portion of the design changes state. By knowing that an input to the portion has changed, then that portion of the design can be re-evaluated and an output from that portion may change. The input-change bit may only be inserted into a design for simulation purposes. In some embodiments, the input-change bit can be an artifact and is actually a memory element somewhere other than the design which aids in simulation. A set of input-change bits may be assembled to track multiple portions of the design under simulation. The set may be optimized to cover proper portions of the design as would be appropriate to partition for simulation in separate sections. The partitioning may include determining sub-graphs based on levels of logic where the levels of logic may be successive stages.
  • The flow 100 continues with selecting a subset of the sub-graphs for simulation 140. The selection may be based on input-change bits. A partition or subset of a graph may be instrumented with an input change bit. An input change bit may be associated with a vertex (node) of a graph, or may be associated with a cluster of nodes of a graph. The flow 100 may further comprise using one or more of the input change bits on a level as part of the selecting of the subset for simulation. An input change bit may indicate that one or more inputs to a partition or subset of a graph (e.g. a sub-graph) have changed over some time period. The selection of a subset of sub-graphs may further comprise selectively evaluating the processes based on an input change bit set being set to true (or valid). Selection of a subset of sub-graphs may be made based on a change in status of one or more input change bits. Selection of a subset of the sub-graphs may be selected for execution based on level, on an active edge of a clock, and the like.
  • The flow 100 may include allocating the subset of the sub-graphs for simulation 140 to one or more processors 154. The subset may then in turn be simulated. In some embodiments, the flow 100 includes copying the high-level design 144 and simulating the high-level design as well as its copy on at least two different processors. The flow 100 continues with evaluating the subset of the sub-graphs 156 to produce a simulation result for the high-level design. A subset of sub-graphs which may be selected based on using one or more input-change bits 162 to determine a change of status. A change in status may cause re-evaluation of a subset of the sub-graphs 156. In embodiments, the subset of the sub-graphs for simulation are selected based on the input-change bit for each sub-graph in the subset of the sub-graphs being set to true. The processors used in such simulations can include one or more copies of a core, a multicore processor, a graphics processing unit (GPU), a parallel processor, a grid computer, and the like. Subsets of sub-graphs may indicate an unchanged status of one or more of the input change bits providing selective simulation of only those subsets of sub-graphs whose input change bits changed status. This strategy thereby reduces computation requirements by ignoring subsets of sub-graphs without a change of status in their input change bits. Various approaches can be used for the evaluation process. Event driven simulation may be used. In other embodiments, the evaluating may be based on an oblivious simulation approach with the simulations being executed on one or more processors. The allocating of processes from the oblivious simulation approach can be to a plurality of processors. Such allocation of processes to a plurality of processors may be based on the input-change bit or bits thereby including evaluation of a subset of the sub-graphs.
  • The flow 100 continues with propagating the simulation result 160, based on the evaluating of the subset of the sub-graphs 156, to a remainder of the graph for further simulation. A verification process may involve iterating the simulation process. Such iteration may occur for both verification processes based on event driven simulation and for verification based on Oblivious Simulation. Iterating the simulation process may cause a change in the order in which a process or processes are evaluated.
  • The flow 100 may further comprise modifying the clock gating 170. While determining a graph 120 for evaluation, a portion of the design may be identified with clock gating which is inefficient from a simulation perspective. In such a case the clock gating may be modified 170, perhaps only for simulation purposes. In some designs, a clock signal may be gated for a variety of reasons including to reduce power consumption. With oblivious simulation an inefficient simulation may result if inactive segments of a design are nonetheless simulated. Safe clock gating may capture a clock enable signal on an inactive edge of a clock as can be seen with certain register designs. Using parallel simulation in this case may incur a high overhead and thus be inefficient. Modification of the clock signal may be undertaken in order to reduce simulation during low activity. In embodiments, the modifying of the clock gating may include moving gating to storage elements. For example, a clock signal and a clock enable signal may be recognized. These signals may be moved to the process which consumes a gated clock. Further, removing sensitive clock gating processes to an inactive edge of a clock may further reduce simulation requirements. In embodiments, the modifying may include changing the design, while in other embodiments the modifying of the clock gating may only be for simulation purposes.
  • The modifying of the clock gating may include eliminating a clock gate to a combinational logic portion 172. A gated clock may be derived from a main clock and, for example, may be employed to disable regions of a design under a clock control. Each gated clock may in turn be used to derive sub-gated clocks. Such a hierarchy of gated clocks may serialize the evaluation of a simulation model. The clock gating may be restructured to combine phases and to occur on the active edge of clock. For example, the enable signal of a gated clock may be computed in one phase. This single phase computation may trigger a change on a gated clock signal, which may in turn necessitate evaluation using a second phase. If the clock gating check were to be moved to a storage element controlled, in embodiments, by a clock, then an evaluation may be done in one clock phase. A combination of clock phases may increase parallel evaluation efficiency, and may remove the overhead of starting a second phase of evaluation. The updated clock gating can be reflected in the graph which was determined. Various steps in the flow 100 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts. Various embodiments of the flow 100 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 2 is a flow diagram for partitioning. A flow 200 may continue or be part of the previous flow 100. In some embodiments, the flow 200 stands on its own and works from pre-existing system designs, graphs, and the like. A graph representing a system design is obtained. A graph may be partitioned 210 to create a set of sub-graphs. The graph may be any of a variety of graphs including a control data flow graph (CDFG), a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem. The graph may be partitioned into various sub-graphs by a variety of means including arbitrary means, static mean, ad hoc means, heuristic means, and the like.
  • In embodiments, partitioning of a graph into sub-graphs may be determined based on logic levels 220. A design may comprise one or more levels of logic. In addition, simulation models may cause a simulation of a system to be performed in one or more levels. However, depending on simulation model, the computation required at a given level might be significantly different from that required at another level. For example, the use of an event driven simulation model may cause an uneven work assignment from level to level as a result of partitioning of a graph. As a further example, the evaluating may be based on an oblivious simulation model. Use of an Oblivious model may further comprise allocating processes from the oblivious simulation model to a plurality of processors. oblivious simulation may not suffer from work starvation on one or more processors. Synchronization may be simpler since a single synchronization per level of the model is sufficient. However, parallel oblivious simulation may perform redundant computation since all logic blocks are evaluated each simulation cycle. When the number of value changes per design clock is low, parallel oblivious simulation may be slower than serial event driven simulation.
  • In embodiments, a graph may be partitioned into sub-graphs based on a reduction of sub-graph crossings 222. Sub-graph crossings may refer to the number of signals which flow among sub-graphs. To minimize the number of sub-graph crossings, an evaluation region may be partitioned into sub-regions subject to constraints of a given design. Signal crossings among sub-regions may be minimized if most values may be produced and consumed within a sub-region; many consumers of a signal may be moved to a receiving sub-region when a signal crosses sub-regions; and the number of primitives in different sub-regions may roughly equal.
  • Partitioning of a graph into sub-graphs may be based on a variety of parameters. For example, the partitioning may include creating value locality 230. To create value locality, signal producers and related signal consumers may be moved to the same partition. Producing and consuming signals within a given partition may reduce sub-graph crossings. For example, creating value read/write locality may permit data production and data consumption to remain local to a given processor assigned, in embodiments, to execute the sub-graph.
  • In embodiments, the partitioning may include creating event change locality 232. Aggregate selective evaluation (ASE) may evaluate all primitives in a block even if only one input to the block may have changed. One may choose to cluster many primitives that change at a given simulation time into a single block to improve the efficiency of ASE and to improve computational efficiency. In such a partitioning scheme, a variety of primitives may be clustered together so that evaluation of all primitives within a partition remains relevant to a given simulation rather than simply being wasted cycles.
  • In embodiments, the partitioning includes balancing of levels 234. Such a partitioning may balance a given workload in three phases, such as fetching input values, evaluating primitives, and checking and writing output changes, for example. In embodiments, the partitioning includes collecting of readers 236 of a simulation value. Simulation values may be created by producers, and consumed by readers. Computational efficiencies may result if a graph is divided into sub-graphs based on clustering of producers and readers. Efficiencies may result from assigning such sub-graphs to the same processors, because computational (e.g. data read/write) efficiencies from reduced processing overhead may result.
  • In embodiments, the partitioning includes separating primitives 238 evenly across clusters within a level. A processor cluster or a sub cluster may support single instruction and/or multiple data (SIMD) instructions. A SIMD instruction may efficiently compute, for example, 8, 16, 32, or more operations of the same primitive type, such as 16 additions in a single clock cycle of the CPU. To exploit SIMD instructions, a single primitive type or a small set of primitive types may be allocated to a cluster or sub-cluster.
  • In embodiments, the partitioning includes clustering sibling primitives 240. Clustering of sibling processes may be an embodiment of change locality. In a given design model, multiple component instantiations can occur. For example, an 8-core CPU may comprise eight instantiations of a component Core. A component which may perform an operation in Core may appear eight times in the CPU. For example, an incrementer—which may increment a program counter (PC) by 1 in a core—appears eight times in a CPU. In this example, these multiple instances of PCs are referred to as “siblings.” Since sibling primitives have a high probability of simultaneous input changes, then those sibling primitives can be clustered together into a given partition of a graph—a sub-graph.
  • A user may run thousands of tests on a single simulation model. Each of these tests may form an execution sequence. By creating a simulation model which runs 2, 4, or more copies of the simulation model at a time, one may increase parallel efficiency and achieve better utilization of a model. Such simulation may follow a process such as: obtain a simulation model then obtain 2, 4, or more copies of the simulation model data, one copy for each test to be supported by a multi-test simulation model. By having multiple copies different stimulus can be applied to the copies to evaluate different tests. In some embodiments, design representations can be shared across the copies. Likewise circuit structures can be shared across the copies.
  • A simulation model may comprise an interconnection of simple primitives. Simple primitives may have roughly equal computational and communication requirements. For example, a description of a logic design at a gate level may be in terms of simple primitives such as AND, OR, and NOT gates, flip-flops, latches, and the like. An RTL description in a high-level language such as Verilog™, VHDL™, SystemVerilog™, SystemC™, or other design language, may be decomposed into primitives. Higher level functions of arbitrary width such as multipliers, address, and the like, may be decomposed into components comprising fixed widths such as 32 or 64 bits, for example. Selectors comprising an arbitrary number of inputs may be decomposed into components comprising a fixed number of inputs such as 3, 5, and the like. Primitives may be clustered together for a variety of reasons that may be related to design and/or simulation. Such primitives can comprise a significant portion of a graph or sub-graph for evaluation. Various steps in the flow 200 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts. Various embodiments of the flow 200 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 3 is a flow diagram for sub-graph usage. A flow 300 may continue or be part of the previous flow 100. In some embodiments, the flow 300 stands on its own and works from pre-existing sub-graphs. A graph representing a system design is obtained. A graph may be portioned into sub-graphs. Sub-graphs may be used for various aspects of evaluation and simulation processes. A simulation sequence may trigger evaluation of one or more primitives. Evaluation may be based on input changes to one or more sub-graphs. Evaluation may result in updating the values at the outputs of the primitives.
  • The sub-graphs are evaluated 310 as part of simulation of a high-level design. Simulation of a graph may comprise one or more optimization criteria, including use of massively parallel architectures, selective execution of a parallel simulation, aggregated selective evaluation optimization for memory architecture, exploitation of SIMD instructions, and the like.
  • Simulation sequences for a design may be formed from sub-graphs. For example, simulation sequences may create regions large enough to keep a parallel machine busy, but the regions may not be so large that computation becomes inefficient due to unnecessary checking or execution. In embodiments, if a design were to have multiple clock regions corresponding to each clock's flip-flops/latches, then the primitives that produce the inputs to the flip-flops/latches may be formed into two regions.
  • A simulation sequence may trigger evaluation of one or more primitives based on input changes. The evaluation may result in updating (e.g. changes to) the values at the outputs of the primitives. The output changes may result in one or more input changes to downstream logic. The flow 300 may further comprise determining that an output of one of the sub-graphs has changes 320 state and copying that change of state 322 to a processor. A process for a second sub-graph may use that change of state as input to the second sub-graph. The second sub-graph may represent another portion of the design being simulated. Because of the change of input state of a second sub-graph, a simulation sequence may be triggered.
  • Similar to the copying which may take place between an output change and an input change, the flow 300 may further comprise copying a sequence of changes of state 324 for that output of one of the sub-graphs and using this sequence as a series of inputs to the second sub-graph 326. A sequence of changes of state may result from one or more simulation steps of a sub-graph. The series of changes of state may represent one or more evaluation steps. The copying of a sequence of changes of state from an output of a sub-graph that may be used as a sequence of inputs to a second sub-graph may execute more efficiently if a first sub-graph and a second sub-graph are executed on the same processor, for example. Various steps in the flow 300 may be changed in order, repeated, omitted, or the like without departing from the disclosed inventive concepts. Various embodiments of the flow 300 may be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.
  • FIG. 4 is an example logic block diagram 400. Simple primitives may have roughly equal computational and communication requirements. A gate level description of a logic design may be in terms of simple primitives such as AND, OR, and NOR gates, flip flops, latches, and the like. Register transfer level (RTL) and higher-level descriptions may be decomposed into simple primitives such as adders, multipliers, multi-bit and/or, selectors, multi-bit flip-flops, latches, state holding processes with arbitrary controls, loops with controls, etc. A description may be evaluated at various levels of abstraction. In embodiments, the evaluating may be based on an oblivious simulation model.
  • A high-level or system-level description of a design may be partitioned for ease of evaluation. For example, a given RTL Description in Verilog™, VHDL™, SystemVerilog™, SystemC™, or other design language, may be decomposed into primitives. A system design may be composed of hundreds of thousands of statements. The statements may form primitives. Primitives such as multipliers, adders, etc. of an arbitrary width may be decomposed into components that have a fixed width such as 32 or 64 bits. Primitives such as Selectors that have an arbitrary number of inputs may be decomposed into components that have a fixed number of inputs, such as 3 or 5. For example, in a given HDL description, the description may be decomposed into simple primitives with suitable a number of inputs and widths, as well as sufficient complexity of operation.
  • In the example logic block 400, various inputs may be applied to the logic. For example, the logic may have inputs such as IN_A 412 and IN_B 414. The inputs IN_A and IN_B may be applied to one or more gates comprising the logic block 400. Control signals may also be applied to a logic block. For example, a control signal Control 416, and a clock signal Clock 418 may be among a plurality of control signals which may be applied to the control block.
  • Various operations may be performed on the one or more inputs applied to a logic block. For example, a multiplier 410 and a subtractor 420 may operate upon inputs 412 and 414. In embodiments, any number of operations may be performed on inputs. Outputs of a multiplier 410 and a subtractor 420, or other logic function or functions, may be connected to another logic block such as a selector 430. A selector 430 may be controlled by a control signal Control 416. Continuing with the example, an output of a selector 430 may be captured into a flip-flop 440. A flip-flop 440 may be controlled by a clock signal Clock 418. The clock signal Clock 418, applied to a flip-flop, may generate an output OUT 442. The output OUT 442 may be connected to one or more logic blocks for design and/or simulation purposes.
  • FIG. 5 is an example flow graph 500 of the logic design. The flow graph 500 may represent the logic block diagram 400. A high-level design may be mapped onto a graph. By determining a map representation, operators of a logic design may be mapped to nodes of a graph, and interconnections of a logic design to arcs of a graph. A graph may define inputs and outputs to the graph. A graph may include a control data flow graph, directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem. A control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design. A graph of a combinational region and a state region may be suitable to various verification and simulation purposes.
  • The graph 500 comprises inputs IN_A 512, IN_B 514, CONTROL 516, and CLOCK 518. The input 512 and the input 514 may be connected by an arc 550 and an arc 552 to node M=A*B 510, and by an arc 554 and an arc 556 to node S=A−B 520. In embodiments, arcs to one or more nodes may connect one or more inputs. Continuing, the node 510 and the node 520 may be connected by the arc 560 and the arc 562 to a node SEL 530. The node 530 may be controlled by a control signal Control 516. Continuing, the node 530 may be connected by an arc 564 to a node FF 540. In embodiments, the node FF 540 may represent a flip-flop. A node 540 may be controlled by a control signal Clock 518. A node 540 may be connected by an arc 566 to an output OUT 542. In embodiments, a clock signal Clock 518 controlling a node 540 may also control an output 542. Logic blocks may be of varying sizes, ranging from a single input, operator, and output, for example, to many inputs, many operators, and many outputs. Large graphs, for example, may be partitioned into a series of sub-graphs for simulation purposes and/or design purposes.
  • FIG. 6 is an example flow graph showing example partitions. The example flow graph with partitions 600 may be derived from the flow graph 500. A graph may be generated as part of a computer-implemented method for design simulation. A graph may be represented by any of a variety of graphical methods such as a directed graph, an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem. The control data flow graph may include a graph of a combinational region of the high-level design and a state region of the high-level design. A graph may be partitioned into simulation sequences. A simulation sequence may trigger evaluation of one or more primitives based on changes to an input or inputs. Evaluation may result in updating the values at the outputs of the primitives. A simulation sequence may be constructed for the flow graph with partitions 600.
  • A simulation sequence may trigger evaluation of one or more primitives based on input changes. A simulation sequence may result in updating values at an output or outputs of one or more primitives. For example, for the graph 600, the following simulation sequence may be constructed. If a change occurs on IN_A 612, IN_B 614, or Control 616, then Region 1 652 is evaluated. If a change occurs on, for example, a positive edge of Clock 618, then Region 2 654 is evaluated. More than one simulation sequence may be generated. For example, one of many alternative sequences may comprise the following. If a change occurs on any of IN_A 612 and IN_B 614, then Region 11 650 is evaluated. Evaluating 650 may result in evaluating M=A*B 610 and S=A−B 620. If a change occurs on arc 622, arc 624 or Control 616, then SEL 630 is evaluated. If a change occurs on, for example, a positive edge of Clock 618, then Region 2 654 is evaluated. Evaluating 654 may result in evaluating a flip-flop FF 640. Evaluating 640 may result in evaluating output OUT 642.
  • Simulation sequences for a given design may be formed to create regions large enough to keep a parallel machine busy but not so large that unnecessary checking or execution is performed. Unnecessary checking or execution may result in excess computational utilization and thus may reduce efficiency. If a design comprises multiple clock regions that may correspond to flip-flops/latches connected to each clock, then primitives which may produce inputs to a flip-flop/latch may be formed into regions.
  • FIG. 7 is an example of a levelized hypergraph 700. The levelized hypergraph 700 may represent a portion of the flow graph with partitions 600. A region as discussed above may comprise an interconnection of primitives which may form a hypergraph. A hypergraph may comprise a series of sub-graphs. A typical region may be acyclic. In embodiments, a region which may have a combinational cycle may be partitioned such that a cycle may be cut at an arbitrary point. A special primitive in a special region may be inserted at the point of the cut or at another location. By inserting a special primitive, all regions may be turned into acyclic hypergraphs.
  • An acyclic hypergraph for simulation may comprise a subset of sub-graphs. The subset of the sub-graphs for simulation may be selected based on the input-change bit for each sub-graph in the subset of the sub-graphs being set to true. An acyclic hypergraph may be levelized in such a manner that each level may have a set of primitives which may not have value dependencies. In a levelized hypergraph 700, an operation M=A*B 710, and an operation S=A−B 720 may not have one or more value dependencies. For example, operations 710 and 720 may depend on inputs IN_A 712 and IN_B 714, but not on each other. In the example given, operations 710 and 720 may be evaluated in the same level Level K 750. A SEL 730 primitive may have value dependencies on operations 710 and 720, and control signal Control 716, and may be evaluated in a level Level K+1 752. Such a dependency may be permitted if the operations 710 and 720 are evaluated in a previous level 750 prior to the evaluation of level 752. Such a procedure may avoid time/event ordered queues, and synchronization of queues. There may be a single synchronization for each level if an oblivious simulation model is used.
  • A subset of sub-graphs may be selected by other means. In embodiments, the selection may further comprise using one or more of the input change bits on a level as part of the selecting of the subset for simulation. Input change bits may be used to indicate that one or more of the inputs, for example, if inputs 712 and 714 have changed. In other embodiments, the partitioning of a graph into sub-graphs may include the balancing of levels. Balancing of levels may permit more efficient simulation and evaluation on GPUs or other parallel processing chips and systems. In another embodiment, the partitioning may include separating primitives evenly across clusters within a level. Again, such separation may serve to improve computation efficiency of simulation and evaluation processes.
  • FIG. 8 is an example logic block 800 for selective evaluation. The logic block 800 is shown which may include input change bits. Input change bits may indicate changes of input values for one or more inputs, and may refer to changes for a single gate or a plurality of gates. The input change bits may be included for the purposes of simulation, and/or may reflect alterations to a hardware design. Recall that oblivious simulation may have an advantage over event driven simulation in that oblivious simulation proceeds without creating and maintaining an event queue required by event driven simulation. However, oblivious simulation may tend to simulate all elements rather than just those which experience an input change. The input change bits may be added in order to simulate only those elements which experience input changes (e.g. selective simulation), and may render simulation more computationally efficient.
  • A selective evaluation procedure may require that input changes be maintained for each primitive of a block of primitives. For example, the logic block 800 may show two primitives, 810 and 840, each with their own Input Change Bit, 830 and 860 respectively. A single input change bit may be assigned to a logic block comprising primitives, in a given simulation level, grouped together to reduce the overhead of maintaining individual input change bits. A single input change bit may indicate a change of one or more inputs into a block. Simulation of a block may only proceed if an input change bit indicates a change.
  • In the given example logic block 800, a single common input change value 830 may be maintained for a primitive 810. Selectively evaluating the processes may be based on an input change bit set being set to true. If one or more inputs Data1 820, Data2 822, and Data3 824 changes, a common input change bit 830 may be set. setting an input change bit 830 may result in evaluation of all primitives, Adder1 812 and Adder2 814, in block 810. For example, if Data1 820 transitions while Data2 822 and Data3 824 remain constant, then an input change bit 830 would be set to indicate that an input had changed, and primitives 812 and 814 would be evaluated.
  • Continuing, consider, for example, that a change on Data1 820, which may cause evaluation of primitives 812 and 814, may cause the output of Adder1 812 to change. Such a change may be reflected on input 816 to storage 850. Also, a change may result in an update 818 of Input Change Bit 860. During a subsequent simulation step, the primitives of block 840 may be evaluated as a result of an input change 816—stored in storage 850 and indicated by input change bit 860.
  • In embodiments, an example procedure may include the steps in Table 1.
  • TABLE 1
    Example handling steps for circuit with input change bits.
    Check input change bit value
    If input change bit value is “1”, evaluate blocks of primitive
    Set input change bit for block to “0” (reset)
    Evaluate the primitives of a block-this may cause an output change and
    may trigger a process
    Write output value to receivers
    Write output change to receiving block or blocks.
  • Writing of an output change to an Input Change Bit 860 of block 840 may result in evaluation of primitives Adder1 842 and Adder2 844 of block 840.
  • A procedure such as the one indicated in the example may balance the workload in three phases. In the example, the three phases may comprise fetching of input values, evaluation of primitives, and checking and writing of output changes. A procedure such as the one indicated in the example may optimize memory access by fetching input values for a given logic block as a single contiguous region of memory. Since only blocks which undergo input changes may be evaluated, redundant evaluation may be reduced in comparison to a conventional Oblivious simulation.
  • FIG. 9 is an example logic combination to create a phase reduction. A logic block 900 may be obtained for simulation by a computer-implemented method for design simulation. Gated clock designs, such as logic block 900, may result from a variety of design decisions. For example, a gated clock 934 may be derived from a Main Clock 922 (e.g. system clock) and may be implemented to disable regions of a design under a clock control. The main clock 922 may also permit capture of Data 920 into a storage element 912. In some designs, each gated clock may in turn be used to derive sub-gated clocks, and so on. Such a hierarchy of gated clocks may permit serialization of evaluation of a given simulation model. In some designs, a gated clock may be enabled by an enable signal 924.
  • An enable 924 may control the Clock Gate 914 of a gated clock 934. A gated clock 934 may be computed in one phase 910. An enable 924 of a gated clock 934 may be computed in one phase 910, and may trigger a change on gated clock 934 which may be evaluated in a second phase. For example, a change on gated clock 934 may cause evaluation of Storage B 930.
  • An example logic combination may include modifying clock gating. The modifying of the clock gating may include moving gating to storage elements. For example, if a clock gating check were moved to a storage element that is controlled by a clock, then evaluation may be accomplished in one phase. For example, consider a logic block 902. A clock enable gate has been moved into a storage element 950. The modifying of the clock gating may include eliminating a clock gate to a combinational logic portion. In embodiments, the modifying of the clock gating may only be for simulation purposes. As with logic block 900, logic block 902 retains inputs Data 960, Main Clock 962, and Clock Gate Enable 964. However, in the example, a Main Clock 962 may be connected to two storage elements 942 and 950. The clock gating may be restructured to combine phases and to occur on the active edge of clock. Since a clock enable gate may be moved into a storage element 950, then two primitives 942 and 950 may be combined into a single phase 940, thus reducing the number of phases required for simulation to one. Reducing the number of simulation phases required may increase parallel evaluation and may remove computational overhead of starting additional phases.
  • FIG. 10 is a system diagram for partitioned simulation. In embodiments, the computer system 1000 for partitioned design simulation may comprise one or more processors 1010 coupled to a memory 1012 and a display 1014. The one or more processors 1010 may be coupled to a Design Library 1020, a Graph Module 1030, a Partition Module 1040, a Selector Module 1050, and an Evaluation Module 1060. In at least one embodiment, the one or more processors 1010 may accomplish the one or more of the graph, partition, selector, and evaluation functions. In at least one embodiment, all of these functions will be accomplished by the one or more processors 1010.
  • The one or more processors 1010 are coupled to the memory 1012 which stores instructions, system support data, intermediate data, analysis, and the like. The one or more processors 1010 may be coupled to an electronic display 1014. The display 1014 may be any electronic display, including but not limited to, a computer display, a laptop screen, a net-book screen, a tablet computer screen, a cell phone display, a mobile device display, a remote with a display, a television, a projector, or the like.
  • The one or more processors 1010 may be configured to obtain a high-level design 1020 for simulation. The high-level design may comprise various aspects of a design including a system to be simulated, a simulation test bench, simulation models, simulation vectors, and the like.
  • The system 1000 may be configured to determine a graph representation for the high-level design 1030. A control data flow graph (CDFG) may be determined for an RTL representation of a high-level design. The CDFG may be folded into a graph of a combinational region and a state region. Other graphical representations may be determined for a representation of a high-level design including a directed graph an acyclic directed graph, a Petri Net, or other graph appropriate to the design problem.
  • The system 1000 may be configured to partition the graph into sub-graphs. A graph may be partitioned into sub-graphs by a variety of means including static means, arbitrary means, and the like. Static partitioning, for example, may be based on creating value read/write locality, balancing of computation by level, creating event-change locality, reducing unique node (or primitive) types in a given partition, and the like. Arbitrary partitioning may be based on random, ad hoc, heuristic, and other means, for example.
  • The system 1000 may be configured to select a subset of the sub-graphs for simulation based on input-change bits. A partition or subset of a graph may be instrumented with an input change bit. An input change bit may be associated with a vertex (node) of a graph, or may be associated with a cluster of nodes of a graph. An input change bit may indicate that one or more inputs to a partition or subset of a graph (e.g. a sub-graph) have changed over some time period. Selection of a subset of sub-graphs may be made based on a change in status of one or more input change bits.
  • The system 1000 may be configured to and evaluate the subset of the sub-graphs to produce a simulation result for the high-level design. A subset of sub-graphs may be simulated based on a change of status of one or more input change bits. Other subsets of sub-graphs may indicate that there may have been no change of status of one or more input change bits. Thus, selective simulation of, for example, only subsets of sub-graphs that had a change of status of one or more input change bits may reduce computational complexity by ignoring subsets of sub-graphs that did not have a change of status of one or more input change bits.
  • The system 1000 may include computer program product comprising code for obtaining a high-level design for simulation; code for determining a graph representation for the high-level design; code for partitioning the graph into sub-graphs; code for selecting a subset of the sub-graphs for simulation based on input-change bits; and code for evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
  • Each of the above methods may be executed on one or more processors on one or more computer systems. Embodiments may include various forms of distributed computing, client/server computing, and cloud based computing. Further, it will be understood that the depicted steps or boxes contained in this disclosure's flow charts are solely illustrative and explanatory. The steps may be modified, omitted, repeated, or re-ordered without departing from the scope of this disclosure. Further, each step may contain one or more sub-steps. While the foregoing drawings and description set forth functional aspects of the disclosed systems, no particular implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. All such arrangements of software and/or hardware are intended to fall within the scope of this disclosure.
  • The block diagrams and flowchart illustrations depict methods, apparatus, systems, and computer program products. The elements and combinations of elements in the block diagrams and flow diagrams, show functions, steps, or groups of steps of the methods, apparatus, systems, computer program products and/or computer-implemented methods. Any and all such functions—generally referred to herein as a “circuit,” “module,” or “system”—may be implemented by computer program instructions, by special-purpose hardware-based computer systems, by combinations of special purpose hardware and computer instructions, by combinations of general purpose hardware and computer instructions, and so on.
  • A programmable apparatus which executes any of the above mentioned computer program products or computer-implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable array logic, memory devices, application specific integrated circuits, or the like. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, and so on.
  • It will be understood that a computer may include a computer program product from a computer-readable storage medium and that this medium may be internal or external, removable and replaceable, or fixed. In addition, a computer may include a Basic Input/Output System (BIOS), firmware, an operating system, a database, or the like that may include, interface with, or support the software and hardware described herein.
  • Embodiments of the present invention are neither limited to conventional computer applications nor the programmable apparatus that run them. To illustrate: the embodiments of the presently claimed invention could include an optical computer, quantum computer, analog computer, or the like. A computer program may be loaded onto a computer to produce a particular machine that may perform any and all of the depicted functions. This particular machine provides a means for carrying out any and all of the depicted functions.
  • Any combination of one or more computer readable media may be utilized including but not limited to: a non-transitory computer readable medium for storage; an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer readable storage medium or any suitable combination of the foregoing; a portable computer diskette; a hard disk; a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); an optical fiber; a portable compact disc; an optical storage device; a magnetic storage device; or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • It will be appreciated that computer program instructions may include computer executable code. A variety of languages for expressing computer program instructions may include without limitation C, C++, Java, JavaScript™, ActionScript™, assembly language, Lisp, Perl, Tcl, Python, Ruby, hardware description languages, database programming languages, functional programming languages, imperative programming languages, and so on. In embodiments, computer program instructions may be stored, compiled, or interpreted to run on a computer, a programmable data processing apparatus, a heterogeneous combination of processors or processor architectures, and so on. Without limitation, embodiments of the present invention may take the form of web-based computer software, which includes client/server software, software-as-a-service, peer-to-peer software, or the like.
  • In embodiments, a computer may enable execution of computer program instructions including multiple programs or threads. The multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and to facilitate substantially simultaneous functions. By way of implementation, any and all methods, program codes, program instructions, and the like described herein may be implemented in one or more threads which may in turn spawn other threads, which may themselves have priorities associated with them. In some embodiments, a computer may process these threads based on priority or other order.
  • Unless explicitly stated or otherwise clear from the context, the verbs “execute” and “process” may be used interchangeably to indicate execute, process, interpret, compile, assemble, link, load, or a combination of the foregoing. Therefore, embodiments that execute or process computer program instructions, computer-executable code, or the like may act upon the instructions or code in any and all of the ways described. Further, the method steps shown are intended to include any suitable method of causing one or more parties or entities to perform the steps. The parties performing a step, or portion of a step, need not be located within a particular geographic location or country boundary. For instance, if an entity located within the United States causes a method step, or portion thereof, to be performed outside of the United States then the method is considered to be performed in the United States by virtue of the causal entity.
  • While the invention has been disclosed in connection with preferred embodiments shown and described in detail, various modifications and improvements thereon will become apparent to those skilled in the art. Accordingly, the forgoing examples should not limit the spirit and scope of the present invention; rather it should be understood in the broadest sense allowable by law.

Claims (29)

What is claimed is:
1. A computer-implemented method for design simulation comprising:
obtaining a high-level design for simulation;
determining a graph representation for the high-level design;
partitioning the graph representation into sub-graphs;
selecting a subset of the sub-graphs for simulation based on input-change bits; and
evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
2. The method of claim 1 further comprising propagating the simulation result, based on the evaluating of the subset of the sub-graphs, to a remainder of the graph representation for further simulation.
3. The method of claim 1 wherein the subset of the sub-graphs for simulation is selected based on an input-change bit, for each sub-graph in the subset of the sub-graphs, being set to true.
4. The method of claim 1 wherein the partitioning further comprises determining sub-graphs based on levels of logic.
5. The method of claim 1 wherein the partitioning into sub-graphs is based on reducing a number of signals crossing sub-graph boundaries.
6. The method of claim 1 further comprising using one or more of the input-change bits on a level as part of the selecting of the subset for simulation.
7. The method of claim 1 wherein the evaluating is based on an oblivious simulation model.
8. The method of claim 7 further comprising allocating processes from the oblivious simulation model to a plurality of processors.
9. The method of claim 8 further comprising selectively evaluating the processes based on an input change bit set being set to valid.
10. The method of claim 1 wherein the graph representation includes a control data flow graph.
11. The method of claim 10 wherein the control data flow graph includes a graph of a combinational region of the high-level design and a state region of the high-level design.
12. The method of claim 1 wherein the partitioning includes creating value locality.
13. The method of claim 1 wherein the partitioning includes creating event change locality.
14. The method of claim 1 wherein the partitioning includes balancing of levels.
15. The method of claim 1 wherein the partitioning includes collecting of readers of a simulation value.
16. The method of claim 1 wherein the partitioning includes separating primitives evenly across clusters within a level.
17. The method of claim 1 wherein the partitioning includes clustering sibling primitives.
18. The method of claim 1 further comprising modifying clock gating.
19. The method of claim 18 wherein the modifying clock gating includes moving gating to storage elements.
20. The method of claim 19 wherein the modifying clock gating includes eliminating a clock gate to a combinational logic portion.
21. The method of claim 19 wherein the modifying clock gating is only for simulation purposes.
22. The method of claim 18 wherein clock gating is restructured to combine phases and to occur on an active edge of clock.
23. The method of claim 1 further comprising determining that an output of one of the sub-graphs has a change of state and copying that change of state to a processor where a process for a second sub-graph uses that change of state as input to the second sub-graph.
24. The method of claim 23 further comprising copying a sequence of changes of state for that output of one of the sub-graphs and using the sequence as a series of inputs to the second sub-graph.
25. The method of claim 1 further comprising copying the high-level design and simulating the high-level design as well as its copy on at least two different processors.
26. The method of claim 25 wherein the copying is performed to accomplish the simulating.
27. The method of claim 25 further comprising maintaining primitives with the same function in a single cluster within the copy of the high-level design.
28. A computer system for design simulation comprising:
a memory which stores instructions;
one or more processors coupled to the memory wherein the one or more processors are configured to:
obtain a high-level design for simulation;
determine a graph representation for the high-level design;
partition the graph representation into sub-graphs;
select a subset of the sub-graphs for simulation based on input-change bits; and
evaluate the subset of the sub-graphs to produce a simulation result for the high-level design.
29. A computer program product embodied in a non-transitory computer readable medium for design simulation comprising:
code for obtaining a high-level design for simulation;
code for determining a graph representation for the high-level design;
code for partitioning the graph representation into sub-graphs;
code for selecting a subset of the sub-graphs for simulation based on input-change bits; and
code for evaluating the subset of the sub-graphs to produce a simulation result for the high-level design.
US13/646,669 2012-04-27 2012-10-06 Selective execution for partitioned parallel simulations Abandoned US20130290919A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/646,669 US20130290919A1 (en) 2012-04-27 2012-10-06 Selective execution for partitioned parallel simulations
US15/457,507 US10853544B2 (en) 2012-04-27 2017-03-13 Selective execution for partitioned parallel simulations

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261639799P 2012-04-27 2012-04-27
US13/646,669 US20130290919A1 (en) 2012-04-27 2012-10-06 Selective execution for partitioned parallel simulations

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/457,507 Continuation US10853544B2 (en) 2012-04-27 2017-03-13 Selective execution for partitioned parallel simulations

Publications (1)

Publication Number Publication Date
US20130290919A1 true US20130290919A1 (en) 2013-10-31

Family

ID=49478511

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/646,669 Abandoned US20130290919A1 (en) 2012-04-27 2012-10-06 Selective execution for partitioned parallel simulations
US15/457,507 Active 2032-11-12 US10853544B2 (en) 2012-04-27 2017-03-13 Selective execution for partitioned parallel simulations

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/457,507 Active 2032-11-12 US10853544B2 (en) 2012-04-27 2017-03-13 Selective execution for partitioned parallel simulations

Country Status (1)

Country Link
US (2) US20130290919A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150117216A1 (en) * 2013-10-31 2015-04-30 Telefonaktiebolaget L M Ericsson (Publ) Method and system for load balancing at a data network
US20160125118A1 (en) * 2014-10-31 2016-05-05 Wave Semiconductor, Inc. Computing resource allocation based on flow graph translation
US9529963B1 (en) * 2015-03-27 2016-12-27 Microsemi Storage Solutions (U.S.), Inc. Method and system for partitioning a verification testbench
CN106484941A (en) * 2015-08-28 2017-03-08 三星电子株式会社 Method of designing integrated circuit and the integrated clock gating device integrated with trigger
US9916405B2 (en) * 2016-02-22 2018-03-13 International Business Machines Corporation Distributed timing analysis of a partitioned integrated circuit design
CN109196529A (en) * 2016-04-25 2019-01-11 谷歌有限责任公司 Quantum auxiliary optimization
US10346573B1 (en) * 2015-09-30 2019-07-09 Cadence Design Systems, Inc. Method and system for performing incremental post layout simulation with layout edits
US10430539B1 (en) * 2016-12-16 2019-10-01 Xilinx, Inc. Method and apparatus for enhancing performance by moving or adding a pipelined register stage in a cascaded chain
US10853544B2 (en) 2012-04-27 2020-12-01 Synopsys, Inc. Selective execution for partitioned parallel simulations
US11010516B2 (en) * 2018-11-09 2021-05-18 Nvidia Corp. Deep learning based identification of difficult to test nodes
CN114638184A (en) * 2022-05-23 2022-06-17 南昌大学 Gate-level circuit simulation method, system, storage medium and equipment

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180117B (en) * 2017-06-30 2020-12-18 东软集团股份有限公司 Chart recommendation method and device and computer equipment
US11093679B2 (en) 2018-03-14 2021-08-17 International Business Machines Corporation Quantum circuit decomposition by integer programming
CN112213745A (en) * 2019-11-27 2021-01-12 中国科学院微小卫星创新研究院 Satellite upper note receiving processor simulator based on GPU
CN112287622B (en) * 2020-12-28 2021-03-09 中国人民解放军国防科技大学 Quick turbulence numerical simulation method and device based on link direction manual compression

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5541849A (en) * 1990-04-06 1996-07-30 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including estimation and comparison of timing parameters
US5544066A (en) * 1990-04-06 1996-08-06 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including estimation and comparison of low-level design constraints
US5553002A (en) * 1990-04-06 1996-09-03 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, using milestone matrix incorporated into user-interface
US5555201A (en) * 1990-04-06 1996-09-10 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including interactive system for hierarchical display of control and dataflow information
US5557531A (en) * 1990-04-06 1996-09-17 Lsi Logic Corporation Method and system for creating and validating low level structural description of electronic design from higher level, behavior-oriented description, including estimating power dissipation of physical implementation
US5572436A (en) * 1990-04-06 1996-11-05 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design
US5696942A (en) * 1995-03-24 1997-12-09 Sun Microsystems, Inc. Cycle-based event-driven simulator for hardware designs
US6112023A (en) * 1997-02-24 2000-08-29 Lucent Technologies Inc. Scheduling-based hardware-software co-synthesis of heterogeneous distributed embedded systems
US6212668B1 (en) * 1996-05-28 2001-04-03 Altera Corporation Gain matrix for hierarchical circuit partitioning
US6216252B1 (en) * 1990-04-06 2001-04-10 Lsi Logic Corporation Method and system for creating, validating, and scaling structural description of electronic device
US6230303B1 (en) * 1997-02-24 2001-05-08 Lucent Technologies Inc. Proximity-based cluster allocation for hardware-software co-synthesis of heterogeneous distributed embedded systems
US20050091025A1 (en) * 2003-08-26 2005-04-28 Wilson James C. Methods and systems for improved integrated circuit functional simulation
US7143021B1 (en) * 2000-10-03 2006-11-28 Cadence Design Systems, Inc. Systems and methods for efficiently simulating analog behavior of designs having hierarchical structure
US20070044079A1 (en) * 2005-06-02 2007-02-22 Tharas Systems Inc. A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors
US7673263B2 (en) * 1997-11-05 2010-03-02 Fujitsu Limited Method for verifying and representing hardware by decomposition and partitioning
US8418115B1 (en) * 2010-05-11 2013-04-09 Xilinx, Inc. Routability based placement for multi-die integrated circuits
US8738349B2 (en) * 2010-04-20 2014-05-27 The Regents Of The University Of Michigan Gate-level logic simulator using multiple processor architectures
US8738559B2 (en) * 2011-01-24 2014-05-27 Microsoft Corporation Graph partitioning with natural cuts

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5678028A (en) 1994-10-25 1997-10-14 Mitsubishi Electric Information Technology Center America, Inc. Hardware-software debugger using simulation speed enhancing techniques including skipping unnecessary bus cycles, avoiding instruction fetch simulation, eliminating the need for explicit clock pulse generation and caching results of instruction decoding
US5649176A (en) * 1995-08-10 1997-07-15 Virtual Machine Works, Inc. Transition analysis and circuit resynthesis method and device for digital circuit modeling
US6487704B1 (en) 1997-08-07 2002-11-26 Verisity Design, Inc. System and method for identifying finite state machines and verifying circuit designs
WO1999032628A2 (en) 1997-12-23 1999-07-01 Children's Medical Center Corp. Wip, a wasp-associated protein
US6927318B2 (en) 1997-12-23 2005-08-09 The Children's Medical Center Corporation WIP, a WASP-associated protein
US6629297B2 (en) 2000-12-14 2003-09-30 Tharas Systems Inc. Tracing the change of state of a signal in a functional verification system
US6480988B2 (en) 2000-12-14 2002-11-12 Tharas Systems, Inc. Functional verification of both cycle-based and non-cycle based designs
US6470480B2 (en) 2000-12-14 2002-10-22 Tharas Systems, Inc. Tracing different states reached by a signal in a functional verification system
US6625786B2 (en) 2000-12-14 2003-09-23 Tharas Systems, Inc. Run-time controller in a functional verification system
US6691287B2 (en) 2000-12-14 2004-02-10 Tharas Systems Inc. Functional verification system
US7805697B2 (en) * 2002-12-06 2010-09-28 Multigig Inc. Rotary clock synchronous fabric
JP2006155533A (en) * 2004-12-01 2006-06-15 Matsushita Electric Ind Co Ltd Circuit information generation device and circuit information generation method
US7370312B1 (en) 2005-01-31 2008-05-06 Bluespec, Inc. System and method for controlling simulation of hardware in a hardware development process
US7509602B2 (en) 2005-06-02 2009-03-24 Eve S.A. Compact processor element for a scalable digital logic verification and emulation system
US7548842B2 (en) 2005-06-02 2009-06-16 Eve S.A. Scalable system for simulation and emulation of electronic circuits using asymmetrical evaluation and canvassing instruction processors
US20060277428A1 (en) 2005-06-02 2006-12-07 Tharas Systems Inc. A system and method for simulation of electronic circuits generating clocks and delaying the execution of instructions in a plurality of processors
US20060277020A1 (en) 2005-06-02 2006-12-07 Tharas Systems A reconfigurable system for verification of electronic circuits using high-speed serial links to connect asymmetrical evaluation and canvassing instruction processors
US7761275B2 (en) * 2005-12-19 2010-07-20 International Business Machines Corporation Synthesizing current source driver model for analysis of cell characteristics
US8359186B2 (en) 2006-01-26 2013-01-22 Subbu Ganesan Method for delay immune and accelerated evaluation of digital circuits by compiling asynchronous completion handshaking means
US8849644B2 (en) 2007-12-20 2014-09-30 Mentor Graphics Corporation Parallel simulation using an ordered priority of event regions
US7458050B1 (en) * 2008-03-21 2008-11-25 International Business Machines Corporation Methods to cluster boolean functions for clock gating
US10423740B2 (en) 2009-04-29 2019-09-24 Synopsys, Inc. Logic simulation and/or emulation which follows hardware semantics
GB2486003B (en) * 2010-12-01 2016-09-14 Advanced Risc Mach Ltd Intergrated circuit, clock gating circuit, and method
US20130290919A1 (en) 2012-04-27 2013-10-31 Synopsys, Inc. Selective execution for partitioned parallel simulations

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6216252B1 (en) * 1990-04-06 2001-04-10 Lsi Logic Corporation Method and system for creating, validating, and scaling structural description of electronic device
US5544066A (en) * 1990-04-06 1996-08-06 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including estimation and comparison of low-level design constraints
US5553002A (en) * 1990-04-06 1996-09-03 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, using milestone matrix incorporated into user-interface
US5555201A (en) * 1990-04-06 1996-09-10 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including interactive system for hierarchical display of control and dataflow information
US5557531A (en) * 1990-04-06 1996-09-17 Lsi Logic Corporation Method and system for creating and validating low level structural description of electronic design from higher level, behavior-oriented description, including estimating power dissipation of physical implementation
US5572436A (en) * 1990-04-06 1996-11-05 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design
US5541849A (en) * 1990-04-06 1996-07-30 Lsi Logic Corporation Method and system for creating and validating low level description of electronic design from higher level, behavior-oriented description, including estimation and comparison of timing parameters
US5696942A (en) * 1995-03-24 1997-12-09 Sun Microsystems, Inc. Cycle-based event-driven simulator for hardware designs
US6212668B1 (en) * 1996-05-28 2001-04-03 Altera Corporation Gain matrix for hierarchical circuit partitioning
US6230303B1 (en) * 1997-02-24 2001-05-08 Lucent Technologies Inc. Proximity-based cluster allocation for hardware-software co-synthesis of heterogeneous distributed embedded systems
US6112023A (en) * 1997-02-24 2000-08-29 Lucent Technologies Inc. Scheduling-based hardware-software co-synthesis of heterogeneous distributed embedded systems
US7673263B2 (en) * 1997-11-05 2010-03-02 Fujitsu Limited Method for verifying and representing hardware by decomposition and partitioning
US7143021B1 (en) * 2000-10-03 2006-11-28 Cadence Design Systems, Inc. Systems and methods for efficiently simulating analog behavior of designs having hierarchical structure
US7257525B2 (en) * 2000-10-03 2007-08-14 Cadence Design Systems, Inc. Systems and methods for efficiently simulating analog behavior of designs having hierarchical structure
US7328143B2 (en) * 2000-10-03 2008-02-05 Cadence Design Systems, Inc. Systems and methods for efficiently simulating analog behavior of designs having hierarchical structure
US7415403B2 (en) * 2000-10-03 2008-08-19 Cadence Design Systems, Inc. Systems and methods for efficiently simulating analog behavior of designs having hierarchical structure
US20050091025A1 (en) * 2003-08-26 2005-04-28 Wilson James C. Methods and systems for improved integrated circuit functional simulation
US20070044079A1 (en) * 2005-06-02 2007-02-22 Tharas Systems Inc. A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors
US8738349B2 (en) * 2010-04-20 2014-05-27 The Regents Of The University Of Michigan Gate-level logic simulator using multiple processor architectures
US8418115B1 (en) * 2010-05-11 2013-04-09 Xilinx, Inc. Routability based placement for multi-die integrated circuits
US8738559B2 (en) * 2011-01-24 2014-05-27 Microsoft Corporation Graph partitioning with natural cuts

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Chatterjee et al., GCS: HighPerformance GateLevel Simulation with GPGPUs; 20-24 April 2009; Design, Automation & Test in Europe Conference & Exhibition; DATE '09; Page(s): 1332 - 1337 *
Chatterjee et al.; Gate-level simulation with GPU computing; June 2011; ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 3, Article 30; pgs. 1-26 *
Cong et al., Instruction Set Extension with Shadow Registers for Configurable Processors, 2005, Proceedings of the 2005 ACM/SIGDA 13th international, FPGA '05, pgs. 99-106 *
David M. Lewis, A hierarchical compiled code event-driven logic simulator, June 1991, IEEE Transactions On Computer-Aided Design, Vol. 10, No. 6, pages 726-737 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853544B2 (en) 2012-04-27 2020-12-01 Synopsys, Inc. Selective execution for partitioned parallel simulations
US20150117216A1 (en) * 2013-10-31 2015-04-30 Telefonaktiebolaget L M Ericsson (Publ) Method and system for load balancing at a data network
US9338097B2 (en) * 2013-10-31 2016-05-10 Telefonaktiebolaget L M Ericsson (Publ) Method and system for load balancing at a data network
US20160125118A1 (en) * 2014-10-31 2016-05-05 Wave Semiconductor, Inc. Computing resource allocation based on flow graph translation
US10042966B2 (en) * 2014-10-31 2018-08-07 Wave Computing, Inc. Computing resource allocation based on flow graph translation
US9529963B1 (en) * 2015-03-27 2016-12-27 Microsemi Storage Solutions (U.S.), Inc. Method and system for partitioning a verification testbench
CN106484941A (en) * 2015-08-28 2017-03-08 三星电子株式会社 Method of designing integrated circuit and the integrated clock gating device integrated with trigger
US10346573B1 (en) * 2015-09-30 2019-07-09 Cadence Design Systems, Inc. Method and system for performing incremental post layout simulation with layout edits
US9916405B2 (en) * 2016-02-22 2018-03-13 International Business Machines Corporation Distributed timing analysis of a partitioned integrated circuit design
CN109196529A (en) * 2016-04-25 2019-01-11 谷歌有限责任公司 Quantum auxiliary optimization
US11449760B2 (en) * 2016-04-25 2022-09-20 Google Llc Quantum assisted optimization
CN109196529B (en) * 2016-04-25 2022-12-20 谷歌有限责任公司 Quantum-assisted optimization
US10430539B1 (en) * 2016-12-16 2019-10-01 Xilinx, Inc. Method and apparatus for enhancing performance by moving or adding a pipelined register stage in a cascaded chain
US11010516B2 (en) * 2018-11-09 2021-05-18 Nvidia Corp. Deep learning based identification of difficult to test nodes
US20210295169A1 (en) * 2018-11-09 2021-09-23 Nvidia Corp. Deep learning based identification of difficult to test nodes
US11574097B2 (en) * 2018-11-09 2023-02-07 Nvidia Corp. Deep learning based identification of difficult to test nodes
CN114638184A (en) * 2022-05-23 2022-06-17 南昌大学 Gate-level circuit simulation method, system, storage medium and equipment

Also Published As

Publication number Publication date
US20170185700A1 (en) 2017-06-29
US10853544B2 (en) 2020-12-01

Similar Documents

Publication Publication Date Title
US10853544B2 (en) Selective execution for partitioned parallel simulations
US11023360B2 (en) Systems and methods for configuring programmable logic devices for deep learning networks
Huang et al. Taskflow: A lightweight parallel and heterogeneous task graph computing system
US10025566B1 (en) Scheduling technique to transform dataflow graph into efficient schedule
Xiao et al. A survey on agent-based simulation using hardware accelerators
US8738349B2 (en) Gate-level logic simulator using multiple processor architectures
US20130226535A1 (en) Concurrent simulation system using graphic processing units (gpu) and method thereof
US9558306B2 (en) Retiming a design for efficient parallel simulation
Holst et al. High-throughput logic timing simulation on GPGPUs
Sen et al. Parallel cycle based logic simulation using graphics processing units
Schmidt et al. Exploiting thread and data level parallelism for ultimate parallel SystemC simulation
Wernsing et al. Elastic computing: A portable optimization framework for hybrid computers
He et al. ISBA: An independent set-based algorithm for automated partial reconfiguration module generation
WO2018076979A1 (en) Detection method and apparatus for data dependency between instructions
Bertacco et al. On the use of GP-GPUs for accelerating compute-intensive EDA applications
Goetschalckx et al. Depfin: a 12-nm depth-first, high-resolution CNN processor for IO-efficient inference
Ducroux et al. Fast and accurate power annotated simulation: Application to a many-core architecture
US9507896B2 (en) Quasi-dynamic scheduling and dynamic scheduling for efficient parallel simulation
Chen et al. Sparsity-oriented sparse solver design for circuit simulation
US20230004430A1 (en) Estimation of power profiles for neural network models running on ai accelerators
US11842132B1 (en) Multi-cycle power analysis of integrated circuit designs
Lin Advances in parallel programming for electronic design automation
Lu Effectively parallelizing electronic design automation algorithms using the operator formulation
Shah et al. Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra
Huang et al. Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System

Legal Events

Date Code Title Description
AS Assignment

Owner name: SYNOPSYS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NARAYANASWAMY, RAMESH;SAHAI, PARAMINDER S;CHIEN, CHIAHON;REEL/FRAME:029121/0198

Effective date: 20121002

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION