CA2152110C - Interrupt-based hardware support for profiling system performance - Google Patents

Interrupt-based hardware support for profiling system performance

Info

Publication number
CA2152110C
CA2152110C CA002152110A CA2152110A CA2152110C CA 2152110 C CA2152110 C CA 2152110C CA 002152110 A CA002152110 A CA 002152110A CA 2152110 A CA2152110 A CA 2152110A CA 2152110 C CA2152110 C CA 2152110C
Authority
CA
Canada
Prior art keywords
counter
interrupt
signal
profiling
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA002152110A
Other languages
French (fr)
Other versions
CA2152110A1 (en
Inventor
Prathima Agrawal
Aaron Jay Goldberg
John Andrew Trotter
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Corp filed Critical AT&T Corp
Publication of CA2152110A1 publication Critical patent/CA2152110A1/en
Application granted granted Critical
Publication of CA2152110C publication Critical patent/CA2152110C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/348Circuit details, i.e. tracer hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/86Event-based monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/88Monitoring involving counting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/885Monitoring specific for caches

Abstract

Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are re-coding their algorithms to exploit memory systems. All of these groups need empirical data on memory behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. By augmenting a cache miss counter with a compare register and interrupt line such that the processor is interrupted when the counter matches the compare value, we can sample system state and develop cache miss profiles that associate cache misses with specific processes, procedures, call stacks, addresses, or user defined aspects of system state. This idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the sun Sparc 10/41. Simple case studies are provided to illustrate Mprof's features.

Description

21~2~.10 1 a - 1 -Interrupt-Based Hardware Support for Profiling System Performance Background of the Invention Field of the Invention The present invention relates generally to computer performance optimization and, more particularly, to performance profiling of computer systems.
Related Art Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance.
Reflecting this trend, various techniques have been implemented to compensate for this lag in performance. One technique, implemented by computer system architects, is the development of increasingly complex memory hierarchies to mask the lag in memory system performance. Another technique, implemented by compiler writers, is the addition of locality-enhancing transformations to better utilize complex memory hierarchies. Still another technique, implemented by applications programmers, is the re-coding of algorithms to exploit memory systems.
A11 of the above techniques require empirical data on memory system behavior to guide the particular optimization. Presently, there are a number of microprocessor and workstation vendors who provide hardware support for profiling memory system performance to meet this need for empirical data. For example, the Intel Pentium microprocessor as well as the Sun ECache (external cache) controller in the Sun Sparc 10's and SparcCenter 1000's, feature cache miss count registers.
While these simple counters provide a first step toward understanding memory system utilization, there are a number of drawbacks in utilizing cache miss count registers. In particular, simple cache miss counters do not directly relate cache behavior to the specific portion of the software which is incurring cache misses.
Instead, they only indicate the aggregate number of misses. This information, however, will not show which process, procedure, or data structure was responsible for the misses.
What is needed, therefore, is a means for obtaining memory system performance data to develop cache miss profiles that associate cache misses with specific processes, procedures, procedure call stacks, and user defined aspects of the system state without interfering with system performance.
Summary of the Invention The present invention is a system and method for profiling computer systems.
The present invention is an implementation of a system profiler that samples system state using interrupts generated by a cache miss counter supplemented with a compare register and interrupt line.
One advantage of the present invention is that it has the ability to interrupt the processor when the cache miss count matches a predetermined value stored in the compare register. This improvement over conventional approaches enables the present invention to sample the state of the processor at a predetermined number of cache misses or cache stall cycles. The present invention is then able to develop a cache miss profile that associates cache misses with specific processes, procedures, procedure call stacks, or even user defined aspects of system state.
Another advantage of the present invention is its ability to provide significant assistance in isolating performance bottlenecks and guiding. optimization of architectures, operating systems, compilers, and applications.
Still another advantage of the present invention is the implementation of a sampling approach that offers a better compromise between speed, accuracy, flexibility, _ 21~~1~.~
_3_ portability, and extensibility than previous hardware and software methods for understanding memory system performance.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.
Brief Description of the Figures In the drawings, like reference numbers indicate identical or functionally similarly elements. Additionally, the left-most digit of a reference number identifies the drawing in which the reference number first appears. The present invention will be described with reference to the accompanying drawings, wherein:
Figure 1 is a block diagram of the preferred embodiment of the interrupt-based hardware support system of the present invention;
Figure 2 is a block diagram of the alternative embodiment of the interrupt-based hardware support system of the present invention; and Figure 3 is a per process cache profile for system-wide sampling of second level cache of a SUN Sparc l0/41.
Detailed Description of the Preferred Embodiments A. Introduction The present invention is a system and method for profiling computer system performance. The present invention may be implemented as a memory system profiler, referred to as Mprof, that samples system state using interrupts generated by a loadable _ 21521~.~

decrementer cache miss counter supplemented with a compare register and interrupt line.
First, the use of memory system profiling and the reasons why conventional approaches for understanding memory system performance are inadequate are discussed. Then, the application of the present invention to profiling memory systems is given, with the preferred hardware implementation and alternative embodiments.
Last, case studies are provided to illustrate how the profiles can be used to optimize applications for the cache hierarchy of a Sparc l0/41.
B. Uses of Memory Profiles Memory system profiles can provide significant assistance in isolating performance bottlenecks and guiding optimization of architectures, operating systems, compilers, and applications. Memory systems with multilevel cache hierarchies, write buffers, instruction prefetch units, banked memories, and sophisticated interconnection networks are becoming commonplace.
Commodity personal computers feature primary and secondary caches while cache-coherent multiprocessors are marketed as "desktop" workstations. The complex memory hierarchies in these systems are designed by architects who identify the memory reference characteristics of important benchmark applications and optimize the memory system to deliver performance for these key programs.
Once the architect has optimized the memory hierarchy for specific reference patterns, it is up to software developers to try to coerce applications into generating these patterns. Recent work has demonstrated that compiler transformations can increase data cache utilization by reordering and blocking loop nests, hide memory latency by carefully prefetching data before it is needed, and improve instruction cache performance by reordering basic blocks. For a discussion of compiler transformations ;~

which can increase data cache utilization by reordering and blocking loop nests see Wolf et al., "A Data Locality Optimizing Algorithm," in Proceedings of the ACM
'91 Conference on Programming Language Design and Implementation (199l), pp. 30-44.
For a discussion of compiler transformations which can hide memory latency by carefully prefetching data before it is needed see Mowry, et al., "Design and Evaluation of a Compiler Algorithm for Prefetching," in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (1992), pp. 62-73. For a discussion of compiler transformations which can improve instruction cache performance by reordering basic blocks see McFarling, S., "Program Optimization for Instruction Caches," in Proceedings of the Third ASPLOS
(l989), pp. 183-93; Pettis et al., "Profile Guided Code Positioning," in Proceedings of the ACM '90 Conference on Programming Language Design and Implementation (1990), pp. 16-27.
Other work has shown that cache coherence tragic in shared memory multiprocessors can be reduced by supporting multiple coherence protocols and letting the compiler or operating system choose the most efficient protocol on a per page basis. See, for example, Veenstra et al., "A Performance Evaluation of Optimal Hybrid Cache Coherency Protocols," in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (1992), pp. 149-60. Moreover, many of the researchers developing software techniques for enhancing memory system utilization have explicitly noted that memory overhead profiles could be productively used to guide compilers in automatically applying their transformation. See, Mowry, et al., "Design and Evaluation of a Compiler Algorithm for Prefetching," in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (1992), pp. 62-73; and Veenstra et al., "A Performance Evaluation of Optimal Hybrid Cache Coherency Protocols," in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (1992), pp. l49-60.

Finally, there is a large literature on rewriting applications to enhance locality and improve memory system performance. See, for example, Cheriton et al., "Restructuring a Parallel Simulation to Improve Cache Behavior in a Shared Memory Multiprocessor: A First Experience," in Proceedings of the International Symposium on Shared Memory Multiprocessing, ( I 99 I ) pp. 109- I I 8. See also, Rothberg et al., "Techniques for Improving the Performance of Sparse Matrix Factorization on Multiprocessor Workstations," in Proceedings of Supercomputing '90 (1990), pp. 232-4I. Given the existence of locality enhancing transformations, programmers need profiling tools to identify where a bottleneck is in the memory system and to quantify the extent of the bottleneck and to measure how transformations actually impact performance.
C. Memory Profiling Techniques The extensive set of applications for memory profiles described above has driven the development of several memory system profilers. These conventional memory profiles may be divided into three techniques or methods: (1) simulation-based techniques, (2) counter-based systems, and (3) software methods.
In simulation memory profiles, a memory reference trace of a program or set of programs is obtained, either by software instrumentation, microcode monitoring, or hardware reference tracing. An example of the software instrumentation approach may be found in Callahan et al., "Analyzing and Visualizing Performance of Memory Hierarchies," in Performance Instrumentation and Visualization, Simmons et al., Addison Wesley (1990), pp. I-26. An example of the microcode monitoring approach may be found in Agarwal et al., " New Technique for Capturing Address Traces Using Microcode," in Proceedings of Thirteenth Annual International Symposium on Computer Architecture, (l986), pp. 119-27. An example of the hardware reference tracing approach may be found in Lenoski et al., "The DASH Prototype: Logic Overhead and Performance," in IEEE Transactions on Parallel and Distributed Systems (1993), vol. 4, no. l, pp. 41-61.

Once the memory reference trace of a program or set of programs is obtained in simulation memory profiles, the trace is then fed into an instrumented memory system simulator. The primary problem with simulation memory profiles is that it is slow, often increasing execution time by factors of l00-I000. The speed problem is exacerbated by the increasing complexity and size of modern memory systems.
For example, the number of memory references needed to simulate the performance of a typical shared bus multiprocessors with 1-30 processors and 1-16 megabyte second level caches on each processor is large enough to make this approach impractical.
Because simulation is so slow, it is used primarily by architects to predict the performance of systems that cannot be measured because they have not yet been built.
Yet, precisely because simulation is expensive, it is crucial that before simulating architects narrow the design space as much as possible by fully understanding the performance of today's systems. As mentioned above, to develop such an understanding, designers are including hardware for measuring cache miss rates and bus utilizations. The drawback of this counting hardware is that it does not give any direct insight into which interactions between the software and memory hierarchy lead to poor performance.
In order to establish a correspondence between hardware event counters and program events, software instrumentation must be added to access the counters.
For example, the number of cache misses that occur can be measured in a procedure and on its behalf by reading the cache miss counter on entry and exit to the procedure and recording the difference in the counter value. However, this software technique has two important shortcomings. First, the instrumentation may perturb the system which is being measured. For a short procedure, the overhead of reading several hardware counters (e.g., read miss, write miss, bus utilization) on entry and exit and storing the z' _g-totals may exceed the work actually done in the procedure. Second, if a context switch occurs between procedure entry and exit, the counter totals will include misses created by other processes.
Progress has been made in addressing some aspects of these shortcomings using model-based techniques to compensate for perturbation, adaptive instrumentation methods to reduce monitoring overhead, and operating systems support to create virtual per-process counters. For a discussion of the shortcomings in using model-based techniques to compensate for perturbation, see Malony et al., "Performance Measurement Intrusion and Perturbation Analysis," in IEEE Transactions on Parallel and Distributed Systems (1992), vol. 3, no. 4, pp. 433-50. For a discussion of adaptive instrumentation methods to reduce monitoring overhead, see Reed et al., "An Overview of the Pablo Performance Analysis Environment," Dept. of Computer Science, U.
of Illinois, Unpublished Manuscript, (1992).
However, there remains a fundamental intrusiveness problem whenever software 1 S instrumentation is inserted into frequently accessed regions of a program to gather fine-grained profiles using hardware counters. To get a sense of the magnitude of the resulting overhead of such a technique, the SPEC integer benchmarks are instrumented to access counters on procedure entry and exit. The overhead is measured in terrr~s of impact on execution time. In practice, both a memory overhead profile and an execution time profile in a single run of the program, so perturbing execution time directly impacts the profile data. Also, it is expected that the profiling techniques will be applied to multithreaded programs where program behavior is highly dependent on the relative execution times of the threads. Therefore, monitoring techniques that 215~11~

impact execution time significantly will be suspect.
In order to measure overhead, the integer SPEC benchmarks are instrumented to record the dynamic number of procedure calls and the user mode execution time of each program. Given these statistics, and assuming the counter code at procedure entry and exit takes a fixed amount of time to execute, counter overhead can be estimated by:
2*(dynamic # of procedure calls) * (time for a counter execution) It should be noted that this formula is conservative because it ignores several potential sources of overhead. For example, inserting counter code will expand the size of the executable, potentially degrading instruction cache performance.
In addition, if the counters are inserted by the compiler, they will interact in an unpredictable way with the optimizer, affecting register allocation, instruction scheduling, and other optimizations.
Subject to these caveats, the above formula can be used to estimate overhead given a value for counter execution time. A generic counter update on procedure entry is represented by the pseudo-code:
Count Array[Procedure Index]-=Read Stall Cycle Counter The procedure exit code is similar, with the subtraction replaced by addition.
The time to execute the pseudo-code can then be divided into counter update bookkeeping time and counter access time. On a RISC architecture, the update to the array involves at least a load, an add, and a store. In addition, to avoid multiple counter updates within a single recursive call, the update code must be protected with a counting semaphore. We refer to the cost of the counting semaphore and the counter update code as counter bookkeeping. On the Sparc 10/4l where the measurements 2~15211~

were made, counter bookkeeping takes about 10 cycles at 40 Mhz or 0.25 microseconds. The cost of counter access is implementation dependent. If the counter is integrated on-chip with the microprocessor, it may cost only a cycle to access. If, on the other hand, the counter is implemented externally using a system bus (SBus) card, it takes around 30 cycles to read on the Sparc 10/41. Table 1 below illustrates the overhead results under these two assumptions about access time.
T abte ru>EUentaflan I: ~verhe$d I~st Book-Program I7ynamie keepingAccess Run %a Overheadf')verhead Qverheacl (sec.) Name Catls (sec.} Time 1 Cycle31# Cycle' FastSlaw (~e.j xlisp 16l20652280.6 8.1 241.8 l95.445 l65 gcc (cc 58477688 29.2 2.9 87.7 129.625 90 1 ) espresso17668I63 8.8 0.9 26.5 57.1 17 62 sc 1749875 0.9 0.1 2.6 12.5 8 28 eqntott 4363768 2.2 0.2 6.5 32.4 7 27 compress25l472 O.l 0.0 0.4 2.4 6 21 Referring to Table l, it can be seen that even with single cycle counter access, some programs experience 45% overhead; with slow counter access programs execute at 1.21 to 2.7 time their uninstrumented speed. Also, it should be noted that the problem becomes more acute when we move from procedure level profiling to finer grained profiling of individual basic blocks or load instructions. In general, hardware counters do not seem to support non-intrusiveness, fine grained profiling.
There is an alternative approach to profiling, referred to as Mtool, that retains the speed of run-time monitoring without requiring hardware instrumentation on the memory system. Mtool isolates memory overhead by comparing the actual execution time of a program segment against the ideal execution time predicted assuming that all memory references are cache hits. For a further discussion of Mtool, see Goldberg - II -et al., "Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications," in IEEE Transactions on Parallel and Distributed Systems (l993), vol. 4, no. 1, pp. 28-40. While this approach allows Mtool to quantify memory overhead on a per second basis, it does not provide information on what aspect of the memory hierarchy is being poorly utilized. For example, Mtool does not distinguish overhead due to primary instruction cache misses from overhead due to second level cache misses from write buffer stalls. From an optimizer's perspective, distinguishing the individual sources of performance loss is critical to identifying appropriate performance improving transformations.
Thus, previous memory system profilers have significant shortcomings that discourage their widespread use: simulation is too slow, hardware counters must be accessed via potentially intrusive software instrumentation, and software techniques like Mtool do not identify what part of the memory hierarchy is a bottleneck. These shortcomings have been alleviated with the sampling-based profiler of the present I S invention.
D. Sampling-Based Profilers Sampling techniques are routinely used to build execution time profilers and axe well known to those skilled in the relevant art. Time profilers rely on an external timer to generate a clock interrupt. The interrupt is fielded by a handler that samples the system state, building up a statistical profile of where the system spends its time. For example, most UMX implementations use the system timer interrupt to support a per process, program counter sampling service that is accessed via the profile system call and the prof tool. When the timer interrupt is serviced, if the interrupted process has requested profiling (with the profile call), then a counter associated with the process' program counter is incremented. The program counter counts are then post-processed by prof to produce a table of per procedure execution times.

Assuming that the sampling is independent of the program execution, the sampled profile will provide a good estimate of the actual execution time spent in each procedure. A detailed discussion of the statistical significance of sampled profiles can be found in Ferrari, D., "Computer Systems Performance Evaluation," Prentice-Hall Inc., Englewood Cliffs, NJ (l978), Chapter 2. A description of a robust UNIX
implementation of code profiling is given in McCanne et al., "A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling," in Proceedings of Winter USENIX (1993).
Moreover, because the sampling rate is relatively low (usually about 60 samples per second for execution time profilers), sampling is less intrusive than inline software instrumentation. And finally, sampling techniques are not limited to execution time profiling. Hall et al., "Call Path Profiling of Monotonic Program Resources in UNIX,"
in Proceedings of 1993 Summer USENIX (1993), describes a sampling-based profiler, Cpprof, which uses information provided by the UNIX System V/proc interface to profile with respect to a wide variety of system events including page faults, floating point exceptions, and bytes read or written. Some previous hardware instrumentation has also supported profiling via sampling. In Cargill et al., "Cheap Hardware Support for Software Debugging and Profiling," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, (1987), pp. 82-3, for example, it is suggested that interrupts generated by a unit that counts instruction fetches to drive code profiling be used.
Another potential instance of interrupt-based profiling can be found in DEC's 21064 implementation of the Alpha architecture. This microprocessor includes two flexible performance counting registers that can be programmed to increment on a variety of events, generating an interrupt every 28, 2'Z, or 2'6 events. These events include misses in the on-chip instruction and data caches. However, the performance counters are apparently intended as a means of counting events over a long period of time relative to the event frequency. See, for example, Digital Equipment Corporation, DECChip 21064-AA Microprocessor Preliminary Data Sheet. Maynard, MA ( 1992) pp. 3.20-3.22. While the chip architects may have had simple aggregate counts in mind, the same interrupting mechanism can be used to build powerful sampling based profiling environments. Note that the 21064 mechanism is not ideal because it lacks fully programmable compare registers. This limitation prevents one from randomizing the sampling interval to insure statistical significance. Nevertheless, the profiling support in DEC's 21064 is evidence that vendors can implement sampling-based profiling in production systems.
Below, the requirements for the present invention are made explicit in the description of the preferred embodiment of the present invention. This embodiment counts primary and secondary cache misses and CPU data stall cycles.
Alternative embodiments could easily be used to monitor other events like cache service time, TLB
misses and write buffer stalls.
1 S E. Hardware Sampling Implementation 1. Introduction A basic problem in both debugging and performance profiling of systems such as those with multiple processors is monitoring the system without disturbing its operation. The present invention solves this problem by providing specialized hardware for monitoring certain operations of a processor and then interrupting the processor to record its state when the specialized hardware detects a predetermined condition. For example, a hardware timer interrupts the processor when waiting time at a particular address exceeds a preset threshold. In another example, multiple counters accumulate occurrences of waits in specified time ranges and interrupt the processor when one of the counts reaches a specified value.

_ ~ .~ ~ 21 ~. 0 Hardware monitors can track many aspects of memory system behavior. For example, by monitoring the external pins of a microprocessor that has an on-chip cache, it is usually straight-forward to detect misses in that cache. And, if the monitor is fabricated as part of the microprocessor (as in the Pentium and Alpha), misses in split on-chip can be identified as instruction or data misses. Regardless of the implementation of the event detector, the detection signal can be fed to a counter that triggers an interrupt when the count reaches a particular value. An interrupt handler can then record the state of the running system. If enough samples are recorded, this information will highlight the parts of a program or programs that are causing cache misses.
Referring to Figure l, the performance monitor of the present invention is now described. The performance monitor l00 of the present invention consists of three main components: an event detection module 102, a loadable down-counter l04, and an interrupt generator l08. The event detection module 102 generates a pulse every time an event of interest occurs. The down-counter 104 counts the pulses generated by event detector 102 and causes interrupt circuitry l08 to generate an interrupt once the counter reaches zero. The initial counter value can be loaded so that an interrupt will be generated after the occurrence of a programmable number of events;
this allows randomized sampling intervals to be used. In addition, by selecting an appropriate value for the loadable counter, relatively infrequent events will still interrupt the CPU
a statistically significant number of times. The interrupt generator 108 implements the correct interface to generate CPU interrupts so the interrupt handler can record appropriate system state. In the preferred embodiment, performance monitor 100 utilizes program logic arrays (PALs) on an SBus card 110 to implement the event detector 102, down counter 104, and interrupt generator 108.

2. The Implementation of the Present invention Referring to Figure 1, the present invention is implemented to monitor the behavior of memory system 112 of a Sun SPARC 10/41 workstation. The SPARC
10/4l offers an on-chip cache (split IOKB instruction and 16KB data) backed by an integrated 1MB second level cache which is connected to the main memory via a cache controller and a memory bus (MBus).
For primary cache misses, event detection circuit 102 detects a cache line read by monitoring several pins on the SRAM's of the second level cache that service the primary cache misses. This is represented by memory system signals 114 output by memory system 1l2. To detect second level cache misses, the event detector 102 monitors the memory bus (MBus) and outputs detection signal l16 on main memory reads received over memory system signal line 114. To properly monitor these main memory read signals, the present invention requires state information since MBus memory accesses use several bus cycles. The event detector l02 may also detect data cache stall time by exploiting an external monitoring pin on the SuperSparc microprocessor that is active when the CPU pipeline is waiting on the data cache.
This is also received via memory signal system line 114. In the preferred embodiment of the present invention, event detection circuit 102 is implemented in a PAL.
However as one skilled in the relevant art would find apparent, detector l02 may be implemented in other types of hardware configurations.
Down counter 104 receives a detection signal generated by event detector l02 via detection signal line 116. Down counter 104 outputs a pulse on signal line 124 to interrupt generator l08 when a predetermined number of events has occurred. In the preferred embodiment of the present invention, down counter 104 is implemented using two PALS and is 20 bits deep. This enables performance monitor 100 to count up to 2z~ events before a pulse is sent via pulse line l24 to interrupt generator 108.
The interrupt generator l08 is implemented on an SBus card 110 and causes an SBus interrupt to occur on interrupt line l30 when it receives a pulse from down counter l04 via signal line 124. Thus, loadable down counter 104 and interrupt generator l08 of the present invention replace the simple cache miss counter used in conventional systems. Downcounter 104 includes a bit which indicates that the performance monitor system 100 generated the interrupt 130 and provides a register for clearing the interrupt. The SBus interface 132 is also used to reset and enable counter 104 via reset line 118 and load enable line l20, respectively. In addition, the SBus interface is used to load counter 104 with an predetermined value via an SBUS data line l28 which is passed through SBus board 110 and provided to the down counter 104 via load counter line 122. Down counter 104 generates a pulse on line 124 when the loaded number of detection signals received via detection signal line 116 is reached.
In other words, the processor is interrupted when the cache miss count reaches zero.
Referring to Figure 2, an architectural block diagram of another preferred embodiment of the present invention is illustrated. Note that in the performance monitor 200 down counter l04 of Figure 1 has been replaced with a counter 204A
and loadable compare register 205A to make the circuit more flexible.
This embodiment supports simultaneous profiling of primary and second level cache misses. This embodiment is an extension of the embodiment illustrated in Figure 1. The components of performance monitor 200 are implemented in field programmable gate arrays (FPGA) rather than PALS. By using an FPGA, it is possible to reconfigure the monitor circuit from software, allowing us to easily reprogram the monitor to trigger on different events or sets of events. The FPGA is integrated on an SBus board 210 with external probes, thereby providing a flexible monitoring architecture that can trigger on conditions involving state or detect accesses to specific address ranges (for example, I/O devices). The FPGA implements several separate counters 204A and 204B so that several distinct event classes can be monitored simultaneously, including primary and secondary level cache misses and data cache stall time. When any of the counters 204A,204B matches its compare value stored in the respective comparator 205A,205B, an interrupt is raised. That is, an interrupt is generated by the respective comparator 205A,205B over interrupt lines 224A,224B, respectively. The interrupt status bits indicate which of the monitored events caused the interrupt.
F. Recording System State Upon receiving a profiling interrupt from the performance monitor of the present invention via SBus l32 or 232, the operating system and user application cooperate with the present invention to record profile data. Ideally, the present invention would sample enough system state to precisely identify the nature of every memory system bottleneck. In practice, the precision with which the sampled data describes the bottleneck must be balanced against the overhead of collecting large amounts of data. For example, while it may be desirable to know the state of a complex data structure when state is sampled, it is prohibitively expensive for a general purpose profiler to make a full dump of a process' data segment on each profiling interrupt.
The performance monitor of the present invention supports two data collection mechanisms that represent alternative trade-offs between flexibility, precision of information, and overhead. The most flexible and least expensive mechanism is system wide sampling. When system-wide profiling is enabled, each time a sampling interrupt occurs, the handler appends a brief sampled state record to a kernel profiling buffer. The record typically contains information such as:
(Timestamp, Current Process Id, Processor Status Word, Current Program Counter Value).
A user-level daemon periodically copies the profiling buffer to user space and saves its contents to a file. Post mortem tools can then synthesize cache miss profiles on a per-process basis, reporting user and system mode cache misses and plotting (using the timestamp data) cache miss samples versus time. In addition, if a mapping between Process IDs and corresponding executable files is maintained, the executables symbol tables may be used to synthesize UNIX prof style profiles that give cache miss counts on a per procedure basis.
The drawback of system wide sampling is that it provides only process and procedure-level information about the conditions under which cache misses are occurring. A preferred approach is the more flexible user level profiling implemented with software signals. In user level profiling, an application opens a special sampling device to indicate to the interrupt handler located in the processor that it wishes to receive memory system sampling signals. When a sampling interrupt occurs, the interrupt handler checks whether the currently executing process has opened the sampling device, and if so, it posts a signal to the process. When the kernel goes to resume the user process, it notices the posted signal and arranges to restart the process in the signal handler that the user process has registered for the sampling signal. This handler records relevant aspects of the process state and then control returns to the process' previously executing thread of execution. Note that this technique is not as non-intrusive as system-wide sampling because invoking the signal handler and resuming the previous thread of control involves extra overhead. However, as long as the sampling rate is not too high, the extra overhead will still be negligible.
The default interface to user level sampling in Mprof requires an application to be linked with a special library that replaces the default start and exit routines. The new start routine installs a sampling signal handler and opens the sampling device while exit writes the profile out to a file. (This library interface resembles linking with the -p flag to enable profiling under UNIX.) The Mprof sampling signal handler records the current procedure call stack so that cache misses can be associated with specific "contextualized" procedure call paths. Mprof will, for example, report that foot incurs cache misses when called from foo 1 called from main. Call path profiles are synthesized by feeding the call stack sample file produced by the handler into the Cpprof system which has facilities to highlight performance critical call paths and to interactively explore the data. Cpprof is discussed at length in Hall et al., "Call Path Profiling of Monotonic Program Resources in UNIX," in Proceedings of 1993 Summer USENIX (1993). An example is provided below which illustrates how user level Mprof profiles are viewed in Cpprof environment.
The inventors have found that Mprof's default call stack sampling signal handler 234 adequately isolates memory system bottlenecks. However, in more complex applications, the performance tuner may wish to replace the default handler with a specialized handler that records application specific information like arguments to a particular procedure or the state of some global state variable. Because the signal handler is part of the application program, it is considered to be well known in the relevant art to replace it with a user defined handler.
Below are examples of profiles produced using Mprof's system wide and user I S level sampling.
G. Case Studies The case studies given below are intended to illustrate the basic functionality of the present invention implemented to profile memory system performance. To emphasize the key features of Mprof, stylized synthetic benchmarks have been used when performing these case studies rather than full application programs.
1. System Wide Sampling This study examines the impact of multiprogramming on the effectiveness of the second level cache of the Sparc 10/4l . In particular, the behavior of a particular application, cache hog, which repeatedly reads all the elements of a 0.9 MB
data structure is profiled. A script that spawns a single instance of cache hog, waits for it to complete, and then starts up two concurrent invocations of the program is shown below.
cache hog sleep 1 cache hog &
cache hog;
sleep I S
The actual data is collected by enabling the hardware sampling and record logging in the interrupt handler, running the test script, and then disabling sampling the commands are:
start~rofd; gen workload; stop~rofd.
In this example, Mprof has been configured to generate a profiling interrupt after 16384 read misses in the second level cache. On each profiling interrupt in system wide sampling mode, the handler logs a timestamp, the process id (PID), and the program counter (PC). Mprof uses the timestamps and PIDs to plot cache miss samples versus time for each process. Figure 3 displays a plot for our test workload.
In Figure 4, the line 302 corresponds to the misses generated by the single invocation of cache hog while lines 304 and 306 represent the misses of a pair of hogs sharing the machine. From Figure 3, it is clear that the miss rates of the multiprogrammed jobs substantially exceed that of their uniprogrammed counterpart. In particular, there are 10 samples taken when the program runs alone but when two executions share the machine there are 75+77=152 samples. The additional misses are generated because the processes interfere in the cache, with each process flushing the other's data.
This phenomenon is not surprising, as each instance of cache hog wants to use 90~l0 of the lines in the second level cache. Mprof does, however, help us to visualize and quantify the effect. Moreover, the example illustrates how, over time, sampling correctly attributes misses to the processes that are incurring them. Note, however, that for multiprogrammed jobs, the samples seem to cluster into groups of four and five. This clustering is an artifact of poor sampling-for the purpose of illustration we have used a fixed sampling interval of 16384 misses. In any given 33 ms time slice, a process takes about 8000 second level cache misses. Thus, it is often the case that for a given time slice, we increment the miss counter by 8000 and then context switch so these misses are eventually charged to another process. In randomized sampling, such errors will tend to cancel, but in fixed interval sampling, correlations can arise producing the sort of clustering shown in Figure 3. In the pathological case, the sampling rate and miss count per-time slice can be chosen such that all samples will hit in one of two processes that are in fact missing equally often. This possibility argues strongly for randomizing the sampling by varying the inter-sample interval.
Returning to the profile results, the sampled data not only identifies which processes are missing in the cache, but also which program counters are associated with these misses. By periodically executing the processor status command (ps) during the profiling run, Mprof records a mapping from PIDs to executable file names.
During the post mortem data viewing phase, when the user clicks on one of the lines of the summary graph given in Figure 3, Mprof exploits this record to map the PID
corresponding to the line into an executable file name, extracts the symbol table from the file, and then synthesizes a procedure level cache miss profile from its sampled program counter information.
2. User Level Sampling in Cpprof This case study illustrates Mprof's ability to relate data stall cycles to call paths, identifying regions of a program that map poorly onto the memory hierarchy.
The case study compares several approaches to sorting a large array of records using the qsort library routine. This routine sorts an array of fixed size records given a pointer to the array, the size of a record, the number of records, and a function for comparing two records. On average, qsort performs NlogN compares when sorting N
records; when two records are out of order they are swapped in place.
In this case study, the present invention operates on a 6 MB data set containing 200,000 32-byte records, each with a 4 byte integer key. As a base case, qsort is invoked on the full array of 32 byte records. This implementation is somewhat inefficient because qsort is comparing integer keys but swapping full 32 byte records.
This unnecessary copying can be avoided by sorting pointers to records rather than the records themselves. Pseudo-code to initialize the pointer array and compare elements is shown below:
init_ptr array(Record *rec, Record **record_ptr, int cnt) {
for (i=0; i < cnt; i+=1 ) record_ptr[i] = rec + i;
compare record_ptr(Record **rpl, Record **rp2) {
return ( (*rpl)->key - (*rp2)->key );
The performance of the base case and record pointer sorts are compared by profiling both real time and data stall time via user-level sampling (real time signals are generating using the UNIX setitimer function). An excerpt from the Cpprof profile of the base case is shown below. Each line in the profile corresponds to a call path rooted at the qsort routine. Entries ending in %body%
represent total time spent in a procedure body and entries ending in Data Stall represent time spent stalled waiting for data. For example, the final call path in the profile below tells us that 0.62 seconds were spent waiting for data in the procedure compare records when it was called from qsort.
format: Call Path [Time in Seconds]
(qsort %body%) [6.86]
(qsort Data Stall) [ I .98]

_ 2.~ 5211 a (qsort compare records %body%) [1.40]
(qsort compare records Data Stall) [0.62]
Using the information in this profile, the first row of Table 2 is generated, which reports time the processor is active (CPU), time the processor is stalled waiting for data (DS) and total time. Note that the CPU figure is derived by subtracting DS from total time. Times are reported both for the qsort and compare routines individually and for qsort+compare. Looking at the qsort+compare columns, it is shown that eliminating the unnecessary copying by using the pointer sort saves 3.1 seconds of CPU time, but leaves data stall time unchanged. The likely explanation for this phenomenon is that the amount of data actually passing through the cache hierarchy is the same in both implementations.
In the pointer based implementation, the comparison function dereferences a pointer to access the key field of a record. But since the primary cache has byte lines, each reference to a key field causes the cache to load the entire 32 byte record.
The data bandwidth requirements can be reduced by building "little"
records (LRecs) where each record contains only a pointer to a full database record and a copy of that record's key:
init little records(Record *rec, LRec *lrec, int cnt) {
for (i=0; i < cnt;l i+=1 ) {
lrec[i].key = rec[i].key;
lrec[i].record_ptr = rec + i;
]
These 8 byte "little" records can then be directly sorted using qsort. In this case, qsort will be swapping 8 byte records and comparing 4 byte keys. The extra 4 bytes of copying can be eliminated by sorting pointers to the little records rather than the records themselves, as was done above for full records. The results for an LRec and LRec pointer implementation appear in the bottom two rows of Table 2. Examining the time spent in qsort, it is shown that reducing the copying from 8 to 4 bytes saves nearly a factor of two in CPU time there ( 1.2 vs. 2.5 sec).
However, it increases the CPU time in compare (there is an extra pointer dereference) and it increases data stall time in both qsort and compare. This increase in data stall time is explained by the fact that the array of pointers to records increases the active data set size from 8 bytes per little record to 12 bytes.
In the end, sorting the little records themselves is the superior approach, running about 2.5 times faster than sorting full records.
Table Versionqsort compare qsort+compare time time (sec) (sec) (sec) CPU DS TotatCPU DS TotalCPU DS Total Full 4.9 2.0 6.9 0.8 0.6 1.4 5.7 2.6 8.3 Sort Rec.
Pointers1.1 1.0 2.1 1.5 1.6 3.1 2.6 2.6 5.2 LRec 2.5 0.1 2.6 0.6 0.1 0.7 0.2 0.2 3.3 LRec 1.2 0.7 1.9 1.5 1.0 2.5 2.7 1.7 4.4 Ptr The availability of data stall cycle times significantly simplified the above optimization process. In particular, the data access stall time information given above focused the attention of the inventors on reducing data movement in the full record pointer implementation. Also, in the LRec and LRec pointer implementations, there was a trade-off between CPU time and data stall time, and explicit measures of both quantities allowed us to understand this trade-off.
It is extremely difficult to understand such trade-offs without the sort of empirical data that Mprof provides. Note also that this example only touches on the combined functionality of Mprof/Cpprof. In the full system, the user currently has access ~..

to Mprof derived first and level cache miss information and data pipeline stall cycles as well as to Cpprof's real time, user time, and page fault profiles. In addition, Cpprof provides an interactive shell for synthesizing different types of profile (function body, body+descendants, upward or downward call path) and relating program data to various subsets of the nodes in a program's dynamic call graph. In summary, the stack samples produced by Mprof fit naturally into an advanced profiling environment.
H. Alternative Embodiments The above case studies identify the type of information Mprof can extract.
There are several features of Mprof which can be implemented to improve and extend its performance. For example, currently, the system state which is sampled is of a "code-centric" nature. That is, in both system-wide sampling and user level call path sampling, a program counter or set of program counters is being recorded. This data allows one to identify where in the code the memory system performance is poor, but it does not directly record what data structures are at fault. By also recording the virtual address of the cache miss that triggers the sampling interrupt, one could get a crude profile of hot spots in the data. To create user level profiles from this data, however, the virtual addresses must be related back to user level data structures.
Techniques for accomplishing such mappings are discussed in Martonosi et al., "MemSpy: Analyzing Memory System Bottlenecks in Programs," in Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (1992), pp. 1-12.
Sampling works well for long running programs where we can sample relatively infrequently and still accumulate enough samples to develop an accurate histogram of where performance is poor. However, in some circumstances a more fine-grained view may be required. For example, after obtaining a coarse sampled a J

21~~~.~

profile of a multithreaded application, a programmer may want to focus on a specific procedure in a single thread. To better support such fine-grained profiling, we can add an enable bit to each Mprof counter such that the counter only increments when the enable bit is set. To determine the number of cache misses (or miss stall time if stall detectors are implemented) in a procedure, loop, or basic block, the program is as follows:
When program execution begins, InitialValue=read(counter0,enable=0.
On entering the code segment, enable=I.
On exiting the code segment, enable=0.
On program exit, read the counter and subtract off InitialValue.
Counter overflow is handled by recording the number of overflow interrupts that the counter generates. Note, this technique still has the problem that when a context switch occurs while executing an instrumented code segment, cache misses in other processes will be attributed to the instrumented code.
As mentioned above, the operating system must maintain the state of the enable bit on a per process basis to avoid this problem.
I. Other Monitoring Applications There are several natural extensions of the performance monitor l00 of the present invention as described above. One important extension is to instrument more of the memory hierarchy to provide better insight into where bottlenecks arise. In particular, this approach would distinguish between the sources of memory overhead in today's processors, including translation look-aside buffers, on-chip instruction and data caches, and write buffers. Since this information lies within the microprocessor, hidden from external probes, chip vendor support is required for such profiling.

It should be noted that the sampling monitor technique of the present invention may also be applied to multiprocessor systems. This is particularly significant because parallel programs with many interacting threads are easily perturbed by intrusive instrumentation techniques. Because sampling interrupts are relatively infrequent, the present invention is well-suited to the multiprocessor domain.
1. Network Monitoring a. Introduction Another application of the present invention is in distributed systems where multiple processors connected by a communication network interact to solve a problem such as workstations connected by a high speed network. The most important feature which must be considered when optimizing system performance in this environment is understanding the interactions between the cooperating processors. The present invention is a network monitoring system that addresses this profiling problem in an inexpensive, non-intrusive, and general purpose manner. This approach offers significant advantages over previous software instrumentation methods (that perturb program performance) and hardware monitors (that lack flexibility and generality).
When used as a network monitor, the user to specify the conditions under which an interrupt is generated. By appropriately programming the monitor we can non-intrusively determine where and how a system is waiting for data from the network, when network utilization is high or low, and where a program is generating significant network traffic.

w 2152~.~~

b) Detecting Synchronization Overhead In any distributed system, one process may be waiting for another process to produce data. The problem is particularly serious because the effects of waiting may be cumulative. Processor A waits for data from B, which waits for data from C which may eventually slow down A. The conventional software approach to measuring synchronization overhead would require a clock before and after every network read to accumulate how long the program spends waiting for data to arrive. In fine-grained networks where network packets can be as small as a single word, the overhead of these software clock accesses can dominate the actual data reading time, perturbing the system and eliminating the waiting times we are trying to monitor.
The conventional hardware approach often used to circumvent this problem is to maintain a counter on the network board that accumulates read wait time.
This approach allows us to determine the total time a program spends waiting on network reads, but it does not tell us when during the program's execution it was waiting so it is difficult to isolate the problem.
The approach of the present invention combines the read wait time counter with an interrupt generator. In the simplest case, the user can specify the value of a compare register and an interrupt is generated every time the read wait time counter reaches the compare value. The interrupt is then used by system software to record aspects of the program's state. By default the interrupt handler can record the current procedure call stack, though we may customize the handler to record application specific state. This interrupt-based sampling technique has been used in the past to relate consumption of CPU time to program state, but it has not been applied more generally to resource classes like network read wait time or cache miss service time.

21~211~

c) Understanding Synchronization Overhead The present invention may also be used to gain information about the distribution of wait times on the network. For example, it would be useful to know how many long waits there were compared to the number of shorter waits.
To perform this function a configuration similar to that shown in Figure 2 can be used to accumulate individual wait times in specific ranges. The user programs the ranges and hardware interrupts whenever one of the ranger counters 204A,204B reaches its compare value in comparator 205A,205B, respectively.
Note, even a single extent detector/counter pair is of substantial use in practice as the programmer often wants to distinguish whether synchronization overhead derives from a few long message waits or many short waits.
d) Resource Utilization The present invention can easily support traditional measurements like number of network transactions and network utilization. But, due to the triggered interrupting capability, the present invention can relate these measurements to system state with little overhead. The present invention simply interrupts the program every time N units of resource are accumulated and record the program state. It is preferred that the present invention be configured to sample at random intervals with mean N units to avoid correlation problems. Obtaining similar data by incrementing software counters on every network transaction would perturb the system, potentially changing the relative execution time of events on different processors, and affecting total system performance in an unpredictable way.
Traditional hardware counters without an interrupting capability would also be inadequate because they do not provide a mechanism for relating resource usage to specific resource users.

_ ~'.~~~1~Q

2. Additional Applications Finally, the monitor has applications beyond profiling and optimization.
Debug operations: The counters and decoder structures can be used to aid in debugging code that uses the communication system. For example, a count can be kept of the number of transmissions to and receives from each of the other processors in the system and this can be used to detect when and if the communication system becomes unsynchronized or deadlocked.
Sanity and error checking: The monitor can be used to detect error conditions. For example, if it is known that no more than ten successive writes are done to a particular channel in a given time, it is possible to program the monitor to alert the processor if this rule is ever violated.
3. Network Interface Example In this example, a network interface is considered which exists as a card in a workstation and allows the workstation to communicate using messages with the network. The waiting time, number of messages and the rate of communication can be monitored easily. Waiting time is monitored by counting the number of clock ticks that occur when the network is blocked and message rate is monitored by measuring the number of messages that are communicated in a given number of clock ticks. To illustrate how these measures are useful, an analog circuit simulation is used as an example program. Circuit simulation has two main phases, the construction of a set of equations representing a circuit (model evaluation), and their solution The equations describing each element of the circuit (model equations) can be evaluated to determine the coefficients of the equations describing the entire circuit. Such evaluations can be performed independently and their results accumulated to derive the coefficients of the circuit equations. These coefficients are stored in matrix form. The resulting matrix and ~m~ilo a vector of variables define the set of equations representing the circuit.
These equations can be solved using well known Lower Upper Decomposition (LUD).
The LUD algorithm can be implemented on a parallel machine by first representing the LUD operations of row normalization, row update and element back-substitution, as tasks in a task graph. These tasks can then be scheduled on a machine with several processors. The results of a task may be needed by tasks on other processors. The tasks and communications must be sequenced to insure that when a task is executed on a processor it has all the pertinent information the task needs to complete. The total LUD time is determined by the waiting time to access the network, time for communicating the data needed and the task execution time. A good schedule minimizes this total time.
The waiting time for the communication network is important during the LUD phase because the total compute time is dependent on minimizing the total waiting time. A poorly sequenced task list will have a large waiting time because tasks that could otherwise execute have to wait for communications from other processors to communicate their results. Because our monitor counts events without perturbing the operation of the program it is possible to get an accurate value of the wait time. If the monitor is set to interrupt when the wait time exceeds a predetermined threshold the parts of the program that are causing the most wait time can be identified easily.
The number of communications is important during model evaluation if the model equations are distributed to processors dynamically, depending on the computation load of a given processor. An unfavorable situation may arise if many communication intensive models are assigned to one processor. If conventional software monitoring techniques are used then the execution of the monitoring software may change the assignment of model equations to processors and therefore would perturb the communications that occur.

~~.521I(~

The rate of communication is also important in model evaluation. In a multiprocessor environment model equations need to communicate their results with the processor that holds the matrix. Such communications can be interleaved with the model computation. If the amount of communication produced during device model evaluations exceeds the capacity of the network then the processors may stall waiting for the network to clear. By monitoring the rate of communications it is possible to identify the processors and parts of the programs sending data that cause these high rate.
4. Output queue examples In this example we consider several workstations communicating with each other using a network interface card with a finite queue. When the queue is full communications cannot take place and the processor will stall. A monitor can be used to detect when the queue is full. The problem might be solved by interleaving other operations with the communications in this part of the program so that the communications occur less frequently.
a) Network blocking detection It is important to know how much time an application is spending waiting to use a communication resource. This situation can occur when the communications of several processes running on a processor conflict with each other. Communication conflicts can be detected by monitoring the total wait time and comparing this with the wait time when the program is run using a network without any other load. A software monitor would perturb the execution time of the processes and therefore give an inaccurate time. The problem is exacerbated if the communications are frequent and small, then the monitoring software overhead has a greater effect as its execution time becomes a greater fraction of the total execution time of the original program. A hardware monitor will be able ~~~~~io to monitor this blocking time without being intrusive, and only interrupt the processor when a significant wait time has elapsed, reducing the percentage of time spent in the monitoring software.
b) Platforms We have described how the monitor would be used in an environment with many processors communicating using a communication network. The monitor can be applied in many other cases as well. For instance, it can be used in a uniprocessor to unobtrusively monitor many aspects of the system's performance such as disk access. It can also be applied in multiprocessor systems for any type of interconnection network. The idea is not just restricted to workstations it can be used in PCs and in mainframes alike.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, which should be defined only in accordance with the following claims and their equivalents.

Claims (18)

1. A system for profiling the behavior of memory systems in a processor, comprising:
an event detector configured to generate a detection signal upon an occurrence of a predetermined event;
a counter, coupled to said event detector, configured to receive and count occurrences of said detection signal received from said event detector, said counter further configured to generate a pulse signal when a predetermined count value of said detection signal occurrences is reached;
an interrupt generator, coupled to said counter, configured to generate a profiling interrupt signal upon receipt of said pulse signal; and an interrupt handler residing in the processor, configured to record a state of the processor upon receipt of said profiling interrupt signal generated by said interrupt generator.
2. The system of claim 1, wherein said event detector, said interrupt generator, and said counter are implemented in one or more PALs.
3. The system of claim 1, wherein said counter is a down counter, said down counter generating said pulse signal when said count value is zero.
4. The system of claim 1, wherein said counter is a loadable counter, said loadable counter configured to receive said predetermined count value.
5. The system of claim 1, wherein the memory system resides in a Sun SPARC 10/41 workstation.
6. The system of claim 1, wherein said event decoder monitors a memory bus and outputs said detection signal when a main memory read signal is received to detect second level cache misses.
7. The system of claim 1, wherein said counter comprises a reset input configured to enable an external reset signal to reset said down counter.
8. The system of claim 1, wherein said counter comprises an enable input configured to enable an external enable signal to enable said down counter.
9. The system of claim 1, wherein said event detector detects data cache stall time, said counter counting a number of clock cycles during which said event detector receives a cache stall signal, said cache stall signal having a predetermined value when the processor pipeline is waiting on a data cache.
10. The system of claim 1, wherein said event decoder detects primary cache misses and second level cache misses.
11. The system of claim 1, wherein said counter further comprises an enable input, said counter configured to count said event when said enable bit is set.
12. A performance monitor system for simultaneous profiling of multiple conditions in a processor memory system, comprising:
a plurality of condition detectors, each of said plurality of condition detectors configured to monitor a predetermined condition and to generate a detection signal upon an occurrence of said predetermined condition;
a plurality of counters, each of said plurality of counters coupled to a respective one of said plurality of condition detectors, for receiving said detection signal and for generating a count signal indicating the number of said detection signals received;

a plurality of loadable compare registers, each of said plurality of loadable compare registers coupled to a respective one of said plurality of counters, configured to receive said count signal from said respective counter and for comparing said count signal to a predetermined value loaded in said each of said plurality of compare registers;
an interrupt generator, coupled to each of said plurality of comparators, configured to generate a profiling interrupt signal based upon a predetermined relationship between said pulse signals received from said plurality of comparators; and an interrupt handler, residing in the processor, configured to record a state of the processor upon receipt of said profiling interrupt signal generated by said interrupt generator.
13. The system of claim 12, wherein the performance monitor is implemented in a field programmable gate array.
14. The system of claim 12, wherein each of said plurality of counters comprises an enable input, said enable input enabling each of said plurality of counters to count said event when said enable bit is set.
15. A network monitoring system for monitoring the interactions between the cooperating processors in a distributed system having multiple processors connected by a communication network, comprising:
an event detector configured to generate a detection signal upon an occurrence of a predetermined event;
a counter, coupled to said event detector, configured to receive and count occurrences of said detection signal received from said event detector, said counter further configured to generate a pulse signal when a predetermined count value of said detection signal occurrences is reached;

an interrupt generator, coupled to said counter, configured to generate a profiling interrupt signal upon receipt of said pulse signal; and an interrupt handler residing in the processor, configured to record a state of the processor upon receipt of said profiling interrupt signal generated by said interrupt generator.
16, The system of claim 15, wherein said multiple processors are workstations.
17. The system of claim 15, wherein the performance monitor monitors the distribution of wait times on the network.
18. The system of claim 15, wherein said predetermined value is a random number, thereby causing the system to sample the network at random intervals.
CA002152110A 1994-06-20 1995-06-19 Interrupt-based hardware support for profiling system performance Expired - Fee Related CA2152110C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US26240494A 1994-06-20 1994-06-20
US262,404 1994-06-20

Publications (2)

Publication Number Publication Date
CA2152110A1 CA2152110A1 (en) 1995-12-21
CA2152110C true CA2152110C (en) 1999-07-27

Family

ID=22997354

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002152110A Expired - Fee Related CA2152110C (en) 1994-06-20 1995-06-19 Interrupt-based hardware support for profiling system performance

Country Status (4)

Country Link
US (1) US5768500A (en)
EP (1) EP0689141A3 (en)
JP (1) JPH0830494A (en)
CA (1) CA2152110C (en)

Families Citing this family (210)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997030402A1 (en) * 1996-02-16 1997-08-21 Philips Electronics N.V. A platform subsystem for executing a multimedia application title and a method for storing it
JPH10124327A (en) * 1996-10-16 1998-05-15 Nec Corp Instruction cache miss rate decreasing method
US6085338A (en) * 1996-12-17 2000-07-04 International Business Machines Corporation CPI infinite and finite analysis
US5960198A (en) * 1997-03-19 1999-09-28 International Business Machines Corporation Software profiler with runtime control to enable and disable instrumented executable
US5933643A (en) * 1997-04-17 1999-08-03 Hewlett Packard Company Profiler driven data prefetching optimization where code generation not performed for loops
FR2764097B1 (en) * 1997-06-02 1999-07-02 Bull Sa HOT POINT DETECTION IN A MACHINE WITH NON-UNIFORM ACCESS MEMORY
US6332212B1 (en) * 1997-10-02 2001-12-18 Ltx Corporation Capturing and displaying computer program execution timing
US6341357B1 (en) * 1997-10-08 2002-01-22 Sun Microsystems, Inc. Apparatus and method for processor performance monitoring
US6035306A (en) * 1997-11-24 2000-03-07 Terascape Software Inc. Method for improving performance of large databases
US6163840A (en) * 1997-11-26 2000-12-19 Compaq Computer Corporation Method and apparatus for sampling multiple potentially concurrent instructions in a processor pipeline
US6237059B1 (en) 1997-11-26 2001-05-22 Compaq Computer Corporation Method for estimating statistics of properties of memory system interactions among contexts in a computer system
US6442585B1 (en) 1997-11-26 2002-08-27 Compaq Computer Corporation Method for scheduling contexts based on statistics of memory system interactions in a computer system
US6237073B1 (en) 1997-11-26 2001-05-22 Compaq Computer Corporation Method for providing virtual memory to physical memory page mapping in a computer operating system that randomly samples state information
US6332178B1 (en) * 1997-11-26 2001-12-18 Compaq Computer Corporation Method for estimating statistics of properties of memory system transactions
US6175814B1 (en) 1997-11-26 2001-01-16 Compaq Computer Corporation Apparatus for determining the instantaneous average number of instructions processed
US6374367B1 (en) 1997-11-26 2002-04-16 Compaq Computer Corporation Apparatus and method for monitoring a computer system to guide optimization
US6195748B1 (en) 1997-11-26 2001-02-27 Compaq Computer Corporation Apparatus for sampling instruction execution information in a processor pipeline
US6549930B1 (en) 1997-11-26 2003-04-15 Compaq Computer Corporation Method for scheduling threads in a multithreaded processor
US6202127B1 (en) * 1997-11-26 2001-03-13 Compaq Computer Corporation Apparatus for spatial and temporal sampling in a computer memory system
US6539339B1 (en) 1997-12-12 2003-03-25 International Business Machines Corporation Method and system for maintaining thread-relative metrics for trace data adjusted for thread switches
US6338159B1 (en) 1997-12-12 2002-01-08 International Business Machines Corporation System and method for providing trace information
US6349406B1 (en) 1997-12-12 2002-02-19 International Business Machines Coporation Method and system for compensating for instrumentation overhead in trace data by computing average minimum event times
US6732357B1 (en) 1997-12-12 2004-05-04 International Business Machines Corporation Determining and compensating for temporal overhead in trace record generation and processing
US6055492A (en) 1997-12-12 2000-04-25 International Business Machines Corporation System and method for providing trace information data reduction
US6513155B1 (en) 1997-12-12 2003-01-28 International Business Machines Corporation Method and system for merging event-based data and sampled data into postprocessed trace output
US6560773B1 (en) 1997-12-12 2003-05-06 International Business Machines Corporation Method and system for memory leak detection in an object-oriented environment during real-time trace processing
US6311325B1 (en) 1998-10-22 2001-10-30 International Business Machines Corporation Method and apparatus for profiling processes in a data processing system background of the invention
US6651243B1 (en) 1997-12-12 2003-11-18 International Business Machines Corporation Method and system for periodic trace sampling for real-time generation of segments of call stack trees
US6546548B1 (en) 1997-12-12 2003-04-08 International Business Machines Corporation Method and system for compensating for output overhead in trace data using initial calibration information
US6751789B1 (en) 1997-12-12 2004-06-15 International Business Machines Corporation Method and system for periodic trace sampling for real-time generation of segments of call stack trees augmented with call stack position determination
US6662358B1 (en) 1997-12-12 2003-12-09 International Business Machines Corporation Minimizing profiling-related perturbation using periodic contextual information
US6507805B1 (en) 1997-12-12 2003-01-14 International Business Machines Corporation Method and system for compensating for instrumentation overhead in trace data by detecting minimum event times
US6954922B2 (en) * 1998-04-29 2005-10-11 Sun Microsystems, Inc. Method apparatus and article of manufacture for time profiling multi-threaded programs
US6275782B1 (en) * 1998-05-05 2001-08-14 Advanced Micro Devices, Inc. Non-intrusive performance monitoring
FR2778761B1 (en) * 1998-05-12 2001-08-03 Canon Kk DEVICE AND METHOD FOR PROVIDING COMPUTER PROGRAM EXECUTION TO EVENTS INTERNAL TO A PROCESSOR
US6243785B1 (en) * 1998-05-20 2001-06-05 3Com Corporation Hardware assisted polling for software drivers
US6134710A (en) * 1998-06-26 2000-10-17 International Business Machines Corp. Adaptive method and system to minimize the effect of long cache misses
US6263298B1 (en) * 1998-08-13 2001-07-17 Precise Software Solutions Ltd. Method for analyzing system performance
US6332117B1 (en) 1998-10-22 2001-12-18 International Business Machines Corporation General event stamping scheme
US7065633B1 (en) 1999-01-28 2006-06-20 Ati International Srl System for delivering exception raised in first architecture to operating system coded in second architecture in dual architecture CPU
US8074055B1 (en) 1999-01-28 2011-12-06 Ati Technologies Ulc Altering data storage conventions of a processor when execution flows from first architecture code to second architecture code
US7111290B1 (en) 1999-01-28 2006-09-19 Ati International Srl Profiling program execution to identify frequently-executed portions and to assist binary translation
US8121828B2 (en) 1999-01-28 2012-02-21 Ati Technologies Ulc Detecting conditions for transfer of execution from one computer instruction stream to another and executing transfer on satisfaction of the conditions
US7941647B2 (en) 1999-01-28 2011-05-10 Ati Technologies Ulc Computer for executing two instruction sets and adds a macroinstruction end marker for performing iterations after loop termination
US7275246B1 (en) 1999-01-28 2007-09-25 Ati International Srl Executing programs for a first computer architecture on a computer of a second architecture
US7013456B1 (en) 1999-01-28 2006-03-14 Ati International Srl Profiling execution of computer programs
US6826748B1 (en) 1999-01-28 2004-11-30 Ati International Srl Profiling program execution into registers of a computer
US6954923B1 (en) 1999-01-28 2005-10-11 Ati International Srl Recording classification of instructions executed by a computer
US8127121B2 (en) 1999-01-28 2012-02-28 Ati Technologies Ulc Apparatus for executing programs for a first computer architechture on a computer of a second architechture
US6574665B1 (en) * 1999-02-26 2003-06-03 Lucent Technologies Inc. Hierarchical vector clock
US6397382B1 (en) * 1999-05-12 2002-05-28 Wind River Systems, Inc. Dynamic software code instrumentation with cache disabling feature
US6374369B1 (en) * 1999-05-21 2002-04-16 Philips Electronics North America Corporation Stochastic performance analysis method and apparatus therefor
US6816274B1 (en) * 1999-05-25 2004-11-09 Silverbrook Research Pty Ltd Method and system for composition and delivery of electronic mail
US6830196B1 (en) * 1999-05-25 2004-12-14 Silverbrook Research Pty Ltd Identity-coded surface region
US6779107B1 (en) 1999-05-28 2004-08-17 Ati International Srl Computer execution by opportunistic adaptation
US6549959B1 (en) 1999-08-30 2003-04-15 Ati International Srl Detecting modification to computer memory by a DMA device
US6604210B1 (en) 1999-09-09 2003-08-05 International Business Machines Corporation Method and system for detecting and recovering from in trace data
US6598012B1 (en) 1999-10-07 2003-07-22 International Business Machines Corporation Method and system for compensating for output overhead in trace date using trace record information
US6728955B1 (en) 1999-11-05 2004-04-27 International Business Machines Corporation Processing events during profiling of an instrumented program
US6421826B1 (en) * 1999-11-05 2002-07-16 Sun Microsystems, Inc. Method and apparatus for performing prefetching at the function level
US7143197B1 (en) * 1999-11-08 2006-11-28 Agere Systems Inc. Method and system for monitoring a telecommunications signal transmission link
US6934832B1 (en) 2000-01-18 2005-08-23 Ati International Srl Exception mechanism for a computer
US6952798B2 (en) * 2000-03-30 2005-10-04 Barksdale Jr William W Method for determining the performance of a communications system
US6556952B1 (en) 2000-05-04 2003-04-29 Advanced Micro Devices, Inc. Performance monitoring and optimizing of controller parameters
US6658652B1 (en) 2000-06-08 2003-12-02 International Business Machines Corporation Method and system for shadow heap memory leak detection and other heap analysis in an object-oriented environment during real-time trace processing
US6904594B1 (en) 2000-07-06 2005-06-07 International Business Machines Corporation Method and system for apportioning changes in metric variables in an symmetric multiprocessor (SMP) environment
US6735758B1 (en) 2000-07-06 2004-05-11 International Business Machines Corporation Method and system for SMP profiling using synchronized or nonsynchronized metric variables with support across multiple systems
US7389497B1 (en) 2000-07-06 2008-06-17 International Business Machines Corporation Method and system for tracing profiling information using per thread metric variables with reused kernel threads
US6742178B1 (en) 2000-07-20 2004-05-25 International Business Machines Corporation System and method for instrumenting application class files with correlation information to the instrumentation
US6662359B1 (en) 2000-07-20 2003-12-09 International Business Machines Corporation System and method for injecting hooks into Java classes to handle exception and finalization processing
US7735072B1 (en) * 2000-08-11 2010-06-08 International Business Machines Corporation Method and apparatus for profiling computer program execution
US6748583B2 (en) * 2000-12-27 2004-06-08 International Business Machines Corporation Monitoring execution of an hierarchical visual program such as for debugging a message flow
US20020124085A1 (en) * 2000-12-28 2002-09-05 Fujitsu Limited Method of simulating operation of logical unit, and computer-readable recording medium retaining program for simulating operation of logical unit
US7093236B2 (en) 2001-02-01 2006-08-15 Arm Limited Tracing out-of-order data
US7093108B2 (en) 2001-02-01 2006-08-15 Arm Limited Apparatus and method for efficiently incorporating instruction set information with instruction addresses
US7016972B2 (en) * 2001-04-23 2006-03-21 International Business Machines Corporation Method and system for providing and viewing performance analysis of resource groups
US7111282B2 (en) * 2001-06-12 2006-09-19 Hewlett-Packard Development Company, L.P. Instrumenting a software program and collecting data from the instrumented software program by type
US6947052B2 (en) * 2001-07-13 2005-09-20 Texas Instruments Incorporated Visual program memory hierarchy optimization
US7487339B2 (en) * 2001-10-12 2009-02-03 Mips Technologies, Inc. Method and apparatus for binding shadow registers to vectored interrupts
US7552261B2 (en) * 2001-10-12 2009-06-23 Mips Technologies, Inc. Configurable prioritization of core generated interrupts
US20030083849A1 (en) * 2001-10-31 2003-05-01 Cabot Mason B. Statistical sampling process
US7360207B2 (en) * 2001-12-13 2008-04-15 Hewlett-Packard Development Company, L.P. Method and system to analyze inlined functions
EP1495417A4 (en) * 2002-03-14 2007-10-03 Unisys Corp System, and method for resource usage estimation
US7539608B1 (en) * 2002-05-10 2009-05-26 Oracle International Corporation Techniques for determining effects on system performance of a memory management parameter
US7020765B2 (en) * 2002-09-27 2006-03-28 Lsi Logic Corporation Marking queue for simultaneous execution of instructions in code block specified by conditional execution instruction
US20040064685A1 (en) * 2002-09-27 2004-04-01 Hung Nguyen System and method for real-time tracing and profiling of a superscalar processor implementing conditional execution
US7299343B2 (en) * 2002-09-27 2007-11-20 Verisilicon Holdings (Cayman Islands) Co. Ltd. System and method for cooperative execution of multiple branching instructions in a processor
US20040064684A1 (en) * 2002-09-30 2004-04-01 Kalluri Seshagiri P. System and method for selectively updating pointers used in conditionally executed load/store with update instructions
CA2408481C (en) 2002-10-17 2009-10-06 Ibm Canada Limited-Ibm Canada Limitee Autonomic reallocation of memory among buffer pools
US7159144B2 (en) * 2003-06-20 2007-01-02 Broadcom Corporation Firmware code profiling
GB0314874D0 (en) * 2003-06-26 2003-07-30 Ibm Generating a model of software in execution
US20050010908A1 (en) * 2003-07-10 2005-01-13 International Business Machines Corporation Method, apparatus and computer program product for implementing breakpoint based performance measurement
US7496907B2 (en) * 2003-08-06 2009-02-24 International Business Machines Corporation Method and system for profile normalization in an autonomic software system
US6925928B2 (en) * 2003-09-18 2005-08-09 Anthony Fox Trash compactor for fast food restaurant waste
US20050071564A1 (en) * 2003-09-25 2005-03-31 International Business Machines Corporation Reduction of cache miss rates using shared private caches
US7395527B2 (en) * 2003-09-30 2008-07-01 International Business Machines Corporation Method and apparatus for counting instruction execution and data accesses
US7373637B2 (en) * 2003-09-30 2008-05-13 International Business Machines Corporation Method and apparatus for counting instruction and memory location ranges
US7937691B2 (en) 2003-09-30 2011-05-03 International Business Machines Corporation Method and apparatus for counting execution of specific instructions and accesses to specific data locations
US20050071821A1 (en) * 2003-09-30 2005-03-31 International Business Machines Corporation Method and apparatus to autonomically select instructions for selective counting
US20050071816A1 (en) * 2003-09-30 2005-03-31 International Business Machines Corporation Method and apparatus to autonomically count instruction execution for applications
US20050071516A1 (en) * 2003-09-30 2005-03-31 International Business Machines Corporation Method and apparatus to autonomically profile applications
US8381037B2 (en) 2003-10-09 2013-02-19 International Business Machines Corporation Method and system for autonomic execution path selection in an application
US7421681B2 (en) * 2003-10-09 2008-09-02 International Business Machines Corporation Method and system for autonomic monitoring of semaphore operation in an application
US20050091456A1 (en) * 2003-10-23 2005-04-28 Huck Jerome C. Determining an arrangement of data in a memory for cache efficiency
US7308683B2 (en) * 2003-10-30 2007-12-11 International Business Machines Corporation Ordering of high use program code segments using simulated annealing
US7318219B2 (en) * 2003-11-28 2008-01-08 International Business Machines Corporation System and method for performance monitoring
US7496908B2 (en) * 2004-01-14 2009-02-24 International Business Machines Corporation Method and apparatus for optimizing code execution using annotated trace information having performance indicator and counter information
US7290255B2 (en) * 2004-01-14 2007-10-30 International Business Machines Corporation Autonomic method and apparatus for local program code reorganization using branch count per instruction hardware
US7526757B2 (en) * 2004-01-14 2009-04-28 International Business Machines Corporation Method and apparatus for maintaining performance monitoring structures in a page table for use in monitoring performance of a computer program
US20050155022A1 (en) * 2004-01-14 2005-07-14 International Business Machines Corporation Method and apparatus for counting instruction execution and data accesses to identify hot spots
US7197586B2 (en) * 2004-01-14 2007-03-27 International Business Machines Corporation Method and system for recording events of an interrupt using pre-interrupt handler and post-interrupt handler
US7293164B2 (en) * 2004-01-14 2007-11-06 International Business Machines Corporation Autonomic method and apparatus for counting branch instructions to generate branch statistics meant to improve branch predictions
US7392370B2 (en) * 2004-01-14 2008-06-24 International Business Machines Corporation Method and apparatus for autonomically initiating measurement of secondary metrics based on hardware counter values for primary metrics
US7415705B2 (en) 2004-01-14 2008-08-19 International Business Machines Corporation Autonomic method and apparatus for hardware assist for patching code
US20050155018A1 (en) * 2004-01-14 2005-07-14 International Business Machines Corporation Method and apparatus for generating interrupts based on arithmetic combinations of performance counter values
US7895382B2 (en) * 2004-01-14 2011-02-22 International Business Machines Corporation Method and apparatus for qualifying collection of performance monitoring events by types of interrupt when interrupt occurs
US7181599B2 (en) * 2004-01-14 2007-02-20 International Business Machines Corporation Method and apparatus for autonomic detection of cache “chase tail” conditions and storage of instructions/data in “chase tail” data structure
JP3937341B2 (en) 2004-01-30 2007-06-27 日本電気株式会社 Transaction profile generation system for computer system performance measurement analysis, its generation method and program
US20050210450A1 (en) * 2004-03-22 2005-09-22 Dimpsey Robert T Method and appartus for hardware assistance for data access coverage
US8135915B2 (en) * 2004-03-22 2012-03-13 International Business Machines Corporation Method and apparatus for hardware assistance for prefetching a pointer to a data structure identified by a prefetch indicator
US7526616B2 (en) * 2004-03-22 2009-04-28 International Business Machines Corporation Method and apparatus for prefetching data from a data structure
US7421684B2 (en) * 2004-03-22 2008-09-02 International Business Machines Corporation Method and apparatus for autonomic test case feedback using hardware assistance for data coverage
US7480899B2 (en) * 2004-03-22 2009-01-20 International Business Machines Corporation Method and apparatus for autonomic test case feedback using hardware assistance for code coverage
US7296130B2 (en) * 2004-03-22 2007-11-13 International Business Machines Corporation Method and apparatus for providing hardware assistance for data access coverage on dynamically allocated data
US7299319B2 (en) * 2004-03-22 2007-11-20 International Business Machines Corporation Method and apparatus for providing hardware assistance for code coverage
US7707554B1 (en) * 2004-04-21 2010-04-27 Oracle America, Inc. Associating data source information with runtime events
JP4599902B2 (en) * 2004-06-18 2010-12-15 株式会社日立製作所 Performance analysis method using hardware monitor
US8214819B2 (en) * 2004-07-09 2012-07-03 Hewlett-Packard Development Company, L.P. Determining call counts in a program
JP4262174B2 (en) * 2004-09-09 2009-05-13 キヤノン株式会社 Simulation time acquisition method
US20060130001A1 (en) * 2004-11-30 2006-06-15 International Business Machines Corporation Apparatus and method for call stack profiling for a software application
US20060143384A1 (en) * 2004-12-27 2006-06-29 Hughes Christopher J System and method for non-uniform cache in a multi-core processor
US7788240B2 (en) 2004-12-29 2010-08-31 Sap Ag Hash mapping with secondary table having linear probing
US7752354B2 (en) * 2005-02-11 2010-07-06 International Business Machines Corporation Auxiliary mechanism to manage instruction restart and restart coming in a lookahead processor
US7640539B2 (en) * 2005-04-12 2009-12-29 International Business Machines Corporation Instruction profiling using multiple metrics
US7657875B2 (en) * 2005-04-12 2010-02-02 International Business Machines Corporation System and method for collecting a plurality of metrics in a single profiling run of computer code
US7703079B1 (en) * 2005-05-03 2010-04-20 Oracle America, Inc. System performance prediction
US7886194B2 (en) * 2005-05-16 2011-02-08 Texas Instruments Incorporated Event-generating instructions
US7373482B1 (en) * 2005-05-26 2008-05-13 Sun Microsystems, Inc. Software-based technique for improving the effectiveness of prefetching during scout mode
US20070079294A1 (en) * 2005-09-30 2007-04-05 Robert Knight Profiling using a user-level control mechanism
US20070150881A1 (en) * 2005-12-22 2007-06-28 Motorola, Inc. Method and system for run-time cache logging
WO2007076634A1 (en) * 2005-12-30 2007-07-12 Intel Corporation Virtual event interface to support platform-wide performance optimization
US7861068B2 (en) * 2006-03-07 2010-12-28 Intel Corporation Method and apparatus for using dynamic workload characteristics to control CPU frequency and voltage scaling
US8255876B2 (en) * 2006-07-28 2012-08-28 Apple Inc. Execution difference identification tool
DE102006041444B4 (en) 2006-09-04 2014-10-30 Infineon Technologies Ag Circuit arrangement and method for detecting an execution time of a command in a computer system
US7519797B1 (en) * 2006-11-02 2009-04-14 Nividia Corporation Hierarchical multi-precision pipeline counters
US7913022B1 (en) 2007-02-14 2011-03-22 Xilinx, Inc. Port interface modules (PIMs) in a multi-port memory controller (MPMC)
US8479124B1 (en) 2007-02-14 2013-07-02 Xilinx, Inc. Graphical user interface (GUI) including input files with information that determines representation of subsequent content displayed by the GUI
US7711907B1 (en) 2007-02-14 2010-05-04 Xilinx, Inc. Self aligning state machine
US7720636B1 (en) * 2007-02-14 2010-05-18 Xilinx, Inc. Performance monitors (PMs) for measuring performance in a system and providing a record of transactions performed
US20080301717A1 (en) * 2007-05-31 2008-12-04 Microsoft Corporation Visualizing a memory footprint of an application program
US7962924B2 (en) 2007-06-07 2011-06-14 International Business Machines Corporation System and method for call stack sampling combined with node and instruction tracing
JP5029245B2 (en) * 2007-09-20 2012-09-19 富士通セミコンダクター株式会社 Profiling method and program
US9354890B1 (en) * 2007-10-23 2016-05-31 Marvell International Ltd. Call stack structure for enabling execution of code outside of a subroutine and between call stack frames
US8095775B1 (en) 2007-11-21 2012-01-10 Marvell International Ltd. Instruction pointers in very long instruction words
US7841436B2 (en) 2008-01-21 2010-11-30 Amigo Mobility International Personal mobility vehicle
US8261245B2 (en) * 2008-01-22 2012-09-04 International Business Machines Corporation Method and system for associating profiler data with a reference clock
US20090210740A1 (en) * 2008-02-14 2009-08-20 Song Huang Off-chip access workload characterization methodology for optimizing computing efficiency
US9418005B2 (en) 2008-07-15 2016-08-16 International Business Machines Corporation Managing garbage collection in a data processing system
US20100017583A1 (en) * 2008-07-15 2010-01-21 International Business Machines Corporation Call Stack Sampling for a Multi-Processor System
US8286134B2 (en) * 2008-07-15 2012-10-09 International Business Machines Corporation Call stack sampling for a multi-processor system
US7917677B2 (en) * 2008-09-15 2011-03-29 International Business Machines Corporation Smart profiler
JP5326708B2 (en) 2009-03-18 2013-10-30 富士通株式会社 Arithmetic processing device and control method of arithmetic processing device
US8850402B2 (en) * 2009-05-22 2014-09-30 International Business Machines Corporation Determining performance of a software entity
US8250331B2 (en) 2009-06-26 2012-08-21 Microsoft Corporation Operating system virtual memory management for hardware transactional memory
US8489864B2 (en) * 2009-06-26 2013-07-16 Microsoft Corporation Performing escape actions in transactions
US8370577B2 (en) 2009-06-26 2013-02-05 Microsoft Corporation Metaphysically addressed cache metadata
US8161247B2 (en) * 2009-06-26 2012-04-17 Microsoft Corporation Wait loss synchronization
US8356166B2 (en) * 2009-06-26 2013-01-15 Microsoft Corporation Minimizing code duplication in an unbounded transactional memory system by using mode agnostic transactional read and write barriers
US8812796B2 (en) 2009-06-26 2014-08-19 Microsoft Corporation Private memory regions and coherence optimizations
US8229907B2 (en) * 2009-06-30 2012-07-24 Microsoft Corporation Hardware accelerated transactional memory system with open nested transactions
US20100333071A1 (en) * 2009-06-30 2010-12-30 International Business Machines Corporation Time Based Context Sampling of Trace Data with Support for Multiple Virtual Machines
US8533440B2 (en) * 2009-12-15 2013-09-10 Microsoft Corporation Accelerating parallel transactions using cache resident transactions
US8539465B2 (en) 2009-12-15 2013-09-17 Microsoft Corporation Accelerating unbounded memory transactions using nested cache resident transactions
US9092253B2 (en) * 2009-12-15 2015-07-28 Microsoft Technology Licensing, Llc Instrumentation of hardware assisted transactional memory system
US8402218B2 (en) * 2009-12-15 2013-03-19 Microsoft Corporation Efficient garbage collection and exception handling in a hardware accelerated transactional memory system
CN102231130B (en) * 2010-01-11 2015-06-17 国际商业机器公司 Method and device for analyzing computer system performances
US9582443B1 (en) 2010-02-12 2017-02-28 Marvell International Ltd. Serial control channel processor for executing time-based instructions
US9176783B2 (en) 2010-05-24 2015-11-03 International Business Machines Corporation Idle transitions sampling with execution context
US8843684B2 (en) 2010-06-11 2014-09-23 International Business Machines Corporation Performing call stack sampling by setting affinity of target thread to a current process to prevent target thread migration
US8799872B2 (en) 2010-06-27 2014-08-05 International Business Machines Corporation Sampling with sample pacing
US8615742B2 (en) * 2010-11-16 2013-12-24 International Business Machines Corporation Autonomic hotspot profiling using paired performance sampling
US8799904B2 (en) * 2011-01-21 2014-08-05 International Business Machines Corporation Scalable system call stack sampling
US8884920B1 (en) 2011-05-25 2014-11-11 Marvell International Ltd. Programmatic sensing of capacitive sensors
US9098694B1 (en) 2011-07-06 2015-08-04 Marvell International Ltd. Clone-resistant logic
US9069553B2 (en) 2011-09-06 2015-06-30 Marvell World Trade Ltd. Switching tasks between heterogeneous cores
US20120095607A1 (en) * 2011-12-22 2012-04-19 Wells Ryan D Method, Apparatus, and System for Energy Efficiency and Energy Conservation Through Dynamic Management of Memory and Input/Output Subsystems
US9483268B2 (en) 2012-03-16 2016-11-01 International Business Machines Corporation Hardware based run-time instrumentation facility for managed run-times
US9442824B2 (en) 2012-03-16 2016-09-13 International Business Machines Corporation Transformation of a program-event-recording event into a run-time instrumentation event
US9471315B2 (en) 2012-03-16 2016-10-18 International Business Machines Corporation Run-time instrumentation reporting
US9405541B2 (en) 2012-03-16 2016-08-02 International Business Machines Corporation Run-time instrumentation indirect sampling by address
US9411591B2 (en) 2012-03-16 2016-08-09 International Business Machines Corporation Run-time instrumentation sampling in transactional-execution mode
US9158660B2 (en) * 2012-03-16 2015-10-13 International Business Machines Corporation Controlling operation of a run-time instrumentation facility
US9430238B2 (en) 2012-03-16 2016-08-30 International Business Machines Corporation Run-time-instrumentation controls emit instruction
US9280447B2 (en) 2012-03-16 2016-03-08 International Business Machines Corporation Modifying run-time-instrumentation controls from a lesser-privileged state
US9454462B2 (en) 2012-03-16 2016-09-27 International Business Machines Corporation Run-time instrumentation monitoring for processor characteristic changes
US9367316B2 (en) 2012-03-16 2016-06-14 International Business Machines Corporation Run-time instrumentation indirect sampling by instruction operation code
US9465716B2 (en) 2012-03-16 2016-10-11 International Business Machines Corporation Run-time instrumentation directed sampling
US9250902B2 (en) 2012-03-16 2016-02-02 International Business Machines Corporation Determining the status of run-time-instrumentation controls
US9021172B2 (en) * 2012-07-06 2015-04-28 Arm Limited Data processing apparatus and method and method for generating performance monitoring interrupt signal based on first event counter and second event counter
KR101991687B1 (en) * 2012-11-23 2019-06-24 삼성전자 주식회사 Dynamic library profiling method, computer readable recording medium storing thereof and dynamic library profiling system
JP2014182478A (en) 2013-03-18 2014-09-29 Fujitsu Ltd Performance profiling device, and performance profiling method
CN104111871B (en) * 2013-04-16 2017-05-03 济南概伦电子科技有限公司 Method and device used for executing dynamic load balancing in circuit simulation
US11018988B2 (en) 2013-08-26 2021-05-25 Vmware, Inc. Translating high level requirements policies to distributed configurations
US10346437B1 (en) * 2014-06-18 2019-07-09 EMC IP Holding Company LLC Event triggered data collection
US11138086B2 (en) 2015-01-28 2021-10-05 Hewlett-Packard Development Company, L.P. Collecting hardware performance data
TWI518585B (en) * 2015-05-18 2016-01-21 國立成功大學 Electronic device having scratchpad memory and management method for scratchpad memory
US11003428B2 (en) 2016-05-25 2021-05-11 Microsoft Technolgy Licensing, Llc. Sample driven profile guided optimization with precise correlation
US10705843B2 (en) * 2017-12-21 2020-07-07 International Business Machines Corporation Method and system for detection of thread stall
US11579859B2 (en) * 2020-12-29 2023-02-14 Imperva, Inc. Dynamic invocation of partner product exit routine in an active information management system
US11698797B2 (en) * 2021-07-01 2023-07-11 Micro Focus Llc Application tuning based on performance characteristics

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3868480A (en) * 1974-01-24 1975-02-25 Telesciences Inc Event monitoring transceiver
DE3328540A1 (en) * 1983-08-06 1985-02-14 Robert Bosch Gmbh, 7000 Stuttgart DEVICE FOR MEASURING TIME BETWEEN TWO EVENTS
US4528553A (en) * 1983-08-16 1985-07-09 Conoco Inc. Event detection apparatus
US4636967A (en) * 1983-10-24 1987-01-13 Honeywell Inc. Monitor circuit
JPS63123218A (en) * 1986-11-12 1988-05-27 Nec Corp Timer/counter circuit
JP2777496B2 (en) * 1991-02-28 1998-07-16 インターナショナル・ビジネス・マシーンズ・コーポレイション Uses when profiling multi-processes in computer systems
US5446876A (en) * 1994-04-15 1995-08-29 International Business Machines Corporation Hardware mechanism for instruction/data address tracing

Also Published As

Publication number Publication date
EP0689141A3 (en) 1997-10-15
EP0689141A2 (en) 1995-12-27
US5768500A (en) 1998-06-16
CA2152110A1 (en) 1995-12-21
JPH0830494A (en) 1996-02-02

Similar Documents

Publication Publication Date Title
CA2152110C (en) Interrupt-based hardware support for profiling system performance
Berg et al. Fast data-locality profiling of native execution
Sweeney et al. Using Hardware Performance Monitors to Understand the Behavior of Java Applications.
Anderson et al. Continuous profiling: Where have all the cycles gone?
Zagha et al. Performance analysis using the MIPS R10000 performance counters
Clark et al. Performance of the VAX-11/780 translation buffer: Simulation and measurement
Gebai et al. Survey and analysis of kernel and userspace tracers on linux: Design, implementation, and overhead
Rosenblum et al. Using the SimOS machine simulator to study complex computer systems
Bedichek Talisman: Fast and accurate multicomputer simulation
Herrod Using complete machine simulation to understand computer system behavior
Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors
Eranian What can performance counters do for memory subsystem analysis?
Martonosi et al. Integrating performance monitoring and communication in parallel computers
Yang et al. Computer performance microscopy with shim
Reed Performance instrumentation techniques for parallel systems
Huang et al. Dynamic simplescalar: Simulating java virtual machines
Lebeck et al. Active memory: A new abstraction for memory system simulation
Chen Software methods for system address tracing
Schaelicke et al. Profiling I/O interrupts in modern architectures
Jaleel et al. Cmp $ im: A binary instrumentation approach to modeling memory behavior of workloads on cmps
Singhal et al. Architectural support for performance tuning: A case study on the SPARCcenter 2000
Marathe et al. Source-code-correlated cache coherence characterization of openmp benchmarks
Moore et al. PAPI deployment, evaluation, and extensions
Marathe et al. Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques
Horowitz et al. Informing loads: Enabling software to observe and react to memory behavior

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed