WO2002045385A2 - Methods and devices for caching method frame segments in a low-power stack-based processor - Google Patents

Methods and devices for caching method frame segments in a low-power stack-based processor Download PDF

Info

Publication number
WO2002045385A2
WO2002045385A2 PCT/US2001/043829 US0143829W WO0245385A2 WO 2002045385 A2 WO2002045385 A2 WO 2002045385A2 US 0143829 W US0143829 W US 0143829W WO 0245385 A2 WO0245385 A2 WO 0245385A2
Authority
WO
WIPO (PCT)
Prior art keywords
stack
cache
caching
frame
local
Prior art date
Application number
PCT/US2001/043829
Other languages
French (fr)
Other versions
WO2002045385A3 (en
Inventor
Michael Majid
Zohair Sahraoui
Thomas Bottomley
Guillaume Comeau
Original Assignee
Zucotto Wireless, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zucotto Wireless, Inc. filed Critical Zucotto Wireless, Inc.
Priority to PCT/US2001/043829 priority Critical patent/WO2002045385A2/en
Priority to AU2002230445A priority patent/AU2002230445A1/en
Priority to PCT/US2001/044031 priority patent/WO2002071211A2/en
Priority to AU2002226968A priority patent/AU2002226968A1/en
Priority to PCT/US2001/043444 priority patent/WO2002042898A2/en
Priority to AU2002241505A priority patent/AU2002241505A1/en
Priority claimed from PCT/US2001/043957 external-priority patent/WO2002048864A2/en
Priority claimed from PCT/US2001/043444 external-priority patent/WO2002042898A2/en
Priority to AU4150502A priority patent/AU4150502A/en
Publication of WO2002045385A2 publication Critical patent/WO2002045385A2/en
Publication of WO2002045385A3 publication Critical patent/WO2002045385A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45508Runtime interpretation or emulation, e g. emulator loops, bytecode interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44557Code layout in executable memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • G06F9/4484Executing subprograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4488Object-oriented
    • G06F9/449Object-oriented method invocation or resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/481Exception handling

Definitions

  • the present invention generally relates to caching data. More specifically, the present invention relates to caching a method frame and segments thereof in a stack-based processor on a low-power device. Background
  • JVM Java Virtual Machines
  • Traditional software Java Virtual Machines (JVM) which interprets Java bytecode into the machine language of a given computing platform, is slow.
  • the clock rate must be increased.
  • the performance is achieved at the expense of high power dissipation due to increased clock rate.
  • additional memory is required to store the software implementing the software JVM, which additionally increases power consumption, size and cost of the mobile device.
  • Java just-in-time compilers
  • hardware accelerators or mere extensions to an existing processor's architecture.
  • Java-enabling technology by far, is the provision of a Java Native Processor into a mobile device.
  • the Java programming language is stack-based, meaning that many operations are conducted on data stored in a "stack” data structure stored in off-chip (with respect to a processor) memory (hereafter referred as "stack memory”).
  • Each method context in Java is represented by a "method frame". Method frames are pushed on and popped off the run-time stack as method invokes and returns are executed. Each method frame includes, parameters received from the method's caller, new local variables, the return execution context frame (which may include the return PC, frame pointer, local variable pointer, status flags, etc.), and the local stack. The local stack is used to hold temporary results, pass parameters to methods, and pass return values to callers.
  • stack caches are typically used to cache the top elements of the stack in memory (stack memory), storing the elements on-chip with the processor. In this way, the processor may access the top elements of stack memory faster than if the top elements were to be fetched from the slower off-chip stack memory, thereby speeding-up overall execution performance.
  • stack caches for stack-based processors provide satisfactory execution performance, they fail to address the requirements of low-power devices, for example, a mobile wireless devices. Such devices are typically battery powered and must operate with little power if users are to be provided with reasonable battery lifetime and a minimal performance level before having to replace or recharge the battery.
  • Off- chip memory 110 is provided having a portion thereof allocated as stack memory 120 for use by the stack-based processor core 100. Accesses to stack memory 120 typically take several processor clock cycles.
  • FIG 2-1 illustrates a prior-art stack-based processor 100 provided with an on-chip stack cache 200.
  • On-chip stack cache 200 serves to maintain a copy of data in stack memory 120 for single-cycle access.
  • Multiple method frames 210 are stored in stack memory 120, and copies of multiple method frames are maintained in the stack cache 200.
  • method frames may comprise a local variable segment 220, a return execution context segment 230, and a local stack operand segment 240.
  • the method frame located at the top of the stack is the "current" method - i.e. the method that the processor core 100 is executing at a given point in time.
  • a stack pointer is typically maintained by processor core 100 to indicate the top of stack element in stack memory 120.
  • the number of copies of the top elements of stack memory 120 that are maintained in stack cache 200 can be large.
  • the picoJava II Native Java Processor provides a 32-bit wide, 64-entry stack cache, implemented as a single circular buffer of memory locations such as a register file.
  • a top pointer 330 and a bottom pointer 320 are maintained to implement the register file as a circular register file.
  • Top pointer 330 may point to the stack cache entry that contains the next available entry for data to be stored (pushed) on the stack cache.
  • the data is written to stack cache 200 at the location indicated by top pointer 330 and the top pointer 330 is modified to point to the next available stack cache entry.
  • the top pointer 330 is modified to point to the top valid entry and the data from that entry is placed on a read bus in processor core 100.
  • US Patent 6,021,469 teaches a method for caching method frames on a stack cache by caching so called "modified" method frames 250 on the circular buffer stack cache.
  • the '469 Patent proposes caching only the local variables and the local stack operand segments of one or more method to reduce the size of the stack cache.
  • a significant number of registers are still required in this implementation as the size of these two segments may be large because more than one method frame may need to be cached.
  • Implementations of large stack caches suffer from high-latency on thread switching as all the stack elements, for example, stack operands and local variable of one or more method of an old thread must be spilled back to memory prior to executing in the context of a new thread.
  • a register file or SRAM large enough to implement a large stack cache taught in Figures 2-1, 2-2, and 2-3 contributes significantly to power dissipation and the footprint of the processor.
  • To reduce the number of stack elements to be spilled back to memory it is known in the art to maintain "dirty bits" - bits that indicate if a cached copy is coherent with the corresponding memory copy.
  • a large stack cache if implemented with a memory technology such as SRAM, cannot maintain dirty bits associated with each word in the stack cache. If a large conventional stack cache is implemented using latches or registers, dirty bits may be provided, however, the high gate count to implement such a cache would be detrimental to both the footprint of the processor die and the power dissipation thereof.
  • the local variables 220 of a method are at risk of being relegated to stack memory 120 whenever the same method pushes a large number of local stack operands.
  • a local variable cache is provided logically and/or physically separated from a stack operand cache such that a higher degree of control may be provided with to respect the design parameters of one or more method frame segments.
  • a higher degree of control may be provided with to respect the design parameters of one or more method frame segments.
  • a stack operand cache may be provided having an optimal size for stack operands and implemented as a circular register file to provide a "sliding window" of the stack operands of a method.
  • a separate local variable cache may be optimized for local variables to have an optimized minimal size and of a design that is best suited for local variables such as a fully-associative or direct mapped cache.
  • a caching mechanism for caching the stack segments of a method.
  • the mechanism comprises a first cache for a first segment and a second cache for a second segment.
  • a caching mechanism for caching the stack segments of a method comprising a first cache for caching a local variable segment and a second cache for caching a stack operand segment, wherein the first cache comprises at most 16 entries, and the second cache comprises 2 to about 8 entries.
  • a caching mechanism for caching the stack segments of a method comprising a first cache for caching a local variable segment and a second cache for caching a stack operand segment
  • the second cache comprises a circular buffer.
  • a device capable of executing stack-based instructions, the instructions provided by a method, the method including at least one method frame, the method frame comprising one or more frame segments may comprise means for caching one or more frame segments of a current method frame.
  • the one or more method frame segments may comprise a local variable segment and a stack operand segment.
  • the frame segments may comprise a local variable segment that includes a number of local variables, and a stack operand segment that includes a number of stack operands, and the means for caching may cache less than the number of local variables and less than the number of stack operands.
  • the device may further comprise a processor core; and a memory operatively coupled to the processor core, the memory containing a stack data structure for storing a call chain of method frames, wherein the means for caching is coupled to both the processor core and the memory, and wherein the means for caching provides the processor core with faster access to the current method frame than may be provided by the memory.
  • the means for caching may be integrated onto a semiconductor chip.
  • the means for caching may comprise faster memory technology than the memory.
  • the means for caching may comprise cache memory, and the cache memory may be selected from the group consisting of registers, latches, RAM, SRAM, and DDR.
  • the one or more frame segments may comprise segment elements, and the means for caching may accommodate no more than 16 segment elements.
  • the one or more frame segments may comprise stack operand segment elements, wherein the means for caching accommodates storage of no more than 8 stack operand segment elements.
  • the one or more frame segments may comprise stack operand segment elements, wherein the means for caching accommodates storage of 8 stack operand segment elements.
  • the one or more frame segments may comprise local variable segment elements, and the means for caching may accommodate no more than 16 local variable segment elements.
  • the means for caching may comprise a circular register file; a top pointer for indicating a first entry in a first register of the register file; and a bottom pointer for indicating a last entry in a second register of the register file.
  • the device may comprise a mobile device.
  • the device may comprise a Java native processor.
  • the device may comprise a Java accelerator.
  • the device may comprise an instruction-path Java accelerator.
  • the device may comprise an electronic circuit.
  • a device capable of executing stack-based instructions may comprise a first cache memory for caching stack operands of the instructions, and a second cache memory for caching local variables of the instructions.
  • the first cache memory may comprise a circular register file.
  • the second cache memory may comprise a direct mapped cache memory.
  • the second cache memory may comprise a fully associative cache memory.
  • the stack operands and the local variables may be both of a single current method.
  • a circuit for caching a method frame may comprise means for caching local variables of the method frame and means for caching stack operands of the method frame.
  • the means for caching local variables and the means for caching stack operands may be separated by a logical boundary.
  • a device capable for executing stack-based instructions may comprise a local variable cache, the local variable cache for caching only local variables.
  • the device may further comprise a stack operand cache, the stack operand cache for caching only stack operands.
  • the local variable cache may cache local variables of the current method frame.
  • the method frame may comprise a Java method frame.
  • a device for executing stack based instructions may comprise a cache, the device having a Java mode and a non-Java mode, the cache for storing local variables of the stack based instructions when operating in the Java mode, and the cache used as a general purpose register file in the non-Java mode.
  • a device capable of executing stack-based instructions may comprise a stack operand cache, the stack operand cache for caching only stack operands.
  • the stack operand cache may cache stack operands of one method frame.
  • the one method frame may comprise a Java method frame.
  • the method frame may comprise a current method frame.
  • a device capable of executing stack-based instructions of a method may comprise a stack operand cache, the stack operand cache for caching one or more stack operands of the current method; and a local variable cache, the local variable cache for caching one or more local variables of the current method.
  • the current method may include a return execution context, wherein one or more element of the return execution context is cached on the stack operand cache.
  • the current method may include a return execution context, wherein one or more elements of the return execution context frame segment is cached on the local variable cache.
  • a method for caching a method frame may comprise the steps of providing a method frame; providing a local variable cache and a stack operand cache, the local variable cache and the stack operand cache separated by a logical boundary; caching one or more local variable of the method frame in the local variable cache; and caching one or more stack operand of the method frame in the stack operand cache.
  • the method may further comprise the step of providing the method frame as a current method frame.
  • the method may further comprise the step of caching one or more local variable and one or more stack operand of only one method frame at a time.
  • the step of caching may comprise caching less than all of the stack operands of the method frame.
  • the step of caching may comprise caching less than all of the local variables of the method frame.
  • Figure 2 illustrates a prior art stack-based processor core having an on-chip stack cache that caches entire or modified method frames.
  • Figure 3. illustrates a processor having separate on-chip, stack operand and local variable caches.
  • Figures 4A and 4B. illustrate two alternative embodiments for an on-chip local variable cache.
  • FIG. 5 illustrates a processor having stack operand and local variable caches provided on a single register file. Description Referring to Figure 3, a separate stack operand cache 310 and local variable cache
  • the local variable cache 300 and the stack operand cache 310 may be implemented on a processor core 100.
  • the local variable cache 300 and the stack operand cache 310 may be implemented in cache memory, for example, as a register, a latch, a RAM, a SRAM, a DDR, as is practiced by those skilled in the art.
  • a processor core comprising a memory are described herein, it is understood that the present invention is not limited thereby, as other memories and other devices are within the scope of the claims that follow, for example, other memories separate from or integral with a processor core that provide faster access to local variables and stack operands of a current method frame than a conventional stack memory 120 can provide.
  • stack operand cache 310 may be implemented as a circular register file, having top pointer 330 and bottom pointer 320.
  • stack operand cache 310 is provided to cache one or more elements the local stack segment of the current method frame. It is understood that in transient states, such as invokes and returns, stack operand cache 310 may, however, contain various data of other method frame segments of more than one method in transient states such as during invokes and returns. As stack operands of the current method's local stack segment are pushed and popped, the operands are written to or read from stack operand cache 310 and top pointer 330 is modified to point to the new top of stack in the stack cache.
  • the caches 300, 310 may be coupled to an instruction execution unit, an arithmetic logic unit (ALU), an address calculation unit, operand processor, an on-chip data or unified instruction and data cache, and other components well known to those skilled in the art and as necessary to implement the invention described herein.
  • ALU arithmetic logic unit
  • operand processor an on-chip data or unified instruction and data cache
  • other components well known to those skilled in the art and as necessary to implement the invention described herein.
  • further levels of caching may be provided between stack and local variable caches and stack memory.
  • stack operand cache 310 may comprise a 32-bit, 8-entry register file. In one embodiment, stack operand cache 310 may comprise a 32-bit register file having no more than 4 entries.
  • an optimal stack operand cache efficiency (measured as the hit rate) may be obtained by providing a stack operand cache 310 having at most 8 entries. Other embodiments, with other numbers of entries are also contemplated herein.
  • the respective sizes of the local variable 300 and local stack 310 caches may be selected by profiling CLDC and MIDP methods.
  • an examination of all methods in all classes in Sun Microsystems' s KVM shows that 99.7 % of all methods require 8 stack operand elements or less, 97 % of all methods require 8 local variables or less; and an examination of all methods in all classes of the MIDP profile shows that 98 % of all methods require 8 stack operand elements or less and 96 % of all methods require 8 local variables or less.
  • a so-called "dribble manager unit" for handling stack operand cache underflow/overflow in the background need not be provided. While the prior art teaches the use of a dribble manager to increase stack cache efficiency, the optimized stack operand cache 310 of the present invention would benefit little from such additional hardware. Dribble managers require dedicated read and write ports on the stack operand cache to permit background access thereto. The additional ports as well as the actual dribble manager hardware substantially increase power dissipation. As the stack operand cache hit rate of a 4-entry stack cache is already about 87%, the power consumption of a dribble manager may not be justified by the marginal potential for improvement in the stack cache hit rate.
  • the stack operand cache 310 may be optimally implemented with a circular register file. As stack operands are pushed and popped from the stack, stack operand cache may "slide" along the top of the stack as the top of stack is redefined with every stack operation. Furthermore, certain optimizations may be implemented such as returning return values to a caller on the stack operand cache.
  • stack-based processor core 100 may also be provided with a local variable cache 300 for caching local variables of a current method. While the prior art teaches caching both local variables and stack operands of one or more methods in one stack cache implemented as a circular buffer, the present invention provides separate caches, with one dedicated to caching one or more local variable of a current method and one dedicated to caching one or more stack operand of the same current method. By separating the local variable cache 300 from stack operand cache 310, the risk of relegating frequently accessed local variables to slow stack memory may be eliminated. By caching the elements of a method frame into separate caches, the run-time nature of each element may be optimally addressed.
  • stack-based processor perforaiance may be significantly reduced if local variables, particularly local variable 0, are not available in the local variable cache. Because local variables are no longer cached with stack operands and, because unlike stack operands local variables may be randomly accessed, the local variable cache 300 may be designed to address the nature of local variables. Thus, in one embodiment, a circular register file may provide only one design for caching local variables in the absence of stack operands.
  • cache designs may be considered that are designed to maximize local variable caching efficiency, for example, wherein the additional hardware for implementing a circular register file is not required, thus simplifying design, reducing power dissipation, and reducing the footprint of a stack-based processor core 100.
  • the local variable cache 300 may be as small as a 32-bit, 16-entry cache and still provide a high local variable cache hit rate.
  • local variable cache 300 comprises a 32- bit, 16-entry register file.
  • local variable cache 300 may comprise a 32-bit, 12-entry register file or a 32-bit, 8-entry register file.
  • Other embodiments, with other numbers of entries are also contemplated herein.
  • local variable cache 300 may be implemented as a direct- mapped cache. In one embodiment, local variable cache 300 may be implemented as a fully associative cache. One skilled in the art will understand that other cache designs may be considered for the local variable cache without departing from the scope of the present invention.
  • Figure 4 A illustrates an embodiment of a fully associative local variable cache 300.
  • Tag memory may be provided as a means for determining which local variables are currently stored in the local variable cache 300. Such tags and usage thereof are known to those skilled in the art.
  • Dirty bits 400 may be maintained to keep track of which local variables in cache 300 differ from their copies in stack memory 120. When a flush is performed, for instance when the current method invokes another method, the local variable cache entries marked as dirty by dirty bits 400 may be written back to their corresponding locations in stack memory 120.
  • a local variable pointer may be maintained to track the corresponding local variable locations in stack memory 120.
  • the local variable pointer may be stored on-chip in a register and it may contain the address of the first local variable of the current method (i.e. local variable 0).
  • All of the current method's local variables may be accessed via this pointer by using offsets.
  • the address of the local variable usually generated by adding an offset to the local variable pointer, may be compared against each tag 420. If no match is found, the local variable may be fetched from stack memory 120.
  • eviction policy may be invoked to dictate which caclie entry is overwritten by the next new local variable.
  • the policy may comprise maintaining a count of the cache entry having the fewest hits, selecting that cache entry for eviction or maintaining a pointer to local variable cache entries and incrementing on every local variable cache miss, the pointer indicating the cache entry to be evicted next.
  • Eviction of a cached local variable typically involves first writing back the cached local variable to its corresponding location in stack memory 120 if it is dirty (as indicated by its dirty bit in dirty field 400).
  • the new local variable may be written to the local variable cache entry without first writing the evicted local variable to stack memory 120 (as the copy in stack memory is valid).
  • Other eviction policies may be employed without departing from the scope of the present invention.
  • the eviction policy, the tags 420, dirty bits 400, and used bits 410 may all be provided by a local cache controller, located in core 100.
  • Figure 4B illustrates an embodiment of a direct-mapped local variable cache 300.
  • a direct mapped local variable cache 300 may be preferred, because a direct-mapped cache requires less hardware for its implementation. Specifically, tag memory and eviction logic are not required. Tags are not required because local variable addressing is implicit in a direct-mapped local variable cache.
  • the first "n" local variables of a current method may be stored therein, where "n" represents the number of entries provided in the stack cache register file 300. In this way, the current object pointer, which is local variable 0 (the first local variable), is guaranteed to remain on-chip while the current method is executed by an instruction execution unit (not shown) in core 100. Furthermore, the lower numbered local variables will tend to be the most frequently accessed local variables in the current method. Thus, by using a simplified local variable cache 300 design, local variable access efficiency may be increased without a great power dissipation penalty.
  • used bits 410 may be provided to indicate which cache entries contain valid data.
  • the used bits 410 are employed in eviction, flushing, and spilling of local variable cache entries.
  • FIG. 5 a single cache register file 500 is illustrated, wherein both local variables and stack operands are cached.
  • a logical boundary 530 exists across cache register file 500, indicating a split between a local variable portion 560 and a stack operand portion 570.
  • Multiplexers 510 and 520 are provided to control reads from the cache register file 500 such that registers of the local variable portion of cache register file 500 are fed to multiplexer 510, and registers of the stack operand portion of the cache register file 500 are fed to multiplexer 520.
  • Multiplexer 510 is under the control, via selection input 550, of a local variable controller (not shown).
  • Multiplexer 520 is under the control, via selection input 540, of a local stack cache controller (not shown). Similar write logic may be provided.
  • This example illustrates that the concept of "splitting" stack and local variable caches into separate caches is not restricted to “physically” splitting the two caches.
  • a single multiplexer may be provided, wherein a selection input thereof is controlled so that the local variable controller is forbidden to read or write in the local stack cache portion of the cache register file 500, and wherein the local stack cache controller is forbidden to read or write in the local variable portion of the cache register file 500.
  • the local variable registers may advantageously be used as general purpose registers in the C mode.
  • the caller On invokes where a caller passes parameters to method, the caller typically pushes the parameters onto its local stack immediately prior to the execution of the invoke instruction.
  • the parameters may be located on-chip in stack operand cache 310.
  • instruction execution unit may issue a command to copy the parameters in stack operand cache 310 into the bottom elements of local variable cache 300.
  • Invokes which are known to consume 20-40% of processing when executing Java bytecode, may be accelerated by avoiding the two interactions with slow stack memory 120 for every parameter that is passed to a method from a caller.
  • processor core would be required to a) write-back each parameter from the stack operand cache 310 in the caller's context to stack memory 120, and b) fetch each parameter from stack memory 120 into local variable cache 300.
  • accesses to memory 110 are costly in terms of processor cycles and power, combined with the fact that invoke instructions are frequently executed in a Java Native Processor, the present invention provides a balance of high-performance while providing low power dissipation.
  • the instruction execution unit of a processor core having local variable 300 and operand stack 310 caches may be microprogrammed to provide interaction with each cache.
  • microinstructions may control the various hardware blocks in the processor core to implement the instructions.
  • the microinstructions may conceptually be divided into various fields, wherein each field provides a control input to a specified hardware block in the processor. Accordingly, separate microinstruction fields may be provided for both the local variable cache 300 and the operand stack cache 310.
  • the corresponding microinstruction field may be provided as an input to the blocks.
  • the operand stack cache block may include a stack cache controller (not shown), the stack controller having a finite state machine implementation and the ability to freeze the pipeline in the processor core to handle states such as local stack underflow on a stack read and local stack overflow on a stack write.
  • a local stack underflow state may occur, when the requested data is not available in the stack cache and must be fetched from memory.
  • a local stack overflow occurs when there is insufficient room in the stack cache for the data to be written and one or more stack data must be written back to memory. Accordingly, on a local stack underflow or overflow, the local stack cache controller may freeze the pipeline of the processor core while the stack cache completes a read or write to stack memory.
  • the microinstruction field may be simplified as fewer states need to be specified and the overall width of the microinstruction fields may be reduced.
  • the reduction in microinstruction field width results in a smaller, lower power microprogram ROM in the microinstruction sequencer unit (MSU) of the processor core. This results in an overall reduction of power consumption and physical size of the processor core.
  • the local variable cache block may include a local variable cache controller having a finite state machine implementation and the ability to freeze the pipeline in the processor core.
  • the local variable cache controller may handle cache misses (when the requested local variable is not available in the cache) and flushes (when cache data is to be written back to stack memory). Accordingly, the width of the local variable cache microinstruction field may be reduced, further contributing to reducing the size of the microprogram ROM in the microsequencer unit.
  • local variable cache 300 and the stack operand cache 310 may be implemented other than as discussed herein, for example, using memories such as SRAM, latches, flip-flops, and the like.
  • J2SE Java 2 Standard Edition
  • J2ME Java 2 Micro Edition
  • CLDC Connected Limited Device Configuration
  • Sun Microsystems, Inc. Palo Alto, California
  • CLI Common Language Interchange
  • IL Intermediate Language
  • CLR Common Language Run-time
  • C# programming language which are part of the .NET and .NET compact framework available from Microsoft Corporation Redmond, Washington
  • BREW Binary Run-time Environment for Wireless

Abstract

An apparatus and method are provided to cache local variables and stack operands of a current method frame.

Description

Methods and Devices for Caching Method Frame Segments in a Low-Power Stack- Based Processor
Related Applications This application claims priority and is related to commonly assigned
PCT Application S.N. Docket number 1065PCT, filed on this day concurrently:
PCT Application S.N. Docket number 1123PCT, filed on this day concurrently:
PCT Application S.N. Docket number 1071 PCT, filed on this day concurrently:
US Provisional Application S.N. 60/276,375 Docket Number 1065.1; US Provisional Application S.N. 60/252,170 Docket Number 1065; US Provisional Application S.N. 60/323,022 Docket Number 1123; US Provisional Application S.N. 60/290,520 Docket Number 1024.1 ; US Provisional Application S.N. 60/270,696 Docket Number 1024; US Provisional Application S.N. 60/256,550 Docket Number 1089; and US Patent Application S.N. 09/941619 Docket Number 1078US; which are all incorporated herein by reference. Field of the Invention
The present invention generally relates to caching data. More specifically, the present invention relates to caching a method frame and segments thereof in a stack-based processor on a low-power device. Background
Mobile devices, such as cellular phones, are quickly becoming commodities, forcing manufacturers and designers of these devices to battle for market share in a price war, and squeezing profit margins. Accordingly, many have identified the need to enhance these devices with additional and meaningful capability by enabling the devices with Java[TM] or Java-like processing capability. These enhanced devices, combined with wireless high data transfer rate capability, provided by emerging 3G wireless technologies, will be central to enabling the vision of so-called "ubiquitous" computing. The Java platform promises better security for transactions, permits users to download new applications including video games, and promises a wealth of interactive content including streaming audio and video.
Traditional software Java Virtual Machines (JVM), which interprets Java bytecode into the machine language of a given computing platform, is slow. To obtain satisfactory performance of software JVM on a standard embedded processor, the clock rate must be increased. Unfortunately, with the prior art the performance is achieved at the expense of high power dissipation due to increased clock rate. Also, additional memory is required to store the software implementing the software JVM, which additionally increases power consumption, size and cost of the mobile device.
Other proposals include the use of just-in-time compilers (JIT), hardware accelerators, or mere extensions to an existing processor's architecture. However, the most promising Java-enabling technology, by far, is the provision of a Java Native Processor into a mobile device.
Unfortunately, prior art Java Native Processors, such as the picoJava[TM] II designed by Sun Microsystems, have yet to be incorporated in a consumer device. One of the shortcomings of prior art Java Native Processors is their high power dissipation. Some of the most compelling applications expected to be exploited by future Java-enabled mobile devices, such as games or other graphics-intensive applications, will make use of multithreading, one of the key features of the Java[TM] programming language. Another shortcoming was their large size.
The Java programming language is stack-based, meaning that many operations are conducted on data stored in a "stack" data structure stored in off-chip (with respect to a processor) memory (hereafter referred as "stack memory").
Each method context in Java is represented by a "method frame". Method frames are pushed on and popped off the run-time stack as method invokes and returns are executed. Each method frame includes, parameters received from the method's caller, new local variables, the return execution context frame (which may include the return PC, frame pointer, local variable pointer, status flags, etc.), and the local stack. The local stack is used to hold temporary results, pass parameters to methods, and pass return values to callers.
As methods are invoked, method frames, are maintained in stack memory to preserve the state of the processor. Consequently, optimizations directed to improving the speed of stack operations are highly effective in increasing the overall performance of Java Native Processors.
One known stack optimization is the provision of a stack cache. Stack caches are typically used to cache the top elements of the stack in memory (stack memory), storing the elements on-chip with the processor. In this way, the processor may access the top elements of stack memory faster than if the top elements were to be fetched from the slower off-chip stack memory, thereby speeding-up overall execution performance. While prior art stack caches for stack-based processors provide satisfactory execution performance, they fail to address the requirements of low-power devices, for example, a mobile wireless devices. Such devices are typically battery powered and must operate with little power if users are to be provided with reasonable battery lifetime and a minimal performance level before having to replace or recharge the battery. Furthermore, as industry strives to miniaturize mobile devices, limited real estate is available on a printed circuit board (PCB) for the processor. Power management also promises to be a key issue for mobile devices as color displays become more common. Thus, although it is known in the art to provide a stack cache to increase the performance of stack-based processors, a need exists in the art for a stack cache directed to increasing performance of mobile devices while at the same time respecting the power and footprint design constraints.
Referring to Figure 1, a prior art stack-based processor core 100 is illustrated. Off- chip memory 110 is provided having a portion thereof allocated as stack memory 120 for use by the stack-based processor core 100. Accesses to stack memory 120 typically take several processor clock cycles.
Figure 2-1 illustrates a prior-art stack-based processor 100 provided with an on-chip stack cache 200. On-chip stack cache 200 serves to maintain a copy of data in stack memory 120 for single-cycle access. Multiple method frames 210 are stored in stack memory 120, and copies of multiple method frames are maintained in the stack cache 200. Referring to Figure 2-2, method frames may comprise a local variable segment 220, a return execution context segment 230, and a local stack operand segment 240. The method frame located at the top of the stack is the "current" method - i.e. the method that the processor core 100 is executing at a given point in time. A stack pointer is typically maintained by processor core 100 to indicate the top of stack element in stack memory 120. Depending on the size of stack cache 200, the number of copies of the top elements of stack memory 120 that are maintained in stack cache 200 can be large.
The picoJava II Native Java Processor provides a 32-bit wide, 64-entry stack cache, implemented as a single circular buffer of memory locations such as a register file. A top pointer 330 and a bottom pointer 320 are maintained to implement the register file as a circular register file. Top pointer 330 may point to the stack cache entry that contains the next available entry for data to be stored (pushed) on the stack cache. When data is pushed on the stack, the data is written to stack cache 200 at the location indicated by top pointer 330 and the top pointer 330 is modified to point to the next available stack cache entry. When data is popped from the stack, the top pointer 330 is modified to point to the top valid entry and the data from that entry is placed on a read bus in processor core 100. With the popped entry no longer valid, a subsequent push instruction will overwrite the entry with new data. When the circular register file is full, as determined by the positions of top pointer 330 and bottom pointer 320, action may be taken to write back one or more elements of the stack cache 200 to stack memory, updating the stack pointer accordingly.
US Patent 6,021,469 teaches a method for caching method frames on a stack cache by caching so called "modified" method frames 250 on the circular buffer stack cache. The '469 Patent proposes caching only the local variables and the local stack operand segments of one or more method to reduce the size of the stack cache. However, a significant number of registers are still required in this implementation as the size of these two segments may be large because more than one method frame may need to be cached.
Implementations of large stack caches, suffer from high-latency on thread switching as all the stack elements, for example, stack operands and local variable of one or more method of an old thread must be spilled back to memory prior to executing in the context of a new thread. Furthermore, a register file or SRAM large enough to implement a large stack cache taught in Figures 2-1, 2-2, and 2-3 contributes significantly to power dissipation and the footprint of the processor. To reduce the number of stack elements to be spilled back to memory, it is known in the art to maintain "dirty bits" - bits that indicate if a cached copy is coherent with the corresponding memory copy. Unfortunately, a large stack cache, if implemented with a memory technology such as SRAM, cannot maintain dirty bits associated with each word in the stack cache. If a large conventional stack cache is implemented using latches or registers, dirty bits may be provided, however, the high gate count to implement such a cache would be detrimental to both the footprint of the processor die and the power dissipation thereof.
Furthermore, with both the local variable 220 and local stack operand segments 240 provided in a single stack cache, the local variables 220 of a method are at risk of being relegated to stack memory 120 whenever the same method pushes a large number of local stack operands.
Accordingly, a need exists for a cache system that overcomes these as well as other shortcomings of the prior art. Summary of the Invention
Although a stack cache may be used to improve the performance of a stack-based machine, conventional designs are inadequate for devices such as cellular phones, PDA's, and the like, which offer limited power and real estate. Therefore, in one embodiment, a local variable cache is provided logically and/or physically separated from a stack operand cache such that a higher degree of control may be provided with to respect the design parameters of one or more method frame segments. By splitting the local variable segment from the stack operand segment, the stack operands of a method do not relegate frequently accessed local variables to stack memory. Furthermore, an optimized design may be provided for each segment. A stack operand cache may be provided having an optimal size for stack operands and implemented as a circular register file to provide a "sliding window" of the stack operands of a method. A separate local variable cache may be optimized for local variables to have an optimized minimal size and of a design that is best suited for local variables such as a fully-associative or direct mapped cache.
Although a smaller stack cache may be provided with fewer cached elements than the prior art (thereby necessitating more accesses to stack memory), a performance increase may nevertheless be achieved. Processing in a multithreaded system is enhanced due to the fact that fewer stack cache elements need to be written back (flushed) to memory stack of a first thread when execution switches from a first thread to a second thread.
In one embodiment, a caching mechanism for caching the stack segments of a method is provided. The mechanism comprises a first cache for a first segment and a second cache for a second segment.
In one embodiment, a caching mechanism for caching the stack segments of a method comprising a first cache for caching a local variable segment and a second cache for caching a stack operand segment is provided, wherein the first cache comprises at most 16 entries, and the second cache comprises 2 to about 8 entries.
In one embodiment, a caching mechanism for caching the stack segments of a method comprising a first cache for caching a local variable segment and a second cache for caching a stack operand segment are provided, wherein the second cache comprises a circular buffer. In one embodiment, a device capable of executing stack-based instructions, the instructions provided by a method, the method including at least one method frame, the method frame comprising one or more frame segments, may comprise means for caching one or more frame segments of a current method frame. The one or more method frame segments may comprise a local variable segment and a stack operand segment. The frame segments may comprise a local variable segment that includes a number of local variables, and a stack operand segment that includes a number of stack operands, and the means for caching may cache less than the number of local variables and less than the number of stack operands. The device may further comprise a processor core; and a memory operatively coupled to the processor core, the memory containing a stack data structure for storing a call chain of method frames, wherein the means for caching is coupled to both the processor core and the memory, and wherein the means for caching provides the processor core with faster access to the current method frame than may be provided by the memory. The means for caching may be integrated onto a semiconductor chip. The means for caching may comprise faster memory technology than the memory. The means for caching may comprise cache memory, and the cache memory may be selected from the group consisting of registers, latches, RAM, SRAM, and DDR. The one or more frame segments may comprise segment elements, and the means for caching may accommodate no more than 16 segment elements. The one or more frame segments may comprise stack operand segment elements, wherein the means for caching accommodates storage of no more than 8 stack operand segment elements. The one or more frame segments may comprise stack operand segment elements, wherein the means for caching accommodates storage of 8 stack operand segment elements. The one or more frame segments may comprise local variable segment elements, and the means for caching may accommodate no more than 16 local variable segment elements. The means for caching may comprise a circular register file; a top pointer for indicating a first entry in a first register of the register file; and a bottom pointer for indicating a last entry in a second register of the register file. The device may comprise a mobile device. The device may comprise a Java native processor. The device may comprise a Java accelerator. The device may comprise an instruction-path Java accelerator. The device may comprise an electronic circuit.
In one embodiment, a device capable of executing stack-based instructions may comprise a first cache memory for caching stack operands of the instructions, and a second cache memory for caching local variables of the instructions. The first cache memory may comprise a circular register file. The second cache memory may comprise a direct mapped cache memory. The second cache memory may comprise a fully associative cache memory. The stack operands and the local variables may be both of a single current method. In one embodiment, a circuit for caching a method frame may comprise means for caching local variables of the method frame and means for caching stack operands of the method frame. The means for caching local variables and the means for caching stack operands may be separated by a logical boundary.
In one embodiment, a device capable for executing stack-based instructions may comprise a local variable cache, the local variable cache for caching only local variables. The device may further comprise a stack operand cache, the stack operand cache for caching only stack operands. The local variable cache may cache local variables of the current method frame. The method frame may comprise a Java method frame.
In one embodiment, a device for executing stack based instructions may comprise a cache, the device having a Java mode and a non-Java mode, the cache for storing local variables of the stack based instructions when operating in the Java mode, and the cache used as a general purpose register file in the non-Java mode.
In one embodiment, a device capable of executing stack-based instructions may comprise a stack operand cache, the stack operand cache for caching only stack operands. The stack operand cache may cache stack operands of one method frame. The one method frame may comprise a Java method frame. The method frame may comprise a current method frame.
In one embodiment, a device capable of executing stack-based instructions of a method, the method including a current method frame stored on a stack, the current method frame including one or more stack operands and one or more local variables, may comprise a stack operand cache, the stack operand cache for caching one or more stack operands of the current method; and a local variable cache, the local variable cache for caching one or more local variables of the current method. The current method may include a return execution context, wherein one or more element of the return execution context is cached on the stack operand cache. The current method may include a return execution context, wherein one or more elements of the return execution context frame segment is cached on the local variable cache. In one embodiment, a method for caching a method frame may comprise the steps of providing a method frame; providing a local variable cache and a stack operand cache, the local variable cache and the stack operand cache separated by a logical boundary; caching one or more local variable of the method frame in the local variable cache; and caching one or more stack operand of the method frame in the stack operand cache. The method may further comprise the step of providing the method frame as a current method frame. The method may further comprise the step of caching one or more local variable and one or more stack operand of only one method frame at a time. The step of caching may comprise caching less than all of the stack operands of the method frame. The step of caching may comprise caching less than all of the local variables of the method frame.
Further aspects, benefits, and embodiments will be apparent to those skilled in the art when viewed in light of the description and claims provided herein. Description of Drawings Figure 1. illustrates a prior art stack-based processor core having stack memory connected thereto.
Figure 2. illustrates a prior art stack-based processor core having an on-chip stack cache that caches entire or modified method frames.
Figure 3. illustrates a processor having separate on-chip, stack operand and local variable caches. Figures 4A and 4B. illustrate two alternative embodiments for an on-chip local variable cache.
Figure 5. illustrates a processor having stack operand and local variable caches provided on a single register file. Description Referring to Figure 3, a separate stack operand cache 310 and local variable cache
300 are illustrated. In one embodiment, the local variable cache 300 and the stack operand cache 310 may be implemented on a processor core 100. In one embodiment, the local variable cache 300 and the stack operand cache 310 may be implemented in cache memory, for example, as a register, a latch, a RAM, a SRAM, a DDR, as is practiced by those skilled in the art. Although a processor core comprising a memory are described herein, it is understood that the present invention is not limited thereby, as other memories and other devices are within the scope of the claims that follow, for example, other memories separate from or integral with a processor core that provide faster access to local variables and stack operands of a current method frame than a conventional stack memory 120 can provide.
In one embodiment, stack operand cache 310 may be implemented as a circular register file, having top pointer 330 and bottom pointer 320. hi one embodiment, stack operand cache 310 is provided to cache one or more elements the local stack segment of the current method frame. It is understood that in transient states, such as invokes and returns, stack operand cache 310 may, however, contain various data of other method frame segments of more than one method in transient states such as during invokes and returns. As stack operands of the current method's local stack segment are pushed and popped, the operands are written to or read from stack operand cache 310 and top pointer 330 is modified to point to the new top of stack in the stack cache. In stack-based processor core 100, the caches 300, 310 may be coupled to an instruction execution unit, an arithmetic logic unit (ALU), an address calculation unit, operand processor, an on-chip data or unified instruction and data cache, and other components well known to those skilled in the art and as necessary to implement the invention described herein. Those skilled in the art will understand that, in one embodiment, further levels of caching may be provided between stack and local variable caches and stack memory.
In one embodiment, statistical profiling of MIDlets written to be compliant with J2ME (available from Sun Microsystems Inc., Palo Alto, CA.) indicates that a 13% stack operand cache miss rate (i.e. the number of times the processor must read or write the memory) may be obtained for both a 4-entry stack operand cache and for an 8-entry stack operand cache. The stack operand cache miss rate increases only marginally as the number of stack operand cache entries is reduced to 3 (19% stack cache miss rate) or two entries (27% stack cache miss rate). Accordingly, in one embodiment, stack operand cache 310 may comprise a 32-bit, 8-entry register file. In one embodiment, stack operand cache 310 may comprise a 32-bit register file having no more than 4 entries. In one embodiment, an optimal stack operand cache efficiency (measured as the hit rate) may be obtained by providing a stack operand cache 310 having at most 8 entries. Other embodiments, with other numbers of entries are also contemplated herein. In one embodiment, with Java or Java-like stack based instructions, the respective sizes of the local variable 300 and local stack 310 caches may be selected by profiling CLDC and MIDP methods. For example, an examination of all methods in all classes in Sun Microsystems' s KVM (available from Sun Microsystems Inc., Palo Alto, CA) shows that 99.7 % of all methods require 8 stack operand elements or less, 97 % of all methods require 8 local variables or less; and an examination of all methods in all classes of the MIDP profile shows that 98 % of all methods require 8 stack operand elements or less and 96 % of all methods require 8 local variables or less.
In one embodiment, a so-called "dribble manager unit" for handling stack operand cache underflow/overflow in the background need not be provided. While the prior art teaches the use of a dribble manager to increase stack cache efficiency, the optimized stack operand cache 310 of the present invention would benefit little from such additional hardware. Dribble managers require dedicated read and write ports on the stack operand cache to permit background access thereto. The additional ports as well as the actual dribble manager hardware substantially increase power dissipation. As the stack operand cache hit rate of a 4-entry stack cache is already about 87%, the power consumption of a dribble manager may not be justified by the marginal potential for improvement in the stack cache hit rate.
In one embodiment, the stack operand cache 310 may be optimally implemented with a circular register file. As stack operands are pushed and popped from the stack, stack operand cache may "slide" along the top of the stack as the top of stack is redefined with every stack operation. Furthermore, certain optimizations may be implemented such as returning return values to a caller on the stack operand cache.
Referring further to Figure 3, in one embodiment, stack-based processor core 100 may also be provided with a local variable cache 300 for caching local variables of a current method. While the prior art teaches caching both local variables and stack operands of one or more methods in one stack cache implemented as a circular buffer, the present invention provides separate caches, with one dedicated to caching one or more local variable of a current method and one dedicated to caching one or more stack operand of the same current method. By separating the local variable cache 300 from stack operand cache 310, the risk of relegating frequently accessed local variables to slow stack memory may be eliminated. By caching the elements of a method frame into separate caches, the run-time nature of each element may be optimally addressed. The prior art "one-cache-fits-all elements" approach fails to satisfy such an optimal design, particularly in a low-power, portable device In one embodiment, with separate caches, if a current method pushes more stack operands than the number of stack cache entries, local variables will not be spilled to memory stack. This is a significant consideration because local variables, particularly the current object pointer (known as the "this" pointer), which tends to be local variable 0, are frequently and randomly accessed. This contrasts with the nature of stack operands, in that stack operands are generally accessed only from the top of the stack (as indicated by top pointer 330). Thus, while it suffices to keep only the top local stack elements in stack operand cache 310, stack-based processor perforaiance may be significantly reduced if local variables, particularly local variable 0, are not available in the local variable cache. Because local variables are no longer cached with stack operands and, because unlike stack operands local variables may be randomly accessed, the local variable cache 300 may be designed to address the nature of local variables. Thus, in one embodiment, a circular register file may provide only one design for caching local variables in the absence of stack operands. Other cache designs may be considered that are designed to maximize local variable caching efficiency, for example, wherein the additional hardware for implementing a circular register file is not required, thus simplifying design, reducing power dissipation, and reducing the footprint of a stack-based processor core 100.
Based on statistical profiling of J2ME MIDlets, it has been discovered that the local variable cache 300 may be as small as a 32-bit, 16-entry cache and still provide a high local variable cache hit rate. Thus, in one embodiment, local variable cache 300 comprises a 32- bit, 16-entry register file. However, in other embodiments, local variable cache 300 may comprise a 32-bit, 12-entry register file or a 32-bit, 8-entry register file. Other embodiments, with other numbers of entries are also contemplated herein.
In one embodiment, local variable cache 300 may be implemented as a direct- mapped cache. In one embodiment, local variable cache 300 may be implemented as a fully associative cache. One skilled in the art will understand that other cache designs may be considered for the local variable cache without departing from the scope of the present invention.
Figure 4 A illustrates an embodiment of a fully associative local variable cache 300. Tag memory may be provided as a means for determining which local variables are currently stored in the local variable cache 300. Such tags and usage thereof are known to those skilled in the art. Dirty bits 400 may be maintained to keep track of which local variables in cache 300 differ from their copies in stack memory 120. When a flush is performed, for instance when the current method invokes another method, the local variable cache entries marked as dirty by dirty bits 400 may be written back to their corresponding locations in stack memory 120. A local variable pointer may be maintained to track the corresponding local variable locations in stack memory 120. The local variable pointer may be stored on-chip in a register and it may contain the address of the first local variable of the current method (i.e. local variable 0). All of the current method's local variables may be accessed via this pointer by using offsets. When a local variable is accessed, the address of the local variable, usually generated by adding an offset to the local variable pointer, may be compared against each tag 420. If no match is found, the local variable may be fetched from stack memory 120.
Many schemes may be employed to manage the storage of local variables in local variable cache 300. Li one embodiment, when local variables are accessed, they may be written to available entries in local variable cache 300. Once full, an eviction policy may be invoked to dictate which caclie entry is overwritten by the next new local variable. The policy may comprise maintaining a count of the cache entry having the fewest hits, selecting that cache entry for eviction or maintaining a pointer to local variable cache entries and incrementing on every local variable cache miss, the pointer indicating the cache entry to be evicted next. Eviction of a cached local variable typically involves first writing back the cached local variable to its corresponding location in stack memory 120 if it is dirty (as indicated by its dirty bit in dirty field 400). If the cached local variable to be evicted is not dirty, then the new local variable may be written to the local variable cache entry without first writing the evicted local variable to stack memory 120 (as the copy in stack memory is valid). Other eviction policies may be employed without departing from the scope of the present invention. In one embodiment, the eviction policy, the tags 420, dirty bits 400, and used bits 410, may all be provided by a local cache controller, located in core 100.
Figure 4B illustrates an embodiment of a direct-mapped local variable cache 300. A direct mapped local variable cache 300 may be preferred, because a direct-mapped cache requires less hardware for its implementation. Specifically, tag memory and eviction logic are not required. Tags are not required because local variable addressing is implicit in a direct-mapped local variable cache. In embodiments practicing direct-mapped local variable caches 300, the first "n" local variables of a current method may be stored therein, where "n" represents the number of entries provided in the stack cache register file 300. In this way, the current object pointer, which is local variable 0 (the first local variable), is guaranteed to remain on-chip while the current method is executed by an instruction execution unit (not shown) in core 100. Furthermore, the lower numbered local variables will tend to be the most frequently accessed local variables in the current method. Thus, by using a simplified local variable cache 300 design, local variable access efficiency may be increased without a great power dissipation penalty.
In both fully associative and direct mapped embodiments, used bits 410 may be provided to indicate which cache entries contain valid data. The used bits 410 are employed in eviction, flushing, and spilling of local variable cache entries.
Referring now to Figure 5, a single cache register file 500 is illustrated, wherein both local variables and stack operands are cached. A logical boundary 530 exists across cache register file 500, indicating a split between a local variable portion 560 and a stack operand portion 570. Multiplexers 510 and 520 are provided to control reads from the cache register file 500 such that registers of the local variable portion of cache register file 500 are fed to multiplexer 510, and registers of the stack operand portion of the cache register file 500 are fed to multiplexer 520. Multiplexer 510 is under the control, via selection input 550, of a local variable controller (not shown). Multiplexer 520 is under the control, via selection input 540, of a local stack cache controller (not shown). Similar write logic may be provided. This example illustrates that the concept of "splitting" stack and local variable caches into separate caches is not restricted to "physically" splitting the two caches.
In one embodiment, a single multiplexer may be provided, wherein a selection input thereof is controlled so that the local variable controller is forbidden to read or write in the local stack cache portion of the cache register file 500, and wherein the local stack cache controller is forbidden to read or write in the local variable portion of the cache register file 500. h one embodiment, where stack-based processor core 100 is capable of operating in more than one mode, for example, Java mode and a C mode, the local variable registers may advantageously be used as general purpose registers in the C mode.
On invokes where a caller passes parameters to method, the caller typically pushes the parameters onto its local stack immediately prior to the execution of the invoke instruction. Thus, when the instruction execution unit on a stack-based processor core executes the invoked instruction, the parameters may be located on-chip in stack operand cache 310.
In one embodiment, instruction execution unit may issue a command to copy the parameters in stack operand cache 310 into the bottom elements of local variable cache 300. Invokes, which are known to consume 20-40% of processing when executing Java bytecode, may be accelerated by avoiding the two interactions with slow stack memory 120 for every parameter that is passed to a method from a caller. Without the capability of transferring the parameters from stack operand cache 310 to local variable cache 300, processor core would be required to a) write-back each parameter from the stack operand cache 310 in the caller's context to stack memory 120, and b) fetch each parameter from stack memory 120 into local variable cache 300. As accesses to memory 110 are costly in terms of processor cycles and power, combined with the fact that invoke instructions are frequently executed in a Java Native Processor, the present invention provides a balance of high-performance while providing low power dissipation.
In one embodiment, the instruction execution unit of a processor core having local variable 300 and operand stack 310 caches may be microprogrammed to provide interaction with each cache. With processor core 100 executing instructions, microinstructions may control the various hardware blocks in the processor core to implement the instructions. The microinstructions may conceptually be divided into various fields, wherein each field provides a control input to a specified hardware block in the processor. Accordingly, separate microinstruction fields may be provided for both the local variable cache 300 and the operand stack cache 310. When an instruction requiring use of either of the local variable or local stack cache blocks executes, the corresponding microinstruction field may be provided as an input to the blocks.
The operand stack cache block may include a stack cache controller (not shown), the stack controller having a finite state machine implementation and the ability to freeze the pipeline in the processor core to handle states such as local stack underflow on a stack read and local stack overflow on a stack write. A local stack underflow state may occur, when the requested data is not available in the stack cache and must be fetched from memory. A local stack overflow occurs when there is insufficient room in the stack cache for the data to be written and one or more stack data must be written back to memory. Accordingly, on a local stack underflow or overflow, the local stack cache controller may freeze the pipeline of the processor core while the stack cache completes a read or write to stack memory. In this way, handling of overflow and underflow states may be delegated to the local stack block and the microprogram is not required to control the local stack cache through these states. The microinstruction field may be simplified as fewer states need to be specified and the overall width of the microinstruction fields may be reduced. Ultimately, the reduction in microinstruction field width results in a smaller, lower power microprogram ROM in the microinstruction sequencer unit (MSU) of the processor core. This results in an overall reduction of power consumption and physical size of the processor core. Similarly, the local variable cache block may include a local variable cache controller having a finite state machine implementation and the ability to freeze the pipeline in the processor core. The local variable cache controller may handle cache misses (when the requested local variable is not available in the cache) and flushes (when cache data is to be written back to stack memory). Accordingly, the width of the local variable cache microinstruction field may be reduced, further contributing to reducing the size of the microprogram ROM in the microsequencer unit.
Numerous modifications and variations of the present invention are possible in light of the above teachings, for example, those skilled in the art will understand that the local variable cache 300 and the stack operand cache 310 may be implemented other than as discussed herein, for example, using memories such as SRAM, latches, flip-flops, and the like. Furthermore, other languages are within the scope of those described herein, including Java 2 Standard Edition (J2SE), Java 2 Micro Edition (J2ME), and configurations such as Connected Limited Device Configuration (CLDC) available from Sun Microsystems, Inc., Palo Alto, California; and other Java or Java-like languages, for example, Common Language Interchange (CLI), Intermediate Language (IL) and Common Language Run-time (CLR) environments, and C# programming language which are part of the .NET and .NET compact framework available from Microsoft Corporation Redmond, Washington; and Binary Run-time Environment for Wireless (BREW) from Qualcomm Inc., San Diego, California. It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

Claims

What is Claimed is:
1. A device capable of executing stack-based instructions, the instructions provided by a method, the method including at least one method frame, the method frame comprising one or more frame segments, comprising: means for caching one or more frame segments of a current method frame.
2. The device of Claim 1, wherein the one or more method frame segments comprise a local variable segment and a stack operand segment.
3. The device of Claim 1, wherein the frame segments comprise a local variable segment that includes a number of local variables, and a stack operand segment that includes a number of stack operands; and wherein the means for caching caches less than the number of local variables and less than the number of stack operands.
4. The device of claim 1 further comprising: a processor core; and a memory operatively coupled to the processor core, the memory containing a stack data structure for storing a call chain of method frames, wherein the means for caching is coupled to both the processor core and the memory, and wherein the means for caching provides the processor core with faster access to the current method frame than may be provided by the memory.
5. The device of claim 4, wherein the means for caching is integrated onto a semiconductor chip.
6. The device of claim 4, wherein the means for caching comprises faster memory technology than the memory.
7. The device of claim 6, wherein the means for caching comprises cache memory, and wherein the cache memory is selected from the group consisting of registers, latches, RAM, SRAM, and DDR.
8. The device of Claim 1, wherein the one or more frame segments comprise segment elements, and wherein the means for caching accommodates no more than 16 segment elements.
9. The device of Claim 1, wherein the one or more frame segments comprise stack operand segment elements, wherein the means for caching accommodates storage of no more than 8 stack operand segment elements.
10. The device of Claim 1, wherein the one or more frame segments comprise stack operand segment elements, wherein the means for caching accommodates storage of 8 stack operand segment elements.
11. The device of claim 1, wherein the one or more frame segments comprise local variable segment elements, and wherein the means for caching accommodates no more than 16 local variable segment elements.
12.The device of Claim 1, wherein the means for caching comprises a circular register file; a top pointer for indicating a first entry in a first register of the register file; and a bottom pointer for indicating a last entry in a second register of the register file.
13. The device of Claim 1, wherein the device comprises a mobile device.
14.The device of claim 1, wherein the device comprises a Java native processor.
15. The device of claim 1, wherein the device comprises a Java accelerator.
16. The device of claim 15, wherein the device comprises instruction-path a Java accelerator.
17. The device of claim 1, wherein the device comprises an electronic circuit.
18. A device capable of executing stack-based instructions, the device comprising: a first cache memory for caching stack operands of the instructions; and a second cache memory for caching local variables of the instructions.
19. The device of Claim 18, wherein the first cache memory comprises a circular register file.
20. The device of Claim 18, wherein the second cache memory comprises a direct mapped cache memory.
21. The device of Claim 18, wherein the second cache memory comprises a fully associative cache memory.
22. The device of Claim 18, wherein the stack operands and the local variables are both of a single current method.
23. A circuit for caching a method frame, comprising: means for caching local variables of the method frame; and means for caching stack operands of the method frame.
24. The circuit of claim 23, wherein the means for caching local variables and the means for caching stack operands are separated by a logical boundary.
25. A device capable of executing stack-based instructions, comprising: a local variable cache, the local variable cache for caching only local variables.
26. The device of claim 25, further comprising a stack operand cache, the stack operand cache for caching only stack operands.
27. The device of claim 25, the local variable cache for caching local variables of only the current method frame.
28. The device of claim 27, the method frame comprising a Java method frame.
29. A device for executing stack based instructions, comprising: a cache, the device having a Java mode and a non-Java mode, the cache for storing local variables of the stack based instructions when operating in the Java mode, and the cache used as a general purpose register file in the non-Java mode.
30. A device capable of executing stack-based instructions, comprising: a stack operand cache, the stack operand cache for caching only stack operands.
31. The device of claim 30, the stack operand cache for caching stack operands of only one method frame.
32. The device of claim 31, the one method frame comprising a Java method frame.
33. The device of claim 31, the method frame comprising a current method frame.
34. A device capable of executing stack-based instructions of a method, the method including a current method frame stored on a stack, the current method frame including one or more stack operands and one or more local variables, comprising: a stack operand cache, the stack operand cache for caching one or more stack operands of the current method; and a local variable cache, the local variable cache for caching one or more local variables of the current method.
35. The device of claim 34, the current method including a return execution context, wherein one or more element of the return execution context is cached on the stack operand cache.
36. The device of claim 34, the current method including a return execution context, wherein one or more elements of the return execution context frame segment is cached on the local variable cache.
37. A method for caching a method frame, comprising the steps of: providing a method frame; providing a local variable cache and a stack operand cache, the local variable cache and the stack operand cache separated by a logical boundary; caching one or more local variable of the method frame in the local variable cache; and caching one or more stack operand of the method frame in the stack operand cache.
38.The method of claim 37, further comprising the step of: providing the method frame as a current method frame.
39. The method of claim 37, further comprising the step of caching one or more local variable and one or more stack operand of only one method frame at a time.
40. The method of claim 37, wherein the step of caching comprises caching less than all of the stack operands of the method frame.
41. The method of claim 37, wherein the step of caching comprises caching less than all of the local variables of the method frame.
PCT/US2001/043829 2000-11-20 2001-11-20 Methods and devices for caching method frame segments in a low-power stack-based processor WO2002045385A2 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
PCT/US2001/043829 WO2002045385A2 (en) 2000-11-20 2001-11-20 Methods and devices for caching method frame segments in a low-power stack-based processor
AU2002230445A AU2002230445A1 (en) 2000-11-20 2001-11-20 Interpretation loop for object oriented processor
PCT/US2001/044031 WO2002071211A2 (en) 2000-11-20 2001-11-20 Data processor having multiple operating modes
AU2002226968A AU2002226968A1 (en) 2000-11-20 2001-11-20 Data processor having multiple operating modes
PCT/US2001/043444 WO2002042898A2 (en) 2000-11-20 2001-11-20 Interpretation loop for object oriented processor
AU2002241505A AU2002241505A1 (en) 2000-11-20 2001-11-20 Methods and devices for caching method frame segments in a low-power stack-based processor
AU4150502A AU4150502A (en) 2000-11-20 2001-11-21 Methods and devices for caching method frame segments in a low-power stack-basedprocessor

Applications Claiming Priority (16)

Application Number Priority Date Filing Date Title
US25217000P 2000-11-20 2000-11-20
US60/252,170 2000-11-20
US25655000P 2000-12-18 2000-12-18
US60/256,550 2000-12-18
US27069601P 2001-02-22 2001-02-22
US60/270,696 2001-02-22
US27637501P 2001-03-16 2001-03-16
US60/276,375 2001-03-16
US29052001P 2001-05-11 2001-05-11
US60/290,520 2001-05-11
US32302201P 2001-09-14 2001-09-14
US60/323,022 2001-09-14
US09/956,130 2001-09-20
PCT/US2001/043829 WO2002045385A2 (en) 2000-11-20 2001-11-20 Methods and devices for caching method frame segments in a low-power stack-based processor
PCT/US2001/043957 WO2002048864A2 (en) 2000-11-20 2001-11-20 System registers for an object-oriented processor
PCT/US2001/043444 WO2002042898A2 (en) 2000-11-20 2001-11-20 Interpretation loop for object oriented processor

Publications (2)

Publication Number Publication Date
WO2002045385A2 true WO2002045385A2 (en) 2002-06-06
WO2002045385A3 WO2002045385A3 (en) 2003-09-12

Family

ID=27792424

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/043829 WO2002045385A2 (en) 2000-11-20 2001-11-20 Methods and devices for caching method frame segments in a low-power stack-based processor

Country Status (1)

Country Link
WO (1) WO2002045385A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1387247A2 (en) * 2002-07-31 2004-02-04 Texas Instruments Inc. System and method to automatically stack and unstack java local variables
US20140143499A1 (en) * 2012-11-21 2014-05-22 Advanced Micro Devices, Inc. Methods and apparatus for data cache way prediction based on classification as stack data
GB2518022A (en) * 2014-01-17 2015-03-11 Imagination Tech Ltd Stack saved variable value prediction

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997027539A1 (en) * 1996-01-24 1997-07-31 Sun Microsystems, Inc. Methods and apparatuses for stack caching
US6138210A (en) * 1997-06-23 2000-10-24 Sun Microsystems, Inc. Multi-stack memory architecture

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997027539A1 (en) * 1996-01-24 1997-07-31 Sun Microsystems, Inc. Methods and apparatuses for stack caching
US6138210A (en) * 1997-06-23 2000-10-24 Sun Microsystems, Inc. Multi-stack memory architecture

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1387247A2 (en) * 2002-07-31 2004-02-04 Texas Instruments Inc. System and method to automatically stack and unstack java local variables
EP1387247A3 (en) * 2002-07-31 2007-12-12 Texas Instruments Inc. System and method to automatically stack and unstack java local variables
US20140143499A1 (en) * 2012-11-21 2014-05-22 Advanced Micro Devices, Inc. Methods and apparatus for data cache way prediction based on classification as stack data
US9734059B2 (en) * 2012-11-21 2017-08-15 Advanced Micro Devices, Inc. Methods and apparatus for data cache way prediction based on classification as stack data
GB2518022A (en) * 2014-01-17 2015-03-11 Imagination Tech Ltd Stack saved variable value prediction
GB2518022B (en) * 2014-01-17 2015-09-23 Imagination Tech Ltd Stack saved variable value prediction
US9934039B2 (en) 2014-01-17 2018-04-03 Mips Tech Limited Stack saved variable pointer value prediction

Also Published As

Publication number Publication date
WO2002045385A3 (en) 2003-09-12

Similar Documents

Publication Publication Date Title
US6122709A (en) Cache with reduced tag information storage
US8024554B2 (en) Modifying an instruction stream using one or more bits to replace an instruction or to replace an instruction and to subsequently execute the replaced instruction
US7827390B2 (en) Microprocessor with private microcode RAM
US6151662A (en) Data transaction typing for improved caching and prefetching characteristics
US6408384B1 (en) Cache fencing for interpretive environments
WO1999031593A1 (en) Cache tag caching
KR20050085148A (en) Microprocessor including a first level cache and a second level cache having different cache line sizes
US6421762B1 (en) Cache allocation policy based on speculative request history
US5926841A (en) Segment descriptor cache for a processor
US7360060B2 (en) Using IMPDEP2 for system commands related to Java accelerator hardware
US6510494B1 (en) Time based mechanism for cached speculative data deallocation
WO1997034229A9 (en) Segment descriptor cache for a processor
US7203797B2 (en) Memory management of local variables
EP1387277B1 (en) Write back policy for memory
WO2002045385A2 (en) Methods and devices for caching method frame segments in a low-power stack-based processor
US11016900B1 (en) Limiting table-of-contents prefetching consequent to symbol table requests
US20040260904A1 (en) Management of stack-based memory usage in a processor
US8429383B2 (en) Multi-processor computing system having a JAVA stack machine and a RISC-based processor
US7555611B2 (en) Memory management of local variables upon a change of context
EP1387248B1 (en) A processor with a split stack
EP0101718B1 (en) Computer with automatic mapping of memory contents into machine registers
EP1387251B1 (en) Instruction for copying data of a stack storage
EP4020224A1 (en) Dynamic inclusive last level cache
US20220197798A1 (en) Single re-use processor cache policy
Groote et al. Computer Organization

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69 EPC (EPO FORM 1205A OF 280803)

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP