CN100458687C - Shared code caching method and apparatus for program code conversion - Google Patents

Shared code caching method and apparatus for program code conversion Download PDF

Info

Publication number
CN100458687C
CN100458687C CNB200480020101XA CN200480020101A CN100458687C CN 100458687 C CN100458687 C CN 100458687C CN B200480020101X A CNB200480020101X A CN B200480020101XA CN 200480020101 A CN200480020101 A CN 200480020101A CN 100458687 C CN100458687 C CN 100458687C
Authority
CN
China
Prior art keywords
code
translater
translation
source
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB200480020101XA
Other languages
Chinese (zh)
Other versions
CN1823322A (en
Inventor
杰兰特·诺斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IBM United Kingdom Ltd
International Business Machines Corp
Original Assignee
Transitive Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Transitive Ltd filed Critical Transitive Ltd
Publication of CN1823322A publication Critical patent/CN1823322A/en
Application granted granted Critical
Publication of CN100458687C publication Critical patent/CN100458687C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45516Runtime code conversion or optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3812Instruction prefetching with instruction modification, e.g. store into instruction stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Subject program code is translated to target code (21) in basic block units at run-time in a process wherein translation of basic blocks is interleaved with execution of those translations. A shared code cache mechanism is added to persistently store subject code translations, such that a translator may reuse translations that were generated and/or optimized by earlier translator instances.

Description

The shared code cache method and apparatus that is used for program code conversion
Technical field
Relate generally to computing machine of the present invention and computer software fields relate in particular to the program code conversion method and apparatus that for example is used for code translation device, simulator and accelerator.
Background technology
In embedded and non-embedded CPU, it is found that popular instruction set architecture (ISA), can be used for performance by " acceleration " in a large number or be " translated " being a plurality of efficient processors (capable processor) for its existence, if these processors can be visited related software pellucidly, they can show better cost/performance advantage so.People also find the mainstream CPU framework by the instruction set architecture that in time locks onto them, and can't obtain development on performance or market exposure level.This framework will benefit the framework altogether from " synthetic CPU ".
The program code conversion method and apparatus has promoted this acceleration, translation and common architecture capabilities, and it for example is suggested among the WO 00/22521 of " program code conversion " at title.
Summary of the invention
According to the present invention, provide the Apparatus and method for of illustrating as in the appended claim.Preferred feature of the present invention will be conspicuous in dependent claims and following description.
In order to realize the present invention, a kind of shared code cache method that is used for program code conversion is provided, comprise: the first translater example (19A) (a) is provided, and the wherein said first translater example is translated as object code part (TC1) with source code first (CS1); (b) the described object code part of buffer memory (TC1); And (c) provide the second translater example (19B), the wherein said second translater example is translated as object code with source code second portion (CS2), be included in and detect between described object code part (TC1) that is buffered and the described source code second portion (CS2) when compatible, retrieve the described object code that is buffered partly (TC1).
Below be to according to the different aspect of various embodiments of the invention and the summary of attainable advantage.This is provided as the introduction that the detailed design that helps those skilled in the art to understand the back is quickly discussed, and is not and does not attempt to limit by any way the scope of this paper claims.
According to this understanding, the inventor in following discloses at the technology of accelerated procedure code conversion, be particularly useful in conjunction with operation (run-time) translater that is used for source program code is translated into object code.Especially, provide and share code cache mechanism, be used to store source code (subjectcode) translation result to reuse.In one embodiment, the source code that is generated by a translater example is buffered, to be reused by follow-up translater example.Other various embodiments, embodiment and the improvement of this mechanism also are provided.
Description of drawings
The accompanying drawing that is included in the instructions and constitutes the part of instructions has been represented current preferred implementation, and accompanying drawing is described as follows:
Accompanying drawing 1 is a block diagram of wherein using the equipment of the embodiment of the invention;
Accompanying drawing 2 is the synoptic diagram of setting forth the corresponding IR (intermediate representation) that generates during operation Translation Processing and the processing;
Accompanying drawing 3 is to set forth according to the fundamental block data structure of exemplary embodiment of the present and the synoptic diagram of buffer memory;
Accompanying drawing 4 is the process flow diagrams of setting forth the fundamental block processing of expansion;
Accompanying drawing 5 is to set forth the process flow diagram that is equal to piecemeal (isoblocking);
Accompanying drawing 6 is to set forth component masses and subsidiary optimized process flow diagram;
Accompanying drawing 7 is the synoptic diagram of setting forth the example of component masses optimization;
Accompanying drawing 8 is the process flow diagrams of setting forth the operation translation, comprise expansion basic piecemeal, be equal to piecemeal and component masses;
Accompanying drawing 9 is the process flow diagrams of setting forth the main aspect of sharing the code cache processing;
Accompanying drawing 10 is further to set forth to share the process flow diagram that code cache is handled;
Accompanying drawing 11 is the synoptic diagram of setting forth the buffer unit example;
Accompanying drawing 12 is the synoptic diagram of setting forth translater example and local code buffer memory and server;
Accompanying drawing 13 is the synoptic diagram of setting forth translater example and remote code buffer memory and server;
Accompanying drawing 14 is the synoptic diagram that are set forth in the caching server that moves on the different system that is not collaborative translator code;
The synoptic diagram of accompanying drawing 15 expression systems, wherein caching server is the network that is connected processing of sharing a plurality of buffer memorys.
Accompanying drawing 16 is to set forth the process flow diagram that buffer memory is evolved;
Accompanying drawing 17 is to set forth the synoptic diagram that wherein a plurality of translater examples use the system of same buffered cellular construction; And
Accompanying drawing 18 and 19 synoptic diagram have been set forth buffer memory respectively and have been inserted strategy and cache lookup implementation of strategies mode.
Embodiment
Next, accompanying drawing 1 to 8 is illustrated method available in the program code conversion, equipment and program code.Accompanying drawing 9 is illustrated the various aspects that for example can be used for such as the shared code cache technology in the program code conversion system illustrated among the accompanying drawing 1-8.
Accompanying drawing 1 has represented to comprise the target processor 13 and the storer of destination register 15, the many software components 19,20,21 of memory stores, and the working storage of the source code 17 that comprises fundamental block buffer memory 23, global register reservoir 27 and will be translated is provided.Software component comprises the code 21 after operating system 20, translator code 19, the translation.Translator code 19 can be for example be translated as the simulator of the code after the translation of another kind of ISA or comes work as the accelerator that is used for source code is translated into the code after the translation of identical ISA as the source code with a kind of ISA.
Translater 19, promptly realize the version after the compiling of source code of translater, and the code 21 after the translation, the i.e. translation result of the source code 17 that is produced by translater 19 is with operating system 20, such as the UNIX that operates on the target processor 13, synthetic operation, typically, target processor 13 is microprocessor or other suitable computing machines.Should be realized that the structure of expression only is exemplary in the accompanying drawing 1, and for example, according to software of the present invention, method and processing can with reside in the operating system or under code realize.Source code, translator code, operating system and memory mechanism can be any one in the known a large amount of kinds of those skilled in the art.
In the equipment of foundation accompanying drawing 1, program code conversion is preferably dynamically carried out when operation, and the code 21 after the translation is moving simultaneously.Translater 19 is according to 21 operations of the program after translating.The execution route of Translation Processing is a control loop, may further comprise the steps: carry out translator code 19, a source code 17 is translated into code 21 after the translation, carry out the code behind this block translation then; The ending of the code behind each block translation all comprises the instruction that control is turned back to translator code 19.In other words, translation interlocks with the step of carrying out source code subsequently, and make and have only the part of source program 17 to be translated at every turn, and the code after the translation of carrying out first fundamental block before the translation of follow-up fundamental block.The basic translation unit of translater is a fundamental block, this means translater 19 every next fundamental block ground translation source codes 17.In form, fundamental block is defined as having just what a entrance and the just code segment of what a exit point, and this is limited to block code on the control path.Therefore, fundamental block is the base unit of control stream.
In the processing of the code 21 after generating translation, generate intermediate representation (" IR ") tree based on sourse instruction (subjectinstruction) sequence.IR tree is the abstract representation of the expression formula calculated by source program and performed operation.Subsequently, based on the code 21 after the IR tree generation translation.
The set of IR node described herein is commonly referred to as " tree ".We notice that in form, this structure is actually directed acyclic graph (DAG), rather than tree.The formal definition of tree requires each node to have father node at the most.Because in the process that IR generates, the embodiment of description uses common subexpression to eliminate, so node often has a plurality of father nodes.For example, the IR of flag affects instruction results can be represented that these two abstract registers are corresponding to target source register (destination subject register) and sign result parameter by two abstract registers.
For example, sourse instruction " add%r1, %r2, %r3 " is carried out the addition of the content of source-register %r2 and %r3, and the result is stored among the source-register %r1.Thereby this instructs corresponding to abstract expression formula " %r1=%r2+%r3 ".This example comprises the addition expression formula definition abstract register %r1 with the subexpression that comprises two operand %r2 of presentation directives and %r3.In the linguistic context of source program 17, these subexpressions can be corresponding to other previous sourse instructions, and perhaps they can represent the details of present instruction, such as back to back constant value.
When analyzing " add " instruction, generate new "+" IR node, corresponding to the abstract mathematics operational symbol of addition."+" IR node is stored quote (be expressed as the subexpression tree in IR, be generally held in the source-register) to other IR nodes of expression operand."+" node self is quoted by source-register, the value (abstract register of %r1, the destination register of instruction) of its definition source-register.For example, the right middle of accompanying drawing 20 divides the IR tree that has shown corresponding to X86 instruction " add%ecx, %edx ".
It may occur to persons skilled in the art that in one embodiment, translater 19 is to use the object oriented programming languages such as C++ to realize.For example, the IR node is implemented as the C++ object, quoting of other nodes is implemented as the C++ corresponding to the C++ object of those other nodes is quoted.Therefore, the IR tree is implemented as the set of IR node object, comprises the various of each node are quoted.
And in the embodiment that is discussing, the IR generative process is used one group of abstract register.These abstract registers are corresponding to the special characteristic of source framework (subject architecture).For example, there is unique abstract register for each physical register on the framework of source (" source-register ").Similarly, there is unique abstract register for each the CC condition code sign that occurs on the framework of source.In the IR generative process, abstract register is as the placeholder of IR tree.For example, the value of representing the source-register %r2 on the set point in the sourse instruction sequence by the specific IR expression tree relevant with the abstract register of source-register %r2.In one embodiment, abstract register is implemented as by the C++ to the root contact object of specific IR tree and quotes and the C++ object relevant with this tree.
In above-mentioned illustrative instruction sequence, during the sourse instruction before analyzing " add " instruction, translater has generated the IR tree corresponding to the value of %r2 and %r3.In other words, the subexpression of the value of calculating %r2 and %r3 has been represented as the IR tree.When generating the IR tree of " add%r1, %r2, %r3 " instruction, new "+" node comprises quoting the IR subtree of %r2 and %r3.
The realization of abstract register is divided between the each several part of the code 21 after translator code 19 and the translation.In translater 19, " abstract register " is employed placeholder in the IR generative process, makes abstract register be associated with the IR tree of the value of calculating the pairing source-register of specific abstract register.Therefore, the abstract register in the translater may be implemented as the C++ object of quoting that comprises IR node object (being the IR tree).The set that all IR that quoted by abstract register bank set is called as work IR forest (" forest " is because it comprises a plurality of abstract register roots that correspond respectively to the IR tree).Work IR forest is represented the Short Description of the abstract computing of the source program on the source code specified point.
In the code 21 after translation, " abstract register " is the ad-hoc location in the global register reservoir, and source register value and realistic objective register are synchronized to this position and from this position synchronous.Alternatively, when loading a value from the global register reservoir, the abstract register in the code 21 after the translation can be understood to be in the destination register 15 of the source register value during the code 21 that is stored back after the register reservoir temporarily is kept at the execution translation before.
The example of said procedure translation is illustrated in accompanying drawing 2.The translation of two X86 instructions of accompanying drawing 2 expressions fundamental block, and the corresponding IR tree that in Translation Processing, generates.The left side of accompanying drawing 2 is illustrated in the execution route of translater 19 in the translation process.In step 151, translater 19 is translated as object code 21 with first fundamental block 153 of source code, carries out that object code 21 then in step 155.When object code 21 is finished execution, control is turned back to translater 19, step 157, wherein translater is translated into object code 21 with next fundamental block 159 of source code 17, carries out that object code 21 then, step 161, or the like.
First fundamental block 153 of source code is being translated in the process of object code, and translater 19 generates IR tree 163 based on fundamental block 153.In this example, IR tree 163 is to be generated by sourse instruction " add%ecx, %edx, ", and this sourse instruction is the flag affects instruction.In the process that generates IR tree 163, this instruction definition four abstract registers: target abstract register %ecx 167, the first flag affects order parameter 169, the second flag affects order parameter 171 and flag affects instruction results 173.IR tree corresponding to " add " instruction is "+" operational symbol 175 (being arithmetic addition), and its operand is source-register %ecx 177 and %edx 179.
Therefore, by the parameter and the result of storage mark influence instruction, the simulation sign of first fundamental block 153 is set to pending status.The flag affects instruction is " add%ecx, %edx ".Order parameter is the source-register %ecx 177 of simulation and the currency of %edx 179.The value that source-register is used " @ " symbolic representation source-register before 177,179 be respectively from the global register reservoir, from position, retrieve, because these particular source registers were not before loaded by current fundamental block corresponding to %ecx and %edx.These parameter values are stored in the first and second flags parameters abstract registers 169,171 subsequently.The result of additive operation 175 is stored in sign as a result in the abstract register 173.
After generating the IR tree, generate corresponding object code 21 based on IR.The processing that generates object code 21 according to general IR is known in the art.Object code is inserted into the end of the piece after the translation, with abstract register, comprises the abstract register that is used to indicate result 173 and flags parameters 169,171, is saved in global register reservoir 27.After generating object code, carry out object code, step 155 then.
The translation that accompanying drawing 2 expressions are staggered and the example of execution.At first, translater 19 generates code 21 after the translation based on the sourse instruction 17 of first fundamental block 153, carries out the code after the translation of fundamental block 153 subsequently.In 153 endings place of first fundamental block, the code 21 after the translation turns back to translater 19 with control, and translater 19 is translated second fundamental block 159 then.Code 21 after the translation of second fundamental block 161 is performed subsequently.When second fundamental block 159 was carried out end, the code after the translation turned back to translater 19 with control, and translater 19 is translated next fundamental block then, by that analogy.
Thereby, have two kinds of dissimilar codes carrying out with interlace mode at the source program that translater moves for 19 times: the code 21 after translator code 19 and the translation.Translator code 19 is to be realized and generate based on the high-level source code of translater 19 before operation by compiler.Code 21 after the translation is to be generated based on the source code 17 of the program that is translated by translator code 19 in the whole process of operation.
The expression of source processor state be divided into equally translater 19 and the translation after code 21 parts between.Translater 19 with the source processor state storage in multiple explicit programming software, such as variable and/or object; The compiler that is used for compiling translater determines how state and computing are implemented at object code.By contrast, the code 21 after the translation is stored in source processor state implicit expression in destination register and the storage unit, and these destination registers and storage unit are directly controlled by the target instruction target word of the code 21 that is translated.
For example, the rudimentary expression of global register reservoir 27 only is the zone that is assigned with storer.How code 21 after the translation that Here it is is checked abstract register and how to be carried out mutual with it by storage between memory area that defines and all types of target register and recovery.Yet in the source code of translater 19, global register reservoir 27 is data arrays or can be more senior by the object of access and control.Code 21 for being translated does not have senior expression fully.
In some cases, be static in translater 19 or can the static source processor state of determining by the code 21 of direct coding after for translation, rather than dynamically calculated.For example, translater 19 can generate according to the instruction type of last flag affects instruction and the code 21 after the compiling of characterization, and this expression is if the instruction type of last flag affects instruction changes, then for same fundamental block, translater can generate different object codes.
Translater 19 comprises the data structure corresponding to the translation of each fundamental block, this be particularly conducive to expansion fundamental block, be equal to the translation state optimization of piece (isoblock), chunk and buffer memory, as described below.The fundamental block data structure 30 that accompanying drawing 3 expression is such, it comprise source address (subject address) 31, object code pointer 33 (i.e. the destination address of the code after the translation), translation prompting 34, entrance and exit condition 35, general introduction tolerance (profiling metric) 37, to preceding continue and follow-up fundamental block data structure quote 38,39 and inlet register mappings 40.Accompanying drawing 3 has further been illustrated fundamental block buffer memory 23, and fundamental block buffer memory 23 is fundamental block data structures, for example with 30,41,42,43,44 of source address index ..., set.In one embodiment, the data corresponding to the fundamental block after the specific translation can be stored in the C++ object.When the translation fundamental block, translater is created new fundamental block object.
The source address 31 of fundamental block is the start address of this fundamental block in the storage space of source program 17, that is to say, if source program 17 just moves on the framework of source, and the storage unit that is positioned of fundamental block then.This is also referred to as the source start address.When the source address (every corresponding address of sourse instruction) of the corresponding certain limit of each fundamental block, the source start address is the source address of first instruction in the fundamental block.
The destination address 33 of fundamental block is the storage unit (start address) of code 21 in target program after the translation.Destination address 33 is also referred to as object code pointer or target start address.In order to carry out the piece after the translation, translater 19 with destination address as by dereference (dereference) to call the function pointer of the code after (transferring control to) translation.
Fundamental block data structure 30,41,42,43 ... be stored in the fundamental block buffer memory 23, the fundamental block buffer memory is the fundamental block object store by the source address tissue.Finish when carrying out when the code after the translation of fundamental block, it turns back to translater 19 with control, and also the value with target (follow-up) source address 31 of fundamental block turns back to translater.In order to determine whether follow-up fundamental block is translated, translater 19 is compared the source address 31 of the fundamental block (being those fundamental blocks that have been translated) in target source address 31 and the fundamental block buffer memory 23.Still the fundamental block that is not translated is translated, and is performed then.Only carry out and be translated the fundamental block of (and have compatible entry condition, as described below).As time goes by, many fundamental blocks of running into will be translated, and this makes and reduces the translation cost that increases.Like this, because module less and less needs translation, so translater 19 is more and more faster with the passing of time.
The fundamental block of expansion
A kind of optimization of using according to illustrated embodiment is to increase the scope that code generates by the technology that is called as " fundamental block of expansion ".Have only at fundamental block A under the situation of a subsequent block (for example fundamental block B), translater may be able to be determined the source address of (when A is decoded) B statically.In this case, fundamental block A and B are integrated in the piece that is called as the fundamental block of expansion (A ').Its difference is, the fundamental block mechanism of expansion can be applied to its destination can the static unconditional jump of determining on; If redirect is with good conditionsi or if the destination can not be determined statically, must form independent fundamental block so.The fundamental block of expansion can remain fundamental block in form, because in the middle redirect (intervening jump) of deletion from A to B afterwards, the code of piece A ' has only a control stream, and does not therefore need synchronously at the AB boundary.
Even A has a plurality of possible subsequent block that comprises B, the fundamental block of expansion also can be used to A is expanded to B, is used for specific execution, and wherein B is that the subsequent block of reality and the address of B ' can staticly be determined.
Can be meant those addresses that translater can be determined in the static address of determining during decoding.During the IR forest that makes up piece, for the target source address relevant with the destination address abstract register makes up the IR tree.If the value of destination address IR tree is (promptly not the relying on source register value dynamic or operation) that can staticly determine, subsequent block can staticly be determined so.For example, under the situation of unconditional jump instruction, comprise destination address (being the source start address of subsequent block) implicitly in jump instruction itself; The source address of jump instruction adds that the side-play amount that is coded in the jump instruction just equals destination address.Equally, constant merge (for example X+ (2+3)=>X+5) and expression formula merge that (for example the optimization of (X*5) * 10=>X*50) can make otherwise become can be static definite for " dynamically " destination address.Thereby the calculating of destination address comprises from destination address IR extracts constant value.
When creating the fundamental block A ' of expansion, translater is handled piece A ' subsequently, and its processing mode is with the processing mode of other fundamental blocks is the same arbitrarily when carrying out IR generation, optimization and code generation.Because algorithm of code generation is just operating in (i.e. Zu He fundamental block A and the code of B) on the bigger scope, so translater 19 generates more optimized code.
Such just as skilled in the art will be aware of, decoding is the processing of extracting each sourse instruction from source code.Source code is stored (being the byte set in the storer) as unformatted byte stream.Have under the source framework situation of variable length instruction (for example X86), decoding at first needs the recognition instruction border; Under the situation of fixed-size command framework, the recognition instruction border does not have what meaning (for example, on MIPS, per 4 bytes are instructions).Subsequently, to constituting the byte application sourse instruction form of given instruction, to extract director data (be the value of instruction type, operand register quantity, middle field and be coded in any other information in the instruction).The order format of utilizing known framework is known by the decode processing of machine instruction of this framework of unformatted byte stream in the art.
Accompanying drawing 4 has been illustrated the establishment of the fundamental block of expansion.When legal fundamental block (A) the earliest was decoded, one group of sub-fundamental block of component that can become the expansion fundamental block was detected.If translater 19 detect the subsequent block (B) of A be can static determine 51, it calculates the start address 53 of B, and continues decoding processing at the start address place of B subsequently.If the subsequent block of B (C) be confirmed as be can static determine 55, decoding processing advances to the start address of C, by that analogy.Certainly, if subsequent block can staticly not determine that then normal translation and execution continue 61,63,65.
In the process of all fundamental blocks decoding, work IR forest comprises the IR tree, with source address 31 (that is target source address, of the subsequent block of calculating current block; Translater has the special-purpose abstract register of destination address).Under the situation of the fundamental block of expanding, eliminated this fact of middle redirect in order to compensate, when the decoded processing of each new component fundamental block absorbed (assimilate), the IR that is used for the computing block source address set deleted 54 (accompanying drawings 4).In other words, the device 19 of serving as interpreter calculates the address of B and decoding when continue carrying out statically on the start address of B, and is deleted corresponding to the IR tree (it is fabricated in the process of decoding A) of the source address 31 of dynamic calculation B; When decoding advances to the start address of C, set deleted 59 corresponding to the IR of the source address of C; Or the like." deletion " IR tree means relied on and any IR node that do not have other abstract registers to rely on of deletion destination address abstract register.Its difference is that the link between IR tree and the target abstract register has been interrupted in deletion; Any other link to identical IR tree is still unaffected.In some cases, the IR of deletion tree also can be relied on by other abstract registers, and in this case, the IR tree still keeps the execution semanteme of source program.
In order to prevent that code from increasing sharply (usually, suppressing the mitigation factor of this code characterization technology), translater is limited in the fundamental block of expansion the sourse instruction of certain maximum quantity.In one embodiment, the fundamental block of expansion is restricted to maximum 200 sourse instructions.
Be equal to piece
Another optimization of Shi Xianing in an illustrated embodiment is exactly so-called " being equal to piecemeal ".According to this technology, the translation of fundamental block is according to compatibility list and by parametrization or characterization, compatibility list is one group of contingent condition of describing source processor state and translater state.For each provenance framework, compatibility list all is different, to consider different architectural features.Actual value at the compatibility condition at the entrance and exit place of specific fundamental block translation is hereinafter referred to as entry condition and exit condition.
Arrival is translated, the entry condition and the different fundamental block of work at present condition (being last exit condition) of still last translation if carry out, and fundamental block must be translated again based on the work at present condition specifically so.The result is that identical source code fundamental block is now by a plurality of object code translation expressions.The different translations of these of same fundamental block are called as and are equal to piece.
In order to support to be equal to piece, the data relevant with each fundamental block translation comprise one group of entry condition 35 and one group of exit condition 36 (accompanying drawing 3).In one embodiment, fundamental block buffer memory 23 is at first by source address 31 tissues, then by entry condition 35,36 tissues (accompanying drawing 3).In another embodiment, serve as interpreter device when fundamental block buffer memory 23 query source addresses 31, inquiry can be returned the fundamental block (being equal to piece) after a plurality of translations.
Accompanying drawing 5 has been illustrated the use that is equal to piece.Locate after the execution of first piece that is translated finishes, the code 21 after the translation calculates and returns the source address of next piece (being subsequent block) 71.Then control is turned back to translater 19, divided as dotted line 73.In translater 19, utilize the source address 31 inquiry fundamental block buffer memorys 23 that return, step 75.The fundamental block buffer memory can return 0,1 or more than a fundamental block data structure with same source 31.If fundamental block buffer memory 23 returns 0 data structure (this means that this fundamental block also is not translated), this data block must be translated device 19 translations, step 77 so.Each data structure that fundamental block buffer memory 23 returns is corresponding to the difference translation (being equal to piece) of the same fundamental block of source code.As shown in the rhombus decision box 79, if (first piece that is translated) current exit condition does not match with the entry condition of any data structure of being returned by fundamental block buffer memory 23, then fundamental block must be translated again, and step 81 is specifically according to those exit conditions and parametrization.If the entry condition coupling of current exit condition and a data structure of being returned by fundamental block buffer memory 23, then this translation is compatible and does not need just to translate again and can be performed step 83.In an illustrated embodiment, translater 19 is carried out piece after the compatible translation by dereference as the destination address of function pointer.
As mentioned above, preferably come the translation of parametrization fundamental block according to compatibility list.To illustrative compatibility list be described at X86 and PowerPC framework now.
The illustrative compatibility list of X86 framework comprises following expression: the slow propagation of (1) source-register (lazy propagation); (2) overlapping abstract register; (3) type of unsettled CC condition code flag affects instruction; (4) the slow propagation of CC condition code flag affects order parameter; (5) direction of character string replicate run; (6) floating point unit of source processor (FPU) pattern; And the modification of (7) segment register.
The compatibility list of X86 framework comprises the expression of the slow propagation of any source-register by translater, is also referred to as the register another name.The register another name appears at translater and knows that two source-registers are when the fundamental block boundary comprises identical value.As long as it is identical that the value of source-register keeps, has only one in the corresponding abstract register by it being saved in the global register reservoir and by synchronously.Before the source-register that is saved is written, the register that (by move) is saved is just used or duplicate to the quoting of register of not preserving simply.This has been avoided two storage access (preservation+recovery) in the code after translation.
The compatibility list of X86 framework comprises the expression of the current overlapping abstract register that is defined.In some cases, the source framework comprises a plurality of overlapping source-registers, and translater utilizes a plurality of abstract registers to represent these overlapping source-registers.For example, utilize a plurality of overlapping abstract registers to represent the source-register of variable-width, overlapping abstract register of each access size.For example, can utilize in the following source-register that has corresponding abstract register respectively any one to come " EAX " register: EAX (position 31 of access X86 ... 0), AX (position 15 ... 0), AH (position 15 ... 8) and AL (position 7 ... 0).
For each integer and floating-point CC condition code sign, the compatibility list of X86 framework comprises whether value of statistical indicant is standardized or unsettled expression, and if value of statistical indicant be unsettled, compatibility list also comprises the type of unsettled flag affects instruction so.
The compatibility list of X86 framework comprises the expression (if certain source-register still keeps the value of flag affects order parameter, if perhaps the value of second parameter is identical with the value of first parameter) of the register another name of CC condition code flag affects order parameter.Compatibility list comprises also whether second parameter is the expression of little constant (being the metainstruction candidate), and if be that then compatibility list also comprises the value of second parameter.
The compatibility list of X86 framework comprises the expression of the current direction of character string replicate run in the source program.This condition field pointing character string replicate run moves up in storer or moves down.This is by according to function direction independent variable and parametrization is translated the code characterization of supporting " strcpy () " function call.
The compatibility list of X86 framework comprises the expression of the FPU pattern of source processor.FPU pattern indication source floating point instruction operates in 32 bit patterns or 64 bit patterns.
The compatibility list of X86 framework comprises the expression of the modification of segment register.All X86 command memories are quoted and are based on one of six memory section registers: CS (code segment), DS (data segment), SS (stack segment), ES (extra data segment), FS (general section) and GS (general section).Under normal circumstances, application program is not revised segment register.Like this, if the segment register value keeps not limit, generate with default mode characterization code so.Yet, for program, revise its segment register possibly, in this case, the compatible position of corresponding segment register will be set, and suitable segment register dynamic value generates the code that is used for the general-purpose storage access to make the translater utilization.
The illustrative embodiment of the compatibility list of PowerPC framework comprises following expression: (1) reformation register (mangled register); (2) link value is propagated; (3) type of unsettled CC condition code flag affects instruction; (4) the slow propagation of CC condition code flag affects order parameter; (5) CC condition code value of statistical indicant another name; And (6) overflow indicator synchronous regime that adds up.
The compatibility list of PowerPC framework comprises the expression of reformation register.Comprise at source code under the situation of a plurality of continuous, storage access of source-register being used for the base address, translater can utilize the reformation destination register to translate those storage access.Under the situation different with its address in source memory of source program data residing address in target memory, translater must comprise the target offset of being calculated by source code in each storage address.When source-register comprised the base address, source, the reformation destination register comprised corresponding to the destination address of this base address, source (being source base address+target offset).Along with register is reformed, by directly source code skew, translation memory access more efficiently being used in the target base address that is stored in the reformation register.By contrast, if do not reform register mechanism, so this situation may need the object code that is used for each storage access is carried out extra control, with space and execution time be cost.Compatibility list points out if exist, then which abstract register is reformed.
The compatibility list of PowerPC framework comprises the expression that link value is propagated.For leaf function (leaf function) (promptly never calling the function of other functions), function body can be extended to and call/home position (the fundamental block mechanism of expansion as discussed above).Therefore, function body is translated together with the code of following after function returns.This is also referred to as function and returns characterization because this translation comprises the code from the function home position, and therefore according to home position by characterization.Whether specific block translation has used link value to propagate is reflected in the exit condition.Like this, when the device of serving as interpreter is met its translation and used the piece that link value propagates, must estimate whether current home position is identical with previous home position.Function turns back on the identical position, position when calling them, thereby calling station is actually the same (skew is 1 or 2 instruction) with home position.Therefore, translater can determine whether home position is identical by calling station more separately; This is equivalent to the source address of comparison (functional blocks the previous and current execution) piece that before continues separately.Like this, in the embodiment that the support chain ad valorem is propagated, the data relevant with each fundamental block translation comprise quoting the block translation that preceding continues (source address of the piece that perhaps before continues certain other represent).
The compatibility list of PowerPC framework comprises that for each integer and floating-point CC condition code sign, value of statistical indicant still is unsettled expression by standardization, and if value of statistical indicant be unsettled, compatibility list also comprises the type of unsettled flag affects instruction so.
The compatibility list of PowerPC comprises the expression (if flag affects order parameter value is movable (live) at source-register just, if perhaps the value of second parameter is identical with the value of first parameter) of the register another name that is used for the flag affects order parameter.Compatibility list comprises also whether second parameter is the expression of little constant (being the metainstruction candidate), and if compatibility list comprises the value of second parameter so.
The compatibility list of PowerPC framework comprises the expression of the register another name of PowerPC CC condition code value of statistical indicant.The PowerPC framework comprises the instruction that is used for whole group PowerPC sign explicitly is loaded into general (source) register.This explicit representation of source indicator value in the source-register is unfavorable for the CC condition code sign simulative optimization of translater.Whether compatibility list comprises value of statistical indicant is movable expression in source-register, and if also comprise the expression of which register.In the IR generative process, during such source-register is preserved value of statistical indicant, quoting of this source-register is translated into quoting corresponding abstract register.This mechanism has eliminated that in destination register explicitly calculates and the necessity of storage source indicator value, and this allows translater to use the optimization of standard conditions code sign conversely.
The compatibility list of PowerPC framework comprises to add up and overflows synchronous expression.In eight overflow condition positions of adding up of this field indication which is current to be associated with the overall situation overflow position that adds up.When being updated for one in eight condition fields of PowerPC, overflow position is set if the overall situation adds up, and it is copied to the corresponding overflow position that adds up in the specified conditions code field so.
The translation prompting
The translation prompting 34 of the fundamental block data structure of accompanying drawing 3 is adopted in the another kind optimization that realizes in an illustrated embodiment.This optimization comes from recognizes that existence is identical static fundamental block data specific to specific fundamental block but for each translation of this piece.For the high static data of the calculating of some type cost, more efficiently be, translater calculates these data in the first time of relevant block in the translation process, and event memory is used for the translation in same future then.Because these data all are identical for each translation of same, so it does not carry out parametrization to translation, thereby it is not the part of the compatibility list (discussed above) of piece in form.The static data of high cost still is stored in the data relevant with each fundamental block translation, yet preserves data than the computational data cost is lower again.In same translation subsequently, even translater 19 can not be reused previous translation, translater 19 also can utilize these " translation prompting " (being the static data of buffer memory) to reduce second and backward the translation cost of translation again.
In one embodiment, the data relevant with each fundamental block translation comprise the translation prompting, and the translation prompting is calculated in first translation process of this piece, and is replicated (perhaps quoting) subsequently to each follow-up translation.
For example, in the translater of realizing with C++ 19, the translation prompting may be implemented as the C++ object, in this case, can store quoting same translation prompting object respectively corresponding to the fundamental block object of same difference translation.Alternatively, in the translater of realizing with C++, fundamental block buffer memory 23 can comprise a fundamental block object for each source fundamental block (rather than each translation), and each this object comprises or keep quoting corresponding translation prompting; This fundamental block object also comprise by the entry condition tissue, a plurality of of translation object corresponding to the translation of the difference of this piece are quoted.
The illustrative translation prompting of X86 framework comprises following expression: (1) initial order prefix; And (2) initially repeat prefix.This translation prompting of X86 framework has the expression of how many prefixes in article one instruction of piece.Some X86 instruction has the prefix of modify instruction computing.This architectural features make to be difficult to X86 instruction stream decode (being both expensive).In case determined the quantity of initial prefix during first decoding of piece, then this value is translated device 19 and is stored as the translation prompting, thereby same follow-up translation does not just need to have redefined it.
The translation prompting of X86 framework comprises further whether article one instruction in the piece has the expression of repetition prefix.Some X86 instruction such as string operation, has the repetition prefix, repeats prefix and tells processor will repeatedly carry out that instruction.Whether this prefix of translation prompting indication exists, and if exist then also indicate its value.
In one embodiment, the translation prompting that is associated with each fundamental block also comprises the complete IR forest corresponding to that fundamental block in addition.This effectively buffer memory all decoding and IR that carry out by front end generate.In another embodiment, the translation prompting comprises the IR forest, because it just existed before optimised.In another embodiment, in order to save the memory resource of the program of being translated, the IR forest is not buffered and is the translation prompting.
Chunk
Another optimization that realizes in illustrative translater embodiment is the programming system expense (overhead) that produces of all abstract registers synchronously when being used to eliminate execution owing to the fundamental block after each translation and finishing.This optimization is called as chunk optimization.
Such as discussed above, in the fundamental block pattern (for example accompanying drawing 2), utilize all the translation after the accessible memory area of code sequence, promptly the global register reservoir 27, and state is delivered to next fundamental block from a fundamental block.Global register reservoir 27 is warehouses of abstract register, each abstract register corresponding to and value or other source architectural features of simulation particular source register.In the implementation of the code 21 after translation, abstract register is stored in the destination register, thereby they can participate in instruction.In the implementation of translator code 21, the abstract register value is stored in global register reservoir 27 or the destination register 15.
Thereby in the fundamental block pattern shown in accompanying drawing 2, owing to following two reasons, must carry out synchronously all abstract registers in each fundamental block end: (1) control turns back to translator code 19, and this overrides all destination registers potentially; And (2) because code generates and only to check a fundamental block, so translater 19 must suppose that all abstract register values are movable (that is, will being used to follow-up fundamental block) at every turn, and thereby must be saved.The target of chunk optimization mechanism is by a plurality of fundamental blocks are translated as continuous integral body, thereby reduces at the fundamental block that often intersects borderline synchronously.By a plurality of fundamental blocks are translated together, if can not the cancellation module boundary synchronously, also can make minimum synchronously.
When the general introduction tolerance of current block reaches activation threshold value, trigger chunk and make up.This piece is called as trigger module.Structure can be divided into the following step (accompanying drawing 6): (1) selects member's piece 71; (2) member's piece is sorted 73; (3) inactive (dead) code deletion 75 of the overall situation; (4) graph coloring register allocation 77; And (5) code generates 79.The first step 71 by carry out from trigger BOB(beginning of block) and by comprise threshold value (inclusion threshold) and at most member's depth-first search (DFS) of limiting the programmed control flow graph of adjusting travel through and discern the set of blocks that will be comprised in the chunk.Second step 73 sorted to this chunk and discerned the critical path (critical path) that runs through chunk, and is minimum and reduce the efficient code layout of branch to enable synchronizing code.The 3rd step and the 4th step 75,77 carry out and optimize.Final step 79 generates the object code of all member's pieces successively, thereby produces the efficient code layout with efficient its registers.
Making up chunk and generating thus in the object code, translater 19 is carried out the step shown in the accompanying drawing 6.When the device 19 of serving as interpreter was run into the fundamental block that before has been translated, before carrying out this piece, translater 19 was checked general introduction tolerance 37 (accompanying drawings 3) of piece with reference to activation threshold value.When the general introduction tolerance 37 of fundamental block surpassed activation threshold value, translater 19 beginning chunks were created.Translater 19 by from trigger BOB(beginning of block) and by comprise threshold value and at most member's traversal control flow graph of limiting adjusting discern the member of chunk.Next, translater 19 is created the ordering of member's piece, and its identification runs through the critical path of chunk.Subsequently, translater 19 is carried out the inactive code deletion of the overall situation; Translater 19 utilizes the register activated information of collecting each member's piece corresponding to the IR of each piece.Next, translater 19 is according to carrying out graph coloring register allocation specific to the strategy of framework, the part group of the unified register mappings of this all member's pieces of policy definition specific to framework.At last, translater 19 is followed graph coloring register allocation restrictedly and utilize the register activity analysis sequentially to generate the object code of each member's piece.
As mentioned above, relevant with each fundamental block data comprise general introduction tolerance 37.In one embodiment, general introduction tolerance 37 is to carry out counting, and promptly the number of times that has been performed of 19 pairs of specific fundamental blocks of translater is counted; In this embodiment, general introduction tolerance 37 is represented as integer count area (counter).In another embodiment, general introduction tolerance 37 is execution time, be that translater 19 keeps all to carry out the continuous summation of the execution time of specific fundamental block, such as code being set, to start and to stop hardware or software timer respectively by beginning and end at fundamental block; In this embodiment, general introduction tolerance 37 is used certain expression (timer) that amounts to the execution time.In another embodiment, translater 19 is stored polytype general introduction tolerance 37 for each fundamental block.In another embodiment, translater 19 is the many group general introductions of each fundamental block storage tolerance 37 corresponding to continuing fundamental block and/or each follow-up fundamental block before each, thereby for keeping different summary data in different controls path.In each translater cycle (i.e. the execution of the translator code between the execution of the code after translation 21 19), upgrade the general introduction tolerance 37 of suitable fundamental block.
In supporting the embodiment of chunk, the data relevant with each fundamental block also comprise in addition to continue before known with follow-up fundamental block object quote 38,39.In the aggregate these are quoted the control flow graph that constitutes all previous fundamental blocks of carrying out.During chunk formed, this control flow graph of translater 19 traversals was to determine which fundamental block will be comprised in the chunk that is forming.
Chunk information among the illustrative embodiment is based on three threshold values: activation threshold value, comprise threshold value and at most member's restriction.Activation threshold value and comprise the general introduction tolerance 37 that threshold value refers to each fundamental block., in the cycle general introduction tolerance 37 and the triggering value of explaining of next fundamental block are compared at each translater.Satisfy activation threshold value if measure 37, then chunk forms beginning.Comprising threshold value is used to be comprised in the scope of determining chunk in the chunk by discerning which follow-up fundamental block subsequently.The member limits the quantity upper limit of definition with involved fundamental block in any one chunk at most.
When for fundamental block A, when reaching activation threshold value, form new chunk as triggering piece with A.Translater 19 begins the definition traversal subsequently, other member's pieces that the subsequent block of traversal A will comprise with identification in the control flow graph.When traversal arrives given fundamental block, its general introduction tolerance 37 with comprise threshold.Satisfy and to comprise threshold value if measure 37, then this fundamental block is labeled and is used to comprise, and traversal proceeds to the subsequent block of this piece.The tolerance 37 of if block is lower than and comprises threshold value, and then this piece is excluded and its subsequent block is not traveled through.When traversal finishes (piece that all paths or arrival are excluded, or get back to involved piece, or arrive maximum member's restrictions), then translater 19 makes up new chunk based on all involved fundamental blocks.
In the embodiment that uses equivalent module and chunk, the control flow graph is to be equal to piece figure, this means that the difference of same source piece is equal to piece and is considered as different masses, to create chunk.Thereby the general introduction tolerance that the difference of same source piece is equal to piece is not added up to.
In another embodiment, be equal to piece and be not used in the fundamental block translation, but be used to this means in the chunk translation, non-group of fundamental block translation is by vague generalization (on entry condition not by characterization).In this embodiment, the entry condition that the general introduction of fundamental block tolerance is carried out is at every turn decomposed (disaggregate), makes to keep different summary information for each theoretical equivalent module (promptly for each not on the same group entry condition).In this embodiment, the data relevant with each fundamental block comprise the general introduction tabulation, and each member of general introduction tabulation is three groups (three-item set), comprising: (1) one group of entry condition; (2) corresponding general introduction is measured; And the tabulation of (3) corresponding subsequent block.Even not according to the actual fundamental block translation of those entry condition characterizations, these data are also kept general introduction and control routing information for the entry condition that each organizes fundamental block.In this embodiment, each the general introduction tolerance in the general introduction tolerance tabulation of activation threshold value and fundamental block is compared.When traversal control flow graph, each element in the profile tabulation of given fundamental block is considered as controlling separate nodes in the flow graph.Therefore, each general introduction tolerance that will comprise in the profile tabulation of threshold value and piece is compared.In this embodiment, be that the particular thermal equivalent module (being exclusively used in specific entry condition) of heat (hot) source piece is created chunk, but utilize that general (aniso-) translations of those same blocks carries out those pieces other be equal to piece.
After the definition traversal, translater 19 is carried out ordering traversal, step 73; Accompanying drawing 6 is with the order of determining that member's piece is translated.The Instructions Cache performance (popular path should be continuous) of the code 21 after the order influence translation of member's piece and the synchronous certainty on member's block boundary (synchronously should be minimum) along popular path direction.In one embodiment, translater 19 utilizes orderly depth-first search (DFS) algorithm to carry out the ordering traversal, sorts by carrying out counting.Traversal is from having member's BOB(beginning of block) of the highest execution counting.If member's piece of traversal has a plurality of subsequent block, the subsequent block that then has higher execution counting is at first traveled through.
They one of skill in the art will appreciate that chunk is not pro forma fundamental block, because may have internal control branch, a plurality of entrance and/or a plurality of exit points.
In case chunk is formed, just can further optimize it, be called as " the inactive code deletion of the overall situation " in this article.The inactive code deletion of this overall situation adopts the activity analysis technology.The inactive code deletion of the overall situation is the processing that removes redundancy of effort from the IR of one group of fundamental block.Usually, the source processor state must be by synchronously on the translation range boundary.A value, such as source-register, be called as for its definition beginning and with its be redefined (overriding) before last use and the code range that finishes is " activity "; Therefore, the analysis of using and defining to value (for example destination register in the nonce in the IR generation linguistic context, the code generation linguistic context or the source-register in the translation linguistic context) is called as activity analysis in the art.Any knowledge of use about data and state that translater has (reading) and definition (writing) all is limited in its translation scope; Other parts of program then are unknown.More particularly, because translater does not know which source-register will be used to (for example, in follow-up fundamental block) outside the translation scope, thereby must suppose that all registers all will be used.Like this, the value (definition) of any source-register that has been modified in given fundamental block must be saved (store in the global register reservoir 27) when that fundamental block finishes, the possibility of using in the future for them.Equally, its value all source-registers that will be used to given fundamental block must be resumed (loading) at first at that fundamental block from global register reservoir 27; That is, the code after the translation of fundamental block must recover given source-register before using its first time in that fundamental block.
The general mechanism that IR generates comprises the inactive code deletion in " this locality " of implicit form, and the scope of " this locality " inactive code deletion only is limited in very little one group of IR node at every turn.For example, the common subexpression A in the source code will set with the single IR with a plurality of father nodes of A and represent, rather than represent with a plurality of examples of expression tree A itself.Such fact has implied " deletion ": IR node may have the link to a plurality of father nodes.Equally, as the IR placeholder the implicit form of inactive code deletion abstract register.If the source code of given fundamental block never defines the particular source register, generate when finishing at the IR of this piece so, will point to empty IR corresponding to the abstract register of this source-register and set.In this case, the abstract register of thinking fit does not need with the global register reservoir synchronous the code generation phase.Like this, the local code deletion of stopping using is implied in the IR generation phase, along with the IR node is created and incrementally generation.
Compare with the inactive code deletion in this locality, " overall situation " inactive code deletion algorithm is applied to the entire I R expression formula forest of fundamental block.According to the inactive code deletion needs activity analysis of the overall situation of illustrated embodiment, the interior source-register of each fundamental block scope of promptly analyzing chunk uses (reading) and source-register definition (writing), with identification activity and inactive zone.IR is transformed, and removing the zone of stopping using, and thereby reduces the working quantity that object code must be carried out.For example, on the specified point of source code, to be defined (overriding) if translater 19 is thought or detected the particular source register before it uses next time, then source-register is considered to stopping using on all positions in the code of seizing definition (preempting definition).For IR, be defined but the source-register that never was used before being redefined is the code of stopping using, these codes can be deleted in the stage at IR, and do not produce object code.Generate for object code, the destination register of stopping using can be used to other interim or source register value and not overflow (spilling).
In the chunk overall situation is stopped using code deletion, on all member's pieces, carry out activity analysis.Activity analysis generates the IR forest of each member's piece, and this IR forest is used to obtain the source-register activated information of this piece subsequently.The IR forest that in the code generation phase that chunk is created, also needs each member's piece.In case generated the IR of each member's piece in activity analysis, then it can be saved to use in code generates subsequently, and perhaps it can be deleted and regenerates in code generates.
The inactive code deletion of the chunk overall situation can come effectively " conversion " IR in two ways.At first, can be modified in the IR forest that generates for each member's piece during the activity analysis, and entire I R forest can be transmitted to the code generation phase (promptly preserve and reuse) during the code generation phase subsequently; In this case, the IR conversion is propagated and is passed through the code generation phase by the IR forest after it being applied directly to the IR forest and preserving conversion subsequently.In this case, data relevant with each member's piece comprise the IR forest after the conversion of activated information (additionally to be used for graph coloring register allocation) and this piece.
Alternatively, in the process of the final code generation phase that chunk is created, utilize the previous activated information of creating, carry out the overall situation that the is used for conversion member piece IR code deletion step of stopping using.In this embodiment, the inactive code conversion of the overall situation can be recorded the tabulation as " stopping using " source-register, and this tabulation is coded in the activated information relevant with each member's piece subsequently.Thereby, carry out the real transform of IR forest by subsequently use register code generation phase of deleting the IR forest of tabulating of stopping using.This situation allows translater disposable generation IR in the activity analysis process, subsequently IR is abandoned, and regenerates identical IR subsequently during code generates, and utilizes this moment activity analysis to come conversion IR (promptly the inactive code deletion of the overall situation is applied to IR self).In this case, the data relevant with each member's piece comprise activated information, and this activated information comprises the tabulation of the source-register of stopping using.The IR forest is not saved.Especially, when the IR forest after the code generation phase is generated by (again), the IR tree of (being listed in the source-register tabulation of stopping using in the activated information) source-register of stopping using is deleted.
In one embodiment, the IR that is created in the activity analysis process is abandoned after activated information is extracted, to save memory resource.IR forest (one of each member's piece) is created in code generation process again, whenever next member's piece.In this embodiment, do not coexist on any point of the IR forest of all member's pieces in translation.Yet the IR forest of two versions creating respectively in activity analysis and code generation process is identical, because they utilize identical IR to generate to handle and generate according to source code.
In another embodiment, translater is that each member's piece is created the IR forest in the activity analysis process, and the IR forest is kept in the data relevant with each member's piece, to be reused in code generation process subsequently.In this embodiment, finish (the overall situation is stopped using the code deletion step) from activity analysis and generate, the IR forest coexistence of all member's pieces to code.In a kind of possibility of this embodiment, during from the initial creation (the activity analysis process) of IR to the last use (code generation) of IR, IR is not carried out conversion or optimization.
In another embodiment, the IR forest of all member's pieces generates between two steps at activity analysis and code and is saved, and optimizes between to IR forest execution block before code generates.In this embodiment, translater has utilized such fact, and promptly all member's piece IR forests coexist at the same position place of translation, and the IR forest of different members piece is carried out the optimization of those IR forests of conversion.In this case, employed IR forest may different with employed IR forest in the activity analysis (described in top two embodiment) during code generated because the IR forest continuously interblock optimized to conversion.In other words, the IR forest that employed IR forest may obtain with whenever regenerating them next member's piece during code generated is different.
In the chunk overall situation was stopped using code deletion, because activity analysis is applied to a plurality of simultaneously, thereby the scope of inactive code detection had been increased.Therefore,, and in the 3rd member's piece, be redefined (not using or exit point in the middle of not having) subsequently in first member's piece, can from first member's piece, delete the IR tree of definition for the first time so if source-register is defined.By contrast, in the fundamental block code generated, translater 19 can not detect this source-register and stop using.
As mentioned above, chunk optimization target is to reduce or eliminate in the synchronous necessity of the register of fundamental block boundary.Therefore, how discussion translater 19 in the component masses process realizes its registers and synchronous now.
Its registers is the processing that abstract (source) register is associated with destination register.Because the abstract register value must reside in the destination register participating in target instruction target word, so its registers is the necessary part that code generates.The expression of these distribution between destination register and the abstract register (i.e. mapping) is called as register mappings.In code generation process, translater 19 is kept the work register mapping, the current state of work register mapping reflection its registers (being the in esse target in the set point place-abstract register mapping in the object code).Hereinafter will be with reference to the outlet register mappings, abstract, the outlet register mappings is the Short Description of work register mapping when withdrawing from from member's piece.Yet owing to do not need to export register mappings synchronously, so it is not recorded, so it is fully abstract.Inlet register mappings 40 (accompanying drawing 3) is the Short Description that shines upon at the porch work register to member's piece, and the inlet register mappings must be recorded to be used for synchronously.
As mentioned above, chunk also comprises a plurality of member's pieces, and is the run time version generation respectively of each member's piece.Like this, each member's piece has its inlet register mappings 40 and outlet register mappings, they be reflected in respectively code after the translation of this piece when beginning and finishing with the specific objective its registers to the particular source register.
The code of group membership's piece generates by its inlet register mappings 40 (the work register mapping of porch) next parameterized, but code generates also modification register mappings.The work register of having been revised by code generator mapping when the outlet register mappings of member's piece reflects this block end.When first member's piece was translated, the work register mapping was empty (carries out graph coloring register allocation, hereinafter discuss).When the translate end of first member's piece, the work register mapping comprises by the code generation handles the register mappings of being created.The work register mapping is copied to the inlet register mappings 40 of all follow-up member's pieces subsequently.
When the code of member's piece generated end, some abstract register may not need synchronously.Register mappings allows translater 19 by discerning the synchronous simultaneous minimization that makes on member's block boundary of which register actual needs.By contrast, under the situation of (non-group) fundamental block, all abstract registers must be by synchronously when each fundamental block finishes.
When member's block end, have three kinds of synchronous situation based on subsequent block.The first, if subsequent block is the member's piece that also is not translated, its inlet register mappings 40 is defined as identically with work register mapping so, and the result of this situation does not need synchronously.The second, if subsequent block is positioned at the outside of group, all abstract registers must be by synchronously (fully synchronously promptly), because control will turn back to translator code 19 before subsequent block is carried out so.The 3rd, if member's piece that subsequent block is its register mappings to be fixed must insert synchronizing code so, so that the work mapping is consistent with the inlet mapping of subsequent block.
Reduce some synchronous cost of register mappings by chunk ordering traversal, chunk ordering traversal makes the register simultaneous minimization or eliminates fully synchronous along the register of hot path.Translate member's piece according to the order that the ordering traversal is produced.When each member's piece was translated, its outlet register mappings was transmitted to the inlet register mappings 40 of follow-up member's piece that all its inlet register mappings also are not fixed.In fact, at first translate the hot path in the chunk, and most of ground,, then do not need synchronously along this path if not all member's pieces, because corresponding register mappings all is consistent.
For example, the border between first and second member's pieces will always not need synchronously, and will be identical with the outlet register mappings 41 of first member's piece because second member's piece will always make its inlet register mappings 40 be fixed as.Between member's piece some may be inevitably synchronously, because chunk can comprise internal control branch and a plurality of entrance.This means that the execution piece that can never ditto continue arrives identical follow-up member's piece, and have different work register mappings at different times.These situations require the inlet register mappings of the suitable member's piece of translater 19 usefulness to come the synchronous working register mappings.
If desired, then register mappings occurs on member's block boundary synchronously.Translater 19 inserts code in ending place of member's piece, so that the inlet register mappings 40 of work register mapping and subsequent block is synchronous.In register mappings was synchronous, each abstract register fell into one of ten kinds of synchronous conditions.Table 1 is expressed as these ten kinds of register synchronous situation the function of the inlet register mappings 40 of the work register mapping of translater and subsequent block.Table 2 is described the register synchronized algorithm by enumerating these ten kinds of formal synchronous situation with the false code description (false code will be explained hereinafter) of text description and respective synchronization action.Thereby, at each member's block boundary place, utilize the algorithm of these 10 kinds of situations to come each abstract register synchronously.The detailed description of synchronous condition and action allows translater to produce synchronizing code efficiently, and this makes the stepped cost of each abstract register minimize.
Listed synchronization action function in the description list 2 hereinafter." spill (E (a)) " is saved in abstract register a the source-register storehouse (assembly of global register reservoir) from destination register E (a)." Fill (t, a) " abstract register a is loaded into the destination register t from the source-register storehouse.If new destination register can be used, " Reallocate () " moved and redistributed abstract register to new destination register (promptly changing the mapping of abstract register) so, if destination register is unavailable, " Reallocate () " overflows abstract register so." FreeNoSpill (t) " is labeled as the free time with destination register, and the related abstractions source-register is overflowed.FreeNoSpill () function is necessary for excessive the overflowing of a plurality of application programs of avoiding algorithm on identical synchronous points.Note,,, do not need synchronizing code for corresponding abstract register for situation with " Nil " synchronization action.
Figure C20048002010100331
Figure C20048002010100332
Enumerating of table 1:10 kind register synchronous situation
Figure C20048002010100333
Figure C20048002010100341
Figure C20048002010100361
Translater 19 is carried out the two-stage its registers in chunk, the overall situation with (or temporary transient) of part.Graph coloring register allocation is definition particular register mapping before code generates, and this continues (promptly on all member's pieces) in whole chunk scope.Local register distributes and comprises the register mappings of being created in the code generation processing.Graph coloring register allocation definition particular register assignment constraints, these constraints distribute to come the code of parametrization member piece to generate by the restriction local register.
Need on member's block boundary, do not carried out synchronously by the abstract register of global assignment, because they are guaranteed to distribute to each identical in each member's piece destination register.The advantage of this method is, for the abstract register of global assignment, and need be in the synchronizing code on member's block boundary (difference of the register mappings between its compensation block).The shortcoming of chunk register mappings is that it hinders local register to distribute, because the destination register of global assignment is not available immediately for new mapping.In order to compensate,, can limit the quantity of global register mapping for specific chunk.
The quantity of actual graph coloring register allocation and selection are defined by the graph coloring register allocation strategy.The graph coloring register allocation strategy can dispose based on source framework, target architecture and the application program that is translated.The optimal number of the register of global assignment rule of thumb draws, and is the function of destination register quantity, source-register quantity, the type of application that is translated and application program use-pattern.This numerical value normally destination register sum deducts certain peanut part afterwards, to guarantee still have enough destination registers to be used for nonce.
Under the destination register situation seldom, such as MIPS-X86 and PowerPC-X86 translater, the quantity of the register of global assignment is zero in that source-register is a lot.This is because the X86 framework has destination register seldom, so that finds to use any fixedly its registers compared with not using, and produces worse object code.
Under and the situation that destination register is a lot of a lot of at source-register, such as the X86-MIPS translater, the quantity of the register of global assignment (n) is 3/4ths of destination register quantity (T).Therefore:
X86-MIPS:n=3/4*T
Although the X86 framework has general-purpose register seldom, its quilt is as having a lot of source-registers, because need a lot of abstract registers come the X 86 processor state (comprising for example CC condition code sign) of Simulation of Complex.
Under the approximately equalised situation of the number of source-register and destination register, such as the MIPS-MIPS accelerator, most of destination register seldom is preserved for nonce by global assignment and have only.Therefore:
MIPS-MIPS:n=T-3
Be less than or equal in the total quantity (s) of the source-register that uses on the whole chunk under the situation of quantity (T) of destination register, all source-registers are by global map.This means that on all member's pieces, whole register mappings is constant.Under the particular case of (s=T), promptly the quantity of the source-register of destination register and activity equates, this means that not staying destination register is used for interim calculating therein; In this case, nonce by local allocation to by global assignment to the destination register that in identical expression tree, does not have the source-register of other application (this information obtains by activity analysis).
When chunk is created end, each member's piece run time version is generated according to traversal order.In code generation process, the IR forest of each member's piece is generated by (again), and (being included in the activated information of piece) the source-register tabulation of stopping using is used to deletion IR forest before producing object code.When each member's piece was translated, its outlet register mappings was transmitted to the inlet register mappings 40 of all follow-up member's pieces (except those have been fixed).Because piece is translated according to traversal order, thereby played the effect of handle, also made the hot path translation adjacent in the target memory space simultaneously along the register mappings simultaneous minimization of hot path.The same with the fundamental block translation, according to one group of entry condition, the work at present condition when promptly chunk is created, characterization group membership block translation.
Accompanying drawing 7 provides the example that is generated chunk by the translator code 19 of foundation illustrated embodiment.Shown in chunk have five members (" A " arrives " E "), junior one entrance (" inlet 1 "; Inlet 2 produces by polymerization subsequently, and is as described below) and three exit points (" outlet 1 ", " outlet 2 ", and " outlet 3 ").In this example, the activation threshold value that is used for the chunk establishment is that the execution counting is 45000, and the threshold value that comprises of member's piece is that the execution counting is 1000.(being 45074 now) triggers this chunk when reaching activation threshold value 45000 structure is counted in execution in modules A, carries out the search of control flow graph on this aspect, so that identification chunk member.In this example, find 5 to surpass the piece that comprises threshold value 1000.In case member's piece is identified, just carry out orderly depth-first search (according to the ordering of general introduction tolerance), make that hotter piece and their subsequent block are at first handled; This has just produced the chunk with critical path ordering.
The inactive code deletion of the overall situation was performed in this stage.Use and each member's piece of defined analysis (being activity analysis) at register.It is more effective in two ways that this makes that code generates.At first, local register distributes can consider which source-register is movable (being which register will be used to current or follow-up member's piece) in chunk, and this helps minimizing to overflow cost; Inactive register is at first overflowed, because they do not need to be resumed.In addition, be defined, use and be redefined subsequently (overriding) if activity analysis shows the particular source register, the random time after then value can be used the last time is dropped (destination register that is it can be released).If activity analysis shows the particular source register value and is defined and is redefined subsequently and use (unlikely without any the centre, because this will mean that the source compiler produces the code of stopping using), the corresponding IR tree that is used for this value so can be dropped, and makes it is not produced object code.
Next be graph coloring register allocation.Translater 19 is often to be distributed fixing destination register mapping by the source-register of access, and this is mapped on all member's moulds is constant.The register of global assignment can not overflow, and this means, it is disabled that those destination registers distribute for local register.When source-register than destination register for a long time, must keep a certain proportion of destination register to be used for the mapping of interim source-register.Whole group of source-register in chunk can be fit under the particular case of destination register, overflows and fills and avoided fully.As shown in Figure 7, translater is provided with code (" Pr1 ") to load these registers from global register reservoir 27 before the head that enters chunk (" A "); This code is called as beginning program (prologue) and loads.
Now, chunk is got ready for object code generates.In code generation process, translater 19 uses work register mapping (mapping between abstract register and the destination register) to come trace register to distribute.The value of the work register mapping when each member's BOB(beginning of block) is recorded in the relevant inlet register mappings 40 of this piece.
At first generate the beginning program block Pr1 of the abstract register that loads global assignment.At this, the mapping of the work register of Pr1 ending place is copied to the inlet register mappings 40 of piece A.
Piece A is translated subsequently, and and then the object code of Pr1 is provided with object code.The control stream code is configured to handle the exit condition of outlet 1, and this exit condition covers the mute branch (being repaired after a while) of epilogue (epilogue) piece Ep1 (being set up after a while).In piece A end, the work register mapping is copied to the inlet register mappings 40 of piece B.This fixedly the inlet register mappings 40 of B have two kinds of results: at first, on the path from A to B, do not need synchronously; Secondly, entering B from any other piece (being member's piece of this chunk or the member's piece that utilizes another chunk of polymerization) needs the inlet register mappings of the outlet register mappings of this piece and B synchronous.
Piece B is next piece on the hot path.Its object code is set at the place that follows piece A closely, and is provided for handling the code of two subsequent block C and A subsequently.The first subsequent block C does not also make its inlet register mappings 40 fixing, so the work register mapping is copied to the inlet register mappings of C simply.Yet the second subsequent block A had before made its inlet register mappings 40 fixing, thereby the inlet register mappings 40 of mapping of the work register of piece B end and piece A can be different.Any difference in the register mappings all needs along certain synchronous (" B-A ") on the path from piece B to piece A, so that make the work register mapping consistent with inlet register mappings 40.The form that this synchronous employing register overflows, fills and exchanges, and be described in detail in superincumbent ten kinds of register mappings synchronous situation.
Now, piece C is translated, and object code is set at the place that follows piece C closely.Similarly, piece D and E are translated continuously and are provided with.Path from E to A needs the outlet register mappings (being the work register mapping of E translate end) from E synchronous to the register mappings of the inlet register mappings 40 of A again, and this is set in the module " E-A " synchronously.
Before withdrawing from chunk and control turned back to translater 19, the register of global assignment must be synchronized to the global register reservoir; This code is called as the ending program and preserves.After member's piece had been translated, code generated the ending program block that all outlet ports point (Ep1, Ep2 and Ep3) is set, and in all member's pieces fixing branch target.Be equal among the embodiment that piece also uses chunk both using, control flow graph traversal for unique source module (being the specific fundamental block in the source code) rather than for the piece that is equal to of this piece.Like this, establishment is transparent for chunk to be equal to piece.Source piece for having a translation or a plurality of translations does not produce specific difference.
In the embodiment shown, chunk and be equal to piece optimization and can be advantageously used.Yet, be equal to piece mechanism and can be same source code sequence and create different fundamental blocks and translate this true making and determine which piece complexity that will involved processing in chunk becomes, because involved piece may not existed before chunk is formed.The piece of the not characterization that utilization existed before optimizing and the information of collecting is being used to select and layout processing before must be modified.
Illustrated embodiment is also used the technology that is used for adapting to the nested loop feature that chunk generates.With entrance only, promptly trigger the starting point of piece at first, create chunk.Nested loop in the program makes inner loop at first become heat, creates the chunk represent inner loop.Subsequently, outer loop becomes hot, creates the new chunk of all pieces that comprise inner loop and outer loop.If the work that the chunk generating algorithm does not have consideration that inner loop is done, but all these work of reforming, the program that comprises the deep layer nested loop so will generate increasing chunk gradually, and each chunk is just generated has needed more storer and more work.In addition, old (inside) chunk can become and can not obtain, and thereby very little benefit is provided or benefit is not provided.
According to illustrated embodiment, the chunk polymerization is used to make the chunk of previous structure to close with additional optimization agllutination.Selecting piece with during being included in the stage in the new chunk, those candidate blocks that have been included in the previous chunk are identified.
Be not that these piece object codes are set, but carry out polymerization, thus, translater 19 is created to the link of the appropriate location in the existing chunk.Because these links can jump to the centre of existing chunk, thereby must strengthen shining upon corresponding to the work register of that position; Therefore, comprise needed register mappings synchronizing code for the set code of this link.
The inlet register mappings 40 that is stored in the fundamental block data structure 30 is supported the chunk polymerization.Polymerization allows the code after other translations to utilize the initial centre that jumps to chunk as the entrance of member's piece.This entrance requires the work at present register mappings to be synchronized to the inlet register mappings 40 of member's piece, and translater 19 is realized this synchronous by synchronizing code (promptly overflow and fills) is set between the entrance of the exit point of the piece that preceding continues and member's piece.
In one embodiment, the register mappings of some member's piece is optionally deleted to save resource.Originally, the inlet register mappings of all member's pieces is by (indefinitely) storage indefinitely in the group, and the section start (from the polymerization chunk) that is beneficial at any member's piece enters chunk.When chunk became big, some register mappings can be deleted to save storer.If register mappings is deleted, polymerization is divided into the zone with chunk effectively so, and wherein some zone (being the deleted member's piece of its register mappings) can not access polymerization inlet.Use Different Strategies to determine which register mappings of storage.A kind of strategy is all register mappings (i.e. never deletion) of all member's pieces of storage.Another kind of strategy is the register mappings of only storing the hottest member's piece.Another strategy is that only storage is as the register mappings of member's piece (being the round-robin starting point) of backward branch target.
In another embodiment, data relevant with each group membership's piece are included as the register mappings of each sourse instruction location records.This allows code after other translations at any point, and is not only the section start of member's piece, jumps to the centre of chunk, because in some cases, group membership's piece can be included in undetected entrance when forming chunk.The a large amount of storeies of this technology consumption, thereby only when not considering conserve memory, just be suitable for.
Component masses provides the piece that is used to discern frequent execution or piece group and it has been carried out the mechanism of additional optimizations.Because the high optimization of more calculating costs is applied in the chunk, thereby their formation preferably is restricted to the fundamental block of known frequent execution.Under the situation of chunk, prove additional calculations by frequent execution; The continuous blocks that often are performed are called as " hot path ".
Can dispose embodiment, wherein use multistage frequency and optimization, make translater 19 detect a plurality of grades of the fundamental block that often is performed, and use complicated more optimization.Alternatively, as mentioned above, only use two-stage optimizing: fundamental optimum is applied to all fundamental blocks, and utilizes above-mentioned chunk set-up mechanism that independent one group of further optimization is applied in the chunk.
The block translation general introduction
Accompanying drawing 8 has been set forth the step of being carried out by translater at run duration between the execution of the code after the translation.As the first fundamental block (BB N-1) finish and carried out 1201 o'clock, it turns back to translater 1202 with control.Translater increases progressively the general introduction tolerance 1203 of first fundamental block.Subsequently, translater utilizes source address that the execution of first fundamental block returns to fundamental block caching query 1205 current fundamental block (BB N, i.e. BB N-1Subsequent block) previous translation after be equal to piece.If subsequent block is translated, the fundamental block buffer memory will return one or more fundamental block data structures so.Subsequently, translater compares the general introduction tolerance and the chunk activation threshold value 1207 (this can comprise a plurality of general introductions tolerance that are equal to piece of polymerization) of subsequent block.If do not satisfy threshold value, translater checks that subsequently whether any piece that is equal to that is returned by the fundamental block buffer memory (is equal to promptly that piece has and BB with the condition of work compatibility so N-1The identical entry condition of exit condition).If find the compatible piece that is equal to, so just carry out translation 1211.
If subsequent block general introduction tolerance surpasses the chunk activation threshold value, then create 1213 and carry out 1211 new chunks, as mentioned above, even the compatible piece that is equal to exists.
If fundamental block does not return any piece that is equal to, perhaps return to be equal to piece all incompatible, current block is translated 1217 one-tenth pieces that are equal to according to the characterization of work at present condition subsequently so, as mentioned above.At decoding BB NDuring end, if BB NSubsequent block (BB N+1) be static confirmable 1219, Kuo Zhan fundamental block is created 1215 so.If the fundamental block of expansion is created, so BB N+1Be translated 1217 subsequently, and the like.When translation was finished, the new piece that is equal to was stored in the fundamental block buffer memory 1221 and be performed 1211 subsequently.
Share code cache
In another preferred embodiment, translater 19 can comprise the code cache mechanism of sharing, its for example can allow corresponding to the object code 21 of particular source program and translation structure the difference of translater 19 carry out or example between be shared.Translater " example " is the specific execution of translater, i.e. execution after a source program translation.As hereinafter will more going through, this shared code cache can be improved by the private code caching server, the private code caching server when the starting and ending that translater is carried out, and in the process of implementation any time (when being written into) of being modified of source code such as Dang Yuanku go up and translater 19 mutual.
Accompanying drawing 9 has been set forth the main aspect of handling according to the shared code cache of illustrated embodiment.In first step 101, translater 19 is source code S 1A part translate into object code T 1In order to reuse object code T 1, this object code of translater 19 buffer memorys T 1, step 103.In rhombus decision box 105, translater 19 is determined source code S 2Next part with previous in step 103 the object code T of buffer memory 1Between compatibility.As being set forth, if at the object code T of buffer memory in conjunction with rhombus decision box 105 1With source code S 2New portion between exist compatible, the object code T of buffer memory so 1Be retrieved and be performed, step 109, thus eliminate the new source code part S of translation 1Burden and necessity.If compatibility does not exist, next of source code (newly) part is translated into object code so, and is further processed as shown in step 111.
In illustrative, the favourable application of the processing of accompanying drawing 9, all object codes that translater 19 will produce in the first source program implementation are kept in the temporary storage, and carry out all these object codes of buffer memory when finishing at this subsequently.Subsequently, translater 19 is carried out compatible judgement in the source code translation process of second source program.
Compatibility between the object code of new portion shown in the step 105, source code and buffer memory is judged and can be realized according to many distinct methods.In the embodiment shown, translater utilize buffer memory key data structure (cache key data structure) come according to current source code sequence with before translated after the whether identical and definite particular cache of source code sequence unit whether with current source code sequence compatibility.Translater 19 is checked to confirm whether new source code sequence can use the object code of previous buffer memory by the buffer memory key data structure of the buffer memory key data structure of newer sequence and the object code sequence that all had before generated and had been buffered.Accurate match represents that translation (object code) is reusable.
In an embodiment who has been implemented, buffer memory key data structure comprises: (1) comprises title or other identifiers of the file of source code sequence; (2) source code sequence position (being side-play amount and length) hereof; (3) the last modification time of file; (4) generate the version number of the translater of the translation structure be buffered; And the address in the source cache that is loaded into of (5) source code.In this embodiment, translater 19 is determined compatible by all constituents that compares each buffer memory key word successively.Any value representation inequality is incompatible.
In another or optional embodiment, buffer memory key data structure 39 comprises the complete copy of all sourse instructions of buffer unit 37 representatives; In this embodiment, the whole sourse instruction sequence of translater by buffer unit 37 relatively is with the source code sequence that will be translated and check whether each sourse instruction is identical, determines the compatibility of buffer unit 37.
In another embodiment, by using hash function to promote the determined buffer memory compatibility of translater.In this case, buffer memory key data structure comprises the digital hash of source code sequence.Subsequently, translater is used constant, hash function repeatably to whole source code sequence.The numeral of hash function formation sequence, this numeral are called as hash number.In such an embodiment, translater passes through simply number execution arithmetic of hash separately with current source code sequence after the translation to be compared, thereby determines compatibility.
This hash function technology also can be used to determine the previous used version of each translater example and the compatibility of currently used version, perhaps for example reside in the compatibility of the different translater examples on two kinds of different processors in the more complicated system.In this case, the digital hash of executable translater file helps translater to determine the translater version compatibility, and this hash number is stored in the buffer memory key word.In such an embodiment, generate translater hash number by the byte sequence of the actual binary executable file that constitutes each translater version being used hash function.
According to a plurality of embodiment, " the source code part " that be translated is code sequence, and as hereinafter will further discussing, this code sequence can comprise fundamental block or other instruction sequence bigger than fundamental block.Accompanying drawing 10 has been set forth shared code cache and has been handled, and wherein each buffer unit is represented the particular code sequence.
The processing of setting forth in 10 with reference to the accompanying drawings, translater 19 is translated into object code TC with first code sequence C S1 1, as step 121 set forth.Subsequently, translater 19 produces the buffer memory key word K of index corresponding to the target code block TB1 of code sequence CS1 1, as described in step 123.In step 124, translater 19 is with object block TC 1And related keyword K 1Store in the buffer memory.In step 125, translater 19 begins to handle second code sequence C S2, at first produces the buffer memory key word K of this sequence 2In comparison step 127, translater 19 is buffer memory key word K relatively subsequently 2With before be stored in buffer memory 29 in relevant those key words with code sequence, comprise key word K 1If buffer memory key word K 1Coupling buffer memory key word K 2, as in step 129, being set forth, so corresponding to buffer memory key word K 1Target code block TC 1Being translated device 19 retrieves from buffer memory 29 and carries out.As shown in the step 131, flow process advances to step 133 subsequently, and wherein translater 19 begins to handle code sequence CS3 subsequently: at first generate buffer memory key word K for this sequence 3, check the index of buffer memory key word subsequently ... K 1, K 2Whether mate.If K in step 127 1K does not match 2, then the second code sequence is translated into object code TC 2, object code TC 2Be buffered subsequently, as shown in the step 135,136.
In the embodiment shown, each buffer unit comprises for necessary all the translation structures of expression source code sequence.Code sequence can be defined before this fundamental block or be the bigger instruction sequence of its generating code.Under the situation of this bigger instruction sequence, all related datas relevant with all fundamental blocks in the code sequence are stored in the identical buffer unit.
In this illustrative embodiment, the data structure that is stored in the buffer unit comprises:
(a) fundamental block object-each fundamental block object is associated source address with certain data.This data comprise:
The summary information of-this source address (for example carrying out counting)
The pointer (if its existence) of the equivalent object code of-sensing
-object code pointed is fundamental block object code or chunk object code.
-" subsequent block information ", how its actual content-dependent finishes in piece;
If-code sequence finishes with unconditional jump, the fundamental block object that will carry out of the subsequent block information points next one so.
If-code sequence finishes (for example " to the branch of link register ") to calculate redirect, subsequent block information points subsequent block buffer memory so, the subsequent module buffer memory is mapped to the fundamental block address to source address.Each has its subsequent block buffer memory with the piece that calculates the redirect end.
If-code sequence finishes with branch, the subsequent block information points is represented the fundamental block of the next source address that will carry out so, if branch is taked/do not taked.
(b) object code-chunk and fundamental block.
(c) chunk information-kept after chunk generates increases in time or changes with the permission chunk.
(d) the piece catalogue-this is that source address is mapped to fundamental block.Each fundamental block in the buffer unit has inlet in the piece catalogue.
In subsequent block information,, utilize certain special sign (for example null pointer) to indicate certain specific action to take place if the next fundamental block that will carry out is positioned at outside the current cache unit.At this, the address of next fundamental block can not be hard-wired, because the purpose plot may be unavailable, and therefore must seek suitable subsequent block by all available cache memory unit on the search system, thereby obtains the purpose plot.Be comprised in this as a reference common pending application _ _ _ _ _ _ in disclosed subregion mechanism be a kind of mechanism that can be used for dividing buffer unit based on source address, make and represent all fundamental blocks of the interior code of source address particular range to be placed in (as corresponding subsequent block cache, object code, chunk information and piece catalogue) in the same buffered unit.Yet other optional mechanism also can be used.
Each data structure in the specific example of the buffer unit structure of being discussed can include only the also pointer of the data in the same buffered unit of sensing.Therefore, reference object needs certain particular job between buffer unit.This is because people can not depend on available destination buffer unit.For this reason, chunk intactly is included in the single buffer unit.
Accompanying drawing 11 has been set forth the example of (a) buffer unit 37.The buffer unit 37 of accompanying drawing 11 is particularly including one or more block translations 41 and the subsequent block tabulation 43 relevant with these pieces 41.Like this, each buffer memory unit 37 is independently, each buffer unit this means that the translation structure in the buffer unit 37 does not rely on the existence of any translation structure outside this buffer unit, because can be loaded or unload independently from caching server 35.This buffer unit 37 comprises for necessary all the translation structures of the specific source code sequence of expression, and above-mentioned specific source code sequence can not comprise follow-up source code sequence.In this manual, " source code sequence " can comprise many sourse instructions, and these sourse instructions are being continuous aspect the control stream, but these instructions can be discontinuous aspect source address.In other words, buffer unit 37 comprises piece relied on all the translation structures after piece (promptly representing the object code sequence of specific source code sequence) after at least one translation and the translation.For example, among piece 41A, 41B after buffer unit 37 comprises translation and the embodiment of 41C, the subsequent block of these pieces tabulation 43 is essential translation structures, but subsequent block 49 self and nonessential.
In different embodiment, translater 19 can define the additional buffer cell data structure of different range.For example, the device 19 of serving as interpreter knows that source program is not when revising, so source program with and all related libraries can be grouped in the single buffer memory unit.Translater 19 can further be set up the buffer unit of one or more following types: (a) each independent sourse instruction can be a buffer unit independently, and (b) each fundamental block can be a buffer unit independently; (c) all pieces corresponding to identical starting resource address can be grouped in the single buffer unit; (d) buffer unit can be represented the discrete range of source code address, (e) each storehouse, source can be independent buffer unit, and (f) each source code is applied in the single buffer unit of all object codes (executable and all storehouses) that comprise this application and is expressed.Translater 19 can further change the grain size category of buffer unit according to certain translation demand and targeted environment.
As another favourable example, translater 19 finds that therein the source operating system 20 of application program has the memory area that keeps for unmodifiable storehouse, and wherein each unmodifiable storehouse always loads on same source.The sort memory zone is used as single buffer unit.For example, MacOS operating system has the memory range (0x90000000-0xA0000000) of reservation, and it is preserved for unmodifiable shared library; Be configured to single buffer unit, represent whole M acOS shared library zone from the translater that the MacOS framework is translated.Comprise at buffer unit under the suitable situation in multiple source storehouse, the buffer memory key word of buffer unit comprises the file modification time in all storehouses that are loaded into this zone.The modification that is included in the storehouse, any source in this buffer unit will make the translation structure that is included in wherein be not useable for the example in future of translater.If one of implicit storehouse has been modified, translater example so subsequently is necessary for this buffer unit (being MacOS shared library zone) and rebuilds the translation structure.New translation structure will have the respective cache key word of the new configuration (promptly being included in the new source code in the storehouse) in reflection storehouse.
Described up to now shared code cache method can be implemented with many different System Design.Realizing sharing among the different embodiment of code cache, such as at those embodiment shown in the accompanying drawing 12-16, share code cache memory device 29 and allow between the difference execution of translater 19 or example, to be shared corresponding to the object code 21 of particular source program and translation structure (buffer unit).Translater " example " is the specific execution of translater, i.e. execution after a source program translation.
For example, as shown in accompanying drawing 12, sharing code cache 29 is helped by private code caching server 35, beginning and end that this private code caching server is carried out at translater, and any time that source code is modified in the implementation is gone up (when being loaded such as Dang Yuanku) and translater 19 is mutual.In the embodiment of accompanying drawing 12, caching server 35 is positioned on identical system or the target architecture 36 with translater 19.In other embodiments, caching server 35 can be the subsystem of translater.
Accompanying drawing 13 has been set forth such an embodiment, and wherein caching server 35 is positioned on the different system 32 with translater example 19.In this case, the framework of server system 32 can be different with target architecture 36.
Accompanying drawing 14 has been set forth such an embodiment, and wherein the translation system of accompanying drawing 1 is cooperated with the caching server 35 on operating in different system 32.In this case, caching server 35 operates on those processors of being operated in the translater processor 31 and different operating system 33 different with operating system.
In the optional embodiment of the shared buffer memory technology shown in the accompanying drawing 15, caching server 35 is connect to share the network that is stored in the processing of the code after the translation among each buffer memory 29A, the 29B and 29C between translater example 19a, the 19b that runs on the different system 63,65, and wherein the target architecture of system 63,65 is identical.System 63,65 can be the desktop computer of for example a pair of networking.Single caching server can provide buffer memory for the difference configuration of any amount of translater 19, but special buffer memory 29 is only to be shared between the compatibility configuration of translater 19.
In an embodiment of the shared buffer memory technology of accompanying drawing 14, the dedicated processes of caching server 35 inquiry that to be active response handle from translater.In optional embodiment, caching server 35 is passive storage systems, such as the file directory or the database of buffer unit.。
And, in the illustrative embodiment of accompanying drawing 14, translater 19 by when source program finishes with buffer unit as file storage on disk and by further keeping the index file of all relevant buffer memory keyword structures of the source code sequence that comprises and be buffered, the translation structure that is buffered be kept at continue in the storer.For example, in a this embodiment, code cache comprises the catalogue of the file in the file system of caching server operating system 33, and wherein each buffer unit 37 is used as single file storage in the CACHE DIRECTORY structure.In another embodiment of the system of accompanying drawing 14, continue translation structure that storer is buffered by " having " and the lasting server process of buffer unit being distributed to the translater example in response to the request of translater 19 and realize.
Thereby in the realization of illustrative shared code cache method, when the translation of source program arrived the also untranslated source code sequence of translater 19, translater 19 checked that caching servers 35 are to seek the code cache of compatibility.If find compatible code cache, translater 19 loads the buffer memory that comprises object code 21 and translation structure so.Code cache has implied the code 21 after all translations that are created in the previous translation implementation of source program, comprise the object code of optimization, such as above 6 and 7 described chunks in conjunction with the accompanying drawings.Carry out on this basis that allows translater execution subsequently formerly to carry out; The major part of source code may be translated and may be optimised, thereby has reduced start-up time, reduced the translation cost and improved performance.
Shared buffer memory allows the different instances of translater 19, and example 19a, 19b such as accompanying drawing 12 benefit from result each other.Especially, the shared buffer memory translation structure that allows to create in a kind of translater example is reused at another translater example." translation structure " in this instructions is meant other data that the object code (being the translation of specific source code sequence) that generated and translater 19 are used to represent, manage and carry out source code.This translation example of structure is those 11 described examples in conjunction with the accompanying drawings, wherein translates structure 37 and comprises fundamental block translation and subsequent block tabulation.
When translater example was after a while carried out identical source program or had the different source program of common source code (for example system library), shared buffer memory allowed to reuse translation result.Run under the situation of identical source code sequence at translater example after a while, shared buffer memory allows to reuse the translation structure of creating in formerly the translater example.If previous translater example is run into specific source code sequence, and translater example after a while runs into identical source code sequence, and shared buffer memory allows translater example afterwards to use the translation structure of being created by previous translater example so.In other words, shared buffer memory allows the retention time of the translation (being buffer unit) of specific source code sequence to surpass the life cycle of the translater example of creating this translation.
For the translation that is buffered, source code can be divided into two classes: (1) is loaded into executable code the storer from disk, and it comprises source scale-of-two, static link library, linker and any storehouse that loads at run duration; The source code of the source code, springboard code (trampoline) or certain the other forms of dynamic generation that are used for the run duration compiling that (2) when not working, produces (from revising code).The shared buffer memory technology finds and the relevant application-specific of first kind sourse instruction sequence, is referred to herein as " static state " code.The static father code comprises a plurality of execution of leap (a) same application domain and (b) the identical sourse instruction sequence of a plurality of application programs (for example, system library) probably.Buffer unit corresponding to static source code is referred to herein as the static cache unit.In contrast, in the content of shared buffer memory technology, its source code is called as " dynamically " in the reformed program of run duration.
In a kind of optimization of shared buffer memory, when the translater example carry out to finish, translater 19 determine source programs which partly comprise static source code, and the shared buffer memory The Application of Technology is limited in those static parts.In such an embodiment, when the translater example was carried out end, only the translation structure corresponding to static code was buffered, and is dropped corresponding to the translation structure of dynamic code.
Buffer memory is evolved
In buffer memory 29 embodiment that (when the program after the translation stops) is updated when translater carry out to finish, can allocating cache server 35, with the current translation body that relatively should carry out and code cache by server 35 current storages.In this configuration, if the code cache of current execution than previously stored version " better ", server memory buffers unit from current execution is used for use in the future so.Translation is called as " issue " from the processing (promptly when the translation of example is better than server) that the translater example copies to buffer memory 29.Like this, the quality that is stored in the code in the buffer memory 29 improves day by day.The processing of this technology and result can be called as " buffer memory evolution ".
Even the translater example is obtained structure at first from buffer memory, the execution of this example may cause new source code sequence to be translated, and therefore causes new translation structure to be created.When this translater example finished, its translation structure collection may be more complete than the respective episode that is stored in the buffer memory.
16 illustrative program with reference to the accompanying drawings, first translater is carried out and is performed, step 201, and subsequently in step 203, its buffer unit C 1Be buffered.Subsequently, second example of translater 19 is carried out second program, and generates buffer unit C 2, step 205.Subsequently, the second translater example 19 is relatively carried out the translation structure C of end at it 2With those translation structure Cs that are stored in the buffer memory 29 1, and determine according to some suitable standard whether the translation that just produced is better than those available in the buffer memory 29 translations.For example, can be based on the number of the sourse instruction number of being translated, the optimization that applied, perhaps the algorithm that is used to assess the quality of institute's generating code by execution determines whether a code cache is better than another.As in step 209, being set forth, if C 2Compare C 1Buffer structure C better, so 2Be loaded in the buffer memory 29 alternative structure C 1
Do not use among the embodiment of buffer memory evolution at translater 19, when carrying out end, the translater example abandons all new translation structures (being that all that is not the translation structure of introducing from buffer memory at first).
In optional embodiment, can configuration-system, make the translater example in the process of implementation in the selected time, rather than only carrying out when finishing, their translation structure is distributed to caching server, for example 35.This allow to make that the translation structure was available to other translater examples before the program after the translation stops, for example, and at a plurality of translater example T such as accompanying drawing 17 1, T 2T nIn can the system of executed in parallel.Selected issuing time or " buffer memory synchronous points " can comprise: (1) application program after translation was not handled in " free time " cycle of a lot of affairs, (2) the translation structure threshold number be generated after (for example, generate the object code of certain unit whenever the translater example, 1 megabyte for example, the time issue); And (3) are although when new translater example request not in shared buffer memory but during the translation structure in the known translater example that is present in current operation.
Parallel translation
In the different embodiment of caching server 35,, server 35 sends to the translater example that code cache/when the translater example was accepted code cache, server 35 can be further configured with optimize codes buffer memory 29 in idling cycle when being not engaged in.In such an embodiment, caching server 35 is carried out one or more following optimizations with the transform code buffer memory: (a) re-construct the CACHE DIRECTORY structure, so that more efficient to the search of particular cache unit; (b) deletion is by the translation of the translation replacement follow-up, that more optimize of identical source code; (c) rearrange code cache, place adjacent to each other with the buffer unit that will often ask, thereby improve the hardware cache performance (promptly reducing the quantity that hardware cache that the caching server 35 in the hardware cache of server system 32 produced is omitted (miss)) of caching server 35; (d) carry out the costliness optimization (offline optimization of caching server does not cause translation cost or the performance loss (performancepenalty) in the translater example) of the translation be buffered; (e) translation also is not translated the device example and turns over but estimates the source code that the translater example will run into (being that the off-line expection is translated).
Shared storage
The another kind optimization of shared buffer memory technology is that to utilize sharing memory access be the translation structure that read-only (even be its content change also seldom change) is buffered in essence.The pith that is stored in the translation structure in the buffer unit can all be read-only in the whole life of translater example, such as the object code sequence that is generated: in case generate, the object code sequence just seldom is dropped or changes (even it may be comprised by optimised translation subsequently).Other buffer unit composition such as carrying out counting and branch's destination summary data, is estimated often to change, because they are upgraded in the implementation of translater regularly.Use simultaneously under the situation of same buffered cellular construction at a plurality of translater examples, for example accompanying drawing 17, and it is accessed that the read-only part of those translaters can be used as shared storage.Whole physical storages that this optimization can reduce a plurality of translations that operate in the single target system use.
The application of property shared storage as an illustration, translater 19 with the code cache file load to shared memory area.Preferably utilize promptly to write promptly and duplicate (copy-on-write) strategy and come the shared buffer memory file.Write shortly is under the replication strategy, cache file is shared by the processing of the translater of all operations at first, " promptly write promptly and duplicate " and (for example mean when certain translation device example is revised the structure that is buffered by any way, the execution counting of upscaling block), to become this specific executions institute special-purpose in the part that is modified of the buffer memory of this point, thereby comprise and be modified regional memory area and can not be shared again.
In illustrative application, buffer memory 29 comprises summary data structure and other data structure (such as object code), and the summary data structure often is updated when translater 19 operation, and in a single day above-mentioned other data structure just is generated and remains unchanged.The operating system of moving translater 19 thereon provides the locked memory pages of 4kb for example as shared cell.Big buffer memory (1MB) can be shared by a plurality of processing, and any modification to buffer memory makes the page that comprises modification become that processing privately owned (page is replicated and only revises privately owned copy).This allows the major part of buffer memory 29 to keep sharing, and variable summary data is privately owned for each processing.Preferably, deliberately arrange buffer memory 29 to be specifically designed to variable summary data, rather than summary data is launched on buffer memory 29 with the page with certain limit.This has reduced the quantity that becomes privately owned storer by local modification.
Distributed caching
Hereinafter among the embodiment of Miao Shuing, the shared buffer memory technology is implemented as the caching system that comprises one or more translater examples and one or more mutual each other server process.Except the foregoing description, in other embodiments, in learning according to known several method in hierarchical cache and distributed caching field any one by corresponding known searching and the storage operation technology, is organized into distribution formula caching system with caching server.
In the caching system that comprises two or more buffer memorys of being set forth, in the tissue of each buffer memory, can use different technology such as accompanying drawing 15.These technology comprise the limited buffer memory in field (scoped cache), range limited buffer memory (ranged cache) and cache policy.The use that can be combined of these technology.
In the embodiment of the caching system of the limited buffer memory in use field, each buffer memory has different buffer memory scopes.The limited buffer memory in field only can be visited specific one group of translater example.The buffer memory scope definition of particular cache which translater example can visit this buffer memory.For example, in one embodiment, each buffer memory has " privately owned " or " overall situation " buffer memory scope.Privately owned buffer memory can only be created its translater example and visit, and its content no longer keeps after the translater example withdraws from.Global buffer can be visited by any translater example, this means, a more than translater can be retrieved buffer unit from buffer memory, perhaps buffer unit is stored in the buffer memory.After the content of global buffer can last till that certain translation device example finishes.
The embodiment of the limited caching system in field can comprise the buffer memory value range that other is possible, comprises (a) application program, (b) Application Type, (c) application provider, perhaps other.Buffer memory with " specific to application program " buffer memory scope is only with by the translater instance access of just carrying out the identical sources application program.Buffer memory with " application program " buffer memory scope is only with by the translater instance access of just carrying out same type application program (for example interactive mode, server, graphical).Buffer memory with " application provider " buffer memory scope is only with by the translater instance access of just carrying out the application program of being created by identical supplier.
In the embodiment of the caching system of the limited buffer memory of usable range, each buffer memory is relevant with the source address scope, makes buffer memory only store the buffer unit that comprise translation of starting resource address in this scope.
In comprising the caching system of two or more buffer memorys, can realize different cache policies, these cache policies change structure and the constraint how each buffer memory is used.Cache policy comprises insertion strategy and corresponding search strategy.As shown in Figure 17, when caching system memory buffers unit 37, insert buffer memory A, B that strategy 51 definition buffer units are stored in.As shown in Figure 18, when caching system is attempted to retrieve buffer unit 37, the order that search strategy 53 decision a plurality of buffer memory A, B are inquired about.
For example, table 3 has been illustrated cache policy and has been inserted strategy, search strategy and cache policy to three kinds of examples aspect the effect of caching system.
Insert strategy Search strategy Effect
1 Add all structures to shared buffer memory A and reach a certain size, use privately owned buffer memory B then up to it Largest buffered is preferential This strategy has forced hard limit for the shared buffer memory size
2 If there is no global buffer is then created shared buffer memory A and is stored all translation structures into A.If shared buffer memory A exists, then create privately owned buffer memory B and the translation structure that all are new and store B into. Largest buffered is preferential This lets alone the first translater example effectively and adds all translation structures to shared buffer memory.For example ought be buffered in a similar manner (be similar control stream) and use the application program of buffer unit, such as same application program or from identical supplier's application program, between
When being shared, this is favourable.
3 For go up all optimised translations at specific control stream (for example chunk), store the buffer memory A in narrower field into.The buffer memory B than wide-range is stored in other translation into for all.That is, the field of buffer memory B is wideer than buffer memory A. The narrowest cache field is preferential This allows translater example to benefit from the general optimum of being carried out by other example, still allows each example to create its oneself optimization post code simultaneously.
Table 3: cache policy
Only produce under the situation partial ordered to buffer memory at search strategy, other exploration (heuristics) can be used to create complete ordering.In other words, search strategy can produce one group of buffer memory that has equal priority with respect to other buffer memory group; Additional exploration is as moderator in the group.For example, caching system can use following exploration: (1) is maximum preferential, (b) hits (hit) recently, perhaps other.In " maximum preferential " explored, largest buffered was at first inquired about, and is second largest buffer memory then, by that analogy.In " hitting recently " explored, inquire about the buffer memory in the group that last return cache hits, be to comprise again the previous buffer memory that hits recently subsequently, by that analogy.
In having the caching system of two or more buffer memorys, may return a plurality of buffer memorys to the inquiry of particular cache key word, each buffer memory comprises the translation structure of coupling buffer memory key word.When selecting between a plurality of hitting, buffer unit can be considered other factors.These factors and cache policy are mutual effectively, to determine the structure and the performance of caching system.The set that these factors can comprise (a) all possible cache field value (promptly, there are how many different field grades), (b) storer of translater example or caching server or disk space constraint, the source application that (c) is performed, perhaps other factors.
Initiatively optimize (aggressive optimization)
Dynamic binary translator can be carried out multiple optimization, and perhaps translater can be configured to more on one's own initiative (aggressively) or use multiple optimization more frequently, is cost with extra translation cost.In the embodiment of the translater that does not use the shared buffer memory technology, translater must balance be translated cost and executory cost, to optimize specific translater example (promptly once carrying out).
In the embodiment of the translater that uses the shared buffer memory technology, can be to a plurality of translater examples, rather than single translater example measured the translation cost.Like this, the active optimization of the code after the translation have under the situation of buffer memory more attractive.Though translation cost initial in the active prioritization scheme is higher, cost has been adjusted in the existence of a plurality of follow-up translation instances, because previous optimization result's benefit is enjoyed in each follow-up translation.The cost of active translation becomes " disposable " that produced by (specific source code sequence or source program) first translater example and hits, but the benefit of the translation after optimizing can be used all subsequent instance of the translater of this buffer memory to enjoy subsequently.
Therefore, in the translater that uses the shared buffer memory technology, there is following situation, promptly uses the higher optimization of cost in the implementation in the first time of particular source program.For carrying out for the first time, this may cause lower slightly performance, but for carrying out the future of application program, the translation structure that is obtained will produce more performance, the execution in future of application program can not cause translating cost, because they can just use the translation structure that be buffered when starting at once.Initiatively a kind of optimization of optimisation strategy is: the translater example in future of running into the identical sources program of the source code that is not translated can select not use at first initiatively optimization, so that reduce the limit translation cost (and the stand-by period that causes thus) when detecting the fresh code path.
Although illustrated and described some first-selected embodiment, but for those those of ordinary skill in the art, be to be appreciated that and make various changes and modification to it and can not deviate from scope of the present invention, as in the appended claims book, defining.
In conjunction with the application and those and this instructions content of public examination together, should note paper or document that all are filed simultaneously or were filed before this instructions, and the content of all these class papers and document is integrated into herein to do reference.
Disclosed in this manual all features (comprising any accessory claim, summary and accompanying drawing) and/or any method disclosed herein or program can be combined with other any array mode except that the array mode that wherein has at least some this category feature and/or step to repel mutually in steps.
Unless specifically stated otherwise, otherwise that disclosed in this manual every kind of feature (comprising any accessory claim book, summary and accompanying drawing) can be suitable for is identical, quite or the alternative features of similar purpose replace.Thereby, unless specifically stated otherwise, otherwise controlled every kind of feature opening all be general series quite or an example in the similar characteristics.
The details that the invention is not restricted to previous embodiment is described.The present invention expands to the feature of any one original creation in the disclosed feature among the present invention or any characteristics combination (comprising any accessory claim, summary and accompanying drawing) of original creation, perhaps expands to the step of any one original creation in disclosed method among the present invention or the program step or the step combination of original creation arbitrarily.

Claims (33)

1. shared code cache method that is used for program code conversion comprises:
(a) provide the first translater example (19A), the wherein said first translater example is translated as object code part (TC1) with source code first (CS1);
(b) the described object code part of buffer memory (TC1); And
(c) provide the second translater example (19B), the wherein said second translater example is translated as object code with source code second portion (CS2), be included in and detect between described object code part (TC1) that is buffered and the described source code second portion (CS2) when compatible, retrieve the described object code that is buffered partly (TC1).
2. method according to claim 1, wherein said source code first (CS1) is the part of first program, described source code second portion (CS2) is the part of second program.
3. method according to claim 2, wherein said object code part (TC1) are buffered in the place, end of the translation of described first program.
4. according to the method for claim 1, the wherein said first translater example (19A) will comprise that first source program of described first source code part (CS1) is translated as described object code part (TC1), the described second translater example (19B) translation, second source program comprises and reuses the described object code part (TC1) that is generated by the described first translater example (19A).
5. according to the method for claim 1, further comprising the steps of:
In step (b), described object code part (TC1) is copied to the code cache equipment (29) of sharing from the described first translater example (19A); And
In step (c),, reuse for the described second translater example (19B) from described shared code cache equipment (29) the described object code part of retrieval (TC1).
6. according to the method for claim 5, wherein step (b) also comprises:
Optionally identify one or more static object code sections from a plurality of object code parts that the described first translater example (19A) is generated, wherein said static object code section draws from static source code; And
The static object code section of the described sign of buffer memory.
7. according to the method for claim 6, wherein step (b) also comprises:
The one or more dynamic object code sections of sign from a plurality of object code parts that the described first translater example (19A) is generated, wherein said dynamic object code section draws from the source code of dynamic generation; And
Abandon described dynamic object code section.
8. according to the method for claim 6 or 7, wherein step (c) comprising:
After the described first translater example (19A) operation is finished, carry out sign and/or abandon step.
9. according to the process of claim 1 wherein,
Step (b) comprises provides the code cache equipment (29) of sharing, with the described object code part of buffer memory (TC1); And
Step (c) is included in optionally replaces the described object code part (TC1) that is buffered in the described shared code cache equipment (29), the wherein said second translater example (19B) provides the translation of the renewal of described object code part (TC1).
10. according to the method for claim 9, wherein,
Step (b) also is included in described first translater example (19A) run duration, and described object code part (TC1) is published to described shared code cache equipment (29) from the described first translater example (19A); And
Step (c) also comprises from described shared code cache equipment (29) the described object code part of retrieval (TC1), reuses for the described second translater example (19B);
Wherein said first translater example (19A) and the described second translater example (19B) move simultaneously.
11. according to the method for claim 10, wherein said issuing steps comprises:
Has the buffer memory synchronous points issue of predetermined trigger situation.
12. according to the method for claim 11, wherein said predetermined trigger situation be in the following situation any one or a plurality of:
Idling cycle when the described first translater example (19A) is non-activity;
After generating the translation structure of number of thresholds; And
When the described second translater example (19B) is asked.
13. according to the process of claim 1 wherein that step (b) comprising:
Provide the code cache equipment (29) of sharing, with the described object code part of buffer memory (TC1); And
Selectivity is carried out the optimization of described shared code cache equipment (29).
14. according to the method for claim 13, wherein said optimization comprise in the following operation any one or a plurality of:
Re-construct the CACHE DIRECTORY structure of described shared code cache equipment (29), so that more efficient to the search of object code part;
The translation that deletion has been replaced by the translation follow-up, that more optimize of identical source code;
Rearrange described shared code cache equipment (29), place adjacent to each other with the object code part that will often ask;
The offline optimization of the translation that execution is buffered;
Carry out off-line expection translation, also be not translated the translation of device example with translation but estimate the source code that the translater example will be run into.
15. according to the process of claim 1 wherein that step (b) comprising:
(19A is in the memory portion of sharing between 19B) at the described first and second translater examples to sharing in the code cache equipment (29) at least to load described object code part (TC1); And
When the described second translater example (19B) was revised at least one part of described shared code cache equipment (29), described at least one part of duplicating described shared code cache equipment (29) arrived and the relevant private memory part of the described second translater example (19B).
16. according to the process of claim 1 wherein that step (b) comprising:
Provide the code cache equipment (29) of sharing, with the described object code part of buffer memory (TC1); And
Described shared code cache equipment (29) is distributed in two or more buffer memorys.
17. according to the method for claim 16, the buffer memory that provides the field limited, range limited buffer memory or cache policy are provided wherein said distribution step.
18. according to the process of claim 1 wherein,
Step (b) comprises the translation of optimizing in the described first translater example (19A), with the object code part (TC1) after providing first to optimize; And
Step (c) is included in the object code part (TC1) of reusing in the described second translater example (19B) after the described optimization.
19. according to the method for claim 18, step (c) comprising:
When translation does not have to be the source code second portion (CS2) of its buffer memory object code part (TC1), in the described second translater example (19B), carry out second translation of optimizing.
20. method according to claim 1 further comprises the step (d) of carrying out described object code.
21. method according to claim 1 wherein is not buffered from the translation of revising code.
22. method according to claim 1, the object code that wherein is buffered part (TC1) comprises the translation structure that comprises fundamental block.
23. method according to claim 1, the object code that wherein is buffered part (TC1) comprise one or more block translations (41) and they subsequent block tabulation (43) separately.
24. method according to claim 1, wherein said object code part (TC1) is converted into the single buffer unit (37) that comprises source program and its related libraries.
25. method according to claim 1, the object code that wherein is buffered part (TC1) comprises single instruction.
26. method according to claim 1, the object code that wherein is buffered part (TC1) comprises all code blocks corresponding to identical starting resource address.
27. method according to claim 1, wherein object code part (TC1) comprises the buffer unit of the discrete range of representing source address.
28. method according to claim 1, wherein object code part (TC1) comprises the storehouse, source.
29. method according to claim 1 wherein relatively determines to be buffered translation by the buffer memory key word and the compatibility of the source code that will be translated.
30. method according to claim 29, wherein said buffer memory key word relatively compare the byte sequence of coding respective sources code instruction sequence.
31. method according to claim 29, wherein said buffer memory key word relatively compares the hash of respective sources code instruction sequence.
32. method according to claim 29, wherein said buffer memory key word is relatively to executable filename; The side-play amount of described source code sequence and length; The last modification time of file; The version number of described translater; And the source memory address of source code sequence compares.
33. according to each described method in the claim 29 to 32, wherein determine compatibility by comparing corresponding to buffer memory key data structure and a plurality of second data structure of described source code second portion (CS2), wherein each second data structure is corresponding to the different sets of the object code instruction that is buffered.
CNB200480020101XA 2003-07-15 2004-07-13 Shared code caching method and apparatus for program code conversion Active CN100458687C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
GBGB0316532.1A GB0316532D0 (en) 2003-07-15 2003-07-15 Method and apparatus for partitioning code in program code conversion
GB0316532.1 2003-07-15
GB0328119.3 2003-12-04

Publications (2)

Publication Number Publication Date
CN1823322A CN1823322A (en) 2006-08-23
CN100458687C true CN100458687C (en) 2009-02-04

Family

ID=27763853

Family Applications (2)

Application Number Title Priority Date Filing Date
CNB2004800232770A Active CN100362475C (en) 2003-07-15 2004-07-13 Partitioning code in program code conversion to account for self-modifying code
CNB200480020101XA Active CN100458687C (en) 2003-07-15 2004-07-13 Shared code caching method and apparatus for program code conversion

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CNB2004800232770A Active CN100362475C (en) 2003-07-15 2004-07-13 Partitioning code in program code conversion to account for self-modifying code

Country Status (5)

Country Link
CN (2) CN100362475C (en)
GB (3) GB0316532D0 (en)
HK (2) HK1068698A1 (en)
IL (1) IL172830A0 (en)
TW (2) TWI362614B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102934082A (en) * 2010-06-14 2013-02-13 英特尔公司 Register mapping techniques for efficient dynamic binary translation

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294675A1 (en) * 2006-06-20 2007-12-20 Transitive Limited Method and apparatus for handling exceptions during binding to native code
GB2442497B (en) * 2006-10-02 2010-03-31 Transitive Ltd Method and apparatus for administering a process filesystem with respect to program code conversion
US9015727B2 (en) 2008-04-02 2015-04-21 Qualcomm Incorporated Sharing operating system sub-processes across tasks
CN101458630B (en) * 2008-12-30 2011-07-27 中国科学院软件研究所 Self-modifying code identification method based on hardware emulator
US8069339B2 (en) * 2009-05-20 2011-11-29 Via Technologies, Inc. Microprocessor with microinstruction-specifiable non-architectural condition code flag register
US8578357B2 (en) * 2009-12-21 2013-11-05 Intel Corporation Endian conversion tool
CN102043659A (en) * 2010-12-08 2011-05-04 上海交通大学 Compiling device for eliminating memory access conflict and implementation method thereof
US10146545B2 (en) 2012-03-13 2018-12-04 Nvidia Corporation Translation address cache for a microprocessor
US9880846B2 (en) 2012-04-11 2018-01-30 Nvidia Corporation Improving hit rate of code translation redirection table with replacement strategy based on usage history table of evicted entries
US10241810B2 (en) 2012-05-18 2019-03-26 Nvidia Corporation Instruction-optimizing processor with branch-count table in hardware
US20140189310A1 (en) 2012-12-27 2014-07-03 Nvidia Corporation Fault detection in instruction translations
US10108424B2 (en) 2013-03-14 2018-10-23 Nvidia Corporation Profiling code portions to generate translations
US9684607B2 (en) * 2015-02-25 2017-06-20 Microsoft Technology Licensing, Llc Automatic recovery of application cache warmth
CN105700932B (en) * 2014-11-25 2019-02-05 财团法人资讯工业策进会 For the variable inference system and method for software program
CN104375879B (en) * 2014-11-26 2018-02-09 康烁 Based on the binary translation method and device for performing tree depth
CN105893031B (en) * 2016-03-28 2019-12-24 广州华多网络科技有限公司 Cache operation implementation method, service layer method calling method and device
US20180210734A1 (en) * 2017-01-26 2018-07-26 Alibaba Group Holding Limited Methods and apparatus for processing self-modifying codes
US10613844B2 (en) 2017-11-10 2020-04-07 International Business Machines Corporation Using comments of a program to provide optimizations
CN107902507B (en) * 2017-11-11 2021-05-04 林光琴 Control software field debugging system and debugging method
US11442740B2 (en) * 2020-09-29 2022-09-13 Rockwell Automation Technologies, Inc. Supporting instruction set architecture components across releases
CN112416338A (en) * 2020-11-26 2021-02-26 上海睿成软件有限公司 Code warehouse system based on label
CN117348889B (en) * 2023-12-05 2024-02-02 飞腾信息技术有限公司 Code translation processing method, system, computer system and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1039374A2 (en) * 1999-03-24 2000-09-27 International Computers Ltd. Instruction execution mechanism
CN1308287A (en) * 2000-01-27 2001-08-15 国际商业机器公司 Instantly selected soft document sharing among computer equipments of different types
US20010049818A1 (en) * 2000-02-09 2001-12-06 Sanjeev Banerjia Partitioned code cache organization to exploit program locallity
US20020059268A1 (en) * 1999-02-17 2002-05-16 Babaian Boris A. Method for fast execution of translated binary code utilizing database cache for low-level code correspondence
US6397242B1 (en) * 1998-05-15 2002-05-28 Vmware, Inc. Virtualization system including a virtual machine monitor for a computer with a segmented architecture

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5313614A (en) * 1988-12-06 1994-05-17 At&T Bell Laboratories Method and apparatus for direct conversion of programs in object code form between different hardware architecture computer systems
US5619665A (en) * 1995-04-13 1997-04-08 Intrnational Business Machines Corporation Method and apparatus for the transparent emulation of an existing instruction-set architecture by an arbitrary underlying instruction-set architecture
US5875318A (en) * 1996-04-12 1999-02-23 International Business Machines Corporation Apparatus and method of minimizing performance degradation of an instruction set translator due to self-modifying code
US6711667B1 (en) * 1996-06-28 2004-03-23 Legerity, Inc. Microprocessor configured to translate instructions from one instruction set to another, and to store the translated instructions
US6112280A (en) * 1998-01-06 2000-08-29 Hewlett-Packard Company Method and apparatus for distinct instruction pointer storage in a partitioned cache memory
US6205545B1 (en) * 1998-04-30 2001-03-20 Hewlett-Packard Company Method and apparatus for using static branch predictions hints with dynamically translated code traces to improve performance
US6339822B1 (en) * 1998-10-02 2002-01-15 Advanced Micro Devices, Inc. Using padded instructions in a block-oriented cache
US6529862B1 (en) * 1999-06-30 2003-03-04 Bull Hn Information Systems Inc. Method and apparatus for dynamic management of translated code blocks in dynamic object code translation
US6516295B1 (en) * 1999-06-30 2003-02-04 Bull Hn Information Systems Inc. Method and apparatus for emulating self-modifying code
US6615300B1 (en) * 2000-06-19 2003-09-02 Transmeta Corporation Fast look-up of indirect branch destination in a dynamic translation system
US6980946B2 (en) * 2001-03-15 2005-12-27 Microsoft Corporation Method for hybrid processing of software instructions of an emulated computer system
US20030093775A1 (en) * 2001-11-14 2003-05-15 Ronald Hilton Processing of self-modifying code under emulation
GB2393274B (en) * 2002-09-20 2006-03-15 Advanced Risc Mach Ltd Data processing system having an external instruction set and an internal instruction set
GB2400937B (en) * 2003-04-22 2005-05-18 Transitive Ltd Method and apparatus for performing interpreter optimizations during program code conversion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397242B1 (en) * 1998-05-15 2002-05-28 Vmware, Inc. Virtualization system including a virtual machine monitor for a computer with a segmented architecture
US20020059268A1 (en) * 1999-02-17 2002-05-16 Babaian Boris A. Method for fast execution of translated binary code utilizing database cache for low-level code correspondence
EP1039374A2 (en) * 1999-03-24 2000-09-27 International Computers Ltd. Instruction execution mechanism
CN1308287A (en) * 2000-01-27 2001-08-15 国际商业机器公司 Instantly selected soft document sharing among computer equipments of different types
US20010049818A1 (en) * 2000-02-09 2001-12-06 Sanjeev Banerjia Partitioned code cache organization to exploit program locallity

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102934082A (en) * 2010-06-14 2013-02-13 英特尔公司 Register mapping techniques for efficient dynamic binary translation
CN102934082B (en) * 2010-06-14 2017-06-09 英特尔公司 For the methods, devices and systems of binary translation

Also Published As

Publication number Publication date
CN1823322A (en) 2006-08-23
GB2404043A (en) 2005-01-19
HK1068699A1 (en) 2005-04-29
TW200515280A (en) 2005-05-01
CN1836210A (en) 2006-09-20
GB0316532D0 (en) 2003-08-20
HK1068698A1 (en) 2005-04-29
TWI362614B (en) 2012-04-21
GB2404044B (en) 2006-07-26
GB2404044A (en) 2005-01-19
GB0328119D0 (en) 2004-01-07
GB2404043B (en) 2006-04-12
TW200516497A (en) 2005-05-16
TWI365406B (en) 2012-06-01
GB0328121D0 (en) 2004-01-07
CN100362475C (en) 2008-01-16
IL172830A0 (en) 2006-06-11

Similar Documents

Publication Publication Date Title
CN100458687C (en) Shared code caching method and apparatus for program code conversion
CN1802632B (en) Method and apparatus for performing interpreter optimizations during program code conversion
Tang et al. XIndex: a scalable learned index for multicore data storage
Kumar et al. A review on big data based parallel and distributed approaches of pattern mining
US7805710B2 (en) Shared code caching for program code conversion
Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
Kennedy et al. Automatic data layout for distributed-memory machines
Brown et al. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns
US4571678A (en) Register allocation and spilling via graph coloring
Turcu et al. Automated data partitioning for highly scalable and strongly consistent transactions
WO2017091425A1 (en) Expression tree interning
JP2007531075A5 (en)
US20090024586A1 (en) System and method for parallel graph search utilizing parallel structured duplicate detection
CN108475266B (en) Matching fixes to remove matching documents
Crotty et al. Tupleware: Distributed Machine Learning on Small Clusters.
Crotty et al. Tupleware: Redefining modern analytics
Dujardin et al. Fast algorithms for compressed multimethod dispatch table generation
Dash et al. Integrating caching and prefetching mechanisms in a distributed transactional memory
Bruno et al. Compiler-assisted object inlining with value fields
Shmeis et al. A rewrite-based optimizer for spark
Babur et al. Towards Distributed Model Analytics with Apache Spark.
Aksenov et al. Parallel-Batched Interpolation Search Tree
US7552137B2 (en) Method for generating a choose tree for a range partitioned database table
Fuentes et al. SIMD-node Transformations for Non-blocking Data Structures
Myers et al. Compiling queries for high-performance computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: INTERNATIONAL BUSINESS MACHINE CORP.

Free format text: FORMER OWNER: IBM YING CO., LTD.

Effective date: 20090731

Owner name: IBM YING CO., LTD.

Free format text: FORMER OWNER: TRANSITIVE LTD.

Effective date: 20090731

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20090731

Address after: American New York

Patentee after: International Business Machines Corp.

Address before: England Hampshire

Patentee before: IBM UK Ltd.

Effective date of registration: 20090731

Address after: England Hampshire

Patentee after: IBM UK Ltd.

Address before: London, England

Patentee before: Transitive Ltd.