Prediction mechanism for subroutine returns in binary translation sub-systems of computers

09836292 · 2017-12-05

Assignee

Vmware, Inc. (Palo Alto, CA)

Inventors

Ole Agesen (Palo Alto, CA, US)

Cpc classification

International classification

Abstract

A sequence of input language (IL) instructions of a guest system is converted, for example by binary translation, into a corresponding sequence of output language (OL) instructions of a host system, which executes the OL instructions. In order to determine the return address after any IL call to a subroutine at a target entry address P, the corresponding OL return address is stored in an array at a location determined by an index calculated as a function of P. After completion of execution of the OL translation of the IL subroutine, execution is transferred to the address stored in the array at the location where the OL return address was previously stored. A confirm instruction block is included in each OL call site to determine whether the transfer was to the correct or incorrect call site, and a back-up routine is included to handle the cases of incorrect call sites.

Claims

1. A system comprising: one or more processors; a host configured to execute instructions in an output language (OL); a guest communicatively coupled to the host, the guest configured to issue a sequence of instructions in an input language (IL); and a binary translator configured to convert the sequence of IL instructions of the guest into a corresponding sequence of OL instructions of the host, the binary translator comprising computer-executable instructions directing the one or more processors to: calculate a first index by evaluating a function with P as an argument, where P is a procedure entry address of a called IL subroutine that is to be translated into an OL subroutine; store a corresponding OL return address in a return target cache at a location indicated by the first index; execute an OL subroutine translation of the called IL subroutine; and upon completion of execution of the OL subroutine translation: in a launch block of instructions, request to retrieve an OL target address from the return target cache at a location indicated by a second index; generate a fault when the launch block is unbound, wherein the launch block is unbound when the first index is different from the second index; scan the return target cache for a third index that points to a confirm block with a predicted IL return address that matches an IL return address from a stack for the called IL subroutine; and replace the second index with the third index.

2. The system of claim 1, wherein the predicted IL return address is Rpred, and wherein the computer-executable instructions further direct one or more processors to: store the IL return address as a call site IL return address Rcall on the stack for the called IL subroutine; determine whether the predicted IL return address Rpred is the same as an actual IL return address Ractual fetched from the stack and, if it is not, transferring execution to a back-up OL return address recovery module; and in the back-up OL return address recovery module, establish the OL return address using a predetermined, secondary address recovery routine.

3. The system of claim 2, wherein the return target cache is an array having a plurality of elements, and wherein the computer-executable instructions further direct one or more processors to initialize the return target cache by storing in each element a beginning address of the back-up return address recovery module.

4. The system of claim 2, wherein a call to the IL subroutine is made from one of a plurality of IL call sites, and wherein the computer-executable instructions further direct one or more processors to: translate each IL call site into a corresponding OL call site; generate a confirm block of instructions corresponding to each OL call site; and upon execution of any confirm block of instructions: compare the actual IL return address Ractual with the predicted IL return address Rpred; if Ractual is equal to Rpred, continue execution of the OL instructions following the OL call site; and if Ractual is not equal to Rpred, transfer execution to the back-up return address recovery module.

5. The system of claim 4, wherein only a single scratch register is used for a launch and confirmation blocks of instructions.

6. The system of claim 1, wherein the computer-executable instructions further direct one or more processors to bind a translation of a return within the OL subroutine translation to the third index in the return target cache.

7. The system of claim 1, wherein the scan is performed a predetermined number of times.

8. The system of claim 7, wherein after the performing the scan the predetermined number of times, the OL return address is routed to a miss handler.

9. A method for converting a sequence of input language (IL) instructions into a corresponding sequence of output language (OL) instructions, the method comprising: calculating a first index by evaluating a function with P as an argument, where P is a procedure entry address of a called IL subroutine that is to be translated into an OL subroutine; storing a corresponding OL return address in a return target cache at a location indicated by the first index; executing an OL subroutine translation of the called IL subroutine; and upon completion of execution of the OL subroutine translation: in a launch block of instructions, requesting to retrieve an OL target address from the return target cache at a location indicated by a second index; generating a fault when the launch block is unbound, wherein the launch block is unbound when the first index is different from the second index; scanning the return target cache for a third index that points to a confirm block with a predicted IL return address that matches an IL return address from a stack for the called IL subroutine; and replacing the second index with the third index.

10. The method of claim 9, wherein the predicted IL return address is Rpred, and the method further comprising: storing the IL return address as a call site IL return address Rcall on the stack for the called IL subroutine; determining whether the predicted IL return address Rpred is the same as an actual IL return address Ractual fetched from the stack and, if it is not, transferring execution to a back-up OL return address recovery module; and in the back-up OL return address recovery module, establishing the OL return address using a predetermined, secondary address recovery routine.

11. The method of claim 10, wherein the return target cache is an array having a plurality of elements, and wherein the method further comprises initializing the return target cache by storing in each element a beginning address of the back-up return address recovery module.

12. The method of claim 10, wherein a call to the IL subroutine is made from one of a plurality of IL call sites, and wherein the method further comprises: translating each IL call site into a corresponding OL call site; generating a confirm block of instructions corresponding to each OL call site; and upon execution of any confirm block of instructions: comparing the actual IL return address Ractual with the predicted IL return address Rpred; if Ractual is equal to Rpred, continuing execution of the OL instructions following the OL call site; and if Ractual is not equal to Rpred, transferring execution to the back-up return address recovery module.

13. The method of claim 12, wherein only a single scratch register is used for a launch and confirmation blocks of instructions.

14. The method of claim 9, further comprising binding a translation of a return within the OL subroutine translation to the third index in the return target cache.

15. One or more non-transitory computer-readable media comprising computer-executable instructions for converting a sequence of input language (IL) instructions into a corresponding sequence of output language (OL) instructions, the computer-executable instructions directing one or more processors to: calculate a first index by evaluating a function with P as an argument, where P is a procedure entry address of a called IL subroutine that is to be translated into an OL subroutine; store a corresponding OL return address in a return target cache at a location indicated by the first index; execute an OL subroutine translation of the called IL subroutine; and upon completion of execution of the OL subroutine translation: in a launch block of instructions, request to retrieve an OL target address from the return target cache at a location indicated by a second index; generate a fault when the launch block is unbound, wherein the launch block is unbound when the first index is different from the second index; scan the return target cache for a third index that points to a confirm block with a predicted IL return address that matches an IL return address from a stack for the called IL subroutine; and replace the second index with the third index.

16. The non-transitory computer-readable media of claim 15, wherein the predicted IL return address is Rpred, and further comprising: store the IL return address as a call site IL return address Rcall on the stack for the called IL subroutine; determine whether the predicted IL return address Rpred is the same as an actual IL return address Ractual fetched from the stack and, if it is not, transferring execution to a back-up OL return address recovery module; and in the back-up OL return address recovery module, establish the OL return address using a predetermined, secondary address recovery routine.

17. The non-transitory computer-readable media of claim 16, wherein the return target cache is an array having a plurality of elements, and wherein the computer-executable instructions further direct the one or more processors to initialize the return target cache by storing in each element a beginning address of the back-up return address recovery module.

18. The non-transitory computer-readable media of claim 16, wherein a call to the IL subroutine is made from one of a plurality of IL call sites, and wherein the computer-executable instructions further direct the one or more processors to: translate each IL call site into a corresponding OL call site; generate a confirm block of instructions corresponding to each OL call site; and upon execution of any confirm block of instructions: compare the actual IL return address Ractual with the predicted IL return address Rpred; if Ractual is equal to Rpred, continue execution of the OL instructions following the OL call site; and if Ractual is not equal to Rpred, transfer execution to the back-up return address recovery module.

19. The non-transitory computer-readable media of claim 18, wherein only a single scratch register is used for a launch and confirmation blocks of instructions.

20. The non-transitory computer-readable media of claim 15, wherein the computer-executable instructions further direct the one or more processors to bind a translation of a return within the OL subroutine translation to the third index in the return target cache.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIGS. 1-3 illustrate the contents of a stack and pointer before a subroutine call, after the call, and after the return from the subroutine.

(2) FIG. 4 shows the main hardware and software components of an exemplary system implementing subroutine calls and returns.

(3) FIGS. 5 and 6 illustrate a return target cache used in the invention, with an OL return address shown in one cell in FIG. 6.

(4) FIG. 7 is a flowchart that shows the main processing steps carried out according to an exemplary procedure for quickly and accurately branching to and returning from subroutines.

DETAILED DESCRIPTION

(5) In the context of binary translation of an IL stream to an OL instruction stream, the software mechanisms described herein provide a way to return very quickly and accurately from a called subroutine P. This ability to return accurately works to any level of nesting. This invention applies to all kinds of binary translators, both cross-platform and non-cross-platform binary translators, including the unrestricted binary translators that make no assumptions about the behavior of the translated program. However, while the presently described embodiments may be applied in the context of any binary translator, it provides the greatest benefits for the unrestricted class of binary translators.

(6) General System Structure

(7) FIG. 4 shows the general structure of an exemplary system to implement subroutine calls and returns. A hardware platform 100 includes one or more processors (CPUs) 102, system memory 104, and a storage device, which will typically be a disk 106. The system memory will typically be some form of high-speed RAM, whereas the disk 106 (one or more) will typically be a non-volatile (“persistent”) mass storage device. The hardware 100 will also include other conventional mechanisms such as a memory management unit MMU 108. Various devices (not shown) such as a display (monitor), keyboard, mouse, trackball, touchpad, etc., are usually also included in or attached to the hardware platform for obvious purposes.

(8) The hardware platform will also include a number of registers 110, which may be included in other hardware components, especially the processor 102 itself. In FIG. 4, a general-purpose register is illustrated as % ecx. (Software registers, for example, dedicated memory locations, may also be used, although these will normally be slower.) At least one stack 112 is also included, and may be implemented in hardware or software; usually, but not necessarily, the stack is implemented as a portion of memory. As is well known, registers (or memory positions) are also included to hold a stack pointer % esp and an instruction pointer % eip (other architectures may use different names for analogous registers and may include more than one of each). Other features of the hardware platform will include such components as an arithmetic/logic unit (ALU); these are not shown because they are so well known and can be assumed to be present in any hardware platform the invention would be implemented on.

(9) A software host 200 runs on the hardware platform 100. The host will include some form of operating system (OS) (either a general-purpose OS and/or some specially designed kernel), which will include drivers as needed for various connected devices, as well as other well known conventional components. The OS and its standard components are not shown because they are so well known and may be assumed to be present, without reducing the ability of a skilled programmer to understand this invention. The host issues instructions in an output language (OL), which will typically be the instruction set of the hardware processor(s) 102.

(10) One or more guest programs/systems 400 (which issue IL instructions for execution) are functionally connected via the host 200 system to run on the underlying hardware platform 100. Examples of the many types of guest programs that may use the invention range from a simple, conventional application program to a full-scale virtual machine (as in the products of VMware, Inc. of Palo Alto, Calif.) that includes its own virtual operating system, virtual memory, etc. It is assumed here that the guest is the component that issues an IL call to a subroutine P, which is translated into an equivalent OL instruction sequence that can run on the hardware platform.

(11) Recall that the input and output languages may actually (but usually will not) be the same, or that one may be a subset of the other, but that binary translation between the guest and hardware is used in whole or in part for other reasons, such as to provide a virtualized platform for guests to run on. Moreover, it may not be necessary to have an actual, physical hardware system at all; rather, the guest (or, indeed, the host) may itself be running on a virtual machine or, for example, an emulation of the hardware platform, in which case the hardware platform 100 is a software construct.

(12) The host system 200 includes a binary translator 210, which will typically be a software module, as well as a binary translation cache 220, which stores the output of the binary translator, that is, the IL-to-OL translated instructions. The host system also includes a return target cache RTC 230, which is discussed below. The general design and function of a binary translator and its connected translation cache are well known and are not described in detail below; of course, the specific improvements to the binary translator provided by this invention are.

(13) In actual operation, all the illustrated software components will be sets of processor-executable instructions and related data either stored on the disk 106 or loaded in memory 104. They are shown as separate components by way of illustration only.

(14) Actual, complete systems will contain many other hardware and software components that are well known. These are consequently not illustrated or described here since the invention does not depend on any particular implementation of these components. For example, as mentioned above the host system 200 will include, be connected to, function as, or replace the conventional host operating system found in most modern computers. In some implementations of the invention, for example, the guest 400 might be a virtual machine, in which case the binary translator and related components may be part of a virtual machine monitor that in turn is running on a host system software layer. The invention can be used in all such configurations.

(15) Return Target Cache (RTC)

(16) FIGS. 5 and 6 illustrate the structure of a return target cache 230, which is preferably implemented as a contiguous and relatively small array in memory. In particular, k is an index into the RTC and will typically be converted into a byte offset by multiplying by the size of the array elements in bytes for the low-level address computation. Such indexing is common in programming.

(17) Of course, other data structures may also be used, although such structures will typically slow the invention down, because they will in general require more bookkeeping and complexity than a simple contiguous array. Moreover, it would also be possible to implement the return target cache using a dedicated hardware memory component, although this will in general also not be necessary, and would reduce the general applicability of the invention. Similarly, the size of the array may be chosen using normal design and experimental methods. The elements in the RTC array 230 will be OL addresses.

(18) Returning from Subroutine

(19) The translation mechanism according to the invention is similar to the one described in U.S. Pat. No. 6,711,672 (outlined above), but avoids the need to compute a hash of any address in the launch block (sequence of instructions); it also reduces the path length, in part by avoiding the need to hash and in part by making do with a single scratch register, here designated % ecx, to carry the return address from a launch block to a confirm code block. The assumption is that at least one call to a subroutine, whose entry address (or “call target address”) is here designated as P, is made in the IL sequence. Below, referring to “subroutine P” is to be understood as meaning “the subroutine whose entry address is P.”

(20) To understand the new mechanism, assume the notation “%” as a prefix designates a register; “% esp” indicates the stack pointer register; “lea” is the “load effective address” instruction (opcode); “jecx” is the opcode for “jump if % ecx=0”; and destination operands are to the left of the comma in the representation of a multi-operand instruction.

(21) Launch Block

(22) Assume now that a call site that invokes procedure P is executed. As part of the invocation, the call site will push an IL return address R.sub.call onto the stack, which will typically be the address of the next executable instruction following the call to P. (Note that if a subroutine fails to include a guaranteed return, then failure to return is a flaw of the original IL subroutine itself and will be faithfully reproduced in the OL sequence.) Eventually, execution of P (including executing further, nested subroutine calls, which are handled in the same manner as the highest level call), will reach a return instruction, which the binary translator then translates into an OL launch block.

(23) According to the invention, the launch block (sequence of instructions) used by the binary translator 210 saves the contents of a scratch register (for example, % ecx) for later restoration, pops an actual IL return address R.sub.actual from the stack into this register, then jumps to a target OL address R′ indicated by the contents of a location k in the return target cache. Thus, using the same notation as above, see Table 12 below. How k is determined according to the invention is explained below.

(24) TABLE-US-00012 TABLE 12 launch: save %ecx ; save %ecx contents to fixed memory ; location (= mov M, %ecx) pop %ecx ; fetch R.sub.actual (pull off stack into %ecx) lea %esp, imm(%esp) ; optional instruction (see below) jmp rtc[k] ; jump to address in rtc cell k

(25) In most cases, R.sub.call will be the same as R.sub.actual. This is not guaranteed to be true, however, so the invention includes mechanisms (described below) to handle the cases in which this is false. The subscript “actual” is used here because it represents what is actually popped from the stack into % ecx.

(26) Now consider each instruction of this launch block in turn:

(27) Instruction “save % ecx”: This well-known instruction in the OL return sequence saves the contents of scratch register % ecx so that the register can be used for performing the return. More specifically, because the return and “back-up” (the Miss/Failure handler described below) routines in this illustrative implementation use the register % ecx, the system needs to save its contents beforehand in order to be able to restore them, and thus restore the complete system state upon return.

(28) Instruction “pop % ecx”: This well-known, conventional stack operation fetches the information (R.sub.actual) at the top of the stack into % ecx. Unless the code after the original subroutine call in some way modifies this information (for example, by some stack manipulation operations—see below) then the top of the stack should contain the IL return address R.sub.call. As explained above, however, this cannot be assumed to be true. The system then automatically (as part of execution of the “pop” instruction) also updates the stack pointer. Regardless of what may have happened to the stack since the call site first pushed R.sub.call onto it, the “pop % ecx” instruction will place the current value of the stack in % ecx and treat it as the IL return address R.sub.actual.

(29) Instruction “lea % esp, imm(% esp)”: Note that “lea” abbreviates “load effective address” and is a form of add instruction that does not modify flags. Instruction “lea x, ±n(y)” performs the operation x:=y±n. Whether this “lea” instruction should be included will depend on how parameters are handled in subroutine calls and returns.

(30) In some languages, the number of parameters that a caller must pass to a subroutine must exactly match the number of parameters that the subroutine is declared to take. When translating such a subroutine, it is possible for the translated code in the subroutine to remove this known number of parameters in the stack before returning to the caller. In these cases, on x86 hardware, it is possible to implement this with a single return-with-immediate instruction. In the OL instruction sequence resulting from translating such a return, the “lea” instruction is used to accomplish this argument removal by setting % esp to point above the location of the argument(s).

(31) In other languages, the number of parameters passed to a given subroutine need not be the same for all call sites. In compiling such languages, it is usually the caller's responsibility to remove arguments from the stack once the subroutine has returned. The subroutine will therefore terminate with a “plain” return-without-immediate instruction. The binary translation of such returns can omit the “lea” instruction since no additional stack pointer adjustment is necessary.

(32) Instruction “jmp rtc[k]”: This instruction as a whole performs a memory-indirect jump through the return target cache's cell at index k. As FIGS. 5 and 6 illustrate, k (or some multiple of k, depending on the architecture) will typically be added to a base address rtc_Base in order to identify the exact cell address, although this will depend on the particular hardware architecture and the data structure chosen to implement the RTC 230. In other words, execution is transferred to the address stored in the cell of the RTC entry located at index k.

(33) “confirm block”: Upon completed execution of the launch block, the following will apply: % ecx will contain the value R.sub.actual found at the top of the stack; and execution will proceed to the instruction whose address R′ is located in cell k of the RTC 230. It is necessary for the system to determine, however, whether execution has, in fact, arrived at a point that corresponds to the proper OL return address.

(34) The binary translator 210 according to the invention therefore inserts a block of code (instructions) referred to as the confirm block or sequence—preferably beginning at each return destination (here: R′) in the OL instruction stream. This confirm block determines whether the return address that was found at rtc[k], and that was used for the return, is such that execution has returned to the right place in the OL instruction stream.

(35) The confirm block follows logically as shown in Table 13 below from the launch block given above.

(36) TABLE-US-00013 TABLE 13 confirm: lea %ecx, −R.sub.pred ; %ecx:=%ecx−R.sub.pred (%ecx) jecx hit ; jump to “hit” if %ecx contains 0 lea %ecx, R.sub.pred ; reestablish actual IL return address in (%ecx) %ecx ; by adding back R.sub.pred jmp miss ; jump to conventional miss handler hit: load %ecx ; restore %ecx original contents from a fixed ; memory location (equivalent to mov %ecx,M)

(37) The first line of this code includes R.sub.pred, which is a predicted IL return address. Note that each confirm block will normally be associated with some call site. Here it is assumed that the IL return address for this site is R.sub.pred. In other words, within each confirm block is an embedded assumption that it has been called because of a jump through the RTC from a launch block that in turn was invoked because of a call site that has an IL return address R.sub.pred. The jump to a given confirm block is thus a “prediction” about which call site is involved. Whether this prediction is true or not is the main purpose of the confirm block.

(38) If R.sub.pred=R.sub.actual, then execution of the return is on the right path. A fast way to determine this is simply to subtract R.sub.pred from the contents of % ecx and see if the result is zero. If it is zero, then a “hit” has occurred, the register's original contents are restored, and execution continues normally with the OL instructions that follow the confirm block. If the result of the subtraction is non-zero, then any conventional miss handler (see below) must be invoked after the actual IL return address R.sub.actual has been recovered in % ecx by adding back the R.sub.pred that was earlier subtracted.

(39) To summarize the main aspect of the mechanism illustrated above: A caller pushes a return address R.sub.call when calling a subroutine P. The launch block loads R.sub.actual into the scratch register % ecx and then jumps through the OL address R′ found in rtc[k] to some confirm block (or to a default address as explained below). The confirm block compares R.sub.actual, found in the scratch register % ecx, with the predicted IL return address R.sub.pred encoded within it (in particular, as one “lea” operand). If these values are equal, then execution may proceed as normal. If not, then the actual return address R.sub.actual is reestablished in % ecx and another mechanism must be invoked to attempt to find the proper return point.

(40) The instructions “lea” and “jecx” avoid modifying and using (basing jumps on) flags and are used as an optimization specifically for x86 architectures. If the invention is implemented in some other architecture, then analogous instructions, which do not modify and/or test flags, should preferably be used. Of course instructions that do modify flags may also be used, although these will typically result in longer and slower OL instruction sequences since flags must be preserved across the launch and confirm blocks.

(41) Value for “k”: What remains to be determined is the value k, which is the index into the return target cache RTC used in the launch block. The value k cannot depend on the IL return address R.sub.actual, since it is not constant at the return site—the procedure might have more than one caller, for example. According to the invention, k is therefore preferably computed as a function of the procedure entry address P. To see how this works, consider again the translation into the OL of the IL call to the procedure (subroutine) P as shown below in Table 14, where “rtc” is the return target cache, h( ) is a function described below, and, as before, an apostrophe indicates an address in the output language (translated) sequence. This translation constitutes a call block of instructions.

(42) TABLE-US-00014 TABLE 14 call P .fwdarw. push R R: rtc[h(P)] := R′ jmp P′

(43) As one optimization, the binary translator 210 preferably stores the code for P′ immediately after this block of call instructions. The “jump” to P′ will then simply be a continuation of execution with the next following instruction.

(44) A fast but still adequate hash function h( ) may be as simple as extracting a certain number of bits b of P, for example, the lower b bits of P, which is equivalent to saying that h(P)=P mod m, where m=2.sup.b. In one prototype of the invention, b=8, so that m=256, such that the lower byte of P was extracted and h(P)=k=P mod 256. Many other hash functions are known in the art, however. The hash function h( ) may be chosen from among these using normal design considerations. Note that the system can compute h(P) at compile time so a translated call is not slowed down by a need to calculate hash values; moreover, unlike in the '672 patent's mechanism, the hash value need not be computed as part of the launch block.

(45) When the target of a call is not a constant, such as P above, as may occur with virtual methods in object-oriented programs and function pointers in C, the system cannot compute h(P)=k at compile time. Instead, the translator emits OL instructions to compute the RTC index dynamically from the given target. Such emitted instructions will be part of the translation of the call. This still avoids the need to compute a hash value in the launch block, however.

(46) The term “hash function” is used here merely because using the lower order (or any other set of) bits of P as an index will usually provide a good distribution of indices over the RTC array 230, as is desirable for a hash function. One could also say that h( ) is a “bit extraction” or “mapping” function, or give this function some other name. The purpose would be the same, however. The term “hash function” is used because it is well known in the art and is general. The invention may use any function h of P whose range is preferably the same as (or smaller than) the range of indices of the RTC, that is, [0,m) where m is the length of the RTC array; to minimize the probability of “collisions,” that is, two different procedure entry addresses hashing to the same value, the function h preferably maps P as uniformly as possible over the RTC index range.

(47) Assume now that there is more than one call site. According to the invention, each call site will store its translated return target address R′ into a cell in the rtc[k] that depends, not on the return address R, but rather on the address of the called procedure P. In the scheme disclosed in the '672 patent, callers scatter the OL return target addresses R′ using the IL return addresses R.sub.call. In the present invention, they scatter the OL return target addresses R′ using the call target address P. To descramble the rtc[ ] at the return launch, it is necessary to find the procedure entry point for the return being translated. In other words, given a return, the system must determine which procedure it “belongs” to. Once the system determines the procedure P, it will also know the RTC index to use, since k=h(P).

(48) Hit

(49) Execution of the launch block transfers execution out of an OL procedure to a confirm block (or the miss handler) at address R′=rtc[k]. Commonly, the activated confirm block will be the one corresponding to the call site that invoked the procedure from which the launch block is returning. A “hit” is deemed to occur when the IL return address R.sub.pred embedded in the activated confirm block is the same as the actual return address R.sub.actual provided in % ecx by the launch block. This means that, after the “lea % ecx, −R.sub.pred(% ecx)” instruction, % ecx will hold the value zero, so that the “jecx” hit instruction will cause a direct jump to the “load” % ecx instruction, which restores the contents of % ecx to whatever they were before the original return instruction was encountered.

(50) Because of the return target cache and the nature of the hash function, a hit will be by far the most common case in the great majority of implementations and uses of the invention. In this case, execution may simply continue from R′ as normal, with no further involvement of the special mechanisms provided by this invention, until the next subroutine call or return is encountered.

(51) Miss

(52) It is possible, however, that the OL return address R′ will be overwritten by another call/return pair that executed inside the routine P. For example, within the instruction sequence P may be one or more calls to other subroutines, for example, Q, each causing calculation of a return address. Computation of the hash function might then cause one or more of these “nested” return addresses to be found at the same location in the return target cache as the higher-level OL return address R′. Upon return, execution will then transfer to the beginning of the confirm block for some OL call sequence (since each RTC entry except for default entries described below will point to “some” confirm block), but one that is incorrect. This case is referred to as a “miss.”

(53) By way of example, let PR.sub.call and PR′ be the IL and OL return addresses, respectively, for a call to subroutine P and let QR.sub.call and QR′ be the IL and OL return addresses, respectively, for a call to subroutine Q. Whenever P is called from this call site, PR.sub.call will be pushed onto the stack; similarly, whenever Q is called from this call site, QR.sub.call is pushed onto the stack. If, however, P calls Q and the hash function causes QR′ to be stored at the same location in the return target cache as PR′, in other words h(P)=h(Q), then upon return to the original call site, execution will transfer to QR′, since this value will have overwritten PR′ in the return target cache. In such a case, the return sequence would be directed to the wrong destination (the confirm block at QR′ instead of the confirm block at PR′).

(54) In other words, a “miss” is when the predicted IL return address R.sub.pred is not the same as the actual return address R.sub.actual in % ecx, in short, an incorrect prediction. In this case, the “jecx” instruction will not jump. Execution falls through to the second lea instruction to recover R.sub.actual in % ecx before jumping to the miss handler described below.

(55) Failure

(56) One other possible return “error” (or “non-hit”) is also possible: If a procedure directly manipulates the return address provided by its caller (e.g., by adding or subtracting from the value stored on the stack), then an attempt to return from the procedure, that is, to execute a launch block, may lead to a jump through an index in the rtc[.] that has never been set by a call site. This situation—which indicates that the RTC value did not provide any prediction at all—is referred to here as a “failure.”

(57) In order to avoid an attempt by the system to return to an undefined OL “address” and to branch to some undefined point in memory, the return target cache 230 is therefore preferably initialized by putting into each of the RTC array elements a default value, specifically, the address to the beginning of the Miss/Failure handler 240 described below. This initialization should also be done any other time the entries in the return target cache are invalidated. For example, certain events, such as address space mapping changes in a system-level binary translator, may require invalidation of the entries in the return target cache. Filling the return target cache with the default address of the Miss/Failure handler will cause the next few translated returns to execute the back-up sequence, but soon the return target cache will be filled with useful entries again, and execution will proceed at the faster speed provided by cache hits, that is, successful returns from the return target cache.

(58) Miss/Failure Handler 240

(59) The “back-up” code component, referred to here as the Miss/Failure handler 240, is the routine according to the invention that handles the cases in which the system determines either that the predicted IL return address R.sub.pred is not the same as the actual IL return address R.sub.actual (a miss) or that the fetched RTC value does not point to any confirm block at all (a failure).

(60) The Miss/Failure routine may be any conventional method that maps the actual IL return address R.sub.actual to an OL return address R′. (Recall that the actual IL return address R.sub.actual is still available in % ecx when the “jmp miss” instruction in the confirm block invokes this routine.) For this “back-up” mapping, it can use any of several well-known data structures such as a conventional hash table. Since the back-up routine, that is, the Miss/Failure handler, executes only when the return target cache misses or fails—events which are relatively rare—even a costly hash table lookup may be used to perform the return. The back-up code will thus use this conventional hash table to compute the correct OL return target R′, restore the scratch register % ecx, and finish with a jump to the correct OL return target R′. The slow back-up procedure used in this invention is thus the same as the only return procedure found in prior art systems—most of the time, however, the invention is able to use the return target cache and the much faster return procedure described above.

(61) In very rare cases, it is possible that even a conventional hash table will not determine the correct OL return address R′. This problem is of course also encountered in the prior art systems that use the hash table-based return scheme. If this happens, it may be because there is no known OL return address that corresponds to the current value of % ecx—there is no way “back.” In this case, as in prior art systems, the system according to the invention may perform a callout to the binary translator and cause it to translate the code starting at % ecx and onwards until some predetermined criterion is met.

(62) Nested Calls

(63) Assume again that the IL instruction sequence includes a call to P and a call to Q (which may be within P) such that the first call's return address is PR and the second call's return address is QR. Assume furthermore that P (Q) is well behaved so that when P (Q) returns, the actual return address is the address that was placed on the stack by P's (Q's) caller. The corresponding OL subroutines and return addresses are thus P′, PR′, Q′ and QR′. To summarize the discussion above, after completing the call to P (P′), the memory indirect jump “jmp” rtc[k] of the return launch block is executed and there will be the following three possible actions:

(64) 1) a return to the correct beginning address PR′ of the confirmation sequence following the correct subroutine call, in short, a “successful,” correct return—a “hit”—that is, a correct prediction;

(65) 2) a return to the beginning address QR′ of the confirmation sequence of the wrong subroutine—a “miss,” which corresponds to an incorrect prediction—which will have happened because a later IL call target Q was converted by the hash function to the same the position in the return target cache and thus overwrote the correct (that is, earlier) entry PR′. This can also happen if the IL program changed the return address on the stack. In this case, the confirmation sequence will direct execution to the Miss/Failure handler; or

(66) 3) a jump directly to the Miss/Failure handler 240 in the case that the entry in the return target cache contains the initial, default address. Note that any other entry besides the default value will be the address to the beginning of the confirmation sequence of some translated subroutine, since the only time a non-default entry is stored in the return target cache is when the binary translator has translated an IL subroutine call and has generated code that puts its translated return address, that is, its confirm block address R′ sequence, into the return target cache.

(67) FIG. 7 is a flowchart that illustrates the binary translation of the IL instruction sequence into the OL sequence, which is stored in the binary translation cache. In FIG. 7, the IL instruction sequence is shown within a box with a solid border; all other instruction blocks are in the OL sequence. Arbitrary IL instructions, and their OL translations, are shown simply as “instr.” FIG. 7 also illustrates the three possibilities for returning after completion of a call to the IL subroutine P, which is translated into an equivalent OL subroutine P′.

(68) Locality

(69) The reason why the return target cache achieves a high hit rate is that, at any given time, it will tend to contain return target addresses for the most recently executed calls. These are the calls that are likely to return in the near future. In other words, the return target cache exploits a locality phenomenon: the depth fluctuations on the return stack as calls and returns execute are relatively modest in most programs. This allows a small return target cache to capture the active calls and returns with a high probability.

(70) Alternative Embodiments and Optimizations

(71) Above is described the preferred embodiment of the invention, for using the return target cache upon translation of calls and returns. There are, however, alternatives. For example, in one variation, the code—the confirm block—shown above at the translated call site to confirm the correctness of the return target addresses may be moved in whole or in part to the return site. This would necessitate slight changes, which skilled programmers will realize and be able to implement. Moreover, as an optimization, the confirm block is preferably emitted so as to be located immediately after the launch block when the return is translated; this improves both instruction cache and branch performance.

(72) Other variations might involve reorganizing the return code to enable multiple return sites to share some of the code. This would provide a code space savings, but would not change the fundamental idea of the invention. The method according to the invention may also be generalized to handle other forms of returns, such as far returns; necessary changes to the steps described above will be obvious to those skilled in the art.

(73) Comparison with the Mechanism for Hashing Return Destination Addresses

(74) In the Background section above, the mechanism for returning from subroutines disclosed in U.S. Pat. No. 6,711,672 is summarized using the instruction sequences used in that application itself. The essential mechanism disclosed in the '672 patent (hashing IL return destination addresses) can be expressed using instructions similar to those used above to describe the present invention. This way of expressing the mechanism for hashing return destination addresses allows for more direct and revealing comparison between the system described in the '672 patent and the present invention. Moreover, the expression of the '672 patent's scheme below has the added advantage (compared with how it is written in the '672 patent) that it does not affect flags.

(75) The launch and confirm blocks in the '672 patent can be expressed as shown below in Table 15 and Table 16, where % eax, % ecx are separate scratch registers. Here, merely for the sake of compatibility with the use of movzx for hashing, it is assumed that the RTC has 256 entries instead of the 64 used by way of example in the '672 patent.

(76) TABLE-US-00015 TABLE 15 ret .fwdarw. launch: save %eax save %ecx pop %eax ; fetch R.sub.actual lea %esp, imm(%esp) ; optional (see above) movzx %ecx, %al ; h(R.sub.actual) jmp rtc[%ecx]

(77) The instruction “movzx” is the “move with zero-extend” opcode and the entire instruction creates in % ecx a 32-bit data word from the lower byte (2.sup.8=256 possible values) of % eax (% al is the lowest byte of the extended accumulator register % eax). Of course, the hash calculation may be performed using other opcodes, for example in architectures with other instruction sets, but if more than one instruction is needed the result will typically be less efficient.

(78) The instruction “jmp” is the conventional “jump” opcode.

(79) Note that this launch block includes the “movzx % ecx, % al” instruction, which replaces the “and % ebx, 63” instruction shown in the example code listing for the '672 patent found above. In either case, the mechanism in the '672 patent therefore requires a recomputation of the hash function in the launch block, as represented in Table 16, where “jecx” is an instruction to skip to hit if % ecx=0.

(80) TABLE-US-00016 TABLE 16 confirm: lea %ecx, − R.sub.pred (%eax) jecx hit jmp miss hit: load %eax load %ecx

(81) If the address in % eax matches R.sub.pred, then “lea” will set % ecx to zero and “jecx” will cause a skip, over the “jmp” instruction, to the instructions to be executed for a hit. If they do not match, then execution will proceed to the following instruction, that is, to “jmp miss”, which causes a branch to the miss-handling routine.

(82) The key difference between the present invention and the invention described in the '672 patent is that, in the present invention, there is no need to compute the hash function dynamically in the launch block. In contrast, in the '672 patent, code must be emitted to pull R.sub.actual from the stack and perform the hash computation.

(83) Another difference is that the launch block according to the present invention has one fewer scratch register save. Moreover, the confirm block in this invention has fewer executed instructions on the common path. Together the launch and confirm blocks can be implemented (in x86, at least) so as to use only a scratch register. Note that this difference may be negated in certain architectures, although it will in general be present in x86 systems.

(84) Binding

(85) To maximize the hit rate, the mapping from return instruction to procedure entry should be as precise as possible. The miss rate will then be determined by the “random” collisions in the RTC; these should be few, however, since stacks rarely move up or down by more than a few frames. Assuming an unambiguous mapping, the mechanism according to the invention should be approximately as efficient as the scheme based on hashing return addresses R instead of P.

(86) On the other hand, in the scheme disclosed in the '672 patent, computation of the RTC index k was entirely dynamic in the launch block. In this invention, however, the RTC may assume a static role as well: it supports binding of launch blocks to RTC indices k.

(87) A given launch block may be “unbound,” that is, not yet associated with a valid return address. Another way to state this is that the RTC index k calculated initially from the function h(P) might not be the same as the RTC index later used when the system tries to retrieve the OL return target address R′ from the RTC. Upon a return, some mechanism should therefore preferably be included to find the corresponding procedure entry. In other words, there should preferably be some way to bind the unbound launch block. Such a binding function should preferably maximize the probability that the two indices are the same.

(88) One way to accomplish this is to include in the system an auxiliary data structure that records procedure entry points. The binary translator will add to this data structure when processing call instructions. Return translations will then look in the data structure to find, say, the nearest preceding procedure entry point.

(89) An entirely different approach is to use the rtc[ ] itself for this computation. According to this scheme, initially, when a return is translated, a special index, for example, 256 (or any other index outside the range of the hash function h( ), is assigned for the purpose and the value stored in rtc[256] is some value that causes the “jmp” instruction to generate a fault or in some other way indicate that the launch block is unbound. In x86 architectures, this value could, for example, be the binary value for −1, which in some cases (known to skilled programmers) can be used to generate a General Protection (GP) fault. Regardless of what value is used to lead to the fault or indication, when the system detects this condition, it scans rtc[0 . . . 255] to try to find a suitable index k to use in place of 256. For each cell rtc[.Math.], the system then determines the return EIP (extended instruction pointer) that the cell services.

(90) Two different situations may occur: (1) the cell may point to the miss handler, in which case this cell cannot be used; and (2) the cell may point to a confirm block, in which case the system can extract the EIP from one of the “lea” instructions in the block, which can be readily found given the confirm block's entry address.

(91) To bind a launch block, the system can scan the RTC entries (at most, for example, 256 entries). With a high probability, it will find an index k that points to a confirm block whose IL address R.sub.pred matches the actual IL return address R.sub.actual. If such a k is found, then the launch block's jump is patched to use k, e.g., “jmp rtc[k]”. If no suitable k value is found, then the system can try again one or more times, for example, the next time(s) the given launch block is executed. After a predetermined number of failing tries, the system can route the return directly to the miss handler, which should happen only rarely.

(92) The launch block will be unbound until k is computed. The value k may be computed and patched in by the fault handler the first time the launch block is executed.

(93) If the RTC according to the invention is used to implement binding, RTC entries should not be lost unnecessarily, as this could prevent binding of important return sites and lead to costly invocations of the conventional miss handler. A failure to accurately bind even a single launch block could cause an unbounded number of miss handler executions. In any binary translator that implements the invention, the system designer should therefore examine carefully all code sections that cause a flush of the RTC.

(94) Dynamic RTC Array Adjustment

(95) One factor that affects the efficiency of the system according to the invention is the frequency of misses, since it takes a relatively long time to determine the correct OL return address in the Miss/Failure handler. A high miss rate might arise, for example, because of an IL program with many deeply nested subroutines.

(96) The system according to the invention may therefore also include a mechanism that dynamically adapts the size of the RTC 230 to current needs. Using this adaptation mechanism, if the miss rate exceeds some experimentally or arbitrarily determined expansion threshold, then the system will often be able to reduce the likelihood of misses by increasing the size of the RTC. In addition to increasing the memory allocation for the RTC array 230, the function h( ) should then be adjusted accordingly. For example, assuming that h(P)=P mod m, and if the array is increased from 256 to 512 elements, then the parameter m should also be changed from 256 to 512 in order to extract the nine least significant bits of the IL return address instead of only eight. An appropriate time to resize the RTC will be immediately after a flush of the translation cache.

(97) Of course, the problem is how to calculate the miss rate. One way is to include incrementing instructions in the Miss/Failure handler to count misses. The miss rate can then be defined as the number of misses that have occurred during some predetermined interval. One problem with this approach, however, is that a very high rate of subroutine calls might lead to a high miss count, even though the ratio of misses to total calls is acceptable.

(98) It would therefore be better to adjust the RTC size based on the relative frequency of misses (for example, the ratio of misses to total calls, or the ratio of misses to hits) rather than on the absolute number of misses in a given interval. In doing so, one should avoid including any additional instructions in the launch and confirm blocks, because these blocks will usually be executed so often that the time needed to execute the additional instructions will in almost all cases be more than the time saved by implementing the dynamic RTC array adjustment feature.

(99) One way to determine the relative miss rate, and to adjust the RTC size accordingly, is to use a sampling technique. First, note that the system can determine, for any given value of the instruction pointer, whether execution is currently in the launch block, in a confirmation block, or in the Miss/Failure handler. A module can therefore be included within the binary translator, or elsewhere in the host system, to periodically interrupt execution of the OL instructions and determine whether execution is in a confirm block (indicating a hit or miss), in the Miss/Failure handler (indicating a miss or failure) and/or in the launch block (indicating some return).

(100) Let M be the number of times execution is found to be in the Miss/Failure handler; C be the number of times execution is found to be in a confirmation block; and L be the number of times execution is found to be in the launch block. The quotient M/C will then be a reasonable estimate of the ratio of misses to total non-failure returns. (Note that adjusting the size of the RTC array will usually not affect the rate of failures.) Similarly, the quotient M/L will be a reasonable estimate of the ratio of misses to total returns, including failures. Either M/C or M/L can therefore be used as the miss rate and compared with the expansion and contraction thresholds. As skilled programmers will realize, all such quotients may need to be scaled to account for differences in the execution times of the different components.

(101) Tail Calls

(102) There is one case where the new scheme according to the invention is inferior to the mechanism for hashing based on return addresses, namely, a situation known in the art as a tail call. If a procedure P is called in the normal way, the system will map its return to the index calculated from “P.” If P is later invoked with a tail call from some other procedure Q, however, the system will probably miss on the return since it should jump through rtc[h(Q)] instead of rtc[h(P)]. Assuming the conventional miss handler (which implements the back-up path) is relatively fast, this is not a serious shortcoming of the invention. Furthermore, tests run by the inventor indicate that misses of this type do not occur so often that they offset the performance gains won through the invention.

Prediction mechanism for subroutine returns in binary translation sub-systems of computers

Assignee

Inventors

Cpc classification

Classification Explorer

G06F8/52

PHYSICS

Classification Explorer

G06F9/3017

PHYSICS

International classification

Classification Explorer

G06F9/45

PHYSICS

Abstract

Claims

Description