Target injection safe method for dynamically inlining branch predictions

Abstract

A method for redirecting an indirect call in an operating system kernel to a direct call is disclosed. The direct calls are contained in trampoline code called an inline jump switch (IJS) or an outline jump switch (OJS). The IJS and OJS can operate in either a use mode, redirecting an indirect call to a direct call, a learning and update mode or fallback mode. In the learning and update mode, target addresses in a trampoline code template are learned and updated by a jump switch worker thread that periodically runs as a kernel process. When building the kernel binary, a plug-in is integrated into the kernel. The plug-in replaces call sites with a trampoline code template containing a direct call so that the template can be later updated by the jump switch worker thread.

Claims

1. A method of redirecting an indirect call in an operating system kernel to a direct call, the method comprising: while in a learning mode, recording, for each of one or more indirect calls, whose code has been replaced with a trampoline code template containing trampoline code for a direct call, an entry in a hash table having a plurality of keys and entries, wherein the recorded entry includes a source branch address and a destination branch address for an indirect call and a count of a number of occurrences of the source branch address and the destination branch address, and the hash table key is a bit-wise combination of the source branch address and the destination branch address; while in an update mode, updating, based on an entry in the hash table, a destination branch address in the direct call code of the trampoline code template associated with the indirect call; and while in a use mode, executing the trampoline code to redirect the indirect call to the updated direct call.

2. The method of claim 1, wherein the trampoline code template is installed in the operating system kernel at compile time.

3. The method of claim 1, wherein the learning mode is operative in response to an epoch event that occurs once every sixty seconds.

4. The method of claim 1, wherein the update mode is operative in response to an epoch event that occurs once every second.

5. The method of claim 1, wherein the trampoline code has fallback code which allows the trampoline code to operate as an indirect call.

6. The method of claim 1, wherein the trampoline code has fallback code which allows the trampoline code to operate as a retpoline, which is a return trampoline that prevents speculative execution until the destination branch address of the indirect call is determined.

7. The method of claim 1, wherein the trampoline code has an expansion mode which allows the trampoline code to access a list of destination branch addresses as possible branch addresses.

8. The method of claim 1, wherein updating the direct call code in the trampoline code template includes updating the trampoline code template while the operating system kernel is running.

9. A system for redirecting an indirect call in an operating system kernel to a direct call, the system comprising: memory containing the operating system kernel and one or more user processes in a user space of the kernel; one or more CPUs coupled to the memory, the one or more CPUs running the operating system kernel and the one or more user processes, wherein the operating system kernel is configured to: while in a learning mode for a trampoline, record, for each of one or more indirect calls, whose code has been replaced with a trampoline code template containing trampoline code for a direct call, an entry in a hash table having a plurality of keys and entries, wherein the recorded entry is a source branch address and a destination branch address for an indirect call and a count of a number of occurrences of the source branch address and the destination branch address, and the hash table key is a bit-wise combination of the source branch address and the destination branch address; while in an update mode for the trampoline, update, based on an entry in the hash table, a destination branch address in the direct call code of the trampoline code template associated with the indirect call; and while in a use mode for the trampoline, execute the trampoline code to redirect the indirect call to the updated direct call.

10. The system of claim 9, wherein the trampoline code template is installed in the operating system kernel at compile time.

11. The system of claim 9, wherein the trampoline code has fallback code which allows the trampoline code to operate as an indirect call.

12. The system of claim 9, wherein the trampoline code has fallback code which allows the trampoline code to operate as a retpoline, which is a return trampoline that prevents speculative execution until the destination branch address of the indirect call is determined.

13. The system of claim 9, wherein the trampoline code has an expansion mode which allows the trampoline code to access a list of destination branch addresses as possible branch addresses.

14. A non-transitory computer-readable medium comprising instructions executable in a computer system, wherein the instructions when executed in the computer system cause the computer system to carry out a method of redirecting an indirect call in an operating system kernel to a direct call, the method comprising: while in a learning mode, recording, for each of one or more indirect calls, whose code has been replaced with a trampoline code template containing trampoline code for a direct call, an entry in a hash table having a plurality of keys and entries, wherein the recorded entry is a source branch address and a destination branch address for an indirect call and a count of a number of occurrences of the source branch address and the destination branch address, and the hash table key is a bit-wise combination of the source branch address and the destination branch address; while in an update mode, updating, based on an entry in the hash table, a destination branch address in the direct call code of the trampoline code template associated with the indirect call; and while in a use mode, executing the trampoline code to redirect the indirect call to the updated direct call.

15. The non-transitory computer-readable medium of claim 14, wherein the trampoline code template is installed in the operating system kernel at compile time.

16. The non-transitory computer-readable medium of claim 14, wherein the learning mode is operative in response to an epoch event that occurs once every sixty seconds.

17. The non-transitory computer-readable medium of claim 14, wherein the update mode is operative in response to an epoch event that occurs once every second.

18. The non-transitory computer-readable medium of claim 14, wherein the trampoline code has fallback code which allows the trampoline code to operate as an indirect call.

19. The non-transitory computer-readable medium of claim 14, wherein the trampoline code has fallback code which allows the trampoline code to operate as a retpoline, which is a return trampoline that prevents speculative execution until the destination branch address of the indirect call is determined.

20. The non-transitory computer-readable medium of claim 14, wherein the trampoline code has an expansion mode which allows the trampoline code to access a list of destination branch addresses as possible branch addresses.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1A depicts an example system whose CPUs may have these vulnerabilities.

(2) FIG. 1B depicts the architecture of the CPUs in more detail.

(3) FIG. 2 depicts a flow of operations for a call to a retpoline.

(4) FIG. 3 depicts a flow of operations for indirect call promotion.

(5) FIG. 4A depicts a flow of operations for inline jump switch (IJS), in an embodiment.

(6) FIG. 4B depicts the target(mode) function, in an embodiment.

(7) FIG. 5 depicts a flow of operations for an outline jump switch (OJS), in an embodiment.

(8) FIG. 6A depicts an example hash table, in an embodiment.

(9) FIG. 6B depicts a flow of operations for IJS and OJS switch type learning, in an embodiment.

(10) FIG. 7 depicts a flow of operations for IJS and OJS switch type learning policy, in an embodiment.

(11) FIG. 8 depicts a flow of operations for an IJS and OJS update, in an embodiment.

(12) FIG. 9 depicts a flow of operations for the update function used in FIG. 8, in an embodiment.

(13) FIG. 10A depicts a flow of operations for patching the operating system kernel, in an embodiment.

(14) FIG. 10B depicts a flow of operations for phase 1 of patching the operating system kernel, in an embodiment.

(15) FIG. 10C depicts a flow of operations for phase 2 of patching the operating system kernel, in an embodiment.

(16) FIG. 10D depicts a flow of operations for phase 3 of patching the operating system kernel, in an embodiment.

(17) FIG. 10E depicts a flow of operations for checking an interruption while patching the operating system kernel, in an embodiment.

(18) FIG. 11 depicts a flow of operations for a plug-in for an operating system compiler, in an embodiment.

DETAILED DESCRIPTION

(19) One or more embodiments described below provide jump switches, which avoid the problems with both retpolines and indirect promotion and other mitigation measures. Jump switches are code fragments, which serve as trampolines for indirect calls, where trampolines are code fragments that redirect the CPU to a different code path. Jump switches are Spectre-aware in that if a jump switch cannot promote an indirect call, then the jump switch falls back to a mitigated indirect call, such as a retpoline or hardware or microcode that provides protection.

(20) Embodiments of jump switches include an inline jump switch (IJS) and an outline jump switch (OJS). The IJS is optimized for code size and covers most of the use cases. The OJS is used when the indirect branch has multiple target addresses, thus extending the capabilities of the IJS.

(21) FIGS. 4A, 4B and 5 describe a flow of operations for both an inline jump switch (IJS) and an outline jump switch (OJS).

(22) FIG. 4A depicts a flow of operations for inline jump switch (IJS), in an embodiment. The IJS is a trampoline that replaces an indirect call. The trampoline includes steps 402 through 408. In step 402, the CPU compares a learned target with the contents of the % eax register. If the result is zero, as determined in step 404, then the CPU performs a call to the learned target in step 406. If the result is not zero, as determined in step 404, then in step 408, the CPU performs a call to a target that depends on a mode of the IJS (target(mode)), which is further described in reference to FIG. 4B. In an embodiment, the steps are implemented in x86 assembly language according to Table 1 below.

(23) TABLE-US-00001 TABLE 1 Line no. Label Code 1 cmp learnedTarget, %eax 2 jnz miss 3 call learnedTarget 4 jmp done 5 miss call target(mode) 6 done
IJSs are short, upgradable and updatable by a jump switch worker thread 112a in FIG. 1A (described in relation to FIGS. 6-9) at runtime. The learnedTarget represents a branch target address that the IJS has learned and is promoted to avoid an indirect jump. If a miss occurs (the no branch of step 404 in FIG. 4A) then the target address depends on the mode that IJS is in.

(24) FIG. 4B depicts the target(mode) function, in an embodiment. In step 452, the mode is matched to one of three possibilities. If the mode is learning, then in step 454, the target address points to learning code. If the mode is OJS, then in step 456 the target address points to an OJS leading to more target addresses. If the mode is fallback, then in step 458, the target address points to either a retpoline or a normal indirect call, depending on whether the system is Spectre-vulnerable.

(25) Initially, after compilation, the IJS is set to the fallback target by having the target address in steps 456 and 458 set to a retpoline. At runtime, worker thread 112a may patch the target addresses depending on the mode the switch is in and what target addresses have been learned by worker thread 112a.

(26) FIG. 5 depicts a flow of operations for an outline jump switch (OJS), in an embodiment. As mentioned above, the OJS handles the case of multiple target addresses as an extension of the IJS. The multiple target addresses are learned in real-time and provided by worker thread 112a. In an embodiment, the OJS is called by the IJS, when the mode of the IJS is changed to OJS. In an embodiment, the OJS is limited to a small number of target addresses, for example, six (6) target addresses.

(27) In step 502 of FIG. 5, the CPU determines whether a list of learned target addresses is empty or not. If not, then in step 504, the CPU obtains an item, lta, from the list and executes a comparison in step 506 of the item with the contents of the % eax register. If the results of the comparison are zero, as determined in step 508, then in step 510, the CPU jumps to the item, lta. In step 512, the CPU updates the list. The processor repeats steps 502 to 512 until the list is empty. If the list is originally empty or when the list becomes empty, the CPU executes, in step 514, a jump to an address of learning code, which is a fallback to the learning code.

(28) In one embodiment, the steps of FIG. 5 are implemented in x86 code according to Table 2 below.

(29) TABLE-US-00002 TABLE 2 Line no. Label Code 1 cmp $lta0, %eax 2 jnz relative lta0 3 cmp $ltal, %eax 4 jnz relative lta0 5 . . . 6 jmp learning relative
To update the various switches, such as US and OJS, with learned target addresses, a worker thread 112a is employed. Worker thread 112a is a kernel process 110a that runs periodically. When worker thread 112a runs, it performs two major functions, learning new target addresses and updating the jump switches, using a hash table. The hash table is described with reference to FIG. 6A. The learning routine is described with reference to FIG. 6B. The learning routine is governed by a policy which is described with reference to FIG. 7. The switch updating is described with reference to FIGS. 8 and 9.

(30) Referring now to FIG. 6A, hash table 620 is a representative one of a plurality of tables, each table being associated with one of the CPU cores 120a-n, 122a-n. In hash table 620, keys 622, 624, 626 are formed by performing a bit-wise combination of the branch source address with the branch target address and then taking the lower 8 bits of the combination. Using the lower 8 bits allows for 256 entries. In one embodiment, the bit-wise combination is a bit-wise exclusive-OR. Each entry in hash table 620 includes three items, the branch source address 622a, 624a, 626a, the branch target address 622b, 624b, 626b, and the count 622c, 624c, 626c.

(31) FIG. 6B depicts a flow of operations for IJS and OJS learning, in an embodiment. Learning occurs periodically, and when active, a learning flag is set. In one embodiment, learning occurs once every 60 seconds. In the figure, if the learning flag is true as determined in step 602, worker thread 112a in step 604 computes a key for a hash table 620 (depicted in FIG. 6A). The key is the lower eight bits of an XOR of the branch source address and the branch destination address. In step 606, worker thread 112a computes an entry for hash table 620 corresponding to the key. In one embodiment, the entry is the source instruction pointer (IP), the destination IP and a count of the number of invocations. In step 608, worker thread 112a adds the entry to hash table 620 at the computed key. When the learning is completed, the learning flag is made false, and worker thread 112a executes a fallback code function in the IJS in step 610. The fallback code may be either a retpoline if Spectre-vulnerable hardware is present or a normal indirect call.

(32) FIG. 7 depicts a flow of operations for an IJS and OJS learning policy, in an embodiment. To implement the learning policy, worker thread 112a keeps track of three lists. The first list is a list of jump switches that are in learning mode. The second list is a list of stable jump switches, i.e., those having a single target. Switches in this list need not be disabled for learning because their fallback paths are to the learning routine. The third list is a list of unstable jump switches, which includes switches with an outlined block and those that have too many target addresses and were set not to have an outlined block.

(33) In step 702 of FIG. 7, if during an epoch (say every 60 seconds) no jump switches were updated, worker thread 112a selects a sublist of jump switches from the unstable list in step 704. In step 706, worker thread 112a converts the switches in the selected sublist to learning switches, i.e., disabling them and setting their fallback target to the learning routine.

(34) FIG. 8 depicts a flow of operations for an IJS and OJS update, in an embodiment. In step 802, worker thread 112a receives an epoch or relearning event, where a relearning event is an event triggered by the user such as entering a steady system state after booting the kernel or changing the workload by starting a new process or container. In one embodiment, the epoch is one (1) second. In step 804, worker thread 112a creates a list of items, each of which is a call total and a source and destination pair over all of the CPU cores 120a-n, 122a-n. The list is created by summing calls in the hash table for each CPU core 120a-n, 122a-n. In step 806, worker thread 112a starts an iterator that runs through each source in the list. In step 808, worker thread 112a sorts the list of destinations for each source based on their hits, where a hit is a case in which the jump switch uses the target branch address that is in the hash table. In step 810, worker thread 112a starts an iterator that runs through each destination of the current source selected in step 806. If, as determined in step 812, the destination has not been promoted, then in step 814, worker thread 112a performs an action on the destination. The actions on the destination are described in more detail in reference to FIG. 9. After iterating through each source and destination in the list, worker thread 112a in step 816, clears all of the hash tables. Jump switches that are not in an update mode are in a usable mode, i.e., able to be executed.

(35) FIG. 9 depicts a flow of operations for the update function used in FIG. 8, in an embodiment. Worker thread 112a executes the update function for each jump switch by performing an action on the jump switch which may involve a set of targets for the switch. In step 902, the function starts an iterator over the jump switches in the set passed by invoking the function. In step 904, the function matches the action for the current switch to one of four different actions. The function executes step 906, updating the IJS with one or more targets if the action is an update. The function executes step 908 if the action is switch and if the IJS is in learning mode, changes the mode to outline mode in step 910. The function executes step 912 if the mode is add by adding or creating one or more targets for the OJS. The function executes step 914 if the action is max and the capacity of an IJS is at maximum. If so, then the function, in step 916, switches the mode of the IJS to fallback mode.

(36) FIG. 10A depicts a flow of operations for patching the operating system kernel 108 code, in an embodiment. To update a jump switch, worker thread 112a performs steps to ensure that the jump switch is safely updated. In one embodiment, the patching uses the text poke system call in the Linux operating system, which allows the safe modification of running code.

(37) The case of updating an IJS jump switch is depicted in FIG. 10A. As shown in the figure the patch is performed in three phases, phase1 step 1002, phase2 1004, and phase3 1006 and a final step 1008, in which check(interruption) function determines whether the kernel was preempted with a context switch during the patching.

(38) FIG. 10B depicts a flow of operations for phase 1 of patching the operating system kernel, in an embodiment. In step 1020, the worker thread 112a sets a breakpoint at Line 1 (L1) of the IJS code in Table 1. The breakpoint is set by writing a breakpoint opcode into the first byte of the instruction at L1. In step 1022, the worker thread 112a sets the instruction pointer to the return address on the stack. If the breakpoint is hit, as determined in step 1024, the CPU jumps to the retpoline code in step 1026. If the breakpoint is not hit, then in step 1028 the phase 1 function returns.

(39) FIG. 10C depicts a flow of operations for phase 2 of patching the operating system kernel, in an embodiment. In step 1030, the worker thread 112a waits for a quiescent period of time to ensure that no thread runs the instructions in lines 2-5. In an embodiment in which the operating system kernel is the Linux kernel, this is performed by calling the synchronize_sched function. In step 1032, the worker thread 112a writes lines 2-5 with replacement code. In step 1034, the function returns.

(40) FIG. 10D depicts a flow of operations for phase 3 of patching the operating system kernel, in an embodiment. In step 1040, the worker thread 112a sets a breakpoint at L1 and in step 1042 sets the instruction pointer to the return address on the stack. If the breakpoint is hit, as determined in step 1044, the CPU 118a-n jumps to the retpoline code. If not, then the worker thread 112a restores the CMP opcode in L1 and returns in step 1050.

(41) FIG. 10E depicts a flow of operations for checking an interruption while patching the operating system kernel, in an embodiment. If, as determined in step 1060, the operating system kernel 108 performed a context switch, then in step 1062, the saved instruction pointer (IP) is set to L1 of the code in Table 1. Setting the IP to L1 ensures that the code will be executed again when the worker thread 112a is re-scheduled.

(42) FIG. 11 depicts a flow of operations for a plug-in for an operating system compiler, in an embodiment. In one embodiment, the compiler is the GNU compiler when the operating system is the Linux operating system. The plug-in is built during a kernel build and assists in the operation of worker thread 112a.

(43) Referring to FIG. 11, if a compiler build-option flag, CONFIG, is true, as determined in step 1102, then the compiler compiles the operating system to use jump switches according to the following steps. In step 1106, the plug-in starts an iterator over each indirect call. In step 1108, the plug-in replaces each indirect call with a jump switch code template, which contains the basic jump switch code, such as the code in Table 1, but with the jump switch set to execute only fallback code. In step 1110, worker thread 112a writes the instruction pointer (IP) and register used by the call to a new file section of a standard file format, such as an executable and linkable format (ELF) file, used by the compiler. The new section of the ELF file contains information that is read during boot of operating system kernel 108 to compose a list of calls so that worker thread 112a can easily recognize which register is used in each jump switch. The information also serves as a precaution to prevent worker thread 112a from patching the wrong code.

(44) Worker thread 112a is integrated into operating system kernel in a manner similar to other periodic tasks which patch code such as static-keys, jump-label and alternatives infrastructure in the Linux operating system.

(45) Thus, jump switches are able to dynamically adapt to changing workloads and to take advantage of information only available at runtime. Jump switches are integrated into the operating system kernel, requiring no source code changes to the kernel, and designed for minimal overhead as they only operate to protect indirect calls rather than the entire binary of the operating system kernel.

(46) The various embodiments described herein may employ various computer-implemented operations involving data stored in computer systems. For example, these operations may require physical manipulation of physical quantitiesusually, though not necessarily, these quantities may take the form of electrical or magnetic signals, where they or representations of them are capable of being stored, transferred, combined, compared, or otherwise manipulated. Further, such manipulations are often referred to in terms, such as producing, identifying, determining, or comparing. Any operations described herein that form part of one or more embodiments of the invention may be useful machine operations. In addition, one or more embodiments of the invention also relate to a device or an apparatus for performing these operations. The apparatus may be specially constructed for specific required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.

(47) The various embodiments described herein may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.

(48) One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer readable media. The term computer readable medium refers to any data storage device that can store data which can thereafter be input to a computer systemcomputer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer readable medium include a hard drive, network attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs)CD-ROM, a CD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.

(49) Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein, but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation, unless explicitly stated in the claims.

(50) Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).

Target injection safe method for dynamically inlining branch predictions

Assignee

Inventors

Cpc classification

Classification Explorer

H04M3/5175

ELECTRICITY

Classification Explorer

G06F21/54

PHYSICS

Classification Explorer

G06F2221/033

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

H04M3/42221

ELECTRICITY

Classification Explorer

G06F21/52

PHYSICS

Classification Explorer

H04M3/5191

ELECTRICITY

Classification Explorer

H04M3/42059

ELECTRICITY

Classification Explorer

H04M3/523

ELECTRICITY

Classification Explorer

G06F2212/65

PHYSICS

Classification Explorer

G06F9/4486

PHYSICS

Classification Explorer

G06F9/30058

PHYSICS

Classification Explorer

G06F9/30101

PHYSICS

Classification Explorer

G06F9/35

PHYSICS

Classification Explorer

G06F9/3836

PHYSICS

Classification Explorer

G06F12/10

PHYSICS

Classification Explorer

G06F9/3806

PHYSICS

Classification Explorer

G06F9/30065

PHYSICS

International classification

Classification Explorer

G06F9/38

PHYSICS

Classification Explorer

G06F9/30

PHYSICS

Abstract

Claims

Description