Uniquified FPGA virtualization approach to hardware security
11017125 · 2021-05-25
Assignee
Inventors
- Greg M. Stitt (Gainesville, FL, US)
- Kai Yang (Gainesville, FL, US)
- Swarup BHUNIA (Gainesville, FL, US)
- Robert A. Karam (Temple Terrace, FL, US)
Cpc classification
H04L63/0428
ELECTRICITY
G09C1/00
PHYSICS
H04L9/003
ELECTRICITY
G06F21/76
PHYSICS
G06F30/34
PHYSICS
International classification
G06F21/76
PHYSICS
G09C1/00
PHYSICS
H04L9/00
ELECTRICITY
G06F12/14
PHYSICS
Abstract
Novel methods of virtualization with unique virtual architectures on field-programmable gate arrays (FPGAs) are provided. A hardware security method can include providing one or more field-programmable gate arrays (FPGAs), and creating an application specialized virtual architecture (or overlay) over the one or more FPGAs (for example, by providing an overlay generator). Unique bitfiles that configure the overlays implemented on the FPGAs can be provided for each deployed FPGA. The application specialized virtual architecture can be constructed using application code, or functions from a domain, to create an overlay represented by one or more hardware description languages (e.g., VHDL).
Claims
1. A method of programming a plurality of field-programmable gate arrays (FPGAs), the method comprising: generating a first overlay from an application code adapted to program the plurality of FPGAs; generating, from the first overlay, a plurality of unique overlays each associated with a different one of the FPGAs and each using a configuration key associated with the FPGA; generating a first set of bitfiles each associated with a different one of the unique overlays; and programming the plurality of the FPGAs using the first set of bitfiles.
2. The method of claim 1, wherein the first overlay is constructed using the application code, or functions from a domain, to create an overlay represented by one or more hardware description languages.
3. The method of claim 2, further comprising compiling an overlay hardware description language using FPGA CAD tools, resulting in an FPGA bitfile that programs the FPGA with the overlay.
4. The method of claim 1, wherein, after generating each of the unique overlays, an application compilation tool flow follows a set of steps including optimization, scheduling, resource allocation, binding, and mapping to generate the corresponding bitfile that programs the corresponding FPGA.
5. The method of claim 4, wherein the application compilation tool flow repeats if an application is changed.
6. The method of claim 1, wherein an overlay tool flow executes only once to create an initial overlay.
7. The method of claim 1, wherein an overlay tool flow executes more than once if application requirements change.
8. The method of claim 1, wherein generating the plurality of unique overlays from the first overlay is achieved using one or more of the following modifications: Instruction Set Randomization, Instruction Order Randomization, and Interconnect Randomization.
9. The method of claim 1 further comprising: compiling a hardware description language representative of the first overlay to generate a second set of bitfiles; and programming the plurality of the FPGAs using the second set of bitfiles.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) Embodiments of the present invention include uniquified methods for hardware security. More specifically, embodiments of the present invention include novel usages of virtualization with unique virtual architectures on field-programmable gate arrays.
(8) Unlike the typical FPGA flow, embodiments of the present invention do not have to implement an application directly on a field-programmable gate array (FPGA). Instead, an application-specialized virtual architecture overlay can first be created that is implemented atop the FPGA.
(9) Embodiments of the present invention extend these characteristics to include improved security, as shown in
(10)
(11) After generating an overlay, the application compilation tool flow can follow a set of steps similar to high-level synthesis and compilation, but specialized for each overlay. Specifically, optimization, scheduling, resource allocation, binding, and mapping can depend on the overlay architecture and the uniquifications. These steps can output a separate overlay bitfile that configures the overlay that is already implemented on the FPGA. The application compilation tool flow repeats every time the application is changed, but generally the overlay tool flow will only execute once to create the initial overlay, or possibly several times if application requirements change significantly.
(12) Unique overlays do not support portable overlay bitstreams across devices by design, which requires compilation or high-level synthesis to generate a unique bitstream for every overlay. To identify the uniquifications for a particular device, the compiler uses the configuration key for each FPGA. This key/device ID pair is only known to the device and/or application manufacturer. The compiler can then guarantee by construction the correctness of the mapping because it is aware of the modifications made to the overlay for that device. Similarly, during an update, the ID can be retrieved from the device, and the corresponding key can be looked up in the manufacturer's database.
(13) Although the presented security advantages can potentially be gained from any overlay, a modified Malleable Hardware (MAHA) overlay can be used, as shown in
(14)
(15) The basic building block of MAHA is the Memory Logic Block (MLB), as shown in
(16) During program execution, MAHA fetches instructions from the schedule table. Depending on the instructions, MAHA will either select a datapath result or lookup table result for a register write back. The lookup table can enable energy-efficient evaluation of complex functions, while the lightweight datapath can enable rapid execution of common functions for a domain. MAHA can use a number of MLBs interconnected in a multi-level hierarchy, the structure of which depends on the particular application. During compilation, instructions, data, and lookup table memory are spatially distributed among the MLBs, while local instructions within a given MLB execute temporally. The architecture is fully customizable for the target application, and can be implemented as an FPGA overlay using standard FPGA CAD tool flows.
(17) The major security advantages of MAHA can be obtained via uniquification using the following modifications:
(18) Instruction Set Randomization (ISR): modification of the instruction encoding
(19) Instruction Order Randomization (IOR): permuting of the schedule table and program counter sequence
(20) Interconnect Randomization (ICR): modifying the inter-MLB communication network
(21) Of these, ISR can be applied to standard microprocessors for the purpose of software security, whereas IOR is more difficult to implement on a standard microprocessor due to the requirement of instruction caching. Instruction caching is not an issue with the MLB for the following reasons: 1) the schedule table size can be modified to match application requirements, and 2) applications can be mapped spatially, using additional MLBs as necessary up to the FPGA's physical resource limit. Finally, the ICR implementation depends on the MLB interconnect structure, which can vary between implementations of the overlay, and has a direct impact on the possible application mappings for that instance.
(22) These uniquification approaches can be implemented within each MLB by making small modifications to the MLB structure. Specifically, ISR requires that either 1) a permutation network be used to permute the order of inputs to certain functional units, which does not require resynthesis, but does increase area and delay, or 2) the encoding be uniquified at the register transfer language (RTL) level, which does not affect timing, but does require resynthesis. The time required for resynthesis can vary depending on the size of the design, but in general this can be mitigated by leveraging incremental compilation to only resynthesize the small fraction of the design performing instruction decoding in each MLB. An analysis of the recompilation time with and without partitioning MAHA's instruction encoding module demonstrated a 24% average decrease in compilation time, which can make resynthesis more practical.
(23) IOR can be implemented more simply using a cryptographically secure sequence generator, namely a maximal period nonlinear feedback shift register. The seed for this generator can be changed in the bitstream without resynthesizing the overlay.
(24) Finally, ICR can be implemented by beginning with a fully-connected network of MLBs, and cutting specific connections between MLBs by modifying connection and switch box configuration bits in the FPGA. As long as the overlay bitfile mapping tool, which generates the instructions for the MLBs, is cognizant of the modifications (ISR, IOR, and ICR), it can create a functionally correct and latency-aware mapping with little variation in performance across all uniquified overlays.
(25) Because a uniquified bitstream is not portable among devices, each FPGA may need to provide a device ID. Each device can have a corresponding randomly-generated configuration key that specifies all ISR, IOR, and ICR modifications. The key/device ID association is known only to the manufacturer. During an update, the ID can be retrieved from the target device, and the corresponding key can be looked up in the manufacturer's database. The compiler can then guarantee by construction the correctness of the mapped application because it is aware of the modifications made to the overlay for that device.
(26) The methods and processes described herein can be embodied as code and/or data. The software code and data described herein can be stored on one or more machine-readable media (e.g., computer-readable media), which may include any device or medium that can store code and/or data for use by a computer system. When a computer system and/or processer reads and executes the code and/or data stored on a computer-readable medium, the computer system and/or processer performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium.
(27) It should be appreciated by those skilled in the art that computer-readable media include removable and non-removable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment. A computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM), and magnetic and optical storage devices (hard drives, magnetic tape, CDs, DVDs); network devices; or other media now known or later developed that is capable of storing computer-readable information/data. Computer-readable media should not be construed or interpreted to include any propagating signals. A computer-readable medium of the subject invention can be, for example, a compact disc (CD), digital video disc (DVD), flash memory device, volatile memory, or a hard disk drive (HDD), such as an external HDD or the HDD of a computing device, though embodiments are not limited thereto. A computing device can be, for example, a laptop computer, desktop computer, server, cell phone, or tablet, though embodiments are not limited thereto.
(28) The subject invention includes, but is not limited to, the following exemplified embodiments.
(29) Embodiment 1. A computer or hardware security method comprising:
(30) providing one or more field-programmable gate arrays (FPGAs); and
(31) creating an application specialized virtual architecture (or overlay) over the one or more FPGAs (e.g., by providing an overlay generator).
(32) Embodiment 2. The method of embodiment 1, wherein unique bitfiles (that configure the overlays implemented on the FPGA) are provided for each deployed FPGA.
(33) Embodiment 3. The method of any of embodiments 1 to 2, wherein the application specialized virtual architecture is constructed using application code, or functions from a domain, to create an overlay represented by one or more hardware description languages (e.g., VHDL).
(34) Embodiment 4. The method of any of embodiments 1 to 3, further comprising compiling the overlay hardware description language using FPGA CAD tools (e.g., Vivado, Quartus), resulting in an FPGA bitfile that programs the FPGA with the overlay.
(35) Embodiment 5. The method of any of embodiments 1 to 4, wherein, after generating the overlay, an application compilation tool flow follows a set of steps including optimization, scheduling, resource allocation, binding, and mapping, as well as compiling the unique or application-specialized overlay (depending on the overlay architecture and uniquifications).
(36) Embodiment 6. The method of any of embodiments 1 to 5, wherein the application compilation tool flow repeats when the application is changed.
(37) Embodiment 7. The method of any of embodiments 1 to 6, wherein the overlay tool flow executes only once to create an initial overlay.
(38) Embodiment 8. The method of any of embodiments 1 to 7, wherein the overlay tool flow executes multiple times (e.g., if application requirements change).
(39) Embodiment 9. The method of any of embodiments 1 to 8, further comprising providing a compiler that uses a configuration key for each FPGA (a key/device ID pair) to identify uniquifications for a particular device.
(40) Embodiment 10. The method of any of embodiments 1 to 9, wherein the key/device ID pair is only known to the device manufacturer, or the application manufacturer, or both.
(41) Embodiment 11. The method of any of embodiments 1 to 10, wherein the compiler ensures correctness of mapped applications by construction (by being aware of the modifications made to the overlay for that device).
(42) Embodiment 12. The method of any of embodiments 1 to 11, wherein the ID can be retrieved from the device during an update and the corresponding key is referenced in the manufacturer's database.
(43) Embodiment 13. The method of any of embodiments 1 to 12, wherein the overlay is a Malleable Hardware (MAHA) overlay or equivalent overlay (or alternative overlay).
(44) Embodiment 14. The method of any of embodiments 1 to 13, wherein the overlay (e.g. MAHA overlay) incorporates Memory Logic Blocks (MLBs).
(45) Embodiment 15. The method of any of embodiments 1 to 14, wherein uniquification is achieved using one or more of the following modifications: Instruction Set Randomization (ISR)—modification of the instruction encoding; Instruction Order Randomization (IOR)—permuting of the schedule table and program counter sequence; and Interconnect Randomization (ICR)—modifying the inter-MLB communication network.
(46) Embodiment 16. The method of any of embodiments 1 to 15, wherein uniquification is implemented within each MLB.
(47) Embodiment 17. The method of any of embodiments 1 to 16, wherein ISR includes permuting the order of input to functional units and/or encoding is uniquified at the register transfer language (RTL) level.
(48) Embodiment 18. The method of any of embodiments 1 to 17, wherein incremental compilation is applied to only resynthesize the fraction of the design (or overlay) performing instruction decoding in each MLB.
(49) Embodiment 19. The method of any of embodiments 1 to 18, wherein IOR is implemented using a cryptographically secure sequence generator (e.g., a maximal period nonlinear feedback shift register).
(50) Embodiment 20. The method of any of embodiments 1 to 19, wherein the seed for the cryptographically secure sequence generator is changed in the bitstream without resynthesizing the overlay.
(51) Embodiment 21. The method of any of embodiments 1 to 20, wherein the ICR is implemented by beginning with a fully-connected network of MLBs, and cutting specific connections between MLBs by modifying connection and switch box configuration bits in the FPGA. (ICR, IOR, and ISR are fairly specific to MAHA. If the overlay is an alternative overlay, then these three may not apply, but alternative methods for uniquification would be available which are specific to that overlay.)
(52) Embodiment 22. The method of any of embodiments 1 to 21, wherein the overlay bitfile mapping tool, which generates the instructions for the MLBs, is cognizant of the modifications (ISR, IOR, and ICR) and creates a functionally correct and latency-aware mapping (allowing for little variation in performance across all uniquified overlays).
(53) Embodiment 23. The method of any of embodiments 1 to 22, wherein each FPGA provides a device ID.
(54) Embodiment 24. The method of any of embodiments 1 to 23, wherein each device has a corresponding randomly-generated configuration key that specifies all ISR, IOR, and ICR modifications.
(55) Embodiment 25. The method of any of embodiments 1 to 24, wherein the method includes generating an application-specialized overlay, and then compiles the application-specialized overlay using FPGA CAD tools.
(56) A greater understanding of the present invention and of its many advantages may be had from the following examples, given by way of illustration. The following examples are illustrative of some of the methods, applications, embodiments and variants of the present invention. They are, of course, not to be considered as limiting the invention. Numerous changes and modifications can be made with respect to the invention.
EXAMPLE 1
(57) To gain a better understanding of and prove the concepts underlying embodiments of the present invention, an overhead analysis was conducted. Specifically, the energy, performance, and area were evaluated on a 40 nm Stratix IV FPGA, both with and without an MAHA overlay. Vector-based dynamic/total power and clock frequency were measured using Altera Quartus II. To evaluate the overlay, 10 benchmarks from image/signal processing and security domains were chosen, then functional correctness of each benchmark was verified using Altera ModelSim with 10,000 input vectors. For each benchmark, specialized overlays were generated by 1) adding/removing MLBs (up to 8), 2) optimizing the bitstream size by customizing the lookup table and schedule table sizes, and 3) specializing the datapath block. Our compiler generated all instructions for each specialized instruction set. The execution time of each benchmark was calculated based on the number of cycles needed, multiplied by the clock period as determined by TimeQuest Timing Analyzer.
(58) Table I compares the MAHA overlay with direct FPGA implementations. On average, the MAHA overlay obtained its security advantages at a cost of 1.5× execution time, 1.9× energy, and 1.7× power. The energy-delay-product increase from the MAHA overlay was 2.8×. MAHA reduced bitstream sizes by an average of 2,190×, while using 855 LUTs on average, compared to 255 for the FPGA—an average LUT increase of 3.4×. Although this ratio may seem significant, it is inflated due to the FPGA circuits being extremely small. Furthermore, the FPGA circuits are application specific, whereas the overlay has flexibility to implement other applications, which suggests not all these extra LUTs were overhead.
EXAMPLE 2
(59) To further explain and prove the concepts underlying embodiments of the present invention, a security analysis was conducted.
(60) Qualitatively, the overlay acts as an obfuscation layer which makes it more difficult for an attacker to reverse engineer the design. Namely, an adversary needs the following information to attack the system:
(61) 1) access to the uniquified FPGA bitstream, and knowledge of its format (e.g. location and meaning of LUT content bits and routing information)
(62) 2) access to the overlay bitfile, and knowledge of its format (e.g. instruction format, location of LUT content), which varies among all devices
(63) The potential impact from the diversification approaches of embodiments of the present invention was then analyzed. Quantitatively, the level of security against brute force attacks, defined as the number of brute force attempts required to reverse engineer the IP mapped to the overlay, is fairly straightforward for ISR and IOR:
(64) ISR: 2.sup.n, where n is the instruction bitwidth
(65) IOR: m!, where m is the size of the schedule table
(66) This is because there are 2.sup.n possible binary encodings for an n-bit instruction, and the order of the m instructions in a given MLB can be permuted in m! ways. Furthermore, the particular implementations of ISR and IOR used can differ among all K MLBs in the system. During application mapping, nodes from the CDFG are distributed among all available MLBs. Therefore, a functionally correct mapping can only be realized with proper execution within each MLB. Therefore, the overall security from these two approaches is an exponential function of the number of MLBs, as shown in Equation 1:
(67)
(68) This expression assumes that the size of the schedule table may differ between MLBs, but the instruction width is constant.
(69) The security provided by ICR depends not only on the number of MLBs, but also on their particular interconnect configuration, to which variations will result in different mappings and different power/performance/area tradeoffs. For demonstration, consider K identical, fully connected MLBs, and an application which is perfectly parallelizable. For each application mapping, the total number of possible placements is K! because for any given mapping, a particular subgraph of the original CDFG may be placed into any MLB, and only the routing, encoded within the instructions, needs to be updated to match. In other words, the fully connected network of identical MLBs is isomorphic, and therefore the placement algorithm is free to assign any subgraph to any MLB. For security purposes, the isomorphism is not ideal, because the overlay bitfiles will not differ significantly.
(70) TABLE-US-00001 TABLE I ENERGY, PERFORMANCE, AND SIZE OF MAPPED APPLICATIONS, WITH AND WITHOUT MAHA OVERLAY Execution Time (μs) Dynamic Energy (nJ) Total Energy (nJ) Bitstr. Size (b) Comb. ALUTs Total Mem. (b) MAHA FPGA Ratio MAHA FPGA Ratio MAHA FPGA Ratio MAHA FPGA MAHA FPGA MAHA FPGA CI 0.24 0.29 0.83 3.98 2.24 1.78 8.04 6.76 1.19 4368 10.9M 1227 87 36216 1920 FIR 0.21 0.20 1.05 3.16 1.58 2.00 6.96 4.47 1.56 3440 10.9M 887 101 29384 1920 AES 5.23 2.56 2.04 96.67 41.82 2.31 210.01 102.08 2.06 7456 10.9M 824 771 60016 1920 DES 4.51 2.50 1.80 77.32 37.45 2.06 182.81 98.43 1.86 13392 10.9M 831 104 97280 1920 DCT 0.78 0.44 1.77 11.05 5.67 1.95 25.92 13.06 1.98 3728 10.9M 1690 90 39108 1920 SHA 12.26 5.13 2.39 179.43 80.0 2.24 351.37 200.75 1.75 5828 10.9M 1073 959 20480 1920 TM 0.11 0.10 1.10 1.53 0.98 1.56 3.66 2.48 1.48 2288 10.9M 754 60 19208 1920 LFSR 0.12 0.09 1.33 1.43 0.74 1.93 2.99 2.34 1.28 1136 10.9M 378 90 10056 1920 BF 3.10 2.11 1.47 56.39 31.39 1.80 140.06 75.14 1.86 4784 10.9M 800 199 38952 1920 THR 0.19 0.20 0.95 3.06 1.91 1.60 67.99 47.05 1.45 3344 10.9M 89 89 27640 1920 Avg 2.66 1.36 1.47 43.4 20.39 1.92 99.98 55.26 1.65 4976 10.9M 855 255 37843 1920
(71) Other mappings based on different interconnections are possible. These other mappings will change not just the routing portion of the instructions, but also the application mapping itself. For example, in
(72) Given that different interconnect configurations will result in different mappings, ICR has profound implications for system security through overlay diversification, as long as there is at least one functionally correct mapping for those interconnect configurations made possible through ICR. If this is true, then the total number of possible mappings would be equal to the number of interconnect configurations for K MLBs. Computing this is nontrivial, and is given by
S.sub.2(K)=A[K] (2)
where A is the OEIS sequence A035512, the 13th term of which is roughly 2.sup.123. It is assumed that the digraph is unlabeled, because as with the example of K fully connected MLBs, isomorphic configurations do not contribute significantly to security.
(73) In fact, it can be shown that for every interconnect configuration, there is at least one functionally correct application mapping, given that the particular configuration satisfies the requirements for a strongly connected digraph, and the per-MLB schedule and LUT memory size constraints are relaxed. By extension, if the interconnect is only weakly connected, this holds as long as there is a path from the MLB processing the CDFGs primary input (PI) to the MLB processing its primary output (PO), as shown in
(74) To prove this, first consider the case of one MLB (K=1). By definition, a single MLB is a connected graph, and with sufficient schedule table and lookup table memory, the entire application can be mapped into the single MLB. For K>1, it follows that either 1) a portion of the application can be parallelized, or 2) that the application is implemented in a pipeline fashion. In the first case, the particular subgraphs can be mapped to any available MLB, as long as partial or intermediate results may be communicated between any two MLBs (even over multiple cycles), which is true if the network is strongly connected. If instead multiple MLBs are used for pipelining, then the application can be divided into sequential subgraphs, each of which can be placed in adjacent MLBs along the direction of the given edge. Thus, pipelining requires only a weakly connected network of MLBs with an extant path from PI to PO. Therefore, regardless of the application properties, there is at least one functionally correct mapping for every interconnect configuration, giving us Equation 2.
(75) There can be security tradeoffs in design mapping. From Equation 1, it follows that, for highly parallelizable applications, the overall level of brute force security may decrease as more MLBs are added when only ISR and IOR are used. This is because an increase in K will usually result in an overall reduction in the size of m, since a smaller schedule table will be needed when instructions are distributed among a greater number of MLBs, causing a reduction in brute force security for certain values of m and K. For example, S.sub.1 (32, 56, 1)=2.sup.281. Assuming that the application is not perfectly parallelizable, it can be divided into two MLBs, each of which has 30 (instead of 28) instructions. This gives us S.sub.1 (30, 32, 2)=2.sup.279. Similarly, ICR is effective against brute force only when K≥14; below that, the number of brute force attempts will be below 2.sup.128. Also, when an application is mapped to a network which is a subset of another, without ISR and IOR, the overlay bitfile will function on both devices. For example, the application mapped to
S.sub.3(n,m,K)=S.sub.1(n,m,K)×S.sub.2(K) (3)
(76) In short, the system security for small values of K requires both ISR and IOR, and the potentially reduced security from ISR and IOR for small values of m is partially compensated by the presence of ICR. Therefore, the architectural modifications of embodiments of the subject invention provide a powerful and versatile tool for security through diversification for the FPGA overlay.
(77) The goal for a typical side channel attack is to obtain secret information, such as an encryption key, by carefully observing certain time-varying system properties, such as power consumption or electromagnetic radiation. The reason such attacks are effective is that these side channels inadvertently leak information because certain operations take more or less power, depending on if the bits involved are 1 or 0. By comparison, the overlay does not rely on operations with secret keys; instead, the particular modifications are encoded into the architecture of the overlay itself. In other words, there is no secret key to leak, making the overlay approach highly effective against side-channel attacks.
(78) It should be understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application.
(79) All patents, patent applications, provisional applications, and publications referred to or cited herein (including those in the “References” section) are incorporated by reference in their entirety, including all figures and tables, to the extent they are not inconsistent with the explicit teachings of this specification.
REFERENCES
(80) [1] J. Coole and G. Stitt, “Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing,” in IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES/ISSS '10. ACM, 2010, pp. 13-22. [2] W. Luis, G. R. Newell, and K. Alexander, “Differential power analysis countermeasures for the configuration of sram fpgas,” in Military Communications Conference, MILCOM 2015-2015 IEEE, October 2015, pp. 1276-1283. [3] J. Coole and G. Stitt, “Fast, flexible high-level synthesis from opencl using reconfiguration contexts,” IEEE Micro, vol. 34, no. 1, pp. 42-53, January 2014. [4] J. Coole, “Adjustable-cost overlays for runtime compilation,” in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, May 2015, pp. 21-24. [5] S. Paul, A. Krishna, W. Qian, R. Karam, and S. Bhunia, “Maha: An energy-efficient malleable hardware accelerator for data-intensive applications,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, September 2014. [6] A. K. Jain, S. A. Fahmy, and D. L. Maskell, “Efficient overlay architecture based on dsp blocks,” in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on. IEEE, 2015, pp. 25-28. [7] D. Capalija and T. S. Abdelrahman, “A high-performance overlay architecture for pipelined execution of data flow graphs,” in 2013 23rd International Conference on Field programmable Logic and Applications. IEEE, 2013, pp. 1-8. [8] G. S. Kc, A. D. Keromytis, and V. Prevelakis, “Countering code-injection attacks with instruction-set randomization,” in Proceedings of the 10th ACM conference on Computer and communications security. ACM, 2003, pp. 272-280. [9] “Number of unlabeled strongly connected digraphs with n nodes.” [Online]. THE ONLINE ENCYCLOPEDIA OF INTEGER SEQUENCES (OEIS), Sequence number: A035512.