FPGA virtualization
11120188 · 2021-09-14
Assignee
Inventors
Cpc classification
H04L67/34
ELECTRICITY
G06F30/34
PHYSICS
G06F9/455
PHYSICS
International classification
G06F9/455
PHYSICS
G06F30/34
PHYSICS
Abstract
An FPGA virtualization platform including a network controller configured to provide an interface to an external network; a static logic section coupled to the network controller; and one or more reconfigurable regions each having a virtualized field programmable gate array vFPGA) that includes a wrapper and a user design.
Claims
1. A field programmable gate array virtualization platform comprising: a network controller configured to provide an interface to an external network; a plurality of reconfigurable virtualized field programmable gate arrays (vFPGA), wherein each vFPGA is configured to be directly attached to a network via the network controller so as to appear on the network as an independent compute resource to other resources on the network; a static logic section coupled to the network controller, wherein the static logic section includes a clock management section, a routing arbiter, and a reconfiguration management unit, wherein the clock management section generates controllable clock domains for each vFPGA, wherein the routing arbiter routes data from the network controller to a vFPGA through two AXI interconnects, wherein a first AXI interconnect reads from an rx-Async buffer and routes to the vFPGA, wherein a second AXI interconnect reads from the vFPGA and forwards to a tx-Async buffer; wherein the reconfiguration management unit includes an internal reconfiguration access port (ICAP), has a dedicated MAC/IP address and is configured to download a user design at runtime to one or more vFPGAs, receive partial bitstreams over a network to reconfigure one or more vFPGA, and freeze vFPGA I/O interfaces during configuration or reconfiguration; each vFPGA including a wrapper, wherein the wrapper includes a custom interface for the user design that provides data, control, clocking signals and logic for coupling the user design to the static logic section, wherein the custom interface includes a description file, wherein the description file is an extensible mark-up language (XML) file; and wherein each vFPGA includes a plurality of physical connections which are reconfigured when the user design is received.
2. The field programmable gate array virtualization platform of claim 1, further comprising: a wrapper generator, wherein each wrapper is generated by the wrapper generator based on a description file of inputs and outputs of a respective user design.
3. The field programmable gate array virtualization platform of claim 1, wherein the static logic section presents one or more vFPGAs as a server, each with a separate MAC/IP.
4. The field programmable gate array virtualization platform of claim 1, wherein the static logic section presents the reconfiguration management unit as a separate server.
5. The field programmable gate array virtualization platform of claim 1, wherein the network controller includes a transmission control protocol data link, network, and session layers of an OSI network stack.
6. The field programmable gate array virtualization platform of claim 1, wherein the network controller establishes sessions between vFPGAs and respective clients, and wherein the network controller ensures data ordering and correctness.
7. The field programmable gate array virtualization platform of claim 1, wherein the network controller is configured to receive user data, deliver the user data to the static logic section, and transmit results back to a user associated with the user data.
8. The field programmable gate array virtualization platform of claim 1, wherein the network controller is integrated with the static logic section.
9. The field programmable gate array virtualization platform of claim 1, wherein the network controller is external to the field programmable gate array virtualization platform.
10. The field programmable gate array virtualization platform of claim 1, wherein a single network controller is shared among a plurality of vFPGAs.
11. The field programmable gate array virtualization platform of claim 1, wherein a network controller is provided per vFPGA.
12. The field programmable gate array virtualization platform of claim 1, wherein the static logic section includes data routers configured to route data between the network controller and one or more vFPGA.
13. The field programmable gate array virtualization platform of claim 1, wherein the reconfiguration management unit includes a wrapper.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) A more complete appreciation of this disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
DETAILED DESCRIPTION
(22) In the drawings, like reference numerals designate identical or corresponding parts throughout the several views. Further, as used herein, the words “a,” “an” and the like generally carry a meaning of “one or more,” unless stated otherwise. The drawings are generally drawn to scale unless specified otherwise or illustrating schematic structures or flowcharts.
(23) Aspects of this disclosure are directed to methods, systems, and computer readable media for FPGA virtualization.
(24) Though FPGAs have achieved significant performance gains for many application domains, implementing applications on FPGAs still remains a non-trivial task. This can be especially true for FPGAs in a data center or a cloud computing environment. Some implementations include a platform for virtualizing FPGAs. Some implementations permit the rapid deployment (or porting) of applications on cloud-based or data-center attached FPGAs. Some implementations of the platform disclosed herein can provide a general abstract interface to any design (not domain specific) and can support dynamic partial reconfiguration (e.g., so designs can be added to an FPGA that has other applications running) at comparable overhead to other notable platforms. Experimental results using a streamed application in a cloud-like environment have shown that the disclosed platform is a viable computing option (in terms of throughput, among other things) for suitable applications compared to conventional server-based or virtual-machine based software implementations.
(25) In one aspect the subject matter disclosed herein provides a methodology and platform for virtualizing FPGAs such that application developers can seamlessly deploy applications, as custom circuits, on a data center-attached FPGA in a similar manner to deploying a software application on a virtual machine. Physical FPGAs are partitioned into several regions, called vFPGAs, having a common infrastructure, or static logic, for communication and re-configuration. The static logic remains constant and does not need re-configuration. A user design deployed on the present FPGA virtualization platform can be configured and accessed remotely over a TCP/IP network. Based on dynamic reconfiguration, a flexible wrapper architecture is disclosed. An abstraction layer, the wrapper generator, bridges the gap between the user's custom interface and the fixed FPGA interface in the disclosed FPGA virtualization platform. Wrappers are automatically generated for any application's circuitry and synthesized to generate a partial bitstream for virtual FPGAs (vFPGAs).
(26) Some implementations can include an FPGA virtualization methodology and platform based on dynamic partial reconfiguration that is suitable for any hardware design (e.g., not domain specific), with a complete interface abstraction layer. The interface abstraction layers can include platform management logic (or static logic) that is fixed and pre-configured on the FPGA. The static logic abstracts the whole platform (with several vFPGAs) as a group of servers, each with a separate MAC/IP (including the reconfiguration management unit within the static logic). This permits the platform to be seamlessly integrated with any Data Center (DC)/cloud management tools (e.g., there may be no need for custom drivers). Also, users can use their vFPGA-based designs in a standard client-server mode without having to write any special drivers.
(27) Some implementations can include a wrapper generator for user designs based on the proposed vFPGA platform. The wrapper generator receives an XML-based description that contains a list of inputs and outputs and their description. The wrapper generator also receives a data description file written using a verification language such as Vera (e.g., format of the incoming/outgoing data to/from the design).
(28) Described below is a detailed test-case prototype implementation and several experimental results to evaluate the overhead of the disclosed FPGA virtualization platform compared with a “bare metal” custom FPGA implementation having direct inputs (e.g., no LAN interface) in terms of area (e.g., FPGA resources), latency, time, power, and throughput. Also described below is a performance comparison showing that the disclosed virtualization platform can outperform a software-based virtualization for a streamed application (e.g., an AES encryption application).
(29) Some implementations of the virtualization platform are based on partial dynamic reconfiguration. The physical FPGA is divided into a static region (e.g., a region that is kept as is with no reconfiguration), several dynamically reconfigurable regions, and a communication controller. Each dynamically reconfigurable region corresponds to one vFPGA where a user design can be placed (along with the wrapper). The wrapper controls clocking the user design according to data arrival. An overview of the disclosed virtualization platform is shown in
(30)
(31) Data movement between the four layers of the platform follows the standard two-way handshaking mechanism as defined in AX14 stream specifications. (ARM AMBA AX14-Stream Protocol Specifications, 2010. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0051a/index.html, which is incorporated herein by reference). This enables both reader and writer to control the data transmission rate and to communicate without losing any cycles. For example, the communication between static and dynamic regions can be done through fixed-width read and write AXI channels. For cross clock-domains data movement between clock regions with unrelated frequencies is achieved with asynchronous FIFOs,
(32) In some implementations, the network controller can implement the TCP's data link, network, and session layers of the OSI network stack. It establishes sessions between vFPGAs and their users and ensures data ordering and correctness. It receives users' data, deliver it to the static logic and transmit the results back to the user. More precisely, the network controller performs the following tasks:
(33) 1) establishes and terminates TCP sessions between vFPGAs and their users, storing source addresses and other session data;
(34) 2) forwards the payload of the received TCP packets to the static logic associated with the target vFPGA index;
(35) 3) constructs TCP packets for the received results from vFPGAs and transmits them to their users;
(36) 4) stores and manages MAC/IP addresses for all associated vFPGAs and negotiates for dynamic network addressing using the DHCP protocol; and
(37) 5) announces the existence of associated vFPGAs over the network and replies to network queries about vFPGAs such as ARP and ping requests.
(38) The network controller can either be integrated with the static logic in the physical FPGA or it can be an off-the-shelf device external to the FPGA. It is also possible to share a network controller among several vFPGAs using a single Ethernet cable connected to the physical FPGA or associate one network controller per vFPGA such that each vFPGA will have a dedicated Ethernet cable connected to the physical FPGA.
(39) The fixed interface between the static logic and the network controller can include, for example, an AXI read-data channel, write-data channel, and the vFPGA indices. In some implementations, the data width can be 8/64 bits for the 1 GE/10 GE Ethernet interface, respectively. Asynchronous FIFOs can be used to move the data across the three clock domains of the static logic, Ethernet transmitter, and Ethernet receiver.
(40)
(41) The static logic includes data routers, a reconfiguration management unit, and a clock management unit. Data routing is needed when the network controller is shared among several virtual FPGAs. Routing data between the network controller and vFPGAs is done through two AXI interconnects. The first one reads from the rx-Async buffer and route to the corresponding vFPGA. The second AXI interconnect reads results from one vFPGAs at a time and forward them to the tx-Async buffer. The result of each vFPGA is collected separately to guarantee no interference with other vFPGAs outputs.
(42) Reconfiguration manager (RM) receives partial bitstreams over the Ethernet to reconfigure any of the vFPGAs. It has its own MAC/IP addresses and the network controller deals with it as another vFPGA. It consists mainly of an internal reconfiguration access port (ICAP) surrounded with a wrapper. It also responsible of freezing the partial region I/O interfaces during the configuration.
(43) The clock management unit produces several clocks for the different domains as shown in
(44) Standard clock buffers and clock management units (CMUs) available on commercial FPGAs have many properties that are utilized in the wrapper design. First, they are controllable (e.g., stoppable). The wrapper uses this property to stall and release the user design clock according the availability of input data and other conditions. Second, they are run-time reconfigurable, allowing the wrapper to set the user design's clock frequency at run-time. Third, their clock phases can be shifted by 180 to provide negative edge clocking for the user design.
(45) The wrapper allows users to fit, communicate, and control their designs in any partially reconfigurable region (vFPGA) through a fixed interface with the fixed logic.
(46)
(47)
(48) Outputs of the serializer are stored in an asynchronous FIFO input buffer. Asynchronous FIFOs are used as buffers to allow the movement of data across the different clock domains (the packing/unpacking circuits use the static logic's clock while the rest of components use the wrapper's clock). The inputs to the user's design are registered in one register or groups of registers to allow separate control of different inputs. Hence, inputs that do not change frequently (e.g. reset, enable, and other control signals) can be set in a separate group so their values are sent (or set by the serializer) only one time at the beginning instead of sending their values with each input. The cycles register stores the number of clock cycles that should be applied after each input and is updated at run-time by the serializer. The cycle register is also useful because it gives the user the ability to flush their pipeline while waiting for inputs.
(49) Table 1 shows the details of the wrapper fixed interface which includes the wrapper's clock, the static logic's clock, user-design's clock and clock-enable, wrapper's reset, and AXI read/write channels with fixed data width (8 or 64-bits depending on the Ethernet interface being used). The wrapper is automatically generated for each design according to a user-provided XML input/output specification. The designer also prepares a description of the data format and application/capture rules using a subset of the verification language OpenVera (SystemVerilog). (F. Haque, J. Michelson, and K. Khan. The Art of Verification with VERA (1 ed.), Verification Central, 2001, which is incorporated herein by reference). In the XML description, the designer can divide the design inputs and outputs into groups such that one input/output group is applied/captured at each clock cycle. If there are more than one output group, the design is stalled until all groups are captured. The “Mask register” width is equal to the number of output groups and determines which output group should be transmitted back to the user every cycle. Moreover, the designer can set a property “Mask=True” for any wire in an output group to define it as a “valid out” signal (e.g., to act as a strobe for capturing that output group). The output arbiter receives the mask register value ANDed with the “valid out” signals, if there is any, and produces a selection index that determine which output should be captured. The output arbiter stalls if the output FIFO is full.
(50) An FSM controls clocking the user's design according to data arrivals and user specifications, reading data from the input FIFO, applying it to the inputs, reading the output results, and capturing outputs and storing them in the output FIFO. When a new input arrives, if its clock-control bit is on, the FSM clocks the user design for the number of cycles indicated in the cycles register.
(51)
(52) TABLE-US-00001 TABLE 1 The fixed interface between the static logic and the user's design in the vFPGA. Signal Direction Description CLK_static_logic In Clocking CLK_wrapper In CLK_user_design In CLK_enable Out Ready_in Out AXI input Valid_in In interface Data_in (8/64 bits) In Ready_out In AXI output Valid_out Out interface Data_out (8/64 bits) Out
(53) To verify the effectiveness of the disclosed FPGA virtualization platform and evaluate its area, power, and speed overhead, a complete test platform was implemented and used to host four different designs placed in its vFPGAs. Four different open IP cores were used as benchmarks (see, E. Villar and J. Villar, “High Performance RSA 512 bit IPCore,” 2010; D. Lundgren, “JPEG Encoder Verilog”, 2010; S. T. Eid, “DCT—Discrete Cosine Transformer,” 2001; and H. Hsing, “AES core specifications,” 2013. All from: https://opencores.org/, each of which is incorporated herein by reference); an RSAS12 encryption engine, a JPEG Encoder (JPEGEnc), a fast discrete cosine transformation (FDCT) engine, and an AES encryption (AES128) engine. A Virtex6 Xilinx FPGA with a 1/10 Gigabit Ethernet port (XC6VLX550t) was used to host the virtualization platform with four the vFPGAs. The 4 IP cores were synthesized with the generated wrapper and a partial configuration bit stream was generated for each IP core targeting one of the created vFPGAs. Xilinx's Plan ahead tool was used to make four reconfigurable regions (vFPGAs) on the FPGA beside the static logic and network controller regions. The 4 IP core circuits were then configured on the FPGA via the static logic's configuration controller using the internal configuration access port (ICAP). Using Xilinx's ChipScope, a technology that allows real-time monitoring of internal FPGA signals, the proper operation of the wrappers were verified.
(54) As an example,
(55)
(56) As shown in
(57) To evaluate the overhead of the disclosed FPGA virtualization technique, the technique was compared to a direct implementation of the four IP cores on the same FPGA (e.g., bare-metal with no virtualization) without any design modifications to the IP cores. Also, to eliminate the effect of frequency on performance, all IP cores for both implementations were operated at 156.25 MHz, the 10 GE Ethernet controller frequency. Though the direct implementation with inputs/outputs applied/captured directly to/from the IP cores through the FPGA I/O pins may not be practical or even realizable, it constitutes the theoretical best-case in terms of area, power, and speed, which is why this approach was used as a baseline for evaluating the area/power/speed overhead of the disclosed FPGA virtualization platform.
(58) Table 2 summarizes the virtualization overhead of the disclosed FPGA virtualization platform compared to the direct implementation in terms of area, latency, power, and throughput. For these results, in order to obtain the overhead for each IP core separately, four copies of each IP core were placed on the virtualization platform since the static logic is actually shared between the four vFPGAs. The results in Table 2 are based on post place and route simulations. This is due to two reasons; 1) there is not a practical way to inject/readout inputs/outputs to the direct FPGA implementations, and 2) a 10 GE switch was not available to send packets to the vFPGA platform. The total computation times are measured from sending the first Ethernet packet of the user's input data until receiving the last Ethernet packet of the results. In the case of the AES128, the computation time overhead is dominated by the communication overhead. The total computation time overhead for the other 3 IP cores is acceptable because computations are more prominent than communication for these benchmarks.
(59) Latency was measured as the time from receiving the first input until producing the first output. For the vFPGA, the latency increase is attributed to the initialization of the mask register and the clocking counter which consumes 150-200 nano-seconds. AES128 latency increased more than others because its input size is 128 bits which is double the data bus width of the system which in turns made the wrapper halves the IPs clock frequency. For such IP cores (with extra-wide input/output widths), a larger bus width would reduce the latency overhead.
(60) TABLE-US-00002 TABLE 2 Virtualization overhead compared to direct implementation on an FPGA for 4 benchmarks. For the vFPGAs, the wrapper's I/O widths are 64/64 bits for all designs. JPEG Benchmark RSA512 DCT Encoder AES128 Inputs/ 64/16 14/13 28/39 128/128 Outputs Widths (bits) Total Computation Time (ns) @156.25 MHz FPGA 18,750,265 175,811 73,164 131,176 vFPGA 18,764,874 198,860 80,377 356,403 Overhead 0.08% 13.1% 9.9% 171.7% Latency (ns) FPGA 1,249,974 439 790 131 vFPGA 1,250,151 577 941 416 Overhead 0.01% 31.4% 19.1% 217.6% Average Throughput (MB/s) @156.25 MHz FPGA 0.20 139.6 360.4 2,381.0 vFPGA 0.20 123.4 328.0 876.6 Overhead −0.08% −11.6% −9.0% −63.2% Dynamic power (mW) FPGA 138.9 248.6 433.3 717.7 vFPGA 362.8 347.0 822.2 1,239.8 Overhead 161.2% 39.6% 89.8% 72.7% Area (Slices) FPGA 2,676 726 9,693 919 vFPGA 3,083 1,263 14,470 1,820 Overhead 15.2% 74.0% 49.3% 98.0%
Average throughput was measured as the ratio of the total data over the total time for both virtual and physical FPGAs in the table. Throughput overhead of the vFPGA platform was around 10% except for the AES128 circuit. The overhead depends on how much unpacking/unpacking are required and how much of the control bits are consumed with the data per cycle. For the RSA 512 benchmark, the IP core's input width matches the wrapper's very well, hence unpacking takes very little overhead. For the DCT and JPEG Encoder, unpacking becomes more significant (the DCT is slightly better matched with the wrapper's data width). As mentioned before, due to the huge mismatch between the AES128 input width and the wrapper's, the effective frequency of this IP core's clock was half that of the wrapper (and the physical FPGA version), yielding the largest throughput overhead. Again, a wider data bus would have reduced this overhead significantly.
(61) Area overhead is measured in FPGA slices and is due to the wrapper and static logic. The static logic's total area is constant at 2,377 slices (˜3%), or ˜600 slices per vFPGA. The wrapper's area dominates the area overhead and varies for each benchmark depending on its input and output size because of the packing/unpacking circuitry.
(62)
(63) Power overhead is incurred due to the additional circuitry of the wrapper and static logic. The effect of the wrapper and static logic on power is more prominent for IPs that have less time overhead (e.g. the RSA512) since the total energy per computation (independent of the frequency) is spent over less time which increases the average power. The results reported in Table 2 are based on active power (i.e. during operation of vFPGA-based designs). Also, the overhead depends on the size of vFPGA circuits relative to the wrapper's and static logic. In this case, the IP cores are relatively small, increasing the relative overhead.
(64) Table 3 below shows a comparison of the disclosed FPGA virtualization platform with other notable platforms for attaching FPGAs to DCs. The table summarize the type of the platform and its interface (to the user's design), area overhead in terms of FPGA resources (for the overlay, it is reported as ratio to the bare-metal design), the platform components (i.e. static logic), and whether partial reconfiguration is supported or not. The VirtualRC overlay architecture was included because it provides an abstracted application-specific interface that can be used to attach an FPGA-design to a DC. (see, R. Kirchgessner, G. Stitt, A. George, and H. Lam, “VirtualRC: A Virtual FPGA Platform for Applications and Tools Portability,” Proc. ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'12). pp. 205-208, 2012, which is incorporated herein by reference). As this table shows, the disclosed FPGA virtualization platform provides a complete interface abstraction and partial reconfiguration support at a comparable or less area overhead than other techniques.
(65) Cloud-based applications usually run on virtual machines or within containers which introduce remarkable overhead compared to running the same application on the physical machine. To show the viability of FPGA-based computing in clouds with the disclosed vFPGA platform, the performance of an actual streamed application (not simulated) is evaluated when it is run on a virtual machine, a physical machine, and on a vFPGA, all in an environment similar to that of a cloud computing system. For this experiment, we designed a custom streamed application that we believe is a good representation of applications that are suited for both, cloud environment and FPGA implementation. The application involves three main sequential tasks performed on streamed blocks of data: decrypt-compute-encrypt, e.g., receive encrypted data, decrypt it, performs some relatively simple computation on the plain text, then encrypt the results and send the results back to the user. Symmetric key encryption (AES) was used for the encryption and decryption tasks. For the three application platforms, a client application (running on a typical workstation) streams the data over a 1 GE LAN to the three different platforms and receives the streamed results back as illustrated in
(66) In the client-to-physical server scenario, the application was run (as a server) on a Xeon machine with 8 cores running at 3.00 GHz, 16 GB of RAM, and 64 bit-linux Ubuntu 16.04LTS. In the client-to-virtual machine scenario, Virtualbox was used to build a virtual machine with 4 GB RAM and 64 bit-linux Ubuntu 16.04LTS on another Xeon® machine with same specifications as the first one. The application was written with Python using the Python stream socket programming (STREAM socket programming on python, https://docs.python.org/2/howto/sockets.html, which is incorporated herein by reference) and the Python Cryptography Toolkit (PyCrypto®) (D. C. Litzenberger, “Pycrypto—the python cryptography toolkit,” URL: https://www.dlitz.net/software/pycrypto (2016), which is incorporated herein by reference). The measured stream socket throughput between two machines using the code was 113 Mbytes/sec which represents 90% of the IGE link theoretical speed.
(67) TABLE-US-00003 TABLE 3 Comparison with notable platforms for attaching FPGAs to DCs. Static logic PR.sup.+ Platform Type Area overhead Components Support MS Catapult PCI attached, ≈39,560 ALMs.sup.‡ two DRAM no Torous network controllers, among FPGAs, four SLite II PCIe DMA (to connect Specific over Ethernet), Interface router, PCIe core, reconfiguration management Disaggregated Network ≈58,128 LUTs DRAM controller, yes attached, mem virt. module Specific for each Interface FPGAs (similar to +116,256 FFs vFPGA network OS sockets) controller, management RIFFA2.1 PCIe DMA 15,862 LUTs + PCIe core, no Interface 14,875 FFs tx-rx engines (Xilinx) for 4 vFPGAs 15,182 ALUTs + 13,418 FFs (Altera) (without PCI logic) DyRACT PCIe DMA 16,157 LUTs + PCIe core, yes Interface 19,453 FFs tx-rx engines, reconfiguration man., clock man., DMAs Byma Network Attached, 28,711 LUTs + Soft processor yes DPR.sup.*, 29,327 FFs (Reconfiguration Specific management), Interface DRAM controller, for Packet MAC Regs., Mem Processing mapping Regs. Applications VirtualRC Domain Specific 2,300 LUTs + N/A no Overlay with 4,550 FFs Specific Interface Disclosed DPR, General 17,504 LUTs + Network yes FPGA 20,514 FFs controller Virtualization (complete Platform TCP stack), clock management, reconfiguration management .sup.+Partial Reconfiguration .sup.‡ALM = Adaptive Logic Module (Altera), equivalent to Xilinx's Slice (6-input LUT + 4FFs). *DPR = Dynamic Partial RE-Configuration (for vFPGAs).
(68) The hardware version of the application was built using Hsing's AES core (H. Hsing, “AES core specifications,” 2013, from: https://opencores.org/, which is incorporated herein by reference). Since Hsing's core only provides AES-ECB mode encryption, it was modified to implement AES-CTR (for encryption and decryption) which provides stronger security. Two separate instances of the AES-CTR core are used to decrypt and encrypt the streamed data. The three platforms utilized TCP streams to/from the client over the 1 GE LAN switch with a measured sustainable throughput of ˜113 MBytes/s.
(69) The application's performance was evaluated using the measured throughput as a function of the streamed block size for the three implementations as shown in
(70)
(71) The disclosed FPGA virtualization platform for attaching FPGAs to DCs and clouds can include dynamic partial reconfiguration. A physical FPGA is partitioned into static logic and partially reconfigurable regions representing vFPGAs. An abstract interface between static logic and the vFPGAs has been developed in a form of an automatically generated wrapper. This allows users to place any circuit IP in the vFPGA, send, and receive data from their IP through standard Ethernet communication. An evaluation implementation of the disclosed FPGA virtualization platform was built and estimated virtualization overhead (e.g., compared to direct implementation on FPGAs) was evaluated in terms of performance, area, throughput, latency, and dynamic power. Experiments showed that the disclosed virtualization platform is both feasible and practical. Also, comparison with other platforms for attaching FPGAs to DCs showed that the area overhead of the disclosed FPGA virtualization platform is within the same range of others but with the added advantages of having an abstract interface, support for partial reconfiguration, and not being domain specific. Comparison with software based cloud implementations showed that the disclosed FPGA virtualization platform is a very viable computing option in the cloud for suitable applications. Some implementations can provide support for external RAM (e.g., DDR3). Hence IP cores that require a large amount of RAM can be accommodated with an implementation that provides support for external RAM.
(72) Next, a hardware description of a computing device that can host a virtualization platform or is operable to provide reconfiguration information to a virtualization platform according to exemplary embodiments is described with reference to
(73) Further, the claimed advancements are not limited by the form of the computer-readable media on which the instructions of the inventive process are stored. For example, the instructions may be stored on CDs, DVDs, in FLASH memory, RAM, ROM, PROM, EPROM, EEPROM, hard disk or any other information processing device with which the computing device communicates, such as a server or computer.
(74) Further, the claimed advancements may be provided as a utility application, background daemon, or component of an operating system, or combination thereof, executing in conjunction with CPU 1500 and an operating system such as Microsoft Windows 7, UNIX, Solaris, LINUX, Apple MAC-OS and other systems known to those skilled in the art.
(75) The hardware elements in order to achieve the computing device may be realized by various circuitry elements, known to those skilled in the art. For example, CPU 1500 may be a Xenon or Core processor from Intel of America or an Opteron processor from AMD of America, or may be other processor types that would be recognized by one of ordinary skill in the art. Alternatively, the CPU 1500 may be implemented on an FPGA, ASIC, PLD or using discrete logic circuits, as one of ordinary skill in the art would recognize. Further, CPU 1500 may be implemented as multiple processors cooperatively working in parallel to perform the instructions of the inventive processes described above.
(76) The computing device in
(77) The computing device further includes a display controller 1508, such as a NVIDIA GeForce GT7 or Quadro graphics adaptor from NVIDIA Corporation of America for interfacing with display 1510, such as a Hewlett Packard HPL2445w LCD monitor. A general purpose I/O interface 1512 interfaces with a keyboard and/or mouse 1514 as well as a touch screen panel 1516 on or separate from display 1510. General purpose I/O interface also connects to a variety of peripherals 1518 including printers and scanners, such as an OfficeJet or DeskJet from Hewlett Packard.
(78) A sound controller 1520 is also provided in the computing device such as Sound Blaster 7-Fi Titanium from Creative, to interface with speakers/microphone 1522 thereby providing sounds and/or music.
(79) The general purpose storage controller 1524 connects the storage medium disk 1504 with communication bus 1526, which may be an ISA, EISA, VESA, PCI, or similar, for interconnecting all of the components of the computing device. A description of the general features and functionality of the display 1510, keyboard and/or mouse 1514, as well as the display controller 1508, storage controller 1524, network controller 1506, sound controller 1520, and general purpose I/O interface 1512 is omitted herein for brevity as these features are known.
(80) The exemplary circuit elements described in the context of the present disclosure may be replaced with other elements and structured differently than the examples provided herein. Moreover, circuitry configured to perform features described herein may be implemented in multiple circuit units (e.g., chips), or the features may be combined in circuitry on a single chipset, as shown on
(81)
(82) In
(83) For example,
(84) Referring again to
(85) The PCI devices may include, for example, Ethernet adapters, add-in cards, and PC cards for notebook computers. The Hard disk drive 1660 and CD-ROM 1666 can use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. In one implementation the I/O bus can include a super I/O (SIO) device.
(86) Further, the hard disk drive (HDD) 1660 and optical drive 1666 can also be coupled to the SB/ICH 1620 through a system bus. In one implementation, a keyboard 1670, a mouse 1672, a parallel port 1678, and a serial port 1676 can be connected to the system bus through the I/O bus. Other peripherals and devices that can be connected to the SB/ICH 1620 using a mass storage controller such as SATA or PATA, an Ethernet port, an ISA bus, a LPC bridge, SMBus, a DMA controller, and an Audio Codec.
(87) Moreover, the present disclosure is not limited to the specific circuit elements described herein, nor is the present disclosure limited to the specific sizing and classification of these elements. For example, the skilled artisan will appreciate that the circuitry described herein may be adapted based on changes on battery sizing and chemistry, or based on the requirements of the intended back-up load to be powered.
(88) The functions and features described herein may also be executed by various distributed components of a system. For example, one or more processors may execute these system functions, wherein the processors are distributed across multiple components communicating in a network. The distributed components may include one or more client and server machines, which may share processing, as shown on
(89) The above-described hardware description is a non-limiting example of corresponding structure for performing the functionality described herein.
(90) A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of this disclosure. For example, preferable results may be achieved if the steps of the disclosed techniques were performed in a different sequence, if components in the disclosed systems were combined in a different manner, or if the components were replaced or supplemented by other components. The functions, processes and algorithms described herein may be performed in hardware or software executed by hardware, including computer processors and/or programmable circuits configured to execute program code and/or computer instructions to execute the functions, processes and algorithms described herein. Additionally, an implementation may be performed on modules or hardware not identical to those described. Accordingly, other implementations are within the scope that may be claimed.