RECONFIGURABLE INTEGRATED CIRCUIT WITH ON-CHIP CONFIGURATION GENERATION
20170244414 · 2017-08-24
Inventors
Cpc classification
International classification
Abstract
Reconfigurable Integrated Circuit with On-Chip Configuration Generation A circuit and method are provided in which reconfiguration is achieved through the modification of a dynamic data path using configuration information generated on the basis of run-time variables. Rather than storing a plurality of pre-set configurations, this can allow configurations optimised to processing tasks to be implemented during operation.
Claims
1. A reconfigurable integrated circuit for carrying out a plurality of processing tasks, comprising: a configuration generator; and a dynamic data path comprising one or more control units comprising configuration memory, wherein the configuration memory is adapted to store a modifiable configuration for the dynamic data path, the configuration generator is adapted to receive one or more run-time variables and to generate configuration information to modify the stored configuration in dependence on the one or more run-time variables, and the one or more control units are adapted to receive and apply configuration information from the configuration generator to modify the stored configuration, thereby optimising the dynamic data path for a required processing task.
2. A circuit according to claim 1, further comprising one or more adaptive data routers arranged to adaptively route data from a data memory through the dynamic data path in dependence on the stored configuration.
3. A circuit according to claim 2, comprising a plurality of adaptive data routers.
4. A circuit according to claim 3, wherein the plurality of adaptive data routers are provided in parallel.
5. A circuit according to of claim 1, wherein each adaptive data router is coupled to a control unit adapted to receive configuration information from the configuration generator, wherein the adaptive data router acts in accordance with the configuration stored in the control unit.
6. A circuit according to claim 1, further comprising a signal distribution network for distributing configuration information from the configuration generator to the control units, wherein the signal distribution network may apply the same configuration information to multiple control units.
7. A circuit according to claim 1, wherein the run-time variables are dependent upon run-time data access patterns.
8. A circuit according to claim 1, wherein the dynamic data path further comprises one or more data path operators.
9. A circuit according to claim 1, wherein the circuit is a field programmable gate array.
10. A method for reconfiguring an integrated circuit comprising: storing a configuration in a configuration memory which defines the configuration of a dynamic data path; receiving run-time variables at a configuration generator; generating configuration information at the configuration generator to modify the stored configuration in response to the received run-time variables; and applying the configuration information from the configuration generator to modify the stored configuration, thereby optimising the dynamic data path for a required processing task.
11. A method according to claim 10, wherein the step of applying the modified configuration comprises adapting one or more adaptive data routers arranged to route data from a data memory through the dynamic data path.
12. A method according to claim 11, comprising adapting a plurality of adaptive data routers.
13. A method according to claim 12, wherein the plurality of adaptive data routers are provided in parallel.
14. A method according to claim 11, wherein each adaptive data router is coupled to the configuration memory and is adapted to receive configuration information from the configuration generator, wherein the adaptive data router acts in accordance with the stored configuration in the absence of configuration information, and acts according to the modified configuration when configuration information is received from the configuration generator.
15. A method according to claim 10, wherein the run-time variables are dependent upon run-time data access patterns.
16. A method according to claim 10, wherein the circuit is a field programmable gate array.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] Preferred embodiment of the present invention will now be described with reference to the accompanying drawings, in which:
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
DETAILED DESCRIPTION
[0043] Referring to
[0044] The dynamic data path 130 is illustrated in more detail in
[0045] The configuration generator 120 receives as input one or more run-time variables, which may include information from the dynamic data path 130 and/or from the configuration generator memory 110, and generates configuration information for modifying the configuration stored in the control units 132. The configuration generator 120 transfers configuration information through the signal distribution network 138 to the control units 132. When in a dynamic mode, the control units 132 modify the configuration stored in the configuration memory 132a on receipt of the configuration information. Preferably in this example, the control units 132 only update the configuration memory of the adaptive data routers 134, while the configuration memory of the data path operators 136 remains the same. As the adaptive data routers 134 operate according to the configuration provided in their associated control unit 132, this enables the operation of the adaptive data routers 134 to be modified dynamically according to configuration information. This dynamically modifies the coupling of particular areas of data memory 140 to the data path operators 136. In particular, the adaptive data routers 134 can map different areas of the data memory to the available data path operators 136 during operation of the circuit 100.
[0046] Each control unit 132 may be placed in a dynamic mode or a static mode. The selection of dynamic or static mode may be effected by selection element 132b also provided in the control unit. When in dynamic mode, the configuration stored in configuration memory 132a can be modified by configuration information from the configuration generator 120 as outlined above. When in static mode, the control unit 132 ignores configuration information from the configuration generator and retains the same configuration in the configuration memory 132a.
[0047] The operation of the integrated circuit can be further understood with reference to the following exemplary algorithm:
TABLE-US-00001 1. int* x, int* y 2. for i ε 1 .fwdarw. N − 1 do 3. a = y[i-1]; 4. y[i] = (x[a+1] + x[a+2]) * x[a+3]; 5. end for
[0048] In this example, x and y are respectively input and output arrays. Data accesses of x depend on the result of previous operations y[i-1]. With reference to
[0049] This dynamic data routing can be achieved by the integrated circuit 100 because the result y[i-1] is returned to the configuration generator 120 as a run-time variable, which may then adaptively provide configuration information via a signal distribution network 138 to the control units 132 in dependence on this variable to modify the configuration stored in the configuration memory 132a of each control unit. This in turn modifies the behaviour of the adaptive data routers 134 associated with the control units in order to selectively route the correct data from x stored in data memory 140 to the data path operators 136 to carry out further processing. As the configuration generator 120 is able to dynamically generate configuration information during operation, there is no need for multiple distinct configurations to be stored on chip when not required. Moreover, the architecture of the integrated circuit 100 allows dynamic configuration to be applied to the dynamic data path 130 within one cycle, rather than requiring multiple cycles for reconfiguration.
[0050] Accordingly, the configuration generator 120 takes run-time variable y[i-1] and generates configurations for the dynamic data path 130 cycle-by-cycle. At each cycle, three values of x data are retrieved from data memory 140 and delivered to the data path operators 136, and one result is generated. Routing resources (i.e. adaptive data routers 134) in the circuit 100 are dynamically re-used to implement cycle-by-cycle connections where necessary under each run-time condition.
[0051] The benefits of the architecture of the circuit 100 according to the embodiment described in
[0052] The architecture reconfigurability R is defined as the minimum of r.sub.cap (routing capacity for dynamic data access), r.sub.cfg (number of distinct configurations supported) and D (number of possible connections for dynamic data access). In the embodiment described above with reference to
[0053] The architecture overhead O is defined as O=o.sub.a.Math.o.sub.t, where o.sub.a is the area overhead and o.sub.t is the time overhead. The area overhead o.sub.a is the ratio of the area of the runtime configurable circuit compared with an initial statically configurable architecture. The time overhead o.sub.t is defined as (t.sub.r+t.sub.p)/t.sub.p where t.sub.r is the reconfiguration time in terms of cycles and t.sub.p is the number of cycles between two reconfigurations. The overhead O thus accounts for the overall impact of implemented design on area and time.
[0054] An optimum system supports unlimited reconfigurability with no area overhead and no time overhead (o.sub.a and o.sub.t would equal 1) giving an efficiency E directly proportional to the desired number of possible connections D. This ideal scenario is illustrated as line 301 in
[0055]
[0056] In off-chip reconfiguration as reflected in line 302, configurations are independently established and then loaded into a configuration memory on the chip.
[0057] Such an approach should provide unlimited r.sub.cfg but suffers significantly from time overhead in implementing any reconfiguration. For example, consider a device with an addressable configuration size of 3232 bits and a maximum configuration throughput of 400 MB/s. Such a device is typical of current devices. The minimum reconfiguration time by these figures is at least 1.01 us. If we assume an operation frequency of 100 MHz then at least 101 clock cycles are consumed for a reconfiguration operation. In the example algorithm given above, configurations need to be updated cycle by cycle, meaning that for such a device 1 result is generated every 102 cycles (o.sub.t=102). As a result, we can see from line 302 in
[0058] Another proposed prior art approach is to provide a plurality of configuration memories on-chip, each storing a different configuration. A selection is made during run-time of which configuration to apply to the dynamic data path. In this scenario, it may be possible for configurations to be updated within a cycle (as they are stored on-chip and do not need to be retrieved from an external source). However, the provision of a plurality of separate configurations increases the area overhead o.sub.a. When the required dynamic connections D increases, there is a need for a corresponding increase in stored configurations to ensure that the required configurations can be supported (i.e. r.sub.dep=D). However, this means expanding the configuration memories in connection blocks and significantly increases the overall architecture area. As shown by line 303 in
[0059] Line 304 shows the efficiency E that may be achieved using an optimised approach based on the principles of the present invention, in which a configuration generator 120 is used to dynamically modify the configuration of a dynamic data path 130. As explained above, this solution does not introduce time overhead o.sub.t since configuration can be modified dynamically within a cycle.
[0060] Furthermore, the area overhead o.sub.a is also low, since the configuration generator 120 and the configuration generator memory 110 can be respectively implemented in existing configurable logic and configuration memory of a current FPGA architecture. Some area overhead arises from additional routing area provided on-chip. However, the impact of this can be kept relatively small by suitably optimised architectural design, and as a result line 304 may be relatively close to the ideal line 301 in
[0061] In practice, it is found that an optimised architecture may take advantage of a number of factors relevant to the tasks that the circuit 100 is applied to. For example, in many tasks, not all adaptive data routers 134 need to adapt during operation. As such, a plurality of the control units 132 can be set to the static mode. Furthermore, it is not always necessary or appropriate to provide bit-level granularity for control of adaptive data routers 134. For example, the required degree of granularity may be at byte level or above (i.e. shifts in the addressable memory carried out by the adaptive data routers 134 may be in steps of a byte or more). Moreover, in many tasks, common shifts may be applied to multiple adaptive data routers 134, meaning that shared configuration information may be routed from the configuration generator 120 to multiple control units 132.
[0062] For example, consider four 8 bit data sets, labelled DATA0, DATA1, DATA2 and DATA3. It may be desirable to choose which data set to retrieve data from, and to retrieve all 8 bits of that particular set in parallel. This can be achieved using 8 independent adaptive data routers, each configured to retrieve a given bit from the data set to which they are coupled. This is illustrated in
[0063]
[0064] As such, depending on operational circumstances, the number of bits of configuration memory 132a in the control units 132 which share configuration information from the configuration generator 120 will vary. Thus, the distribution of configuration information from the configuration generator 120 is preferably configurable. Similarly, the selection of dynamic mode or static mode for various control units 132 can vary.
[0065] An optimised architecture is illustrated in
[0066]
[0067] The signal distribution network 138, the control units 132, the distribution control units 138a, the row decoder 170, and the adaptive data routers 134 are usually combined to form an integrated adaptive module 150.
[0068] The physical layout of an embodiment such as that illustrated in
[0069] The above-referenced embodiments of the present invention find particular utility in a range of processing tasks requiring dynamic data access. In order to understand this more fully, four examples will be presented in more detail below. These are web service processing (Memcached), embedded systems (H.264), graph traversal (breadth-first search), and data management (sorting).
[0070] First, Memcached is widely used in large scale web servers. Frequently accessed data are stored in distributed memory systems, and accessed through hash tables. In software solutions, the hash tables are managed in off-chip memories. A key pointer it and a key size nkey are used to search target keys in the hash tables. By utilising the processes described above, the hash table may be stored in a byte addressable data memory 140. At each cycle, proper connections can be configured using it and nkey as the run-time variables input to the configuration generator 120. In this way, only active data access patterns are implemented.
[0071] Second, H.264 is a widely used video compression standard. In a H.264 software design, dynamic data access operations exist in various modules.
[0072] Processing in H.264 is based on 16×16 macro-blocks. Each macro-block contains 16 4×4 blocks. Intra-prediction algorithms predict values in a 4×4 block, based on its neighbouring blocks. Nine prediction modes are supported. By using a reconfigurable circuit 100 as described above, neighbouring data can be dynamically accessed in parallel, and the output results can be dynamically distributed into a target array.
[0073] Third, Breadth-First Search (BFS) is a graph exploration algorithm that is used in graph-based applications, such as neural simulation, data routing, and network security. BFS traversals label graph vertices with BFS levels, which indicates the distance between the traversed vertex and the source vertex. As an example, to find vertices in BFS level 3, vertices in BFS level 2 are traversed. Adjacent vertices of these traversed vertices are labelled as BFS level 3, if they have not been labelled before. Design challenges for a BFS algorithm arise from its sparse and random data accesses. Assume we have M vertices in level 2, and each vertex has N adjacent vertices, M×N off-chip random data accesses are issued, to find vertices in BFS level 3. By using the dynamic design of the circuits described in the above-mentioned embodiments, dynamic data access patterns can be efficiently implemented. For example, configurations can be generated based on the index of each vertex. Connections between input addresses and memory banks can be dynamically reconfigured.
[0074] Fourth, in parallelised data sorting problems, small sorted arrays can be merged in parallel, to enable parallel sorting of large-scale data. However, an approach with sorted data stored in FIFOs can generate errors when data are retrieved in subsequent operations, since without intervention these will not be retrieved in the correct order. Dynamic data access using the above-described embodiments can be used to ensure data that have been stored are retrieved in the correct order.
[0075] Other variations and modifications will be apparent to the skilled person. Such variations and modifications may involve equivalent and other features which are already known and which may be used instead of, or in addition to, features described herein. Features that are described in the context of separate embodiments may be provided in combination in a single embodiment. Conversely, features which are described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
[0076] It should be noted that the term “comprising” does not exclude other elements or steps, the term “a” or “an” does not exclude a plurality, a single feature may fulfill the functions of several features recited in the claims and reference signs in the claims shall not be construed as limiting the scope of the claims. It should also be noted that the Figures are not necessarily to scale; emphasis instead generally being placed upon illustrating the principles of the present invention.