FPGA-based dynamic graph processing method

11609787 · 2023-03-21

Assignee

Inventors

Cpc classification

International classification

Abstract

The present disclosure relates to an FPGA-based dynamic graph processing method, comprising: where graph mirrors of a dynamic graph that have successive timestamps define an increment therebetween, a pre-processing module dividing the graph mirror having the latter timestamp into at least one path unit in a manner that incremental computing for any vertex only depends on a preorder vertex of that vertex; an FPGA processing module storing at least two said path units into an on-chip memory directly linked to threads in a manner that every thread unit is able to process the path unit independently; the thread unit determining an increment value between the successive timestamps of the preorder vertex while updating a state value of the preorder vertex, and transferring the increment value to a succeeding vertex adjacent to the preorder vertex in a transfer direction determined by the path unit, so as to update the state value of the succeeding vertex.

Claims

1. An FPGA-based dynamic graph processing method, comprising the steps of: pre-processing a dynamic graph, where graph mirrors of the dynamic graph that have successive timestamps define an increment therebetween, and a pre-processing module divides a graph mirror of said graph mirrors of the dynamic graph, said graph mirror having a latter timestamp of said successive timestamps into at least one path unit, wherein the dividing occurs in a manner that incremental computing for any vertex only depends on a preorder vertex of that vertex; where there are two or more path units, storing via an FPGA processing module the at least two path units into an on-chip memory directly linked to threads in a manner that every thread unit is able to process the path unit independently; and an incremental computing, wherein the thread unit determines an increment value between the successive timestamps of the preorder vertex while updating a state value of the preorder vertex, and transferring the increment value to a succeeding vertex adjacent to the preorder vertex in a transfer direction determined by the path unit, so as to update the state value of the succeeding vertex, writing the updated vertex state value back into the on-chip memory, until all said vertices of the path unit have been updated, and when all the thread units have updated the vertices of the path units, a merging-computing-driving module merging and calculating node graph data, and then uploading the data to a main graph-processing module, so as to complete graph processing; wherein the step of pre-processing further comprises: performing traversal on the graph mirror of the dynamic graph having the latter timestamp, and accessing every vertex of the graph mirror during the traversal, in which a sub-edge is formed between adjacent two vertices; based on a preset path length, dividing the path into said path units in a manner that any two said path units do not have any common sub-edges existing therebetween; and performing an increment analysis on the path units, marking a said path unit that has an increment as an active path unit and sending it to the on-chip memory, and marking a said path unit that does not have an increment as a to-be-activated path unit and sending it to the merging-computing-driving module, where it waits to be activated by the thread unit, so as to complete incremental computing; and wherein the pre-processing module is able to read updated vertex data and/or edge data of the path units from the on-chip memory, so that the pre-processing module is able to use the updated vertex data and/or edge data as reference data when determining graph data increment of the adjacent succeeding timestamp.

2. The processing method of claim 1, wherein for the pre-processing step: the preset path length includes a first path length threshold and a second path length threshold, and the path unit has a length that is between the first path length threshold and the second path length threshold, so that the thread units are able to perform parallel updating on the path units when a load imbalance rate is lower than a load imbalance threshold.

3. The processing method of claim 2, wherein the pre-processing step further comprises: the pre-processing module being able to select at least one common node as a start point for the traversal, and record a traversal path while marking nodes that have increment; according to a graph structure of the graph processing, defining the path units within the preset path length taking the nodes that have increment as breakpoints, and marking them as the active path units; and storing the graph mirrors outside the path units into the main graph-processing module by means of creating mirror data of the breakpoints, so as to wait for the path units that need to receive increment updating, thereby completing the graph processing.

4. The processing method of claim 3, wherein the incremental computing step at least comprises computing sub-steps of: the thread unit reading the path units and the increment on the on-chip memory it directly links to in a synchronous, time-sequence manner; based on the increment, according to the graph structure, performing increment updating on the state values of the vertices one by one, and saving the state value of each said vertex to the on-chip memory as soon as it is updated, until the state value of the last vertex is updated; and the merging-computing-driving module, according to the graph structure, uploading the state values of the last vertices of the path units to the main graph-processing module one by one through the on-chip memory and/or through the thread units.

5. The processing method of claim 4, wherein the pre-processing module generates the path units and feed them back as a work queue to the merging-computing-driving module, and only when the merging-computing-driving module determines that the thread unit has completed the incremental computing for the active path units and the activated path units based on the work queue, sends the increment result generated by the thread unit to the main graph-processing module.

6. The processing method of claim 5, wherein if the inactive path units and the active path units share common vertices, the pre-processing module creates the mirror data of the common vertices, so that the main graph-processing module is able to establish incremental relationship with the mirror data based on the state values of the common vertices in the active path units during the graph processing.

7. An FPGA-based dynamic graph processing system, comprising a pre-processing module and an FPGA processing module, characterized in that, where graph mirrors of a dynamic graph that have successive timestamps define an increment therebetween, a pre-processing module configured to divide a graph mirror of said graph mirrors of the dynamic graph, said graph mirror having a latter timestamp of said successive timestamps in a manner that incremental computing for any vertex only depends on a preorder vertex of that vertex; the pre-processing module further configured to: perform traversal on the graph mirror of the dynamic graph having the latter timestamp, and accessing every vertex of the graph mirror during the traversal, in which a sub-edge is formed between adjacent two vertices; based on a preset path length, divide the path into said path units in a manner that any two said path units do not have any common sub-edges existing therebetween; and perform an increment analysis on the path units, marking a said path unit that has an increment as an active path unit and sending it to the on-chip memory, and marking a said path unit that does not have an increment as a to-be-activated path unit and sending it to the merging-computing-driving module, where it waits to be activated by the thread unit, so as to complete incremental computing; and wherein the pre-processing module is able to read updated vertex data and/or edge data of the path units from the on-chip memory, so that the pre-processing module is able to use the updated vertex data and/or edge data as reference data when determining graph data increment of the adjacent succeeding timestamp; where there are two or more said path units, the FPGA processing module configured to store at least two said path units into an on-chip memory directly linked to threads in a manner that every thread unit is able to process the path unit independently; the thread unit determining an increment value between the successive timestamps of the preorder vertex while updating a state value of the preorder vertex, and transferring the increment value to a succeeding vertex adjacent to the preorder vertex in a transfer direction determined by the path unit, so as to update the state value of the succeeding vertex, writing the updated vertex state value back into the on-chip memory, until all said vertices of the path unit have been updated, and when all the thread units have updated the vertices of the path units; a merging-computing-driving module configured to merge and calculate node graph data, and then uploading the data to a main graph computing thread, so as to complete graph processing.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic modular diagram of an FPGA-based dynamic graph computing system according to the present invention;

(2) FIG. 2 is a schematic diagram of incremental computing module of the preferred dynamic graph computing system of the present invention;

(3) FIG. 3 shows a graph data structure to be processed according to the present invention; and

(4) FIG. 4 is a flowchart of an FPGA-based dynamic graph computing method according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

(5) The following description is to be read with reference to FIG. 1 through FIG. 4.

(6) For clarity, some technical terms used in this document are defined as below:

(7) FPGA: Field Programmable Gate Array.

(8) CPU: Central Processing Unit.

(9) Graph computing: a process using a graph as a data model to express and solve questions.

(10) Graph: an abstract data structure that represents relationship among objects and makes description with vertices and edges, wherein a vertex denotes an object and an edge denotes relationship between two objects.

(11) Graph data: data that can be abstracted into a graph for description.

Embodiment 1

(12) The present embodiment discloses an FPGA-based dynamic graph processing system. As shown in FIG. 1, the system at least comprises a pre-processing module 100a and an FPGA processing module 200. The pre-processing module 100a divides a graph mirror into at least one path unit. Before dividing the graph mirror into the path unit(s), the pre-processing module 100a determines whether there is increment between graph mirrors of successive timestamps. If the graph mirrors of successive timestamps do have increment therebetween, the pre-processing module 100a divides the graph mirror having the latter timestamp into at least one path unit. Preferably, in the path unit, incremental computing for any vertex only depends on a preorder vertex of that vertex.

(13) Preferably, the pre-processing module 100a is configured to perform the following sub-steps to divide the graph mirror:

(14) S1: performing traversal on the graph mirror of the dynamic graph corresponding to the latter timestamp; accessing every said vertex of the graph mirror during the traversal; a sub-edge is formed between adjacent two said vertices. For example, FIG. 3 shows a graph structure of a graph mirror. The graph structure comprises 10 vertices. The traversal may be begun from Vertex 1 and move to Vertex 10 according to the graph structure. As shown in FIG. 3, one of the sub-edges may be 1.fwdarw.2. The path may be 1.fwdarw.2.fwdarw.5. Of course, it may be 1.fwdarw.2, or a single vertex.

(15) S2: based on a preset path length, dividing the path into said path units in a manner that there is not any common sub-edge existing between any two said path units. The preset path length may be, for example, of 0 unit (a single vertex), 1 unit (two vertices) or so on. For instance, the path unit 1.fwdarw.2.fwdarw.5.fwdarw.8.fwdarw.10 and the path unit 1.fwdarw.3.fwdarw.6.fwdarw.8.fwdarw.10 have the same sub-edge 8.fwdarw.10, and the path division is so made that no common sub-edges exist between any two path units. Thus, in this case, the graph should divided as 1.fwdarw.2.fwdarw.5.fwdarw.8, 1.fwdarw.3.fwdarw.6.fwdarw.8 and 8.fwdarw.10.

(16) S3: performing an increment analysis on the path units. The increment may be about changes in the vertex data, such as a change in the data of Vertex 1. Alternatively, the increment may be about structural changes of the vertex objects. For example, a path unit in the previous timestamp is 1.fwdarw.2.fwdarw.5.fwdarw.8 and becomes 1.fwdarw.2.fwdarw.8 in the next timestamp. Further alternatively, the increment may be about changes in the sub-edge data of the vertex objects, For example, with respect to a path unit 1.fwdarw.2, the sub-edge data corresponding to the previous timestamp is A, and for the next timestamp is B. If a path unit shows increment, the path unit is marked as an active path unit and sent to the on-chip memory 200a. Alternatively, if no increment is shown, the path unit is marked as a to-be-activated one and sent to the merging-computing-driving module, where it waits to be activated by the thread unit, so as to complete incremental computing.

(17) Preferably, the pre-processing sub-step S2 further comprises the following steps.

(18) S21: the preset path length comprising a first path length threshold and a second path length threshold. The thread unit 200b is a processing unit that performs parallel acceleration on path units, and load should be balanced to improve consistence of parallel processing. However, graphs have complex structures, some paths are single-link paths, e.g., the path 1.fwdarw.4.fwdarw.7.fwdarw.9.fwdarw.10 in FIG. 3, and some paths are bifurcate-link paths, e.g., the path unit 1.fwdarw.2.fwdarw.5.fwdarw.8.fwdarw.10 and the path unit 1.fwdarw.3.fwdarw.6.fwdarw.8.fwdarw.10 share the same sub-edge 8.fwdarw.10. For ensuring computing accuracy for such a bifurcate-link structure, it is necessary to further divide it into 1.fwdarw.2.fwdarw.5.fwdarw.8, 1.fwdarw.3.fwdarw.6.fwdarw.8 and 8.fwdarw.10. This causes differences in length among the path units and also in processing time required by the thread units 200b, thereby making the main graph-processing module 100b of the graph computing wait too long. To address this, the length of each path unit shall be set between the first path length threshold and the second path length threshold, so that the thread unit 200b can perform parallel update on the path units while the load imbalance ratio keeps lower than the load imbalance threshold. If the length of the path unit is not between the first path length threshold and the second path length threshold, link addition or link breakage should be performed. Herein, link addition means adding inactive path units into an active path unit, and link breakage means breaking an active path unit into at least two segments.

(19) Preferably, after dividing graph mirrors into active path units and inactive path units, the pre-processing module 100a creates mirror data of the shared node. For example, the active paths are 1.fwdarw.2.fwdarw.5.fwdarw.8 and 1.fwdarw.3.fwdarw.6.fwdarw.8, while the inactive path is 8.fwdarw.10. Therefore, the pre-processing module 100a creates mirror data of the shared node 8, so that the graph structure can be rebuilt in the merging-computing-driving module 300 based on the mirror data. The merging-computing-driving module 300 is able to establish incremental relationship with the mirror data of the shared node based on the state values of the shared vertices in the active path units during the graph computing.

(20) Preferably, the pre-processing module 100a is a part of the CPU processing module 100. Where there are two or more path units (due to its nature of complication, a dynamic graph always has hundreds or thousands of path units in nature), the FPGA processing module 200 includes hundreds or thousands of thread units 200b, and each thread unit 200b may be directly linked to an on-chip memory 200a. The FPGA processing module 200 thus can store the at least two path units into the on-chip memories 200a directly linked with the corresponding thread in a manner that every said thread unit is able to process the path unit independently.

(21) Preferably, the thread unit 200b determines an increment value between the successive timestamps of the preorder vertex while updating a state value of the preorder vertex. As shown in FIG. 2, for the path unit 1.fwdarw.2.fwdarw.5.fwdarw.8, the updating sub-unit 200b-1 updates the state value of the preorder vertex 1 so the preorder vertex 1 has an increment value. Then the transferring sub-unit 200b-2 transfers the increment value to the following vertex 2. The increment value is transferred to the node 8 according to the logic structure determined by the graph structure, until the state value of the vertex 8 is updated. When all of the thread units 200b have updated the vertices of the path unit, the merging-computing-driving module 300 merge and calculates node graph data and loads them onto a main graph-processing module 100b, so as to complete graph computing.

(22) Preferably, the thread unit 200b performs incremental computing through the following incremental computing sub-steps: F1: the thread unit 200b reads the path units and the increment on the on-chip memory 200a it directly links to by means of a scheduled, synchronous processing process. Since the load is balanced to the greatest possibility in the pre-processing module 100a, the processing time of the thread unit 200b can basically be synchronous; F2: based on the increment, according to the graph structure, performing increment updating on the state values of the vertices; wherein the state value of each vertex is stored into the on-chip memory 200a as soon as it is updated, until the state value of the last vertex is updated. For example, after the state value of Vertex 1 is updated, it is immediately stored into the memory 200a; F3: the merging-computing-driving module 300, according to the graph structure, uploads the state values of the last vertices of the path units to the main graph-processing module 100b one by one through the on-chip memory 200a and/or through the thread units 200b. For example, the state value of the last vertex can be acquired in the on-chip memory 200a, and can be transferred by the thread unit 200b. For example, incremental computing for 1.fwdarw.2.fwdarw.5.fwdarw.8 and 1.fwdarw.3.fwdarw.6.fwdarw.8 is completed at the vertex 8, so the merging-computing-driving module 300 reads the state values of the vertex 8 of the two paths after the incremental computing from the on-chip memory 200a, and performs union operation thereon before sending the operational result to the main graph-processing module 100b, for graph computing with the inactive path 8.fwdarw.10.

(23) Preferably, the pre-processing module 100 arranges all the active path units to generate a work queue. For example, a work queue of the active path units 1.fwdarw.2.fwdarw.5.fwdarw.8, 1.fwdarw.3.fwdarw.6.fwdarw.8, and 1.fwdarw.3.fwdarw.6.fwdarw.8.fwdarw.10 is generated and fed back to the merging-computing-driving module 300. The merging-computing-driving module 300 only sends the result of its increment computing to the main graph-processing module 100b after receiving the three active paths, so as to ensure the data accuracy of graph computing. Preferably, the main graph-processing module 100b is another computing part of the CPU processing module.

(24) Preferably, the pre-processing module 100a can read the updated vertex data and/or edge data of the path units of the on-chip memory 200a, so that the pre-processing module 100 can use the updated vertex data and/or edge data as reference data when determining graph data increment for the next timestamp. Assuming that at Timestamp t.sub.0 data relationship of 1.fwdarw.2.fwdarw.5.fwdarw.8 is: A.fwdarw.B.fwdarw.C.fwdarw.D, and at Timestamp t.sub.1 it is: A+Δa.fwdarw.B(+Δa).fwdarw.C(+Δa).fwdarw.D(+Δa), so at Timestamp t.sub.2 it is E.fwdarw.F.fwdarw.G.fwdarw.H, then the pre-processing module 100a will calculate the increment for Timestamp t.sub.2 using the graph data of Timestamp t.sub.1.

Embodiment 2

(25) It is to be noted that the embodiment described in detail below is merely intended to illustrate but not limit the present invention. Besides, all the technical features in the embodiments described herein may be combined as long as no conflict or contradiction is caused.

(26) Preferably, the pre-processing module 100a may perform division through the following pre-processing sub-steps: P1: the pre-processing module 100a is able to select at least one common node as a start point for the traversal, and record a traversal path while marking nodes that show the increment. For example, the common nodes may be 1, 8 and 10. When the current traversal has traversed one common node and reaches another common node, the current traversal ends. The way thus has the following advantages. First, parallel traversal, by which the pre-processing time required by the pre-processing module 100a can be saved. Second, faster path recognition, since common nodes usually have common sub-edges so the pre-processing module 100a can finish division of path units faster. Third, increment nodes are marked during traversal, and then the increment nodes and the common nodes are divided into path units, so as to accomplish marking and dividing the graph mirrors during traversal, thereby facilitating positional division of the path units; P2: according to a graph structure of the graph computing, using the nodes showing the increment as breakpoints to divide the path units not exceeding the preset path length, and marking them as the active path units. The preset path length is also for ensuring load balance among the thread units 200b; and P3: saving the graph mirrors outside the path units into the main graph-processing module 100b by means of creating mirror data of the breakpoints, so as to wait for the path units that receive increment updating, thereby completing the graph computing.

Embodiment 3

(27) The invention as well as a preferred mode of use, further objectives and advantages thereof will be best understood by reference to the following detailed description of illustrative embodiments when read in conjunction with the accompanying drawings. The disclosed FPGA-based dynamic graph processing method comprises a pre-processing step and incremental computing step with the following details:

(28) (1) The pre-processing module 100 serves to acquire path set in the latest graph mirror according to the increment existing between adjacent two graph mirrors of a dynamic graph by performing the following sub-steps:

(29) (1.1) if the dynamic graph is processed for the first time, dividing the graph mirrors into sets of paths using the graph division method and storing them in the memory, or otherwise, proceeding with Step (1.2);

(30) (1.2) uploading every edge in the increment to the on-chip memory, and determining the path location in the old graph mirror where the increment exists according to the location of the two vertices of the edge;

(31) (1.3) updating all edge information stored in the paths affected by increment so as to acquire the path set in the latest graph mirror; if the length difference between different paths is lower than the threshold, re-dividing the graph mirror so as to acquire the path set;

(32) (1.4) initializing a work queue using information of all paths affected by the increment and using it in incremental computing. The pre-processing part may be designed based on various graph dividing methods, as long as paths can be generated. For example, according to the DFS graph dividing method, an arbitrary vertex in a graph mirror of a dynamic graph is used as the start point for depth-first-search traversal, and the accessed edges are stored in order, until the number of edges reaches the preset path length threshold. In this way, all the edges in the dynamic graph are accessed once, and the path units do not share any common sub-edge, meeting the requirement of the disclosed design.

(33) (2) The thread unit 200b is configured to process the affected paths in the work queue using the hardware pipeline until the work queue becomes empty, and to output the final result. This comprises the following sub-steps:

(34) (2.1) uploading plural paths to the on-chip memory according to the work queue;

(35) (2.2) distributing each path to a different custom hardware pipeline, with every pipeline accessing an independent space on the on-chip memory; and

(36) (2.3) every pipeline including a vertex updating module and an increment transferring module, so that the graph vertices in the path flow through these two modules successively; First, a hardware operator updates the state value of every vertex, and writes it back to the designated on-chip memory space. Then the increment of the vertices is transferred to the next vertex in the path;

(37) The hardware operator in Step (2.3) is an algorithm corresponding to actual needs (such as the SCC algorithm and the PageRank algorithm), which is customized through a programmable logic unit of the FPGA and has the graph-vertex state values updated with the increment value.

(38) (2.4) after the graph-vertex state values in plural paths have been updated, synchronizing the graph-vertex state values through the on-chip memory and the FIFO queue, and then adding the paths in which the graph-vertex state values change after the synchronization is completed to the work queue; and

(39) (2.5) checking whether the work queue is empty, and if yes, outputting the current graph data as the operational result of the accelerator, or, otherwise, returning to Step (2.1).

Embodiment 4

(40) It is to be noted that the embodiment described in detail below is merely intended to illustrate but not limit the present invention. Besides, all the technical features in the embodiments described herein may be combined as long as no conflict or contradiction is caused.

(41) The present embodiment discloses an FPGA-based dynamic graph processing method, which comprises:

(42) Pre-processing step: where graph mirrors of a dynamic graph that have successive timestamps define an increment therebetween, a pre-processing module 100a dividing the graph mirror having the latter timestamp into at least one path unit in a manner that incremental computing for any vertex only depends on a preorder vertex of that vertex;

(43) Storing step: where there are two or more said path units, an FPGA processing module 200 storing at least two path units into the on-chip memories 200a directly linked to the threads in a manner that every said thread unit is able to process the path unit independently;

(44) Increment step: the thread unit 200b determining an increment value between the successive timestamps of the preorder vertex while updating a state value of the preorder vertex, and transferring the increment value to an adjacent succeeding vertex with respect to the preorder vertex in a transfer direction determined by the path unit, so as to update the state value of the succeeding vertex, writing the updated graph-vertex state value into the on-chip memory 200a, until all said vertices of the path unit have been updated; and

(45) Return step: when all the thread units 200b have updated the vertices of the path units, a merging-computing-driving module 300 merging and calculating node graph data before uploading the data to a main graph computing main memory 400, so as to complete graph computing.

(46) Preferably, the pre-processing module 100a performs traversal on the graph mirror of the dynamic graph corresponding to the next timestamp, and accesses every said vertex of the graph mirror during the traversal, in which a sub-edge is formed between adjacent two said vertices. Based on a preset path length, the paths are divided into path units so that any two of the path units do not intersect.

(47) The increment analysis is performed on the path units. If there is increment, the path unit is marked as an active path unit and sent to the on-chip memory 200a. If there is not increment, the path unit is marked as an inactive path unit and sent to the main graph-processing module 100b, so as to wait for the path units that receive increment updating, thereby completing graph computing.

(48) The present invention has been described with reference to the preferred embodiments and it is understood that the embodiments are not intended to limit the scope of the present invention. Moreover, as the contents disclosed herein should be readily understood and can be implemented by a person skilled in the art, all equivalent changes or modifications which do not depart from the concept of the present invention should be encompassed by the appended claims.