Automatic clustering for self-organizing grids

Abstract

A cluster of nodes, comprising: a plurality of nodes, each having a security policy, and being associated task processing resources; a registration agent configured to register a node and issue a node certificate to the respective node; a communication network configured to communicate certificates to authorize access to computing resources, in accordance with the respective security policy; and a processor configured to automatically dynamically partition the plurality of nodes into subnets, based on at least a distance function of at least one node characteristic, each subnet designating a communication node for communicating control information and task data with other communication nodes, and to communicate control information between each node within the subnet and the communication node of the other subnets.

Claims

1. A non-transitory computer-readable medium storing executable instructions that, in response to execution, cause a processor of a first node device within a first subnet to perform operations comprising: receiving, by the first node device, a node device certificate in response to a successful registration by a registration agent; using the node device certificate to retrieve role information; generating an access token from the node device certificate and retrieved role information; communicating, by the first node device, the access token to a second node device within a second subnet to authorize access to computing resources of the second node device in accordance with a security policy of the second node device provided that the access token has not expired, wherein the first subnet comprises a plurality of node devices based on a distance function of a node device characteristic, and wherein the second subnet comprises a plurality of node devices different from the node devices comprising the first subnet based on the distance function of the node device characteristic; and communicating, by the first node device, control information and task data to the second node device.

2. The non-transitory computer-readable medium of claim 1, further comprising instructions that, in response to execution, cause the processor of the first node device to perform operations further comprising: designating a set of preferred node devices for allocation of portions of a task, wherein the second node device is included in the preferred node devices.

3. The non-transitory computer-readable medium of claim 1, further comprising instructions that, in response to execution, cause the processor of the first node device to perform operations further comprising: designating a set of preferred node devices for allocating portions of a task, wherein the designated set is based on both the task and a partitioning algorithm based on the distance function of the node device characteristic.

4. The non-transitory computer-readable medium of claim 3, wherein the node device characteristic includes a pairwise communication latency between respective node devices.

5. The non-transitory computer-readable medium of claim 1, wherein the second node device controls each node device within the second subnet.

6. The non-transitory computer-readable medium of claim 1, wherein the second node device communicates control information between each node device within the second subnet and the plurality of node devices of the plurality of subnets.

7. The non-transitory computer-readable medium of claim 1, wherein the node device characteristic comprises a link delay metric.

8. The non-transitory computer-readable medium of claim 7, wherein the first subnet and the second subnet are dynamically control led based on current conditions that are determined at least in part by proactive communications that include a heartbeat message.

9. The non-transitory computer-readable medium of claim 1, further comprising instructions that, in response to execution, cause the processor of the first node device to perform operations further comprising: partitioning the plurality of node devices in the first subnet into two new subnets in response to a failure of one or more of the plurality of node devices to respond to a predetermined number of consecutive heartbeat messages.

10. A method for clustering node devices for accomplishing a task, comprising: receiving, by a first node device within a first subnet, a node device certificate in response to a successful registration by a registration agent; using the node device certificate to retrieve role information; generating an access token from the node device certificate and retrieved role information; communicating, by the first node device, the access token to a second node device within a second subnet to authorize access to computing resources of the second node device in accordance with a security policy of the second node device provided that the access token has not expired, wherein the first subnet comprises a plurality of node devices based on a distance function of a node device characteristic, and wherein the second subnet comprises a plurality of node devices different from the node devices comprising the first subnet based on the distance function of the node device characteristic; and communicating, by the first node device, control information and task data to the second node device of the second subnet; and designating a set of preferred node devices for allocating portions of a task, wherein the designated set is based on the task and a partitioning algorithm based on the distance function of the node device characteristic.

11. The method of claim 10, wherein the second node device is included in the set of preferred node devices.

12. The method of claim 10, wherein the node device characteristic includes a pairwise communication latency between respective node devices.

13. The method of claim 10, wherein the second node device controls each node device within the second subnet.

14. The method of claim 10, wherein the second node device communicates control information between each node device within the second subnet and the plurality of node devices of the plurality of subnets.

15. The method of claim 10, wherein the node device characteristic comprises a link delay metric.

16. The method of claim 10, wherein the first subnet and the second subnet are dynamically controlled based on current conditions that are determined at least in part by proactive communications that include a heartbeat message.

17. The method of claim 10, wherein the heartbeat message includes merged update messages.

18. The method of claim 10, further comprising: partitioning the plurality of node devices in the first subnet into two new subnets in response to a failure of one or more of the plurality of node devices to respond to a predetermined number of consecutive heartbeat messages.

19. A system comprising: a memory; and a processor configured to: receive, by the first node device, a node device certificate in response to a successful registration by a registration agent; use the node device certificate to retrieve role information; generate an access token from the node device certificate and retrieved role information; communicate, by the first node device, the access token to a second node device within a second subnet to authorize access to computing resources of the second node device in accordance with a security policy of the second node device provided that the access token has not expired, wherein the first subnet comprises a plurality of node devices based on a distance function of a node device characteristic, and wherein the second subnet comprises a plurality of node devices different from the node devices comprising the first subnet based on the distance function of the node device characteristic; and communicate, by the first node device, control information and task data to the second node device.

20. The system of claim 19, wherein the processor is further configured to: designate a set of preferred node devices for allocation of portions of a task, wherein the second node device is included in the preferred node devices.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows an MDTree of 16 nodes with maximum neighborhood size of K=4. Each super-node is shown in bold, and is labeled with the number of nodes it controls.

(2) FIG. 2 shows the average Link Delay in a cluster.

(3) FIG. 3 shows a Maximum Link Delay to the cluster requester.

(4) FIG. 4 shows a Cluster Diameter.

(5) FIG. 5 shows Cluster Requesting Overhead, messages include requests, responses, pings, and cluster confirms.

(6) FIG. 6 shows a comparison of Genetic Split and Random Split.

EXPERIMENTAL EVALUATION

(7) Simulation experiments were conducted to evaluate the present approach using the GPS simulation framework [10, 11] and Transit-Stub networks generated from the GT-ITM topology generator [12]. The GPS was extended to model MDTrees, and to support the cluster formation algorithm discussed herein. The topology studied consists of 600 nodes (due to run-time and memory usage considerations). Link delay within a stub is 5 milliseconds (ms), between stubs and transits it is 10 ms, and between transits is set to 30 ms. Cluster requests of sizes 8, 16, 32, 64, 128, and 256 nodes were evaluated. Pings are used to determine the link delay between node pairs. The value of Kwas set at 25 for the MDTree, and 180% as candidate scale factorS. The following metrics were used to measure the quality of the cluster that an MDTree helps discover: Average link delay among nodes within the cluster: The average link delay is likely to be the most important criterion for the quality of the clustering, especially for fine-grained applications. Such applications require frequent communication among nodes within the cluster and their performance is bound by the latency of communication. Maximum link delay to the cluster requester: This criterion is important for clusters in which the most frequent communication is between the cluster requester and the other nodes. Cluster diameter: The largest link delay between any pair of nodes in the cluster. Cluster Formation Overhead: The overlay performance is measured by the number of messages sent during the process of requesting a cluster. These messages include cluster request messages, cluster responses, pings, and cluster confirmation. Since the MDTree is constructed pro-actively, the cost is amortized over all the requests generated for clusters; it can be considered as a fixed cost. New node joining only cost approximately O(Log.sub.k N) messages, which includes attach queries and pings, where, again, N is the number of nodes in the SOG, and K is the maximum number of nodes in any neighborhood. Maintenance Overhead: The overlay performance is measured by the number of messages transmitted in the MDTree.

(8) FIG. 2 shows the average link delay in the extracted cluster, compared to the optimal cluster for the topology (found through exhaustive search). The average delay, in general, is quite good compared to the optimal available. However, especially at small size clusters (smaller than the layer size), the quality of the solution can be improved. This argues for supporting mechanisms to allow nodes to change their location in the tree if they are not placed well. That is, it is preferred that the system and method support a determination of placement quality, and the communication protocol between nodes support communications which both support the determination of quality of placement and restructuring of the network in case of poor placement, even if this imposes some inefficiency on the operation of well-placed nodes. At 256 nodes, the large size of the cluster relative to the topology size may contribute to the two graphs converging.

(9) FIGS. 3 and 4 show the maximum link delay to the cluster requester, and the cluster diameter respectively. These figures contain a similar result to that of FIG. 2. In general, the results show that the present approach performs well with respect to the optimal solution according to all three metrics. The complexity of the clustering stage (i.e. the messages that are exchanged after a cluster is requested, as opposed to the pro-active MDTree setup costs associated with join messages) depends on the options used in representing the clusters and the MDTree structure (e.g. the values of Sand % described earlier).

(10) FIG. 5 shows that the overhead for requesting a cluster appears to be linear with the size of the requested cluster. Building and maintaining the MDTree structure also requires overhead. Node joining costs approximately O(Log.sub.kN) messages and pings. However, the main overhead comes from the periodic heartbeat messages, since it broadcasts to each node in the neighborhood. This overhead can be reduced by piggybacking and merging update messages. Therefore, it is preferred than an independent heartbeat message only be sent if no other communication conveying similar or corresponding information is not sent within a predetermined period. Of course, the heartbeat may also be adaptive, in which case the frequency of heartbeat messages is dependent on a predicted dynamic change of the network. If the network is generally stable, the heartbeat messages may be infrequent, while if instability is predicted, the heartbeats may be sufficiently frequent to optimize network availability. Instability may be predicted, for example, based on a past history of the communications network or SOG performance, or based on an explicit message.

(11) In some cases, the communication network may be shared with other tasks, in which case the overhead of the heartbeat messages may impact other systems, and an increase in heartbeat messages will not only reduce efficiency of the SOG, but also consume limited bandwidth and adversely impact other systems, which in turn may themselves respond by increasing overhead and network utilization. Therefore, in such a case, it may be desired to determine existence of such a condition, and back off from unnecessary network utilization. For example, a genetic algorithm or other testing protocol may be used to test the communication network, to determine its characteristics.

(12) Clearly, super-nodes and regular peer-nodes have different levels of responsibilities in MDTrees. A super-node is a leader on all layers from 1 to the second highest layer it joins. Each super-node must participate in query and information exchange on all the neighborhoods it joins, which can make it heavily burdened. However, if higher layer super-nodes did not appear within neighborhoods at lower layer neighborhoods, it would be inefficient to pass information down to neighborhoods at lower layers.

(13) Graph hi-partitioning is known to be NP-complete [7]. In an MDTree, genetic algorithms may be used for neighborhood splitting. A preferred algorithm generates approximately optimal partitioning results within hundreds or thousands of generations, which is a small number of computations compared to the NP-complete optimal solution (and these computations take place locally within a super-node, requiring no internode messages). Various other known heuristics may be used to bi-partition the nodes. Since MDTree tries to sort close nodes into the same branch, a genetic algorithm is preferable to a random split algorithm, especially for transit-stub topology.

(14) FIG. 6 shows that the genetic algorithm (or any other effective hi-partitioning heuristic) has a significant impact on the quality of the solution when compared to random partitioning for neighborhood splitting.

(15) The present invention provides an efficient data structure and algorithm for implementing automatic node clustering for self-organizing grids, which will contain clusters of high performance “permanent” machines alongside individual intermittently available computing nodes. Users can ask for an “ad hoc” cluster of size N, and the preferred algorithm will return one whose latency characteristics (or other performance characteristic) come close to those of the optimal such cluster. Automatic clustering is an important service for SOGs, but is also of interest for more traditional grids, whose resource states and network characteristics are dynamic (limiting the effectiveness of static cluster information), and whose applications may require node sets that must span multiple organizations.

(16) The MDTree structure organizes nodes based on the link delay between node pairs. The preferred approach is distributed, scalable, efficient, and effective. A genetic algorithm is used for neighborhood splitting to improve the efficiency and effectiveness of partitioning.

(17) In addition, the system and method according to the present invention may provide tree optimization to revisit placement decisions. Likewise, the invention may determine the effect of node departure on clustering. Further, the invention may provide re-balancing to recover from incorrect placement decisions. As discussed above, the minimum link delay criterion is but one possible metric, and the method may employ multiple criteria to identify candidate cluster nodes, instead of just inter-node delay. For example, computing capabilities and current load, and the measured bandwidth (total and/or available) between nodes may be employed.

(18) Tiered SOG resources, ranging from conventional clusters that are stable and constantly available, to user desktops that may be donated when they are not in use may be implemented.

(19) This variation in the nature of these resources can be accounted for, both in the construction of the MDTrees (e.g., by associating super-nodes with stable nodes) and during the extraction of clusters (e.g., by taking advantage of known structure information like the presence of clusters, instead of trying to automatically derive all structure).

(20) The present invention may also provide resource monitoring for co-scheduling in SOGs. Resource monitoring and co-scheduling have significant overlap with automatic clustering, and therefore a joint optimization may be employed. Effective SOG operation also requires service and application deployment, fault tolerance, and security.

REFERENCES

(21) [1] Enabling grids fore-science (EGEE). http://public.euegee. org.

(22) [2] Seti @home. http://setiathome.berkeley.edu.

(23) [3] Teragrid. http://www.teragrid.org.

(24) [4] N. Abu-Ghazaleh and M. J. Lewis. Short paper: Toward self-organizing grids. In Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC-15), pages 324-327, June 2006. Hot Topics Session.

(25) [5] A. Agrawal and H. Casanova. Clustering hosts in p2p and global computing platforms. In The Workshop on Global and Peer-to-Peer Computing on Large Scale Distributed Systems, Tokyo, Japan, April 2003.

(26) [6] S. Banerjee, C. Kommareddy, and B. Bhattacharjee. Scalable peer finding on the internet. In Global Telecommunications Conference, 2002. GLOBECOM '02, volume 3, pages 2205-2209, November 2002.

(27) [7] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, N.Y., USA, 1979.

(28) [8] E. K. Lua, J. Crowcroft, and M. Pias. Highways: Proximity clustering for scalable peer-to-peer network. In 4th International Conference on Peer-to-Peer Computing (P2P 2004), Zurich, Switzerland, 2004. IEEE Computer Society.

(29) [9] Q. Xu and J. Subhlok. Automatic clustering of grid nodes. In 6th IEEEIACM International Workshop on Grid Computing, Seattle, Wash., November 2005.

(30) [10] W. Yang. General p2p simulator. http://www.cs.binghamton.edurwyang/gps.

(31) [11] W. Yang and N. Abu-Ghazaleh. GPS: A general peer-to-peer simulator and its use for modeling bittorrent. In Proceedings of 13th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '05, pages 425-432, Atlanta, Ga., September 2005.

(32) [12] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee. How to model an internetwork. In IEEE Infocom, volume 2, pages 594-602, San Francisco, Calif., March 1996. IEEE.

(33) [13] W. Zheng, S. Zhang, Y. Ouyang, F. Makedon, and J. Ford. Node clustering based on link delay in p2p networks. In 2005 ACM Symposium on Applied Computing, 2005.

Automatic clustering for self-organizing grids

Assignee

Inventors

Cpc classification

Classification Explorer

H04L45/121

ELECTRICITY

Classification Explorer

H04L43/10

ELECTRICITY

Classification Explorer

H04L67/1044

ELECTRICITY

Classification Explorer

G06F15/16

PHYSICS

Classification Explorer

H04L67/02

ELECTRICITY

Classification Explorer

G06Q10/06

PHYSICS

Classification Explorer

H04L67/51

ELECTRICITY

Classification Explorer

H04L67/10

ELECTRICITY

Classification Explorer

H04L41/12

ELECTRICITY

Classification Explorer

H04L47/70

ELECTRICITY

Classification Explorer

H04L45/12

ELECTRICITY

Classification Explorer

H04L47/783

ELECTRICITY

Classification Explorer

H04L45/122

ELECTRICITY

International classification

Classification Explorer

G06F15/16

PHYSICS

Classification Explorer

H04L67/104

ELECTRICITY

Classification Explorer

H04L67/10

ELECTRICITY

Classification Explorer

H04L67/02

ELECTRICITY

Classification Explorer

H04L47/70

ELECTRICITY

Classification Explorer

H04L43/10

ELECTRICITY

Classification Explorer

H04L41/12

ELECTRICITY

Abstract

Claims

Description