Group alert in server systems
11171851 · 2021-11-09
Assignee
Inventors
- Liang-Chin Kao (Taipei, TW)
- Shang-Ching Hung (Taipei, TW)
- Shih-Chiang Chung (Taiwan, TW)
- An Sheng Huang (Taipei, TW)
- Yi-Hsun Chen (Taipei, TW)
Cpc classification
G06F11/0748
PHYSICS
G06F11/3058
PHYSICS
G06F11/0709
PHYSICS
H04L41/0686
ELECTRICITY
G06F11/0781
PHYSICS
International classification
Abstract
A server system having functionality of group alerting is disclosed. Said server system comprises: a plurality of server computers having alert notification capabilities, the plurality of server computers being divided into at least one group; and a management console node managing and monitoring the plurality of server computers; wherein the alert notification is issued by a group of the at least one group of the plurality of server computers when a health problem of a server computer in said group of the at least one group of the plurality of server computers occurs.
Claims
1. A server system comprising: a plurality of server computers having alert notification capabilities, the plurality of server computers being divided into at least a first group according to a location and a second group according to a function, wherein the plurality of server computers comprise a first server computer that belongs to both the first group and the second group, wherein the first server computer is to: responsive to a private event of the first server computer, selectively send a notification of the private event to a domain node of a selected group of the first group and the second group, the first server computer to select the first group as the selected group responsive to the private event being related to the location, and the first server computer to select the second group as the selected group responsive to the private event being related to the function, and wherein the selected group is to issue a group event responsive to the notification of the private event of the first server computer, wherein the domain node is to: accumulate the private event with another private event notified by another server computer of the plurality of server computers, to produce accumulated private events, and determine whether to issue the group event to a management console node responsive to whether the accumulated private events satisfy a group threshold of a predefined policy.
2. The server system of claim 1, wherein the selected group is to send the group event to the management console node that manages and monitors the plurality of server computers.
3. The server system of claim 1, wherein the first group comprises a first domain node, and when the selected group is the first group, the first domain node is to: receive the notification of the private event from the first server computer, and issue the group event to the management console node that manages and monitors the plurality of server computers.
4. The server system of claim 3, wherein the second group comprises a second domain node, and when the selected group is the second group, the second domain node is to: receive the notification of the private event from the first server computer, and issue the group event to the management console node.
5. The server system of claim 1, wherein the group event is selected from among a group health alert, a group utilization alert, and a group hardware failure alert.
6. The server system of claim 1, wherein each of the plurality of server computers has a built-in Intelligent Platform Management Interface (IPMI) supported Baseboard Management Controller (BMC).
7. The server system of claim 1, wherein the domain node is to: determine that the accumulated private events satisfy the group threshold responsive to at least a specified amount of server computers in the selected group violating an individual threshold, wherein the group threshold comprises the specified amount.
8. The server system of claim 1, wherein the first group or the second group is further divided into subgroups, and each subgroup of the subgroups includes a respective domain node and a respective member server computer.
9. The server system of claim 7, wherein at least the specified amount of the server computers in the selected group violating the individual threshold comprises a specified percentage of the server computers in the selected group violating the individual threshold, and wherein the group threshold comprises the specified percentage.
10. A method of group alerting in a server system, comprising: managing and monitoring a plurality of server computers by a management console node, the plurality of server computers divided into at least a first group according to a location and a second group according to a function, wherein the plurality of server computers comprise a first server computer that belongs to both the first group and the second group; responsive to a private event of the first server computer, selectively sending, by the first server computer, a notification of the private event to a domain node of a selected group of the first group and the second group, the first server computer selecting the first group as the selected group responsive to the first server computer determining that the private event is related to the location, and the first server computer selecting the second group as the selected group responsive to the first server computer determining that the private event is related to the function; accumulating, by the domain node, the private event with another private event notified by another server computer of the plurality of server computers, to produce accumulated private events; determining, by the domain node, whether to issue a group event to the management console node responsive to whether the accumulated private events satisfy a group threshold of a predefined policy; and issuing, by the selected group, the group event responsive to the notification of the private event of the first server computer.
11. The method of claim 10, comprising: determining, by the domain node, that the accumulated private events satisfy the group threshold responsive to at least a specified amount of server computers in the selected group violating an individual threshold, wherein the group threshold comprises the specified amount.
12. The method of claim 10, wherein the first group or the second group is further divided into subgroups, and each subgroup of the subgroups includes a respective domain node and a respective member server computer.
13. The method of claim 10, wherein a given server computer of the plurality of server computers is the domain node of the selected group, and a member server computer of another group of server computers that is part of the plurality of server computers.
14. A non-transitory computer-readable storage medium comprising instructions that upon execution cause a domain computer node of a given group of server computers to: receive, from a first server computer responsive to a private event of the first server computer, a notification of the private event, wherein the first server computer is part of a plurality of server computers divided into at least a first group according to a location and a second group according to a function, wherein the first server computer belongs to both the first group and the second group, wherein the notification of the private event is issued by the first server computer to the given group that is one of the first group and the second group, the given group to which the notification of the private event is issued being the first group responsive to the private event being related to the location, and the given group to which the notification of the private event is issued being the second group responsive to the private event being related to the function; accumulate the private event with another private event notified by another server computer of the plurality of server computers, to produce accumulated private events; determine whether to issue a group event responsive to whether the accumulated private events satisfy a group threshold of a predefined policy; and issue the group event to a management console node responsive to the notification of the private event from the first server computer.
15. The non-transitory computer-readable storage medium of claim 14, wherein the instructions upon execution cause the domain computer node to: determine that the accumulated private events satisfy the group threshold responsive to at least a specified percentage of server computers in the given group violating an individual threshold, wherein the group threshold comprises the specified percentage.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Various features and advantages of the disclosed embodiments will be apparent from the detailed description which follows, taken in conjunction with the accompanying drawings, which together illustrate, by way of example, features of the disclosed embodiments.
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7)
(8) The management console node (110) manages and monitors the operation and the health/utilization status of the plurality of server computers. When an event of a server computer in the plurality of server computers occurs, an alert notification will not be issued by said server computer having the event. On the contrary, the alert notification will be issued by a group of the at least one group of the plurality of server computers, to which said server computer belongs. Hence, the issued alert notification reports a group event of a specific group in the server system rather than an individual event of a specific server computer in the server system. Namely, if a server computer (131) in server group 1 (130) has an event, the alert notification will be issued by server group 1 (130); if a server computer (141) in server group 2 (140) has an event, the alert notification will be issued by server group 2 (140); and if a server computer in server group N (150) has an event, the alert notification will be issued by server group N (150), instead of said server computer itself. The alert notification is sent to the management console node (110), such that the management console node (110) can locate the group including said server computer having the event.
(9) Each of the plurality of server computers in the server system (100) may have a built-in IPMI supported Baseboard Management Controller (BMC) and does not need any extra hardware to support the group alerting of the server system of the present invention.
(10)
(11)
(12) Based on different criteria of dividing the plurality of server computers into different groups, a domain node in one group of the server system may also be a member server computer in another group of the server system. For example, as illustrated in
(13) When an event of a client server computer (320) occurs, the client server computer (320) cannot send a normal alert to the management console node. Instead, the client server computer (320) sends a private event (350), which can only be received by the domain node (310). After receiving the private event (350) issued by the client server computer (320), the domain node (310) needs to decide that the current event (350) has reached predefined threshold values and needs to be sent to the management console node as a group event (360) according to a predefined policy. The issued group event (360) may comprise, for example, a group health alert, a group utilization alert, a group hardware failure alert and other group alerts indicating any event that impacts the normal operations of the server computers. Examples of group alerting policies are listed in the following table. However, the illustrated examples of group alerting policies do not limit the claimed scope of the present invention. Any modification of the alerting policies may easily be conceived by a person skilled in the art based on the actual requirements when managing and monitoring a server system.
(14) Group Alert (Health/Utilization/Hardware Failure)
(15) TABLE-US-00001 TABLE 1 Examples of Group Alerting Policies GTH Alert STH Percentage When to Single system in group trigger Type Event threshold threshold event Group Health Over CPU CPU temperature 10% (In a STH = Critical Alert temperature over limit (100° C.) group of 100 true servers, that and is 10) GTH = true Group Health System System 3% (In a STH = Critical Alert unexpected unexpected group of 100 true shutdown shutdown servers, that and is 3) GTH = true Group Health Over CPU CPU temperature 1% (In a STH = Warning Alert temperature over limit (100° C.) group of 100 true servers, that and is 1) GTH = true Group Utilization Over CPU CPU average 80% (In a STH = Alert utilization utilization over group of 100 true 90% in last one servers, that and minute is 80) GTH = true Group Utilization Over network average network 20% (In a STH = Alert throughput throughput over 80 Mb/s group of 100 true utilization in last five servers, that and minutes is 20) GTH = true Group Hardware Disk failure Disk failure 3% (In a STH = Failure Warning (Recoverable) group of 100 true Alert servers, that and is 3) GTH = true Group Hardware Disk failure Disk failure 1% (In a STH = Failure Critical (Unrecoverable) group of 100 true Alert servers, that and is 3) GTH = true Group Hardware Power failure Redundant Power 3% (In a STH = Failure Warning supply failure group of 100 true Alert servers, that and is 3) GTH = true Group Hardware Power failure System shutdown 1% (In a STH = Failure Critical due to power group of 100 true Alert supply failure servers, that and is 3) GTH = true Group Hardware Unexpected Unexpected 1% (In a STH = Failure Critical shutdown system shutdown group of 100 true Alert failure servers, that and is 3) GTH = true
(16) The administrator may set different predefined group alerting policies for each domain node in each server group of the plurality of server computers (as listed in Table 1). For example, the administrator may set an over temperature threshold value of 100° C. for the CPU of each member server computer in server group 1 (130). The member server computer will send a private event (350) to a domain node of server group 1 (130) when it reaches a CPU temperature over 100° C. The domain node of server group 1 (130) will accumulate and count total private events sent by the member server computers. The administrator may further set a group over temperature event of 10% of the member server computers alerting the event of over CPU temperature for the domain node. The domain node will send a group event (360) to the management console node (110) when it accumulates that over 10% of the member server computers have issued the private events of over CPU temperature. After the management console node (110) receiving the group event from the domain node, the administrator may immediately take actions to prevent the member server computers from being crashed. For example, the administrator may immediately cool down the room temperature for server group 1 (130), share work loads of server group 1 (130) to other server groups, or take any other actions that may reduce the temperature of the CPU of the member server computers in server group 1 (130). For server group 2 (140), due to lower temperature endurance of the CPU of the server computers, the administrator may set a lower over temperature threshold value of 80° C. (for example) for the CPU of each member server computer and set a group over temperature threshold value of 5% of the member server computers alerting the event of over CPU temperature for the domain node, such that the administrator may be informed with a lower threshold for the issue of CPU temperature in server group 2 (140).
(17) The other example is that the administrator may set an average network throughput threshold value of over 80 Mb/s in last five minutes for each member server computer and set an over network throughput utilization group event of over 20% of the member server computers reaching the threshold value of average 80 Mb/s throughput in the last five minutes. The administrator may monitor whether each of the groups of the server system incurs abnormally large data transmissions at the network, which may be caused by a virus attack or by a simultaneous file downloading in a specific area.
(18)
(19) According to the illustrated server system of the present invention, a method (500) for group alerting in a server system is also illustrated in
(20) The claimed server system and method of the present application have many advantages over the conventional server system platform. For example, the group alerting allows an administrator to monitor the server system by group instead of individual server computers, such that identifying and locating group problems in a very large data center becomes easier than locating each of the individual server computers. Only deployment of a new domain node in each of the groups is required without adding any new hardware and changing any hardware configurations on each existing server computer. The group alerting for the server system of the present application may be implemented merely by a firmware update without any extra efforts.
(21) The aforesaid detailed descriptions illustrate the preferred embodiments of the present application. However, the scope of the claimed invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.