TEMPERATURE PREDICTION SYSTEM AND METHOD FOR PREDICTING A TEMPERATURE OF A CHIP OF A PCIE CARD OF A SERVER

20220156171 · 2022-05-19

    Inventors

    Cpc classification

    International classification

    Abstract

    To predict a temperature of a chip of a PCIe card of a server, use a gated recurrent unit of a recurrent neural network to define a temperature prediction model for the chip, collect training data of the temperature prediction model according to mutual response changes of control variables, use the training data to train the temperature prediction model to obtain a training result close to a measured temperature of the chip and evaluate the training result to obtain features that best reflect the temperature change of the chip, perform an error analysis on the training result to obtain a set of key features from the features, form a temperature predictor according to the set of key features and the temperature prediction model, and generate a predicted temperature of the chip by the temperature predictor.

    Claims

    1. A method for predicting a temperature of a chip of a PCIe card of a server comprising: using a gated recurrent unit of a recurrent neural network to define a temperature prediction model for the chip, the temperature prediction model comprising an input terminal and an output terminal; collecting training data of the temperature prediction model according to mutual response changes of a plurality of control variables; using the training data to train the temperature prediction model at the input terminal to obtain a training result close to a measured temperature of the chip from the output terminal, and evaluate the training result to obtain a plurality of features that best reflect the temperature change of the chip; performing an error analysis on the training result to obtain a set of key features from the plurality of features; forming a temperature predictor according to the set of key features and the temperature prediction model; and generating a predicted temperature of the chip by the temperature predictor.

    2. The method of claim 1 wherein the plurality of control variables comprise: chip power of the PCIe card being in an on state or an off state; a utilization rate of a processor being in an idle state, 25% utilization rate, 50% utilization rate, 75% utilization rate or 100% utilization rate; a fan speed of the server being 30% of full speed, 40% of full speed, 50% of full speed, 60% of full speed, 70% of full speed, 80% of full speed, 90% of full speed or 100% of full speed; and an intake air temperature of the server being between 18° C. and 25° C.

    3. The method of claim 2 wherein the training data comprises the utilization rate of the processor, the fan speed of the server, the chip power of the PCIe card and the measured temperature of the chip.

    4. The method of claim 3 wherein the measured temperature is obtained from a thermocouple sensor disposed on the chip.

    5. The method of claim 3 wherein the plurality of features comprise any combination of a group consisting of the utilization rate of the processor, the fan speed of the server, the chip power of the PCIe card, the measured temperature of the chip and the intake air temperature of the server, and the set of key features comprises the chip power of the PCIe card, the fan speed of the server, the temperature of the processor and the intake air temperature of the server.

    6. The method of claim 1 wherein the error analysis is a root mean square error analysis.

    7. The method of claim 1 further comprising controlling a fan speed of the server according to the predicted temperature of the chip.

    8. A temperature prediction system comprising: a server comprising a PCIe card and a fan; a temperature predictor comprising: a temperature prediction model defined by a gated recurrent unit (GRU) of a recurrent neural network (RNN) for a chip of the PCIe card; and a set of key features that best reflect a temperature change of the chip; and a baseboard management controller configured to control a temperature prediction model to generate a predicted temperature of the chip of the PCIe card according to the set of key features, and control a fan speed of the server according to the predicted temperature.

    9. The temperature prediction system of claim 8 wherein the set of key features comprises the chip power of the PCIe card, the fan speed of the server, the temperature of the processor and the intake air temperature of the server.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0009] FIG. 1 is a schematic diagram of a temperature prediction system in an embodiment of the present invention.

    [0010] FIG. 2 is a schematic diagram of a temperature prediction model in an embodiment of the present invention.

    [0011] FIG. 3 is another schematic diagram of the temperature prediction system in FIG. 1.

    [0012] FIG. 4 is a training diagram of the temperature prediction model in FIG. 2.

    [0013] FIG. 5 is a prediction diagram of the temperature prediction model in FIG. 2.

    DETAILED DESCRIPTION

    [0014] FIG. 1 is a schematic diagram of a temperature prediction system 100 in an embodiment of the present invention. The temperature prediction system 100 comprises a server 30 and a baseboard management controller 20. The server 30 comprises a central processing unit (CPU) 2, a memory 4, a hard disk module 6, a fan module 8, a power supply 10, and a PCIe (PCI express) card 12. The baseboard management controller 20 is used to control the temperature prediction model to generate the predicted temperature of the chip of the PCIe card 12 according to the key features, and control the speed of the server fan according to the predicted temperature.

    [0015] The temperature prediction system 100 further comprises a temperature predictor. The temperature predictor comprises a temperature prediction model defined by a gated recurrent unit (GRU) of a recurrent neural network (RNN) for the chip of the PCIe card 12, and a set of key features that best reflect the temperature change of the chip of the PCIe card 12. The temperature prediction model and the set of key features can be stored in the memory 4 and executed by the central processing unit 2. The memory 4 and central processing unit 2 can be in any form.

    [0016] Please refer to FIGS. 2 and 3. FIG. 2 is a schematic diagram of a temperature prediction model 200 in an embodiment of the present invention. FIG. 3 is another schematic diagram of the temperature prediction system 100. In this embodiment, a gated recurrent unit (GRU) in a recurrent neural network (RNN) is used as the architecture of the temperature prediction model 200. Because the recurrent neural network can remember past historical data, using this deep learning framework can effectively predict future trends from historical data. The goal of the temperature prediction model 200 is to infer output data y(k), y(k+1), y(k+2) . . . from the training data x1, x2 . . . of the known input terminals. k is the sampling point, and the sampling period is 1 second, but not limited to this. The choice of training data has an absolute impact on the accuracy of the prediction system. The embodiment selects the following training data: the intake air temperature T.sub.amb of the server 30, the fan speed of the fan module 8, the temperature T.sub.CPU of the central processing unit 2, the power P of the PCIe card 12, and the inlet temperature T.sub.in of the PCIe card 12. In addition, the output data of the temperature prediction model 200 is the chip temperature T.sub.PCIE of the PCIe card 12. The generation of training data, the storage and processing of data, and the training and evaluation of the temperature prediction model 200 can be implemented in an automated manner through programs.

    TABLE-US-00001 TABLE 1 Control variables Control range Control range adjustment Chip of PCIe card ON/OFF ON OFF CPU utilization rate  0-100% Idle 25% 50% 75% 100% Fan speed 30-100% 30 40 50 60 70 80 90 100 Server inlet temperature 18-25° C. 18-25° C.

    [0017] With reference to the control variables in Table 1, the control range adjustment is only for illustration and is not used to limit the present invention. Control variables can be used to generate input data for predictive models. The chip power P of the PCIe card 12 may be in one of two states: ON and OFF. The control signal of the fan speed U is a pulse-width modulation (PWM) signal which may correspond to one of eight states: 30% speed, 40% speed, 50% speed, 60% speed, 70% speed, 80% speed, 90% speed and 100% speed. The utilization rate of the central processing unit 2 may be in one of five states: idle state, 25% utilization rate, 50% utilization rate, 75% utilization rate and 100% utilization rate, which is the main heat source affecting the downstream PCIe card 12. In the embodiment, the fan speed, the chip power P of the PCIe card 12, and the utilization rate of the CPU 2 can be controlled by the program, and the intake air temperature T.sub.amb of the server 30, the temperature T.sub.CPU of the CPU 2 and the chip temperature T.sub.PCIE of the PCIe card 12 can be detected to train the temperature prediction model 200. In the design stage of the server 30, a thermocouple sensor can be used in advance to sense the chip of the PCIe card 12, thereby obtaining the temperature of the chip. After the training is completed, the chip on the PCIe card 12 does not have a thermocouple sensor, but the temperature prediction model 200 in the embodiment can be used to predict the change of the chip temperature T.sub.PCIE.

    TABLE-US-00002 TABLE 2 Errors Input features Greatest T.sub.amb T.sub.CPU T.sub.in P U RMSE error 1 x x ∘ ∘ ∘ 1.107 5.478 2 x ∘ x ∘ ∘ 0.737 6.356 3 ∘ x x ∘ ∘ 5.706 13.666 4 x ∘ ∘ ∘ ∘ 0.371 2.548 5 ∘ x ∘ ∘ ∘ 1.020 4.69 6 x ∘ x ∘ ∘ 0.487 2.95 7 ∘ ∘ ∘ ∘ ∘ 0.395 2.684

    [0018] Table 2 is an error analysis of the results after training under various input features. The error data is an illustration of the experimental results according to the present invention, and is not used to limit the present invention. In Table 2, o represents this feature is being used, and x represents this feature is not being used. The chip power P and fan speed U of the PCIe card 12 are both key features. From the root mean square error (RMSE) analysis, adding T.sub.amb, T.sub.in, and T.sub.CPU can produce a relatively small error range (the fourth group of input features). Therefore, the embodiment selects the chip power P of the PCIe card 12, the fan speed U, the temperature T.sub.CPU of the central processing unit 2, and the inlet temperature Tin of the PCIe card 12 as the key features of the temperature predictor. However, the present invention is not limited to this. In another embodiment, the key features can include any combination of the features in Table 2.

    [0019] FIG. 4 is a training schematic diagram of the temperature prediction model 200 in an embodiment of the present invention. In the embodiment, the central processing unit 2 is in an idle state and uses the control variables of Table 1 to train the temperature prediction model 200. When the chip of the PCIe card 12 is in the ON state, the chip power of the PCIe card 12 is 100%, the chip temperature T.sub.PCIE increases. When the chip of the PCIe card 12 is in the OFF state, the chip power of the PCIe card 12 is 0%, the chip temperature T.sub.PCIE drops. The temperature T.sub.CPU of the central processing unit 2 changes with the switching of the chip of the PCIe card 12. The fan speed U during training has two modes: 80% and 70%. The training data generated by the control variables and other parameters in this embodiment can be used to train the temperature prediction model 200 so that the output data of the temperature prediction model 200, that is, the chip temperature T.sub.PCIE, can be close to the measured temperature.

    [0020] FIG. 5 is a schematic diagram of the prediction of the temperature predictor in an embodiment of the present invention. The temperature predictor is formed by the key features of the temperature prediction model 200. In FIG. 5, when the fan speed of the fan module 8 gradually increases from 40% to 80%, the temperature T.sub.CPU of the CPU 2 and the intake air temperature T.sub.amb of the server 30 do not change much. However, the chip temperature T.sub.PCIE of the PCIe card 12 is lowered as the chip of the PCIe card 12 is turned on and the fan speed increases. Moreover, the actual value of the chip temperature T.sub.PCIE of the PCIe card 12 is quite close to the predicted value, which proves that the temperature predictor can actually predict the chip temperature T.sub.PCIE of the PCIe card 12.

    [0021] In summary, the embodiment discloses a temperature prediction system and method for the PCIe chip of the server, including training data and output data for defining the temperature prediction model of the PCIe chip of the server, using the training data to train and test the temperature prediction model, adjusting the temperature prediction model so that the output data of the temperature prediction model is close to the measured value, and using the temperature prediction model and the temperature predictor formed by the key features to predict the temperature of the chip of the PCIe card. In this way, the temperature change of the chip of the PCIe card can be predicted, solving the time delay problem of the fan speed response.

    [0022] In an embodiment of the present invention, the temperature predictor and method for the PCIe chip can be applied to a server. The server can be used in artificial intelligence (AI) operations and edge computing. The server can also be a 5G server, cloud server or car networking server.

    [0023] Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.