HIGH PERFORMANCE COMPUTING MACHINE AND METHOD IMPLEMENTED IN SUCH A HPC MACHINE
20230076890 ยท 2023-03-09
Assignee
Inventors
Cpc classification
G06F2221/2143
PHYSICS
H04L63/0428
ELECTRICITY
G06F21/79
PHYSICS
G06F21/53
PHYSICS
G06F21/57
PHYSICS
International classification
G06F9/50
PHYSICS
G06F21/79
PHYSICS
Abstract
A High Performance Computing (HPC) machine comprising several computing processors interconnected through at least one network, and at least one primary management unit, in a vicinity of at least one computing processor. The at least one primary management unit powers on the at least one processor. The at least one primary management unit (comprises a random data item generator, and a secure storage memory for storing a secret data item, common to all computing processors of the HPC machine, and used for authentication of each computing processor of the HPC machine for data exchange in the HPC machine.
Claims
1. A High Performance Computing (HPC) machine comprising: several computing processors interconnected through at least one network, and at least one primary management unit, in a vicinity of at least one computing processor of said several computing processors, and provided for powering on said at least one computing processor; wherein the at least one primary management unit comprises a random data item generator, a secure storage memory, for storing a secret data item, common to all computing processors of said at least one computing processor of said HPC machine, and used for authentication of each computing processor of said at least one computing processor of said HPC machine for data exchange in said HPC machine.
2. The HPC machine according to claim 1, wherein the secret data item is a random number, the random data item generator comprising a random number generator as a function of at least one physical quantity measured by at least one sensor.
3. The HPC machine according to claim 1, wherein said at least one primary management unit comprises a primary management unit dedicated to a single computing processor of said at least one computing processor.
4. The HPC machine according to claim 1, wherein said at least one primary management unit comprises a primary management unit common to said several computing processors.
5. The HPC machine according to claim 1, wherein said at least one primary management unit comprises a primary management unit common to several groups of computing processors of said at least one computing processor, wherein at least one group of said several groups of computing processors comprises said several computing processors.
6. The HPC machine according to claim 5, further comprising for said at least one group of the several groups of computing processors, a secondary management unit in a vicinity of said at least one group provided for powering on said at least one computing processor of said at least one group, wherein said secondary management unit comprises a second secure storage memory for storing the secret data item.
7. The HPC machine according to claim 1, wherein each primary management unit of said at least one primary management unit comprises respectively a secondary management unit, wherein the secure storage memory is a RAM memory that is erased when said each primary management unit and said secondary management unit are powered off.
8. The HPC machine according to claim 1, wherein each primary management unit of said at least one primary management unit respectively comprises a secondary management unit , and further comprises a trusted execution environment (TEE) that executes a trusted client controlling access to the secure storage memory of said each primary management unit and said secondary management unit.
9. The HPC machine according to claim 1, wherein each primary management unit of said at least one primary management unit respectively comprises a secondary management unit, and further comprises a one-time programmable(OTP) memory that stores an identity data item and that checks and attests an identity of a component comprising said at least one primary management unit and said secondary management unit, at a time said component is added to said HPC machine.
10. The HPC machine according to claim 1, further comprising several computing racks, each computing rack of said several computing racks comprising several computing blades, each computing blade of said several computing blades comprising said several computing processors, wherein for at least one computing rack of said several computing racks, a primary management unit of said at least one primary management unit is integrated in a Rack Management Controller (RMC) of said at least one computing rack in a processor of said RMC; and for at least one computing blade of said several computing blades, a secondary management unit is integrated in a Baseboard Management Controller (BMC) of said at least one computing blade in a processor of said BMC.
11. A management method for a High Performance Computing (HPC) machine, said HPC machine comprising several computing processors interconnected through at least one network, and at least one primary management unit, in a vicinity of at least one computing processor of said several computing processors, and provided for powering on said at least one computing processor; wherein the at least one primary management unit comprises a random data item generator, a secure storage memory, for storing a secret data item, common to all computing processors of said at least one computing processor of said HPC machine, and used for authentication of each computing processor of said at least one computing processor of said HPC machine for data exchange in said HPC machine; said management method comprising: a phase for powering on said HPC machine, said phase for powering on said HPC machine comprising powering on a first primary management unit of said at least one primary management unit, generating, by said first primary management unit, the secret data item, and storing said secret data item in the secure storage memory.
12. The management method according to claim 11, wherein, when several primary management units of said at least one primary management unit are powered-on at a same time, said phase for powering on said HPC machine further comprises generating, by each primary management unit of said several primary management units, the secret data item, negotiating between said several primary management units of a common secret data item, according to a predetermined negotiation protocol, and storing said common secret data item in the secure storage memory of said each primary management unit.
13. A communication method in a High Performance Computing (HPC) machine, said HPC machine comprising several computing processors interconnected through at least one network, and at least one primary management unit, in a vicinity of at least one computing processor of said several computing processors, and provided for powering on said at least one computing processor; wherein the at least one primary management unit comprises a random data item generator, a secure storage memory, for storing a secret data item, common to all computing processors of said at least one computing processor of said HPC machine, and used for authentication of each computing processor of said at least one computing processor of said HPC machine for data exchange in said HPC machine; said communication method comprising: sending data to, or receiving data from, said at least one computing processor according to a key agreement protocol using the secret data item stored in the secure storage memory of the at least one primary management unit associated to said at least one computing processor.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0103] Other advantages and characteristics will become apparent on examination of the detailed description of at least one embodiment which is in no way limitative, and the attached figures, where:
[0104]
[0105]
[0106]
[0107]
DETAILED DESCRIPTION OF THE INVENTION
[0108] It is well understood that the one or more embodiments that will be described below are in no way limitative. In particular, it is possible to imagine variants of the invention comprising only a selection of the characteristics described hereinafter, in isolation from the other characteristics described, if this selection of characteristics is sufficient to confer a technical advantage or to differentiate the invention with respect to the state of the prior art. Such a selection comprises at least one, preferably functional, characteristic without structural details, or with only a part of the structural details if this part alone is sufficient to confer a technical advantage or to differentiate the one or more embodiments of the invention with respect to the prior art.
[0109] In the FIGURES, elements common to several figures retain the same reference.
[0110]
[0111] The primary management unit 100, shown in
as long as the at least one CP and the said management unit are able to communicate directly through hardware without relying on a network, such as for example: [0115] through an internal bus, or [0116] through a SPI (Serial Peripheral Interface) bus, or [0117] through a PCI (Peripheral Component Interconnect) bus, or [0118] through direct pin-to-pin connection.
Thus, the primary management unit 100 is readily and locally accessible by each CP, without contacting a centralized third party.
[0119] The primary management unit 100, shown in
[0120] The primary management unit, PMU, 100 comprises a random data item generator 102. The random data generator 102 may be any device or any function that generates a random data that is different every time. For example, the random data generator 102 may enclose a hash function or similar. The random data generator 102 may be a random number generator.
[0121] Preferably, in at least one embodiment, the random data generator 102 may generate a random data as a function of at least one physical quantity measured by at least one physical sensor 104, such as a temperature sensor, a frequency sensor, a noise sensor, etc. Thus, in at least one embodiment, the random data generated by the random data generator 102 depends on the value of the at least one physical quantity measured, by the at least one sensor, at the time said random data item is generated, in the environment of the PMU.
[0122] The PMU 100, shown in
[0123] The secret data item 108 stored in the secure storage memory 106 is used by each CP managed, directly or indirectly, by the PMU 100 for authentication in the HPC machine, for example for data exchange in said HPC machine.
[0124] The secure data storage memory 106 may be any type of memory, as long as the content of such memory is protected such that access to said memory is regulated. The security of the memory 106 may comprise hardware security means and/or software security means.
[0125] Preferably, in at least one embodiment, the secure data memory 106 may be a memory that is, partially or totally, erased when said memory is powered off. For example, in at least one embodiment, the secure data memory may be a RAM memory such that the power-off of the PMU, or of the component comprising said PMU, results in the obliteration of the RAM memory, avoiding a third party to access to the secret data item.
[0126] Preferably, in one or more embodiments, the PMU 100 may also comprise a trusted execution environment, TEE, 110 for executing a trusted client 112 controlling access to the secure storage memory 106. Thus, in at least one embodiment, the access to the secret data item 108 is controlled by a client 112. The TEE 110 guarantees, thanks to the client 112, that the secret data item 108, stored in the secure storage memory 106, is protected with respect to confidentiality and integrity. The TEE 110 may be a secure area of a processor.
[0127] For example, in at least one embodiment, an ARM CPU provides a built-in TEE, known as ARM TrustZone. An Intel CPU also provides a built in TEE, known as Intel SGX. Thus, for example, the primary management unit may be integrated in an ARM CPU that may be a computing processor, or preferably a management processor.
[0128] For a component, access to the secure storage memory 106 may be authorized by the trusted client 112 only if the component provides a valid and confirmed identity data, proving that said component is a component of the HPC machine or a trusted component. For example, in at least one embodiment, for another PMU or a secondary management unit (as will be described below), access to the secure storage memory 106 may be authorized by the trusted client 112 only if said management unit provides a valid and confirmed identity data, proving that said management unit, or the component comprising said management unit, is a trusted component.
[0129] The primary management unit 100 may preferably further comprise a one-time programmable, OTP, memory 114 for storing an identity data item 116. This identity data item is used for checking and attesting the identity of said PMU 100, or of a component comprising said PMU, in particular at the time said component is added to the HPC machine, or powered on. For example, in at least one embodiment, when the PMU is arranged in/on a rack management controller, the OTP memory of the PMU may comprise an identity data for attesting the identity of said rack management controller.
[0130] The identity data 116 may be a public key, injected in the OTP memory 114 at factory level, and whose private corresponding part is owned by the vendor or the supplier. This key pair may be used to attest, remotely, the authenticity of the new component, by using TPM-like standards. This way, a newly inserted/powered on component can be checked remotely for sanity before sharing the common secret data item with said component.
[0131] Of course, by way of one or more embodiments, the PMU 100 may comprise other component(s) or software(s), that will not be described here since those components are not relevant for the understanding of the one or more embodiments of the invention. For example, in at least one embodiment, the PMU 100 may comprise a central unit, or a central software, for managing the PMU, or for communicating with other components, for data encryption/decryption, etc.
[0132]
[0133] The secondary management unit, SMU, 200, shown in
Thus, the SMU 100 is readily and locally accessible by each CP, without using a centralized third party.
[0136] The SMU 200, shown in
[0137] The SMU 200 comprises a secure storage memory 206, similar to the secure data storage 106 of the PMU 100 of
[0138] The secure data storage memory 206 may be any type of memory, as long as the content of such memory is protected such that access to said memory is regulated. The security of the memory 206 may comprise hardware security means and/or software security means.
[0139] Preferably, in one or more embodiments, the secure data memory 206 may be a memory that is, partially or totally, erased when said memory is powered off. For example, in at least one embodiment, the secure data memory 206 may be a RAM memory such that the power-off of the SMU 200, or of the component comprising said SMU 200, results in the obliteration of the RAM memory, avoiding a third party to access the secret data item 108.
[0140] Preferably, in one or more embodiments, the SMU 200 may also comprise a TEE 210 for executing a trusted client 212 controlling access to the secure storage memory 206, similar to the PMU 100. Thus, in at least one embodiment, the access to the secret data item 108 is controlled by a client 212. The TEE 210 guarantees, thanks to the client 212, that the secret data item 108, stored in the secure storage memory 206, is protected with respect to confidentiality and integrity. The TEE 210 may be a secure area of a processor.
[0141] For example, in one or more embodiments, an ARM CPU provides a built in TEE, known as ARM TrustZone. An Intel CPU also provides a built in TEE, known as Intel SGX. Thus, for example, the SMU may be integrated in an ARM CPU that may be a computing processor, or preferably a management processor.
[0142] Similar to the PMU 100, in one or more embodiments, the SMU 200 may preferably further comprise a one-time programmable, OTP, memory 214 for storing an identity data item 216. This identity data item 216 is used for checking and attesting the identity of said SMU 200, or of a component comprising said SMU 200, in particular at the time said component is added to the HPC machine, or powered on. For example, when the PMU is arranged in/on a baseboard management controller of a computing blade, the OTP memory 214 of the SMU 200 may comprise an identity data 216 for attesting the identity of said baseboard management controller, or the identity of said computing blade.
[0143] The identity data 216 may be a public key, injected in the OTP memory 214 at factory level, and whose private corresponding part is owned by the vendor or the supplier. This key pair may be used to attest, remotely, the authenticity of the new component, by using TPM-like standards. This way, a newly inserted/powered on component can be checked remotely for sanity before sharing the common secret data item with said component.
[0144] Of course, by way of one or more embodiments, the SMU 100 may comprise other component(s) of software(s), that will not be described here since those components are not relevant in the understanding of the one or more embodiments of the invention. For example, in one or more embodiments, the SMU 200 may comprise a central unit, or a central software, for managing the SMU, or for communicating with other components, for data encryption/decryption, etc.
[0145]
[0146] The HPC machine 300, shown in
[0147] The HPC machine 300 comprises a central device 302 that coordinates and manages the computations in the HPC machine 300. The central device may be a server, a computer, a CPU, etc.
[0148] The HPC machine 300 also comprises several computing processors 304.sub.1-304.sub.n, for performing an overall computing task. Each computing processor (CP) may compute individual computing task, in parallel or in series with at least another CP, depending on the overall computing task.
[0149] The CPs 304.sub.1-304.sub.n are interconnected with each other and with the central device 302, through: [0150] an interconnection network 306 for computational data exchange between the CPs or with the central device 302, [0151] a management network 308 for management data exchange, especially with the central device.
The networks 306 and 308 may be different physical networks, such as each network is a different physical LAN (Local Area Network). In some embodiments, the networks 306 and 308 may share a same physical network: in this case, networks 306 and 308 may be different virtual networks, such as vLANs (Virtual Local Area Network) sharing said same physical network.
[0152] Each CP 304.sub.1-304.sub.n comprises a primary management unit 100.sub.1-100.sub.n according to one or more embodiments of the invention. For example, primary management unit 100.sub.1-100.sub.n may be the PMU 100 of the
[0153] For example, each CP 304.sub.i may be an ARM CPU comprising built in TEE and OTP for the PMU 100.sub.i.
[0154] In this example, the secret data item common to all CP 304.sub.1-304.sub.n is generated at the level of the CPs 304.sub.1-304.sub.n.
[0155]
[0156] The HPC machine 400, shown in
[0157] In the HPC machine the CPs 304 are organized as groups 402.sub.1-402.sub.m. Each group 402.sub.j of CP(s) may comprise one or several CPs 304. At least two groups may comprise the same number of CPs 304. Alternatively, the number of CP in at least two groups may be different.
[0158] Each group 402.sub.1-402.sub.m may be a computing blade also comprising a Baseboard Management controller (BMC), respectively 404.sub.1-404.sub.m.
[0159] Each BMC 404.sub.j of a computing blade 402.sub.j is provided to power on the CPs of the said computing blade, for example individually. Each BMC 404.sub.j of a computing blade 402.sub.j may also provide other function to the CPs of said computing blade 402.sub.j, such as a communication interface or a communication gateway, cooling, data encryption, etc.
[0160] Each BMC 404.sub.j comprises a processor (not shown) that is not a computing processor and that is not used for computing task in the HPC machine 400, contrary to the CPs 304. The processor of the BMC 404.sub.j may be called Managing Processor (MP) of the BMC. The MP of at least one BMC 404.sub.1-404.sub.m may be and ARM CPU.
[0161] In the example shown in
[0162] In the HPC 400 of
[0163] In this example, the secret data item common to all CPs 304 is generated at the level of the BMCs and not at the level of the CPs. Each CP gets the secret data item from the BMC associated to said CP.
[0164]
[0165] The HPC machine 500, shown in
[0166] Furthermore, by way of at least one embodiment, in the HPC machine 500 of
[0167] Each computing rack 502.sub.1 also comprise a Rack Management controller (RMC), respectively 504.sub.1-504.sub.m. Each RMC 504.sub.1 of a computing rack 502.sub.1 is provided to power-on each computing blade 402 of said rack 502.sub.1, for example individually. More particularly, each RMC 504.sub.1 of a computing rack 502.sub.1 is provided to power on the BMC 404 of each computing blade 402 of said rack 502.sub.1, for example individually. Each RMC 504.sub.1 of a computing rack 502.sub.1 may also provide other function to the computing blade 402 of said computing rack 502.sub.1, such as a communication gateway, cooling, data encryption, etc.
[0168] Each RMC 504.sub.1 comprises a processor (not shown) that is not a computing processor and that is not used for computing tasks in the HPC machine 500, contrary to the CPs 304. The processor of the RMC 504.sub.1 may be called Managing Processor (MP) of the RMC. The MP of at least one RMC 504.sub.1 may be and ARM CPU.
[0169] In the HPC machine 500, each RMC 504.sub.1-504.sub.k is provided with a primary management unit, respectively 100.sub.1-100.sub.k. The primary management unit 100.sub.1 of a RMC 504.sub.1 may preferably, with no loss of generality, be integrated in the processor of said RMC.
[0170] Moreover, in the HPC machine 500, each BMC 404 is provided with a secondary management unit, respectively 200.sub.1-200.sub.k. The secondary management unit 200 of a BMC may preferably, with no loss of generality, be integrated in the processor of said RMC. The secondary management unit 200 of a BMC may be the secondary management unit 200 of
[0171] In this example, in at least one embodiment, the secret data item common to all CPs 304 is generated at the level of the RMCs 504.sub.1-504.sub.l, and not at the level of the CPs 304 or BMCs 404. The secret data item is communicated to each BMC 404 and stored in the secure storage memory of said BMC, at the time the BMC is powered on. When a CP is powered on the secret data item is subsequently communicated to the CP from the BMC associated to said CP.
[0172] Of course, the HPC machine according to one or more embodiments of the invention is not limited to the examples shown in
[0173]
[0174] The method 600, shown in
[0175] The method 600 according to one or more embodiments of the invention comprises a phase 602, called a powering on phase, carried out when the HPC machine is powered on. The powering on phase 62 comprises a step 604 powering on a first primary management unit. The powering on of a first primary management unit may be carried out by powering on: [0176] a first CP 304.sub.i in the HPC machine 300 of
The powering on may be carried out manually or by central device 302, or another device local or distant to HPC device.
[0179] At a step 606, a first secret data item is generated by the first powered on primary management unit, i.e. the one powered on at the step 604.
[0180] The secret data item is stored, at step 608, in the secure storage memory of the primary management unit. It will be shared to any primary management unit, when applicable to any secondary management unit, that will be powered on thereafter, optionally after verification of the identity of the component integrating said management unit.
[0181] In some cases, in at least one embodiment, at the powering on phase 600, several primary management units are powered on at the same time. In this case, the secret data generating step 606 is carried out by each of said primary management units, such that each primary management unit generates a secret data item.
[0182] Then, before the storing step 608, the powering-on phase comprises a step 610 during which the powered-on primary management units negotiate together for choosing a common secret data item to be used thereafter in the HPC machine. The negotiation may be carried out according to a predetermined protocol.
[0183] The negotiating protocol may lead to choose one of the generated secret data items, as the common secret data item, for example based on a timestamp information or any other predetermined rule. In one or more embodiments, the negotiating protocol may lead to calculate the common secret data item as a function of all of the secret data items generated at step 606.
[0184] Regardless the negotiating solution, optionally, the common secret data item obtained at the negotiating step may be stored in a first block of a blockchain, in particular with a timestamp. The secret data item generated by each of the primary management unit may also be added in a new block in said blockchain.
[0185] After the power-on phase, the method 600 may comprise a phase 620 powering on an additional primary management unit, according to one or more embodiments.
[0186] The phase 620 comprises a step 622 powering on the additional primary management unit, for example by powering on: [0187] an additional CP 304.sub.i in the HPC machine 300 of
The powering on may be carried out manually or by central device 302, or another device local or distant to the HPC device.
[0190] At a step 624, in at least one embodiment, the common secret data item is communicated to said additional primary management unit, optionally after verification of its identity, or the identity of the component comprising said additional primary management unit.
[0191] The common secret data item is stored in the secure storage memory of said additional primary management unit, at step 626.
[0192] Optionally, in at least one embodiment, the additional primary management unit may also generate a secret data item that is timestamped and added to the blockchain in a new block.
[0193] The phase 620 may be repeated every time an additional primary management unit is powered on in the HPC machine.
[0194] The method 600, by way of one or more embodiments, may comprise a phase 630 powering on a new primary management unit, when a new component, such as a CP or a BMC or a RMC, is added to the HPC. Such a component may be an extension component, or a replacement component in case of maintenance for example.
[0195] The phase 630 comprises a step 632 powering on the new primary management unit, for example by powering on: [0196] a new CP 304 added to the HPC machine 300 of
The powering on may be carried out manually or by the central device 302, or another device local or distant to the HPC device.
[0199] After the powering on step 632, at a step 638, the identity of the new component is checked. The identity of the new component may be checked as described above, thanks to the identity data stored in an OTP memory of the new primary management unit.
[0200] If the identity check is satisfactory, the common secret data item is communicated to said new primary management unit, at step 634.
[0201] The common secret data item is stored in the secure storage memory of said new primary management unit, at step 636.
[0202] Optionally, in at least one embodiment, the new primary management unit may also generate a secret data item that is timestamped and added to the blockchain in a new block.
[0203] The phase 630 may be repeated every time a new component comprising a primary management unit is added to the HPC machine.
[0204] Of course, one or more embodiments of the invention are not limited to the examples detailed above.