COMPUTER SYSTEM INSTALLED ON BOARD A CARRIER IMPLEMENTING AT LEAST ONE SERVICE CRITICAL FOR THE OPERATING SAFETY OF THE CARRIER
20230012925 · 2023-01-19
Inventors
Cpc classification
G06F11/1629
PHYSICS
G06F11/0739
PHYSICS
G06F11/183
PHYSICS
G06F11/3013
PHYSICS
G06F11/2048
PHYSICS
International classification
Abstract
A computer system installed on board a carrier, communicating in a network with a data concentrator and with a monitor, and implementing at least one service that is critical for the operating safety of the carrier, the critical service being redundant in at least two instances (δ.sub.1, . . . δ.sub.m) on different respective computers (C.sub.1, . . . , C.sub.m) connected to the network, each computer (C.sub.k) implementing at least one software task implementing an instance (δ.sub.k) of the critical service being configured to implement the critical service by way of time control.
Claims
1. A computer system installed on board a carrier, communicating in a network with a data concentrator and with a monitor (M), and implementing at least one service that is critical for the operating safety of the carrier, the critical service being redundant in at least two instances (δ.sub.1, . . . δ.sub.m) on different respective computers (C.sub.1, . . . , C.sub.m) connected to said network, each computer (C.sub.k) implementing at least one software task implementing an instance (δ.sub.k) of the critical service and being configured to implement the critical service by way of time control by using: an increasing sequence of task activation dates (R.sub.n) and a sequence of corresponding task latest end dates (D.sub.n), relating to the starting (0) of the system, with a gap between an end date and the corresponding activation date above or equal to a threshold corresponding to an estimate of the execution time or of the response time of the task (WCET); a backup of an internal state (s.sub.n) of the computer between two successive activations of the service by way of modeling by recording the memory states of the task; an update of the internal state (s.sub.n+1) of the computer on each activation (n) of the service, starting after the corresponding activation date (R.sub.n), reads the input data (i.sub.n) of the service and computes the output data (o.sub.n) of the service and provides them to the data concentrator, the dependency between firstly the updated internal state and the computed output data (s.sub.n+1, on) and secondly the previous internal state and the read input data (s.sub.n, i.sub.n) being represented by a transfer function (f); and a relay server (SR.sub.k) configured to compute a signature (h.sub.n+1.sup.k), which is characteristic of the execution of the instance (δk) of the service from the initial activation (0) of the system to the current latest end date (D.sub.n), by way of a hash chain dependent on a hash function (H) on nb bits, and to transmit the signature (h.sub.n+1.sup.k) to the monitor (M); the monitor (M) detecting a fault by analyzing the signatures (h.sub.n+1.sup.1 . . . h.sub.n+1.sup.m) of the instances (δ.sub.1, . . . , δ.sub.m).
2. The system as claimed in claim 1, wherein a relay server is configured to compute the signature by way of a hash chain using a cryptographic hash function H, by recurrence for each instance k, in each period n, by way of the following relationship: h.sub.n+1.sup.k=H(h.sub.n.sup.k, i.sub.n.sup.k, o.sub.n.sup.k) wherein: in which: h.sub.n+1.sup.k represents the signature of the instance kin the period n+1; h.sub.n.sup.k represents the signature of the instance k in the period n; i.sub.n.sup.k represents the input data of the service of the instance k in the period n; and o.sub.n.sup.k represents the output data of the service of the instance k in the period n.
3. The system as claimed in claim 1, wherein the monitor (M) is configured to detect a temporal fault when a signature (h.sub.n+1.sup.k) of the instances (δ.sub.1, . . . , δ.sub.m) has not been received before the latest end date (D.sub.n) of the current period.
4. The system as claimed in claim 1, wherein the monitor (M) is configured to compare the signatures received from the relay servers (SR.sub.1, . . . , SR.sub.m) in order to detect an operational fault when one signature (h.sub.n+1.sup.k) of the instances (δ.sub.1, . . . , δ.sub.m) is different from the other signatures (h.sub.n+1.sup.k).
5. The system as claimed in claim 1, wherein, when the number of instances of the service is equal to at least three, and fewer than half of these instances are faulty, the monitor (M) is configured to take a majority vote among the signatures received in time providing a majority signature denoting the operational instances, the signatures that are different from the majority signature denoting the faulty instances, and configured to signal to the remainder of the system the operational instances and the faulty instances so that the transmission of the faulty instances is interrupted on the data concentrator.
6. The system as claimed in claim 5, wherein, when an instance is detected as faulty in a period n.sub.d, the computer hosting the faulty instance is configured to retrieve a copy of a correct internal memory state of another operational instance corresponding to the period n.sub.d, to restart the faulty instance from the correct memory state, and to feed back to the faulty instance the input data from the period n.sub.d to the current period, by applying the transfer function, possibly behind schedule in relation to the corresponding latest end dates, and the monitor (M) is configured so as, when the faulty instance has caught up with the operational instances, that is to say has applied, before the n.sup.th latest end date (Dn), the transfer function to the inputs up to the period n, the signatures being equal again, to report the faulty instance as operational again.
7. The system as claimed in claim 1, wherein a relay server (SR.sub.k) is a software server implemented on the corresponding computer (C.sub.k).
8. The system as claimed in claim 1, wherein a relay server (SR.sub.k) is a hardware server implemented at the data concentrator.
9. The system as claimed in claim 1, comprising a network (RI) that is independent of the data concentrator for transmitting the signatures by way of the relay servers (SR.sub.k), with a lower passband and higher reliability than those of the data concentrator.
10. A method for managing at least one service that is critical for the operating safety of a computer system as claimed in claim 1, installed on board a carrier, the critical service being redundant in at least two instances (δ.sub.1, . . . , δ.sub.m) on different respective computers (C.sub.1, . . . , C.sub.m) connected to said network, each implementation of an instance (δk) of the critical service using: an increasing sequence of task activation dates (R.sub.n) and a sequence of corresponding task latest end dates (D.sub.n), relating to the starting (0) of the system, with a gap between an end date and the corresponding activation date above or equal to a threshold corresponding to an estimate of the execution time or of the response time of the task (WCET); a backup of an internal state (s.sub.n) of the computer between two successive activations of the service by way of modeling by recording the memory states of the task; an update of the internal state (s.sub.n+1) of the computer on each activation (n) of the service, starting after the corresponding activation date (R.sub.n), reads the input data (i.sub.n) of the service and computes the output data (o.sub.n) of the service and provides them to the data concentrator, the dependency between firstly the updated internal state and the computed output data (s.sub.n+1, on) and secondly the previous internal state and the read input data (s.sub.n, i.sub.n) being represented by a transfer function (f); and a computation of a signature (h.sub.n+1.sup.k) by a relay server, which is characteristic of the execution of the instance (δk) of the service from the initial activation (0) of the system to the current latest end date (D.sub.n), by way of a hash chain dependent on a hash function (H) on nb bits, and a transmission of the signature (h.sub.n+1.sup.k) to the monitor (M); a detection of a fault by the monitor (M) being performed by analyzing the signatures (h.sub.n+1.sup.k) of the instances (δ.sub.1, . . . , δ.sub.m).
Description
[0045] The invention will be better understood on studying a few embodiments that are described using nonlimiting examples and are illustrated by the appended drawings, in which
[0046]
[0047]
[0048]
[0049] Throughout the figures, elements that have identical references are similar.
[0050]
[0051] A computer system 1 installed on board a carrier communicates in a network with a data concentrator 2 and with a monitor M and implements at least one service that is critical for the operating safety of the carrier, or safety critical, the critical service being redundant, i.e. executed in at least two instances δ.sub.1, . . . δ.sub.m on different respective computers C.sub.1, . . . , C.sub.m connected to said network, in this case two replicas on two respective computers.
[0052] Each computer C.sub.1, . . . , C.sub.m implements an instance δ.sub.k of the critical service and is configured to implement the critical service by using: [0053] an increasing sequence of activation dates R.sub.n of the system and a sequence of corresponding latest end dates D.sub.n, relating to the starting of the system, with a gap between an end date and the corresponding activation date above or equal to a threshold corresponding to an estimate of the execution time of the service WCET. The dates Rn and Dn comply with the following in equations: ∀n, 0<R.sub.n<R.sub.n+1, 0<D.sub.n<D.sub.n+1, and D.sub.n−R.sub.n≥WCET; [0054] a backup of an internal state s.sub.n of the computer (modeling of the memory states, registers, variables of the code) between two successive activations of the service by way of modeling by recording the memory states of the computer; [0055] an update of the internal state s.sub.n+1 of the computer on each activation n of the service, starting at the corresponding activation date R.sub.n, reads the input data in of the service and computes the output data on of the service and provides them to the data concentrator 2, the dependency between firstly the updated internal state and the computed output data s.sub.n+1, on and secondly the previous internal state and the read input data s.sub.n, i.sub.n being represented by a transfer function f; and [0056] a relay server SR.sub.k configured to compute a signature h.sub.n+1.sup.k, which is characteristic of the execution of the instance δk of the service from the initial activation 0 of the system to the current latest end date D.sub.n, by way of a hash chain dependent on a hash function H on nb bits, and to transmit the signature h.sub.n+1.sup.k to the monitor M.
[0057] The monitor M detects a fault by analyzing the signatures h.sub.n+1.sup.k of the instances δ.sub.1, . . . δ.sub.m.
[0058] In
[0059] The hash function H is a cryptographic hash function, that is to say that, for a message of arbitrary size, it associates a fingerprint h, referred to as being fast to compute, that is resistant to preimage attacks (given a fingerprint h, it is impossible in practice to construct a message m such that H(m)=h), to second preimage attacks (knowing m1, it is impossible in practice to construct a message m2 such that H(m2)=H(m1)) and to collisions (it is impossible in practice to construct two different messages m1 and m2 such that H(m1)=H(m2)).
[0060] When a critical service is redundant, multiple instances δk (k=1, 2 . . . m) implement the same transfer function f, but are susceptible to faults. The variables modeling the operation of the instance k are called X.sup.k, and that describing a theoretical flawless instance is called X.
[0061] Each instance δ.sub.k satisfies the same time constraints R.sub.n and D.sub.n, is started in the same initial state s.sub.0.sup.1=s.sub.0.sup.2= . . . =s.sub.0 and receives the same inputs(i.sub.n) before the date R.sub.n (as a result of a multicast message being sent, or multiple sending). Therefore, in nominal mode, all instances compute exactly the same internal state values s.sub.n.sup.1=s.sub.n.sup.2= . . . =s.sub.n, and produce the same outputs o.sub.n.sup.1=o.sub.n.sup.2= . . . =o.sub.n, before the latest date D.sub.n of the current period.
[0062] The instances are not necessarily executed simultaneously; they may be executed on computers having different frequencies or may also be pre-empted by other tasks. The only necessary assumption is that the n.sup.th execution, or n.sup.th job, is effectively executed between the activation R.sub.n and current latest D.sub.n dates.
[0063] Let us suppose that the instance δ.sub.1 has an error, which is activated during the n.sup.th job: internal fault s.sub.n+1.sup.1≠s.sub.n+1 or external fault o.sub.n.sup.1≠o.sub.n. The invention allows these faults to be detected as soon as possible, for the purpose of signaling, and, if necessary, for the purpose of triggering a failsoft mode of operation.
[0064] The present invention uses the computation of a signature h.sub.n.sup.k, which is characteristic of the execution of each instance δ.sub.k from when it is started to the current latest date D.sub.n, and then transmission of these signatures to a monitor M, which compares them in order to detect an error.
[0065] The signature h.sub.n.sup.k, is computed by way of a hash chain h.sub.n+1.sup.k=H(h.sub.n.sup.k, i.sub.n.sup.k, o.sub.n .sup.k, s.sub.n+1.sup.k) in which H is a hash function on nb bits. This computation may be performed by the instance δ.sub.k. The signature h.sub.n+1.sup.k is transmitted to the monitor M via the data concentrator 2 before the current latest date D.sub.n. After the current latest date D.sub.n, the monitor M compares all signatures h.sub.n+1.sup.k. In nominal mode, all signatures are equal.
[0066] Supposing that the system has remained in a nominal mode up to the latest date D.sub.n−1, the monitor M detects a fault in the following cases: [0067] if one of the signatures h.sub.n+1.sup.k has not been received before the latest date D.sub.n, which may be due to: [0068] a temporal fault in the course of the nth job of the instance δ.sub.k (computation not finished in time, and violation of the latest date D.sub.n), or [0069] a temporal fault in the transmission of a message comprising a signature h.sub.n+1.sup.k by the network (or by its controllers or software), [0070] if one of the signatures h.sub.n+1.sup.k is different from the others, which may be due to: a loss of integrity of the transfer function implementation f.sub.k≠f, activation of this error leads to a fault in the internal state s.sub.n+1.sup.k≠s.sub.n+1, or in the outputs o.sub.n.sup.k≠o.sub.n, [0071] a loss of integrity, or availability, of the inputs in received by the instance k on the activation date Rn: i.sub.n.sup.k≠i.sub.n, or [0072] an error in computing the signature by the relay server SR.sub.k, a loss of integrity, or a delay in transmitting the message containing the signature h.sub.n+1.sup.k to the monitor M: only these last cases correspond to a false positive (the monitor M then signals a fault on the instance δ.sub.k, whereas it does not have a fault).
[0073] A false negative may occur when: [0074] the monitor M itself is faulty; this risk may be reduced by various means, including redundancy of the monitor so that there would be at least two monitors M, [0075] a majority of the instances δ.sub.k of the service are faulty and produce the same signature: [0076] this risk is conventionally considered to be sufficiently improbable to be tolerated in the event of faults of a random, independent or transitory nature (if each instance has a probability p of suffering a random fault, and the faults of the instances are believed to be independent, then the probability of K replicas being faulty is p.sup.K), [0077] the risk of constant faults or common modes is conventionally reduced by rigorous design, analysis and test processes, [0078] the signatures are all equal, and yet a replica deviates from its specification, that is to say: ∃j≠k, (o.sub.n.sup.j, s.sub.n+1.sup.j)≠(o.sub.n.sup.k, s.sub.n+1.sup.k) and nevertheless h.sub.n+1.sup.j=h.sub.n+1.sup.k; that is to say a collision for the hash function: [0079] for a cryptographic hash function or redundancy check of CRC type on nb bits, this collision has a probability of
Selecting nb≥α log.sub.2(10) reduces this risk to an acceptable probability of ≤10.sup.−α. (for example, for a tolerated fault probability of 10.sup.−12 per hour of operation, with comparison of signatures in the period Π=10 ms, or
per 10 ms, nb>64 bits is sufficient).
[0080] When the monitor M detects a deviation in the received signatures, it may signal this to an operating state management, or health management, device that will be responsible for deactivating the replicas, switching over to a failsoft mode, called FT, the acronym for fault tolerant, mode, or restarting all replicas in a reference state.
[0081] Additionally, if more than two replicas of the service are instantiated, the monitor M may determine, by way of a majority vote, the faulty instance(s) and selectively deactivate or restart them.
[0082] As illustrated in
[0083] The invention allows very effective implementation of the redundancy principle, with fewer constraints than conventional lockstep, with limited network and computational overhead since the signature may be produced by a very short message. This load is reduced again if the signature computation is performed by a hardware accelerator.
[0084] As such, the signature generation device described in the patent FR2989488B1 provides an effective implementation of the signature of the execution. In the case of stateless functions (s.sub.n=Ø), the signature may be computed by the data concentrator itself.