Maintainable distributed fail-safe real-time computer system
11481012 · 2022-10-25
Assignee
Inventors
Cpc classification
H04L9/30
ELECTRICITY
G06F1/3206
PHYSICS
G06F11/3003
PHYSICS
G06F11/0757
PHYSICS
G06F11/3055
PHYSICS
G06F21/50
PHYSICS
International classification
H02J50/00
ELECTRICITY
G06F1/3206
PHYSICS
G06F21/50
PHYSICS
Abstract
A distributed maintainable real-time computer system is provided, wherein the real-time computer system includes at least two central computers and one, two or a plurality of peripheral computers. The central computers have access to a sparse global time, have identical hardware and identical software, but different startup data, wherein each functional central computer periodically sends time-triggered multi-cast life-sign messages to the other central computers according to a time plan a priori defined in its startup data, and wherein the peripheral computers (151, 152, 153, 154) can exchange messages (135) with the central computers (110, 120), and wherein at all times one central computer is in the active state and the other central computers are in the non-active state, and wherein after the apparent absence of a life-sign message of the active central computer expected at a planned reception time, that non-active functioning central computer which has the shortest start-up timeout takes over the function of the active central computer, and wherein each central computer (110, 120; 200) consists of three independent subsystems, an application computer (210), a storage medium having the startup data (230) characteristic of the central computer (200) and an internal monitor (220), wherein the internal monitor (220) periodically checks the correct functioning of the application computer (210), and wherein upon detection of an error the monitor (220) initiates a hardware reset and a restart of the application computer (210), and wherein preferably the active central computer initiates a maintenance action after an apparent absence of the life-sign messages expected at the planned reception times from a non-active central computer, which action can lead to the repair or replacement of a permanently failed central computer.
Claims
1. A distributed maintainable real-time computer system for controlling and/or monitoring an installation, the real-time computer system comprising: at least two central computers (110, 120; 200); and one or more peripheral computers (151, 152, 153, 154), wherein the installation is controlled and/or monitored with the one or more peripheral computers, wherein each peripheral computer controls and/or monitors a part of the installation, wherein the central computers have access to a sparse global time and the central computers have identical hardware and identical software, but use different startup data, wherein each functional central computer periodically sends time-triggered multi-cast life-sign messages to the other central computers according to a predetermined time plan defined a priori in its startup data, wherein the peripheral computers (151, 152, 153, 154) can exchange messages (135) with the central computers (110, 120), and wherein at any given time one central computer is in the active state and the other central computers are in the non-active state, and wherein, in particular immediately after the apparent absence of a life-sign message of the active central computer expected at a planned reception time, that non-active functioning central computer which has the shortest start-up timeout of all non-active functioning central computers takes over the function of the active central computer, and wherein each central computer (110, 120; 200) consists of three independent subsystems, an application computer (210), a storage medium having the startup data (230) characteristic of the central computer (200) and an internal monitor (220), wherein the internal monitor (220) periodically checks the correct functioning of the application computer (210), and wherein upon detection of an error, the internal monitor (220) initiates a hardware reset and a restart of the application computer (210), wherein the active central computer sends a start state message of a time-limited state message sequence to the peripheral computers after an occurrence of a significant event.
2. The real-time computer system according to claim 1, wherein the central computers (110, 120; 200) have a fail-silent characteristic.
3. The real-time computer system according to claim 1, wherein the application computer (210) of a central computer (200) periodically sends a life-sign message to the internal monitor (220) of the central computer (200).
4. The real-time computer system according to claim 1, wherein the internal monitor (220) of the central computer (200) periodically executes a challenge-response protocol to check the functional capability of the application computer (210) of the central computer (200).
5. The real-time computer system according to claim 1, wherein a peripheral computer has sensors to observe a physical state of an environment.
6. The real-time computer system according to claim 1, wherein the identical software for all central computers (110, 120) is cryptographically secured by means of a public key method.
7. The real-time computer system according to claim 1, wherein a value derived from an indicator determines which of the different startup data sets contained in the software of a central computer is used in this central computer.
8. The real-time computer system according to claim 1, wherein the time-triggered life-sign messages of a central computer contain the value of the indicator characterizing the startup data set currently used in that central computer.
9. The real-time computer system according to claim 1, wherein from a newly added central computer, after loading its software having different startup data sets, the life-sign messages of all functioning central computers are received and from these life-sign messages it is derived which startup data sets are already in use, and the newly added central computer sets its indicator in such a manner that the first startup data set not used at the present time is used in the newly added central computer.
10. The real-time computer system according to claim 1, wherein after the apparent absence of the state messages from a non-active central computer expected at the planned reception times, the active central computer orders a replacement or replacement parts for the failed central computer via the Internet.
11. The real-time computer system according to claim 1, wherein the central computers are equipped with a battery or other independent energy supply.
12. The real-time computer system according to claim 1, wherein the central computers are supplied with energy via a wireless charging station.
13. The real-time computer system according to claim 1, wherein a central computer has redundant wired or wireless communication channels for communication with the other central computers and/or the peripheral computers.
14. The real-time computer system according to claim 13, wherein the redundant wired or wireless communication channels are based on different transmission technologies.
15. The real-time computer system according to claim 1, wherein the central computer has an Internet connection via which human-machine communication with users can be conducted.
16. The real-time computer system of claim 15, wherein the human-machine communication with users can be conducted using a smart phone, tablet, or other mobile device.
17. The real-time computer system according to claim 1, wherein the active central computer initiates a maintenance action after an apparent absence of the life-sign messages expected at the planned reception times from a non-active central computer, which can lead to the repair or replacement of a permanently failed central computer.
Description
(1) In the following the invention is explained in detail by the example shown in the drawings. In which:
(2)
(3)
(4)
(5) The two central computers 110, 120 exchange periodic time-triggered state messages via a communication channel 115. These state messages also have the function of life-sign messages.
(6) The message exchange can occur via a wired or wireless communication channel (e.g. via Wi-Fi or Bluetooth). It is advantageous if the communication channel 115 is designed redundantly and the redundant communication channels are based on different data transmission technologies, in such a manner that an error in one of the two redundant communication channels can be detected and masked.
(7) The communication between the central computers 110, 120 and the peripheral computers 151, 152, 153, 154 is preferably performed via a time-limited state message sequence.
(8) This message exchange or communication can occur via a wired or wireless communication channel 135 (e.g. via Wi-Fi or Bluetooth). It is advantageous if a communication channel 135 provided for this purpose is designed redundantly and the redundant communication channels are based on different data transmission technologies, in such a manner that an error in one of the two redundant communication channels can be detected and masked.
(9) A time-limited state message sequence is a sequence of state messages that is started by the active central computer, e.g., the central computer 110, upon detection of a significant event by sending a start state message to the peripheral computers and that is terminated after the last expected response state messages from the peripheral computers have arrived.
(10) A significant event is either the occurrence of an a priori predetermined time event (i.e. a predetermined time is reached) or a state change in the installation observed by the active central computer or a request by a user to make a state change in the installation.
(11) A data field of the start state message contains an intended future state of the process periphery and the connected installation part of one or a plurality of peripheral computers.
(12) The addressed peripheral computers perform the intended state change and respond with one or a plurality of multi-cast response state messages, which contain the current state of the process periphery and the installation. These multi-cast response state messages are received by all functioning central computers. Preferably, in order to ensure that the intended effect has actually occurred in the physical environment of the peripheral computer, the peripheral computer has sensors (e.g. a camera) with which the intended effect (or its absence) can be observed in the physical environment of the peripheral computer (e.g. opening state of a window).
(13) If the expected response state messages do not arrive at the active central computer within an a priori predetermined response timeout, the active central computer can repeat the state message sequence several times. If the several repetitions are unsuccessful, or if an error is observed in the physical effect, the active central computer detects an error in the peripheral computer or in the installation and issues a corresponding error notice to the user. Since state messages are idempotent, repeating identical state messages has no impact on the state.
(14)
(15) The software for the central computer can be loaded from a USB storage or from a cloud via the Internet.
(16) It is advantageous if the software is cryptographically secured by means of a public key method. The central computer 200 is then able to check the integrity of the software by means of a known public key before restarting the software. The corresponding private key for creating the software is preferably only known to the authorized creator of the software.
(17) The startup data 230 can be loaded e.g. from an exchangeable USB storage.
(18) There is also the possibility that different startup data sets for all central computers are included in the software for the central computers and it depends on the value of an indicator which set of startup data is to be used in the central computer 200.
(19) An indicator is a hint that indicates which alternative has to be selected from a given amount of alternatives—the different startup data sets.
(20) The current value of the indicator is included in every life-sign message of a functioning central computer.
(21) One possibility is to derive the value of the indicator from the location of a mechanical switch on the respective central computer 200.
(22) Another possibility is to derive the value of the indicator from a contact strip of a charging station for the respective central computer 200.
(23) A charging station is a device, advantageously with a battery, which realizes the energy supply of the central computer. In the concrete example, the two charging stations for the central computers 110, 120 have different contact strips.
(24) A further possibility is to set the value of the indicator in the cold start phase of the distributed computer system (immediately after power up of the whole system) by an algorithm using random numbers.
(25) After loading the software with all different startup data sets (e.g. from the cloud or a USB storage), a newly added central computer will first receive the life-sign messages from all functioning central computers. From the life-sign messages it can be derived which startup data sets are already in use. The indicator is now set in the newly added central computer in such a manner that the first currently unused startup data set is used by the newly added central computer.
(26) Each functioning central computer periodically sends time-triggered state messages to the other central computers in multicast method—according to the selected a priori created time plan, which is preferably part of the startup data. Such a state message is interpreted as a life-sign message from the corresponding sending central computer. Time-triggered life-sign messages enable a very short error detection latency.
(27) If the functioning inactive central computer having the shortest start-up timeout has not received a life-sign message from the active central computer immediately after the specified reception time (which is included in the startup data), it assumes the role of the active central computer and sends a multicast life-sign message with the remark active central computer to all other central computers. All other central computers go into the inactive state after power-up, after receiving this active central computer message.
(28) The internal monitor 210 periodically checks the correct functioning of the application computer 220. This check can be performed either by the reception of a periodic life-sign of the application computer 220 by the monitor 210 or by the periodic initiation of a challenge-response protocol by the monitor 210.
(29) A life-sign is a periodic signal that is sent from the application computer 220, for example via a data line 215, to the monitor 210. If the life-sign is absent, the monitor assumes that the application computer 220 is failed and initiates a reset and a restart of the application computer 220.
(30) Challenge-response protocols for authenticating the correct behavior of a computer are described in detail in the specialized literature [WikCR]. The monitor 210 periodically sends a challenge message with a variable start value of a task to the application computer 220 e.g. via the data line 215. The application computer 220 has to respond to the task with the correct answer within a predetermined time interval. In case the monitor detects a faulty behavior of the application computer, the monitor 210 initiates a reset and a restart of the application computer 220.
(31) It is advantageous if the central computer 200 has fail-silent characteristics, i.e. only produces correct or recognizably wrong output messages. A recognizably wrong output message is rejected by the recipient. The state of art teaches how to build a computer having fail-silent characteristic (see [Kop11, p.130]).
(32) A standard operating system, e.g. LINUX, or a proprietary operating system can be used in the application computer 220. The application computer 220 has a wired or wireless communication channel (e.g. via Wi-Fi) to the Internet and on to a cloud for processing the collected data. The software for the central computer can also be loaded via this communication channel and replacement parts for defective components can be ordered. The human-machine interface of the distributed computer system can also be handled by means of an app (application software) via the Internet with a smart phone or tablet of the user.
(33) The application computer 220 provides a platform for executing a variety of application programs (apps) for controlling the process peripherals. These application programs are developed e.g. in coordination with or by the supplier of the existing peripheral computers and process peripherals.
(34) It is advantageous if the energy supplies of the central computers 110, 120 are independent of each other. For example, the central computers may each have a battery to buffer the energy supply.
(35) It is advantageous if the energy supply of the central computers 110, 120 is effected via wireless charging stations.
(36) It is advantageous if the entire data transfer is handled via wireless communication channels and the software is loaded from the cloud.
(37) It is advantageous if in a safety-relevant application—e.g. in the field of medical technology—the peripheral computers and the corresponding installation parts are also designed redundantly.
(38) Troubleshooting an error of a permanently failed central computer can be done as follows: 1. After a permanently failed central computer is detected by the active central computer, a new central computer is automatically ordered by the active central computer via the Internet. 2. Unwrapping of the arrived package and positioning at the designated location of the charging station. 3. Automated loading of software from the cloud, automatic selection of startup data, and automatic restart of the new central computer to the state of inactive functional central computer.
(39) The only manual action for troubleshooting is to unwrap the arrived package and position the new central computer in the designated location of the existing charging station.
(40) Such simple troubleshooting does not require specially trained maintenance personnel, which leads to considerable cost savings.
(41) Since a central computer is still functioning, continuous operation during maintenance is ensured.
(42) An error in a peripheral computer and the connected installation is detected and diagnosed by the active central computer. If redundant peripheral computers and redundant installation parts are present in a safety-critical application, a failure of a peripheral computer or an installation part can be tolerated without interrupting the operation of the safety-critical application.
(43) In view of the currently high maintenance costs for electronic systems, the invention disclosed here is of great economic importance.
LITERATURE CITED
(44) [Kop11] Kopetz, H., Real-Time Systems. Springer Verlag. 2011 [WikCR] Wikipedia: Challenge-Response Authentication. Accessed on May 21, 2019 [WikSE]) Wikipedia: Single Event Upset. Accessed on May 21, 2019