Method and Arrangement for Operating Two Redundant Systems

Abstract

A method and an arrangement having redundant systems operating in parallel in a cyclic mode and reciprocally checking a result of the task of the other system on a regular basis, and wherein one system is selected or confirmed for the productive mode in the fault situation found, where a characteristic variable concerning an operating parameter is picked up for each of the systems in multiple/all cycles and used for updating statistical parameters, where at least when a disparity between results of the two systems is found, a current operating parameter is correlated with the statistical parameter for each system, and where the system for which the current operating parameter differs from the statistical parameter less is detected as the correctly operating system and used for the productive mode such that the degree of fault coverage can be increased and hence the availability of the overall system increased.

Claims

1. A method for operating an arrangement having two redundant systems each operating in parallel in a cyclic mode, one system of the two systems operating in a productive mode each time and another system of the two systems executing the same task for checking purposes, the method comprising: checking reciprocally by the two systems at least one result of a task of a respective other system on a regular basis, each system of the two systems comparing a result of the task of the other system of the two systems with their own result, a detected fault comprising a detected disparity among the results leads to a fault situation being found, with one of the systems being selected or confirmed for the productive mode in the fault situation found; picking up at least one respective characteristic variable concerning an operating parameter for each of the systems in multiple or all cycles and using the picked up at least one respective characteristic variable for updating at least one statistical parameter each time; correlating a current operating parameter with the associated statistical parameter for each system at least when a disparity between the results of the two systems is found; and utilizing the system for which the respective current operating parameter differs from the associated statistical parameter less is detected as the correctly operating system as the productive mode.

2. The method as claimed in patent claim 1, wherein the operating parameter used comprises a program runtime comprising a runtime for executing a complete cycle or a program part executed in a cycle each time.

3. The method as claimed in claim 1, wherein the operating parameter used is at least one performance counter for specifying a performance index of the respective system.

4. The method as claimed in claim 1, wherein a multiplicity of characteristic variables are picked up as operating parameters and combined to form one of (i) a set of statistical parameters and (ii) a single overall statistical parameter.

5. The method as claimed in claim 1, wherein commands for picking up the operating parameters are inserted in an application program of the systems that is executed in each cycle.

6. The method as claimed in patent claim 5, wherein the commands are inserted in a number of program blocks that are executed independently of branches or conditions in each cycle.

7. The method as claimed in patent claim 1, wherein the two redundant systems comprises industrial automation components.

8. An arrangement comprising: two redundant systems which operate in parallel in a cyclic mode, one of the two systems being switched to a productive mode each time and another of the two systems executing the same task for checking purposes; wherein, in particular by virtue of a respective result of a task of a respective other of the two systems is compared with their own result of the task by the two systems to reciprocally check results of the respective other system on a regular basis; wherein one of (i) a detected fault and (ii) a detected disparity among the results leads to a fault situation being found; wherein one system of the two systems is one of (i) selected for the productive mode and (ii) confirmed for the productive mode in the fault situation; wherein at least one respective characteristic variable concerning an operating parameter is picked up for each of the two systems in multiple or all cycles and used to update at least one statistical parameter each time; wherein, at least when a disparity between the results of the two systems is found, a current operating parameter is correlated with the associated statistical parameter for each system; and wherein the system for which the respective current operating parameter differs from the associated statistical parameter less is detected as the correctly operating system and used for the productive mode.

9. The arrangement as claimed in claim 8, wherein the operating parameter used forms a program runtime comprising a runtime for executing a complete cycle or a program part executed in a cycle each time.

10. The arrangement as claimed in claim 8, wherein the operating parameter used is at least one performance counter for specifying a performance index of the respective system.

11. The arrangement as claimed in claim 9, wherein the operating parameter used is at least one performance counter for specifying a performance index of the respective system.

12. The arrangement as claimed in claim 8, wherein a multiplicity of characteristic variables to be picked up as operating parameters are combined to form one of (i) a set of statistical parameters and (ii) a single overall statistical parameter.

13. The arrangement as claimed in claim 9, wherein a multiplicity of characteristic variables to be picked up as operating parameters are combined to form one of (i) a set of statistical parameters and (ii) a single overall statistical parameter.

14. The arrangement as claimed in claim 10, wherein a multiplicity of characteristic variables to be picked up as operating parameters are combined to form one of (i) a set of statistical parameters and (ii) a single overall statistical parameter.

15. The arrangement as claimed in claim 8, wherein commands for picking up operating parameters are inserted in an application program of the systems which is executed in each cycle.

16. The arrangement as claimed in claim 15, wherein the commands to be inserted in a number of program blocks are executed independently of branches or conditions in each cycle.

17. The arrangement as claimed in claim 15, wherein the two redundant systems comprise industrial automation components.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] The method according to the invention is explained below on the basis of an exemplary embodiment; the exemplary embodiment is used at the same time to explain an arrangement according to the invention, in which:

[0022] FIG. 1 shows a schematic depiction of two redundantly operated systems linked via a network having two production means;

[0023] FIG. 1 shows a schematic depiction of two redundantly operated systems linked via a network having two production means;

[0024] FIG. 2 is a graphical plot showing the dependency of the availability of an arrangement comprising two systems in comparison with a single system based on a degree of fault coverage; and

[0025] FIG. 3 is a flowchart of the method in accordance with the invention.

DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS

[0026] FIG. 1 depicts two systems S1, S2 (also called nodes) operating in parallel in a redundant mode. A network NW, such as an automation network, connects the systems S1, S2 to production units P1, P2 and also to one another for a data interchange DA, the production units P1, P2 being controlled by the systems S1, S2. It is assumed that one of the systems S1, S2 operates in the productive mode (master), i.e., actually controls the production device P1, P2, while the other system S1, S2 executes the same software (operating system, application program, automation program) in a shadow mode (slave) using the same input data (e.g., process parameters, measured values), but the results are used only for checking the respective other system S1, S2. If one of the systems S1, S2 fails, or is detected as faulty, the respective other system S1, S2 undertakes the production mode or continues it while the faulty system S1, S2 is repaired, such as via a restart.

[0027] FIG. 2 depicts the ratio of the mean time between failures (MTBF) of a redundant system (MTBF.sub.system) and a single system (MTBF.sub.single) based on the degree of fault coverage (diagnostic coverage(DC)). As already described at the outset, the degree of fault coverage (DC) and therefore the safety of the diagnosis of which of the differing systems S1, S2 is faulty are essential for increasing the dependability and availability of the overall system.

[0028] The method in accordance with the invention is based on the measurement of the performance of individual parts of the operating system, or of the firmware (system programs) and/or the user program. The user program is normally the software component most susceptible to fault and can also most easily be provided with diagnosis instructions. As a result, the application program is for the most part moved to the focus of the examinations under consideration here. Operating parameters are picked up in this case. To this end, it is possible for the runtime to be measured, and/or what are known as performance counters of the respective (individual) system S1, S2 are ascertained, which a modern CPU usually provides.

[0029] The exemplary embodiment is based on programmable logic controllers as considered systems S1, S2, these executing an automation task (e.g., production control or process automation) in cycles. For each cycle, approximately 10-1000 of these measured values are ascertained for operating parameters and, normally, at the end of the cycle, variables derived on each of the two nodes, i.e., statistical parameters, are computed therefrom. In a simple case, these are the mean value and variance for each measured value of an operating parameter.

[0030] In the event of a fault, i.e., if the reciprocal comparison of the two systems S1, S2 or nodes fails or exhibits discrepancies, each of the two systems S1, S2 or each of the nodes uses the previously computed derived variables, i.e., the statistical parameters, to ascertain whether the current measured values of the operating parameters allow an anomaly, i.e., a fault, to be inferred. The disparity in the current measured value of an operating parameter from its continually updated statistical parameter is what is known as an anomaly value in this case. If one of the systems S1, S2 or nodes has computed a very much higher anomaly value than the other, it makes sense to shut down this system or to remove it from the productive mode, and to allow the other system to continue to run in the productive mode or to transfer it from the shadow mode to the productive mode. For the comparison, a data interchange DA between the systems S1, S2 (if the systems S1, S2 monitor one another) or between each system S1, S2 and an evaluating entity (not depicted in the figures) can be provided.

[0031] The use of the comparison of the anomaly values as a selection criterion can be justified in that many faults, in particular hardware faults, can have an influence of the performance of one or more program parts. A few examples may be cited here in this regard: [0032] A fault in the memory access unit (MMU) results in faulty addresses being accessed. There is a high probability of these not being in the cache. As a result, the cache fault rate and hence the program runtime increase. [0033] A fault in the arithmetic and logic unit terminates computing operations too early, resulting in an altered runtime response. [0034] A fault in the control unit or distortion of loop counters means that the correct number of loop passes is not executed, resulting in an altered runtime response. [0035] The distortion of a process value means that rarely executed program parts are executed, resulting in an unusual runtime response. [0036] Distortion of the program also usually results in alterations in the runtime.

[0037] Together with the certainty that at least one of the two systems S1, S2/nodes must be faulty, a high anomaly in the runtime response is thus a strong indication of there being an abnormal response. Here, the current anomaly values of the two systems S1, S2 are compared in order to determine, in the event of a fault, the system that has the higher probability of being the faulty system. Thus, regular fluctuations in the operating parameters have no effect on the decision because they arise on both systems S1, S2 in equal measure during correct operation.

[0038] The proposed measures allow the sometimes 50% probability of the wrong node being shut down to be reduced significantly. It should be noted that lowering it to just 30% would already lead to a significant increase in the MTBF (and hence to a reduction in failure-conditional costs for an operator)in this regard see also FIG. 2.

[0039] A specific exemplary embodiment assumes that the cyclically executed program can be broken down at the topmost level into a suitable number (approximately 10-1000) of sequentially executed blocks. This means that loops and case distinctions occur only inside these blocks. The blocks can contain system functions (e.g. driver calls) or user-programmed functions (e.g. reading and processing of sensor data, comparison of data with one another and against constant desired values, Boolean combination of the comparison results or calculation of control values).

[0040] Each program block is instrumented by the generating chain (e.g. engineering system, in particular compiler) to the effect that one or more measured values (runtime, or number of cache hits) are produced for this block in each cycle. Overall, N measured values (x_1 to x_N) are generated in a cycle for all blocks.

[0041] For each measured value x_i, the two variables M_i and S_i are furthermore created, which store the mean value and the variance of the value. These values are initialized (for all i from 1 . . . N) after the first cycle as follows:

TABLE-US-00001 init_i(in x_i,out M_i, out S_i){ M_i := x_i; S_i := 0; }

[0042] From the second cycle onward, the values are updated (update function) as follows, the variable k being a global cycle counter:

TABLE-US-00002 update_i(in k,inout M_i, inout S_i){ Mlast := M_i; M_i := Mlast + (x_i-Mlast)/k; S_i := S_i + (x_i-Mlast)*(x_i-M_i); }

[0043] This update involves recurrence equations for the mean value and the variance, cf. D. Knuth: The Art of Computer Programming, 3rd Ed., Chapter 4.2.2., page 232. It should be noted that S_i can never be negative in this case.

[0044] If a fault occurs (i.e. the outputs from the master and the reserve or the productive system and the shadow system do not match), each of the two nodes can compute the anomaly values of the most recently measured values. Instead of the update function, the calcAnomalyValue function is then called (again for all i from 1 . . . N):

TABLE-US-00003 calcAnomalyValue(in x_i, M_i, S_i, out aValue_i){ squaredDiff := (M_i-x_i)*(M_i-x_i); if (S_i <= epsilon) aValue_i := MAX_A_VALUE; else aValue_i := squaredDiff / S_i; }

[0045] The more the current value x_i differs from the average value M_i, the higher is its anomaly, where values with a high variance S_i are weighted less strongly.

[0046] By adding the values aValue_i, each of the two nodes can independently calculate a total anomaly value. If in doubt, the node having the higher value is shut down, this being able to be accomplished via a data interchange DA transmitting the node's own anomaly value to the neighboring node, and vice versa. In an advantageous embodiment, a node itself (provided it is still operational) decides whether it needs to be shut down/repaired, or can be operated further. In another embodiment, this decision is undertaken by a central entity, such as a central operational controller or a watchdog device based on the anomaly values found.

[0047] In the exemplary embodiment, the statistical evaluation of the operating parameters can be refined further, such as by taking into consideration the distribution function for individual x_i or taking into consideration dependencies between the x_i belonging to the same program section. Further, dependencies between x_i of different program sections can be taken into consideration. Further, a historic evolution of an x_i can occur, where it is also possible to take into consideration dependencies pertaining to the current and historic process input values.

[0048] FIG. 3 is a flowchart of a method for operating an arrangement having two redundant systems S1, S2 each operating in parallel in a cyclic mode, one system of the two systems S1, S2 operating in a productive mode each time and another system of the two systems S1, S2 executing the same task for checking purposes. The method comprises checking reciprocally by the two systems S1, S2 at least one result of a task of a respective other system on a regular basis, as indicated in step 310. In accordance with the invention, each system of the two systems S1, S2 compares a result of the task of the other system of the two systems S1, S2 with their own result, where a detected fault comprising a detected disparity among the results leads to a fault situation being found, with one of the systems S1, S2 being selected or confirmed for the productive mode in the fault situation found.

[0049] Next, at least one respective characteristic variable concerning an operating parameter is picked up for each of the systems S1, S2 in multiple or all cycles and using the picked up at least one respective characteristic variable for updating at least one statistical parameter each time, as indicated in step 320.

[0050] Next, a current operating parameter is correlated with the associated statistical parameter for each system S1, S2 at least when a disparity between the results of the two systems S1, S2 is found, as indicated in step 330.

[0051] Next, the system S1, S2 for which the respective current operating parameter differs from the associated statistical parameter less is detected as the correctly operating system S1, S2 is utilized as the productive mode, as indicated in step 340.

[0052] Thus, while there have been shown, described and pointed out fundamental novel features of the invention as applied to a preferred embodiment thereof, it will be understood that various omissions and substitutions and changes in the form and details of the devices illustrated, and in their operation, may be made by those skilled in the art without departing from the spirit of the invention. For example, it is expressly intended that all combinations of those elements and/or method steps which perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Moreover, it should be recognized that structures and/or elements shown and/or described in connection with any disclosed form or embodiment of the invention may be incorporated in any other disclosed or described or suggested form or embodiment as a general matter of design choice. It is the intention, therefore, to be limited only as indicated by the scope of the claims appended hereto.

Method and Arrangement for Operating Two Redundant Systems

Assignee

Inventors

Cpc classification

Classification Explorer

G05B2219/24195

PHYSICS

Classification Explorer

G05B19/0428

PHYSICS

Classification Explorer

G05B19/0425

PHYSICS

Classification Explorer

G05B23/0221

PHYSICS

Classification Explorer

G05B9/03

PHYSICS

Classification Explorer

G05B2219/24182

PHYSICS

Classification Explorer

G06F11/202

PHYSICS

International classification

Classification Explorer

G05B23/02

PHYSICS

Classification Explorer

G05B9/03

PHYSICS

Classification Explorer

G05B19/042

PHYSICS

Classification Explorer

G06F11/20

PHYSICS

Abstract

Claims

Description