Combined model-based approach and data driven prediction for troubleshooting faults in physical systems

Abstract

A method for diagnosing and troubleshooting failures of components of a physical system with low troubleshooting cost, according to which for each component in the system, a Model-Based Diagnosis (MBD) is used for computing the probability of causing a system failure, based on currently observed system behavior or on knowledge about the system's structure. Then the probability of causing a system failure is computed, based on its age and its survival curves. Then, it is determined whether a faulty component C should be fixed or replaced by minimizing future troubleshooting costs, being the costs of the process of diagnosing and repairing an observed failure.

Claims

1. A method for diagnosing and troubleshooting failures of components of a physical system with low troubleshooting cost, comprising: a) for each component C in said system: a.1) using a Model-Based Diagnosis (MBD) for computing a first fault likelihood estimate being the probability of causing a system failure, based on currently observed system behavior or on knowledge about the system's structure; a.2) computing a second fault likelihood estimate being the probability of causing a system failure, based on its age and on prior probability distributions and fault likelihood estimation given by its survival curves; a.3) using the fault likelihood estimation given by said survival curves as prior probability distributions within the likelihood estimation computation done by said MBD by combining said first and second fault likelihood estimates using a weighted linear combination, such that the weights are positive and sum up to one; b) choosing whether a faulty component C should be fixed or replaced by minimizing the future sum of troubleshooting costs, being the costs of the process of diagnosing and repairing an observed failure; and c) upon detecting that said physical system fails, initiating a troubleshooting process by performing sense and repair actions until the system is fixed.

2. The method according to claim 1, wherein troubleshooting is performed according to: d) diagnostic information about the relation between sensor data and faults; e) the likelihood of each component to fail for a given the age of said component, obtained from its corresponding survival curves.

3. The method according to claim 1, wherein troubleshooting is performed by a process that minimizes current troubleshooting costs and maintenance costs of future failing components.

4. The method according to claim 1, wherein troubleshooting is performed by a troubleshooting agent, being capable of performing sensing and repair actions.

5. The method according to claim 1, further comprising deploying one or more sensors in the system, for fault detection.

6. The method according to claim 1, wherein the troubleshooting agent performs a sequence of actions that results in a system state, in which all system components are healthy.

7. The method according to claim 1, wherein the MBD algorithm uses a system model that represents the relation between the system inputs (including sensors) and outputs, and the components behavior.

8. A method for diagnosing failures of components of a physical system consisting of a plurality of components, comprising: a) for each component C in said system: a.1) using a Model-Based Diagnosis (MBD) for computing a first fault likelihood estimate being the probability of causing a system failure, based on knowledge about the system's structure; a.2) computing a second fault likelihood estimate being the probability of causing a system failure, based on its age and on prior probability distributions and fault likelihood estimation give by its survival curves; a.3) using the fault likelihood estimation given by said survival curves as prior probability distributions within the likelihood estimation computation done by said MBD by combining said first and second fault likelihood estimates using a weighted linear combination, such that the weights are positive and sum up to one; b) continuously collecting data readings from one or more sensors deployed in said system; c) upon detecting data reading(s) indicative of system failure, computing for each component C, the probability that said component C caused said system failure; and d) determining that one or more components having probability higher than a predetermined threshold caused said system failure.

9. The method according to claim 1, further comprising: e) for each component C in said system, computing the probability of causing future system failures, based on its age and its survival curves; f) computing the troubleshooting costs of said future system failures; and g) providing indications which currently intact component C should be replaced to minimize said troubleshooting costs.

10. A system having diagnosing and troubleshooting capability of failures of components of a physical system with low troubleshooting cost, comprising: a) one or more processors for performing the following steps for each component C in said system: a.1) computing a first fault likelihood estimate being the probability of causing a system failure, based on a Model-Based Diagnosis (MBD) and on currently observed system behavior or on knowledge about the system's structure; a.2) computing a second fault likelihood estimate being the probability of causing a system failure, based on its age and on prior probability distributions and fault likelihood estimation given by its survival curves; and a.3) using the fault likelihood estimation given by said survival curves as prior probability distributions within the likelihood estimation computation done by said MBD by combining said first and second fault likelihood estimates using a weighted linear combination, such that the weights are positive and sum up to one; a.4) providing indication whether a faulty component C should be fixed or replaced by minimizing the future sum of troubleshooting costs, being the costs of the process of diagnosing and repairing an observed failure; and a.5) upon detecting that said physical system fails, initiating a troubleshooting process by performing sense and repair actions until the system is fixed.

11. The system according to claim 10, in which troubleshooting is performed according to: b) diagnostic information about the relation between sensor data and faults; and c) the likelihood of each component to fail for a given the age of said component, obtained from its corresponding survival curves.

12. The system according to claim 10, in which troubleshooting is performed by a process that minimizes current troubleshooting costs and maintenance costs of future failing components.

13. The system according to claim 10, in which troubleshooting is performed by a troubleshooting agent, being capable of performing sensing and repair actions.

14. The system according to claim 10, further comprising one or more sensors deployed in the physical system, for fault detection.

15. The system according to claim 10, deploying the troubleshooting agent performs a sequence of actions that results in a system state, in which all system components are healthy.

16. The system according to claim 10, deploying the MBD algorithm uses a system model that represents the relation between the system inputs (including sensors) and outputs, and the components behavior.

17. A system for diagnosing failures of components of a physical system consisting of a plurality of components and having one or more sensors deployed in said system, comprising: a) one or more processors for performing the following steps for each component C in said system: a.1) computing a first fault likelihood estimate being the probability of causing a system failure, based on using a Model-Based Diagnosis (MBD) and on knowledge about the system's structure; a.2) computing a second fault likelihood estimate being the probability of causing a system failure, based on its age and on prior probability distributions and fault likelihood estimation given by its survival curves; a.3) using the fault likelihood estimation given by said survival curves as prior probability distributions within the likelihood estimation computation done by said MBD by combining said first and second fault likelihood estimates using a weighted linear combination, such that the weights are positive and sum up to one; a.4) continuously collecting data readings from said one or more sensors; a.11) upon detecting data reading(s) indicative of system failure, computing for each component C, the probability that said component C caused said system failure; and a.5) determining that one or more components having probability higher than a predetermined threshold caused said system failure.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The above and other characteristics and advantages of the invention will be better understood through the following illustrative and non-limitative detailed description of preferred embodiments thereof, with reference to the appended drawings, wherein:

(2) FIG. 1 illustrates an example of exponential survival curves;

(3) FIG. 2 depicts a possible Bayesian Network (BN) that represents an example of running a car that does not start

(4) FIG. 3 illustrates graphical representation of car diagnosis system;

(5) FIGS. 4A and 4B show the troubleshooting cost for each of the algorithms, for different values of the Age.sub.diff parameters, for a real world Electrical Power System and car diagnosis system, respectively; and

(6) FIG. 5 shows the results of the long-term experiments, on a car diagnosis system.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

(7) The present invention uses prognosis tools, and in particular survival curves, to lower troubleshooting costs. The integration of prognosis and diagnosis is useful for improving troubleshooting costs by using fault predictions from survival curves as priors in an MBD algorithm. It is also useful for developing an anticipatory troubleshooter that chooses whether a faulty component should be fixed or replaced by considering possible future troubleshooting costs.

(8) The present invention proposes an anticipatory troubleshooting model that uses improved diagnosis process that considers both diagnostic information about the relation between sensor data and faults, as well as the likelihood of each component to fail given its age, obtained from the corresponding survival curves. The proposed model effectively integrates prognosis and diagnosis, and in particular survival curves and automated diagnosis algorithms.

(9) The integration of survival curves into the troubleshooting process also allows a more holistic form of troubleshooting referred to here as anticipatory troubleshooting and described below. Troubleshooting is the process of diagnosing and repairing an observed failure. Diagnostic and repair actions may incur costs, such as the time spent in observing internal components and the monetary cost of purchasing a new component to replace a faulty one. Troubleshooting algorithms aim to minimize the costs incurred until the system is fixed.

(10) The present invention uses prognosis tools, and in particular, survival curves, to develop a troubleshooting algorithm that minimizes current troubleshooting costs and future maintenance costs. These maintenance costs include costs due to future failures, which would require additional troubleshooting and perhaps system downtime. This type of troubleshooting, where future costs are also considered, is defined as anticipatory troubleshooting and proposes an effective anticipatory troubleshooting algorithm.

(11) In particular, the proposed troubleshooting algorithm addresses how to choose the most appropriate repair action, given a component that is identified as faulty. For example, repairing a faulty component may be cheaper than replacing it with a new one. On the other hand, a new component is less likely to fail in the near future. The proposed anticipatory troubleshooting algorithm leverages available survival curves to efficiently choose the appropriate repair action. We next describe the proposed anticipatory troubleshooting concept and algorithm formally.

(12) A system is composed of a set of components, denoted COMPS. A component CCOMPS is either healthy or faulty, denoted by the health predicate h(C) or h(C), respectively. The state of a system, denoted is a conjunction of health literals (a literal is a notation for representing a fixed value), defining for every component whether it is healthy or not. A troubleshooting agent is an agent, being capable of performing sensing and repair actions. The agents' belief about the state of the system, denoted B, is a conjunction of health literals.

(13) It is assumed that the agents knowledge is correct, i.e., if h(C)B.fwdarw.h(C). The agents belief, however, may be incomplete, i.e., there may exists a CCOMPS such that neither h(C) nor h(C) is in B. A troubleshooting problem arises if the system is identified as faulty, e.g., by some fault detection mechanism. It is assumed that such a mechanism exists, revealing to the agent whether the system is faulty or not.

(14) An action of the troubleshooting agent is a transition function, accepting and potentially modifying both system state and agent's belief B. Two types of actions are considered: sense and repair.

(15) Each action is parametrized by a single component, where Sense.sub.C checks if C is healthy or not, and Repair.sub.C results in C being healthy. Formally, applying Sense.sub.C does not modify and updates B by adding h(C) if h(C) or adding h(C) otherwise. Similarly, applying Repair.sub.C adds h(C) to both B and , and removes h(C) from B and if it was there.

(16) Definition 1 (Troubleshooting Problem (TP))

(17) A TP is defined by the tuple P= custom character COMPS, , B, A where

(18) (1) COMPS is the set of components in the system,

(19) (2) is the state of the system,

(20) (3) B.Math. is the agent's belief about the system state, and

(21) (4) A is the set of actions the troubleshooting agent is able to perform.

(22) A TP arises if Ch(C). A solution to a TP is a sequence of actions that results in a system state, in which all components are healthy.

(23) A troubleshooting algorithm (TA) is an algorithm for guiding a troubleshooting agent faced with a TP. TAs are iterative: in every iteration the TA accepts the agent's current belief B as an input and outputs a sense or repair action for the troubleshooting agent performs. A TA halts when the sequence of actions it outputted forms a solution to the TP, i.e., when the system is fixed. The solution outputted by a TA to a TP P is denoted by (P). Both sense and repair actions incur a cost. The cost of an action a is denoted by cost(a). The cost of solving P using , denoted by cost(, P), is the sum of the costs of all actions in (p): cost(, P)=.sub.a(P)cost(a). TAs aim to minimize this cost.

(24) Looking back into the car diagnosis example, in which there are three relevant components that may be faulty: the radiator (C.sub.1), the ignition system (C.sub.2) and the battery (C.sub.3). Assuming that the radiator is the correct diagnosis (i.e., the radiator is really faulty) and the agent knows that the battery is not faulty, then the corresponding system state and agent's belief B are represented by:
={h(C1),h(C2),h(C3)} and B={h(C3)}.

(25) Table 1 lists a solution to this TP, in which the agent first senses the ignition system, then the radiator, and finally repairs the radiator. Formally, (P).Math.={Sense.sub.C2, Sense.sub.C1, Repair.sub.C1}. If the cost of sense is one and the cost of repair is five, then the troubleshooting costs of this solution is 1+1+5=7.

(26) Troubleshooting with Survival Functions

(27) If the cost of sense actions is much smaller than the cost of repair actions, then an intelligent troubleshooting algorithm would only repair components that were first identified as faulty as a result of a sense action. This simplifies the troubleshooting process: perform sense actions on components until a faulty component is found, and then repair it. The challenge is which component to sense first.

(28) To address this challenge, efficient troubleshooting algorithms use a Diagnosis Algorithm (DA). A DA outputs one or more diagnoses, where a diagnosis is a hypothesis regarding which components are faulty. Moreover, many DAs output for each diagnosis the likelihood that it is correct, denoted p(). These diagnoses likelihoods can be aggregated to provide an estimate of the likelihood that each component is faulty, denoted p(C). A reasonable troubleshooter can then choose to sense first the component most likely to be faulty.

(29) Most effective existing DAs use some prior knowledge about the diagnosed system to provide accurate diagnoses. Model-Based Diagnosis (MBD) is a classical approach to diagnosis, in which an existing model of the system, along with observations of the system behavior, is used to infer diagnoses. Some MBD algorithms assume a system model that represents the system behavior using propositional logic and use logical reasoning to infer diagnoses that are consistent with system model and observations.

(30) Generally, most MBD algorithms implicitly assume that the system model represents the relation between the system inputs (including sensors) and outputs, and the components behavior. In the example of the present invention, the DA that has been used is based on a Bayesian Network (a BN is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a Directed Acyclic GraphDAG) that represents the probabilistic dependency between observations and the system health state. In addition, techniques from survival analysis are used for allowing augmenting such models with information about the age of each component and its implication on the likelihood of components to be faulty.

(31) Integrating Survival Analysis into a DA

(32) Every component C is associated with an age denoted Age.sub.C. If T.sub.C is a random variable representing the age in which C will fail, a survival function for C, denoted S.sub.C(t), is the probability that C will survive until the age t component C will not fail before age t). Formally: S.sub.C(t)=Pr(TCt). Survival functions can be obtained by analysis of the physics of the corresponding system or learned from past data (see for example Survival analysis of automobile components using mutually exclusive forests (Eyal et al., IEEE T. Systems, Man, and Cybernetics: Systems, 44(2):246-253, 2014).

(33) FIG. 1 illustrates an example of exponential survival curves. The three survival curves are generated by an exponential decay function e.sup..Math.t, where is a parameter and t is the age (the x-axis). The y-axis represents the probability that a component will survive (i.e., will not fail) t time units (e.g., months). The three curves plotted in FIG. 1 correspond to three values of the A parameter.

(34) It is desired to compute the probability of a component C to cause a system failure, given its age and survival function. In most systems, faulty components may fail intermittently, meaning that a component may be faulty but still not cause a system failure. Thus, the faulty component that caused the system to fail may have been faulty even before time t. To consider this, the probability of a component C of age AgeC to cause the system failure has been estimated by the probability that it has failed any time before the current time. This probability is directly given by 1S.sub.C(Age.sub.C), denoted by F.sub.C(Age.sub.C).

(35) Therefore, for a given component C two estimation should be done for the likelihood that it is correct: one from the MBD algorithm (p(C)) and one from its survival curve (F.sub.C(Age.sub.C)). The MBD algorithm's estimate is derived from the currently observed system behavior or knowledge about the system's structure. The survival curve estimation is derived from knowledge about how such components tend to fail over time.

(36) The present invention proposes to combine these estimates to provide a more accurate and more informed diagnostic report. One approach to combine these fault likelihood estimates is by using some weighted linear combination, such that the weights are positive and sum up to one. However, these estimates are fundamentally different: F.sub.C(Age.sub.C) is an estimate given a-priori to the actual fault, while p(C) is computed by the MBD algorithm for the specific fault at hand, taking into consideration the currently observed system behavior.

(37) MBD algorithms often require information about the prior probability distribution of each component to be faulty when computing their likelihood estimates. However, these prior probability distributions are often set to be uniform, although it has been shown that setting such distributions more efficiently can significantly improve diagnostic accuracy. Therefore, the present invention uses the fault likelihood estimation given by the survival curves as prior probability distributions within the likelihood estimation computation done by the MBD algorithm.

(38) Specifically, experiments were made with an MBD that computes diagnoses by applying inference on a Bayesian Network (BN). The BN contains both health variables and other variables such as sensor readings. The values of the observable variables are set, and then the marginal of each health variable is computed by applying an inference algorithm on the BN. The Bayesian reasoning (a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more information becomes available) is done by the inference algorithm that requires a prior probability.

(39) According to an embodiment of the invention, S.sub.C(Age.sub.C) is used as this prior probability, while normalizing the fault probability over the remaining probability sum. Other ways to integrate survival curves in an MBD are also possible, and the key contribution is that doing so is beneficial.

(40) Returning back to the example of running a car that does not start. FIG. 2 depicts a possible BN that represents this example. Nodes Ig, B, and R correspond to the health variables for the ignition, battery, and radiator, respectively. W correspond to the water level variable, and C correspond to the observation that the car not starting. The Conditional Probability Tables (CPTs) for all nodes except for C are also illustrated in FIG. 2. The value of C deterministically depends on Ig, B, and R: the car can start only if all components are healthy.

(41) Modeling such dependency (a logical OR) in a BN is trivial. In this example, multiple faults are not allowed (these are mapped to a N/A value of C). Assuming that the car does not start (C=False) and the water level is low (W=Low), Bayesian reasoning is applied to obtain the likelihood of each component to be faulty. In this example, the likelihood of Ig, B, or R to be faulty is 0.16, 0.33, and 0.52, respectively. Thus, a troubleshooter would sense R first.

(42) It is assumed that the ages of the ignition (Ig), battery (B), and radiator (R) are 3, 12, and 5, respectively, and that they all follow an exponential survival curve of e.sup.0.09.Math.t. Thus, according to the components age and survival curves, the probability of Ig, B, and R to be faulty are 0.24, 0.66, and 0.36, respectively. Setting these probabilities instead of the original health nodes' prior probability distributions is shown in FIG. 2 in the S(X) columns of the CPTs. Setting these prior probability distributions dramatically affects the result of the Bayesian reasoning, where the current probability of Ig, B, and R to be faulty is 0.16, 0.56, and 0.28, respectively. As a result, a troubleshooter that is aware of both BN and survival curves would choose to sense the battery (rather than the radiator).

(43) Anticipatory Troubleshooting

(44) The present invention proposes an anticipatory troubleshooting algorithm, which is able to reason about both current and future failures. To, reason about failures over time, statistical tools are incorporated from survival analysis that allows predicting when a failure is likely to occur. Incorporating this prognostic information in a troubleshooting algorithm allows better fault isolation and more efficient decision making in which repair actions to employ to minimize troubleshooting costs over time.

(45) The main benefit of using survival functions in the context of troubleshooting is in the ability to reason about future failures, with the goal of minimizing troubleshooting costs over time.

(46) If [0, T.sub.limit] is the time period in which it is required to minimize troubleshooting costs, during this time period, components in the system may fail. When the system fails, a troubleshooting process is initiated, performing sense and repair actions until the system is fixed. The target function to be minimized is the sum of costs incurred due to actions performed by the troubleshooting agent within the time period [0, T.sub.limit]. This sum of troubleshooting costs is referred to as the long-term troubleshooting cost. a troubleshooting algorithm that aims to minimize this cost is referred to as an anticipatory troubleshooting algorithm.

(47) When there is only a single sense action and a single repair action, there is no difference between an anticipatory troubleshooting and a troubleshooting algorithm only aiming to minimize the current troubleshooting costs. The difference between traditional troubleshooting and anticipatory troubleshooting is meaningful when there are multiple repair actions. In other words, after the troubleshooting algorithm identifies which component is faulty, the troubleshooter needs to decide which repair action to use to repair it.

(48) Fix Vs. Replace Actions

(49) A setting, where there are two possible repair actions is called Fix and Replace. Applying a Replace(C) action means that the troubleshooting agent replaces C with a new one. Applying Fix(C) action means that the troubleshooting agent fixes C without replacing it. Both fix and replace are repair actions, in the sense that after performing them, the component is healthy and the agent knows about it, i.e., replacing h(C) with h(C) in both the system state and the agent's belief.

(50) However, Fix is expected to be cheaper than Replace. Also, after replacing a component, its ability to survive is expected to be significantly higher than that after it has been fixed, since the replaced component is new.

(51) If S.sub.C(t, Age.sub.C) be the survival curve of C after it was fixed at age AgeC, i.e., the probability of C to survive t time units after it was fixed, given that it was fixed at age Age.sub.C.
S.sub.C(t,Age.sub.C)=Pr(T.sub.Ct+Age.sub.C|C fixed at age AgeC)

(52) Such a survival function is called an after-fix survival function. The expected differences relations between fix and replace are:
CCOMPS:cost(Fix(C))<cost(Replace(C))(1)
t[0,T.sub.limit]CCOMPS:SC(t,Age.sub.C)<S.sub.C(t)(2)

(53) Fixing a faulty component seems to be cheaper, but may result in future faults being more frequent. This embodies the main dilemma in anticipatory troubleshooting: weighing current troubleshooting costs (where Fix is preferable) against potential future troubleshooting costs (where Replace is preferable).

(54) Choosing the Appropriate Repair Action

(55) An preferable approach to choose which repair action to perform is to discretize the time period [0, T.sub.limit], model the problem as a Markov Decision Problem (MDP), and apply an off-the-shelf MDP solver, as described below.

(56) Discretization

(57) The time limit [0, T.sub.limit] is partitioned to a non-overlapping set of equal-sized time ranges T={T.sub.0, . . . , T.sub.n}. Each T.sub.i is referred to as a time step, and t is the size of each time step.

(58) MDP Modeling

(59) An MDP is defined by a state space S, a set of actions A, a reward function r(s,a), and a transition function tr(s,a,s). a state in the state space is defined by a tuple s=(Ti, C, Curves, Ages), representing a state in which component C was diagnosed as faulty at time step T.sub.i, where Curves and Ages are vectors representing the survival curves and ages of all components in COMPS. C can be null, representing a state in which no component was faulty at time T.sub.i.

(60) If only a single fault scenarios are considered (i.e., at most, one component is fault at every time step), states for time T.sub.n+1 are terminal states. The set of actions A consists of three actions: Replace(C), Fix(C), and no-op (no-op represents not doing any action). The reward function R(s,a) is minus the cost of the executed action, where the no-op action costs zero. The state transition function is as follows:

(61) After any action, a state for time step T.sub.i will transition to a state for time step T.sub.i+1.

(62) The MDP transition function Tr(s,a,s), which is a function that returns the probability of reaching state s after performing action a at state s, is defined as follows:

(63) If s=(Ti, C, Curves, Ages) and s=(Tj, C, Curves, Ages). The values of Tj, Curves, and Ages are set deterministically by s and a: Tj=Ti+1, Curves is only updated after a Fix(C) action (replacing C's survival function with its after-fix curve), and Ages consists of all components being older by one time step, except for when C is replaced (in which case, the age of C is set to zero). The uncertainty in state transition is which component, if any, will be faulty in the next time step.

(64) If S.sub.C and Age.sub.C be the survival curve and age of C according to Curves, Ages, then the probability that C will fail at a specific time range Tj given its survival curve is:
Pr(T.sub.CTj)=S.sub.C(Age.sub.Ct)S.sub.C(Age.sub.C)
which is a standard computation in survival analysis: the probability of surviving before Tj (when the age of C was AgeCt) minus the probability of surviving until Tj (when the age of C is Age.sub.C).
Solving the MDP

(65) The state space of this MDP is exponential in the number of time steps reasoned about (n).

(66) A decision rule called Decision Rule 1 (DR1) that roughly corresponds to reasoning about a single level of this MDP state space has been implemented. If C.sub.replace=cost(Replace(C)), C.sub.fix=cost(Fix(C)), and T.sub.left be the time left until T.sub.limit, then following DR1 is to replace a faulty component C iff the following inequality holds:
C.sub.replace+(1S.sub.C(T.sub.left)).Math.C.sub.replaceC.sub.fix+(1S.sub.C(T.sub.left,Age.sub.C)).Math.C.sub.replace(3)

(67) DR1 has the following property:

(68) Proposition 1

(69) DR1 is optimal if the following holds:

(70) (1) a component will not fail more than twice in the time range [0, Tlimit];

(71) (2) a component can be fixed at most once;

(72) (3) a replaced component will not be fixed in the future;

(73) (4) components fail independently.

(74) Experimental Results

(75) To evaluate the proposed algorithms, two sets of experiments have been performed: one-shot experiments, in which a single TP is solved and longterm experiments, in which troubleshooting costs are accumulated.

(76) Experiments were performed over two systems, modeled using a Bayesian network (BN) following the standard use of BN for diagnoses. The first system, denoted S1, represents a real world Electrical Power System. The BN was generated automatically from formal design and is publicly available. It has 26 nodes, 6 of which are health nodes. The second system, denoted S2, is the CAR DIAGNOSIS 2 network from the library of benchmark BN made available by Norsys (www.norsys.com/netlib/CarDiagnosis2.dnet). This second system represents a network for diagnosing a car that does not start, based on spark plugs, headlights, main fuse, etc. It contains 18 nodes, 7 of which are health nodes. A graphical representation of S2 is illustrated in FIG. 3.

(77) Survival Curves and Component Ages

(78) A standard exponential curve (defined above and illustrated in FIG. 1) with =0.09. Exponential curves are fundamental parametric models used in the survival analysis.

(79) The age of each component is set to be Age.sub.init plus a random number between zero and Age.sub.diff, where Age.sub.init is a constant, set arbitrarily to 0.3 and Age.sub.diff is a varied parameter in the experiments. The purpose of the Age.sub.diff parameter is to control the possible impact of considering the components' survival functions: a small Age.sub.diff results in all components having almost the same age, and thus the survival curves do not provide significant information to distinguish between which component is more likely to be faulty.

(80) One-Shot Experiments

(81) In this set of experiments random TPs (details below) were generated and compared the performance of four TAs:

(82) (1) Random, which chooses randomly which component to sense;

(83) (2) BN-based, which chooses to sense the component most likely to be faulty according to the BN;

(84) (3) Survival-based, which chooses to sense the component most likely to be faulty according to its survival curve and age;

(85) (4) Hybrid, which chooses to sense the component most likely to be faulty taking into consideration both BN and survival curve.

(86) Performance of a TA was measured by the troubleshooting costs incurred until the system is fixed. Since only single fault scenarios have been considered, the cost of the single repair action performed in each of these experiments were omitted, as all algorithms had spent this cost.

(87) Each TP was generated with a single faulty health node as follows:

(88) The value of non-health nodes in the BN that do not depend on any other node were set randomly according to their priors. These nodes a referred to as control nodes. Then, the age of each component was set as mentioned above, i.e., by sampling uniformly within the range of [Age.sub.init, Age.sub.init+Age.sub.diff]. Then, the CPT of every health node was modified to take into account the survival curve (i.e., the prior of being healthy was set to SC (Age.sub.C)). Next, the marginal probability of each component to be faulty in this modified BN has been computed, and a single component to be faulty was chose according to these computed probabilities. Then, the BN values for all remaining nodes (nodes that are not control or health node) were sampled, while setting the values of the already set nodes. These nodes are called the sensor nodes, and a subset of them were revealed to the DA.

(89) FIGS. 4A and 4B show the troubleshooting cost for each of the algorithms, for different values of the Age.sub.diff parameters, for a real world Electrical Power System (S1) and car diagnosis system (S2), respectively. AU results are averaged over 50 instances. It can be seen that the proposed Hybrid TA outperforms all baseline TAs, thereby demonstrating the importance of considering both survival curves and MBD. It can also be seen that, as Age.sub.diff grows, the performance of Survival improves, since the components' age differ more, and thus considering it is more valuable. When Age.sub.diff is minimal, the performance of Survival is similar to Random and worse than BN.

(90) BN performed better, since it was provided with evidencethe values of some sensor nodes (in the case of S1 9 sensor nodes have been revealed and for S2 2 sensor nodes have been revealed). Experiments were also made with different numbers of revealed nodes. As expected, revealing more nodes improves the performance of both BN and Hybrid.

(91) The results demonstrate that Hybrid is more robust than both Survival and BN, and is either equal or outperforms them across all varied parameter.

(92) Long-Term Experiments

(93) In this set of experiments random TPs were generated over a period of 28 months (i.e., T.sub.limit=28), while choosing when each component fails according to its survival function. In each experiment one of the following TAs has been used to solve the TPs that arise:

(94) (1) Always Fix (AF), in which faulty components are repaired using the Fix action;

(95) (2) Always Replace (AR), in which faulty components are repaired using the Replace action;

(96) (3) Hybrid, in which DR1 has been used to choose the appropriate repair action.

(97) The performance of each algorithm is measured by the sum of troubleshooting costs incurred when solving all the TPs that arose. Since the focus of these experiments is to study the Fix vs. Replace dilemma, the costs incurred were omitted clue to Sense action, and only the cost the repair action used in every troubleshooting session was measured (i.e., C.sub.replace or C.sub.fix).

(98) To sample when a component will fail after it was fixed, and to compute the Hybrid TA, an after-fix survival function (Sc (t, Age.sub.C) has been required. Such functions can be given by domain experts or learned from past data. Then, the following after-fix survival function has been used:
S.sub.C(t,AgeC)=(SC(t)).sup.P
where P is a parameter called the fix punish factor. This after-fix survival curve holds the intuitive requirement that a replaced component is more likely to survive longer than a component that was fixed (Eq. 2). The punish-factor parameter P controls the difference between the after-fix and the regular survival function.

(99) FIG. 1 shows the survival curves after a punish factor of 2 and 5. Another important parameter in this set of experiments is the ratio between C.sub.replace and C.sub.fix. This parameter is referred to as the cost ratio parameter.

(100) FIG. 5 shows the results of the long-term experiments, on system S1. The x-axis shows different cost ratios, in buckets of punish-factor values. The y-axis shows the long-term troubleshooting costs. All results are averaged over 50 instances. It can be seen that when the cost ratio is small, then Fix is significantly cheaper then Replace, and thus the Always Fix (AF) algorithm performs best. Similarly, when the punish factor is very high, a fixed component is much more likely to fail than a replaced one, thus Always Replace (AR) algorithm performs best. The Hybrid algorithm is able to successfully choose when to replace or fix in most parameter combinations. The same trends were also observed for system S2. Thus, even though the assumptions in which DR1(=Hybrid) is optimal do not hold in the experiments made (e.g., a component may have more than two faults), it can be seen that using it allows an effective balance between AF and AR.

(101) The above examples and description have of course been provided only for the purpose of illustration, and are not intended to limit the invention in any way. As will be appreciated by the skilled person, the invention can be carried out in a great variety of ways, employing more than one technique from those described above, other than used in the description, all without exceeding the scope of the invention.

Combined model-based approach and data driven prediction for troubleshooting faults in physical systems

Assignee

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G06F11/2257

PHYSICS

Classification Explorer

G06F11/008

PHYSICS

Classification Explorer

G06F11/2268

PHYSICS

Classification Explorer

G06F11/2273

PHYSICS

Classification Explorer

G06F11/0793

PHYSICS

Classification Explorer

G06F11/0766

PHYSICS

Classification Explorer

G06F11/079

PHYSICS

International classification

Classification Explorer

G06F11/00

PHYSICS

Classification Explorer

G06F11/07

PHYSICS

Classification Explorer

G06N7/00

PHYSICS

Classification Explorer

G06F11/22

PHYSICS

Abstract

Claims

Description