[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1400.0. "Analyzing Error logs to Predict Failures and Isolate FRUs" by GRANE::HEINTZE () Thu Mar 21 1991 22:14


I am in search for some algorithms for thesholding and filtering the data
found in VMS error logs.  Let me first give you some history and then my
thoughts on some algorithms.  I solicit you to (1) comment on my suggestion
of modeling the error stream from a device as a autoregressive process and
(2) suggest some other models.

Currently FM CSSE supports the product VAXsimPLUS which attempts to predict
failures in various devices and isolate the FRU (field replacable unit).

VAXsimPLUS is sort of a prototype or proof of concept which means that many
aspects  are suboptimal.  Consequently we are investigating the implementation
of a more definitive product.

VAXsimPLUS, as you might expect, is layered on VAXsim.  VAXsim does straight
thresholding, no analysis.  It keeps a tally of hard, soft, and media errors
for each device based on what passes thru the error log mail box.

   When a margin is exceeded in VAXsim, SPEAR (a pseudo expert system) is 
spawned to analyze the error log.   Based on the results of the analysis and
the cluster configuration, VAXsimPLUS might initiate a shadow copy to save the
customer's data.  It will also send a theory number to the FIELD account and
a message to the system manager to call FS (Field service).

The term "margin" has a specific meaning in the context of VAXsimPLUS.  The
actual threshold which triggers analysis via SPEAR is an arithmatic function of
time, the error count and the margin.  Specifically, the margin is an integer
typically around 5 or 10.  There is a hard, soft and media margin for each
device type. Initially, the threshold is equal to the margin.  After each time
a device crosses threshold and SPEAR analysis is triggered, the threshold is
doubled. After 24 hours, the threshold is set back to the value of the margin.

Well, well...    What do you think of this two step process where the second
stage is a (pseudo) expert system?  (I say pseudo because its just a bunch
of compiled BLISS statements - we cannot use something like PROLOG because
we don't want the competition to reverse engineer our algorithms easily).

We've been contemplating the notion of implementing the successor to SPEAR
analysis as an autotomaton.  (I wonder if we could use GALLILEO or YACC to
generate this automaton?)  I envision a state machine that just sits out there
indefinitely monitoring the stream of errors that are enroute to the error log.
This would be considerably less compute (and I/O) intensive than occaissionaly
invoking an expert system that would read the last 20 megatons of error log
files  to figure out what is wrong.

If we were to use an automaton do we need some notion of time that the 
VAXsim margining currently provides?  To answer this question, we need to
understand how devices fail.  I believe VAXsimPLUS was implemented around
the notion (1) that the error rate increases exponentially as a function of
time once a device starts to fail and (2) it is sufficient to perform 
notification (ie, send mail to Field Service and system manager) about once
a day.  A greater frequency is undesirable.

My thought is that if we use an finite state automaton, we only perform
notification when we enter a new failure state.  Since the cost of invoking
an expert system is no longer an issue, we won't have to worry about 
performing notification too often.  The problem with using a finite state
automaton is with history:  we might have to perserve state accross system
crashes.  We would also have to deal with the fact that any one VAX node
in a cluster might not see all the errors any given device generates.

The other issue is (presumably noise) filtering which is a related 
(perhaps redundant) question:  How do we decide what information in the
error log we want to look at and what we want to ignore?

Currently I'm taking a class in adaptive filters.  I wonder if there are any
possible applications of adaptive filter theory here.  I wonder if you can
consider the error stream (log) wide sense stationary?  Probably not in light
of the failure patterns of devices.  I wonder if you could model the error
stream for any specific device as an AR (autoregressive) process.  AR processes
have the form:

        v(n) = a u(n) + a u(n-1) + a u(n-2) + ...
                0        1          2

where v(n) is the white noise process and u is some signal and "a" is the
vector of AR coefficients.   You can take a white noise sequence as input
to an AR process and produce a certain signal "u", or take "u" and feed it to
an AR process to get a white noise sequence.  The latter is extremely useful
because, you can subtract the noise sequence output by the AR process from 
the corrupted signal ("u") to get the original signal.  If you use 
an adaptive filter (one that alters the "a" vector on the fly) we can better
accomodate non-stationary signals.

If it is useful to model the stream of hard, soft and media errors for  a
device as a AR process, how do we (mathmatically) define noise and just what
would the "original signal" represent?

This is also posted in the algorithms conference.
T.RTitleUserPersonal
Name
DateLines
1400.1I'd suggest Bayesian AnalysisCSSE::NEILSENWally Neilsen-SteinhardtFri Mar 22 1991 17:1762
One preliminary word of warning: VAXsim and especially SPEAR were based on a
lot of real-world knowledge about how devices fail.  You should be sure that
any replacement product does not lose all that real-world knowledge.  What
follows here, for example, is rather abstract and would have to be modified
somewhat to make a better fit with the real world.  I seem to remember
discussing this years ago with the SPEAR folks, or perhaps it was in this
conference.  Anyway, ...

Look at this as a pair (at least) of decisions to be made by your tool: should
it ring an alarm bell or not?  What FRU (or FRUs) should it advise the FSE to
replace?  

A simple and fairly general decision algorithm is based on Bayesian probability
analysis.  What follows is a simplified application of this analysis to your
case.

Assume a device can be characterized by a single "quality" variable Q(t), which
represents its ability to perform at time t.  For convenience, scale it so that
Q=1 means perfect success, and Q=0 means perfect failure.  It would be nice to
measure Q directly, but we will assume that that is impossible, and we are
forced to infer Q from an error history.

Represent the error history E as a sequence of the n most recent successes and
failures, and call a specific error history Ei.  We want to compute Q from Ei.
First we compute P( Ei | Q ), the probability of seeing sequence Ei, given Q. 
Then we must assign P( Q ), called prior probabilities to Q.  This is our
estimate, before we look at Ei, of the probability that the device is in state
Q. We can get this from previous experience with devices of this type or from
the earlier history of this specific device.  Then we can compute the
probability of Q using Bayes' Theorem:

	P( Q | Ei ) = P( Ei | Q ) * P( Q ) / ( SUM ( P( Ei | Q ) * P( Q ) )

The denominator is a normalizing factor, and if Q is a continuous variable then
it becomes an integral.

If we assume that Q is constant, that Ei is generated by a Bernoulli process,
and that our prior probability can be represented using one of the family of
beta functions, then the sum and quotient become very simple.  P( Q | Ei ) must
also be a beta function, and there is a simple relation between the parameters
of prior (before Ei) beta function and posterior (after Ei) beta function.

Once you have the beta function, you can compute its mean, variance and
suitable confidence intervals about the mean.  You can also combine it with a
utility function, which measures the utility of various decisions by your tool,
to generate a decision rule for the tool.  This decision rule will probably be
a simple counting rule, based on something like the count of failures in the
last n tries.  You will definitely need something like a count of tries or its
surrogate, a time interval, to scale the count of failures.

Of course, your Q is by assumption not constant, so your Bayesian analysis
would be more complex.  One way to deal with this is to assume that Q changes
in step functions, with a low probability of changing during the sample Ei. 
Then you compute the current Qi based on the current Ei and the previous
Q(i-1), allowing for the small probability that Q has changed.  Another way to
deal with it is to put the empirical form of Q(t) into your calculations of 
P( Ei | Q ).

That was a very quick and sloppy overview of the direction I would recommend to
you.  Let me know if you want to follow up on it.  I think you will end up with 
an automoton, but perhaps a very simple one, and perhaps the same one used by 
VAXsim.