{"title": "Fault Diagnosis of Antenna Pointing Systems using Hybrid Neural Network and Signal Processing Models", "book": "Advances in Neural Information Processing Systems", "page_first": 667, "page_last": 674, "abstract": null, "full_text": "Fault Diagnosis of Antenna Pointing Systems \n\nusing Hybrid Neural Network and Signal \n\nProcessing Models \n\nPadhraic Smyth, J eft\" Mellstrom \nJet Propulsion Laboratory 238-420 \nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nAbstract \n\nWe describe in this paper a novel application of neural networks to system \nhealth monitoring of a large antenna for deep space communications. The \npaper outlines our approach to building a monitoring system using hybrid \nsignal processing and neural network techniques, including autoregressive \nmodelling, pattern recognition, and Hidden Markov models. We discuss \nseveral problems which are somewhat generic in applications of this kind \n-\nin particular we address the problem of detecting classes which were \nnot present in the training data. Experimental results indicate that the \nproposed system is sufficiently reliable for practical implementation. \n\n1 Background: The Deep Space Network \n\nThe Deep Space Network (DSN) (designed and operated by the Jet Propulsion Lab(cid:173)\noratory (JPL) for the National Aeronautics and Space Administration (NASA)) is \nunique in terms of providing end-to-end telecommunication capabilities between \nearth and various interplanetary spacecraft throughout the solar system. The \nground component of the DSN consists of three ground station complexes located \nin California, Spain and Australia, giving full 24-hour coverage for deep space com(cid:173)\nmunications. Since spacecraft are always severely limited in terms of available \ntransmitter power (for example, each of the Voyager spacecraft only use 20 watts \nto transmit signals back to earth), all subsystems of the end-to-end communica(cid:173)\ntions link (radio telemetry, coding, receivers, amplifiers) tend to be pushed to the \n667 \n\n\f668 \n\nSmyth and Mellstrom \n\nabsolute limits of performance. The large steerable ground antennas (70m and 34m \ndishes) represent critical potential single points of failure in the network. In partic(cid:173)\nular there is only a single 70m antenna at each complex because of the large cost \nand calibration effort involved in constructing and operating a steerable antenna \nof that size -\nthe entire structure (including pedestal support) weighs over 8,000 \ntons. \n\nThe antenna pointing systems consist of azimuth and elevation axes drives which \nrespond to computer-generated trajectory commands to steer the antenna in real(cid:173)\ntime. Pointing accuracy requirements for the antenna are such that there is little \ntolerance for component degradation. Achieving the necessary degree of positional \naccuracy is rendered difficult by various non-linearities in the gear and motor ele(cid:173)\nments and environmental disturbances such as gusts of wind affecting the antenna \ndish structure. Off-beam pointing can result in rapid fall-off in signal-to-noise ratios \nand consequent potential loss of irrecoverable scientific data from the spacecraft. \n\nThe pointing systems are a complex mix of electro-mechanical and hydraulic com(cid:173)\nponents. A faulty component will manifest itself indirectly via a change in the char(cid:173)\nacteristics of observed sensor readings in the pointing control loop. Because of the \nnon-linearity and feedback present, direct causal relationships between fault condi(cid:173)\nthis makes manual fault \ntions and observed symptoms can be difficult to establish -\ndiagnosis a slow and expensive process. In addition, if a pointing problem occurs \nwhile a spacecraft is being tracked, the antenna is often shut-down to prevent any \npotential damage to the structure, and the track is transferred to another antenna \nif possible. Hence, at present, diagnosis often occurs after the fact, where the orig(cid:173)\ninal fault conditions may be difficult to replicate. An obvious strategy is to design \nan on-line automated monitoring system. Conventional control-theoretic models \nfor fault detection are impractical due to the difficulties in constructing accurate \nan alternative is to learn the symptom-fault \nmodels for such a non-linear system -\nmapping directly from training data, the approach we follow here. \n\n2 Fault Classification over Time \n\n2.1 Data Collection and Feature Extraction \n\nThe observable data consists of various sensor readings (in the form of sampled \ntime series) which can be monitored while the antenna is in tracking mode. The \napproach we take is to estimate the state of the system at discrete intervals in time. \nA feature vector ~ of dimension k is estimated from sets of successive windows \nof sensor data. A pattern recognition component then models the instantaneous \nestimate of the posterior class probability given the features, p(wd~), 1 :::; i :::; m. \nFinally, a hidden Markov model is used to take advantage of temporal context and \nestimate class probabilities conditioned on recent past history. This hierarchical \npattern of information flow, where the time series data is transformed and mapped \ninto a categorical representation (the fault classes) and integrated over time to \nenable robust decision-making, is quite generic to systems which must passively \nsense and monitor their environment in real-time. \n\nExperimental data was gathered from a new antenna at a research ground-station \nat the Goldstone DSN complex in California. We introduced hardware faults in a \n\n\fFault Diagnosis of Antenna Pointing Systems \n\n669 \n\ncontrolled manner by switching faulty components in and out of the control loop. \nObtaining data in this manner is an expensive and time-consuming procedure since \nthe antenna is not currently instrumented for sensor data acquisition and is located \nin a remote location of the Mojave Desert in Southern California. Sensor variables \nmonitored included wind speed, motor currents, tachometer voltages, estimated \nantenna position, and so forth, under three separate fault conditions (plus normal \nconditions) . \n\nThe time series data was segmented into windows of 4 seconds duration (200 sam(cid:173)\npies) to allow reasonably accurate estimates of the various features. The features \nconsisted of order statistics (such as the range) and moments (such as the vari(cid:173)\nance) of particular sensor channels. In addition we also applied an autoregressive(cid:173)\nexogenous (ARX) modelling technique to the motor current data, where the ARX \ncoefficients are estimated on each individual 4-second window of data. The autore(cid:173)\ngressive representation is particularly useful for discriminative purposes (Eggers and \nKhuon, 1990). \n\n2.2 \n\nState Estimation with a Hidden Markov Model \n\nIf one applies a simple feed-forward network model to estimate the class probabilities \nat each discrete time instant t, the fact that faults are typically correlated over time \nis ignored. Rather than modelling the temporal dependence of features, p(.f.(t)I.f.(t-\n1), ... ,.f.(0)), a simpler approach is to model temporal dependence via the class \nvariable using a Hidden Markov Model (HMM). The m classes comprise the Markov \nmodel states. Components of the Markov transition matrix A (of dimension m x m) \nare specified subjectively rather than estimated from the data, since there is no \nreliable database of fault-transition information available at the component level \nfrom which to estimate these numbers. The hidden component of the HMM model \narises from the fact that one cannot observe the states directly, but only indirectly \nvia a stochastic mapping from states to symptoms (the features). For the results \nreported in this paper, the state probability estimates at time t are calculated using \nall the information available up to that point in time. The probability state vector \nis denoted by p(s(t)). The probability estimate of state i at time t can be calculated \nrecursively via the standard HMM equations: \n\n( \n\nA \n\nU t = A.p s t - 1 \n\n) \n\nUi(t)Yi(t) \n\nL.,.;j=l U, t Y, t \n\n\u2022 ( ) \u2022 \n\nA \n\n( \n\n) \n\n( ( ) ) \n\nand p Si t = \",m \n\n( ) ) \n\n( \n\nwhere the estimates are initialised by a prior probability vector p(s(O)), the Ui(t) \nare the components of u(t), 1 ~ i ~ m, and the Yi(t) are the likelihoods p(.f.IWi) \nproduced by the particular classifier being used (which can be estimated to within \na normalising cons tan t by p( Wi I.f.) / p( Wi)) . \n\n2.3 Classification Results \n\nWe compare a feedforward multi-layer perceptron model (single hidden layer with 12 \nsigmoidal units, trained using the squared error objective function and a conjugate(cid:173)\ngradient version of backpropagation) and a simple maximum-likelihood Gaussian \nclassifier (with an assumed diagonal covariance matrix, variances estimated from \nthe data), both with and without the HMM component. Table 1 summarizes the \n\n\f670 \n\nSmyth and Mellstrom \n\n----------------------v----------------\n\n1 \n\n0.8 \n\nEstimated \nprobability 0 .6 \n\nof true \nclass \n\n(Normal) \n\n0 .4 \n\n0.2 \n\n00 \n\n-----neural+Markov \n-neural \n\n210 \n\n4 0 \n\nTime (seconds) \n\n630 \n\n840 \n\nFigure 1: Stabilising effect of Markov component \n\noverall classification accuracies obtained for each of the models -\nthese results are \nfor models trained on data collected in early 1991 (450 windows) which were then \nfield-tested in real-time at the antenna site in November 1991 (596 windows)_ There \nwere 12 features used in this particular experiment, including both ARX and time(cid:173)\ndomain features. Clearly, the neural-Markov model is the best model in the sense \nthat no samples at all were misclassified -\nit is significantly better than the simple \nGaussian classifier. Without the Markov component, the neural model still classified \nquite well (0.84% error rate). However all of its errors were false alarms (the classifier \ndecision was a fault label, when in reality conditions were normal) which are highly \nundesirable from an operational viewpoint -\nin this context, the Markov model \nhas significant practical benefit. Figure 1 demonstrates the stabilising effect of the \nMarkov model over time. The vertical axis corresponds to the probability estimate \nof the model for the true class. Note the large fluctuations and general uncertainty \nin the neural output (due to the inherent noise in the feature data) compared to \nthe stability when temporal context is modelled. \n\nTable 1: Classification results for different models \n\nModel \n\n3 Detecting novel classes \n\nWhile the neural model described above exhibits excellent performance in terms \nof discrimination, there is another aspect to classifier performance which we must \nconsider for applications of this nature: how will the classifier respond if presented \nwith data from a class which was not included in the training set ? Ideally, one \nwould like the model to detect this situation. For fault diagnosis the chance that \none will encounter such novel classes under operational conditions is quite high since \nthere is little hope of having an exhaustive library of faults to train on. \n\nIn general, whether one uses a neural network, decision tree or other classification \n\n\fFault Diagnosis of Antenna Pointing Systems \n\n671 \n\nfeature 2 \n\nB \n\nB \n\nB \n\nB B \nB \n\nB \n\nc \n\nB \nB \n\nA \nA \n\nA \n\nA \nA \n\nA \n\n\" \n\nnovel input \n\n.... __ ---training data \n\nA \nA \n\nfeature 1 \n\nFigure 2: Data from a novel class C \n\nmethod, there are few guarantees about the extrapolation behaviour of the trained \nclassification model. Consider Figure 2, where point C is far away from the \"A\" s \nand \"B\"s on which the model is trained. The response of the trained model to \npoint C may be somewhat arbitrary, since it may lie on either side of a decision \nboundary depending on a variety of factors such as initial conditions for the training \nalgorithm, objective function used, particular training data, and so forth. One might \nhope that for a feedforward multi-layer perceptron, novel input vectors would lead \nto low response for all outputs. However, if units with non-local response functions \nare used in the model (such as the commonly used sigmoid function), the tendency \nof training algorithms such as backpropagation is to generate mappings which have \na large response for at least one of the classes as the attributes take on values which \nextend well beyond the range of the training data values. Leonard and Kramer \n(1990) discuss this particular problem of poor extrapolation in the context of fault \ndiagnosis of a chemical plant. The underlying problem lies in the basic nature of \ndiscriminative models which focus on estimating decision boundaries based on the \ndifferences between classes. In contrast, if one wants to detect data from novel \nclasses, one must have a generative model for each known class, namely one which \nspecifies how the data is generated for these classes. Hence, in a probabilistic \nframework, one seeks estimates of the probability density function of the data given \na particular class, f(~.lwi)' from which one can in turn use Bayes' rule for prediction: \n\n(1) \n\n4 Kernel Density Estimation \n\nUnless one assumes a particular parametric form for f(~lwd, then it must be some(cid:173)\nhow estimated from the data. Let us ignore the multi-class nature of the problem \ntemporarily and simply look at a single-class case. We focus here on the use of \nkernel-based methods (Silverman, 1986). Consider the I-dimensional case of esti(cid:173)\nmating the density f( x) given samples {Xi}, 1 ::; i ::; N. The idea is simple enough: \nwe obtain an estimate j(x), where x is the point at which we wish to know the \ndensity, by summing the contributions of the kernel K((x - xi)/h) (where h is the \nbandwidth of the estimator, and K(.) is the kernel function) over all the samples \n\n\u2022 \n\n\f672 \n\nSmyth and Mellstrom \n\nand normalizing such that the estimate is itself a density, i.e., \n\nj(x) = ;h {;,K( x ~ x,) \n\nN \n\n(2) \n\nThe estimate j(x) directly inherits the properties of K(.), hence it is common to \nchoose the kernel shape itself to be some well-known smooth function, such as a \nGaussian. For the multi-dimensional case, the product kernel is commonly used: \n\n1 t(rr K(xk - Xf)) \n\nhk \n\n(3) \n\nj(x) = \n-\n\nNh1\u00b7\u00b7\u00b7hd . \n\n,=1 k=1 \n\nwhere xk denotes the component in dimension k of vector ~, and the hi represent \ndifferent bandwidths in each dimension. \n\nVarious studies have shown that the quality of the estimate is typically much more \nsensitive to the choice of the bandwidth h than it is to the kernel shape K(.) (Izen(cid:173)\nmann, 1991). Cross-validation techniques are usually the best method to estimate \nthe bandwidths from the data, although this can be computationally intensive and \nthe resulting estimates can have a high variance across particular data sets. A signif(cid:173)\nicant disadvantage of kernel models is the fact that all training data points must be \nstored and a distance measure between a new point and each of the stored points \nmust be calculated for each class prediction. Another less obvious disadvantage \nis the lack of empirical results and experience with using these models for real(cid:173)\nworld applications -\nin particular there is a dearth of results for high-dimensional \nproblems. In this context we now outline a kernel approximation model which is \nconsiderably simpler both to train and implement than the full kernel model. \n\n5 Kernel Approximation using Mixture Densities \n\n5.1 Generating a kernel approximation \n\nAn obvious simplification to the full kernel model is to replace clusters of data \npoints by representative ceritroids, to be referred to as the centroid kernel model. \nIntuitively, the sum of the responses from a number of kernels is approximated by \na single kernel of appropriate width. Omohundro (1992) has proposed algorithms \nfor bottom-up merging of data points for problems of this nature. Here, however, \nwe describe a top-down approach by observing that the kernel estimate is itself a \nspecial case of a mixture density. The underlying density is assumed to be a linear \ncombination of L mixture components, i.e., \n\nL \n\nf(x) = I>~ifi(X) \n\ni=1 \n\n(4) \n\nwhere the ai are the mixing proportions. The full kernel estimate is itself a special \ncase of a mixture model with ai = liN and Ji(x) = K(x). Hence, our centroid \nkernel model can also be treated as a mixture model but now the parameters of the \nmixture model (the mixing proportions or weights, and the widths and locations of \nthe centroid kernels) must be estimated from the data. There is a well-known and \n\n\fFault Diagnosis of Antenna Pointing Systems \n\n673 \n\n-Centroid Kernel Likelihood \n-----lower 1-s1gma boundary \n____________________________ :-:-:~r_1:.s~~a_~u~~ry \n\nKernel Model \n\n40 \n\n20 \n\nLog-likelihood of \n0 \nunknown ellis -20 \nunder normll \nhypothesis \non test dltl \n\n-40 \n\n-10 \n\n-10 \n\n-100 +-'----t.;;.;;...--+-':...:;....-~-__F~-_F-=---~F_=-_F=-___1 00 \n\nTime (seconds) \n\no~ ____ ~ _____________________________ ~ __ \n\nSigmoidal Model \n\n-0-4 \n\nLog-likelihood of \nunknown ellss -0.' \nunder normll \nhypothesis \non test dltl \n\n-1.2 \n\n-1.1 \n\n-likelihOOd.~eu~ \nun \n---\u00b7-loWer 1-81 \n_\u00b7--\u00b7upper 105 ma un~~ \n\n400 \n-2+----t---r--r---+--~-~~-~-___1 \n\n350 \n\n300 \n\n250 \n\nTime (seconds) \n\n150 \n\n200 \n\nFigure 3: Likelihoods of kernel versus sigmoidal model on novel data \n\nfast statistical procedure known as the EM (Expectation-Maximisation) algorithm \nfor iteratively calculating these parameters, given some initial estimates (e.g., Red(cid:173)\nner and Walker, 1984). Hence, the procedure for generating a centroid kernel model \nis straightforward: divide the training data into homogeneous subsets according to \nclass labels and then fit a mixture model with L components to each class using \nthe EM procedure (initialisation can be based on randomly selected prototypes) . \nPrediction of class labels then follows directly from Bayes' rule (Equation (1)). Note \nthat there is a strong similarity between mixture/kernel models and Radial Basis \nFunction (RBF) networks. However, unlike the RBF models, we do not train the \noutput layer of our network in order to improve discriminative performance as this \nwould potentially destroy the desired probability estimation properties of the model. \n\n5.2 Experimental results on detecting novel classes \n\nIn Figure 3 we plot the log-likelihoods, log f(~lwd, as a function of time, for both a \ncentroid kernel model (Gaussian kernel, L = 5) and the single-hidden layer sigmoidal \nnetwork described in Section 2.2. Both of these models have been trained on only 3 \nof the original 4 classes (the discriminative performance of the models was roughly \nequivalent), excluding one of the known classes. The inputs {~i} to the models are \ndata from this fourth class. The dashed lines indicate the Jl \u00b1 u boundaries on the \n\n\f674 \n\nSmyth and Mellstrom \n\nlog-likelihood for the normal class as calculated on the training data -\nthis tells \nus the typical response of each model for class \"normal\" (note that the absolute \nvalues are irrelevant since the likelihoods have not been normalised via Bayes rule). \nFor both models, the maximum response for the novel data came from the normal \nclass. For the sigmoidal model, the novel response was actually greater than that \non the training data -\nthe network is very confident in its erroneous decision that \nthe novel data belongs to class normal. Hence, in practice, the presence of a novel \nclass would be completely masked. On the other hand, for the kernel model, the \nmeasured response on the novel data is significantly lower than that obtained on \nthe training data. The classifier can directly calculate that it is highly unlikely that \nthis new data belongs to any of the 3 classes on which the model was trained. In \npractice, for a centroid kernel model, the training data will almost certainly fit the \nmodel better than a new set of test data, even data from the same class. Hence, \nit is a matter of calibration to determine appropriate levels at which new data is \ndeemed sufficiently unlikely to come from any of the known classes. Nonetheless, \nthe main point is that a local kernel representation facilitates such detection, in \ncontrast to models with global response functions (such as sigmoids). \n\nIn general, one does not expect a generative model which is not trained discrim(cid:173)\ninatively to be fully competitive in terms of classification performance with dis(cid:173)\ncriminative models - on-going research involves developing hybrid discriminative(cid:173)\ngenerative classifiers. In addition, on-line learning of novel classes once they have \nbeen detected is an interesting and important problem for applications of this na(cid:173)\nture. An initial version of the system we have described in this paper is currently \nundergoing test and evaluation for implementation at DSN antenna sites. \n\nAcknowledgements \n\nThe research described in this paper was performed at the Jet Propulsion Labora(cid:173)\ntory, California Institute of Technology, under a contract with the National Aero(cid:173)\nnautics and Space Administration and was supported in part by DARPA under \ngrant number AFOSR-90-0199. \n\nReferences \n\nM. Eggers and T. Khuon, 'Neural network data fusion concepts and application,' \nin Proceedings of 1990 IJCNN , San Diego, vol. II , 7-16, 1990. \nM. A. Kramer and J. A. Leonard, 'Diagnosis using backpropagation neural net(cid:173)\nanalysis and criticism,' Computers in Chemical Engineering, vol.14, no.12, \nworks -\npp.1323-1338, 1990. \nB. Silverman, Density Estimation for Statistics and Data Analysis, New York: \nChapman and Hall, 1986. \n\nA. J. Iz en mann , 'Recent developments in nonparametric density estimation,' J. \nAmer. Stat. Assoc., vol.86, pp.205-224, March 1991. \n\nS. Omohundro, 'Model-merging for improved generalization,' in this volume. \n\nR. A. Redner and H. F. Walker, 'Mixture densities, maximum likelihood, and the \nEM algorithm,' SIAM Review, vol. 26 , no.2, pp.195-239, April 1984. \n\n\f", "award": [], "sourceid": 502, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Jeff", "family_name": "Mellstrom", "institution": null}]}