Optimal state selection and tuning parameters for a degradation model in bearings using Mel-Frequency Cepstral Coefficients and Hidden Markov Chains

Preventive maintenance is a philosophy for assets management that aims to maximize operation through routine inspections with increasing frequency when no abnormalities are exhibit. This leads to an increase in the probability of failure due to the repetitive intervention and the inherent human error. Recently, forecasting research, or predictive research, have been addressed in order to obtain effective maintenance strategies and evaluate and manage the residual risk in equipment susceptible to degradation. Predictive research is related to the estimation of an active’s Remaining Useful Life (RUL) by predicting its health state through the progression of its degradation. This article presents the development of an automated system that identifies types of faults in bearings, using Cepstral Coefficients on the Mel scale (MFCC) as the features set for description, and Hidden Markov Chains (HMC) with discrete observations as the classification method. Here we emphasizes on the optimal selection of the states in the HMM classifier using the ROC (Receiver Operating Characteristic) curves criteria. Features are discretized using clustering by k-means. Signals in this study are vibrations signals from the bearings in electrical machinery. The two databases used here are labeled with four different operation scenarios: normal, inner ring fault, outer ring fault, and rolling element fault. One of the databases allows for differentiation in severity levels for each scenario.

approaches, hence the trained model only respond properly to fault processes with equal features to the training data.
Consequently, these off-line models are not suitable to situations where features are changing.Moreover, the computational cost of training, prevent them from being used on-line for real time applications [9].Therefore, the prediction models to date are theoretical and limited to small amounts of models and fault modes [2].
Another traditional strategy for RUL prediction is to use approaches based on the physical model of the mechanism, but they require specific knowledge of the system and generally do not reflect a general model for all the fault modes and for the entire life cycle of the mechanism [5,[10][11].These models normally imply uncertainty from assumptions and simplifications due to the complexity and stochastic nature involved in the systems [11][12][13].Furthermore, it is impractical to create a prediction model for the entire life cycle [14].
General interest has been focused on non-invasive techniques for fault detection and prediction, thus, basically implying signals-based techniques [15].Parametric linear prediction techniques as autoregressive moving-average models (ARMA) usually works for short term predictions given the assumption of linearity of the process.Artificial Intelligence techniques has been used as well, as Artificial Neural Networks, however since they are black boxes and have slow convergence, they suffer from shortcomings such as difficulties of interpretation and structure.
Another prediction approaches, like Support Vector Machine (SVM), fuzzy logic, etc., have difficulties

INTRODUCTION
American manufacturers invested around USD 600 bi-million in maintenance of critical systems in the early eighties.These expenses doubled within just 20 years, and an alarming half of money was spent on non-effective maintenance [1].However, the constant pressure on improving the reliability of assets remains intact [2].So much that investments on preventive and predictive actions become globally 65.9% of maintenance resources in 2008 and 72.4% for the industrialized world [3].What is worrying is that industrial energy consumption grows in global participation, with 42.6% in 2013 [4].
With the purpose of developing efficient maintenance strategies, challenges related to the predictive research have been faced to manage the residual risk of failing equipment [2].
Predictive research in this context should be understood as the estimation of Remaining Useful Life (RUL) for an asset by the prediction of the progression of a diagnosed anomaly [5].Nonetheless, the diagnosis is basically a classification problem resorting to many methodologies addressed in the literature, as opposed to prognosis [6].An efficient diagnosis, in order to make decisions about the structural health state, should deliver information regarding the location and severity following a detection procedure of events and statistical analysis of the observed characteristics [7].Some normal strategies for fault diagnosis are based on visual observation of the spectrum with 2-D Fast Fourier Transform (FFT) and time-frequency contour graphs, which are unable to comply with the requirements of a prediction model [8].The prediction models usually are trained with off-line to deal with stochastic problems because of the inherent randomness in most degradation processes [5,13,15].Modal analysis has shown that modes can be insensitive to damage location and that changes in the modes forms can be unidentifiable due to noise and the limited number of recognizable modes forms [7].Some applications where prediction is key part of the work, shows a model-based perspective applied in the early design and development of the asset.The model is simulated in order to improve reliability and availability.As an example of this case, in [16][17] the development of a model for a vehicle suspension is introduced.Diagnosis and RUL predictions are also included with projections of the system under different operation conditions.
Whether model-based or data-based, the general RUL inference methods rely on previous knowledge about how the system operates and the relative frequency of occurrence of the different kinds of defects.This perspective resembles the structure of a double stochastic process which must find, by probabilistic means, the total degradation state of a system and the probability of transition between states [18][19].
To a great extent, the aforementioned model is similar to those used in speech processing, since both have signals with quasi-stationary nature and in both cases there are differences in the parameters of similar characters from different systems.An example of such case is the variability in similar phonemes generated by the same vocal tract, or the variability in vibration signals from presumably similar machines under identical operation conditions.
Several speech processing systems use Hidden Markov Chains (HMC) since they allow for the analysis of a dynamic random process [18,20,21].Given HMC hold the advantages of easy interpretation and the ability of performance in competitive learning environments, they are introduced as a method for classification of the residual life in a degradation process with the goal of characterize the health state of a mechanism [5,6,15].Mel-Frequency Cepstral Coefficients (MFCC) provide signal information from both the time and the frequency domain, and they also enable dynamic features extraction, whether they are linear or non-linear.These features are commonly used in vibration signals and speech processing [22,23], and that i's the reason for us to use MFCC jointly with HMC in order to perform the bearings diagnosis.

Data Base
We used two different data bases for the training and the validation of the proposed system.The first data base was compiled by the Bearing Data Center at Case Western Reserve University [24].The one selected corresponds to a data acquisition of 12k samples per second and faults are induced through electric discharge of 0.1778, 0.3556 and 0.5334 millimeters in diameter to configure faults labeled according to three severity levels (low, medium, and high) in three elements of the bearings (inner ring, outer ring, and rolling element).More details on other parameters from this data base can be found in [24].
The second data base was developed by the Mechanical Vibrations Laboratory at Universidad Tecnológica de Pereira, with data acquisition at 20k samples per second.Faults are induced through mechanized action on the rolling element, the inner ring, and the outer ring.

Mel-Frequency Cepstral Coefficient (MFCC)
It is a parametric representation method based on the frequency response of the human hearing.MFCC have two filter types, linearly distributed for frequencies below 1 kHz, and logarithmically distributed for frequencies above 1 kHz [25].This distribution is referred to as the Mel scale.The MFCC implementation process compromises seven steps detailed in Figure 1 [25].
• Segmentation: Dividing the signal into small frames with n samples each one.The frames usually overlap with the adjacent.
• Windowing: A Hamming window is usually used to adjust the frames and to integrate all the closest frequency lines.• Pre-Emphasis: Filtering to emphasize the higher frequencies by increasing these frequencies' energy.• FFT: Fast Fourier Transform using: where F is the number of frames, w is the the Hamming Window function, and x is the input signal.
• Mel filters: The bank of filters distributed according to the Mel scale has a band pass triangular frequency response.The response magnitude is equal to unity at the Center frequency and decrease linearly to zero at center frequency of its two adjacent filters.• DCT (Discrete cosine transform): The process to convert the Mel spectrum into time domain.
The result of the conversion is called Mel Frequency Cepstrum Coefficients.• Delta energy spectrum: given the quasi-stationary nature of the signals involved, the frames changes as they overlap in their transition.To represent this, 13 velocity deltas and 39 double deltas or accelerations are added.

Code Book
To use the MFCC feature set as discrete observations, a clustering technique such k-means is required [26].
The original procedure follows the guideline of defining k centroids in compliance to the existence of k clusters.The centroids are located so that they remain as far as possible from each other and every continuous observation point is associated to its nearest discrete centroid.Subsequently, several iterations are run in order to relocate the centroids and hence minimize the distance between continuous observations and centroids.This algorithm may have some variations directed to improve the computational efficiency [27].The collection of centroids for different discretization processes are known as the Code Book.

Hidden Markov Chains (HMC)
A Markov Chain is any system that at certain instant of time can be considered to be in one of a finite set of discrete states, denoted as {S 1 ,S 2 ,…, S N }.At discrete uniformly distributed times, the system suffers state transitions according to a set of transition probabilities, for a time t at the current state q t .
The chain is of first order if its transition probability only depends on the previous state, denoted as: with a i, j > 0 y a i, j This model is known as an observable Markov Chain, since the process output is the same set of states at any time instant.On the other hand, a Hidden Markov Chain is the extension of the observable model, where the outputs are probabilistic functions of the state, and thus the model is an embedded double stochastic process which is not directly observable, but indirectly, through the set of output sequences [21,28].
HMC has the following characteristics: • N: number of states of the model • M: number of symbols of the different observable states, denoted as A description on how to address each of the aforementioned problems can be obtained from [21,28].

Labelled Data bases
For training and validation, two data bases were used.One compiled by Bearing Data Center from Case Western Reserve University [24], and the other from the Mechanical Vibrations Laboratory of Universidad Tecnológica de Pereira.
For the Bearing Data Center database, the data was acquired at 12k samples per second, faults were induced in three different parts of the bearing (inner ring, outer ring and rolling element) with three levels of severity, one state of normal operation (base signal) and at four different velocities (1730, 1750, 1772 and 1797 RPMs.) With the purpose of creating a labeled database, labels are assigned as follows: "Inner" for the inner ring faults, "Outer" for the outer ring faults, "Normal" for the base signal, "N1", "N2" and "N3" for the 3 severity levels, N1 is the less severe and 3 is the most.The resulting database comprises ten folders: Normal, BallN1, BallN2, BallN3, InnerN1, InnerN2, InnerN3, OuterN1, OuterN2 y OuterN3.

MFCC features and k-means extraction
Where 1 ≤ j ≤ k.Then, an update is performed by estimating new measures for the centroids in each division: Finally, the algorithm converges when no significant changes occur in the actualization step.

HMC models training
To train the HMC models, the parameters λ = {A, B, π} must be adjusted given the observations, in order to maximize P{O|}.
The Expectation Maximization (EM) algorithm is used, as in [31][32], where the probability of being on the state i at the time t and on the state j at the time t + 1, in order to estimate the parameters of the model, as: ξ t (i, j) can be also reformulated as a function of the two new variables a t (j) y b t (i), as in: where, the re-estimation of the parameters are: Iterations are performed until P{O|λ'}, that is, while the new model is more probable to explain the observation sequence.

Model Evaluation
To evaluate the different HMC models that better represent the observations, we created Receiver Operating Characteristics (ROC) curves, which are broadly accepted as the de facto analysis and comparison method for diagnosis tests [33][34][35].
Sensitivity and Specificity are the measures for the diagnostic accuracy of a test.Sensitivity is defined as the rate of true positives and represents the proportion of observations that yield positive results on the test.On the other hand, specificity is defined as the rate of true negatives and represents the proportion of observations yielding negative results on the test.
An ROC curve represents a graphic of sensitivity against 1-specificity.The curve is obtained after performing a Montecarlo sampling on the solution set of the tuning parameters of the HMC model.When the model tuning distinguishes clearly the positive observations from the negative ones, the sensitivity will be 1 and the specificity 0 (i.e.1-specificity = 1).When tuning is unable to make distinctions, sensitivity will be 0 and specificity 1 (i.e.1-specificity = 0).The area under the ROC curve is used as measure of the diagnostic accuracy and as comparison criterion between the models [33,35].
In the specific case of our HMC trained models, the number of states is the discriminant criteria, and the tuning parameters: the number of coefficients, the number of Mel filters, and the number of centroids Out of this total numbers of the parameters, it can be seen that a ROC is constructed with a total of 1100 points, each one corresponding to sensitivity and specificity values for a trained model with particular tuning parameters.This points are plotted in a normalized coordinate frame in the range from (0,0) to (1,1).
Figures 2 and 3 show ROC curves for HMC models with states 1, 2, 3, 4 and 5 over the data base from Case Western Reserve University Bearing Data Center with a 70 % of the data base used for training.Whereas Figure 2 shows the results when the data base reveals separation including severity levels, and Figure 3 shows the same revealed separation but with only faults state without severity.
Figure 4 show the ROC curves from HMC models with 1, 2, 3, 4 and 5 states for the Mechanical Vibrations Laboratory database with 70 % of the database used for training with fault states included but without severity levels since they are not available.
To provide a measure of diagnostic accuracy, ROC curves were smoothed using a moving average method and then, the area under each curve is Figure 2. ROC curves for HMC trained with data from Bearing Data Center.Data base includes severity levels.
To determine the optimal number of states for the HMC model, we look for a trade-off between the area under the ROC curve and the number of states used in HMC model.For the Bearing Data Center database case, Figure 5 shows area maximums for 3 and 6 states.Including severity levels, local maximums are achieved for 5 and 7 states.Accounting for computational complexity, it can be determined that 4 states reach values near the vicinity of the maximum area.For the case of the Lab.Mechanical Vibrations UTP database, Figure 5 shows a monotonically increasing area that is maximum at 7 states.
For the models whose data base does not include severity, the ROC area increases as the number of HMC states also increase.However when severity levels are included, the ROC area is prone to decrease after reaching its maximum.The larger the area under ROC curve, the greater the diagnostic accuracy and thus the discriminant power [36].Despite some authors [37][38] suggestion on using the non-parametric Wilcoxon statistical test to analyze the differences under the areas, it can be shown that the concept of area under the ROC is closely related to Wilcoxon and is not affected by the probability distribution.This is especially convenient when working with probability distribution functions different to the normal.
From results just detailed above and with the purpose of obtaining a diagnostic accuracy with a lower computational cost, the number of states selected for the case is 4.

Model training
In order to determine the global reliability for the bearings degradation identification system, a 4-states HMC model was implemented using 24 coefficients, 24 Mel filters, and 32 centroids.
Each model was trained independently using 100 Montecarlo iterations, were the database was separated for training and testing with uniform probability distribution, and based on a heuristic cross-validation approach.The results for each of the three data bases are shown in Tables 1, 2 and 3. Results were divided into three training/testing separations as: 50%, 60%, and 70%.Each row contains the results for a specific type of fault, including reliability and standard deviation.
The last row consolidates the reliability of the total process for fault identification and/or severity fault levels.
It should be highlighted here, that the Bearing Data Center database was considered twice.First, as a database without severity levels (Table 1) and second, as a database with severity levels, (Table 2).

CONCLUSIONS
The analysis of the areas under the ROC curves provided by Figures 2 to 5, leads to the conclusion that for both of the selected databases, four is the advisable number of states for which a HMC model for identifying degradation should be trained.From the training performed as described, it is noticeable on Tables 1 to 3, that variations on the percentage of the data base used for training do not alter significantly the total reliability for degradation identification.Tables report variations smaller than 0.007.For example, on Table 2, the difference between the lowest reliability total (when 70% of the data based is assigned to training), and the highest (when 60% of the data based is assigned to training) is only 0.9390 -0.9320 = 0.007.Likewise, Table 1 and Table 3 yield difference of variations of 0.0039 y 0.0008 between each other.These variations provide evidence that HMC models have generalization power and are not prone to overtraining due to variations on the percentage of database separated for training.
The Bearing Data Center database achieve better results when the faults recognition includes severity levels, as seen in Table 2, the best result is achieved for a 60% of the database with a reliability of 0.9390±0.1328.Conversely, when the same data base does not include the levels, the best result is reported on Table 1 for the 50% of the data base with a reliability of 0.9160±0.0984.Nonetheless, there exists the possibility of having to deal with the high number of models to be trained due to the division into degradation levels.
An important observation can be seen when comparing the two models using the Bearing Data Center database, the results without severity levels on Table 1 against the averages from severity levels of the same type of faults on Table 2.For both scenarios the Normal fault type is 100%.But the Ball fault type yields a better result when degradation is included, reporting a minimum average of 0.8833, versus 0.9736 reported without degradation.But the tendency changed for the Inner fault type and the Outer fault type, since the best results are given when degradation is not included: 0.8082 and 0.7978 respectively.Whereas the obtained minimum averages are 0.9133 for Inner and 0.9522 for Outer when degradation is included.In accordance with these comparison, the separation of classes and intra-classes seems affected by whether severity levels are included or not.
MFCC proved to be an efficient tool that allows obtaining either linear or nonlinear dynamics characteristics including time and frequency information.
A final remark is how clearly recognizable is the normal operation of the equipment.It can be easily separated from the faults types induced, as it can be seen on Tables 1, 2, and 3.This characteristic is encouraging to continue with the study of classification and identification of degradation and the severity levels on this kind of assets.On Tables 1, 2, and 3, which represent data bases without severity levels, the best ratio of results was reported for the fault induced on the rolling elements (referred to as "Ball" on the Tables).On the other hand, the fault induced on the Outer ring yielded the lowest results.
j ]: Probability distribution of state transition.• B = [b i, j (k)]: Probability distribution of symbol observation at the state j, where b j (k) = P{v k en t|q t = S j }. • π = [π i ]: Initial state distribution where π i = P{q 1 = S i }. • O: Observation sequence, where O={O 1 , O 2 ,…, O T }, con O t ∈ V. • λ: Set of parameters for a HMC model, where λ ={A, B, π}.With the assistance of HMC, three basic problems can be addressed, namely:• The evaluation problem: given the observation sequence and a model, estimate efficiently P{O|λ}, i.e. the probability of observation sequence from the given model.• Decodification problem: Given the observation sequence and a model, select the best state sequence that better explain the observations.• Training problem: given the observations, adjust the model parameters to maximize P{O|λ}.

Figure 3 .
Figure 3. ROC Curves for HMC trained with data base from the Bearing Data Center.Data base includes fault states but severity is not available.

Figure 4 .
Figure 4. ROC curves for HMC trained with data from Mechanical Vibrations Laboratory of Universidad Tecnológica de Pereira.

Figure 5 .
Figure 5. Areas under the ROC curves for each HMC model trained and based on their respective data base.

Figure 5
show 3 curves denoting the areas under the ROC curve from the respective basis.Keep in mind that the curve labeled as "Bearing Data Center with severity" is the only one making separation that includes fault severity, whereas the other two ("Bearing Data Center without severity" and "Mechanical Vibrations Lab.-UTP") only include separation by faults states.

Table 3 .
Training for the 4-states HMC using the Mechanical Vibrations Laboraty data base without degradation levels.

Table 1 .
Training for the 4-states HMC using the Bearing Data Center data base without degradation levels.

0.9160 ± 0.0984 0.9121 ± 0.1009 0.9127 ± 0.1069Table 2 .
Training for the 4-states HMC using the Bearing Data Center data base with degradation levels. of the Bearing Data Center when including the degradation levels can be explained by the fact that each set of signals has greater variability when signals are not divided into severity levels.