ERESULT TEAM DATA ENGINEERING

Fault Prediction with stateful LSTM

Maurizio Vivarelli
6 min readJan 10, 2022

A discussion in the context of time series with sequences of different length

With the goal to develop a real world Fault Prediction model, I decided to test some concept with the C-MAPSS dataset (Turbofan Engine).

In particular this post will discuss and test the use of Stateful flag of Keras LSTM layer as well Shuffle flag and Batch size.

The result will be compared with the one obtained by Deepak Honakeri in his post.

The data

This is a sample of data got from C-MAPSS dataset:

where each id is an engine and each cycle is a data item (an array of feature collected at each timestep).

Each engine life is different so the sequences have diffent length (the graph below show engine life in days per engine id):

If you think to immediately replace an engine after its fault, you can look at the data as a single flow:

where:

  • each sequence is an engine;
  • each sequence is connected one another (that is what I will have in my real world fault prediction task);
  • data are then split indipendently from sequences into blocks of “timestep size” data items to create what is called a “sample”;
  • a sample can than be fed to the RNN;
  • for each sample backpropagation will compute weigths delta;
  • deltas from all the samples in a batch are than summed up to compute the values that will change the network weigths;

The context

Simply put the ML system described in this article receive an array of values at every time interval (cycle) and at every time interval respond with:

  • the engine is working normally (more than 45 cycles to fault);
  • the engine is in warning state (between 45 and 15 cycles to fault);
  • the engine is in alarm state (less than 15 cycles to fault).

For more details you can read the article above or the original one by Marco Cerliani here.

Stateless networks

With stateless network, the state is discarded after processing every sample of data, say for example 50 data items. During processing of these 50 items, the state build up, at the end the network output the result, but after that the state is discarded. So if you want to address the problem with a stateless network you need to estimate the memory length (the number of data items) needed for the network to do his prediction job. After this estimate you can define the timestep size equal or greater of the memory size. After that you can process the dataset with a sliding windows tecnique in order to feed the network for every cycle with an entire windows of timestep size of data:

and let the network learn the state each time from scratch.

The Stateful flag

According to Keras documentation:

  • stateful: Boolean (default False). If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.

Based to the description above a stateful network is able to build a state (memory) that overcome (is not reset) the timestep size boundary.

This way you can ask the network a prediction each time you feed a single data item.

Also this way you let the network learn/decide how long some information should be maintained rather than relying on an estimate of the memory length required (that could also be greater than the maximum timestep size reasonably available).

Finally this way you can see that you will compute approximately “timestep size” less data.

The Shuffle flag

According to Keras documentation:

  • shuffle: Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’).

If your data is made of sequences long enough, probably you have to choose a smaller timestep size. This is because the timestep size is the number of iteration the layer do before backpropagation so it cannot be arbitrarily large. For this reason a sequence will be made of a number of samples, each composed of timestep size data items, and probably not an exact integer number. So if you want to maintain the time ordering of event you cannot shuffle your samples in the dataset.

The Batch Size

According to Keras documentation:

  • batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.
  • mask: Binary tensor of shape [batch, timesteps] indicating whether a given timestep should be masked (optional, defaults to None). An individual True entry indicates that the corresponding timestep should be utilized, while a False entry indicates that the corresponding timestep should be ignored.

Said that, with “Statefoul” flag enabled “the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch” you still can take advantage of batch size greater than one.

When resetting state?

If you apply a system of this kind to a production line fault prediction you can try to put in place a method to reset state during training and during prediction also if the sequences are of different length. But why not try to let the network learn that, after a fault, the memory state is no longer worth maintaining?

And for this reason the last choice I made was to never resetting the state, like a human would do if in charge to monitor physically the production line.

The model

Keeping in mind all of the above this is the model:

other than the stateful flag equal to True on every layer is worth noting the return_sequences flag equal to True on the last hidden layer. This change the output from that:

to that:

And this explain also the difference from the non-stateful and stateful approaches.

In the first one you feed the network with a timestep size of data and get one result. Than with sliding window technique you go forward by one step and feed another timestep size of data getting another result, practically rebuilding network state from ground-up each time.

In the second one you feed the network with a single data item and get one result, increasing but not rebuilding the network memory state.

The results

Trained over 15 epochs the model reach an accuracy of about 0.95:

Training time over a single GPU was less than a minute.

This is accuracy graph:

and this is the confusion matrix over the test set:

With a weighted loss function:

the model paied a little in terms of accuracy, about 0.92:

but the confusion matrix look a lot better:

If you are interested, the code is in two Jupiter notebook that you can find on Github.

References

--

--

Maurizio Vivarelli

Data Engineer, ERP developer. I like build things, sport, space and futurism.