Data Visualization, Machine Learning

Time Series Forecasting — Building and Deploying Models

Forecasting a hydraulic oil test rig’s condition over time using ensemble learning and neural networks. Part 1 / 2

Ranganath Venkataraman

Published in

Towards AI

13 min readJan 27, 2021

TL/DR: I built models to forecast hydraulic rig conditions using various tools including: tsfresh, ensemble learning, and recurrent neural networks (RNNs). Models are deployed with Flask using HTML interfaces.

As someone interested in machine learning’s applications in the energy industry — see my other posts — I’m mindful of time series forecasting. Data gathered within refineries and petrochemical plants through a vast network of sensors usually has a time stamp courtesy of the data historian and many analyses will benefit from considering time.

As I plan a machine learning application within my company that uses time-stamped data, I’d like to first hone my skills using a publicly available dataset. I’ll use the hydraulic systems dataset from the UCI Machine Learning repository.

Since my purpose is practicing various techniques, I will simply speak to further steps for optimizing model performance and will also not repeat analysis for multiple labels. I have also noted observations on handling data that isn’t pre-packaged in convenient csv files and requiring feature engineering / selection varying with approaches to datasets without the time component.

The second and final part of this article will explore AzureML’s forecasting tools and ARIMA.

My approach is outlined in this linked table of contents — links open a new window

Define the business problem
Data review
Loading data
Feature engineering — interplay with #5 and #6
Feature selection — interplay with #4 and #6
Developing and evaluating the model — interplay with #4 and #5
Deploying models
Conclusion and take-aways for the business

Here is the supporting GitHub repo.

Define the business problem

Hydraulic rig experiments give insight into oil performance and machinery conditions by simulating industrial conditions in a lab environment. By performing this simulation, we learn whether the oil is a suitable candidate to permit safe and sustainable operations — and what to expect in a large scale industrial setup.

By predicting key measures of system performance, we will give an industrial customer confidence that they can plan and predict for large scale operations which have greater safety and financial repercussions than lab trials.

Data review

The repository contains 16 text files, each representing data from different sensors measuring pressure, temperature, flow, vibration, and cooling efficiency. See Figure 1 below for an example of a temperature sensor’s file.

There are 6 pressure sensors, 4 temperature sensors, 1 motor power sensor, 2 volumetric flow sensors, 2 sensors to measure cooling efficiency and power and 1 to measure general efficiency.

Each text file has 2205 instances i.e. rows, and 60 columns since the sensors captured data every minute.

One text file contains labels i.e. the targets we’re trying to predict. These are measures of hydraulic accumulation, rig stability, valve / cooler condition, and internal pump leakage. Figure 2 below has a snapshot of such a text file.

A luxury of the UCI ML repository that I haven’t had in my real world projects is their courteous informing me that I have no missing data.

Loading data

Without a curated csv file to apply the Pandas library’s read_csv method, I need a different approach. I could separately open every file and then reading every line in each file e.g.

import numpy as np
import pandas as pdcooleff = [] 
ce = open('CE.txt','r') # Read in a new file
for line in ce:
    cooleff.append(line.split()) # Read in each line of opened file, while separating each line.cooleff = np.reshape(cooleff,(2205,60)) # Reshape the array
cooleffDF = pd.DataFrame(cooleff)       # Convert array to dataframe

This is at best inelegant and at worst, slows down my model. Therefore I will use Python’s glob module to first create a list of text files desired for import and then loop through those files. I will then use pandas read_csv:

import numpy as np
import pandas as pd
import globlocn = ".Downloads\\hyddata\\*.txt" # find all the text files in the path 
files = glob.glob(locn) # Compile list of aforementioned text files
features = {} # use a dictionary to save all the variablesfor file in files:
    df = pd.read_csv(open(file),delim_whitespace=True,header=None)

We will later append the df’s contents into a separate file and label the data. However I first want to take a step back and consider my strategy for this problem since it influences how I treat the features.

Before considering my approach, let’s quickly import the labels data.

label = pd.read_csv (".Downloads \\hyddata\\ profile.txt", delim_whitespace=True, header=None)
label.columns = ['cooler_condition', 'valve_condition', 'pump_leak', 'hydraulic_accumulator', 'stable_flag']

Now for that evaluation of my approach: Figure 5 below is a flowchart of solution strategies.

I will use a diverse range of strategies again to maximize my learning.

Labels only: I can go straight to Developing and Evaluating a model. This approach uses labels from a prior time period to predict labels in the future — a justifiable approach, given the subplots I generated below which shows that the target metrics demonstrate consistent patterns over the cycles.

I will use a Long Short Term Memory (LSTM) model, a type of Recurrent Neural Network (these account for dependencies between values in sequence). To implement this with Keras, I’ll first need to convert the time series sequence into a dataset suitable for supervised learning.

I’m going to use the series_to_supervised() function, which creates a dataframe where prior values form the feature set for future predictions of that same value. This function is courtesy of Jason Brownlee’s Machine Learning Mastery.

I’ll use this approach to predict the hydraulic_accumulator label.

univar = series_to_supervised(label[['hydraulic_accumulator']],n_in=2,n_out=1)# Creating a dataframe above for use in a supervised learning problem. 2 columns will have 2 preceding values in the sequence while the last column features value at that time i.e. the labelunivar = univar.valuestrain,test = univar[:1201,:], univar[1202:,:]# Creating training and testing sets, by simply splitting the data.# Final step below splits training and testing sets into features and labels. Remember that only the last column features labels i.e. the y.xtrain,ytrain = train[:,0:2],train[:,-1]
xtest,ytest = test[:,0:2],test[:,-1]

Here’s the before-and-after: as you can see I now have 3 columns in the bottom— 2 of which feature preceding values.

LSTMs require feature inputs to be 3-D: therefore the code below reshapes xtrain and xtest to have same number of rows i.e. 1201 for xtrain, number of timesteps (1), and number of features (2).

xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))

I can now use Keras and its Sequential class to create a layered model. For the LSTM layer, I picked 60 units and stuck with the default tanh activation function — both opportunities for optimization, as are also: the number of epochs and batch size used for training the model.

%pip install keras
%pip install tensorflow
import tensorflow
import keras
from keras.models import Sequential
from keras.layers import Activation, Dense, LSTMmodel = Sequential()
model.add(LSTM(60, input_shape=(xtrain.shape[1], xtrain.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')# I've used mean absolute error as a performance metric for the LSTM and the Adam optimizer, to minimize this error.# Fit the model
history = model.fit(xtrain, ytrain, epochs=50, batch_size=20, validation_data=(xtest, ytest), verbose=0, shuffle=False)# Above: fitting the model above with pre-prepared test and validation data. Below is a plot the results of my fitting and validation.from matplotlib import pyplot
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()

This article will not delve into the details of selecting hyperparameter values. I observe below that the model’s performance improves over the 50 epochs of training with a final error of ~10% (error of 10.95/ average of ytest)

Figure 8: LSTM for hydraulic accumulator, no features

Here are the trends for stability flag — a categorical label — and pump leakage, a supposedly tough metric to predict per the UCI repository documentation. The latter has an error of 0.04 / mean of pump leakage or 6%.

Figure 9: LSTM for stability flag and pump leakage, no other features

Feature engineering

Feature usage with feature engineering: can it improve performance? To answer this question, let’s first understand that feature engineering for time series focuses on extracting information about the trends. An example of a ready-made python package for our use is tsfresh whose extract_features function calculates a comprehensive set of features. This function requires a dataframe that has a clearly specified column of id numbers, one id number for each time series. Another column of sort numbers will help us organize the time stamps within each series.

However Figure 3 above isn’t sorted in this way. Even when I collect all features into a single dataframe, it will still need modifying. Continuing from the code snippet that produced the output in Figure 3 …

df.index.name="cycle"
df_T = df.T # Transpose
df_T.index.name="time"
df_T.reset_index(inplace=True)# Setting indices name for all rows and columnsstring = ' cycle'.join(str(e) for e in list(temp_df_transposed.columns))
            temp_df_transposed.columns = string.split(" ")
            
# Adjusting the names of columns to add prefix of "cycle". This prefix acts as a stub to guide the reorientation of the dataframe when using the pandas method wide_to_long below.temp_df_long = pd.wide_to_long(temp_df_transposed.iloc[1:,:],stubnames='cycle', i=['time'], j='c')temp_df_long.reset_index(inplace=True)

Perhaps the best way to understand the impact of a wide_to_long application is looking at the before and after. As illustrated by Figure 9 below, the 60x2206 matrix with time and cycle on different axes becomes a matrix with both variables on the same axis. The total number of values is now 60*2206— 60 (index) = 132,300 all on 1 column.

Each of these long dataframes is read into a new dictionary called ‘features’, which then requires switching back to a dataframe format.

features[name[9:-4]] = temp_df_long
for key in list(features.keys()):
    features[key].columns=['seconds','cycle',key]dfs= [features['...\\features\\CP'],
      features['..features\\CE'],
 .....
     features['rangy\\Downloads\\Hydraulics-main\\features\\VS1']]from functools import reduce
features_join = reduce(lambda left,right: pd.merge(left,right,on=['seconds','cycle']), dfs)
features_join.head()

The last segment of code above uses Python’s reduce function to apply a merge across the created dfs to — finally — have a column that identifies each time series, and another column to sort out each series. The former is ‘cycle’ and the latter is ‘seconds’, as seen in Figure 10 below.

Figure 11: dataframe for use in extracting features

We can now use tsfresh’s extract_features method to produce a dataframe that is the result of feature engineering and ready for use in modeling. See Figure 12 for a snapshot of this dataframe.

%pip install tsfresh
from tsfresh import extract_features# Automatic feature extraction using the tsfresh packageextracted_features = extract_features(features_join, column_id = "cycle", column_sort="seconds")impute(extracted_features)

Figure 12: time series dataframe with feature engineering

Here’s an idea of the types of statistical measures resulting from feature engineering using tsfresh:

extracted_features.columnsIndex(['.\features\CE__variance_larger_than_standard_deviation',
       '.\features\CE__has_duplicate_max',
       '.\features\CE__has_duplicate_min',
       ...
       '.\features\CE__mean_abs_change',
       '.\features\CE__mean_change',
       '.\features\CE__mean_second_derivative_central',
       '.\features\CE__median',
       ...
       '.\features\VS1__fourier_entropy__bins_2',
       '.\hyddata\features\VS1__fourier_entropy__bins_3',
       '.\features\VS1__fourier_entropy__bins_100',
       '.\features\VS1__permutation_entropy__dimension_3__tau_1',
       ...
       '.\features\VS1__permutation_entropy__dimension_7__tau_1'] 
      dtype='object', length=13243)

Does feature engineering using tsfresh offer obvious improvement over the approach that only used labels? First I have to setup the Keras model similar to our approach above — only now, x represents our extracted features, not the two preceding entries in the time sequence.

See model training and testing performance in Figure 13 below.

Developing and Evaluating the model

quadvar_y = label['hydraulic_accumulator']
quadvar_x = extracted_features.valuesxtrain_quad,ytrain_quad = quadvar_x[:1201,:], quadvar_y[:1201,]
xtest_quad,ytest_quad = quadvar_x[1202:,:],quadvar_y[1202:,]xtrain_quad = xtrain_quad.reshape((xtrain_quad.shape[0], 1, xtrain_quad.shape[1]))
xtest_quad = xtest_quad.reshape((xtest_quad.shape[0], 1, xtest_quad.shape[1]))quadmodel = Sequential()
quadmodel.add(LSTM(60, input_shape=(xtrain_quad.shape[1], xtrain_quad.shape[2])))
quadmodel.add(Dense(5))
quadmodel.add(Dense(1))
quadmodel.compile(loss='mae', optimizer='adam')# Fit the modelquadhistory = quadmodel.fit(xtrain_quad, ytrain_quad, epochs=50, batch_size=20, validation_data=(xtest_quad, ytest_quad), verbose=0, shuffle=False)

Figure 13: LSTM for hydraulic accumulator, with features

There is no measurable gain from using all the features, and maybe there is some incremental benefit gained from using a sub-selection of features. It also appears that we could achieve optimal scores with less training — good to know, going forward.

Feature selection

I’m now going to do another step and assess its benefits using ensemble learning — the performance that I note below should also be observed in an LSTM.

That step is finding the relevant features for our given target using tsfresh’s select_features function. The helpful documentation details how features are selected for their relevance to the selected target / label.

from tsfresh import select_featuresglobal features_filtered_accum
features_filtered_accum = select_features(extracted_features, label['hydraulic_accumulator'])global features_filtered_leak
features_filtered_leak = select_features(extracted_features, label['pump_leak'])

With the selected features, I next use cross-validation to gauge the impact of an ensemble algorithm in predicting hydraulic accumulation and pump leakage. Now I use the extracted and selected features.

import xgboost
from xgboost import XGBClassifier, XGBRegressorxgr = XGBRegressor()
xgc = XGBClassifier()features_filtered_accum = features_filtered_accum.valueslabel['hydraulic_accumulator'] = label['hydraulic_accumulator'].valuesfeatures_filtered_leak = features_filtered_leak.values
label['pump_leak'] = label['pump_leak']from sklearn.model_selection import KFold, cross_validate
cv = KFold(n_splits=7,shuffle=True)cross_validate(xgr,features_filtered_accum,label['hydraulic_accumulator'],cv=cv,scoring='neg_mean_absolute_error')cross_validate(xgr,features_filtered_leak,label['pump_leak'],cv=cv,scoring='neg_mean_absolute_error')

Developing and Evaluating the model

The extracted features make a significant improvement — the mean absolute error below (test_score) is orders of magnitude below what an LSTM achieved without feature selection!

Figure 14: ensemble learning with feature engineering and feature selection

Deploying models

Consulting tsfresh’s resources on creating a scikit-learn pipeline with their functions gave me the necessary insight for this step.

The code below creates scikit-learn pipelines for two different labels — hydraulic accumulator and stability flag — and then dumps the pipelines into a saved model.

from tsfresh.transformers import RelevantFeatureAugmenter
from sklearn.pipeline import Pipelinepipeline_flag = Pipeline([('augmenter', RelevantFeatureAugmenter (column_id="cycle", column_sort="seconds")), ('xgc' , XGBClassifier())])pipeline_accum = Pipeline([('augmenter', RelevantFeatureAugmenter (column_id="cycle", column_sort="seconds")), ('xgr', XGBRegressor())])y_stable_flag = label['stable_flag']
y_hydraulic_accumulator = label['hydraulic_accumulator']
X = pd.DataFrame(index = y_stable_flag.index)pipeline_flag.set_params(augmenter__timeseries_container=features_join)
pipeline_accum.set_params(augmenter__timeseries_container=features_join)pipeline_flag.fit(X,y_stable_flag)
pipeline_accum.fit(X,y_hydraulic_accumulator)import pickle
pickle.dump(pipeline_accum,open('pipeline_accum.pkl','wb'))
pickle.dump(pipeline_flag,open('pipeline_flag.pkl','wb'))

My models are now saved and I can move to deploy. For now I’ll stick with local deployment only on my computer using Flask — the text below goes through the high level steps and there are many good resources out there for more in-depth coverage.

Build an interface for users to input a timestamp for predicting stability flag and hydraulic accumulation. See HTML code below — I was going for functionality, so there is opportunity to make things look nicer.

My #comments below are for reader benefit and not part of the HTML script. Figure 15 after this script has a snapshot of what the user sees.

<html><body><h3>Prediction of Stability Flag and Hydraulic Accumulation </h3>
# Comment: this is the title of the webpage.<div><form action="/predict" method="POST"><label for="timstmp">Cycle number</label>
# Comment: this is the label and variable that users will be asked to enter<input type="number" step="1" id="timstmp" name="timstmp">
# Comment: this is the actual box into which a user enters the above variable.<br><input type="submit" value="Submit">
# Comment: creating a submit button so the user can run the query.</form></div></body></html>

2. Build an interface that gives users the results of the machine learning model. In this case, simply showing the predictions.

<!doctype html><html><body><h1> {{ prediction_text}}</h1></body></html>

If I typed in a particular cycle into the index page, here’s the outcome when I click “Submit” — again, I’m not aiming for aesthetics in this article, only functionality.

How did the Submit button get me from one page to the other? Through the use of:

3. A script written to load the saved models with pickle and then draw in inputs specified via the hyindex page. The loaded models make predictions using the inputs, and those predictions are returned to the hypredict page. Again my #comments below are for reader benefit and not part of the HTML script.

import pandas as pd, numpy as np
import pickle
import flask
from flask import Flask, request, jsonify, render_templateapp = Flask(__name__) #Creating instance of Flask class for web app
@app.route('/')
def Hm():
    return render_template('hyindex.html')#Code above routes any user from localhost:5000 to the hyindex.html webpage through the Hm() function.model_accum = pickle.load(open('pipeline_accum.pkl','rb'))
model_flag = pickle.load(open('pipeline_flag.pkl','rb'))# Loading models that were saved earlier.# Code below reads the number specified in the request form, converts it into a dataframe for use in the saved model's predict method. The dataframe's index must be set as equal to the values for running the code.@app.route('/predict', methods=['POST'])
def predict():
    inputt = [int(x) for x in request.form.values()]
    xtest = np.array(inputt)
    xtest_df = pd.DataFrame(xtest)
    xtest_df.set_index(xtest,inplace=True)
    Xtest_df = pd.DataFrame(index = xtest_df.index)    prediction_accum = model_accum.predict(Xtest_df)
    prediction_flag = model_flag.predict(Xtest_df)#Having made my predictions above, I'll now send these over to hypredict with the variable name prediction_text.
    
    return render_template('hypredict.html',prediction_text= 'Hydraulic accumulation prediction is ' + format(prediction_accum) + ' and stability flag prediction is ' + format(prediction_flag))#Python assigns the name "_main_" to this script; hence below, we tell the app to run if this script is run.if __name__ == "__main__": 
    app.run(debug=True)

Therefore, to actually run this deployed model, I simply run the script above on the terminal as seen below in Figure 17. I called the script above hyapp.py

Please note that the index and predict HTML files should be in a folder called templates.

Conclusion and take-aways for the business

By developing and deploying this machine learning model, an industrial setup can predict the conditions of their hydraulic system using experiment data for various oils. This work can help them avoid repeating experiments, and also drive safe and efficient large-scale construction.

This refresher sharpens my toolkit as I tackle this new venture within my work that uses time series forecasting.

The next and final article in this series will apply ARIMA and AzureML in this problem. I welcome any comments.

Data Visualization, Machine Learning

Time Series Forecasting — Building and Deploying Models

Forecasting a hydraulic oil test rig’s condition over time using ensemble learning and neural networks. Part 1 / 2

Define the business problem

Data review

Loading data

Feature engineering

Developing and Evaluating the model

Feature selection

Developing and Evaluating the model

Deploying models

Conclusion and take-aways for the business

Written by Ranganath Venkataraman