Preparing data for an LSTM network

In this post I’m going to be working out and learning out preparing data for LSTM networks, particularly data with several features. With LSTM networks we are able to deal with the concept of ‘time’ so our sequence (each element in the time series) will have several features.

With this example I’ve included a csv file that can be downloaded here.

If this file is opened in NotePad ++ it looks like the following.

We have 5 points in time that we want to train our LSTM on, to hopefully make predictions on the following unit of time. (Obviously we would need our training data to be much bigger than this but I’m just using a simple example to understand ‘shaping’ of data.

To put this data into context, it could be considered like this. Imagine we’re training a robot to walk, and the values represent various stepper motors on that robot.

What we are trying to do is predict the best motor settings for time 0.

i.e. the sequence is 5->4->3->2->1->??

We can see the columns currently represent the 7 motors on the robot, the first column is actually just representing the ‘step number’ and won’t be used in training the LSTM network.

The very first thing we want to do to our data is remove the first column.

To do this easily, we can use pandas iloc function.

To explain this, let’s do some code.

# import required to process csv files with pandas
import pandas as pd
# import numpy to create arrays
import numpy as np   

# import the multi feature csv
multifeature_csv = pd.read_csv(r'C:\Users\james\Anaconda3JamesData\AI_Multifeature_LSTM_series.csv', header=None) 

# diplay the contents of the csv file with NO processing
myData_processed = multifeature_csv.iloc[:,].values
print (myData_processed)

# process the data, take all data except the first row
myData_processed = multifeature_csv.iloc[:, 1:8].values

# this is added simply to put a space 
# in-between the two print outputs for clarity
print (" ") 

print (myData_processed)

Pandas iloc is the key to doing this easily (there are many other ways it can be done though). More information about Pandas iloc can be found here.

So now we’ve managed to remove the value of each row (the first column). By using ‘myData_processed = multifeature_csv.iloc[:, 1:8].values’ in the example above.

An important point here: with pandas read csv, we must add ‘, header=None’ argument as above. This tells pandas that row 1 of the dataset is not a series of names for the columns (as is our case here). If we do not add this argument, pandas will not read the first row of data.

Let’s now take a look at the ‘shape’ of the data at the moment.

# myData_processed is referring to the data that has been
# processed (the 1st column has been removed) as above/
print (myData_processed.shape[0]) # x-axis
print (myData_processed.shape[1]) # y-axis
print (myData_processed.shape)
5
7
(5, 7)

We can see our data currently has;
– an x-axis of 5 which is the number of rows
– a y-axis of 7, which is the number of features in each row

Finally we see the shape is (x, y) = (5, 7)

The (5, 7) shows us we have a 2 dimensional array at the moment.

So how we create a 3 dimensional array?

reshape()

By using the reshape() function we can change the dimentions of the array above with the following code.

# the data is 1 sample, 5 time steps, and 7 features
data = myData_processed.reshape(1, 5, 7)
print (data)
[[[124.709      124.926      124.598      124.693       11.64208106     20.94393565  18.99666547]   [124.694      124.829      124.651      124.717       10.79624784     19.99622329  18.13706702]   [124.707      124.854      124.62       124.658        9.98039535     19.43243205  17.84106688]   [124.657      124.794      124.582      124.789        8.90443606     18.34251187  17.84574804]   [124.79       124.888      124.371      124.516        8.58611711     18.32177354  20.80970792]]] 

This has now created a 3 dimensional array form the 2 dimensional data. We can verify this with the following.

data.shape
 (1, 5, 7) 

We have verifed that our data is now 1 sample, with 5 time steps, with 7 features. We can see the shape function returns 3 values, which means out data is now in a 3-dimensional format.

Leave a Reply