Data preparation for time forecasting using LSTM

36 views (last 30 days)
Hi, I am trying to solve a time forecasting problem using LSTM in Matlab. The questions still remain after going through
(Q1) The problem I am facing is in the data preparation stage. Specifically, I have 5000 samples of time responses of the same response quantity and the number of time steps is 1001. I want to train 90% data (5000 x 901) and keep 10% for the prediction (5000 x 100). At present, I am storing the complete data as a matrix:
data is [5000 x 1001]
dataTrain = data(:,901);
dataTest = data(:,901:end);
Then, standardizing the data
XTrain = dataTrainStandardized(:,1:end-1);
YTrain = dataTrainStandardized(:,2:end);
XTest = dataTestStandardized(:,1:end-1);
Now, what should be the LSTM network architecture as per my data set and problem definition?
numFeatures = ? % I guess number of features should be 1 as it is univariate.
numResponses = ? % I guess this should be the number of training time steps (=901)
However, this gives an error “The training sequences are of feature dimension 5000 but the input layer expects sequences of feature dimension 1.” So, should I store the dataset in a cell (each cell representing 1 feature) and inside the cell a matrix of dimension (no of samples x no of time steps)?
numHiddenUnits = 100;
layers = [ ...
sequenceInputLayer(numFeatures)
lstmLayer(numHiddenUnits)
fullyConnectedLayer(numResponses)
regressionLayer];
(Q2) What does the 'MiniBatchSize' do? Does it divide the time steps (columns) into smaller batches or the number of samples (rows) into smaller batches?
(Q3) The last question is related to the ‘predictAndUpdateState’. Is the following formatting okay?
net = predictAndUpdateState(net,XTrain);
[net,YPred] = predictAndUpdateState(net,YTrain(:,end));
numTimeStepsTest = size(XTest,2); %numel(XTest);
for i = 2:numTimeStepsTest
[net,YPred(:,i)] = predictAndUpdateState(net,YPred(:,i-1),...
'MiniBatchSize',25,'ExecutionEnvironment','auto');
End
This question is somewhat related to Q1.

Accepted Answer

Conor Daly
Conor Daly on 3 Aug 2021
Hi Tanmoy
Q1: When training a network with sequence data, the data must be presented to trainNetwork as cell arrays of size numObs-by-1. Each entry of the cell array corresponds to a single time series with dimensions, for example, numFeatures-by-numTimesteps. So for your data, I'm interpreting 5000 samples to mean 5000 independent observations. For example, it could be that I'm recording a process for 900 time steps, and I make 5000 independent recordings. This means we need to create a 5000-by-1 cell array, where each entry contains a 1-by-900 (training) time series. Of course, I could be wrong in how I've interpeted your data, and the data could instead be a single observation of a 5000-channel time series. For example, it could be that I make one recording of 5000 quantities over 900 time steps. In this case, your data corresponds to a 5000-by-900 (training) array.
You can manipulate your data into a 5000 observation cell array as follows:
numObs = 5000;
numTrainTimesteps = 900;
dataTrain = data(:, 1:numTrainTimesteps);
dataTest = data(:, (numTrainTimesteps+1):end);
XTrain = cell(numObs, 1);
TTrain = cell(numObs, 1);
XTest = cell(numObs, 1);
TTest = cell(numObs, 1);
for n = 1:numObs
XTrain{n} = dataTrain(n, 1:end-1);
TTrain{n} = dataTrain(n, 2:end);
XTest{n} = dataTest(n, 1:end-1);
TTest{n} = dataTest(n, 2:end);
end
With this set up, the number of input features and number of output features (or regression responses) are both equal to one. So we can build LSTM layer arrays as follows:
numFeatures = 1;
numResponses = 1;
numHiddenUnits = 32;
layers = [ sequenceInputLayer(numFeatures)
lstmLayer(numHiddenUnits)
fullyConnectedLayer(numResponses)
regressionLayer() ];
Q2: The mini-batch size name-value option in trainingOptions and the inference functions (e.g. predict) controls the number of observations that are passed through the network in a single iteration. So for example, if we have 5000 observations and we choose a mini-batch size of 500, it'll take us 10 iterations to work through the entire data set.
Q3: It is only recommend to use the predictAndUpdateState function one observation at a time. You can do this via an outer loop which runs of the number of observations in your test set. Since we're looping over observations, we need to be careful to reset the state after each independent observation -- we can do this with the resetState method of SeriesNetwork and DAGNetwork. For example:
% Initialize the prediction variable YTest. It should be of corresponding
% size to TTest.
YTest = cell(size(TTest));
% Determine the number of time steps for which we want to generate a
% response. Note that if our test data is ragged -- i.e. contains a
% different number of time steps for each observation -- then we need to be
% more careful in how determine numSteps.
numSteps = size(TTest{1}, 2);
for n = 1:numObs
% Create a network with state corresponding to the training time steps.
[net, Y] = predictAndUpdateState(net, XTrain{n});
% Initialize the prediction input.
Y = Y(:, end);
% Initialize the prediction for this observation.
Yseq = [];
for t = 1:numSteps
[net, Y] = predictAndUpdateState(net, Y);
Yseq = cat(2, Yseq, Y);
end
% Assign the generated prediction into the prediction variable.
YTest{n} = Yseq;
% Reset the network state so the network is ready for the next
% observation.
net = resetState(net);
end
  3 Comments
Tanmoy Chatterjee
Tanmoy Chatterjee on 6 Aug 2021
Hi Conor,
Thanks once again. The code is running fine now.
(Q4) But one thing in regard to your explanation for Q3 in part of the code below,
for t = 1:numSteps
[net, Y] = predictAndUpdateState(net, Y);
Yseq = cat(2, Yseq, Y);
end
I suppose that the input for the predictAndUpdateState should be Xtest instead of Y (underlined). Can you please clarify this, if this was a typo. I am getting excellent prediction using XTest.
(Q5) I am now trying splitting the data (along the time axis) into training, validation and testing data. Can you help me with some guide on how to compute the model's accuracy on the validation testset apart from the 'training-progress' plots? I want to plot the validation error in a separate plot.
Nanxin
Nanxin on 29 Oct 2022
Hi Conor,
Your explaination and code help me so much, because I have been looking for answer to the question: if I can use Batch of data to predict with Predictandupdatestate. Now I konw we would better fed the trained net one observation each time when predicting. But the for loop may need much time expense when the number of observations is so much. So I want to know if there are some solution to accelerate at the sofeware and hareware aspect..
Thanks again.

Sign in to comment.

More Answers (1)

Patrick Stettler
Patrick Stettler on 19 Sep 2023
Hi Conor
Your answer was indeed very helpful. I'm still struggling, however, with the data-structuring issue. The experiment I'm trying to solve is akin to the setup you indicated above ("...for example, it could be that I make one recording of 5000 quantities over 900 time steps. In this case, your data corresponds to a 5000-by-900 (training) array").
I have time-series data (S&P500-returns and three indicators), that is feature-dimension 4 and 1000 time steps. I'm trying to predict the direction of the next close (up->1, unchanged->0, down->-1) with a LSTM model in Deep Network Designer app.
I've tried to structure data in several ways but couldn't make Designer work. My approach:
XTrain:
1) 4-by-1000 array (doubles)
2) convert to arraydatastore (as Designer only accepts arraydatastore type);
YTrain:
1) 1-by-1000 array (doubles, i.e. +1, 0, -1)
2) convert to arraydatastore (as Designer only accepts arraydatastore type);
XYTrain:
1) combine XTrain and YTrain with combine(XTrain, YTrain)
In Designer:
1) in InputLayer: InputSize = 4
2) last layer is a classificationLayer
Result:
This leads to several errors. Designer complains, for example, about a) input-data <> InputSize mismatch and b) categorization mismatch.
I couldn't find the answers in the documentation, some hints would be much appreciated, thanks.
  3 Comments
Patrick Stettler
Patrick Stettler on 20 Sep 2023
Many thanks Conor, much appreciated, this makes things clearer now (I was definitely wrong on the IterationDimension (=3)).
I've tried to replicate your (programmatic) setup directly in the Designer-app as follows:
  • network type sequence-to-label
  • sequenceInputLayer:
  • inputSize = 4
  • fullyConnectedLayer
  • OutputSize = 3
  • classificationLayer
  • outputSize = 'auto'
  • ---
  • data: using 'ds' combined datastore as constructed above
  • Solver: 'adam'
Result:
The problem seems to be, that when using Designer, the responses also need to be structured as 3x1000. Alternatively, one would need to tell Designer's classificationLayer to set outputSize=1 (my hypothesis), making it fit the 'ds' datastore as is. Or how/where else would one instruct Designer to work with the as-is 'ds' datastore?
Thanks for enlightment, Patrick
Conor Daly
Conor Daly on 1 Oct 2023
Thans Patrick! I'm sorry it's still not working. It's not really clear to me what's going on -- would you be able to share your code (with dummied/randomized data)?

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!