Question on Regression Learner App

40 views (last 30 days)
lauzof
lauzof on 19 Dec 2022
Commented: lauzof on 10 Jan 2023
Hi guys,
I trained a model using the regression leaner package in matlab R2021b. When training my model, I got some "Training results" (such as RMSE, R-squared, etc.) and a scatter plot with Predicted Response vs True Response. Then, I saved the model as "Export Model for Deployment". Now, I'd like to obtain the predicted response values from this training instance that originated such Predicted Results. Do you know how I can get them?
thanks a lot for your answer!
best,
Laura

Accepted Answer

Drew
Drew on 5 Jan 2023
To get the RMSE results on validation data, a set of k-fold cross-validation models are needed. In the example provided, 50-fold cross-validation was used in Regression Learner. When running this model training in Regression Learning, 51 models were trained: 1 model for each cross-validation fold, plus a final model trained on all of the training data. When a model is exported from Regression Learner in 2021b, only the final model is exported. This is highlighted in a note at the top of this page: https://www.mathworks.com/help/stats/export-regression-model-to-predict-new-data.html
At the high-level, two approaches are:
(1) Use the "Export Model" option from the Regression Learner, then write code to calculate the validation RMSE
(2) Use the "Generate Function" option of Regression Learner. This generates a matlab function which trains the final model and calculates the validation RMSE.
(1) Use the "Export Model" option from the Regression Learner, then write code to calculate the validation RMSE
For approach (1): After exporting the final model from the Regression Learner app as "trainedModel", one can get the validation RMSE with the code shown below.
% Do 50-fold cross-validation.
CVMdl= crossval(trainedModel.RegressionGP,'Kfold',50);
% Do prediction on the validation data using the set of 50 cross-validation models
Y_validation=kfoldPredict(CVMdl);
% Calculate RMSE on validation data
rmse_on_validation_data=sqrt(mean((Y_validation-tbl_training.Y).^2));
Note that the "crossval" function will do 50-fold cross-validation, since we specified 'Kfold' of 50. This means that 50 models will be trained, and stored in the resulting data structure. The "crossval" function will randomly partition the training data into 50 parts, then train 50 models, one for each fold. For example, the first model could be trained on folds 2-50, so it can be tested on fold 1. The second model could be trained on folds 1 and 3-50, so it can be tested on fold 2, etc. The crossval function accesses the original training data from inside the trainedModel.RegressionGP data structure. For more info, see https://www.mathworks.com/help/stats/classreg.learning.partition.regressionpartitionedmodel-class.html
Here is some code to plot the validation predictions versus the True response:
% Plot predicted vs actual for validation data
scatter(tbl_training.Y,Y_validation,15,'filled','Color',[0 0.4470 0.7410]);
line([-1.75,1.75],[-1.75,1.75],'Color','k');
axis([-1.75 1.75 -1.75 1.75]);
xlabel('True response');ylabel('Predicted response using kfold validation models');
title('On validation data, Predicted response vs True Response');
subtitle(sprintf('RMSE of kfold validation models on validation data:%0.5f',rmse_on_validation_data));
legend("Observations","Perfect prediction","Location","southeast");
This leads to the following figure:
So, the above plot is what you are looking for. A few notes:
(1) The RMSE on validation data (0.29623) is slightly different from what you see in Regression Learner (0.29645) because the data was randomly re-partitioned into 50-folds at the command line with the function crossval, and thus the 50 cross-validation models are slightly different than what was used inside Regression Learner.
(2) The RMSE on the training data is much lower (0.18386), because testing the final model on the training data is "cheating" because the model training has seen the data being predicted. That is, in this case, the same data is being used for training and testing. A similar calculation and plot can be done using the final model on the training data:
% Do prediction on the training set, using the final model
Y_training = trainedModel.predictFcn(tbl_training);
% Calculate RMSE on training set using final model
rmse_on_training_data = sqrt(mean((Y_training-tbl_training.Y).^2))
% Plot predicted vs actual for training data using final model
scatter(tbl_training.Y,Y_training,15,'filled','Color',[0 0.4470 0.7410]);
line([-1.75,1.75],[-1.75,1.75],'Color','k');
axis([-1.75 1.75 -1.75 1.75]);
xlabel('True response');ylabel('Predicted response using final model');
title('On training data, Predicted response vs True Response');
subtitle(sprintf('RMSE of final model on training data:%0.5f',rmse_on_training_data));
legend("Observations","Perfect prediction","Location","southeast");
This leads to the following plot for the training data:
(2) Use the "Generate Function" option of Regression Learner. This generates a MATLAB function which trains the final model and calculates the validation RMSE.
Another way to reproduce the validation RMSE result is to use the "Generate Function" option from the Regression Learner app. The data tip indicates that this option will "Generate MATLAB code for training the currently selected model in the Models pane, including validation predictions."
So, just select the "Generate Function" option in the export area:
This outputs the following code. Notice that the last 3 lines of code calculate the validationRMSE in a way similar to that provided in the first part of this answer. For more info, see https://www.mathworks.com/help/stats/export-regression-model-to-predict-new-data.html#bvi2d8a-49. (Note, if you use PCA or feature selection in the Regression Learner app, then the generated code for calculating the validation RMSE will be much longer, and so in that case it is especially helpful to have this code auto-generated by the Regression Learner app.)
function [trainedModel, validationRMSE] = trainRegressionModel(trainingData)
% [trainedModel, validationRMSE] = trainRegressionModel(trainingData)
% Returns a trained regression model and its RMSE. This code recreates the
% model trained in Regression Learner app. Use the generated code to
% automate training the same model with new data, or to learn how to
% programmatically train models.
%
% Input:
% trainingData: A table containing the same predictor and response
% columns as those imported into the app.
%
% Output:
% trainedModel: A struct containing the trained regression model. The
% struct contains various fields with information about the trained
% model.
%
% trainedModel.predictFcn: A function to make predictions on new data.
%
% validationRMSE: A double containing the RMSE. In the app, the Models
% pane displays the RMSE for each model.
%
% Use the code to train the model with new data. To retrain your model,
% call the function from the command line with your original data or new
% data as the input argument trainingData.
%
% For example, to retrain a regression model trained with the original data
% set T, enter:
% [trainedModel, validationRMSE] = trainRegressionModel(T)
%
% To make predictions with the returned 'trainedModel' on new data T2, use
% yfit = trainedModel.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
% trainedModel.HowToPredict
% Auto-generated by MATLAB on 04-Jan-2023 17:51:50
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
predictors = inputTable(:, predictorNames);
response = inputTable.Y;
isCategoricalPredictor = [false, false, false, false, false, false];
% Train a regression model
% This code specifies all the model options and trains the model.
regressionGP = fitrgp(...
predictors, ...
response, ...
'BasisFunction', 'constant', ...
'KernelFunction', 'exponential', ...
'Standardize', true);
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
gpPredictFcn = @(x) predict(regressionGP, x);
trainedModel.predictFcn = @(x) gpPredictFcn(predictorExtractionFcn(x));
% Add additional fields to the result struct
trainedModel.RequiredVariables = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
trainedModel.RegressionGP = regressionGP;
trainedModel.About = 'This struct is a trained model exported from Regression Learner R2021b.';
trainedModel.HowToPredict = sprintf('To make predictions on a new table, T, use: \n yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>.');
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
predictors = inputTable(:, predictorNames);
response = inputTable.Y;
isCategoricalPredictor = [false, false, false, false, false, false];
% Perform cross-validation
partitionedModel = crossval(trainedModel.RegressionGP, 'KFold', 50);
% Compute validation predictions
validationPredictions = kfoldPredict(partitionedModel);
% Compute validation RMSE
validationRMSE = sqrt(kfoldLoss(partitionedModel, 'LossFun', 'mse'));
  3 Comments
Drew
Drew on 9 Jan 2023
If my latest answer has been helpful to you, it would be great if you can accept the answer. I thought I would mention this, since it looks like you are new to MATLAB answers.
Your latest comment asks about wanting to "show statistics on model calibration". If this means you want to show statistics about the predicted versus actual response of your Gaussian Process regression model, then the answer I have given enables you to do exactly that on validation data, training data, or new test data. .
So, to recap, after training a Regression model, here are some common actions that are done:
(1) Use the final model to get regression (prediction) results on new data. This data could be thought of as "new test data". https://www.mathworks.com/help/stats/export-regression-model-to-predict-new-data.html
(2) Use k-fold cross-validation models to get regression (prediction) results on the validation data, and calculate the RMSE (or other metric) on the validation data. If you want to estimate the expected RMSE on future new test data, then the RMSE on the validation data can be used for that purpose.
(3) Use the final model to get regression (prediction) results on the training data, and calculate the RMSE (or other metric) on the training data. Note that this RMSE on the training data, using the final model, is not a good predictor of RMSE on future new test data, because the RMSE on training data will be lower due to the ovelap in training and test data.
I think the above three options are probably all you need.
Less common actions: In your comment posted Jan 4, 2023 14:23, you used the fitlm command to train a linear regression model on the (x,y) datapoints formed from the (true, predicted) values from your Gaussian Process regression model. That creates a best-fit line through the "predicted vs true" datapoints. That is another regression, done after the Gaussian Process regression, and so it gives different results, with a different error rate, that is, a different RMSE, etc, since it is a completely different regression. Based on your questions and comments, it looked like you wanted to reproduce what you were seeing in the Regression Learner app, so I indicated how to do that. The Regression Learner app does not train a linear regression model on the "predicted vs true" datapoints from the Gaussian Process Regression model. If you wanted to look at that second regression, built with fitlm, here is how it looks:
% Do prediction on the training set, using the final model
Y_training = trainedModel.predictFcn(tbl_training);
% Calculate RMSE on training set using final model
rmse_on_training_data = sqrt(mean((Y_training-tbl_training.Y).^2))
% Plot predicted vs actual for training data using final model
scatter(tbl_training.Y,Y_training,15,'filled','Color',[0 0.4470 0.7410]);
line([-1.75,1.75],[-1.75,1.75],'Color','k');
axis([-1.75 1.75 -1.75 1.75]);
xlabel('True response');ylabel('Predicted response using final GPR model');
title('On training data, Predicted response vs True Response');
subtitle(sprintf('RMSE of final GPR model on training data:%0.5f\nRed line shows best linear fit through (true, predicted) datapoints',rmse_on_training_data));
hold on;
% Do another regression, which is drawing a best-fit line through (x,y) points (actual, predicted) from GPR
Mdl=fitlm(tbl_training.Y,Y_training)
x=linspace(-2,2,100);
y=Mdl.Coefficients{1,1}+x*Mdl.Coefficients{2,1}
plot(x,y,'r');
legend("Observations","Perfect GPR prediction if blue points were on this line","Best fit line through (true, predicted) points from GPR","Location","southeast");
legend('Position',[0.30476,0.13651,0.61786,0.11905])
hold off;
That yields this plot:
The simple linear model also has a built-in plot method that yields a similar plot (but without the black diagonal line), but with much less code. We will set the axis limits to be the same, so the plots look more similar:
plot(Mdl)
axis([-1.75 1.75 -1.75 1.75]);
lauzof
lauzof on 10 Jan 2023
Dear Drew,
thanks a lot for your answer and help. Your code works perfectly!
best,
Laura

Sign in to comment.

More Answers (1)

Drew
Drew on 3 Jan 2023
To work with a model from the Regression Learner app at the MATLAB commandline, it is recommended to use the "Export model" or "Export Compact Model" options, rather than "Export Model for Deployment". For example, if your model was exported to the MATLAB workspace as a trainedModel using the "Export Model" or "Export Compact Model" option, and if X contains the training data, then perform regression on the training data with:
trainedModel.predictFcn(X)
If "Export Model" is selected rather than "Export Compact Model", then the training data is inside the model object in the trainedModel structure. You can see the model object type by examining the tranedModel structure. For example, if the trained model is a RegressionTree, then perform regression on the training data with:
trainedModel.predictFcn(trainedModel.RegressionTree.X)
  2 Comments
lauzof
lauzof on 4 Jan 2023
Hi Drew,
thanks for your answer. I performed the steps you suggested but I cannot retrieve the same values that I get as "Training Results". Below, I give you more details and I share the dataset I'm using.
I'm using the tbl_training in test.mat file to predict the response Y as a function of predictors X1, X2, X3, X4, X5 and X6, and 50-fold Cross Validation with an Exponential GPR model. I get a model with the following fitting, whose "Predicted reponse" values I need to obtain:
I exported the model. When I follow the command:
Y_training = trainedModel.predictFcn(tbl_training);
I obtain another fitting:
fitlm(tbl_training.Y,Y_training)
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
_________ _________ ______ __________
(Intercept) -0.052502 0.0052629 -9.976 2.9559e-22
x1 0.74189 0.010531 70.451 0
Number of observations: 864, Error degrees of freedom: 862
Root Mean Squared Error: 0.141
R-squared: 0.852, Adjusted R-Squared: 0.852
F-statistic vs. constant model: 4.96e+03, p-value = 0
see the plot here:
figure;scatter(tbl_training.Y,Y_training)
So, I still don't understand why I have a much better fitting that the one reported in the "Training results", and how could I get the predicted reponse values from the first plot I shared that emerge from the model training.
Both the test.mat and trainedModel.mat files can be found at https://drive.google.com/file/d/17gAcg2eJtPzDLM-x4CoYh9wsoH5f5o3X/view?usp=share_link
Thanks again for your help!
best,
Laura
Drew
Drew on 4 Jan 2023
Edited: Drew on 5 Jan 2023
see answer below

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!