Out-Sample normalization problem

Hi. I’m working on a binary classification system that I have 21 financial ratios and variables for inputs and my output is one of financial criteria that could be 0 or 1. Before insert data to my classification model (MLP, SVM or ELM) I normalize data (max/min mapping or whitening). My financial ratios are from companies’ statements so we have various size of companies in our data.
Otherwise I'm using 5-fold cross validation for designing my model. After design the model now I want use it by new data so I must normalize these data. I find that for Max-Min mapping I must use Maximum and Minimum of designing phase data-set and for whitening I must use mean and variance of it.
Suppose that in x-min/max-min, my new data set has a feature sample that x of it is lower than previous minimum so now this normalized feature (for that specific sample) is negative. This is not a problem? Is the output (1 or 0) true for this specific sample? Besides this in whittling method we can have same problem.
Thanks.

 Accepted Answer

Greg Heath
Greg Heath on 3 Apr 2014
Edited: Greg Heath on 3 Apr 2014
Regardless of what you use in the model, I always standardize pre-modelling using zscore or mapstd to identify outliers for removal or modification.
Warning: Each dimension should be normalized separately.
P.S. If you use neural nets the default is mapminmax to [-1,1] and the hidden layer transfer functions are the odd function tanh.
Hope this helps
Thank you for formally accepting my answer
Greg

6 Comments

Jack
Jack on 4 Apr 2014
Edited: Jack on 4 Apr 2014
Thank you for answer Greg.
I did not understand your fist sentence. We normalize data before insert it for training and testing (Creating model). Now we are going to use final model, so we normalize new data with parameters of normalization in first phase (max,min or mean and variance). Now some samples in a specific feature are out of boundary of min/max or mean/variance of data so for example in mapminmax we have negative value.
What we can do about this problem? We must remove subnormal normalized data in final phase or there are another approaches about this problem?
Thanks.
The MATLAB training functions automatically use the mapminmax [-1,1] normalization of both inputs and targets and uses the transfer function tansig for both hidden and output layers.
The sim or net function automatically does the necessary normalization, calculates the normalized output and then unnormalizes it.
I prefer using mapstd, on input and output with {tansig,purelin} for regression. Whereas for classification I use unit vector outputs and {tansig,logsig} or {tansig,softmax} for classification. However, after overriding defaults a million times, I decided to use the defaults.
Now before training I always standardize using zscore (easier than mapstd) then plot and use minmax to find and delete or modify outliers.
I then use the standardized data as input to configure and train. I NO LONGER OVERWRITE the normalization/denormalization defaults.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As far as your "problem" is concerned, THERE IS NO PROBLEM(unless, of course you have outliers that are many std away from the data mean).
So, nets trained with [0,1] data can handle data outside that interval. How much outside depends on the problem. I guess it would bother me greatly if a lot of nondesign data were in the [0.5 1.5] range.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Again: My recommendation
1. standardize input data and regression output data 2. remove or modify outliers 3. Accept the default training function normalizations.
Jack's "Answer" moved here:
Thank you so much Greg. Your detailed assistance is a big help for me.
As I mentioned, my output is binary (0 and 1) so I think I don’t need Unit vector in a binary classification.
1. About transfer functions I’m using tansig for hidden layer(s) and purelin for output layer. So based your experience {tansig,logsig} or {tansig,softmax} is better in binary classification case?
2. After normalizing data with “zscore”, how you find outlier with plotting and “minmax”? (Considering all features together or analyze it separately?) How can I modify outliers instead of deleting outlier samples? Now I’m using k-means clustering, fist create a cluster for 0-class and 1-class separately and after that remove for instance 10% of samples that are beyond the center in every cluster. Is this a good approach?
3. You mentioned:
Standardize input data and regression output data 2. Remove or modify outliers 3. Accept the default training function normalizations.
So fist I must manually Standardize data and active normalization in neural network options. I think when I normalize data separately before inserting it for training network, in addition when my problem is a binary classification (I don’t need output normalizing) so finally I don’t need active normalization (mapmazmin or mapstd) in neural network properties. Is this true?
Thank you again.
Greg's comment to Jack's "Answer" moved here:
% As I mentioned, my output is binary (0 and 1) so I think I don’t need % Unit vector in a binary classification.
You don't need it but it is more convenient because of special functions like IND2VEC and VEC2MIN.
% 1. About transfer functions I’m using tansig for hidden layer(s) and % purelin for output layer. So based your experience {tansig,logsig} or % {tansig,softmax} is better in binary classification case?
SOFTMAX for mutually exclusive classes (e.g., penny, nickel, dime) LOGSIG for non-mutually exclusive classes (e.g., tall, dark, handsome)
% 2. After normalizing data with “zscore”, how you find outlier with % plotting and “minmax”?
(x-meanx)/std > threshold of your choice
%(Considering all features together or analyze it separately?)
Only need one of many features to create an outlier.
% How can I modify outliers instead of deleting outlier samples?
Replace xout with xmean+ std*threshold (see above)
% Now I’m using k-means clustering, fist create a cluster for % 0-class and 1-class separately and after that remove for instance 10% of % samples that are beyond the center in every cluster. Is this a good % approach?
No. Once outliers are removed, just use patternnet. Multiple designs will reveal points that are consistently misclassified.
% 3. You mentioned: % % Standardize input data and regression output data 2. Remove or modify % outliers 3. Accept the default training function normalizations. % % So fist I must manually Standardize data and active normalization in % neural network options. I think when I normalize data separately before % inserting it for training network, in addition when my problem is a % binary classification (I don’t need output normalizing) so finally I % don’t need active normalization (mapmazmin or mapstd) in neural network % properties. Is this true? ....
If you are going to check for outliers, I find this to be the least complicated 1. Standardize inputs and remove or modify outliers. 2. Accept additional default normalization/denormalization of patternnet
Jack's second so-called "Answer" moved here:
Thank you again Greg.
I don’t use k-means clustering after employ other outlier detection techniques. Outlier detection using k-means clustering is an option for outlier detection in my system besides your proposed technique. So I can choose any of these two techniques. With regard to the above discussion, what is your idea about k-means clustering?
You mentioned that I can use ‘(x-meanx)/std > threshold of your choice ‘so your proposed technique does not consider all inputs (in my case: 21 variables) simultaneously and I can analyze one feature with it at a time. Is this true?
Thanks.
No. You consider all at once using matrix coding. I consider a 21 dimensional vector an outlier if one or more components is an outlier.
All MATLAB code is matrix based. So if you find one or more outlying components in a column of an input or target matrix, either modify or delete the column. Any target column corresponding to a deleted input must also be deleted and vice versa.

Sign in to comment.

More Answers (0)

Asked:

on 3 Apr 2014

Commented:

on 11 Apr 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!