Out-Sample normalization problem

Question

0 votes

Hi. I’m working on a binary classification system that I have 21 financial ratios and variables for inputs and my output is one of financial criteria that could be 0 or 1. Before insert data to my classification model (MLP, SVM or ELM) I normalize data (max/min mapping or whitening). My financial ratios are from companies’ statements so we have various size of companies in our data.

Otherwise I'm using 5-fold cross validation for designing my model. After design the model now I want use it by new data so I must normalize these data. I find that for Max-Min mapping I must use Maximum and Minimum of designing phase data-set and for whitening I must use mean and variance of it.

Suppose that in x-min/max-min, my new data set has a feature sample that x of it is lower than previous minimum so now this normalized feature (for that specific sample) is negative. This is not a problem? Is the output (1 or 0) true for this specific sample? Besides this in whittling method we can have same problem.

Thanks.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Greg Heath on 3 Apr 2014

Edited: Greg Heath on 3 Apr 2014

1 vote

Regardless of what you use in the model, I always standardize pre-modelling using zscore or mapstd to identify outliers for removal or modification.

Warning: Each dimension should be normalized separately.

P.S. If you use neural nets the default is mapminmax to [-1,1] and the hidden layer transfer functions are the odd function tanh.

Hope this helps

Thank you for formally accepting my answer

Greg

6 Comments
Show 4 older comments Hide 4 older comments

Greg Heath on 4 Apr 2014

The MATLAB training functions automatically use the mapminmax [-1,1] normalization of both inputs and targets and uses the transfer function tansig for both hidden and output layers.

The sim or net function automatically does the necessary normalization, calculates the normalized output and then unnormalizes it.

I prefer using mapstd, on input and output with {tansig,purelin} for regression. Whereas for classification I use unit vector outputs and {tansig,logsig} or {tansig,softmax} for classification. However, after overriding defaults a million times, I decided to use the defaults.

Now before training I always standardize using zscore (easier than mapstd) then plot and use minmax to find and delete or modify outliers.

I then use the standardized data as input to configure and train. I NO LONGER OVERWRITE the normalization/denormalization defaults.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

As far as your "problem" is concerned, THERE IS NO PROBLEM(unless, of course you have outliers that are many std away from the data mean).

So, nets trained with [0,1] data can handle data outside that interval. How much outside depends on the problem. I guess it would bother me greatly if a lot of nondesign data were in the [0.5 1.5] range.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Again: My recommendation

1. standardize input data and regression output data 2. remove or modify outliers 3. Accept the default training function normalizations.

Image Analyst on 7 Apr 2014

Jack's "Answer" moved here:

Thank you so much Greg. Your detailed assistance is a big help for me.

As I mentioned, my output is binary (0 and 1) so I think I don’t need Unit vector in a binary classification.

1. About transfer functions I’m using tansig for hidden layer(s) and purelin for output layer. So based your experience {tansig,logsig} or {tansig,softmax} is better in binary classification case?

2. After normalizing data with “zscore”, how you find outlier with plotting and “minmax”? (Considering all features together or analyze it separately?) How can I modify outliers instead of deleting outlier samples? Now I’m using k-means clustering, fist create a cluster for 0-class and 1-class separately and after that remove for instance 10% of samples that are beyond the center in every cluster. Is this a good approach?

3. You mentioned:

Standardize input data and regression output data 2. Remove or modify outliers 3. Accept the default training function normalizations.

So fist I must manually Standardize data and active normalization in neural network options. I think when I normalize data separately before inserting it for training network, in addition when my problem is a binary classification (I don’t need output normalizing) so finally I don’t need active normalization (mapmazmin or mapstd) in neural network properties. Is this true?

Thank you again.

Image Analyst on 7 Apr 2014

Greg's comment to Jack's "Answer" moved here:

% As I mentioned, my output is binary (0 and 1) so I think I don’t need % Unit vector in a binary classification.

You don't need it but it is more convenient because of special functions like IND2VEC and VEC2MIN.

% 1. About transfer functions I’m using tansig for hidden layer(s) and % purelin for output layer. So based your experience {tansig,logsig} or % {tansig,softmax} is better in binary classification case?

SOFTMAX for mutually exclusive classes (e.g., penny, nickel, dime) LOGSIG for non-mutually exclusive classes (e.g., tall, dark, handsome)

% 2. After normalizing data with “zscore”, how you find outlier with % plotting and “minmax”?

(x-meanx)/std > threshold of your choice

%(Considering all features together or analyze it separately?)

Only need one of many features to create an outlier.

% How can I modify outliers instead of deleting outlier samples?

Replace xout with xmean+ std*threshold (see above)

% Now I’m using k-means clustering, fist create a cluster for % 0-class and 1-class separately and after that remove for instance 10% of % samples that are beyond the center in every cluster. Is this a good % approach?

No. Once outliers are removed, just use patternnet. Multiple designs will reveal points that are consistently misclassified.

% 3. You mentioned: % % Standardize input data and regression output data 2. Remove or modify % outliers 3. Accept the default training function normalizations. % % So fist I must manually Standardize data and active normalization in % neural network options. I think when I normalize data separately before % inserting it for training network, in addition when my problem is a % binary classification (I don’t need output normalizing) so finally I % don’t need active normalization (mapmazmin or mapstd) in neural network % properties. Is this true? ....

If you are going to check for outliers, I find this to be the least complicated 1. Standardize inputs and remove or modify outliers. 2. Accept additional default normalization/denormalization of patternnet

Image Analyst on 7 Apr 2014

Jack's second so-called "Answer" moved here:

Thank you again Greg.

I don’t use k-means clustering after employ other outlier detection techniques. Outlier detection using k-means clustering is an option for outlier detection in my system besides your proposed technique. So I can choose any of these two techniques. With regard to the above discussion, what is your idea about k-means clustering?

You mentioned that I can use ‘(x-meanx)/std > threshold of your choice ‘so your proposed technique does not consider all inputs (in my case: 21 variables) simultaneously and I can analyze one feature with it at a time. Is this true?

Thanks.

Greg Heath on 11 Apr 2014

No. You consider all at once using matrix coding. I consider a 21 dimensional vector an outlier if one or more components is an outlier.

All MATLAB code is matrix based. So if you find one or more outlying components in a column of an input or target matrix, either modify or delete the column. Any target column corresponding to a deleted input must also be deleted and vice versa.

Sign in to comment.

Out-Sample normalization problem

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

6 Comments
Show 4 older comments Hide 4 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Out-Sample normalization problem

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

6 Comments Show 4 older comments Hide 4 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

6 Comments
Show 4 older comments Hide 4 older comments