Your performance numbers are scale dependent. They should be normalized by the values that result if the model was so NAIVE that it outputs a constant value, REGARDLESS of the input. The constant that minimizes MSE is just the mean target variance.
then, if e is the error
the normalized mean-square-error and resulting Rsquare are given by
NMSE = mse(e)/vart
Rsq = 1 - NMSE
Rsq is the fraction of the mean target variance that is modeled by the net (See Wikipedia Rsquare).
My training set design goal for regression and classification is
NMSEtrn = MSEtrn/varttrn <= 0.01
(99% of the training target variance modeled) whereas for openloop feedback timeseries training, it is 0.005 with the crossed finger hope that closing the feedback loop doesn't raise it above 0.01.
Multiple random initial weight designs are ranked by NMSEval, the slightly biased NMSE for the validation set. Finally, the completely unbiased estimate for current and unseen nontraining data is given by NMSEtst.
Finally, a design is usually chosen w.r.t. NONTRAINING NMSE.
Now I usually design 10 nets for each value of H (number of hidden nodes) considered. My selection is based on the nontraining values Rsqval and Rsqtst (regardless of Rtrn) provided Nval and Ntrn are sufficiently large ( I prefer
However, sometimes I'm forced to accept
Hope this helps.
Thank you for formally accepting my answer
Greg
P.S. Many of my earlier posts use the notations MSE00 and MSE00a for biased and unbiased estimates of the average target variance