Deep Learning Toolbox - How to replicate training/testing results for reproducibility purposes
5 views (last 30 days)
Show older comments
Dear All,
I'm a newbie on MatLab but interested on some aspects related to the Deep Learning Toolbox.
For example, in terms of reproducibility of the training/testing results of a classification/regression network using MatLab Deep Learning Toolbox, I'm wondering if is there a way (or a feature available) to ensure that, once defined a DAGnetwork (or a LayerGraph), XData, YData, and a set of trainingOptions, is it possible to perform the training process in way that ensures that, If I repeat the process (from the beginning) I get exactly the same results.
This is because, If you obtain and want to make available to others a given set of results and you also want to share the code you used, you should be able to ensure that such results can be achieved in a way that does not depend on the environment you use (e.g. cpu, gpu, multi-gpu, parallel, and so forth).
For other applications, it is well known how, in MatLab, before reproducing some random processes, you must use rng('default') or a similar instruction to reset the pseudo random generator, so every next new random sequence will be replicated exactly the same.
Maybe I'm wrong, but it seems to me that there is no immediate methodology to ensure the same reproducibility of the results when using Deep Learning Toolbox.
This could be of paramount importance in parametric studies, such as the ones where we change some parameters (e.g. learning rate/learning drop period/solver type in training options) and we want to evaluate how such (and only) changes affect the overall performance.
I believe it is very important. I've read many threads and messages and it seems that the overall performance changes (slightly, actually) if you train a given network (for example a CNN network with/without dropout Layers and/or batch normalization layers) for the expected number of epochs, and repeat the process (from the beginning) as many times as you want. This slight (but non negligible) difference on final training performance is clearly visible in trainingInfo and is still present also if you use rng('default') line of code before the trainNetwork command. I'm afraid also that there is a difference when using CPU (no noticeable changes) or GPU as computing environment. But it is also clear that we can't talk about 'real' Deep Learning applications without using GPUs.
It has been explained that the problem happens in some networks (when training is accelerated by using the GPU, which is also the reason we buy GPUs) and it is due to the use of backwardConvolution algorithms via cuDNN, which provides an implementation that is non-deterministic (on some algorithms)
but cuBLAS alternative provides deterministic behaviour
Question: Is there a way, in MatLab, to invoke a different algorithm (for backwardConvolution) when using cuDNN in order to honor reproducibility?
or in alternative, is it possible to use cuBLAS instread of cuDNN to perform backwardConvolution?
For example, In Theano (see link below) they also use cuDNN for performing backConvolution (like many others, for Deep Learning), but they give the option to invoke a deterministic or undeterministic version of the cuDNN algorithm. It is obvious that the deterministic version is slower, but it ensures reproducibility, and is also orders of magnitude faster than CPU-only solution!.
http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html
Why don't use the same approach?
Similar threads (for MatLab) with similar response and no effective solution (unless the use of CPU which is simply not a serious solution to the problem):
https://www.mathworks.com/matlabcentral/answers/449775-matlab-demo-merchdata-reproducibility-problem
I hope to have clarified my question
0 Comments
Answers (1)
Gabija Marsalkaite
on 9 Apr 2019
As you already mentioned reproducibility strongly depends on environment. At the moment there is no way to choose which GPU algorithm to invoke with CUDA and that introduces some differences between runs. However, variation can be reduced by setting up both CPU and GPU random number generator states (e.g. rng(0, "threefry") and gpurng(0, "threefry") respectively). For parallel environments rng has to be set for each worker.
In addition to that, I have also created enhancemnet request which will be considered by developers.
1 Comment
See Also
Categories
Find more on Image Data Workflows in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!