How to control which GPUs and CPUs get which tasks during multiple calls to trainNetwork?

13 views (last 30 days)
I am working on a machine with a number of CPU cores (40) and a number of GPUs (4). I need to train a large number of shallow LSTM neural networks (~500,000), and would like to use my compute resources as efficiently as possible.
Here are the options I've come up with:
1) parpool('local') gives 40 workers max, which are the number of CPU cores available. Apparently parpool('local') does not provide access to the GPUs - is this correct? I can then use spmd to launch separate instances of trainNetwork across individual CPUs on my machine, and this runs 40 such instances at a time.
I have three questions about this:
First, is there a way to use both the GPUs and CPUs as separate laboratories (i.e., with different labindex values) in my spmd loop? Why do I not have a total of 44 avaialble workers from parpool?
Second, is there a way to assign more than one CPU to a particular lab, for example, could I divide my 40 cores up into 8 gorups of 5 and deploy a separate instance of trainNetwork to each of the 8 groups?
Third, given that I am using LSTMs, my 'ExecutionEnvironment' options are 'gpu', 'cpu', and 'auto', but it apears that the 'cpu' option uses more than one cpu at a time, because the timing for each task increases by a factor of about ~6 when I use spdm vs. only running one instance of trainNetwork (with 'ExecutionEnvironment' = 'cpu') at a time - this leads me to belive that when I run a single instance of trainNetowrk with 'ExecutionEnvironment' = 'cpu' it uses more than one CPU core. Is this correct?
2) I can access the GPUs individually using gpuDevice, and I can run 4 instances of trainNetwork simultaneously on my 4 GPUs. This works well, with effectively linear spedup as compared to only using one GPU at a time, but apparently does not take advantage of my CPUs.
Ideally, I'd lke a way to (1) test scaling across multiple CPUs for my partiucalr trainNetwork problem, and (2) a way to run multiple parallel instances of trainNetwork that use all of my hardware. Ideally, the best option seems to be to let the GPUs each take a number of the trainNetwork instances in parallel, and then to deploy groups of CPUs (with optimal size currently unknown) to handle a number of the trainNetwork instances.
Is there a way to do this?
Thank you,
Grey

Accepted Answer

Joss Knight
Joss Knight on 28 Jan 2019
The computation on the GPU is so much faster than on the CPU for a typical Deep Learning example that there are only disadvantages to getting the CPU cores involved for the most intensive parts of the computation. Of course the CPU is being used, for all the MATLAB business logic, but that is generally low overhead and not suitable for GPU execution.
When you train on the CPU only the heavy computation is heavily vectorized and multithreaded, so there is a good chance that moving to parallel execution won't give much of an additional advantage. Parallel execution for multi-cpu comes more into its own when you go multi-node, i.e. have a cluster of multiple machines.
You can control how much multi-threading MATLAB does using maxNumCompThreads. You can run this inside parfevalOnAll, perhaps, to set the multi-threading level on your pool before training. That way you may find a good balance between numbers of MATLABs and number of threads for your particular network. You may indeed find that for your network, there is an ideal pool size for which training in parallel is effective even on a single machine.
  5 Comments
Joss Knight
Joss Knight on 31 Jan 2019
The answers to these questions are in the documentation for the features you are using. I dare say you may benefit from reading around a bit to get the feel for things like parpool, local clusters, Parallel Preferences and the Cluster Profile Manager.
The default local cluster settings are to give you one worker per physical core and not allow you more; apparently you have 40 physical cores (lucky you!). You also get one compute thread per worker. We do this to prevent users getting a terrible experience with parallel language and blaming us. However, it's perfectly possible for you to create a new local cluster profile (for which I suggest you consult the documentation) with no limits on the maximum number of workers or number of compute threads.
The GPU is not like a CPU, think of it like a maths coprocessor. Under the hood, MATLAB will operate on gpuArrays by launching GPU kernels, one command at a time. You can't use the GPU without also using the CPU (to launch kernels, run the MATLAB desktop and interpreter, do business logic etc) so it makes sense to not oversubscribe your CPU cores by pretending that your CPU is entirely idle while the GPU does work. If you want to anyway, create a new Cluster Profile and give it a go.
Grey Nearing
Grey Nearing on 31 Jan 2019
Here is the part that answers my question:
"You can't use the GPU without also using the CPU ..."
Thanks. In hindsight it's obvious.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!