Making use of multiple harddrives to avoid IO bottlenecks?
2 views (last 30 days)
Show older comments
Science Machine
on 20 Jun 2022
Commented: Science Machine
on 21 Jun 2022
I am reading in a lot of data (1.5 terabyte). So I would like to minimize disk IO.
- I have 4 NVME drives (2 tb each)
- 'a lot' of ram (okay, a lot = 128 gb, which could mean not that much in fact)
- I have data that I would like to postprocess in matlab
- I am using parfor loops to read data
Typically, I would put all the data on 1 drive. Even though NVME drives IO is quite quick (~4000 mb/s), my question is:
- Would it make sense to distribute the (to be postprocessed data) on all 4 drives, which would then be read in by matlab, in order to minimize IO bottlenecks?
0 Comments
Accepted Answer
Walter Roberson
on 20 Jun 2022
You should ideally distribute the data to different drives and distribute the drives to different controllers.
However you might be constrained by your architecture. I seem to recall having read about some architectures that could only handle three full-width PCIx and the fourth one had to run at half speed. You also need to take into account that the other drives on your system will need some lanes. PCIx cannot allocate (for example) 12 lanes for one device, and 2 for each of two other devices for a total of 16: if I recall correctly, you can only allocate powers of 2 - so the first device could get 8, and the other 2 each, with the remaining 4 unused.
You might be interested in some of the Linustech videos, as in some of them he shows difficulty in maxing out drives.
The reviews seem to say that in the mass pro market these days (not very low volume specialty manufacturers), the Samsung 9x0 are close to the best read rates (not always the best write rates compared some of the small manufacturers).
While I am on the topic: anyone using external enclosures and needing high performance, should look seriously at some of Thunderbolt 4 NAS or DAS. The performance ratings for the well designed enclosures are sometimes several times what you would get from the low cost mass market drives.
3 Comments
Walter Roberson
on 21 Jun 2022
If the cluster is cloud computing that is emulating drives over some internal layer, then that is probably something that would require getting a specific service agreement for separate hardware.
If the cluster can give you multiple drives each on separate controllers, you would typically prefer that. If you are using spinning platter drives, then two drives per controller is commonly the most efficient.
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!