How to optimize my code and system configs?

1 view (last 30 days)
Ive J
Ive J on 15 Jul 2019
Commented: Ive J on 15 Jul 2019
Hi
I am working with big datasets (> 3 GB each), and use mostly Statistics and Machine Learning ToolboxStatistics built-in functions (e.g. supervised learning methods). To better handle the datasets, I often store them (my raw data are in csv or txt format) as chunks of MAT files (mainly using datastore). As a summary, I realized that the most time-consuming parts of my scripts are associated with the followings:
  1. Loading the data (either new raw datasets or chunks of data).
  2. Passing the data structures to different in-house functions.
  3. Converting between data types (e.g. converting a character vector of ~500,000 X 1 to numeric using str2double)
  4. Performing the required statistical analyses (e.g. Regression analysis).
Moreover, I am using an Intel Core i9 7980XE processor with 128 GB RAM (DDR4). I also have a 2 TB SSD (560 MBps).
However, I have several doubts and questions, which I hope anyone can help me with:
  1. How can I improve loading time of my data? For instance, is it better to save chunks of data in file formats rather MAT files? Curretnly, each variable in the raw data is stored as a vector.
  2. Since I intend to convert between data types (I originally save all data as character data type), is it better to save double as double, character as character, etc. ? Saving time of character arrays showed to be faster, and that's why I saved all as character arrays.
  3. While I run my codes/functions, both cpu and RAM usage seem not be limiting, so, I wonder what is a the main rate limiting step in handling big data? Should I improve my SSD?
Many thanks for your helps in advance.

Answers (1)

Le Vu Bao
Le Vu Bao on 15 Jul 2019
Edited: Le Vu Bao on 15 Jul 2019
Same interested with you.
In my experience, I used to store my data in *.mat (-v7) file. Don't know if this is a generic rule or not, but in my case, storing and loading any big STRUCT, CELL arrays in a *.mat (-v7) took so much time ( Although it supports partly reading from data). So I tried to split my data into smaller files and store them in *.mat (-v6), tried to avoid storing any struct or cell (although I had to use them in my case). And create a function to load only the part I need.
  1 Comment
Ive J
Ive J on 15 Jul 2019
Hi Le,
Thanks for your response. I do not use '-v7' version, because as you mentioned it's time consuming. I store each variable in separate vector, and store multiple vectors in a separate mat file, so that each vector can be loaded more efficiently. Yet, due to high load of data, it's time consuming.
Now, I replaced str2double with str2doubleq to accelerate string to numeric conversion; however it solves one part of my problem!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!