How to train LSTM net on very large dataset.

6 views (last 30 days)
Davey Gregg
Davey Gregg on 2 Mar 2021
Answered: Udit06 on 22 Feb 2024
I am trying to train a standard LTSM net and I have about 225Gb worth of data that I want to feed it. The data comes from a binary neural recording file containing 32 channels. I am pulling 60 seconds of data before each timestamp that I am trying to predict and slicing that up into 1 second chuncks giving 32x1000 arrays as my sampling frequency is 1KHz. My plan is to train the network with 60 classes, one for each second, and have the net output its confidence on when my event of interest may occur. I got halfway decent results with this method using a small subset of this data that can fit into ram, but I really want to give it the whole thing so it can see more possible autocorrilations hidden in the full data.
What I am during currently is reading the data from the binary file, transforming the data into cell arrays and storing them into a matObj using matfile(). The matObj variables are structured in the format needed as inputs for the training function as a 1xn cell array of sequences and a 1xn catagorial array of labels. Calling net = trainNetwork(matObj.XTrain,matObj.YTrain,layers,options) works well for smaller datasets but MatLab still loads the data into ram and if I try to use my 225Gb matObj file it throws "Out of memory". So I don't know a good way to pass this data to the training function. Filedatastore only seems to work with a large collection of smaller files and I really don't want to have each sequence saved as its own file to use with file datastore. It takes long enough to save all the sequences into one file and that is much better than having a folder with 10K+ files.

Answers (1)

Udit06
Udit06 on 22 Feb 2024
Hi Davey,
If you don't want create multiple smaller files, you can create a MATLAB's custom datastore that reads the data directly from the large file in small chunks that can be loaded into the memory. By iterating over the smaller chunks of data, networks can learn from large data sets without needing to load all data into memory at once. You can also leverage Parallel Computing Toolbox to scale up the network training.
Refer to the following MathWorks documentations to understand how to create a custom datastore and scaling up deep learning in parallel.
You can also refer to the following documentation that guides how to train a deep learning model using big data.
I hope it helps.

Categories

Find more on Image Data Workflows in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!