Pre-indexing a tall datastore

5 views (last 30 days)
Michael
Michael on 30 Aug 2019
Answered: Rajani Mishra on 13 Feb 2020
Hello,
I'd like to generate a set of indicies for my tall datastore in advance so, when I want to select rows, it will be fast.
Here is what I came up with.
I expect there is a much better way. Please help me improve it.
FYI, the datastore will be several million rows by four columns. I ran the following test on a very small file.
Thanks
Michael
Code:
% CREATE TALL DATASTORE
ds = datastore(fname);
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
ds.SelectedFormats(1:3) = {'%{MM/dd/uuuu}D','%C','%C'};
tds = tall(ds);
gDATE = unique(tds.DATE);
gTICKER = unique(tds.TICKER);
gFIELD = unique(tds.FIELD);
% FIND UNIQUE LABELS
[uniqueDates, uniqueTickers, uniqueFields] = gather(gDATE,gTICKER,gFIELD);
% FIND THE INDEX OF EACH UNIQUE LABEL
for iTicker = 1:length(uniqueTickers)
gTicker = find(tds.TICKER==uniqueTickers(iTicker));
idxTicker{iTicker} = gather(gTicker);
end
for iField = 1:length(uniqueFields)
gField = find(tds.FIELD==uniqueFields(iField));
idxField{iField} = gather(gField);
end
% TABULATE
tTickerIndex = array2table([cellstr(uniqueTickers) idxTicker.'], ...
'VariableNames',{'TICKER','INDEX'});
tTickerIndex.TICKER= categorical(tTickerIndex.TICKER)
tFieldIndex = array2table([cellstr(uniqueFields) idxField.'], ...
'VariableNames',{'FIELD','INDEX'})
tFieldIndex.FIELD = categorical(tFieldIndex.FIELD)
% DISPLAY FIRST 10 VALUES FOR EQY_DVD_YLD_IND
out = tds(tFieldIndex.INDEX{tFieldIndex.FIELD=='EQY_DVD_YLD_IND'}(1:10),:);
gather(out)
Output
tTickerIndex = 1×2 table
TICKER INDEX
_______________ ________________
DFIVX US EQUITY [34112×1 double]
tFieldIndex = 6×2 table
FIELD INDEX
________________________________ _______________
DAY_TO_DAY_TOT_RETURN_GROSS_DVDS [6423×1 double]
DAY_TO_DAY_TOT_RETURN_NET_DVDS [6423×1 double]
DIVIDEND_INDICATED_YIELD [6242×1 double]
EQY_DVD_YLD_IND [6242×1 double]
EQY_DVD_YLD_IND_NET [6242×1 double]
FUND_TOTAL_ASSETS [2540×1 double]
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 1.1 sec
- Pass 2 of 2: Completed in 0.98 sec
Evaluation completed in 2.8 sec
ans = 10×4 table
DATE TICKER FIELD VALUE
__________ _______________ _______________ ______
06/01/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/02/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1611
06/03/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1718
06/06/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1654
06/07/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.159
06/08/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1442
06/09/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/10/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.138
06/13/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277
06/14/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277

Answers (2)

Michael
Michael on 1 Sep 2019
Apparently, indexing is completely ineffective. My data is about 950 million rows by 4 columns
ds = datastore('tallFinal.csv');
ds.SelectedFormats={'%q','%C','%C','%s'};
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
tds = tall(ds);
i = int16(intersect(idxTicker,idxField));
size(i)
ans = 225 1
temp = tds(i,:);
temp = gather(temp);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 3: Completed in 23 min 0 sec
- Pass 2 of 3: 33% complete

Rajani Mishra
Rajani Mishra on 13 Feb 2020
I found below link related to indexing in tall array, find it below:
You can also consider creating a custom datastore and process data in smaller groups. For creating custom datastore please refer below:
Hope this helps!

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!