Access to data without reading all of it

2 views (last 30 days)
Hi all,
I have two data sets. One is10 gb and the other one is 2TB. They are both txt files.
Let me say in the small data set, I have variables timestamp, ID and x. In the big one, I have timestamp, ID and y. Unique number of ID's and timestamps are much higher in the big data set.
For each observation in the small data, I want to find the row with the same milisecond and id in the big data and then copy the value of y to small data.
Is it possible to find corresponding rows without reading 2TB of data?
Thanks
  5 Comments
Filip
Filip on 20 May 2019
Edited: per isakson on 21 May 2019
Responses to Questions:
  • No.
  • No.
  • Yes.
Looks like I will follow this path: creating a database and using SQL queries.
I am slightly confused about the comments about searvh algorithm. Currently I have trial version of the code for a subset of data sets, and I use "find" function of matlab together with @arrayfun. Is find function in Matlab not very efficient? Say my small data is d1 and the big one is d2. I have:
d1.y = arrayfun(@(z) d2.y(find(z>=d2.Tsec, 1,'last' )), d1.Tsec );
so that I get the y value from d2 when timestamp in d1 is for the first time bigger than or equal to timestamp in d2. I have to do this after I get subsets of data with the same ID.
But I do not want to read all 2TB of data since this is not a "use once and forget" code like you asked. Looks like my best bet is to create a database at this point.
Thank you all for help.
Jan
Jan on 22 May 2019
The find function is very efficient, but not designed for sorted data.
A Bayer-tree https://en.wikipedia.org/wiki/B-tree would be an efficient method to stored sorted data, which do not match into the RAM.

Sign in to comment.

Accepted Answer

per isakson
per isakson on 22 May 2019
Edited: per isakson on 22 May 2019
"I am slightly confused about the comments about search algorithm. [...] Is find function in Matlab not very efficient?"
I'm convinced that the Matlab function, find(), is efficient given the requirements. However, for the special case of a large sorted vector there are faster search algorithms. I made a little demo, which shows that simple m-code implementations of interpolation_search and binary_search are more than an order of magnitude faster than find() - searching a large sorted vector.
%%
A = sort( floor( cumsum( 1+9*rand(1e9,1) ) ) ); % sample data
key = A( 2E8 );
%%
tic; [ixIS,nIS] = interpolation_search( A, key ); toc
tic; [ixBS,nBS] = binary_search( A, key ); toc
tic; ixFL = find( A==key, 1, 'last' ); toc
tic; ixFF = find( A==key, 1, 'first' ); toc
%%
[ key ; A([ixIS,ixBS,ixFL,ixFF]) ]
outputs in the command window
Elapsed time is 0.002190 seconds.
Elapsed time is 0.001413 seconds.
Elapsed time is 0.916372 seconds.
Elapsed time is 0.608332 seconds.
ans =
1.0e+09 *
1.100021924000000
1.100021924000000
1.100021924000000
1.100021924000000
1.100021924000000
>>

More Answers (1)

Jan
Jan on 19 May 2019
Edited: per isakson on 20 May 2019
It is a very bad idea to create a huge data set with a size of 2 TB as text files. Do the files have a fixed width for all lines? This would at least allow a binary search in each file. If you search linearily through 2 TB of text files millions of times, you might wait for years to get all results. With a modern data base a processing time in the magnitude of minutes is more likely.
Please blame the person who decided to use text files.
But now you have these files. Of course you can use a binary search: Find the central element of each file. If the row width is fixed, this is trivial. If not, start at the central byte and search for the next (or previous) line break. If the found value if higher than the searched one, use the first half of the file for the next search, or the seconds half in the other case. This requires log2(nData) file accesses only, where nData is the number of data sets.
If the input data is sorted, you can use this information also to refine the search by choosing a start point at the formerly found value.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!