How to quickly identify a list of nearby stations?

4 views (last 30 days)
I have a total of 1 million stations numbered from 1 to 1 million. For each of them, how do I quickly identify the rest of the stations (Station numbers) that are within 2 km of distance? I wrote the below program, but it was extremely slow. Any recommendations would be greatly appreciated. Many thanks!
%% calculate the distance between the stations:
m2 = numel(Sta); % The list of profile numbers
B0 = zeros(m2*m2, 6);
n0 = 0;
for i=1:m2
for j=1:m2
if Sta{i}.lon <= 180
Lon1 = Sta{i}.lon;
else
Lon1 = Sta{i}.lon-360;
end
if Sta{j}.lon <= 180
Lon2 = Sta{j}.lon;
else
Lon2 = Sta{j}.lon-360;
end
n0 = n0+1;
B0(n0,:) = [i, j, Sta{i}.lat, Lon1, Sta{j}.lat, Lon2];
end
end
Dis = distance(B0(:,3), B0(:,4), B0(:,5), B0(:,6), 6371); % The radius of the earth is 6371km.
%units: km.
D = []; % cell array storing the list of stations that are close to this profile:
for i=1:m2
Ind1 = B0(:,1)==i; % Index of the distance values for Station i.
B = [B0(Ind1,2), Dis(Ind1)]; % Two columns: List of profiles, Distance.
Ind = B(:,2) <= 2; % distance is less than 2km.
if sum(Ind) >= 2 % at least two profiles identified, one of them is itself.
C0 = setdiff( B(Ind,1), i ); % list of profiles that are close to this station (excluding itself)
D{i} = C0;
else
D{i} = [];
end
end

Accepted Answer

Torsten
Torsten on 2 Oct 2022
Edited: Torsten on 2 Oct 2022
And for your double loop, you should replace
for i=1:m2
for j=1:m2
by
for i=1:m2
for j=i+1:m2
to save computation time.
  7 Comments
Leon
Leon on 2 Oct 2022
Many thanks! I forgot everything I learned in my Advanced Mathematics class :-)
Image Analyst
Image Analyst on 2 Oct 2022
You accepted this answer but we don't know how you solved the problem of needing 8000 GB of memory to hold your B0 array for your trillion distances. How did you do it?

Sign in to comment.

More Answers (3)

Image Analyst
Image Analyst on 2 Oct 2022
Well let's see why it's slow. You have a million elements and you create an array that is a million by a million by 8 bytes. So that is 8000 gigabytes! Now maybe you have an 8 terabyte drive in your computer for swap space, but it will still be slow. I don't even know if MATLAB can handle an array that big even if you had infinite memory.
The second reason it is slow is the use of cell arrays. Cell arrays are very slow and memory inefficient. Put all your locations into standard double arrays, not cell arrays.
Now I don't know what this distance function you're using is (there are several) but it might be slow because it is general purpose with lots of overhead or maybe it takes the square root or even some more complicated formula. That could be avoided by using the simple Euclidean distance formula but not taking the square root. If you want the distance to be less than 2 km, that is the same as wanting the squared distance to be less than 4. Hence just square the distances and compare to 4 but don't take the square root. That is just an extra, unnecessary operation that will slow it down.
So why do I think the square distance will probably be better? Well if you're trying to find stations that are within 2 km of each other you're probably not considering stations placed all over the globe. You're probably not worried about whether a station in Beijing is within 2 km of one in Miami. You wouldn't need that kind of precision. You're probably considering a much smaller area, like just stations in Belgium or some country or state. Though why you'd have a million stations in a small area, I don't know. What is a station anyway? A radio station? A train station? An antenna location? So if you're using a small area of the globe you can convert the lats and lons to linear (x,y) locations. But you'll say "oh but it won't be accurate" and I'll say "who cares?" Certainly not us/you. It will be accurate enough to find stations within 2 km since there is no earth curvature effect over such a small distance, and if it's inaccurate for stations separated by 500 km, who cares? You don't want to identify a pair of stations like that (very distant) anyway so who cares if it says the distance is 480 km instead of 500 km? It doesn't matter. In the end all you care about are stations where the squared distance is less than 4, not the absolutely precise distance down to the nearest millimeter.
So just have a 2-D array where you compute the squared distance of every station to every other station using the Pythagorean theorem and your (x,y) coordinates (not cell arrays). Try meshgrid. But you still have the problem of having a trillion distances to compute. You're just going to somehow reduce that down to something more manageable - a trillion distances is not realistic. So now once you have the square distances in your array, finding the ones closer than 4 is trivial
mask = (squaredDistances > 0) & (squaredDistances <= 4);
% Find station indexes
[stationID1, stationID2] = find(mask);
  2 Comments
Image Analyst
Image Analyst on 2 Oct 2022
You can speed this up even further by not computing the squared distance. Just compute the absolute value of the delta X and delta Y. Then you can throw out any that are not within 2. That will narrow it down immensely. However you're still going to have a trillion delta x distances and a trillion delta y distances initially. Plus you're avoiding the two squaring operations and a summing operation - that's a total of 3 trillion mathematical operations you're avoiding.
Now, on that smaller subset (maybe a few million or so), you can compute the more accurate distances using a more precise method if you wish.
Leon
Leon on 2 Oct 2022
Wow, thank you so much for the very insightful comments! I find it very helpful.

Sign in to comment.


Walter Roberson
Walter Roberson on 2 Oct 2022
First, divide the stations into subsets:
  • stations close enough to the North Pole that the smaller distance between degrees of longitude could start substantially affecting the calculation of which stations are close to each other. For the moment, call this NorthStations
  • A subset of NorthStations that is potentially within 2 km of the sourthern boundary of the NorthStations. Call this NorthBorderStations for the moment
  • Likewise, create SouthStations and SouthBorderStations
  • Take the remaining stations that are not in NorthStations or in SouthStations, and add copies of NorthBorderStations and SouthBorderStations to that. Call this OtherStations at the moment
  • Take a subset of OtherStations that are potentially within 2km of -180 degrees (remember that OtherStations is clipped at 2km North/South of the boundary you decided earlier so the curve at the North/South boundary will be limited.) Add 360 to those longitudes and add that subset back into OtherStations
Now NorthStations and SouthStations are substantially reduced subsets of the original data. You can use the Haversine distance formula within each subset, perhaps using pdist(). This will compare every point in each of the subsets to every other point within the subsets, but the subsets should only be including the locations where the curve introduces notable risk of unexpected connections. The distances calculated within each of the subsets will be exact on the first pass for each of the two subsets.
Now take OtherStations and use knnsearch and Euclidean distance, with the radius set to the largest euclidean distance that maps to 2km right at the border you used for dividing into the North / South subsets. For example if you choose to put the boundary at acosd(1/3) (70.5 degrees) then use a radius corresponding to 1/(1/3)*2 = 6 km. knnsearch with these settings with create a quadtree and search it for nearby points. This will, if I recall, take roughly N * log(N) time where N is the number of points in the subset.
The list of neighbours you get will be an overestimate -- but now you can take the points and their neighbours and do pairwise havershine formulas to get exact distances.
The higher/lower you put the latitude boundaries the more you reduce the size of the NorthStations and SouthStations subsets that have to be compared pairwise, but the more you overpopulate the search in OtherStations (because the more extreme the boundary, the larger you need to make the euclidean search radius so as to be able to find the matches right at the boundary.
I guess perhaps this search strategy could end up with some duplicates, if you happen to have pairs that lie exactly within the NorthBorderStations or SouthBorderStations, or entirely within the band that is duplicated at -180, so be sure to do a final filtering pass.

Walter Roberson
Walter Roberson on 2 Oct 2022
If you have the Mapping Toolbox then use distance
  3 Comments
Leon
Leon on 2 Oct 2022
Edited: Leon on 2 Oct 2022
The stations can be anywhere in the global ocean. Some of them will be close to the poles.

Sign in to comment.

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!