random select n elements from n arrays

35 views (last 30 days)
I need random select 50 elements from 50 lists to build a new array, each list includes 100 elements, then apply some post sorting processing on all possible new arrays to find the indexes of all 50 elements for the best array.
The question is there will be possibility for the combination of the new array. How to realize this work in MATLAB? Can we try Mapreduce?
this problem feels like I have 50 classes and each class have 100 students, and i need find the 'best' team with 50 students from 50 classes. each student has a unique score/contribution for the team, so i scan all possible teams and sort them to find the one i want as well as the student ID in each class.
  1 Comment
Feng Cheng
Feng Cheng on 4 Aug 2021
Edited: Feng Cheng on 4 Aug 2021
The sorting function is kind of complex but easy to calculate which will only care about two parameters: students scores and classes indexes.
I don't need to store the entire possible solutions in memory. I have done a Mapreduce test to sort huge nums separatly stored in 100Gb text files, and this test does not fit traditional memory calculation either. For my case, I think it's samilar but i have no idea how to handle with much bigger combinations. I know this num is beyond calculation. It's an extreme example. I can simplify my problem into a less num, e.g., , which might be total 19Gb data size. Then it comes back to the same question.
I can set up 8 nested for-loop to save all combination indexes into multiple files into disk and apply Mapreduce to solve the problem, although it feels stupid.
But it's hard to extend the workflow for bigger data that is too big to calculation, like the one I mentioned . I was thinking there might be some way to get rid of this big number. That's why i come here to find some better tools. I have no diea about the Optimal Stopping Strategy but i can check it anyway.
I am sorry to waste your guys so much energy. Let's stop here.

Sign in to comment.

Accepted Answer

Image Analyst
Image Analyst on 4 Aug 2021
I believe what they may be after is the (what should be more famous) "Optimal Stopping Strategy", or the "37% rule". They might have discussed it in your course. Basically it says that it says that you review the first 37% of the possibilities without selecting any of them. Then you start reviewing the possibilities with the rule that the first one you encounter that is better than any of them you have seen before is the one you should pick. If you do that then there is a 37% chance you have picked the optimal possibility.
Pretty weird, right? However there is actually some theory behind it.
Here is a web page with more information including a table and more complicated twists on the problem.
However even 37% of 100^50 is too big, but if your course did not have some particular strategy in mind, Wikpedia has some you can try: https://en.wikipedia.org/wiki/Optimal_stopping

More Answers (4)

Image Analyst
Image Analyst on 4 Aug 2021
How do you figure 100^50? You have 50 lists of 100 elements from which you are supposed to select half the elements randomly. So you will have 50 output vectors. You can use randperm()
For one vector, let's call it vec1, you can do
randomIndexes = randperm(length(vec1), 50); % Get 50 random indexes.
% Now extract those into vec1Output:
vec1Output = vec1(randomIndexes);
% Apply "some sorting processing":
vec1Output = sort(vec1Output, 'ascend'); % Or whatever you want to do.
There, just do that also for vec2 through vec50 and you'll have 50 output vectors.
  7 Comments
Image Analyst
Image Analyst on 4 Aug 2021
@Chunru, originally, when I answered, it did not say to get the "best" or try multiple sets of random indexes, as the original poster admitted.
Peter O
Peter O on 4 Aug 2021
@Feng Cheng are you able to give us a little bit more information on what the sorting function needs to achieve and what the base list parameters are? Are they students with scores or are they actually just a big parameter grid?
As @Walter Roberson brings up, this is sounding a lot more like an optimization problem than a sorting problem, and DP looks like a good candidate. If the cost function can't be solved through a dynamic programming approach then something like a stochastic optimizer (genetic algorithms, particle swarm, simulated annealing, differential evolution) might also be suited to the task. In either case the algorithms are solving to find the "best" solution without needing to keep the entire possible solution space in memory.

Sign in to comment.


Chunru
Chunru on 4 Aug 2021
Edited: Chunru on 4 Aug 2021
nele = 100; nlist = 50;
data = randn(nele, nlist); % data in the list
idx = randi([1 nele], 1 ,nlist); % row index for each list
idx = (0:nele:nele*nlist-1) + idx; % linear index
x = data(idx); % data selected
  7 Comments
Chunru
Chunru on 4 Aug 2021
Try to appreciate the big number:
Let's say you have 10^100 x 50 array of double. The total size in memory is 10^100x50*8 =4x10^102.
Let's say you can save the data in Hard disks (not RAM). Assume each hard disk is 4TB = 4x10^12. You will need (4*10^102)/(4*10^12) = 10^90 such hard disks. Assum each hard disks is 100g. The total weight will be 10^89 kg. Just compare the weight of Sun 1.989 × 10^30 kg, all the hard disks can be ~10^69 times heavior than Sun.
Does the whole universe weight more than 10^69 Sun?
Walter Roberson
Walter Roberson on 4 Aug 2021
Mass of the observable universe is apparently around 10^53 kg.

Sign in to comment.


Peter O
Peter O on 4 Aug 2021
With a little programming and a for loop you can handle all possible combinations.
If you want to be able to draw repeated elements from the lists (e.g. the new list could include index 10 multiple times), use randi. Otherwise, use randperm.
% Generate some fake lists as a single array, each having 100 entries.
n_lists = 50;
n_entries = 100;
Lists = rand(100,n_lists);
n_pick = 50;
% We know the size of the draw. Preallocate.
NewLists = zeros(n_pick, n_lists);
% Generate the indices to draw at random.
for ix=1:n_lists
Indexes = randperm(size(Lists,1), n_pick);
% or if you want to allow repeats:
% Indexes = randi(size(Lists,1),n_pick,1);
NewLists(1:n_pick, ix) = Lists(Indexes, ix);
end
% Sort as-needed here. Since the lists here are ordered as a single matrix
% you can handle it with a single call. sort called without addiitonal
% arguments will sort individual columns
NewLists_sorted = sort(NewLists)
NewLists_sorted = 3×5
0.5339 0.0871 0.2878 0.4678 0.2783 0.5875 0.4006 0.3817 0.5071 0.8016 0.7901 0.7552 0.9092 0.9484 0.8142
  3 Comments
Chunru
Chunru on 4 Aug 2021
Edited: Chunru on 4 Aug 2021
You most likely can not get the 100^50 array, which is too big. You need to generate one sample vector by one sample vector (or some mangeable number of samples). You will never go through all possibilities.
Peter O
Peter O on 4 Aug 2021
Yes, @Chunru, I agree. The solution above is for a different problem, which a few of us thought the OP was trying to solve. The OP has updated the question to help us better understand what they are trying to do.

Sign in to comment.


Walter Roberson
Walter Roberson on 4 Aug 2021
Edited: Walter Roberson on 4 Aug 2021
Lists = arrayfun(@(idx) randi(99,1,100), 1:50, 'uniform', 0);
one_random_selection = cellfun(@(L) L(randi(100)), Lists)
one_random_selection = 1×50
16 51 65 2 32 74 79 48 82 58 14 42 17 5 8 32 19 33 99 32 64 58 35 96 60 72 5 73 30 27
sort(one_random_selection)
ans = 1×50
2 3 5 5 6 8 12 14 16 17 19 19 21 27 28 30 32 32 32 33 35 37 37 42 43 46 47 48 51 51
If you want to find all of the unique sort()'s, that would be a different matter entirely.
  7 Comments
Feng Cheng
Feng Cheng on 4 Aug 2021
To be honest, your codes are useless for me. We're talking about totally different questions. I was just asking for some possible solutions for this kind of problem. I know there could be no answer for this question but I appreciate everyone for the postive discussions. please make sure that you understand my question before your mocking.
Walter Roberson
Walter Roberson on 4 Aug 2021
You are mistaken. I have provided a link to code that can produce EVERY combination from the arrays, which none of the other people have given you. You run the initialization routine that I provided there, then you run the loop that I provided there. Each time, the next combination is returned; you evaluate the combination according to your scoring system, and if it is the best combination so far, you keep a record of it. You keep looping until the code says there are no more possibilities available; at that point you return the best value.
I can set up 8 nested for-loop to save all combination indexes into multiple files into disk and apply Mapreduce to solve the problem, although it feels stupid.
My code provides a mechanism to have a single for loop that will evaluate every possibility.
But it's hard to extend the workflow for bigger data that is too big to calculation
My code extends to any dataset that in itself takes less than approximately 1/4 of your available memory. 5000 items (50 classes of 100 students) is not a problem for my code, given enough time.
Mapreduce
Using mapreduce implies that you want to evaluate all possibilities. My code evaluates all possibilities given enough time.
If mapreduce is even conceptually in your mental toolbox to solve this problem, then my code should be a serious contender.
For problems like this, there are two possibilities:
  1. That the results from one combination cannot be used to make predictions about the results from other combinations, so all of the combinations need to be tried; in such a case, my code can handle problems too big for mapreduce given enough time; or
  2. That the results from one combination can be used to make predictions about the results from other combinations; in such a case, mapreduce is not typically a useful strategy, but techniques such as Dynamic Programming could potentially be very effective
You also have to ask whether you need the best combination, or just a "reasonably good" combination. For example, it would be entirely reasonable to ask "If I put a limit of 10^9 combinations evaluated (to keep the computation time limited), then what techniques should I use to get the best combination I am likely to get?" . And typically mapreduce() is not one of the techniques used to get the best combination within a budget.

Sign in to comment.

Categories

Find more on Optimization Toolbox in Help Center and File Exchange

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!