You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
I want to delete 5% random Selected Index from array and replace zero at the end MATLAB
1 view (last 30 days)
Show older comments
hello everyone i hope you are doing well
i have dataset of shape 1x1000, i have implemeneted the following code to delete 5% samples randomly
but output replace only first index value is saved
How can i do it in MATLAB
Please help
1x1000 but value is not saving in output
output matrix =[200 0 0 0 .......]
load('dataset1')
N = numel(dataset1) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
MPV=zeros(size(dataset1));
for i=1:length(size_MP)
MP = randsample(N,size_MP) ;
sortvalue=sort(MP);
end
Temp_series1=zeros(size(dataset1));
index=1
totallength=length(dataset1)-length(MP)
for j=1:length(totallength)
for k=1:length(MP)
if j==MPV(k)
index=index+1;
end
end
Temp_series1(j)=dataset1(index)
end
Accepted Answer
Matt J
on 1 Mar 2022
Edited: Matt J
on 1 Mar 2022
load('dataset1')
N = numel(dataset1) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
discard=randperm(N,size_MP);
dataset1(discard)=[];
dataset1(end+1:N)=0;
34 Comments
Med Future
on 1 Mar 2022
@Matt J The data in the file is 250x1000
the data share before is 1x1000
Now i have 250 row can you modified the above code to run on reach row and produced output like that
Matt J
on 1 Mar 2022
dataset1=load('dataset').dataset;
N = size(dataset1,2) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
discard=randperm(N,size_MP);
dataset1(:,discard)=[];
dataset1(:,end+1:N)=0;
Walter Roberson
on 1 Mar 2022
That would delete the same columns for every row. I think they want different columns to be chosen for each row.
Med Future
on 1 Mar 2022
Edited: Med Future
on 1 Mar 2022
@Matt J Thanks can we also discard=randperm(N,size_MP); for every row so that every row have different random index?
i want different columns to be chosen for each row
Med Future
on 1 Mar 2022
@Matt J Can you please modified it i want different columns to be chosen for each row
Walter Roberson
on 1 Mar 2022
[~, colidx] = sort(rand(size(dataset)), 2) ;
keep = floor(0.95*size(dataset, 2));
colidx = sort(colidx(:, 1:keep), 2);
rowidx = repmat((1:size(dataset, 1)).', 1, keep) ;
newds = dataset(sub2ind(size(dataset), rowidx, colidx);
newds(end, size(dataset, 2)) = 0;
Matt J
on 1 Mar 2022
Edited: Matt J
on 1 Mar 2022
Since you only have 250 rows, I would just loop.
dataset=load('datasetvalue').dataset;
N = size(dataset,2) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
dataset=num2cell(dataset,2);
for i=1:numel(dataset)
discard=randperm(N,size_MP);
dataset{i}(discard)=[];
dataset{i}(:,end+1:N)=0;
end
dataset=cell2mat(dataset);
Med Future
on 1 Mar 2022
@Matt J not working it just convert only first row , not implemented on other rows
Med Future
on 1 Mar 2022
@Matt J Okay did the different randnperm columns to be chosen for each row? the ouput of discard is only shown one row
Matt J
on 1 Mar 2022
I don't know which code you're looking at. In my later version of the code which loops over the rows, the randomization is done separately for every row.
Med Future
on 1 Mar 2022
@Matt J i am looking at the last code.
the discard array has only single row at the output. is it possible i can see the random indices of every row?,
Med Future
on 1 Mar 2022
@Matt J discard=randperm(N,size_MP); only single row , what can i do to see for all random index for every row
Med Future
on 1 Mar 2022
@Matt J when i run the above command
discard=randperm(N,size_MP);
Discards(i,:)=discard;
getting this error
Unable to perform assignment because the size of the left side is 1-by-1 and the size of the right side is 1-by-0.
Matt J
on 1 Mar 2022
Edited: Matt J
on 1 Mar 2022
I'm not seeing that error.
dataset=load('datasetvalue').dataset;
[M,N] = size(dataset) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
Discards=nan(M,size_MP);
for i=1:M
row=dataset(i,:);
discard=randperm(N,size_MP);
row(discard)=[];
row(:,end+1:N)=0;
dataset(i,:)=row;
Discards(i,:)=discard;
end
whos dataset Discards
Name Size Bytes Class Attributes
Discards 250x50 100000 double
dataset 250x1000 2000000 double
spy(dataset)
Discards(1:10,1:20)
ans = 10×20
263 538 503 893 687 752 115 823 130 587 350 617 367 189 626 971 45 480 870 750
430 353 17 869 295 559 945 319 904 12 742 484 508 125 739 174 262 838 855 824
363 161 985 154 901 264 755 634 374 600 962 940 297 6 57 585 471 146 423 350
923 780 355 945 122 12 102 906 345 304 723 66 43 467 804 716 548 776 118 240
719 925 172 464 41 51 525 606 437 608 410 733 479 885 379 507 692 32 990 385
553 684 8 72 122 851 502 544 588 214 52 99 71 419 753 296 564 216 328 379
879 812 273 797 416 545 54 564 246 855 493 613 904 440 459 873 915 573 197 342
599 84 864 429 177 789 610 465 668 186 94 188 912 747 790 687 294 119 923 537
90 265 628 598 233 922 382 12 774 584 786 87 717 345 990 133 833 961 699 491
190 892 644 48 653 629 213 885 802 165 778 706 636 102 773 386 100 118 359 792
Walter Roberson
on 1 Mar 2022
dataset = randi(9, 250, 1000);
[~, colidx] = sort(rand(size(dataset)), 2) ;
keep = floor(0.95*size(dataset, 2));
colidx = sort(colidx(:, 1:keep), 2);
rowidx = repmat((1:size(dataset, 1)).', 1, keep) ;
newds = dataset(sub2ind(size(dataset), rowidx, colidx));
newds(end, size(dataset, 2)) = 0;
newds(1:10,940:960)
ans = 10×21
7 2 4 2 6 2 4 5 4 9 6 0 0 0 0 0 0 0 0 0 0
8 9 6 7 1 7 4 6 9 2 2 0 0 0 0 0 0 0 0 0 0
1 7 5 8 5 1 8 2 8 6 9 0 0 0 0 0 0 0 0 0 0
2 4 9 7 1 8 2 5 4 5 5 0 0 0 0 0 0 0 0 0 0
3 6 9 2 5 2 9 3 6 9 6 0 0 0 0 0 0 0 0 0 0
8 4 7 3 1 7 3 1 8 5 7 0 0 0 0 0 0 0 0 0 0
9 8 3 6 6 3 1 4 9 3 1 0 0 0 0 0 0 0 0 0 0
9 5 5 7 4 2 5 1 2 1 1 0 0 0 0 0 0 0 0 0 0
9 5 1 8 2 5 5 2 1 5 4 0 0 0 0 0 0 0 0 0 0
2 9 8 7 8 8 3 3 8 2 9 0 0 0 0 0 0 0 0 0 0
Med Future
on 1 Mar 2022
@Matt J the Discards index is not correct for example if the 5th value is discard but the dataset 5th value is not deleted.
Matt J
on 1 Mar 2022
Edited: Matt J
on 1 Mar 2022
It is in my tests. Here are the resuls I see in a smaller example with percentageMP=50.
dataset = %before
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
dataset = %after
2 3 8 9 10 0 0 0 0 0
1 2 6 9 10 0 0 0 0 0
1 2 5 6 10 0 0 0 0 0
3 4 5 6 10 0 0 0 0 0
Discards =
6 7 4 5 1
5 8 3 4 7
8 3 4 9 7
7 1 8 2 9
Med Future
on 1 Mar 2022
I have sortedvalue=sort(discard)
[M,N] = size(dataset) ;
percentageMP=5;
size_MP=round(percentageMP/100*N);
Discards=nan(M,size_MP);
for i=1:M
row=dataset(i,:);
discard=randperm(N,size_MP);
sortedvalue=sort(discard)
row(sortedvalue)=[];
row(:,end+1:N)=0;
dataset(i,:)=row;
Discards(i,:)=sortedvalue;
end
Matt J
on 1 Mar 2022
Edited: Matt J
on 1 Mar 2022
You can sort if you wish, but it only slows things down. The modification of dataset is not affected by the order of the indices in discard.
Also, if you wish to sort, it is probably more efficient to do so after the loop:
for i=1:M
row=dataset(i,:);
discard=randperm(N,size_MP);
row(discard)=[];
row(:,end+1:N)=0;
dataset(i,:)=row;
Discards(i,:)=discard;
end
Discards=sort(Discards,2);
Walter Roberson
on 1 Mar 2022
[~, colidx] = sort(rand(size(dataset)), 2) ;
Create an array of random numbers the same size as the dataset. Sort it along the rows, discarding the actual sorted values, but keeping the sort indices. So colidx will be an array the same size as dataset, in which each row is a list of indices into the row, with the indices reflecting the sorting order of a list of random numbers.
Why would you do that? Well, because each index now appears exactly once in each row, and the order of indices is random. In other words, you have produced a random permutation of the column indices, and you hae created a different such random permutation for each row.
If you were to try to use randperm() you would find that it is restricted to outputing a single vector, not a 2D array in which each row or column is different.
keep = floor(0.95*size(dataset, 2));
In other words, calculates the 950 that is the number of entries to leave untouched per line.
colidx = sort(colidx(:, 1:keep), 2);
so that matrix of random permutations of indices... take only the first 950 columns of it. Then sort along the second dimension. What you get out is an ordered vector for each row, with the vector being 950 elements long, and the vector consisting of entries from 1:1000 except omitting 50 random entries. Like [1, 2, 3, 5, 6, 8, ...] . These will be the column indices of what to keep for that particular row; with you only having kept 950 out of 1000 possible, you are selecting 950 out of 1000 to be kept.
rowidx = repmat((1:size(dataset, 1)).', 1, keep) ;
newds = dataset(sub2ind(size(dataset), rowidx, colidx));
Would you believe... magic?
Not actually magic, but certainly arcane, in the sense of obscure "hidden" knowledge.
The rowidx line is constructing
1 1 1 1 1 ... 1 (950 times)
2 2 2 2 2 ... 2 (950 times)
3 3 3 3 3
4 4 4 4 4
up to the number of rows.
colidx is, remember, things like
1 2 3 5 6 8 .... 950 entries
1 3 4 6 8 9 .... 950 entries
5 13 14 15 18 ... 950 entries
and after you create rowidx and colidx are arrays the same size. And you can read off corresponding elements of the two as being a combination row index and column index of some element you want to keep. In this example data, you want to keep (1,1), (1,2), (1,3), (1,5), (1,6), (1,8), (2, 1), (2,3), (2,4), (2,6), (2,8), (2,9), (3,5), (3,13), (3,14), (3,15), (3,18) .
now sub2ind() takes those pairs, row index and column index, and from the corresponding elements, calculates the linear indices those places would correspond to in array the size() of the dataset. So the sub2ind() would return an array of indices that might look like
1 1001 2001 4001 5001 7001
2 2002 3002 5002 7002 8002
4003 12003 13003 14003 17003
These are linear indices into dataset.
Then dataset() those indices causes those values to be extracted. So you would get an array that was like
[d(1,1), d(1,2), d(1,3), d(1,5), d(1,6) d(1,8)...
d(2,1), d(2,3), d(2,4), d(2,6), d(2,8), d(2,9) ...
Each row would have 950 elements, and each row has values extracted from exactly one row of input.
This is a somewhat obscure way to do mass extraction of data from an array when the data might not be regularly spaced.
newds is now 250 rows and 950 columns
newds(end, size(dataset, 2)) = 0;
size(dataset,2) is the original number of columns in dataset, which is 1000. newds(250,1000) is beyond the end of newds as newds is 250 by 950, so by assigning a 0 at newds(250,1000) you are implicitly asking to expand the matrix to be 250 x 1000 by adding extra columns of zeros.
That line of code has a bug in the situation where none of the original data was dropped -- if keep was the same as the number of columns then this line of code would be in theoretical error as it would zero the last entry of the matrix. There are other ways of padding with 0 that are more robust for the case where the old and new matrix are to end up the same size.
Walter Roberson
on 2 Mar 2022
"what the effect of randomsample instead of randperm?"
The code in randsample() was designed before Mathworks upgraded the internal randperm algorithm for the two-input case. Because of that, it has an internal "optimization" for the case where less than 1/4 of the values are being selected, with the "optimization" being based on using randi() until enough distinct random values have been generated and then randomizing their order using randperm. This is guaranteed to require at least twice as many random number generations as would be used for a Fisher-Yates shuffle, which is what randperm would use for this configuration.
Also randsample requires the Statistics Toolbox but randperm does not.
More Answers (0)
See Also
Categories
Find more on Data Distribution Plots in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)