Remove rows in an array containing a non-matching element

I have a datafile data.txt:
gene12 489 483 838
gene82 488 763 920
gene31 974 837 198
gene45 489 101 378
gene59 89 827 138
I have another data file genelist.txt that lists just genes I'm interested in for my study:
gene45
gene59
gene61
I want to modify the first dataset by removing all rows where the gene isn't found in the second list so basically end up with this array:
gene45 489 101 378
gene59 89 827 138
How do I go about doing this?

 Accepted Answer

Probably the easiest:
geneswithdata = readtable('data.txt'); %load file as a table
geneswithdata.Properties.VariableNames{1} = 'genes'; %rename first column for clarity (optional).
%I would also rename all the other columns
genesonly = readtable('genelist.txt'); %load as a table
genesonly.Properties.VariableNames = {'genes'}; %rename columns. Common columns must have the same name
filteredgenes = innerjoin(genesonly, geneswithdata);
Done.
Using ismember that last line could be done as:
found = ismember(geneswithdata, genesonly);
filteredgenes = geneswithdata(found, :);
Using intersect (rather than setdiff) it could be done as:
[~, tokeep] = intersect(geneswithdata, genesonly);
filteredgenes = geneswithdata(tokeep, :);

3 Comments

This almost got me there. The only issue is that I lose the first line/first gene of genelist.txt. This is an easy fix if I simply edit genelist.txt to:
genes
gene45
gene59
gene61
Is there another way I can prevent losing that first line if I'm loading files as tables?
Thank you!
By default, readtable considers the first line as a header line that is to be used to name the variables. To tell it to not do that:
readtable(___, 'ReadVariableNames', false)
readtable is extremely flexible. Look at its documentation to see all the options available.
Thank you very much for the help!

Sign in to comment.

More Answers (1)

Look into ismember() or setdiff()

1 Comment

I don't know how to use either for this purpose. setdiff() is going to give me the genes they don't have in common? I want the genes they have in common. ismember() gives me a logical array. I run into the same issue of how do I use the array to pull out only the rows that are "true". I am having difficulty manipulating the datasets (which format to load the txt files--structure, table, etc).

Sign in to comment.

Categories

Asked:

on 11 Apr 2017

Commented:

on 16 Apr 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!