I have some data that includes both numbers and text. The text is inconsistent in which column the information is in. Is it possible to sort this table in some way please?

Here is some example data. How for example would I plot the Speed & Time of Apples and of Oranges please?

----------------------------------------

ID / Speed / Time / Fruit 1 / Fruit 2 / Fruit 3

----------------------------------------

1 3 9 Apple Jam Cake

2 6 5 Cake Orange Jam

3 3 4 Jam Cake Apple

4 5 2 Jam Orange Cake

5 9 4 Jam Jam Orange

6 5 3 Pear Jam Cake

7 3 5 Cake Cake Cake

8 2 3 Jam Cake Apple

9 5 3 Jam Orange Jam

10 7 9 Jam Jam Apple

22 Comments

As a follow up question, my data is also inconsistent in how it is described. For example items 10 and 11 in my new table:
11×6 table
ID Speed Time Fruit1 Fruit2 Fruit3
__ _____ ____ _______ ________ ____________
1 3 9 'Apple' 'Jam' 'Cake'
2 6 5 'Cake' 'Orange' 'Jam'
3 3 4 'Jam' 'Cake' 'Apple'
4 5 2 'Jam' 'Orange' 'Cake'
5 9 4 'Jam' 'Jam' 'Orange'
6 5 3 'Pear' 'Jam' 'Cake'
7 3 5 'Cake' 'Cake' 'Cake'
8 2 3 'Jam' 'Cake' 'Apple'
9 5 3 'Jam' 'Orange' 'Jam'
10 7 1 'Jam' 'Jam' 'JamApple'
11 1 4 'Jam' 'Jam' 'OrangeCake'
Is it possible to include the JamApple as a type of Apple please?
dpb
dpb on 3 May 2017
Edited: dpb on 3 May 2017
It'll just be another categorical level as above so add it to the set to match and previous logic will work. But, in that formulation it is a unique value, the substring internally is not discernible.
To do a search for the string you would have to leave the variables as cellstr (or, I presume, if you have later release, perhaps the new string class is also supported in table I don't know, can't test.
But, this clearly illustrates that you're going to have more and more difficulties if you do not normalize the database first...every case is going to require special code to handle this way and you'll have to look manually at every combination.
Thanks,
Sorry I'm not sure that I understand. I think that what I am trying to do is to use MATLAB to normalise my data. Are you saying that it cannot be done? That is for example to identify JamApple as a type of Apple.
If I can find out how "CTRL F" works in Calc then could I try doing it that way?
Should I try using Python instead of MATLAB?
>> version
ans =
'9.2.0.556344 (R2017a)'
dpb
dpb on 4 May 2017
Edited: dpb on 4 May 2017
So far, you're just trying to look up data in an unnormalized database. The mushing-together of two different characteristics into a single field is the problem; 'JamApple' should be recoded in the database rather than trying to write special code to search all these different storage forms. Same thing in putting a given characteristic in different fields (columns); each field should be of a specific type and only of that type. The process of rearranging the structure of a database itself and then correcting entries in a database to follow those rules is known as "normalization".
"... how "CTRL F" works in Calc..."
I have absolutely no idea what that means??? If you're talking about a text search in some application, then that's the "solution" I spoke of before of storing everything as cellstr or string and doing character searches instead, but it's still going to be problematic if there's no consistency in the underlying database.
You'll have the same or similar problems no matter what programming language you use as long as the data are inconsistent.
OK, R2017a will support the string class; you can 'spearmint and see how well it works in a table with string searching; I'm limited to R2012b so can't try it out...see the doc for it and what string-searching functions support the new class so far and if can use it in a table.
ADDENDUM
Indeed, table does support strings and there's even a new search function for them contains that looks like would allow you to finesse your way past the dilemma by retaining the data as strings. I still wouldn't recommend this direction in general if this is going to grow into something that is really used and going to have any significant amount of effort put into it going forward; it's just going to get more and more convoluted as it evolves if so. If it's just a toy app, that's something else again, certainly, but the idea of "worth doing, worth doing well" comes to mind even there, at least to me...
Thanks,
Yes, this is going to be used over the next 3 years for my PhD. I am working with a large volume of data that has been supplied by a third party. They intend to further increase my supply of data. I'm open to any advice on any approach to it. Hopefully I can use "contains" to normalise the data.
I just tried "contains" and it does what I think I want, thanks that's great!
IA = any(contains(cox{:,:},{'Apple'}),2);
I'm now away till Monday.
dpb
dpb on 4 May 2017
Edited: dpb on 4 May 2017
Well, if you have any control at all over this third party data it would behoove you to get it cleaned up on that end first. It's one thing to have all this jumbled-up stuff while there's still such a small amount you can visually look at each and every member and recognize where there are inconsistencies and heuristically code your way around them. When it gets large enough, that will no longer be practical at best and problematic or impossible at worst.
In that case, you could theoretically even be jeopardizing your degree by not solving the data storage issues up front; at a minimum I foresee making your task much more difficult and losing much sleep over detailed coding issues that have nothing whatsoever to do with the dissertation topic but are simply tedious and time-wasting exercises rather than research.
Perhaps I'm being overly cautious here, but I'd surely advise spending some serious time considering these data and what your research is going to require to be able to reliably extract from them to accomplish the objective in some depth rather than just finding some "tricks" to work around the issues this little sample illustrate. Where there's one nit here, there are bound to be a bunch more you've not yet seem when get the whole thing.
I'd suggest doing a search on the phrase "database normalization" and reading some of the articles to at least get a general idea of what is meant and some of the issues that occur when data are unnormalized.
It may not be needed to build a full relational database but having that understanding of what one is and how having one solves many problems such as this one of having to write specialty searching expressions because the data aren't symmetric will help you understand what will be the minimum amount of such rules you'll have to impose in order to be able to retrieve the data needed.
In the above research, you'll probably also be led to "NoSQL databases" which are those which are not relational but what are "key value stores"; iow, data are stored with a key/lookup relation. With what you have here, if it is at all representative of the actual data, that may be more in the direction you want to consider in your storage scheme. As of latest releases, Matlab now supports a map class which allows for such kind of organization; you should probably look some at it and compare the difference between it and the table.
I still think, however, that the basic data as you've shown it here needs some pretty heavy manipulation to make fields and values far more consistent and less overlapping or you're likely to run into real difficulties as you proceed.
Hope that helps outline some of the issues that you can look into before getting too deeply invested in any one technique that turns out to be a dead end.
Roger Cox
Roger Cox on 8 May 2017
Edited: Roger Cox on 8 May 2017
Thanks that's very helpful
dpb
dpb on 8 May 2017
You don't seem to be listening...
"NORMALIZE THE DATA FIRST!!!!"
You can keep making ad hoc fixes but as I've noted, this way leads to continuing to make for such issues every time some new issue arises -- and they will continue to arise.
This particular one could be solved by use of coding tricks, the real answer is to fix the data.
This
IA = any(contains(cox{:,:},{'Apple'}),2);
seems like kind of a broad brush. I would think you'd want
IA = any(contains(cox{:,{'Fruit1' 'Fruit2' 'Fruit3'}},'Apple'),2);
In any case, you are probably better off using categoricals for those variables instead of text. Indeed (modulo dpd's comments about normalization), with one command you can combine JamApple with Apple if you so desire:
t.Fruit3 = mergecats(t.Fruit3,{'JamApple','Apple'})
and to do it across all categorical variables in the table:
t = varfun(@(x) mergecats(x,{'JamApple','Apple'}), t, ...
'DataVariables',vartype('categorical'));
dpb
dpb on 8 May 2017
..."with one command you can combine JamApple with Apple if you so desire"
But what happens when JamApple turns into PreservesApple and who knows how many other perturbations? And, when Fruit1 is needed to be separated from Fruit2 and they've both got the same fields and yet there's something to be looked up that's dependent upon that...if, as OP indicates this is input data for a PhD dissertation topic, he's got a lot riding on the outcome and it surely looks to be a knot wrapped in an enigma to unsort what looks to be a sizable dataset if it has no order whatsoever imposed upon it.
I've said my piece; ol' D Quixote'll retire from the field now...battered, but undaunted in the quest. :)
I agree: "modulo dpd's comments about normalization"
dpb
dpb on 9 May 2017
I grok, Peter, saw/appreciated the comment...just wanted to reemphasize one last time.
Roger Cox
Roger Cox on 10 May 2017
These coding tools are helping me to normalise my data.
Roger Cox
Roger Cox on 10 May 2017
As another follow up question, what about where I need to sort what the 5th character is in this example please?:
1234
1235
1236A
1237A
1238A
1239O
1240O
1241AP
1242AQ
Roger Cox
Roger Cox on 10 May 2017
Edited: Roger Cox on 10 May 2017
What is the "any" for please?
t1 = readtable('cox1.dat');
PAD = '00000';
tP = strcat(t1, PAD);
t5 = cellfun(@(x) x(5),tP(cellfun('length',tP) > 1),'un',0);
IA = contains(t5, 'A');
dpb
dpb on 10 May 2017
Edited: dpb on 10 May 2017
any reduces the logical array to vector T/F condition. Here, note the "_,2" optional trailing argument to work by row (2nd dimension) instead of default column; hence the result is T for each row matching the condition to select from the overall table those rows matching whatever the condition was.
Thus you get every row that has whatever 'twas were looking for in any column in the searched area. The alternative here is all which would reguire the match occur in every column of a row to return T.
Look up "logical addressing" in "Getting Started" documentation for further details and understanding of how important it is in Matlab in general and particularly for such activities as you're undertaking.
Hope the ability to rewrite the data into something more closely approximating a database is coming along so these convoluted searches can go away.
If you don't mind my asking, are these examples really representative of actual data and if so, just what is it they come from and what meaning/knowledge is to be derived therefrom? Looks like random word generator with restricted vocabulary so far... :)
The any was to look across the three table variables, any one of which might have contained the category you were looking for. The code you're showing now seems to be for a completely different purpose. This seems to have wandered a bit far from the original question perhaps make a new post.
Also, it looks like you're trying to call strcat on a table, which won't work.
Roger Cox
Roger Cox on 11 May 2017
As you guessed, the examples are not representative of the actual data, which is confidential. My PhD is about Wind Turbine Reliability.
The searches are being used to normnalise the data. I feel that, thanks to your help, I am making good progress but that it is quite a job.
Thanks Peter, so the Any is for searching multiple columns; that makes sense.
I did start a new post but then I couldn't find it so I went back to this one.
dpb
dpb on 11 May 2017
"...so the Any is for searching multiple columns"
Well, yes, but somewhat indirectly. The actual "looking" was done by contains or ismember depending on either my or Peter's example search.
See my more involved note preceding Peter's for what part any more specifically plays in the process.
dpb
dpb on 11 May 2017
What definition for "reliability" for wind turbine, pray tell? Simply the mechanical apparatus or some greater measure?
I've monitored output of a local wind farm near here (which is SW KS, one of the most favorable locations there is) and over some 10 years average capacity factor has been only on the order of 30-35% of installed nameplate capacity. Months in midsummer/midwinter "doldrums" are sometimes in low 20% range. Thinking this might indicate demand limitation rather than actual generation ability I correlated with average windspeed and there's a >80% consonance between the two from which it surely appears it is wind-limited rather than demand.
Seems like major investment that mostly does nothing besides which we can't eliminate ready-reserve base generation since that would effect grid reliability.
Roger Cox
Roger Cox on 18 May 2017
I'm interested in Condition Based Maitenance. This could increase the capacity factor achieved for a constant ammount of maintenance work.
As you suggest, energy storage is another interesting area.
dpb
dpb on 18 May 2017
Interesting...spent last 10 or so year of employment career with CSI (Computational Systems, Inc.)for whom predictive maintenance is core business...they were just sold to Emerson Electric when I left to return to family farm; they now are pretty-much folded into Emerson so don't show up as CSI externally any longer ... <Emerson-embeds-prediction-data> might lead to some useful white papers, etc., etc., etc., ... <machinery-health-management>
Last piece I worked on was the wireless accelerometer, an industry first at the time...

Sign in to comment.

 Accepted Answer

dpb
dpb on 3 May 2017
Using categorical for the string columns has some advantages. I fixed up your listing above to be able to read as a table via
> t=readtable('cox.dat');
>> whos t
Name Size Bytes Class Attributes
t 10x6 2848 table
>> t.Properties
ans =
Description: ''
VariableDescriptions: {}
VariableUnits: {}
DimensionNames: {'Row' 'Variable'}
UserData: []
RowNames: {}
VariableNames: {'ID' 'Speed' 'Time' 'Fruit1' 'Fruit2' 'Fruit3'}
>>
With that it's easy enough to write
>> ix=any(ismember(t{:,4:6},{'Apple';'Orange'}),2); % find any row with apple or orange
>> t(ix,:) % display those found
ans =
ID Speed Time Fruit1 Fruit2 Fruit3
__ _____ ____ ______ ______ ______
1 3 9 Apple Jam Cake
2 6 5 Cake Orange Jam
3 3 4 Jam Cake Apple
4 5 2 Jam Orange Cake
5 9 4 Jam Jam Orange
8 2 3 Jam Cake Apple
9 5 3 Jam Orange Jam
10 7 9 Jam Jam Apple
>>
Use the logical index vector to access any other variable of interest.
It would be better going forward, of course, to normalize the database to put the three disparate properties in their own variable.

5 Comments

Roger Cox
Roger Cox on 3 May 2017
Edited: dpb on 3 May 2017
Thanks that find function is what I need. So I can plot the Speed & Time of Apples and of Oranges using:
% Data
TT3 =readtable('cox3.dat');
% Find Fruit
AP = any(ismember(TT3{:,4:6},{'Apple'}),2);
OR = any(ismember(TT3{:,4:6},{'Orange'}),2);
% Sort Fruit
TT3AP = TT3(AP,:); % Apple
TT3OR = TT3(OR,:); % Orange
APS = TT3AP(:,2); % Apple Speed
APT = TT3AP(:,3); % Apple Time
ORS = TT3OR(:,2); % Orange Speed
ORT = TT3OR(:,3); % Orange Time
APS1 = table2array(APS);
APT1 = table2array(APT);
ORS1 = table2array(ORS);
ORT1 = table2array(ORT);
% Graph
hold off
scatter(APS1,APT1)
hold on
scatter(ORS1,ORT1)
legend('Apple','Orange')
xlabel('Speed')
ylabel('Time')
dpb
dpb on 3 May 2017
Edited: dpb on 3 May 2017
You don't need so many intermediary variables here...
TT3 =readtable('cox3.dat');
AP = any(ismember(TT3{:,4:6},{'Apple'}),2);
OR = any(ismember(TT3{:,4:6},{'Orange'}),2);
plot(TT3{OR,2},TT3{OR,3},'o',TT3{AP,2},TT3{AP,3},'o')
xlim([0 10]),ylim([0 10])
legend('Apple','Orange')
xlabel('Speed')
ylabel('Time')
"Use the curlies, Luke!" to dereference the table variables similarly as to cell array addressing and unless using the special features of scatter such as variable-size/color markers, plot is much more efficient.
Thanks that is both more efficient and more effective than my attempt.
Am I allowed a follow up question please (see JamApple above)?
When accessing one variable at a time, Obi Wan might also suggest using the dot:
plot(TT3.Speed(OR),TT3.Time(OR),'o',TT3.Speed(AP),TT3.Time(AP),'o')
But even with curlies, names are nice:
plot(TT3{OR,'Speed'},TT3{OR,'Time'},'o',TT3{AP,'Speed'},TT3{AP,'Speed'},'o')
dpb
dpb on 9 May 2017
Ever so, may be it can... :)

Sign in to comment.

More Answers (0)

Products

Asked:

on 2 May 2017

Commented:

dpb
on 18 May 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!