Whats the difference between a table (new in R2013b) and a dataset (stats toolbox)?

As an enthusiast for the dataset class, I notice with interest a new class table in the latest MATLAB release (in the promo video). This sounds very similar to the existing dataset class in the Statistics Toolbox which I have been using since release.
When I search the documentation/help for "table dataset" all I find is a converter function dataset2table and table2dataset, but the question I have is what is the difference in intention between these? When is it appropriate to use a dataset and when to use a table? What is the difference between the design of these two classes?
What about the "new" categorical class. Has this moved from stats toolbox into base MATLAB?
Should we expect dataset and categorical classess in the Statistics Toolbox to be deprecated in the future?

 Accepted Answer

Julian, as you noticed, MATLAB R2013b includes two new array types known as tables and categorical arrays. These are very similar to the dataset, nominal, and ordinal array types that have been part of the Statistics Toolbox for about six years. Like a dataset array, a table is a container that holds mixed-type tabular data, the sort of column-oriented data you would often import from a CSV file or a spreadsheet. And like nominal and ordinal arrays, a categorical array represents discrete non-numeric data, the sort of data you might otherwise have used strings or "coded integers" to store.
Generally speaking, these new data types should look and feel very familiar to anyone who has used the ones in the Statistics Toolbox. One obvious difference is that they are included as part of core MATLAB, and you don't need to install the Statistics Toolbox to use them. In addition, their design and terminology makes them a bit more accessible for non-statistical uses, though they remain just as useful for statistics.
Tables and categorical arrays are ultimately intended as replacements for dataset, nominal, and ordinal arrays, and we recommend that MATLAB users adopt them for new work. We also recommend that, over time, users update any of their existing code that uses dataset/nominal/ordinal, but we don't expect that that changeover can happen immediately. Upcoming releases will provide more details and strategies for making the transition.
In R2013b, all of the Statistics Toolbox functionality that uses nominal and ordinal arrays also supports the new categorical arrays. In R2013b, you'll still need to use dataset arrays in the Statistics Toolbox for things like LinearModel and (new in R2013b) LinearMixedModel, but you might consider creating tables and converting to dataset only when needed, using table2dataset.

5 Comments

Great explanation Peter. I'll add that there's a blog post that introduces these new data types in detail: http://blogs.mathworks.com/loren/2013/09/10/introduction-to-the-new-matlab-data-types-in-r2013b/
Just to say I am a big fan of the dataset & categorical class and have been making heavy use of it since its launch. Before today I saw pleas from users here at MATLAB Central for dataset to be included in base MATLAB, so tables should do the trick. However in my opinion it is disingenuous of TMW to launch tables as "new", rather than a rebranding of datasets and an (entirely welcome) license change, while omitting to mention datasets in any of the release notes, videos or blogs to accompany the marketing.
As one example, the main doc page for dataset http://www.mathworks.co.uk/help/stats/dataset-arrays.html doesn't have any reference to equivalent page for table. There is no discussion of when it is appropriate to use a dataset and when to use a table, or an equivalence / migration guide, or a mention of the future intention to deprecate datasets, which has a big impact on my code.
In future I would hope to see:
  • better cross-referencing between tables and datasets in the doc
  • forward guidance for Stats Toolbox users about the road-map for datasets now that we also have tables
  • no undocumented changes made to the class design for tables without remark in the release notes (as happened with datasets)
  • more functions in the Statistics Toolbox can work natively with datasets or tables, e.g. boxplot, parallelcoords to name 2 off my head.
The only other point to ask is in these days of "Big Data" whether a handle version of table/dataset would be useful? My datasets tend to be large, but making a small change to one variable in a dataset, or just changing the metadata, means copying the entire table and passing large frames on the stack.
PS I have also posted this comment as a blog response
Julian, I was a bit surprised that you think we were disingenuous, which means lacking in frankness, candor, or sincerity; insincere or calculating. To our way of thinking, it is not necessary for a MATLAB user to understand datasets in the Statistics Toolbox or to know about their existence in order to learn about and successfully use the new table type in MATLAB. In fact, we think it would be mostly a distraction, introducing added complexity into documentation and demonstrations that would not be helpful to most people. That said, it is probably true that some people could have used more information about the connection between dataset and table than we provided. I might suggest "oversight" or "ran out of time" as alternative explanations instead of insincerity.
On your point about large datasets, MATLAB uses reference-counting heavily under the hood in order avoid actual memory copies whenever possible. Changing one variable, or changing the table's metadata, wouldn't normally result in a memory copy of the entire table.
Steve, sure, I agree not mentioning the existing classes in the main MATLAB doc is clearer for a new (or a no Stats Toolbox) MATLAB user, and the doc is cleaner that way. But release notes (and videos) speak mainly to existing users rather than new ones.... and the new converter methods dataset2table and table2dataset should be mentioned in the release notes for the Statistics Toolbox. TMW modified the head doc page for Dataset Arrays http://www.mathworks.com/help/stats/dataset-arrays.html to reference table2dataset & dataset2table, but there is no remark at all about the relation between dataset and table, and the future implication for Statistics Toolbox users. The head page for Categorical Arrays http://www.mathworks.co.uk/help/stats/categorical-arrays.html fails even to mention its new non-abstract namesake.
I am sure the design of table and categorical leaned heavily on experience with dataset and categorical. TMW didn't forget about datasets or categoricals when you launched their replacements with a big fanfare, but you did forget about their users when you updated the Statistics Toolbox documentation and release notes.
Thank you for your other remark regarding efficiency. BTW It's great to see "datasets" and "categoricals" get a wider audience, I really like them. It will be quite a while before I get to try the new ones (my company just upgraded to R2013a from R2011a). I hope a migration guide will be published by then?
BTW, I just found out changes made to the dataset() constructor between R2011a and R2013a broke my code...

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!