Explore1

Version 1.0.0 (1.69 MB) by ArchNW
Uses CSV based data to Generate a suite of data exploration statistics and graphs and then automatically write new data back to a CSV file.
13 Downloads
Updated 26 Sep 2023

View License

Explore1
Explore1 imports csv data, calculates a complete set of data summary statistics, basic hypothesis tests, and associated graphs. Optionally, Explore1 can write generated result tables back to a csv file.
Example 1: Output1 = explore1(test_Choice, input_Data)
Example 2: Output1 = explore1()
test_Choice – a selection from the explore1() function user input menu (see below)
input_Data – a CSV file name (see below)
Output1 – a Matlab structure containing all tables generated during a run with brief identifications
Introduction
The Explore1 functions were created with the overall goal of allowing researchers to quickly begin to understand a given dataset. Data, once collected, is often stored in one of a few basic types of files. Probably the most common of these are comma-separated values (CSV) files or, more specifically, Excel files. Similarly, in scientific research there are a number of basic statistical procedures that one should complete to describe a dataset. These basic procedures (often referred to as summary statistics) form the basis for further investigation of the data. Further, there are number of basic hypothesis tests that can be used to begin to understand the underlying structures of a dataset. Explore1 was created with these ideas in mind. Explore1 takes data stored in CSV files, calls a series of related functions performs that carry out these basic first steps of analysis. It contains functions and scripts that accomplish several tasks. These include
1) Import data from a CSV (Excel) file.
2) Calculate a selection of basic summary and normality testing statistics.
3) Perform basic hypothesis tests.
4) Write and save resulting tables to a new Excel file.
To further the overall goal reaching a firmer understanding of a dataset, Explore1 gives the user options to:
5) Generate graphs for associated statistics.
6) Calculate bootstrap confidence intervals.
In general, Explore1 will calculate summary statistics on a complete dataset imported from a CSV file. This data can be interval (measurement) data or counts of observations (which can be generated automatically from a list of observations). All statistics are organized into tables. Additionally, Explore1 will perform analysis (summary and hypothesis testing) and generate associated graphs focusing on one Grouping Variable and (optionally) one Splitting variable. These are the variables that determine how the data is summarized and tested. For example, if “Yet Another Nominal Variable” from Example 1 (see Data Structure below) was selected as the Grouping variable, summary statistics for Code1 and Code2 would be generated. For hypothesis testing, a t-test, amongst other tests, would be calculated comparing the two subdivisions. For interval-based analysis the calculations would be working with all the observations’ selected Interval Variable X’s for each Code. For count-based analysis, the observations will be transformed into count data. The original dataset will be automatically binned based on an organizing variable. Summary statistics for each Code’s counts will be generated. Hypotheses tests will then be conducted comparing counts for each Code across bins. Finally, if both Grouping and Splitting Variables have been identified, Explore1 will combine these into a new variable with all combinations of the two parent variables’ subdivisions. Analysis will then be carried out on this new hybrid variable. Upon completing calculating all statistics Explore1 can write tables containing test results and associated figures to an Excel file.
**********************************************************************************************
Statistics
Explore1 calculates statistics and generates graphs for the overall dataset, each interval measure selected, and/or for binned count data. Additionally, summary statistics are calculated for each subdivision for Grouping, Splitting, and Group-Split subdivisions. Explore1 calculates the following statistics:
1) General Summary Statistics:
a. Number of observations
b. Sum
c. Mean
d. Standard Deviation
e. Minimum
f. 25th Percentile
g. Median
h. 75th Percentile
i. Maximum
j. Standard Error
k. Variance
l. Skewness
m. Standard Error of Skewness
n. Kurtosis
o. Standard Error of Kurtosis
p. Coefficient of Variation
2) Absolute Deviation
a. Mean Absolute Deviation
b. Median Absolute Deviation
3) Normality
a. Shapiro-Wilk W
b. Shapiro-Francia W’
c. Anderson-Darling
d. Kolmogorov-Smirnov (one sample)
e. Jarque-Bera
4) Hypothesis Testing
a. T-test
b. Permutation t-test
c. Mann-Whitney U
d. Two Sample Kolmogorov Smirnov
e. Fligner-Killeen
i. Conover Variation
ii. Donnelly-Kramer Variation
f. ANOVA
g. Robust ANOVA Alternatives:
i. Bartette’s
ii. Levene Absolute
iii. Brown-Forsythe
iv. O’Brian
h. ANOVA Post-hoc
i. Tukey-Kramer
ii. Bonferonni
iii. Dunn-Sidak
iv. Scheffe
i. Kruskal-Wallis
i. Mann-Whitney Post-Hoc
5) Effect Size Measure
a. Cohen’s D
b. Glass’ Delta 1
c. Glass’ Delta 2
d. Hedges G
e. R-effect
f. Eta
g. Eta Squared
h. Omega Squared
i. Epsilon-Squared
6) Chi-Square
a. Contingency Table
i. Observed
ii. Expected
iii. Residual
iv. Standardized Residual
v. Adjusted Residual
b. Chi-Square
c. Likelihood Ratio
d. Fisher Test
e. Cramer V
f. Phi
g. Contingency Coefficient
h. Nominal Measure of Association
i. Lambda Test
ii. Goodman and Kruskal tau
iii. Uncertainty Coefficient
7) Graphs
a. Histogram
b. Histogram with fit distributions shown:
i. Normal
ii. Kernel Density
iii. Poisson
c. Normal Distribution Probability Plot
d. Quantile-Quantile Plot
e. Probability Plot for Lognormal Distribution
f. Group KDE
g. Group Bar Graph with Error Bars
h. Boxplot
Included Functions
amg2 – alternate multicompare graphs
basic_numeric_stats_v2 – summary statistics for interval based data
basic_sum_stats_v2 - summary statistics for count based data
cbs() – calls analysis and writing functions for complete dataset - counts
chi2 - calculates chi-square and related tables, figures, and statistics
chi2_2 – calculates chi-square and related tables, figures, and statistics
cnt_data_1_nomv_stats – stand-alone control script for count data a one nominal Grouping variable
cnt_data_2_nomvs_stats = stand-alone control script for count data with nominal Grouping and Splitting variables
Count_Statistics – stand-alone control script which take observation data, bins it into count data and calls summary and hypothesis testing functions
count_sum_stats_cbs – summary statistics called from cbs() function
countstats2 – main control function between explore1() and summary and hypothesis functions
explore1 – control function – select analysis variation and define input data name
FK_DK_Con – Fligner-Killeen test
group_sum_stats_v2 – summary statistics - counts
groupsplit_num_sum_v2 - summary statistics – interval
groupsplit_sum_stats_v2 - summary statistics – counts
hypoth_measures_n2_V2 – hypothesis testing – 2 subdivisions of Grouping Variable - interval
hypoth_measures_n3plus_V2 – hypothesis testing – more than 2 subdivisions – interval
hypoth_n2_V2 – hypothesis testing – 2 subdivisions of Grouping Variable – counts
hypoth_n3plus_V2 – hypothesis testing – more than 2 subdivisions – counts
ibs – calls analysis and writing functions for complete dataset - interval
inputsBoth – user input questions
Live_Counts – achieves same results as selecting “2” when prompted while using explore1()
Live_Measures – achieves same results as selecting “1” when prompted while using explore1()
Measure_Statistics – stand-alone control script – intervals - achieves same results as selecting “1” when prompted while using explore1()
measurestats2 – control function
swft – Shaprio-Wilk and Shapiro-Francia normality tests
t_perm_test – permutation test
xlgrphwrite2 – writes figures to an Excel file created by the xlwrite2 function
xlwrite2 – writes table output to an Excel file
Testing and Algorithm Selection
Throughout the programming process, test results were compared to results produced by several statistical software packages. These included SPSS, Stata, PAST, and R. In some cases, statistical software packages use slightly different algorithms to achieve the same basic ends. In a portion of those cases, the results could be slightly different. When differences presented themselves, I have generally selected the version of the test in question that seemed to be used across the most platforms. Failing this, a literature review was carried out.
Bibliography, Abridged
Most of the procedures calculated by Explore1 are fairly well documented in most basic to mid-level statistics books. Presented below is an abbreviated list of references used in selecting and refining the algorithms used throughout the functions.
Ahmad, F., & Sherwani, R. A. K. (2015). Power Comparison of Various Normality Tests. Pak.j.stat.oper.res., 11(3), 331-345.
Anderson, M. J. (2001). Permutation Tests for Univariate or Multivariate Analysis of Variance and Regression. Canadian Journal of Fisheries Aquatic Science, 58, 626-639.
Baxter, M. J., & Beardah, C. C. (1996). Beyond Histograms - Improved Approaces to Simple Data Display in Archaeology Using Kernel Density Esimates. Department f Mathematics, Statistics, and Operational Research. The Nottingham Trent University. Nottingham.
Bohn, L. L., & Wolfe, D. A. (1992). Nonparametric Two-Sample Procedures for Ranked-Set Samples Data. Journal of the American Statistical Association, 87(418), 552-561.
Borenstein, M., Hedges, L. R., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to Meta-Analysis. Chichester: John Wiley and Sons, Ltd.
Brown, M. B., & Forsythe, A. B. (1974). Robust Tests for the Equality of Variances. Journal of the American Statistical Association, 69(346), 364-367.
Cameron, A. C. (2004). Kurtosis. In M. S. Lewis-Beck, A. Bryman, & T. F. Liao (Eds.), Encyclopedia of Social Science Research Methods (pp. 544-545). Thousand Oaks: SAGE Publications, Inc.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, Second Edition. United States of America: Lawrence Erlbaum Associates.
Cohen, J. (1992a). A Power Primer. Psychological Bulletin, 112(1), 155-159.
Cohen, J. (1992b). Statistical Power Analysis. Current Directions in Psychological Science, 1(3), 98-101.
Conover, W. J., Johnson, M. E., & Johnson, C. D. (1981). A Comparative Study of Tests for Homogeneity of Variances, with Applications to the OuterContinental Shelf Bidding Data. Technometrics, 23(4), 351-361.
DeCarlo, L. T. (1997). On the Meaning and Use of Kurtosis. Psychological Methods, 2(3), 292-307.
Donnelly, S. M., & Kramer, A. (1999). Testing for Multiple Species in Fossil Samples: An Evaluation and Comparison of Tests for Equal Relative Variation. American Journal of Physical Anthropology, 108, 507-529.
Drennan, R. D. (2009). Statistics for Archaeologists: A Commonsense Approach, Second Edition. Dordrecht: Springer.
Fletcher, M., & Lock, G. R. (2005). Digging Numbers : Elementary Statistics For Archaeologists (2nd ed.). Oxford : Oxford University Committee for Archaeology: Oakville, CT.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: current use, calculations, and interpretation. J Exp Psychol Gen, 141(1), 2-18. doi:10.1037/a0024338
Gan, F. F., Koehler, K. J., & Thompson, J. C. (1991). Probability Plots and Distribution Curves for Assessing the Fit of Probability Models. The American Statistician, 45(1), 14-21.
Joanes, D. N., & Gill, C. A. (1998). Comparing Measures of Sample Skewness and Kurtosis. Journal of the Royal Statistical Society. Series D (The Statistician), 41(1), 183-189.
Liebetrau, A. M. (2011). Measures of Association: SAGEE Publications, Inc.
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc, 82(4), 591-605. doi:10.1111/j.1469-185X.2007.00027.x
Ramenofsky, A. F., & Steffen, A. (1998a). Units as Tools of Measurement. In A. F. Ramenofsky & A. Steffen (Eds.), Unit Issues in Archaeology (pp. 3-18). Salt Lake: The University of Utah Press.
Ramenofsky, A. F., & Steffen, A. (Eds.). (1998b). Unit Issues in Archaeology. Salt Lake: The University of Utah Press.
Rogan, J. C., & Keselman, H. J. (1977). Is the ANOVA F-Test Robust to Variance Heterogeneity When Sample Sizes are Equal?: An Investigation via a Coefficient of Variation. American Educational Research Journal, 14, 493-498.
Rosenthal, R., & Rubin, D. B. (2003). r equivalent: A simple effect size indicator. Psychol Methods, 8(4), 492-496. doi:10.1037/1082-989X.8.4.492
Royston, J. P. (1982a). Algorithm AS 181: The W Test for Normality. Applied Statistics, 31(2), 176-180.
Royston, J. P. (1982b). An Extension of Shapiro and Wilk's W Test for Normality to Large Samples. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(2), 115-124.
Royston, J. P. (1983). A Simple Method for Evaluating the Shapiro-Francia W' Test of Non-Normality. Journal of the Royal Statistical Society. Series D (The Statistician), 32(3), 297-300.
Royston, J. P. (1991). Tests for departure from normality. Stata Technical Bulletin, 2(July), 16-17.
Ruxton, G. (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test.
Behavioral Ecology, 688-690.
Shennan, S. (1997). Quantifying Archaeology. Iowa City: University of Iowa Press.
Stephens, M. A. (1970). Use of the Kolmogorov-Smirnov, Cramer-Von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society. Series B (Methodological), 32(1), 115-122.
Sullivan, A. P., III, Mink, P. B., II, & Uphus, P. M. (2007). Archaeological Survey Design, Units of Observation, and the Characterization of Regional Variability. American Antiquity, 72(2), 322-333.
Tomarken, A. J., & Sterlin, R. C. (1986). Comparison of ANOVA Alternatives Under Variance Heterogeneity and Specific Noncentrality Structures. Quantitative Methods in Psychology, 99(1), 90-99.
Vargha, A., & Delany, H. D. (1998). The Kruskal-Wallis Test and Stochastic Homogeneity. Journal of Educational and Behavioral Statistics, 23(2), 170-192.
Wilcox, R. R. (1992). Why Can Methods for Comparing Means Have Relatively Low Power, and What Can You Do to Correct the Problem? Current Directions in Psychological Science, 1(3), 101-105.
Wilk, M. B., & Gnanadesikan, R. (1968). Probability Plotting Methods for the Analysis of Data. Biometrika, 55(1), 1-17.
Wilk, M. B., & Shapiro, S. S. (1965). An Analysis of Variance Test for Normaily (Complete Samples). Biometrika, 52(3/4), 591-611.
Yazici, B., & Yolacan, S. (2007). A Comparison of Various Tests of Normality. Journal of Statistical Computation and Simulation, 77(2), 175–183.
Zimmerman, D. W. (1987). Comparative Power of Student T Test and Mann-Whitney U Test for Unequal Sample Sizes and Variances. The Journal of Experimental Education, 55(3), 171-174.
Copywrite William Gardner-O'Kearny 2023

Cite As

Gardner-O'Kearny, William (2023). Explore1 (https://www.mathworks.com/matlabcentral/fileexchange/<...>), MATLAB Central File Exchange. Retrieved September 26, 2023.

MATLAB Release Compatibility
Created with R2023b
Compatible with R2020a to R2023b
Platform Compatibility
Windows macOS Linux
Tags Add Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
1.0.0