Need advice for coding dummyvar vectors - Regression
4 views (last 30 days)
Show older comments
How should I properly code dummvar vectors for use in regression analysis in MATLAB? I have attached a sample table of data (.xlsx file) that I wish to import into MATLAB then run regressions (outcome is last column in table). Some of the categoricals are WindAirport, WindRail, etc. and are simply coded as 1 or 0 (as 'double' variable types); others are logicals T/F. For my model output I need to show both the groups, such as: Site_0 and Site_1 and their regression slope coefficients, as well as the model intercept term and its coefficient. Shall all dummyvars be categorical ? logical ? or double to acheive the desired model output? I will use fitglm as the model function. Any advice is welcome. Thank you.
T=readtable('chels_sample.xlsx'); % alternatively, code as T = xlsread('chels_sample.xlsx')
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')
0 Comments
Answers (1)
the cyclist
on 14 Jun 2023
Edited: the cyclist
on 14 Jun 2023
"If data is in a table or dataset array tbl, then, by default, fitglm treats all categorical values, logical values, character arrays, string arrays, and cell arrays of character vectors as categorical variables."
It looks like Day_0 and Day_1 were read in as logical
T=readtable('chels_sample.xlsx');
class(T.Day_0)
class(T.Day_1)
but that WindAirport and WindRail were not:
class(T.WindAirport)
class(T.WindRail)
therefore I would explicitly convert those
T.WindAirport = categorical(T.WindAirport);
T.WindRail = categorical(T.WindRail);
before calling the model
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')
The coefficient of WindAirport_1 is when the value is (categorical) 1. WindAirport=0 is the reference level.
3 Comments
the cyclist
on 15 Jun 2023
The overall model intercept term is in the output: Intercept = 9.2949. The intercept is the value of the response when
- all categorical explanatory variables are at their reference level, and
- all continuous explanatory values are zero
I notice that Day_0 and Day_1 are constant in your data, which I expect is why there are no estimated coefficients for them. (Perhaps you only uploaded a subset of the data?) If they are constant, they should not be in the model. The same seems to be true for Site_0 and Site_1, and many of your other variables. So, I don't understand that.
For the categorical variables that do have different values (e.g. WindRail), the estimate reported is the change in response for the different levels (e.g. WindRail=1), relative to the reference level (WindRail=0). I would not call that a slope, which would only be calculated for a continuous variable.
See Also
Categories
Find more on Gaussian Process Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!