How does fitlm set reference level with categorical variables?

10 views (last 30 days)
I am running linear regression using fitlm with categorical datasets:
model = fitlm(DataTable ,'Score ~ Industry + Rating + Liquid')
The regressor set the Industry and Rating reference level to the 1st row cells, but for "Liquid" variable, it sets "Q1" as the reference level. I am a little confused on this select? I thought the regressor will always set the 1st row as reference for all 3 variables. Could you please explain why it choose a different reference level for the "Liquid" variable.

Answers (1)

Cris LaPierre
Cris LaPierre on 11 Oct 2024
and this note in Algorithms:
fitlm treats a categorical predictor as follows:
  • A model with a categorical predictor that has L levels (categories) includes L – 1 indicator variables. The model uses the first category as a reference level, so it does not include the indicator variable for the reference level. If the data type of the categorical predictor is categorical, then you can check the order of categories by using categories and reorder the categories by using reordercats to customize the reference level. For more details about creating indicator variables, see Automatic Creation of Dummy Variables.
  • fitlm treats the group of L – 1 indicator variables as a single variable. If you want to treat the indicator variables as distinct predictor variables, create indicator variables manually by using dummyvar. Then use the indicator variables, except the one corresponding to the reference level of the categorical variable, when you fit a model. For the categorical predictor X, if you specify all columns of dummyvar(X) and an intercept term as predictors, then the design matrix becomes rank deficient.
  • Interaction terms between a continuous predictor and a categorical predictor with L levels consist of the element-wise product of the L – 1 indicator variables with the continuous predictor.
  • Interaction terms between two categorical predictors with L and M levels consist of the (L – 1)*(M – 1) indicator variables to include all possible combinations of the two categorical predictor levels.
  • You cannot specify higher-order terms for a categorical predictor because the square of an indicator is equal to itself.
  9 Comments
Cris LaPierre
Cris LaPierre on 15 Oct 2024
The reason for the behavior you are seeing is because Industry and Rating are not categorical variables.
load matlab_datasample2.mat
varfun(@class,DataSample)
ans = 1x4 table
class_Score class_Industry class_Rating class_Liquid ___________ ______________ ____________ ____________ double cell cell categorical
If you want row 1 to be the reference values, then either don't use categorical data types, or use reordercats to ensure the row 1 categorical values are the first category.
Here, I'm converting Liquid to string.
DataSample = convertvars(DataSample, "Liquid","string")
DataSample = 1406x4 table
Score Industry Rating Liquid _______ _____________________________ ________ ______ -92.102 {'METALS_AND_MINING' } {'BA1' } "2" -125.94 {'AIRLINES' } {'BA2' } "2" -90.965 {'AIRLINES' } {'BA1' } "1" -56.942 {'TECHNOLOGY' } {'AA1' } "1" -127.78 {'RETAIL_&_SUPERMARKETS' } {'BA1' } "2" 9.7511 {'OTHER_REITS' } {'BAA3'} "4" 4.5882 {'PHARMACEUTICALS' } {'A3' } "1" -112.25 {'MEDIA_ENTERTAINMENT' } {'B3' } "5" -84.497 {'AUTOMOTIVE_AUTO_SUPPLIERS'} {'BA2' } "2" -53.485 {'HEALTHCARE' } {'AA3' } "1" 0.51723 {'METALS_AND_MINING' } {'BAA1'} "2" -3.3194 {'AIRLINES' } {'BA1' } "3" -62.494 {'RETAILERS' } {'BA2' } "5" 32.613 {'INDUSTRIAL_OTHER' } {'B1' } "3" 8.5647 {'CONSUMER_PRODUCTS' } {'BA3' } "4" -4.5917 {'P&C' } {'BAA1'} "2"
varfun(@class,DataSample)
ans = 1x4 table
class_Score class_Industry class_Rating class_Liquid ___________ ______________ ____________ ____________ double cell cell string
model = fitlm(DataSample,'Score ~ Industry + Rating + Liquid')
model =
Linear regression model: Score ~ 1 + Industry + Rating + Liquid Estimated Coefficients: Estimate SE tStat pValue ________ ______ __________ ___________ (Intercept) -38.212 31.515 -1.2125 0.22554 Industry_AIRLINES 74.075 48.623 1.5235 0.12788 Industry_TECHNOLOGY 46.33 28.915 1.6023 0.10933 Industry_RETAIL_&_SUPERMARKETS 58.743 32.97 1.7817 0.075031 Industry_OTHER_REITS 3.1752 50.35 0.063062 0.94973 Industry_PHARMACEUTICALS 23.979 39.108 0.61313 0.5399 Industry_MEDIA_ENTERTAINMENT -73.778 33.416 -2.2079 0.027427 Industry_AUTOMOTIVE_AUTO_SUPPLIERS 26.169 38.719 0.67587 0.49924 Industry_HEALTHCARE 27.374 31.002 0.88296 0.37742 Industry_RETAILERS -16.989 69.658 -0.24389 0.80736 Industry_INDUSTRIAL_OTHER -6.4796 38.917 -0.1665 0.86779 Industry_CONSUMER_PRODUCTS 23.455 36.496 0.64267 0.52055 Industry_P&C 58.788 32.565 1.8052 0.071269 Industry_REIT 85.54 32.967 2.5947 0.0095745 Industry_PKGED_FOOD_FOODSVCS_REST 30.374 33.14 0.91654 0.35955 Industry_CONSUMER_CYCLICAL_SERVICES 10.205 40.119 0.25436 0.79925 Industry_Utilities_OpCo_FMB 84.52 34.752 2.4321 0.015146 Industry_Utilities_Holdco 86.896 31.922 2.7222 0.0065725 Industry_Utilities_OpCo_Uns 71.542 35.941 1.9905 0.046742 Industry_LIFE 68.912 37.124 1.8563 0.063641 Industry_CONSTRUCTION_MACHINERY 64.929 43.476 1.4934 0.13556 Industry_AEROSPACE/DEFENSE 25.573 37.21 0.68726 0.49204 Industry_AIRCRAFT_LEASE 26.407 78.714 0.33549 0.73731 Industry_CHEMICALS 79.924 33.1 2.4146 0.015888 Industry_BANKING_US_SUB 67.288 35.675 1.8862 0.059495 Industry_BANKING_US_SR 78.517 33.661 2.3326 0.019822 Industry_MIDSTREAM 57.774 31.627 1.8267 0.067968 Industry_BROKERAGE_ASSETMANAGERS_EXCHANGES 61.685 36.962 1.6689 0.095383 Industry_CABLE_TELCO -15.638 32.765 -0.47727 0.63325 Industry_INDEPENDENT 21.359 34.483 0.61941 0.53576 Industry_OIL_FIELD_SERVICES -35.168 36.037 -0.97587 0.32931 Industry_FINANCE_COMPANIES 23.547 34.786 0.67691 0.49858 Industry_BANKING 39.035 44.918 0.86904 0.38498 Industry_LIFE_FA_BACKED_NOTES 70.792 95.979 0.73757 0.46091 Industry_DIVERSIFIED_MANUFACTURING 37.67 32.393 1.1629 0.24508 Industry_PACKAGING 24.327 42.538 0.5719 0.56749 Industry_TRANSPORTATION_SERVICES 26.77 42.432 0.63089 0.52822 Industry_ELECTRIC 79.585 59.145 1.3456 0.17867 Industry_GAMING 31.264 39.646 0.78857 0.43051 Industry_PAPER 35.469 48.39 0.73299 0.4637 Industry_BUILDING_MATERIALS 17.926 37.124 0.48287 0.62927 Industry_BEVERAGE 64.112 52.879 1.2124 0.22557 Industry_NO_INDUSTRY -61.964 71.257 -0.86958 0.38469 Industry_RAILROADS_ENVIRONMENTAL 42.524 44.653 0.95232 0.34111 Industry_HOME_CONSTRUCTION 7.4579 40.197 0.18553 0.85284 Industry_FINANCIAL_OTHER 5.8455 63.208 0.092481 0.92633 Industry_CABLE_SATELLITE -1.1275 131.6 -0.0085676 0.99317 Industry_LODGING_LEISURE 4.4847 35.959 0.12472 0.90077 Industry_Utilities_Genco 20.092 48.759 0.41207 0.68036 Industry_REFINING -14.143 50.302 -0.28117 0.77862 Industry_REITS_HEALTHCARE 63.194 50.417 1.2534 0.21027 Industry_INTEGRATED 79.255 71.394 1.1101 0.26716 Industry_HEALTHCARE_REITS -597.22 100.19 -5.9606 3.2334e-09 Industry_RETAIL_REITS 33.916 94.718 0.35808 0.72034 Industry_AUTOMOTIVE_AUTO_FINCO 53.97 69.608 0.77535 0.43828 Industry_HIGHER_ED_TXCRP 19.481 131.76 0.14785 0.88248 Industry_ENVIRONMENTAL -58.522 131.98 -0.44342 0.65753 Industry_BANKING_US_PFD 32.933 78.824 0.4178 0.67616 Industry_AIRLINES_EETC_A 33.805 95.25 0.3549 0.72272 Industry_INSURANCE_US_SUBORDINATED 53.352 78.697 0.67795 0.49793 Industry_TOBACCO 41.967 69.422 0.60451 0.54561 Industry_BANKING_GLOBAL_TLAC_SR 60.837 131.21 0.46367 0.64296 Industry_UTILITY_OTHER 5.1819 131.24 0.039483 0.96851 Rating_BA2 30.788 20.998 1.4662 0.14282 Rating_AA1 -72.611 129.89 -0.55901 0.57625 Rating_BAA3 -6.6247 19.214 -0.34478 0.73032 Rating_A3 -46.588 22.728 -2.0498 0.040584 Rating_B3 78.003 24.971 3.1238 0.0018249 Rating_AA3 -42.946 38.267 -1.1223 0.26196 Rating_BAA1 -30.175 20.288 -1.4874 0.13716 Rating_B1 33.131 21.115 1.569 0.11688 Rating_BA3 14.299 20.243 0.70638 0.48008 Rating_B2 20.609 22.265 0.92559 0.35483 Rating_A1 -71.461 28.924 -2.4707 0.013613 Rating_A2 -54.844 24.399 -2.2478 0.024757 Rating_BAA2 -24.753 19.031 -1.3007 0.1936 Rating_CAA3 534.91 43.266 12.363 2.8422e-33 Rating_AA2 -113.84 52.126 -2.1839 0.029147 Rating_CAA1 169.91 27.988 6.0709 1.667e-09 Rating_CAA2 413.13 39.372 10.493 8.8261e-25 Rating_CA 788.56 57.97 13.603 1.7066e-39 Rating_NR -98.999 131.24 -0.75431 0.4508 Rating_AAA -48.279 93.631 -0.51563 0.6062 Rating_C 3773.2 94.223 40.045 5.4623e-229 Liquid_1 7.5509 10.665 0.70804 0.47905 Liquid_4 18.731 11.687 1.6027 0.10925 Liquid_5 28.48 13.186 2.1599 0.030965 Liquid_3 30.728 11.029 2.7862 0.0054111 Number of observations: 1386, Error degrees of freedom: 1298 Root Mean Squared Error: 128 R-squared: 0.643, Adjusted R-Squared: 0.619 F-statistic vs. constant model: 26.9, p-value = 7.06e-231

Sign in to comment.

Categories

Find more on Weather and Atmospheric Science in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!