Loss Function and Model Quality Metrics
What is a Loss Function?
The System Identification Toolbox™ software estimates model parameters by minimizing the error between the model output and the measured response. This error, called loss function or cost function, is a positive function of prediction errors e(t). In general, this function is a weighted sum of squares of the errors. For a model with nyoutputs, the loss function V(θ) has the following general form:
$V(\theta )=\frac{1}{N}{\displaystyle \sum}_{t=1}^{N}{e}^{T}\left(t,\theta \right)W\left(\theta \right)e\left(t,\theta \right)$
where:
N is the number of data samples.
e(t,θ) is nyby1 error vector at a given time t, parameterized by the parameter vector θ.
W(θ) is the weighting matrix, specified as a positive semidefinite matrix. If W is a diagonal matrix, you can think of it as a way to control the relative importance of outputs during multioutput estimations. When W is a fixed or known weight, it does not depend on θ.
The software determines the parameter values by minimizing V(θ) with respect to θ.
For notational convenience, V(θ) is expressed in its matrix form:
$V\left(\theta \right)=\frac{1}{N}trace\left({E}^{T}\left(\theta \right)E\left(\theta \right)W(\theta )\right)$
E(θ) is the error matrix of size Nbyny. The i:th row of E(θ) represents the error value at time t = i.
The exact form of V(θ) depends on the following factors:
Options to Configure the Loss Function
You can configure the loss function for your application needs. The following estimation options, when available for the estimator, configure the loss function:
Estimation Option  Description  Notes 


Note For models whose noise component is trivial, (H(q) = 1), e_{p}(t), and e_{s}(t) are equivalent. The  

When you specify a weighting filter, prefiltered prediction or simulation error is minimized:
$${e}_{f}(t)=\mathcal{L}(e(t))$$
where $\mathcal{L}(.)$ is a linear filter. The 


When 





$V(\theta )=\frac{1}{N}\left({\displaystyle \sum}_{t\in I}{e}^{T}\left(t,\theta \right)W\left(\theta \right)e\left(t,\theta \right)+{\displaystyle \sum}_{t\in J}{v}^{T}\left(t,\theta \right)W\left(\theta \right)v\left(t,\theta \right)\right)$
where:
The error v(t,θ) is defined as:
$v\left(t,\theta \right)=e\left(t,\theta \right)*\sigma \frac{\rho}{\sqrt{\lefte\left(t,\theta \right)\right}}$



The loss function is set up with the goal of minimizing the prediction errors. It does not include specific constraints on the variance (a measure of reliability) of estimated parameters. This can sometimes lead to models with large uncertainty in estimated model parameters, especially when the model has many parameters.
$V\left(\theta \right)=\frac{1}{N}{\displaystyle \sum}_{t=1}^{N}{e}^{T}\left(t,\theta \right)W\left(\theta \right)e\left(t,\theta \right)+\frac{1}{N}\lambda {\left(\theta {\theta}^{*}\right)}^{T}R\left(\theta {\theta}^{*}\right)$
The second term is a weighted (R) and scaled (λ) variance of the estimated parameter set θ about its nominal value θ*. 

Effect of Focus
and WeightingFilter
Options on the Loss Function
The Focus
option can be interpreted as a weighting filter in the loss
function. The WeightingFilter
option is an additional custom weighting
filter that is applied to the loss function.
To understand the effect of Focus
and
WeightingFilter
, consider a linear singleinput singleoutput model:
$$y(t)=G(q,\theta )\text{}u(t)+H(q,\theta )\text{}e(t)$$
Where G(q,θ) is the measured transfer function, H(q,θ) is the noise model, and e(t) represents the additive disturbances modeled as white Gaussian noise. q is the timeshift operator.
In frequency domain, the linear model can be represented as:
$$Y(\omega )=G(\omega ,\theta )U(\omega )+H(\omega ,\theta )E(\omega )$$
where Y(ω), U(ω), and E(ω) are the Fourier transforms of the output, input, and output error, respectively. G(ω,θ) and H(ω,θ) represent the frequency response of the inputoutput and noise transfer functions, respectively.
The loss function to be minimized for the SISO model is given by:
$V(\theta )=\frac{1}{N}{\displaystyle \sum}_{t=1}^{N}{e}^{T}\left(t,\theta \right)e\left(t,\theta \right)$
Using Parseval’s Identity, the loss function in frequencydomain is:
$$V(\theta ,\omega )=\frac{1}{N}{\Vert E(\omega )\Vert}^{2}$$
Substituting for E(ω) gives:
$$V(\theta ,\omega )=\frac{1}{N}{\Vert \frac{Y(\omega )}{U(\omega )}G(\theta ,\omega ))\Vert}^{2}\frac{{\Vert U(\omega )\Vert}^{2}}{{\Vert H(\theta ,\omega )\Vert}^{2}}$$
Thus, you can interpret minimizing the loss function V as fitting
G(θ,ω) to the empirical transfer
function $$Y(\omega )/U(\omega )$$, using $$\frac{{\Vert U(\omega )\Vert}^{2}}{{\Vert H(\theta ,\omega )\Vert}^{2}}$$ as a weighting filter. This corresponds to specifying
Focus
as 'prediction'
. The estimation emphasizes
frequencies where input has more power ($${\Vert U(\omega )\Vert}^{2}$$ is greater) and deemphasizes frequencies where noise is significant ($${\Vert H(\theta ,\omega )\Vert}^{2}$$ is large).
When Focus
is specified as 'simulation'
, the inverse
weighting with $${\Vert H(\theta ,\omega )\Vert}^{2}$$ is not used. That is, only the input spectrum is used to weigh the relative
importance of the estimation fit in a specific frequency range.
When you specify a linear filter $\mathcal{L}$ as WeightingFilter
, it is used as an additional custom
weighting in the loss function.
$$V(\theta )=\frac{1}{{N}^{2}}{\Vert \frac{Y(\omega )}{U(\omega )}G(\theta ))\Vert}^{2}\frac{{\Vert U(\omega )\Vert}^{2}}{{\Vert H(\theta )\Vert}^{2}}{\Vert \mathcal{L}(\omega )\Vert}^{2}$$
Here $$\mathcal{L}(\omega )$$ is the frequency response of the filter. Use $$\mathcal{L}(\omega )$$ to enhance the fit of the model response to observed data in certain frequencies, such as to emphasize the fit close to system resonant frequencies.
The estimated value of inputoutput transfer function G is the same as
what you get if you instead first prefilter the estimation data with $\mathcal{L}(.)$ using idfilt
, and then estimate the model without
specifying WeightingFilter
. However, the effect of $\mathcal{L}(.)$ on the estimated noise model H depends on the choice of
Focus
:
Focus
is'prediction'
— The software minimizes the weighted prediction error $${e}_{f}(t)=\mathcal{L}({e}_{p}(t))$$, and the estimated model has the form:$$y(t)=G(q)u(t)+{H}_{1}(q)e(t)$$
Where $${H}_{1}(q)=H(q)/\mathcal{L}(q)$$. Thus, the estimation with prediction focus creates a biased estimate of H. This is the same estimated noise model you get if you instead first prefilter the estimation data with $\mathcal{L}(.)$ using
idfilt
, and then estimate the model.When H is parameterized independent of G, you can treat the filter $\mathcal{L}(.)$ as a way of affecting the estimation bias distribution. That is, you can shape the tradeoff between fitting G to the system frequency response and fitting $$H/\mathcal{L}$$ to the disturbance spectrum when minimizing the loss function. For more details see, section 14.4 in System Identification: Theory for the User, Second Edition, by Lennart Ljung, Prentice Hall PTR, 1999.
Focus
is'simulation'
— The software first estimates G by minimizing the weighted simulation error $${e}_{f}(t)=\mathcal{L}({e}_{s}(t))$$, where ${e}_{s}\left(t\right)={y}_{measured}\left(t\right)G(q){u}_{measured}\left(t\right)$. Once G is estimated, the software fixes it and computes H by minimizing pure prediction errors e(t) using unfiltered data. The estimated model has the form:$$y(t)=G(q)u(t)+He(t)$$
If you prefilter the data first, and then estimate the model, you get the same estimate for G but get a biased noise model $$H/\mathcal{L}$$.
Thus, the WeightingFilter
has the same effect as prefiltering the
estimation data for estimation of G. For estimation of H,
the effect of WeightingFilter
depends upon the choice of
Focus
. A prediction focus estimates a biased version of the noise model $$H/\mathcal{L}$$, while a simulation focus estimates H. Prefiltering the
estimation data, and then estimating the model always gives $$H/\mathcal{L}$$ as the noise model.
Model Quality Metrics
After you estimate a model, use model quality metrics to assess the quality of identified
models, compare different models, and pick the best one. The Report.Fit
property of an identified model stores various metrics such as FitPercent
,
LossFcn
, FPE
, MSE
,
AIC
, nAIC
, AICc
, and
BIC
values.
FitPercent
,LossFcn
, andMSE
are measures of the actual quantity that is minimized during the estimation. For example, ifFocus
is'simulation'
, these quantities are computed for the simulation error e_{s} (t). Similarly, if you specify theWeightingFilter
option, thenLossFcn
,FPE
, andMSE
are computed using filtered residuals e_{f} (t).FPE
,AIC
,nAIC
,AICc
, andBIC
measures are computed as properties of the output disturbance according to the relationship:$y\left(t\right)=G\left(q\right)u\left(t\right)+H\left(q\right)e(t)$
G(q) and H(q) represent the measured and noise components of the estimated model.
Regardless of how the loss function is configured, the error vector e(t) is computed as 1step ahead prediction error using a given model and a given dataset. This implies that even when the model is obtained by minimizing the simulation error e_{s} (t), the FPE and various AIC values are still computed using the prediction error e_{p} (t). The actual value of e_{p} (t) is determined using the
pe
command with prediction horizon of 1 and using the initial conditions specified for the estimation.These metrics contain two terms — one for describing the model accuracy and another to describe its complexity. For example, in FPE, $det\left(\frac{1}{N}{E}^{T}E\right)$ describes the model accuracy and $\frac{1+\frac{np}{N}}{1\frac{np}{N}}$ describes the model complexity.
By comparing models using these criteria, you can pick a model that gives the best (smallest criterion value) tradeoff between accuracy and complexity.
Quality Metric  Description 

 Normalized Root Mean Squared Error (NRMSE) expressed as a percentage, defined as:
$FitPercent=100\left(1\frac{\Vert {y}_{measured}{y}_{model}\Vert}{\Vert {y}_{measured}\overline{{y}_{measured}}\Vert}\right)$
where:
For input or output data,


Value of the loss function when the estimation completes. It contains effects of error thresholds, output weight, and regularization used for estimation. 

Mean Squared Error measure, defined as:
$MSE=\frac{1}{N}{\displaystyle \sum}_{t=1}^{N}{e}^{T}\left(t\right)e\left(t\right)$
where:


Akaike’s Final Prediction Error (FPE), defined as:
$FPE=det\left(\frac{1}{N}{E}^{T}E\right)\left(\frac{1+\frac{{n}_{p}}{N}}{1\frac{{n}_{p}}{N}}\right)$
where:


A raw measure of Akaike's Information Criterion, defined as:
$AIC=N\ast log\left(det\left(\frac{1}{N}{E}^{T}E\right)\right)+2\ast {n}_{p}+N\left({n}_{y}\ast \mathrm{log}\left(2\pi \right)+1\right)$


Small samplesize corrected Akaike's Information Criterion, defined as:
$$AICc=AIC+2\ast {n}_{p}\ast \frac{({n}_{p}+1)}{(N{n}_{p}1)}$$
This metric is often more reliable for picking a model of optimal complexity from a list of candidate models when the data size N is small. 

Normalized measure of Akaike's Information Criterion, defined as:
$nAIC=log\left(det\left(\frac{1}{N}{E}^{T}E\right)\right)+\frac{2\ast {n}_{p}}{N}$


Bayesian Information Criterion, defined as:
$BIC=N\ast log\left(det\left(\frac{1}{N}{E}^{T}E\right)\right)+N\ast \left({n}_{y}\ast \mathrm{log}\left(2\pi \right)+1\right)+{n}_{p}\ast \text{log}(N)$

See Also
aic
 fpe
 pe
 goodnessOfFit
 sim
 predict
 nparams