Linear regression on data with asymmetric measurement error

21 views (last 30 days)
I am looking to perform a linear regression on measured data that takes into account an asymmetric error in the data. I've created some dummy data to illustrate what I mean:
The blue curve represents the measured data, while the red curve is the lower bound and is notably closer to the measured data than the orange curve, which represents the upper bound.
Snippet of code to create dummy data:
xdata = linspace(0,10, 20);
ydata = 2*xdata+1.5*rand(1,length(xdata));
y_err_low = 0.3*xdata+1.5*rand(1,length(xdata));
y_err_high = 0.6*xdata+1.5*rand(1,length(xdata));
ylowbnd = ydata - y_err_low;
yupbnd = ydata + y_err_high;
plot(xdata, ydata,'o-', 'LineWidth', 2, 'DisplayName', 'measured data')
hold on
plot(xdata, ylowbnd, 'x--', 'LineWidth', 2, 'DisplayName', 'lower bound')
plot(xdata, yupbnd, 's--', 'LineWidth', 2, 'DisplayName', 'upper bound')
xlabel('x')
ylabel('y')
legend('Location','northwest')
I have linear regression approaches that rely on the error in y being symmetric about the measured datapoint, but am struggling to find a way to weight my regression based on an asymmetric error.
Things I've been digging into:
  • fmincon (for both fmincon and lsqcurvefit, the bounds, equalities, and inequalities do not appear to allow to input a bound/etc with vectors, e.g., , where anonymous function to fit the data would be and the objective for fmincon would be )
  • lsqcurvefit
  • Method of Maximum Likelihood (here the examples I've been seeing rely on Gaussian distribution around each ydata point, so not asymmetric)
I would appreciate any help in how I can go about giving the fit more (or less) freedom to roam as matches with the asymmetric error associated with each data point.
Thanks!

Answers (2)

Mathieu NOE
Mathieu NOE on 10 Nov 2023
hello Katrina
maybe this ?
you can force the mean curve to get closer from either the upper or the lower bound by adjusting the a coefficient
a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)
% a<1 shift the mean towards the lower bound, a>1 towards the upper bound
full code (dummy data slightly different from your version, sorry !)
% "true" data
x2 = (0:30);
y2 = 2*x2+1.5*rand(1,length(x2));
dx = mean(diff(x2));
% upper bound
x1 = x2 + dx/3;
y1 = 2.6*x1+1.5*rand(1,length(x1));
% lower bound
x3 = x2 + dx*2/3;
y3 = 1.7*x3-1.5*rand(1,length(x3));
% measurement = all data (contatenated)
x = [x1 x2 x3];
[x,ind] = sort(x);
y = [y1 y2 y3];
y = y(ind);
%%%% main loop %%%%
n = 15; % buffer size
a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)
% a<1 shift the mean towards the lower bound, a>1 towards the upper bound
yy = myspecialavg(y, n ,a);
plot(x2, y2,'b',x, y,'*-c',x,yy,'r', 'LineWidth', 2, 'DisplayName', 'measured data')
legend('"true data"','noisy data','my solution');
xlabel('x')
ylabel('y')
legend('Location','northwest')
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function out = myspecialavg(in, N, a)
% OUTPUT_ARRAY = MYSLIDINGAVG(INPUT_ARRAY, N)
%
% The function 'slidingavg' implements a one-dimensional weighted filtering, applying a sliding window to a sequence. Such filtering replaces the center value in
% the window with the average value of all the points within the window. When the sliding window is exceeding the lower or upper boundaries of the input
% vector INPUT_ARRAY, the average is computed among the available points. Indicating with nx the length of the the input sequence, we note that for values
% of N larger or equal to 2*(nx - 1), each value of the output data array are identical and equal to mean(in).
%
% * The input argument INPUT_ARRAY is the numerical data array to be processed.
% * The input argument N is the number of neighboring data points to average over for each point of IN.
%
% * The output argument OUTPUT_ARRAY is the output data array.
if (isempty(in)) | (N<=0) % If the input array is empty or N is non-positive,
disp(sprintf('SlidingAvg: (Error) empty input data or N null.')); % an error is reported to the standard output and the
return; % execution of the routine is stopped.
end % if
if (N==1) % If the number of neighbouring points over which the sliding
out = in; % average will be performed is '1', then no average actually occur and
return; % OUTPUT_ARRAY will be the copy of INPUT_ARRAY and the execution of the routine
end % if % is stopped.
nx = length(in); % The length of the input data structure is acquired to later evaluate the 'mean' over the appropriate boundaries.
if (N>=(2*(nx-1))) % If the number of neighbouring points over which the sliding
out = mean(in)*ones(size(in)); % average will be performed is large enough, then the average actually covers all the points
return; % of INPUT_ARRAY, for each index of OUTPUT_ARRAY and some CPU time can be gained by such an approach.
end % if % The execution of the routine is stopped.
out = zeros(size(in)); % In all the other situations, the initialization of the output data structure is performed.
if rem(N,2)~=1 % When N is even, then we proceed in taking the half of it:
m = N/2; % m = N / 2.
else % Otherwise (N >= 3, N odd), N-1 is even ( N-1 >= 2) and we proceed taking the half of it:
m = (N-1)/2; % m = (N-1) / 2.
end % if
for i=1:nx, % For each element (i-th) contained in the input numerical array, a check must be performed:
dist2start = i-1; % index distance from current index to start index (1)
dist2end = nx-i; % index distance from current index to end index (nx)
if dist2start<m || dist2end<m % if we are close to start / end of data, reduce the mean calculation on centered data vector reduced to available samples
dd = min(dist2start,dist2end); % min of the two distance (start or end)
else
dd = m;
end % if
tmp = sort(in(i-dd:i+dd)); % buffered data , reduced to available samples at both ends of the data vector
win = linspace(1/a,a,numel(tmp));
win = win/sum(win);
out(i) = sum(win.*tmp); % mean of weighted data , reduced to available samples at both ends of the data vector
end % for i
end
  4 Comments
Mathieu NOE
Mathieu NOE on 14 Nov 2023
hello Katrina
sorry but for the time being I have no other solution to suggest

Sign in to comment.


Jeff Miller
Jeff Miller on 14 Nov 2023
If you have separate measures of the lower and upper directional error associated with each X value (either empirical or derived from some model), then you can probably use least-squares.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!