Removing outliers from the data creates gaps. Filling these gaps with missing values or the median of surrounding values does not address the issue.Why?
7 views (last 30 days)
Show older comments
I am analyzing EMG data in windows. In each window, I apply z-score normalization to identify and remove outliers. To address the gaps created by removing these outliers, I attempt to fill the empty spaces with the median of the surrounding values. Additionally, I have experimented with MATLAB built-in functions such as 'movmedian' for this purpose.
here is my function:
function data_clean = remove_outliers_and_fill(data)
% Calculate z-scores for each column
z_scores = zscore(data);
% Define outlier threshold
threshold =3;
% Identify outliers
outliers = abs(z_scores) > threshold;
% Copy data to preserve original shape
data_clean = data;
% Loop through each column
[num_rows, num_cols] = size(data);
for col = 1:num_cols
for row = 1:num_rows
if outliers(row, col)
range_start = max(1, row-10);
range_end = min(num_rows, row+10);
neighbors = data(range_start:range_end, col);
% Exclude the outlier from median calculation
filtered_neighbors = neighbors(neighbors ~= data(row, col));
median_value = median(filtered_neighbors);
data_clean(row, col) = median_value;
end
end
end
end
here is the plot where it creates gaps after applying the above function.
2 Comments
Accepted Answer
Star Strider
on 17 Jun 2024
Your version/release is not stated, however beginning with R2017a, the filloutliers function has been available. Using the 'median' or 'mean' as the ‘findmethod’ (I use 'median' here), it will automatically consider as outlliers anything within outside ±3 standard deviations (equivalent to your ‘zscore’ reference). See the documentation I linked to here for details.
If you have R2017a or a later version/release, try this —
V1 = readmatrix('data.csv');
L = numel(V1);
X1 = linspace(0, L-1, L);
figure
plot(X1, V1)
grid
xlim([4300 5000])
title('Original')
[B,TF,L,U,C] = filloutliers(V1, 'linear', 'median');
figure
plot(X1, V1, 'DisplayName','Original Data')
hold on
plot(X1, B, '-r', 'DisplayName','Outliers Filled (Linear Interpolation)')
hold off
grid
xlim([4300 5000])
legend('Location','best')
title('Filled Outliers')
.
2 Comments
Star Strider
on 18 Jun 2024
As always, my pleasure!
I am not certain what you intend by ‘I am looking for something which can remove the outliers from both time and frequency.’ If you want to remove the outliers rather than fill them by interpolating them, you can use the rmoutliers function. I do not usually suggest that because it disrupts the integrity of the data.
If you want to remoove specific frequencies from your data, use the Signal Processing Toolbox to create frequency-selective filters. There are several filtering options, and I can help you design and implement the filters.
One caution however is that it will be necessary to have a matching vector of sampling times for each dependent variable data element before you do any processing of the data. The reason is that the sampling times provide the frequency information and the regularity of the samples themselves. For optimal performace, the sampling frequency must be constant, and the sampling intervals consistent from sample to sample. If that is not the situation for your data, there is a function (resample) that can regularise the sampling frequency (and interpolate the dependent variable data) to proivide that. At that point, you can use various filters. Again, I can help you design and implement them.
.
More Answers (1)
Nipun
on 17 Jun 2024
Hi Seemab,
I understand that you want to remove outliers from your EMG data, fill the gaps with the median of the surrounding values, and avoid gaps in the resulting data. The gaps might be due to not considering edge cases correctly or the outlier removal leaving isolated data points.
Here's an improved version of your function to address the gaps:
- Use movmedian to smooth the data after outlier removal.
- Ensure the median replacement does not create new outliers
function data_clean = remove_outliers_and_fill(data)
% Calculate z-scores for each column
z_scores = zscore(data);
% Define outlier threshold
threshold = 3;
% Identify outliers
outliers = abs(z_scores) > threshold;
% Copy data to preserve original shape
data_clean = data;
% Loop through each column
[num_rows, num_cols] = size(data);
for col = 1:num_cols
for row = 1:num_rows
if outliers(row, col)
range_start = max(1, row-10);
range_end = min(num_rows, row+10);
neighbors = data(range_start:range_end, col);
% Exclude the outlier from median calculation
filtered_neighbors = neighbors(neighbors ~= data(row, col));
median_value = median(filtered_neighbors);
data_clean(row, col) = median_value;
end
end
end
% Use movmedian to smooth the data after filling
window_size = 5; % Adjust window size as needed
for col = 1:num_cols
data_clean(:, col) = movmedian(data_clean(:, col), window_size);
end
end
Example Usage
% Sample data (replace with actual EMG data)
data = randn(5000, 1) * 1e-5;
% Add some artificial outliers for testing
data(4700:4720) = 3e-5;
% Clean the data
data_clean = remove_outliers_and_fill(data);
% Plot original and cleaned data
figure;
subplot(2,1,1);
plot(data);
title('Original Data');
xlabel('Time (windows)');
ylabel('Amplitude');
subplot(2,1,2);
plot(data_clean);
title('Cleaned Data');
xlabel('Time (windows)');
ylabel('Amplitude');
For more information on the movmedian function, refer to the MathWorks documentation: https://www.mathworks.com/help/matlab/ref/movmedian.html
Hope this helps.
Regards,
Nipun
See Also
Categories
Find more on Multirate Signal Processing in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!