rmoutliers
Detect and remove outliers in data
Syntax
Description
detects and removes outliers from the data in a vector, matrix, table, or timetable. B
= rmoutliers(A
)
If
A
is a row or column vector,rmoutliers
detects outliers and removes them.If
A
is a matrix, table, or timetable,rmoutliers
detects outliers in each column or variable ofA
separately and removes the entire row.
By default, an outlier is a value that is more than three scaled median absolute deviations (MAD).
specifies additional parameters for detecting and removing outliers using one or more
namevalue pair arguments. For example, B
= rmoutliers(___,Name,Value
)rmoutliers(A,'SamplePoints',t)
detects outliers in A
relative to the corresponding elements of a time
vector t
.
Examples
Remove Outliers in Vector
Create a vector containing two outliers, and remove them. TF
allows you to identify which elements of the input vector were detected as outliers and removed.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; [B,TF] = rmoutliers(A)
B = 1×13
57 59 60 59 58 57 58 61 62 60 62 58 57
TF = 1x15 logical array
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
A(TF)
ans = 1×2
100 300
Detect Outliers using Mean
Remove outliers of a vector where an outlier is defined as a point more than three standard deviations from the mean of the data.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
[B,TF] = rmoutliers(A,'mean')
B = 1×14
57 59 60 100 59 58 57 58 61 62 60 62 58 57
TF = 1x15 logical array
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
A(TF)
ans = 300
Detect Outliers with Sliding Window
Create a vector of data containing a local outlier.
x = 2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in A
.
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)1);
Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the locations of the outliers in A
relative to the points in t
with a window size of 5 hours, and remove them.
[B,TF] = rmoutliers(A,'movmedian',hours(5),'SamplePoints',t);
Plot the input data and the data with the outlier removed.
plot(t,A,'b.',t(~TF),B,'r') legend('Input Data','Output Data')
Remove Columns Containing Outliers
Create a matrix containing two outliers, and remove the columns containing them.
A = magic(5); A(4,4) = 500; A(5,5) = 500; A
A = 5×5
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 500 3
11 18 25 2 500
B = rmoutliers(A,2)
B = 5×3
17 24 1
23 5 7
4 6 13
10 12 19
11 18 25
Input Arguments
A
— Input data
vector  matrix  table  timetable
Input data, specified as a vector, matrix, table, or timetable.
Data Types: double
 single
method
— Method for detecting outliers
'median'
(default)  'mean'
 'quartiles'
 'grubbs'
 'gesd'
Method for detecting outliers, specified as one of the following:
Method  Description 

'median'  Outliers are defined as elements more than three scaled MAD from the
median. The scaled MAD is defined as
c*median(abs(Amedian(A))) , where
c=1/(sqrt(2)*erfcinv(3/2)) . 
'mean'  Outliers are defined as elements more than three standard deviations from
the mean. This method is faster but less robust than
'median' . 
'quartiles'  Outliers are defined as elements more than 1.5 interquartile ranges above
the upper quartile (75 percent) or below the lower quartile (25 percent). This
method is useful when the data in A is not normally
distributed. 
'grubbs'  Outliers are detected using Grubbs’s test for outliers, which removes one
outlier per iteration based on hypothesis testing. This method assumes that
the data in A is normally distributed. 
'gesd'  Outliers are detected using the generalized extreme Studentized deviate
test for outliers. This iterative method is similar to
'grubbs' , but can perform better when there are multiple
outliers masking each other. 
threshold
— Percentile thresholds
twoelement row vector
Percentile thresholds, specified as a twoelement row vector whose elements are in
the interval [0,100]. The first element indicates the lower percentile threshold and the
second element indicates the upper percentile threshold. For example, a threshold of
[10 90]
defines outliers as points below the 10th percentile and
above the 90th percentile. The first element of threshold
must be
less than the second element.
movmethod
— Moving method
'movmedian'
 'movmean'
Moving method for determining outliers, specified as one of the following:
Method  Description 

'movmedian'  Outliers are defined as elements more than three local scaled MAD from
the local median over a window length specified by window .
This method is also known as a Hampel filter. 
'movmean'  Outliers are defined as elements more than three local standard
deviations from the local mean over a window length specified by
window . 
window
— Window length
scalar  twoelement vector
Window length, specified as a scalar or twoelement vector.
When window
is a positive integer scalar, the window is centered
about the current element and contains window1
neighboring elements.
If window
is even, then the window is centered about the current and
previous elements.
When window
is a twoelement vector of positive integers
[b f]
, the window contains the current element,
b
elements backward, and f
elements
forward.
When A
is a timetable or 'SamplePoints'
is
specified as a datetime
or duration
vector,
window
must be of type duration
, and the windows
are computed relative to the sample points.
dim
— Operating dimension
1 (default)  2
Operating dimension, specified as 1 or 2. By default, rmoutliers
operates along the first dimension whose size does not equal 1.
NameValue Arguments
Specify optional
commaseparated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
rmoutliers(A,'ThresholdFactor',4)
SamplePoints
— Sample points
vector  table variable name  scalar  function handle  table vartype
subscript
Sample points, specified as the commaseparated pair consisting of
'SamplePoints'
and either a vector of sample point values or one
of the options in the following table when the input data is a table. The sample
points represent the xaxis locations of the data, and must be
sorted and contain unique elements. Sample points do not need to be uniformly sampled.
The vector [1 2 3 ...]
is the default.
When the input data is a table, you can specify the sample points as a table variable using one of the following options.
Option for Table Input  Description  Examples 

Variable name  A character vector or scalar string specifying a single table variable name 

Scalar variable index  A scalar table variable index 

Logical vector  A logical vector whose elements each correspond to a table variable, where


Function handle  A function handle that takes a table variable as input and returns a logical scalar,
which must be 

vartype subscript  A table subscript generated by the 

Note
This namevalue pair is not supported when the input data is a timetable
. Timetables always use the vector of row times as the sample points. To use different sample points, you must edit the timetable so that the row times contain the desired sample points.
Moving windows are defined relative to the sample points. For example, if
t
is a vector of times corresponding to the input data, then
rmoutliers(rand(1,10),'movmean',3,'SamplePoints',t)
has a window
that represents the time interval between t(i)1.5
and
t(i)+1.5
.
When the sample points vector has data type datetime
or
duration
, then the moving window length must have type
duration
.
Example: rmoutliers(A,'SamplePoints',0:0.1:10)
Example: rmoutliers(T,'SamplePoints',"Var1")
Data Types: single
 double
 datetime
 duration
DataVariables
— Table variables to operate on
table variable name  scalar  vector  cell array  function handle  table vartype
subscript
Table variables to operate on, specified as the commaseparated pair consisting of
'DataVariables'
and one of the options in this table. The
'DataVariables'
value indicates which variables of the input
table to examine for outliers. Other variables in the table not specified by
'DataVariables'
pass through to the output without being examined
for outliers. When operating on the rows of A
,
rmoutliers
removes any row that has outliers in the columns
corresponding to the variables specified. When operating on the columns of
A
, rmoutliers
removes the specified variables
from the table.
Option  Description  Examples 

Variable name  A character vector or scalar string specifying a single table variable name 

Vector of variable names  A cell array of character vectors or string array where each element is a table variable name 

Scalar or vector of variable indices  A scalar or vector of table variable indices 

Logical vector  A logical vector whose elements each correspond to a table variable, where


Function handle  A function handle that takes a table variable as input and returns a logical scalar 

vartype subscript  A table subscript generated by the 

Example: rmoutliers(T,'DataVariables',["Var1" "Var2"
"Var4"])
ThresholdFactor
— Detection threshold factor
nonnegative scalar
Detection threshold factor, specified as the commaseparated pair consisting of
'ThresholdFactor'
and a nonnegative scalar.
For methods 'median'
and 'movmedian'
, the
detection threshold factor replaces the number of scaled MAD, which is 3 by
default.
For methods 'mean'
and 'movmean'
, the
detection threshold factor replaces the number of standard deviations from the mean,
which is 3 by default.
For methods 'grubbs'
and 'gesd'
, the
detection threshold factor is a scalar ranging from 0 to 1. Values close to 0 result
in a smaller number of outliers and values close to 1 result in a larger number of
outliers. The default detection threshold factor is 0.05.
For the 'quartile'
method, the detection threshold factor
replaces the number of interquartile ranges, which is 1.5 by default.
This namevalue pair is not supported when the specified method is
'percentiles'
.
MaxNumOutliers
— Maximum outlier count
positive scalar
Maximum outlier count, for the 'gesd'
method only, specified as
the commaseparated pair consisting of 'MaxNumOutliers'
and a
positive scalar. The 'MaxNumOutliers'
value specifies the maximum
number of outliers returned by the 'gesd'
method. For example,
rmoutliers(A,'MaxNumOutliers',5)
returns no more than five
outliers.
The default value for 'MaxNumOutliers'
is the integer nearest
to 10 percent of the number of elements in A
. Setting a larger
value for the maximum number of outliers can ensure that all outliers are detected,
but at the cost of reduced computational efficiency.
MinNumOutliers
— Minimum outlier count
1 (default)  positive integer scalar
Minimum outlier count, specified as the commaseparated pair consisting of
'MinNumOutliers'
and a positive scalar. The
'MinNumOutliers'
value specifies the minimum number of outliers
required to remove a row or column. For example,
rmoutliers(A,'MinNumOutliers',3)
removes a row of a matrix
A
when there are 3 or more outliers detected in that
column.
Output Arguments
B
— Data with outliers removed
vector  matrix  table  timetable
Data with outliers removed, returned as a vector, matrix, table, or timetable. The
size of B
depends on the number of removed rows or columns.
TF
— Removed data indicator
logical vector
Removed data indicator, returned as a logical vector. The value 1
(true
) corresponds to rows or columns in A
that
were removed. The value 0 (false
) corresponds to unchanged rows or
columns. The orientation and size of TF
depends on
A
and the dimension of operation.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
Usage notes and limitations:
The
'percentiles'
,'grubbs'
, and'gesd'
methods are not supported.The
'movmedian'
and'movmean'
methods do not support tall timetables.The
'SamplePoints'
and'MaxNumOutliers'
namevalue pairs are not supported.The value of
'DataVariables'
cannot be a function handle.Computation of
rmoutliers(A)
,rmoutliers(A,'median',...)
, orrmoutliers(A,'quartiles',...)
along the first dimension is only supported for tall column vectorsA
.rmoutliers(A,2)
is not supported for tall tables.
For more information, see Tall Arrays.
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
The
'movmean'
and'movmedian'
methods for detecting outliers do not support timetable input data, datetime'SamplePoints'
values, or duration'SamplePoints'
values.For table input,
dim
must equal1
.
ThreadBased Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports threadbased environments. For more information, see Run MATLAB Functions in ThreadBased Environment.
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
Usage notes and limitations:
The
'movmedian'
moving method is not supported.The
'SamplePoints'
and'DataVariables'
namevalue pairs are not supported.
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
See Also
isoutlier
 filloutliers
 ismissing
 fillmissing
 rmmissing
 Clean Outlier
Data
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
 América Latina (Español)
 Canada (English)
 United States (English)
Europe
 Belgium (English)
 Denmark (English)
 Deutschland (Deutsch)
 España (Español)
 Finland (English)
 France (Français)
 Ireland (English)
 Italia (Italiano)
 Luxembourg (English)
 Netherlands (English)
 Norway (English)
 Österreich (Deutsch)
 Portugal (English)
 Sweden (English)
 Switzerland
 United Kingdom (English)