filloutliers
Detect and replace outliers in data
Syntax
Description
finds outliers in B
= filloutliers(A
,fillmethod
)A
and replaces them according to
fillmethod
. For example,
filloutliers(A,"previous")
replaces outliers with the
previous nonoutlier element.
If
A
is a matrix, thenfilloutliers
operates on each column ofA
separately.If
A
is a multidimensional array, thenfilloutliers
operates along the first dimension ofA
whose size does not equal 1.If
A
is a table or timetable, thenfilloutliers
operates on each variable ofA
separately.
By default, an outlier is a value that is more than three scaled median absolute deviations (MAD) from the median.
You can use filloutliers
functionality interactively by adding
the Clean Outlier
Data task to a live script.
specifies a method for detecting outliers. For example,
B
= filloutliers(A
,fillmethod
,findmethod
)filloutliers(A,"previous","mean")
defines an outlier as an
element of A
more than three standard deviations from the
mean.
defines outliers as points outside of the percentiles specified in
B
= filloutliers(A
,fillmethod
,"percentiles",threshold
)threshold
. The threshold
argument is a
two-element row vector containing the lower and upper percentile thresholds, such as
[10 90]
.
detects local outliers using a moving window mean or median with window length
B
= filloutliers(A
,fillmethod
,movmethod
,window
)window
. For example,
filloutliers(A,"previous","movmean",5)
identifies outliers as
elements more than three local standard deviations from the local mean within a
five-element window.
specifies additional parameters for detecting and replacing outliers using one or
more name-value arguments. For example,
B
= filloutliers(___,Name,Value
)filloutliers(A,"previous","SamplePoints",t)
detects outliers
in A
relative to the corresponding elements of a time vector
t
.
Examples
Interpolate Outliers in Vector
Fill outliers in a vector of data using the "linear"
method, and visualize the filled data.
Create a vector of data containing two outliers.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
Replace the outliers using linear interpolation.
B = filloutliers(A,"linear");
Plot the original data and the data with the outliers filled.
plot(A) hold on plot(B,"o-") legend("Original Data","Filled Data")
Use Mean Detection and Nearest Fill Methods
Identify potential outliers in a table of data, fill any outliers using the "nearest"
fill method, and visualize the cleaned data.
Create a timetable of data, and visualize the data to detect potential outliers.
T = hours(1:15); V = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; A = timetable(T',V'); plot(A.Time,A.Var1)
Fill outliers in the data, where an outlier is defined as a point more than three standard deviations from the mean. Replace the outlier with the nearest element that is not an outlier.
B = filloutliers(A,"nearest","mean")
B=15×1 timetable
Time Var1
_____ ____
1 hr 57
2 hr 59
3 hr 60
4 hr 100
5 hr 59
6 hr 58
7 hr 57
8 hr 58
9 hr 61
10 hr 61
11 hr 62
12 hr 60
13 hr 62
14 hr 58
15 hr 57
In the same graph, plot the original data and the data with the outlier filled.
hold on plot(B.Time,B.Var1,"o-") legend("Original Data","Filled Data")
Use Moving Detection Method
Use a moving median to detect and fill local outliers within a sine wave that corresponds to a time vector.
Create a vector of data containing a local outlier.
x = -2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in A
.
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)-1);
Define outliers as points more than three local scaled MAD from the local median within a sliding window. Find the location of the outlier in A
relative to the points in t
with a window size of 5 hours. Fill the outlier with the computed threshold value using the method "clip"
.
[B,TF,L,U,C] = filloutliers(A,"clip","movmedian",hours(5),"SamplePoints",t);
Plot the original data and the data with the outlier filled.
plot(t,A) hold on plot(t,B,"o-") legend("Original Data","Filled Data")
Fill Outliers in Matrix Rows
Create a matrix of data containing outliers along the diagonal.
A = randn(5,5) + diag(1000*ones(1,5))
A = 5×5
103 ×
1.0005 -0.0013 -0.0013 -0.0002 0.0007
0.0018 0.9996 0.0030 -0.0001 -0.0012
-0.0023 0.0003 1.0007 0.0015 0.0007
0.0009 0.0036 -0.0001 1.0014 0.0016
0.0003 0.0028 0.0007 0.0014 1.0005
Fill outliers with zeros based on the data in each row, and display the new values.
[B,TF] = filloutliers(A,0,2); B
B = 5×5
0 -1.3077 -1.3499 -0.2050 0.6715
1.8339 0 3.0349 -0.1241 -1.2075
-2.2588 0.3426 0 1.4897 0.7172
0.8622 3.5784 -0.0631 0 1.6302
0.3188 2.7694 0.7147 1.4172 0
You can access the detected outlier values and their filled values using TF
as an index vector.
[A(TF) B(TF)]
ans = 5×2
103 ×
1.0005 0
0.9996 0
1.0007 0
1.0014 0
1.0005 0
Specify Outlier Locations
Create a vector containing two outliers and detect their locations.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; detect = isoutlier(A)
detect = 1x15 logical array
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Fill the outliers using the "nearest"
method. Instead of using a detection method, provide the outlier locations detected by isoutlier
.
B = filloutliers(A,"nearest","OutlierLocations",detect)
B = 1×15
57 59 60 59 59 58 57 58 61 61 62 60 62 58 57
Return Outlier Thresholds
Replace the outlier in a vector of data using the "clip"
fill method.
Create a vector of data with an outlier.
A = [60 59 49 49 58 100 61 57 48 58];
Detect outliers with the default method "median"
, and replace the outlier with the upper threshold value by using the "clip"
fill method.
[B,TF,L,U,C] = filloutliers(A,"clip");
Plot the original data, the data with the outlier filled, and the thresholds and center value determined by the outlier detection method. The center value is the median of the data, and the upper and lower thresholds are three scaled MAD above and below the median.
plot(A) hold on plot(B,"o-") yline([L U C],":",["Lower Threshold","Upper Threshold","Center Value"]) legend("Original Data","Filled Data")
Fill Values Above Scalar Threshold
Since R2024a
Create a table and fill outliers defined as values greater than 10. Create a table of logical variables loc
that indicates the locations of outliers to fill. Then, specify the known outlier locations for filloutliers
using the OutlierLocations
name-value argument.
A = [1; 4; 9; 12; 3]; B = [9; 0; 6; 2; 1]; C = [14; 4; 2; 3; 8]; T = table(A,B,C)
T=5×3 table
A B C
__ _ __
1 9 14
4 0 4
9 6 2
12 2 3
3 1 8
loc = T>10
loc=5×3 table
A B C
_____ _____ _____
false false true
false false false
false false false
true false false
false false false
T = filloutliers(T,10,OutlierLocations=loc)
T=5×3 table
A B C
__ _ __
1 9 10
4 0 4
9 6 2
10 2 3
3 1 8
Input Arguments
A
— Input data
vector | matrix | multidimensional array | table | timetable
Input data, specified as a vector, matrix, multidimensional array, table, or timetable.
If
A
is a table, then its variables must be of typedouble
orsingle
, or you can use theDataVariables
argument to listdouble
orsingle
variables explicitly. Specifying variables is useful when you are working with a table that contains variables with data types other thandouble
orsingle
.If
A
is a timetable, thenfilloutliers
operates only on the table elements. If row times are used as sample points, then they must be unique and listed in ascending order.
Data Types: double
| single
| table
| timetable
fillmethod
— Fill method
numeric scalar | "center"
| "clip"
| "previous"
| "next"
| "nearest"
| "linear"
| "spline"
| "pchip"
| "makima"
Fill method for replacing outliers, specified as one of these values.
Fill Method | Description |
---|---|
Numeric scalar | Specified scalar value |
"center" | Center value determined by
findmethod |
"clip" | Lower threshold value for elements smaller than the
lower threshold determined by
findmethod ; upper threshold value
for elements larger than the upper threshold determined
by findmethod |
"previous" | Previous nonoutlier value |
"next" | Next nonoutlier value |
"nearest" | Nearest nonoutlier value |
"linear" | Linear interpolation of neighboring, nonoutlier values |
"spline" | Piecewise cubic spline interpolation |
"pchip" | Shape-preserving piecewise cubic spline interpolation |
"makima" | Modified Akima cubic Hermite interpolation (numeric,
duration , and
datetime data types only) |
Data Types: double
| single
| char
| string
findmethod
— Method for detecting outliers
"median"
(default) | "mean"
| "quartiles"
| "grubbs"
| "gesd"
Method for detecting outliers, specified as one of these values.
Method | Description |
---|---|
"median" | Outliers are defined as elements more than three
scaled MAD from the median. The scaled MAD is defined as
c*median(abs(A-median(A))) , where
c=-1/(sqrt(2)*erfcinv(3/2)) . |
"mean" | Outliers are defined as elements more than three
standard deviations from the mean. This method is faster
but less robust than
"median" . |
"quartiles" | Outliers are defined as elements more than 1.5
interquartile ranges above the upper quartile (75
percent) or below the lower quartile (25 percent). This
method is useful when the data in A
is not normally distributed. |
"grubbs" | Outliers are detected using Grubbs’ test, which
removes one outlier per iteration based on hypothesis
testing. This method assumes that the data in
A is normally
distributed. |
"gesd" | Outliers are detected using the generalized extreme
Studentized deviate test for outliers. This iterative
method is similar to "grubbs" but can
perform better when multiple outliers are masking each
other. |
threshold
— Percentile thresholds
two-element row vector
Percentile thresholds, specified as a two-element row vector whose
elements are in the interval [0,100]. The first element indicates the lower
percentile threshold, and the second element indicates the upper percentile
threshold. The first element of threshold
must be less
than the second element.
For example, a threshold of [10 90]
defines outliers as
points below the 10th percentile and above the 90th percentile.
movmethod
— Moving method
"movmedian"
| "movmean"
Moving method for detecting outliers, specified as one of these values.
Method | Description |
---|---|
"movmedian" | Outliers are defined as elements more than three
local scaled MAD from the local median over a window
length specified by window . This
method is also known as a Hampel
filter. |
"movmean" | Outliers are defined as elements more than three
local standard deviations from the local mean over a
window length specified by
window . |
window
— Window length
positive integer scalar | two-element vector of positive integers | positive duration scalar | two-element vector of positive durations
Window length, specified as a positive integer scalar, a two-element vector of positive integers, a positive duration scalar, or a two-element vector of positive durations.
When window
is a positive integer scalar, the window is
centered about the current element and contains window-1
neighboring elements. If window
is even, then the window
is centered about the current and previous elements.
When window
is a two-element vector of positive
integers [b f]
, the window contains the current element,
b
elements backward, and f
elements forward.
When A
is a timetable or
SamplePoints
is specified as a
datetime
or duration
vector,
window
must be of type duration
,
and the windows are computed relative to the sample points.
dim
— Operating dimension
positive integer scalar
Operating dimension, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.
Consider an m
-by-n
input matrix,
A
:
filloutliers(A,fillmethod,1)
fills outliers according to the data in each column ofA
and returns anm
-by-n
matrix.filloutliers(A,fillmethod,2)
fills outliers according to the data in each row ofA
and returns anm
-by-n
matrix.
For table or timetable input data, dim
is not supported
and operation is along each table or timetable variable separately.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: filloutliers(A,"center","mean",ThresholdFactor=4)
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: filloutliers(A,"center","mean","ThresholdFactor",4)
SamplePoints
— Sample points
vector | table variable name | scalar | function handle | table vartype
subscript
Sample points, specified as a vector of sample point values or one of
the options in the following table when the input data is a table. The
sample points represent the x-axis locations of the
data, and must be sorted and contain unique elements. Sample points do
not need to be uniformly sampled. The vector [1 2 3
...]
is the default.
When the input data is a table, you can specify the sample points as a table variable using one of these options.
Indexing Scheme | Examples |
---|---|
Variable name:
|
|
Variable index:
|
|
Function handle:
|
|
Variable type:
|
|
Note
This name-value argument is not supported when the input data is a
timetable
. Timetables use the vector of row times as the sample
points. To use different sample points, you must edit the timetable so that the row times
contain the desired sample points.
Moving windows are defined relative to the sample points. For example,
if t
is a vector of times corresponding to the input
data, then
filloutliers(rand(1,10),"previous","movmean",3,"SamplePoints",t)
has a window that represents the time interval between
t(i)-1.5
and t(i)+1.5
.
When the sample points vector has data type
datetime
or duration
, the
moving window length must have type duration
.
Example: filloutliers([1 100 3 4],"nearest","SamplePoints",[1
2.5 3 4])
Example: filloutliers(T,"nearest","SamplePoints","Var1")
Data Types: single
| double
| datetime
| duration
DataVariables
— Table variables to operate on
table variable name | scalar | vector | cell array | pattern | function handle | table vartype
subscript
Table variables to operate on, specified as one of the options in this
table. The DataVariables
value indicates which
variables of the input table to fill. The data type associated with the
indicated variables must be double
or
single
.
Other variables in the table not specified by
DataVariables
pass through to the output without
being filled.
Indexing Scheme | Values to Specify | Examples |
---|---|---|
Variable names |
|
|
Variable index |
|
|
Function handle |
|
|
Variable type |
|
|
Example: filloutliers(A,"previous","DataVariables",["Var1"
"Var2" "Var4"])
ReplaceValues
— Replace values indicator
true
or 1
(default) | false
or 0
Replace values indicator, specified as one of these logical or numeric
values when A
is a table or timetable:
true
or1
— Replace input table variables containing outliers with filled table variables.false
or0
— Append the input table with all table variables that were checked for outliers. The outliers in the appended variables are filled.
For vector, matrix, or multidimensional array input data,
ReplaceValues
is not supported.
Example: filloutliers(T,"previous","ReplaceValues",false)
ThresholdFactor
— Detection threshold factor
nonnegative scalar
Detection threshold factor, specified as a nonnegative scalar.
For methods "median"
and
"movmedian"
, the detection threshold factor
replaces the number of scaled MAD, which is 3 by default.
For methods "mean"
and
"movmean"
, the detection threshold factor replaces
the number of standard deviations from the mean, which is 3 by
default.
For methods "grubbs"
and
"gesd"
, the detection threshold factor is a scalar
ranging from 0 to 1. Values close to 0 result in a smaller number of
outliers, and values close to 1 result in a larger number of outliers.
The default detection threshold factor is 0.05.
For the "quartiles"
method, the detection threshold
factor replaces the number of interquartile ranges, which is 1.5 by
default.
This name-value argument is not supported when the specified method is
"percentiles"
.
MaxNumOutliers
— Maximum filled outliers by GESD
positive integer scalar
Maximum filled outliers by GESD, specified as a positive integer
scalar. The MaxNumOutliers
value specifies the
maximum number of outliers that are filled by the
"gesd"
method. For example,
filloutliers(A,"linear","gesd","MaxNumOutliers",5)
fills no more than five outliers.
The default value for MaxNumOutliers
is the integer
nearest to 10 percent of the number of elements in A
.
Setting a larger value for the maximum number of outliers makes it more
likely that all outliers are detected but at the cost of reduced
computational efficiency.
The "gesd"
method assumes the nonoutlier input data
is sampled from an approximate normal distribution. When the data is not
sampled in this way, the number of filled outliers might exceed the
MaxNumOutliers
value.
OutlierLocations
— Known outlier indicator
vector | matrix | multidimensional array | table | timetable
Known outlier indicator, specified as a logical vector, matrix, or multidimensional array, or a table or timetable with logical variables (since R2024a).
If OutlierLocations
is an array, it must be the
same size as A
. If
OutlierLocations
is a table or timetable, it must
contain logical variables with the same sizes and names as the input
table variables to operate on.
Elements with a value of 1
(true
) indicate the locations of outliers in
A
. Elements with a value of 0
(false
) indicate nonoutliers. When you specify
OutlierLocations
,
filloutliers
does not use an outlier detection
method. Instead, it uses the elements of the known outlier indicator to
define outliers.
You cannot specify OutlierLocations
if you specify
findmethod
.
Data Types: logical
| table
| timetable
Output Arguments
B
— Filled data
vector | matrix | multidimensional array | table | timetable
Filled data, returned as a vector, matrix, multidimensional array, table, or timetable.
B
is the same size as A
unless the
value of ReplaceValues
is false
. If
the value of ReplaceValues
is false
,
then the width of B
is the sum of the input data width
and the number of data variables specified.
TF
— Filled data indicator
vector | matrix | multidimensional array
Filled data indicator, returned as a logical vector, matrix, or
multidimensional array. Elements with a value of 1 (true
)
correspond to filled elements of B
that were previously
outliers. Elements with a value of 0 (false
) correspond
to unchanged elements.
TF
is the same size as B
.
Data Types: logical
L
— Lower threshold
scalar | vector | matrix | multidimensional array | table | timetable
Lower threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the lower threshold value of the default outlier detection method is three scaled MAD below the median of the input data.
If findmethod
is used for outlier detection, then
L
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then L
has the
same size as A
.
U
— Upper threshold
scalar | vector | matrix | multidimensional array | table | timetable
Upper threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the upper threshold value of the default outlier detection method is three scaled MAD above the median of the input data.
If findmethod
is used for outlier detection, then
U
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then U
has the
same size as A
.
C
— Center value
scalar | vector | matrix | multidimensional array | table | timetable
Center value used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the center value of the default outlier detection method is the median of the input data.
If findmethod
is used for outlier detection, then
C
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then C
has the
same size as A
.
More About
Median Absolute Deviation
For a finite-length vector A made up of N scalar observations, the median absolute deviation (MAD) is defined as
for i = 1,2,...,N.
The scaled MAD is defined as c*median(abs(A-median(A)))
, where
c=-1/(sqrt(2)*erfcinv(3/2))
.
Alternative Functionality
Live Editor Task
You can use filloutliers
functionality interactively by adding
the Clean Outlier
Data task to a live script.
References
[1] NIST/SEMATECH e-Handbook of Statistical Methods, https://www.itl.nist.gov/div898/handbook/, 2013.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
The
filloutliers
function supports tall arrays with the following usage
notes and limitations:
The
"percentiles"
,"grubbs"
, and"gesd"
methods are not supported.The
"movmedian"
and"movmean"
methods do not support tall timetables.The
SamplePoints
andMaxNumOutliers
name-value arguments are not supported.The value of
DataVariables
cannot be a function handle.The
OutlierLocations
name-value argument cannot specify a table or timetable.Computation of
filloutliers(A,fillmethod)
,filloutliers(A,fillmethod,"median",…)
orfilloutliers(A,fillmethod,"quartiles",…)
along the first dimension is supported only whenA
is a tall column vector.The syntaxes
filloutliers(A,"spline",…)
andfilloutliers(A,"makima",…)
are not supported.
For more information, see Tall Arrays.
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
The
"movmean"
and"movmedian"
methods for detecting outliers do not support timetable input data, datetimeSamplePoints
values, or durationSamplePoints
values.Only the
"center"
,"clip"
, and numeric scalar methods for filling outliers are supported when the input data is a timetable or when theSamplePoints
value has typedatetime
orduration
.To use the
"spline"
and"pchip"
fill methods, you must enable support for variable-size arrays.String and character array inputs must be constant.
The
OutlierLocations
name-value argument cannot specify a table or timetable.
Thread-Based Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced in R2017aR2024b: Support "makima"
as input value to fill method
The fill method now supports "makima"
as an input value for
C/C++ code generation.
R2024a: Define outlier locations as table
Define the locations of outliers by specifying the
OutlierLocations
name-value argument as a table containing
logical variables with names present in the input table. Previously, you could
specify OutlierLocations
only as a vector, matrix, or
multidimensional array.
R2022a: Append filled values
For table or timetable input data, append the input table with all table variables
that were checked for outliers. The outliers in the appended variables are filled.
Append, rather than replace, table variables by setting the
ReplaceValues
name-value argument to
false
.
R2021b: Specify sample points as table variable
For table input data, specify the sample points as a table variable using the
SamplePoints
name-value argument.
See Also
Functions
rmoutliers
|isoutlier
|clip
|ismissing
|fillmissing
|fillmissing2
Live Editor Tasks
Apps
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)