Data Cleaning

 

What Is Data Cleaning?

3 things you need to know

Data cleaning, also known as data cleansing or data wrangling, is the process of identifying and addressing anomalies in a given data set. Various techniques can be employed to cleanse data, including managing outliers, estimating missing data, or filtering out noise.

By cleaning the data, engineers and data scientists can improve the quality of their results and avoid making erroneous conclusions based on flawed or incomplete data.

MATLAB® provides functions and apps that simplify data cleaning, allowing you to focus on your analysis and problem-solving.

Why It Matters

Why Does Data Cleaning Matter?

Data cleaning techniques are used across various domains:

  • Signal Processing
  • Signals obtained from analog sensors are exposed to a variety of noise sources resulting in data inconsistencies. Preprocessing involves using data cleaning techniques such as removing outliers and smoothing to ensure that the analysis is performed on clean and meaningful data.

    Advanced denoising techniques such as adaptive filters and wavelet transforms help remove noise while preventing the loss of underlying signal information in systems with wideband signals affected by white noise.

    Figure 1 shows a plot of the daily weight of a person—with missing values—recorded for 365 days. The plot shows original and interpolated readings. The plot, zoomed in on days 200 through 250, shows identified missing values and their replacements derived using interpolation.

MATLAB plot with day on the x-axis and weight in kg on the y-axis. The original weight data is plotted in blue, with circles representing missing weight values filled using an interpolation technique.

Figure 1. MATLAB plot of daily weight data of a person, with its missed entries filled using interpolation technique.

  • Artificial Intelligence (AI)
  • The performance and reliability of AI models depends directly on the quality of input data. Data cleaning eliminates noise, inconsistencies, missing values, and outliers that can mislead AI models, leading to biased or erroneous outcomes. Clean input data ensures that AI algorithms learn valid patterns and provide accurate predictions.

    Figure 2 shows a MATLAB plot of load consumption data cleansed of missing and outlier data and noise using the fillmissing, filloutliers, and smooothdata functions, which is then inputted into an AI model to produce an accurate prediction of load consumption. Figure 3 plots the results of a prediction model alongside the actual data and the amount of error between them in MATLAB. Without using a cleansed data set to train the model, the prediction error could be higher.

MATLAB plot showing power consumption raw data and clean data after using data cleaning techniques.

Figure 2. Plot of load consumption pre- and post-data cleaning using MATLAB.

Load forecasting plot of power consumption data created using a bagged tree–based regression AI model in MATLAB and a MATLAB plot showing the difference between the actual data and the predicted data with months on the x-axis and error in megawatts on the y-axis.

Figure 3. Validation results of an AI model that is predicting load consumption plotted using MATLAB. a) Bagged tree–based regression model with the predicted data plotted against actual consumption data. b) Error between predicted data and actual consumption data.

  • Perform Data Cleaning Using Deep Learning Models
  • Traditional data cleaning methods work well with data that can be modeled with commonly known statistical and mathematical models. But for complex data sets that do not fit standard models well, like human speech, EEG signals, etc., we can leverage deep learning models to perform data cleaning.

    In this example shown in Figure 4, speech signals are riddled with noise from a washing machine running in the background. Data cleaning methods, such as smoothing or outlier removal, cannot effectively remove the noise from the washing machine data as it has an audio spectrum that overlaps with the speech signal. Deep learning networks, such as fully connected and convolutional, are able to clean or denoise the speech signal, thus removing the noise and leaving the underlying signal.

Graphs showing a clean speech signal and a version of it contaminated by washing machine noise in the background and graphs comparing the output of fully connected and convolutional networks used to denoise the speech signal plotted in MATLAB.

Figure 4. MATLAB plots of clean and noisy speech signals and denoised output from two deep learning networks—fully connected and convolutional.

  • Computational Finance
  • Accurate financial models, risk assessments, and investment strategies rely on clean data. Financial analysts spend a significant amount of their time on data cleaning instead of actual modeling due to challenges in their data. Missing data, large data, or varying data sources can lead to biased predictions and flawed decision-making. Data cleaning is an effective technique for improving accuracy and reliability of financial analysis. Figure 5 shows the missing values in the average annual stock prices of a company. The fillmissing function in MATLAB is used to fill the missing data using linear interpolation.

MATLAB plot of raw data of a company’s average annual stock price containing missing values alongside the plot of its cleaned version where the missing values were filled using the fill missing data cleaning technique by linear interpolation.

Figure 5. MATLAB plot shows the filling of missing data in a company’s average annual stock price using linear interpolation.

How It Works

How Does Data Cleaning Work?

Data cleaning is an iterative process that involves different techniques depending on your data set, the objectives of the final analysis, and the available tools and software. Data cleaning typically involves one or more of following steps:

Fill Missing Data

Missing data refers to the absence of values or information in a data set resulting in NULL, 0, empty strings, or invalid (NaN) data points. Values can be missing because of several reasons such as data acquisition, data transmission, and data conversion. Missing data can have a significant impact on the quality and validity of data analysis and modeling; hence, it is important to address it appropriately during the data cleaning process.

Missing data can be classified into three categories and identifying the right category can help you select an appropriate fill method:

  1. Missing at random (MAR) — In this category, the variable with missing values is dependent on other variables in the data set. For instance, a rooftop solar installation relaying telemetry data such as irradiance level, grid voltage, frequency, etc., would have missing values at night or during rainy days because there isn’t enough solar irradiance to power up the system and thus the missing values of grid voltage or frequency are caused by poor irradiance levels.
  2. Missing completely at random (MCAR) — In this category, the underlying cause of missing values is completely unrelated to any other variable in the data set. For example, missing packets in weather telemetry could result from malfunctioning sensors or high channel noise.
  3. Missing not at random (MNAR) — This scenario applies to variables where the underlying cause of missing data is related to the variable itself. For example, if a sensor relaying temperature information has reached its measurement limits, it would result in missing values in the form of its saturated thresholds.

Identifying missing data sounds straightforward, but replacing it with a suitable estimate is an involved process. You can start by spotting missing values using visualization or searching for invalid values. Replacing the missing values involves generating values that are likely to be close to the actual values. Based on the nature of the data, the technique of filling these missing values can vary. For example:

  • A slow varying data like temperature could simply use the nearest valid value.
  • Data sets exhibiting seasonality and reduced randomness like weather could use statistical methods like moving average, median, or K-nearest neighbors.
  • Data sets exhibiting strong dependencies on its previous values like stock prices or economic indicators are well-suited for interpolation-based techniques to generate missing data.

The plot below shows raw solar irradiance data with its missing values filled using the fillmissing function. In this instance, a moving median window-based technique is used to fill in the missing values.

Figure 6. Time-series plot of a solar irradiance raw data set, with its missing values filled using the fillmissing function in MATLAB.

Managing Outliers

Outliers are data points that deviate significantly from most observations within a data set. They can be unusually high or low values that do not seem to follow the general pattern of the data. Outliers can distort the statistical analysis and interpretation of a data set, potentially leading to misleading results. Outliers can arise due to various reasons, including measurement errors, data entry mistakes, natural variability, or genuine anomalies in the underlying process being studied.

Managing outliers involves two configurable steps:

  1. Detection
  2. Detecting outliers involves defining a valid operating range outside of which any data point is identified as an outlier. Methods used in defining the valid operating range are related to the attribute, source, and purpose of the data set. These methods range from simple techniques like visualization-based or fixed threshold–based outlier detection to statistical methods like median absolute deviation, to distance-based methods, such as Euclidean and Mahalanobis.

  3. Filling outliers
  4. After identifying the outliers, they can be replaced with generated values. Generating techniques used in replacing outliers are similar to the ones used for filling missing data.

Figure 7 shows input data with two outliers that are detected and filled using the linear interpolation median detection method.

Graph shows two outliers detected using median thresholding and filled by a linear interpolation method interactively using the Clean Outlier Data Live Editor task in MATLAB.

Figure 7. Clean Outlier Live Editor task used to detect and fill outliers using median thresholding and linear interpolation respectively.

Smoothing

Smoothing is a data analysis technique used to reduce noise, variability, or irregularities in a data set to reveal underlying patterns or trends more clearly. It is commonly applied in various fields including statistics, signal processing, time-series analysis, and image processing.

Like other data cleaning methods, the smoothing technique is also highly dependent on the nature and the domain of the data. You can use simple statistical methods like moving average filter, weighted moving average filter, or moving median-based filter to more complex techniques like splines, Fourier transform smoothing, and Kalman filtering. The smoothing function requires the data set to be ordered and sampled at a fixed interval.

A plot showing raw noisy input data before and after applying the data cleaning technique smoothdata function in MATLAB to remove the noise from the input signal.

Figure 8. MATLAB plot of a noisy data set smoothed using a moving average filter with the smoothdata function.

Data Cleaning with MATLAB

Data Cleaning with MATLAB

Data cleaning is an important first step in many engineering and scientific workflows. MATLAB provides several functions and interactive low-code techniques (apps and Live Editor tasks) to perform data cleaning efficiently.

Interactive Capabilities

Writing code using high-level functions can be very efficient, particularly when you’re familiar with the language syntax and different function options. However, identifying the right data cleaning technique is typically iterative, and spending time getting the code just right for each iteration can slow things down. With the interactive tools in MATLAB, you can experiment with different data cleaning methods quickly, visualizing the impact of each approach on the data set, and then determining the optimal cleaning approach. Live Editor tasks and apps, such as the Data Cleaner app, make it easy to explore, clean, prepare, and complete data analysis tasks with less coding. The tools can also automatically generate MATLAB code for reuse.

Live Editor Tasks

Live Editor tasks are simple point-and-click interfaces or small apps that can be embedded into a live script. Tasks represent a series of MATLAB commands that automatically run as you explore the parameters interactively; results are displayed immediately. Tasks are useful because they can help reduce development time, errors, and time spent in plotting. You can interactively perform different data cleaning operations using a sequence of Live Editor tasks such as Clean Missing Data, Clean Outlier Data, Smooth Data, and so on.

Screenshot of the Data Preprocessing toolbar in MATLAB, providing a selection of interactive tasks to choose from to perform data cleaning.

Figure 9. Collection of Live Tasks available in MATLAB for data preprocessing.

Figure 10. Fill missing data interactively by using the Clean Missing Data Live Editor task.

Data Cleaner App

The Data Cleaner app allows you to process and clean column-oriented raw data. You can interactively iterate and visualize the impact of various data cleaning choices on your raw data.

Using this app, you can:

  • Access column-oriented data from the MATLAB workspace or from a file.
  • Explore data by using visualization, data, and summary views.
  • Sort by a variable and rename or remove variables.
  • Retime data in a timetable, stack or unstack table variables, clean missing data, clean outlier data, smooth data, or normalize data.
  • Edit and reorder previously performed data cleaning steps based on how you want to analyze your data.
  • Export cleaned data to the MATLAB workspace, or export MATLAB code for data cleaning as a script or function.

Figure 11. Explore and clean time-series data using the Data Cleaner app in MATLAB.

Programmatic Approach

You can use a combination of command line functions available in MATLAB to efficiently perform data cleaning of your data set and prepare it for analysis. By using functions like readmatrix, readtable, readtimetable, and importdata, raw data can be imported into MATLAB. You can then visualize your data set to identify anomalies. Once you have established the inconsistencies with your data set, you can use data cleaning functions in MATLAB to fill missing values (fillmissing, ismissing, rmmissing), remove outliers (filloutliers, rmoutliers), and filter any noise (smoothdata, movmean, movmedian) in the data set.

Data cleaning is an important first step in data analysis to make your data suitable for further analysis. For more information, check the resources below.