Ebook

Chapter 2: CREATING ANNOTATED DATA TO TRAIN AND VALIDATE MODELS

Chapter 2

Creating Annotated Data to Train and Validate Models


section

Training, Validation, and Testing Sets: What’s the Difference?

Most of your data should be reserved for training. The training set is used by the backpropagation algorithm to optimize the large number of network weights by fitting the input data to the output annotations provided. Training data sets tend to be so large that they often include pre-existing or simulated data. The training set is used to train and fit the data by optimizing weight values so your model learns what it should consider important.

Validation data is also used when you train your model. This data is used to continuously check how well your model generalizes to new data as you train and will help choose between models. The validation set isn’t used by any data-hungry optimization algorithm, and it generally can be much smaller than the training set. In modern systems, validation data is often made as realistic as possible—creating validation data often involves acquiring new real-world signals and annotating them afresh.

You use the testing set to calculate the performance of the model after a round of training is complete. Like the validation set, the testing set is often as realistic as possible, and creating fresh data often involves acquiring new real-world signals and annotating them afresh.

Screenshot showing training progress with accuracy on the y-axis and iteration on the x-axis. A blue line starts at 60% accuracy, quickly rises, and continues at 90 to 98%. A dashed line starts at 30% accuracy, rises, and stays at 90%

The solid blue line represents the accuracy of the data used in the training set. The dashed black line represents the accuracy of the data in validation set.

To create a working deep learning model, you most often need to have at least three different types of data to develop your model—data to train it, data to validate that it is genuinely learning, and data to test its final performance.

Importing Data Sets

Given the amount of data necessary to train a deep learning model, it’s important to consider memory constraints and data management. If you can’t fit all your data in memory, you will need a way to represent your stored data without reading it all in one go. One way to do this in MATLAB is to use datastores like audioDatastore (requires Audio Toolbox™) or signalDatastore (requires Signal Processing Toolbox™ and Deep Learning Toolbox™). These datastores help manage in- or out-of-memory signals and process signals to extract features using parallel pools.

section

Labeling

Labeling, or annotating, data correctly is necessary for your model to be calibrated correctly. It is important for your validation and testing data sets to have accurate labels because these sets are how you judge how your model is performing—while in training and then once training is complete. Your training data labels are also important, but given the size of the training data set, these labels are often handled with a different set of techniques.

This section starts with labels for validation and testing data sets, and then looks at training data sets.

Validation and Testing Data

Validation data needs to accurately represent the data that the network will see in the final application. First, it must include signals that closely represent the problem that you are trying to solve. For an audio application, this might include recording signals with the same microphone in quiet environments and then with varying levels of noise, echo, and reverberation.

The validation data should include some high-quality labels, possibly added manually. That is what you want the network to learn. In this case, that's the red mask plotted on top.

Screenshot labeled ‘validation signal example’ shows a voice signal with a red line along the center and rising to outline each peak of the signal.

A sample validation signal for keyword spotting. The blue signal is the speech, the red line indicates the keyword mask.

If you played back just the regions with the keyword mask, it would sound like this:

How Can You Achieve Good Quality Labels for Your Validation and Test Data?

Try using an intelligent system trained to carry out a similar task with proven accuracy; often this means manually labeling data or using a pretrained machine learning model.

You can make manual labeling easier with an interactive app like Signal Labeler. Interactive apps provide an interface to select regions in signals and assign labels, adjust selected segments, and other tasks.

Plot the signal using a different color for each word using an external API

Plot the signal using a different color for each word using an external API as shown in this example.

Another option is to use a working model developed by someone else. Here’s an example using Google’s well-known speech-to-text service through their cloud API interface.

To create a mask label for trigger words, you can export the word label to the MATLAB command line, and after a few lines of code you can have the labels you need. See the code for this example.

Here you can play back the annotated segments.

Screenshot of signal with orange outlines above several peaks, representing a mask of trigger words in the audio.
section

Training Data

When working with signal data such as audio recordings, it is unrealistic to record terabytes of good-quality data and accurately label it manually. One way around this is to use existing labeled recordings, possibly for a slightly different problem. A research data set that you can license is also a good option.

Label Spoken Words in Audio Signals Using External API

Test the Signal Labeler app for yourself with the IBM® Watson Speech to Text API.

It’s okay if the training set is not tailor-made for your application; however, the bigger the difference between training and validation data, the larger the accuracy gap will be.

It is good to have a few extra techniques handy like automated labeling algorithms to get you started.

Audio Toolbox in particular has many automatic labeling functions, including detectSpeech and speech2text, that can help with labeling. Similarly, Signal Processing Toolbox supports bulk automatic labeling using Peak Labeler and custom functions.

The next chapter covers more techniques to improve the quality and grow the size of your training data, such as data augmentation and synthesis.

Test Your Knowledge