splitlabels
Syntax
Description
Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.
specifies additional input arguments using name-value pairs. For example,
idxs
= splitlabels(___,Name,Value
)'UnderlyingDatastoreIndex',3
splits the labels only in the third
underlying datastore of a combined datastore.
Examples
Split Vowels
Read William Shakespeare's sonnets with the fileread
function. Extract all the vowels from the text and convert them to lowercase.
sonnets = fileread("sonnets.txt"); vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';
Count the number of instances of each vowel.
cnts = countlabels(vowels)
cnts=5×3 table
Label Count Percent
_____ _____ _______
a 4940 18.368
e 9028 33.569
i 4895 18.201
o 5710 21.232
u 2321 8.6302
Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.
spltn = splitlabels(vowels,[500 300]); for kj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj})); end cntsn{:}
ans=5×3 table
Label Count Percent
_____ _____ _______
a 500 20
e 500 20
i 500 20
o 500 20
u 500 20
ans=5×3 table
Label Count Percent
_____ _____ _______
a 300 20
e 300 20
i 300 20
o 300 20
u 300 20
ans=5×3 table
Label Count Percent
_____ _____ _______
a 4140 18.083
e 8228 35.94
i 4095 17.887
o 4910 21.447
u 1521 6.6437
Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.
spltp = splitlabels(vowels,[0.5 0.3]); for kj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj})); end cntsp{:}
ans=5×3 table
Label Count Percent
_____ _____ _______
a 2470 18.367
e 4514 33.566
i 2448 18.203
o 2855 21.23
u 1161 8.6333
ans=5×3 table
Label Count Percent
_____ _____ _______
a 1482 18.371
e 2708 33.569
i 1468 18.198
o 1713 21.235
u 696 8.6277
ans=5×3 table
Label Count Percent
_____ _____ _______
a 988 18.368
e 1806 33.575
i 979 18.2
o 1142 21.231
u 464 8.6261
Split Vowels and Consonants
Read William Shakespeare's sonnets with the fileread
function. Remove all nonalphabetic characters from the text and convert to lowercase.
sonnets = fileread("sonnets.txt"); letters = lower(sonnets(regexp(sonnets,"[A-z]")))';
Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.
type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) = "vowel"; T = table(letters,type,'VariableNames',["Letter" "Type"]); head(T)
Letter Type ______ ___________ t "consonant" h "consonant" e "vowel" s "consonant" o "vowel" n "consonant" n "consonant" e "vowel"
Display the number of instances of each category.
cnt = countlabels(T,'TableVariable',"Type")
cnt=2×3 table
Type Count Percent
_________ _____ _______
consonant 46516 63.365
vowel 26894 36.635
Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.
splt = splitlabels(T,0.6,'TableVariable',"Type"); sixty = countlabels(T(splt{1},:),'TableVariable',"Type")
sixty=2×3 table
Type Count Percent
_________ _____ _______
consonant 27910 63.366
vowel 16136 36.634
forty = countlabels(T(splt{2},:),'TableVariable',"Type")
forty=2×3 table
Type Count Percent
_________ _____ _______
consonant 18606 63.363
vowel 10758 36.637
Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter y, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.
splt = splitlabels(T,0.6,'Exclude',"y"); sixti = countlabels(T(splt{1},:),'TableVariable',"Type")
sixti=2×3 table
Type Count Percent
_________ _____ _______
consonant 26719 62.346
vowel 16137 37.654
forti = countlabels(T(splt{2},:),'TableVariable',"Type")
forti=2×3 table
Type Count Percent
_________ _____ _______
consonant 17813 62.349
vowel 10757 37.651
Split the table into two sets of the same size. Include only the letters e and s. Randomize the sets.
halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]); cnt = countlabels(T(halves{1},:))
cnt=2×3 table
Letter Count Percent
______ _____ _______
e 4514 64.385
s 2497 35.615
Split Data in Datastore
Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as A
, 30 as B
, and 30 as C
. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.
dsData = arrayDatastore(randn(100,1)); dsLabels = arrayDatastore([repmat("A",40,1); ... repmat("B",30,1); repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,UnderlyingDatastoreIndex=2)
cnt=3×3 table
Label Count Percent
_____ _____ _______
A 40 40
B 30 30
C 30 30
Split the data set into two sets, one containing 60% of the numbers and the other with the rest.
splitIndices = splitlabels(dsDataset,0.6,UnderlyingDatastoreIndex=2); dsDataset1 = subset(dsDataset,splitIndices{1}); cnt1 = countlabels(dsDataset1,UnderlyingDatastoreIndex=2)
cnt1=3×3 table
Label Count Percent
_____ _____ _______
A 24 40
B 18 30
C 18 30
dsDataset2 = subset(dsDataset,splitIndices{2}); cnt2 = countlabels(dsDataset2,UnderlyingDatastoreIndex=2)
cnt2=3×3 table
Label Count Percent
_____ _____ _______
A 16 40
B 12 30
C 12 30
Input Arguments
lblsrc
— Input label source
categorical vector | string vector | logical vector | numeric vector | cell array | table | datastore | CombinedDatastore
object
Input label source, specified as one of these:
A categorical vector.
A string vector or a cell array of character vectors.
A numeric vector or a cell array of numeric scalars.
A logical vector or a cell array of logical scalars.
A table with variables containing any of the previous data types.
A datastore whose
readall
function returns any of the previous data types.A
CombinedDatastore
object containing an underlying datastore whosereadall
function returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.
lblsrc
must contain labels that can be converted to a vector with a discrete set of categories.
Example: lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C"
"D"])
creates the label source as a ten-sample categorical vector with
four categories: A
, B
, C
, and
D
.
Example: lblsrc = [0 7 2 5 11 17 15 7 7 11]
creates the label source
as a ten-sample numeric vector.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| logical
| char
| string
| table
| cell
| categorical
p
— Proportions or numbers of labels
integer scalar | scalar in (0, 1) | vector of integers | vector of fractions
Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.
If
p
is a scalar,splitlabels
finds two splitting index sets and returns a two-element cell array inidxs
.If
p
is an integer, the first element ofidxs
contains a vector of indices pointing to the firstp
values of each label category. The second element ofidxs
contains indices pointing to the remaining values of each label category.If
p
is a value in the range (0, 1) andlblsrc
has Ki elements in the ith category, the first element ofidxs
contains a vector of indices pointing to the firstp
× Ki values of each label category. The second element ofidxs
contains the indices of the remaining values of each label category.
If
p
is a vector with N elements of the form p1, p2, …, pN,splitlabels
finds N + 1 splitting index sets and returns an (N + 1)-element cell array inidxs
.If
p
is a vector of integers, the first element ofidxs
is a vector of indices pointing to the first p1 values of each label category, the next element ofidxs
contains the next p2 values of each label category, and so on. The last element inidxs
contains the remaining indices of each label category.If
p
is a vector of fractions andlblsrc
has Ki elements of the ith category, the first element ofidxs
is a vector of indices concatenating the first p1 × Ki values of each category, the next element ofidxs
contains the next p2 × Ki values of each label category, and so on. The last element inidxs
contains the remaining indices of each label category.
Note
If
p
contains fractions, then the sum of its elements must not be greater than one.If
p
contains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'TableVariable',"AreaCode",'Exclude',["617" "508"]
specifies
that the function split labels based on telephone area code and exclude numbers from Boston
and Natick.
Include
— Labels to include in index sets
vector of label categories | cell array of label categories
Labels to include in the index sets, specified as a vector or cell array of label
categories. The categories specified with this argument must be of the same type as
the labels in lblsrc
. Each category in the vector or cell array
must match one of the label categories in lblsrc
.
Exclude
— Labels to exclude from index sets
vector of label categories | cell array of label categories
Labels to exclude from the index sets, specified as a vector or cell array of
label categories. The categories specified with this argument must be of the same type
as the labels in lblsrc
. Each category in the vector or cell
array must match one of the label categories in lblsrc
.
TableVariable
— Table variable to read
first table variable (default) | character vector | string scalar
Table variable to read, specified as a character vector or string scalar. If this argument is
not specified, then splitlabels
uses the first table
variable.
UnderlyingDatastoreIndex
— Underlying datastore index
integer scalar
Underlying datastore index, specified as an integer scalar. This argument applies when
lblsrc
is a CombinedDatastore
object. splitlabels
counts the labels in the datastore obtained
using the UnderlyingDatastores
property of
lblsrc
.
Output Arguments
idxs
— Splitting indices
cell array
Splitting indices, returned as a cell array.
Version History
Introduced in R2021a
See Also
countlabels
(Signal Processing Toolbox) | filenames2labels
(Signal Processing Toolbox) | folders2labels
(Signal Processing Toolbox)
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)