Main Content

Work with Deep Learning Data in AWS

This example shows how to upload data to an Amazon S3 bucket.

Before you can perform deep learning training in the cloud, you need to upload your data to the cloud. The example shows how to download the CIFAR-10 data set to your computer, and then upload the data to an Amazon S3 bucket for later use in MATLAB. The CIFAR-10 data set is a labeled image data set commonly used for benchmarking image classification algorithms. Before running this example, you need access to an Amazon Web Services (AWS) account. After you upload the data set to Amazon S3, you can try any of the examples in Parallel and Cloud.

Download CIFAR-10 to Local Machine

Specify a local directory in which to download the data set. The following code creates a folder in your current directory containing all the images in the data set.

directory = pwd; 
[trainDirectory,testDirectory] = downloadCIFARToFolders(directory);
Downloading CIFAR-10 data set...done.
Copying CIFAR-10 to folders...done.

Upload Local Data Set to Amazon S3 Bucket

To work with data in the cloud, you can upload to Amazon S3 and then use datastores to access the data in S3 from the workers in your cluster. The following steps describe how to upload the CIFAR-10 data set from your local machine to an Amazon S3 bucket.

1. For efficient file transfers to and from Amazon S3, download and install the AWS Command Line Interface tool from https://aws.amazon.com/cli/.

2. Specify your AWS Access Key ID, Secret Access Key, and Region of the bucket as system environment variables. Contact your AWS account administrator to obtain your keys.

For example, on Linux, macOS, or Unix, specify these variables:

export AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY" 
export AWS_DEFAULT_REGION="us-east-1" 

On Windows, specify these variables:

set AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
set AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
set AWS_DEFAULT_REGION="us-east-1"

To specify these environment variables permanently, set them in your user or system environment.

3. Create a bucket for your data by using either the AWS S3 web page or a command such as the following:

aws s3 mb s3://mynewbucket

4. Upload your data using a command such as the following:

aws s3 cp mylocaldatapath s3://mynewbucket --recursive

For example:

aws s3 cp path/to/CIFAR10/in/the/local/machine s3://MyExampleCloudData/cifar10/ --recursive

5. Copy your AWS credentials to your cluster workers by completing these steps in MATLAB:

a. In the Environment section on the Home tab, select Parallel > Create and Manage Clusters.

b. In the Cluster Profile pane of the Cluster Profile Manager, select your cloud cluster profile.

c. In the Properties tab, select the EnvironmentVariables property, scrolling as necessary to find the property.

d. At the bottom right of the window, click Edit.

e. Click in the box to the right of EnvironmentVariables, and then type these three variables, each on its own line: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION.

f. At the bottom right of the window, click Done.

For information on how to create a cloud cluster, see Create Cloud Cluster (Parallel Computing Toolbox).

Use Data Set in MATLAB

After you store your data in Amazon S3, you can use datastores to access the data from your cluster workers. Simply create a datastore pointing to the URL of the S3 bucket. The following sample code shows how to use an imageDatastore to access an S3 bucket. Replace 's3://MyExampleCloudData/cifar10/train' with the URL of your S3 bucket.

imds = imageDatastore('s3://MyExampleCloudData/cifar10/train', ...
 'IncludeSubfolders',true, ...
 'LabelSource','foldernames');

With the CIFAR-10 data set now stored in Amazon S3, you can try any of the examples in Parallel and Cloud that show how to use CIFAR-10 in different use cases.

See Also

Related Topics