Machine Learning Datasets in R (10 datasets you can use right now)

https://machinelearningmastery.com/machine-learning-datasets-in-r/

 

Last Updated on August 15, 2020

You need standard datasets to practice machine learning.

In this short post you will discover how you can load standard classification and regression datasets in R.

This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.

It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Practice On Small Well-Understood Datasets

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

  • You can download them fast.
  • You can fit them into memory easily.
  • You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

Access Standard Datasets in R

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

 

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

https://machinelearningmastery.lpages.co/leadbox/141caac73f72a2%3A164f8be4f346dc/5734055144325120/" data-leadbox-id="141caac73f72a2:164f8be4f346dc">Start Your FREE Mini-Course Now!

 

How To Load Standard Datasets in R

In this section you will discover the libraries that you can use to get access to standard machine learning datasets.

You will also discover specific classification and regression that you can load and use to practice machine learning in R.

Library: datasets

https://machinelearningmastery.com/wp-content/uploads/2016/02/Iris-Flowers-Datasets-300x281.jpg 300w" alt="Iris Flowers Dataset" width="640" height="599" class="size-full wp-image-2202" style="margin: 0px !important; padding: 4px 0px; border: 0px; outline: 0px; font-size: 15px; vertical-align: bottom; background: 0px 0px; max-width: 100%; height: auto;" aria-describedby="caption-attachment-2202" loading="lazy" />

Iris Flowers Dataset
Photo by Rick Ligthelm, some rights reserved.

The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.

You can load a dataset from this library by typing:

For example, to load the very commonly used iris dataset:

To see a list of the datasets available in this library, you can type:

Some highlights datasets from this package that you could use are below.

Iris Flowers Dataset

  • Description: Predict iris flower species from flower measurements.
  • Type: Multi-class classification
  • Dimensions: 150 instances, 5 attributes
  • Inputs: Numeric
  • Output: Categorical, 3 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Longley’s Economic Regression Data

  • Description: Predict number of people employed from economic variables
  • Type: Regression
  • Dimensions: 16 instances, 7 attributes
  • Inputs: Numeric
  • Output: Numeric

You will see:

Library: mlbench

https://machinelearningmastery.com/wp-content/uploads/2016/02/Soybean-Dataset-300x200.jpg 300w" alt="Soybean Dataset" width="640" height="427" class="size-full wp-image-2204" style="margin: 0px !important; padding: 4px 0px; border: 0px; outline: 0px; font-size: 15px; vertical-align: bottom; background: 0px 0px; max-width: 100%; height: auto;" aria-describedby="caption-attachment-2204" loading="lazy" />

Soybean Dataset
Photo by United Soybean Board, some rights reserved.

Direct from the manual for the library:

A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

You can learn more about the mlbench library on the mlbench CRAN page.

If not installed, you can install this library as follows:

You can load the library as follows:

To see a list of the datasets available in this library, you can type:

Some highlights datasets from this library that you could use are:

Boston Housing Data

  • Description: Predict the house price in Boston from house details
  • Type: Regression
  • Dimensions: 506 instances, 14 attributes
  • Inputs: Numeric
  • Output: Numeric
  • UCI Machine Learning Repository: Description

You will see:

Wisconsin Breast Cancer Database

  • Description: Predict whether a cancer is malignant or benign from biopsy details.
  • Type: Binary Classification
    Dimensions: 699 instances, 11 attributes
  • Inputs: Integer (Nominal)
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Glass Identification Database

  • Description: Predict the glass type from chemical properties.
  • Type: Classification
  • Dimensions: 214 instances, 10 attributes
  • Inputs: Numeric
  • Output: Categorical, 7 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Johns Hopkins University Ionosphere database

  • Description: Predict high-energy structures in the atmosphere from antenna data.
  • Type: Classification
  • Dimensions: 351 instances, 35 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Pima Indians Diabetes Database

  • Description: Predict the onset of diabetes in female Pima Indians from medical record data.
  • Type: Binary Classification
  • Dimensions: 768 instances, 9 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • Dataset Details: Description
  • Published accuracy results: Summary

You will see:

Sonar, Mines vs. Rocks

  • Description: Predict metal or rock returns from sonar return data.
  • Type: Binary Classification
  • Dimensions: 208 instances, 61 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Soybean Database

  • Description: Predict problems with soybean crops from crop data.
  • Type: Multi-Class Classification
  • Dimensions: 683 instances, 26 attributes
  • Inputs: Integer (Nominal)
  • Output: Categorical, 19 class labels
  • UCI Machine Learning Repository: Description

You will see:

Library: AppliedPredictiveModeling

https://machinelearningmastery.com/wp-content/uploads/2016/02/Abalone-Dataset-300x242.jpg 300w" alt="Abalone Dataset" width="640" height="516" class="size-full wp-image-2205" style="margin: 0px !important; padding: 4px 0px; border: 0px; outline: 0px; font-size: 15px; vertical-align: bottom; background: 0px 0px; max-width: 100%; height: auto;" aria-describedby="caption-attachment-2205" loading="lazy" />

Abalone Dataset
Photo by MAURO CATEB, some rights reserved.

Many books that use R also include their own R library that provides all of the code and datasets used in the book.

The excellent book Applied Predictive Modeling has its own library called AppliedPredictiveModeling.

If not installed, you can install this library as follows:

You can load the library as follows:

To see a list of the datasets available in this library, you can type:

One highlight datasets from this library that you could use is:

Abalone Data

  • Description: Predict abalone age from abalone measurement data.
  • Type: Regression or Classification
  • Dimensions: 4177 instances, 9 attributes
  • Inputs: Numerical and categorical
  • Output: Integer
  • UCI Machine Learning Repository: Description

You will see:

Summary

In this post you discovered that you do not need to collect or load your own data in order to practice machine learning in R.

You learned about 3 different libraries that provide sample machine learning datasets that you can use:

  • datasets library
  • mlbench library
  • AppliedPredictiveModeling library

You also discovered 10 specific standard machine learning datasets that you can use to practice classification and regression machine learning techniques.

  • Iris flowers datasets (multi-class classification)
  • Longley’s Economic Regression Data (regression)
  • Boston Housing Data (regression)
  • Wisconsin Breast Cancer Database (binary classification)
  • Glass Identification Database (multi-class classification)
  • Johns Hopkins University Ionosphere database (binary classification)
  • Pima Indians Diabetes Database (binary classification)
  • Sonar, Mines vs. Rocks (binary classification)
  • Soybean Database (multi-class classification)
  • Abalone Data (regression or classification)

Next Step

Did you try out these recipes?

  1. Start your R interactive environment.
  2. Type or copy-and-paste the recipes above and try them out.
  3. Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.