When Can You Trust Your Computer to Learn On Its Own?

Mariann Beagrie
4 min readMar 9, 2022
A robot sitting on a bench looking at a tablet.
Photo by Andrea De Santis on Unsplash

Computers can help you learn things about data that you would never be able to figure out by yourself. All data mining algorithms aim to find useful patterns or trends in data that would be difficult or impossible to find using simpler methods. However, some algorithms need a bit more “supervision” than others. Keep reading to learn more about the differences between Supervised and Unsupervised Learning.

Supervised Learning

With Supervised Learning techniques, you “supervise” the learning process by providing the algorithms with data that contains a target variable. The aim is for the algorithm to learn how to predict the value for this target variable. This value or “label” is the “supervision”.

First, the algorithm divides the data you give into training and validation data. Then it is used to training data to train a model that can predict the value of the target variable. Once the algorithm is finished “training ” its model, it gives it a practice test on the validation data. The model predicts the target values in the validation set. Then the algorithm checks the answers to see how well the model did.

The score is provided in the form of evaluation measures. These are used to judge how well a particular model did at predicting the target. If it performed well, it is usually given a final “test” on data kept aside for this purpose. If it did poorly, you would probably want to adjust some things and try again. You might decide to use different variables, try new pre-processing methods, change the parameters or choose a new algorithm altogether. Once a model passes the final test, it should do a good job predicting the target variable with any data similar to the data it was trained with.

Advantages:

  • It can indicate how variables are correlated to the target variable.
  • It can be used to make predictions.
  • It provides objective measures that can be used to compare the goodness of different models.

Disadvantages:

  • Often requires lots of labeled data. It is usually time-consuming and/or expensive to obtain enough labeled data to get meaningful results. This isn’t an issue if high-quality data is already available that applies to your investigation. However, if new information needs to be collected, it might not be feasible to use Supervised learning.

Types of Supervised Learning

  • Classification: The algorithms are used to determine the best set of values that can be used to correctly predict the category of a target variable. For example, will this person make a successful employee (yes or no)?
  • Regression: These algorithms are used to predict numerical values. For example, how much money is this person likely to make at age 30.

Unsupervised Learning

Unsupervised learning is done without the support of labeled data. It’s basically like giving an algorithm a giant pile of data and saying, “See what you can make of this.” The algorithm performs a bunch of calculations on the data and reports back, “These are the connections I found.” It’s up to the person looking at the results to determine how meaningful they are.

Some evaluation measures are usually provided along with the patterns found; however, patterns with seemingly good evaluation measures might still be meaningless or not that informative. For example, discovering that diapers and baby formula are usually bought by the same people doesn’t provide any new information. Diapers and beer often being bought together is a more interesting pattern that might invite further investigation, or maybe not…

Unsupervised methods often give different results each time they are run. It’s like if you had a bunch of cats, dogs, wolves and lions. You could legitimately sort them into “felines” and “canines” or “pets” and “wild animals”. Both are valid ways to sort the animals. However, one method would be better for deciding which to put in a pet shop, and another would be better for looking at similarities amongst species.

Advantages:

  • Does not require data to be labeled, so it will probably be easier to obtain the data you need for the task you are trying to do.
  • Can find interesting and useful patterns that would not be found otherwise.

Disadvantages:

  • Can find lots of meaningless or not very interesting patterns, so it often requires a domain expert to interpret the results and determine how useful they are.

Types

  • Clustering: looks for similarities in the data and tries to form groups based on these. For example, these customers are very similar, and many have bought this item. Maybe we should recommend that same item to other customers like them.
  • Association Rules Mining: Looks for items that typically occur together. For example, when chips and beer are bought, hotdogs are also bought 60% of the time.
  • Dimensionality reduction: used to find the best way to combine variables so they retain all of the vital information, but there are fewer of them.

In addition to Supervised and Unsupervised learning, there is also Semi-Supervised learning, a combination of the two. It is a good solution for when you only have a little labelled data, but lots of unlabeled data, and would really like to use Supervised Learning.

Click here or here if you are interested in reading more about the connection between diapers and beer.

--

--

Mariann Beagrie

I have taught in the US, Germany, South Korea and China. I recently completed a degree in Computer Science. I love traveling, reading and learning.