A Quick Guide to Supervised Learning
Defining supervised learning - a beginner's walkthrough
What is Supervised Learning?
Machine learning (ML) is a subset of artificial intelligence (AI) that is modeled after the human brain to accomplish tasks through algorithms and statistical models. Data scientists can teach an ML algorithm through two main processes: supervised learning and unsupervised learning.
A learning model is a mathematical representation of a tangible process. If the learning model is supervised, the inputs of the algorithm are labeled. The algorithm receives a paired dataset that includes the sample and the label for that sample. These inputs are also called observations, examples, or instances. If the datasets are unlabeled, the algorithm is unsupervised and must categorize, compute, and deliver outputs on its own with no predefined parameters.
For example, in a supervised learning model where the output goal is to categorize animals into cats or dogs, the samples include labeled pictures of cats and dogs. If the sample is of the right quality and quantity, the ML algorithm can learn from the labeled data and categorize new inputs of cats and dogs quickly and accurately.
Agustinus Nalwan, AI & machine learning technical lead at Carsales.comk explains how to use image recognition in business
If the dataset is unsupervised, the examples are unlabeled. In this case, the ML algorithm will generate observations on its own. It may sort by length of hair, color, shape of ears, and a number of other characteristics. Eventually, by overlapping these outputs and learning from them, the unsupervised learning model will have taught itself the difference between a cat and a dog and can categorize them—if not label them—appropriately.
To determine the type of algorithm that best fits a supervised learning model, a proposed task must be defined as either a classification or regression task.
A classification model groups inputs into categories, or discrete values. Discrete data is either “this” or “that.” It is not somewhere in between. These data categories are further classified as binary or multi-class.
When a prediction involves just two outputs, or classes, the classification model is binary. One class is given a 0 value and one class is given a 1 value. The closer the output is to 0 or 1, the closer the accuracy of the machine learning prediction is.
Spam filters are an example of binary classification. A spam filter is able to scan an email and predict the likelihood that it is spam through certain indicators. A numerical value is given to the email, which is then classified as spam or not spam.
Consider the phrase “weight loss.” It may trigger a spam label. The beauty of machine learning is in its ability to figure out the likelihood of unseen phrases as being spam based on its supervised learning algorithm. For example, perhaps the phrase “lose weight fast” was part of the training data. Given the weight and parameters programmed into the algorithm, the new “weight loss” phrase is accurately identified as spam.
Classifying tumors as benign or malignant is another example of a binary classification. However, if there are several different types of possible cancers in a particular tissue, classifying that case is more involved. The same supervised learning processes are involved, but now, we have values that go beyond just 0 and 1. Each possible output has its own numerical value.
For example, in an attempt to diagnose breast cancer, the three main types may be labeled as 1, 2, 3. The supervised algorithm receives the new input and is able to classify the type of breast cancer. If the output is closest 2.8, the breast cancer is likely the third type. If the output is 1.5, the breast cancer could either be type 1 or type 2.
A regression task involves an output variable that is not a “this” or “that.” The output can be continuous, meaning they are not confined to specific output labels. Regression tasks often involve numerical outputs, such as the weight of a person or by how many points the stock market will rise or fall. Grocery stores use regression models to predict the quantity of food they should purchase based on sales data. This number is flexible depending on the inputs it receives.
There are several different algorithms that are used for classification models and regression models, but regardless of the algorithm that best fits the scenario, it will still go through a similar training process using supervised datasets.
Sai Giridhar Ramasamy of Lloyd's Banking Group shares his unique approach to implementing AI
Source: The AI Network Podcast
There are three types of datasets used in creating a supervised learning algorithm. Since the goal of a supervised model is to be able to deliver correct outputs from new data, it is important to ensure that the input data is robust, cleansed, and free from bias. For example, if only long-tailed cats are used in a dataset to train a model, it is possible that the algorithm will incorrectly determine that a Manx cat is a dog.
The first step to creating a functional machine learning model is to input labeled data that the algorithm can learn from. Training data must be predefined in a supervised learning model. The right answers are labeled, and features of that answer are broken down into a data table, or a matrix, which is a collection of data organized into rows and columns. Different weights can be given to different instances so that the most important features contribute more to the algorithm’s learning and output, and the features that have less relevance drop off or are given less weight.
When these matrices are combined, a feature vector is formed. For example, if an ML algorithm is classifying fruit, features may include color, shape, size, and texture. Since the ML model is supervised, feature vectors are prelableled. This enables the algorithm to determine a specific output even if some of the features are similar between fruit. The algorithm knows which path to follow in order to reach a conclusion.
Depending on the desired outcome, a training set must contain the key historical and/or factual labeled information to deliver accurate outputs. In order for the algorithm to create the most accurate output, or inferred function, the data must be organized and cleansed going in. Data cleaning involves manual tasks such as fixing errors in data, filling incomplete data in, removing rows of empty data, or reducing the amount of data. One of the biggest problems in Big Data and data mining with supervised learning models is inaccuracy due to underprepared data.
Once the algorithm has received a training set, it moves on to validation. There are many different models that can be applied to machine learning. A validation set will run a training set through several of these models to see which one produces the best fit for the task at hand. Validation sets are controlled, labeled, and cleansed in the same way training sets are. A validation model performs a controlled run-through for the sake of understanding how the algorithm “thinks.”
Validation spots overfitting, which is when the algorithm knows the training set too well. This means that the extra information or noise from the training set becomes a part of the identifying parameters for the algorithm. When new data is then introduced to the model, its accuracy is reduced. Overfitting happens when there isn’t enough data or when there are too many data features in the algorithm. The model learns to hone in on the nuances of this specific data, leaving it unable to generalize when a new piece of data is inputted into the algorithm. Validation sets help identify when a training set needs to be tweaked.
During validation, the parameters of a classifier can be adjusted, such as the number of hidden layers in a neural network. The work of weighing datasets is done during the training phase. Thus, the training and validation phase work in conjunction.
After the model has been trained and validated, a test set inputs brand new samples to test the accuracy of the algorithm. In order to ensure the model is unbiased, a test set must completely differ from its training and validation sets. These test sets model real-world scenarios and should only be performed once or twice after the supervised model is complete.
The purpose of a test set is to verify the accuracy of the algorithm. If test outcomes are undesirable and then used to retrain the algorithm, the machine learning model will not perform well in the real world, because the algorithm essentially cheated by receiving and adapting to data that was supposed to be unseen and organic.
Supervised learning works within the context of AI and ML to learn how to process future data. It is limited in its power because it is prone to the biases of the humans who label the datasets, and it can have a difficult time handling data points outside of its defined parameters. However, programmed correctly, supervised learning is a powerful tool in predictive analytics where past data forecasts future events, and with classification problems, such as accurately identifying and diagnosing diseases.
Algorithmia. (2018, April 9). Introduction to Unsupervised Learning. Blog.algorithmia.com.
Arbon, J. (2018, September 18). Testing AI: Supervised Learning. Medium.com.
Bhatia, R. (2017, September 19). Top 6 Regression Algorithms Used In Data Mining And Their Applications In Industry. Analytics India. Analyticsindiamag.com.
Brownlee, J. (2013, December 30). Data, Learning, and Modeling. Machinelearningmastery.com
Despois, J. (2018, March 22). Memorizing is not learning! — 6 tricks to prevent overfitting in machine learning. Hackernoon.com.
Dey, S. (2017, August 14). Dogs vs. Cats: Image Classification with Deep Learning using TensorFlow in Python. Datasciencecentral.com.
Frost, J. (N.D.). Making Predictions with Regression Analysis. Statisticsbyjim.com.
Han, Z., Wei, B., Zheng, Y., Yin, Y., Li, K., Li, S. (2017, June 23). Breast Cancer Multi-classification from Histopathological Images with Structured Deep Learning Model. Nature.com.
Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q., Shen, H., Wang, Y. (2017, June 21). Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. Ncbi.nlm.nih.gov.
Jones, M.T. (2018, February 6). Supervised learning models. IBM Developer. Developer.ibm.com.
Klose, O. (2015, August 26). Machine Learning (6) - Binary Classification: Flight Delays, Surviving the Titanic and Targeted Marketing. Oliviaklose.azurewebsites.net.
Le, J. (2016. August). The 10 Algorithms Machine Learning Engineers Need to Know. KDnuggets.com.
Martin, P. (N.D.). Retail store sales forecasting. Neuraldesigner.com.
Mierswa, I. (N.D.). Learn The Right Way To Validate Models, Part 1: Training Error & Test Error. Rapidminer.com.
Montañez, A. (2016, May 20). Unveiling the Hidden Layers of Deep Learning. ScientificAmerican.com.
Morgan, J. (2018, June 21). Difference Between Supervised Learning and Unsupervised Learning. DifferenceBetween.net.
Olteanu, A. (2018, January 3). Tutorial: Learning Curves for Machine Learning in Python. Dataquest.io.
Ovchinnikov, A., Gorelik, J., Livschitz, V. (2016, November 18). Creating training and test data sets and preparing the data. Griddynamics.com.
Pena, D. (2018, April 13). The Multiclass Definitions. Towardsdatascience.com.
Safigh, A., Khalilzadeh, N. (2016, November 9). Proposed efficient algorithm to filter spam using machine learning techniques. Sciencedirect.com.
Singh, A. (2018, October 25). Stock Prices Prediction Using Machine Learning and Deep Learning Techniques (with Python codes). Analyticsvidhya.com.
Thomas, R. (2017, November). How (and Why) to Create a Good Validation Set. KDnuggets.com.
Thomas, S. (2018, November 18). Data Cleaning in Machine Learning: Best Practices and Methods. Einfochips.com.
Valcheva, S. (N.D.). Discrete vs Continuous Data: with Comparison Chart. Intellspot.com.