Table of contents
Choice of the best model
One of the most important steps in machine learning involves choosing the so-called “best model”.
Whether it's logistic regression, Random Forest, Bayesian methods, Support Vector Machine (SVM), or neural networks, no single model can be labeled as the best. Instead, there is a "more adequate model" suited to the specific data and context in which it will be used.
Before we dive into this crucial topic, let's briefly introduce Machine Learning.
Machine Learning (ML) is a part of the broader field of Artificial Intelligence, and its goal is to enable machines to learn tasks automatically, similar to how humans do.
The term Machine Learning was first introduced in 1959 by Arthur Samuel and later defined more formally by Tom Mitchell.
“It is said that a program learns from a certain experience E compared to a class of tasks T by obtaining a performance P, if its performance in carrying out the tasks T, measured by performance P, improves with experience E.”
There is a small difference between traditional algorithms and machine learning algorithms.
In the first scenario, the programmer sets the parameters and data needed to solve the task; in the second scenario, when facing problems without predefined strategies or unknown models, the computer learns by performing the task and improving over time.
For instance, a computer program can solve the Tic-Tac-Toe game and beat you because it's programmed with a winning strategy; but if it didn't know the basic rules of the game, it would need to use an algorithm that learns these rules by playing until it can win.
In the latter case, the computer doesn't just make a move. Instead, it tries to figure out the best one by learning from the examples it gets while playing. It creates rules based on these examples and can independently determine if a new situation fits a rule it has learned. This way, it decides on the next move.
Therefore, the goal of Machine Learning is to create models that enable the development of learning algorithms to solve specific problems.
The learning model outlines the goal of the analysis, meaning it specifies how you want the algorithm to learn.
There are various learning models:
- Supervised Learning
In this first case, think of the algorithm learning from the training dataset as if a teacher is overseeing the learning process. The learning stops when the algorithm achieves a satisfactory level of performance.
Supervised learning happens when you have input data (X) and output data (Y), and you use an algorithm to learn the function that produces the output from the input. The aim is to closely estimate the function so that when there's new input data (X), the algorithm can predict the output value (Y) for that data.
Supervised learning problems can be divided into:
Classification: is the process in which a machine is able to recognize and categorize dimensional objects from a data set.
Regression: represents the fact that a machine can predict the value of what it is analyzing based on current data. In other words, it studies the relationship between two or more variables, one independent of the other.
For example, given the size of a house, predict its price, or look for the relationship between car races and the number of accidents a driver has.
- Unsupervised Learning
Unsupervised learning is when you have an input variable (X), represented by data, and no corresponding output variable.
It seeks to identify relationships or patterns among the analyzed data without relying on categorization, unlike what is observed with Supervised Learning algorithms.
Learning is termed "unsupervised" because it lacks correct answers and a teacher to guide the process.
The algorithms work independently to uncover and highlight interesting data structures.
Unsupervised learning problems can be divided into:
3.1) Grouping: also known as “clustering”, it is used to group data that share similar characteristics.
In this scenario, the algorithm learns by identifying relationships between the data on its own.
The program doesn't rely on pre-categorized data but instead, it creates a rule to group the presented cases based on characteristics it identifies directly from the data itself.
The program doesn't define what the data represents, making it more challenging to assess the reliability of the results.
3.2) Association: It is a problem where you aim to find rules that explain significant parts of the data. For instance, people who purchase product A often also buy product B.
It seeks to identify frequent patterns, associations, correlations, or random structures among a set of items or objects in a relational database.
Given a set of transactions, this approach aims to find rules that predict the occurrence of an item based on the occurrences of other items in the transaction. It is closely related to Data Mining.
4) Reinforcement Learning
Machine learning technique that mimics human learning through trial and error.
The algorithm is designed to learn and adapt to changes in the environment through a system of evaluation. This system rewards correct actions and penalizes incorrect ones.
The aim is to maximize the rewards without specifying the exact path to follow.
Following this introduction to Machine Learning and its different training models, we now encounter the challenge of selecting the appropriate model based on the data and knowledge at hand.
Over time, each training model has developed, incorporating a range of algorithms that vary in complexity and sophistication.
Creating an effective model heavily relies on choosing and fine-tuning the features, as well as picking the right model.
Finding the best model involves a complex and repetitive process.
As the diagram shows, we begin by examining the data's features using histograms, scatter plots, and other visual aids. This crucial step is known as EDA (Exploratory Data Analysis).
Next, we analyze the data's characteristics, performing tasks like normalization, resizing, and attribute extraction.
After this stage, which helps us understand the data and its context, we determine which category of machine learning models best fits our goals and the issues we're addressing. This often involves testing how well different models can predict and adapt.
We repeatedly refine our choice through evaluation and tuning, using both numerical and visual tools such as ROC curves, residual plots, heat maps, and validation curves.
During this model selection phase, one approach we use is outlined by the Scikit-Learn library, as shown in the following diagram.
This diagram serves as a good starting point, guiding you through a simplified decision-making process to choose the most suitable machine learning algorithm for your dataset. The Scikit-Learn flowchart is helpful because it provides a roadmap, but it doesn't explain how the different models function. For deeper understanding, two images are widely recognized in the Scikit-Learn community: the classifier comparison and the clustering comparison graphs. The "clustering comparison" chart is especially valuable for comparing various clustering algorithms across different datasets, as unsupervised learning does not benefit from having labeled data.
Similarly, the ranking comparison chart below is a useful visual comparison of the performance of nine different classifiers across three different data sets:
Typically, these images are used to show significant differences in how various models perform across different datasets.
But what should you do when you've explored all those options? Scikit-Learn offers many more templates, and the flowchart we've seen is just the beginning. You can take a comprehensive approach to test the entire Scikit-Learn model catalog to find the best fit for your dataset.
However, if our goal is to become more knowledgeable "Machine Learning professionals," then we're interested not just in how our models perform, but also in understanding why they work or don't work.
For our purposes, experimenting with model modules and "hyperparameters" is likely the best way to find the optimal model. One tool for exploring models is Dr. Saed Sayad’s interactive data mining map.
https://www.saedsayad.com/data_mining_map.htm
This resource is more detailed than the Scikit-Learn flowchart because it includes additional models. Besides predictive methods, it also covers statistical methods, exploratory data analysis (EDA), and data normalization.
Below is a graph we developed in the Humanativa R&D Department. It aims to offer a comprehensive view of Sayad’s predictive methods, including some, like reinforcement learning, not shown in the original. It also incorporates the Scikit-Learn diagram.
The graph uses color and hierarchy to distinguish between different model shapes and families.
While this map isn't complete, our aim is for it to serve as a tool that, drawing from our experience, helps us quickly choose which models to test based on the data we have.
Summary
This article explores the crucial process of selecting the best model in machine learning, emphasizing that there's no one-size-fits-all model but rather a more adequate one depending on the specific data and context. It briefly introduces machine learning, its history, and the difference between traditional algorithms and machine learning algorithms. It delves into various learning models such as supervised learning, unsupervised learning, and reinforcement learning, explaining their distinctions, applications, and challenges. The piece also guides on how to approach model selection, utilizing tools like EDA, the Scikit-Learn library, and Dr. Saed Sayad’s interactive data mining map, ultimately aiming to enhance understanding of model performance and suitability for different datasets and problems.