What Types of Questions Can Data Science Answer
Data science has enabled us to solve complex and diverse problems by using machine learning and statistic algorithms. Here we have enumerated the common applications of supervised, unsupervised and reinforcement learning techniques
Machine learning (ML) is the motor that drives data science. Each ML method (also called an algorithm) takes in data, turns it over, and spits out an answer. ML algorithms do the part of data science that is the trickiest to explain and the most fun to work with. That’s where the mathematical magic happens.
ML algorithms can be grouped into families based on the type of question they answer. These can help guide your thinking as you are formulating your razor sharp question.
Is this A or B?
This family is formally known as two-class classification. It’s useful for any question that has just two possible answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like this or can be re-phrased to fit this form. It’s the simplest and most commonly asked data science question. Here are few typical examples.
Is this A or B or C or D?
This algorithm family is called multi-class classification. Like its name implies, it answers a question that has several (or even many) possible answers: which flavor, which person, which part, which company, which candidate. Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few typical examples.
Is this Weird?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no. The difference is that binary classification assumes you have a collection of examples of both yes and no cases. Anomaly detection doesn’t. This is particularly useful when what you are looking for occurs so rarely that you haven’t had a chance to collect many examples of it, like equipment failures. It’s also very helpful when there is a lot of variety in what constitutes “not normal,” as there is in credit card fraud detection. Here are some typical anomaly detection questions.
How Much / How Many?
When you are looking for a number instead of a class or category, the algorithm family to use is regression.
Usually, regression algorithms give a real-valued answer; the answers can have lots of decimal places or even be negative. For some questions, especially questions beginning “How many…”, negative answers may have to be re-interpreted as zero and fractional values re-interpreted as the nearest whole number.
Multi-Class Classification as Regression
Sometimes questions that look like multi-value classification questions are actually better suited to regression. For instance, “Which news story is the most interesting to this reader?” appears to ask for a category—a single item from the list of news stories. However, you can reformulate it to “How interesting is each story on this list to this reader?” and give each article a numerical score. Then it is a simple thing to identify the highest-scoring article. Questions of this type often occur as rankings or comparisons.
Two-Class Classification as Regression
It may not come as a surprise that binary classification problems can also be reformulated as regression. (In fact, under the hood some algorithms reformulate every binary classification as regression.) This is especially helpful when an example can belong part A and part B, or have a chance of going either way. When an answer can be partly yes and no, probably on but possibly off, then regression can reflect that. Questions of this type often begin “How likely…” or “What fraction…”
As you may have gathered, the families of two-class classification, multi-class classification, anomaly detection, and regression are all closely related. They all belong to the same extended family, supervised learning. They have a lot in common, and often questions can be modified and posed in more than one of them. What they all share is that they are built using a set labeled examples (a process called training), after which they can assign a value or category to unlabeled examples (a process called scoring).
Entirely different sets of data science questions belong in the extended algorithm families of unsupervised and reinforcement learning.
How is this Data Organized?
Questions about how data is organized belong to unsupervised learning. There are a wide variety of techniques that try to tease out the structure of data. One family of these perform clustering, a.k.a. chunking, grouping, bunching, or segmentation. They seek to separate out a data set into intuitive chunks. What makes clustering different from supervised learning is that there is no number or name that tells you what group each point belongs to, what the groups represent, or even how many groups there should be. If supervised learning is picking out planets from among the stars in the night sky, then clustering is inventing constellations. Clustering tries to separate out data into natural “clumps,” so that a human analyst can more easily interpret it and explain it to others.
Clustering always relies on a definition of closeness or similarity, called a distance metric. The distance metric can be any measurable quantity, such as difference in IQ, number of shared genetic base pairs, or miles-as-the-crow-flies. Clustering questions all try to break data into more nearly uniform groups.
Another family of unsupervised learning algorithms are called dimensionality reduction techniques. Dimensionality reduction is another way to simplify the data, to make it both easier to communicate, faster to compute with, and easier to store.
At its core, dimensionality reduction is all about creating a shorthand for describing data points. A simple example is GPA. A college student’s academic strength is measured in dozens of classes by hundreds of exams and thousands of assignments. Each assignment says something about how well that student understands the course material, but a full listing of them would be way too much for any recruiter to digest. Luckily, you can create a shorthand just by averaging all the scores together. You can get away with this massive simplification because students who do very well on one assignment or in one class typically do well in others. By using GPA rather than the full portfolio, you do lose richness. For instance, you wouldn’t know it if the student is stronger in math than English, or if she scored better on take-home programming assignments than on in-class quizzes. But what you gain is simplicity, which makes it a lot easier to talk about and compare students’ strength.
Dimensionality reduction-related questions are usually about factors that tend to vary together.
What Should I Do Now?
A third extended family of ML algorithms focuses on taking actions. These are called reinforcement learning (RL) algorithms. They are little different than the supervised and unsupervised learning algorithms. A regression algorithm might predict that the high temperature will be 98 degrees tomorrow, but it doesn’t decide what to do about it. A RL algorithm goes the next step and chooses an action, such as pre-refrigerating the upper floors of the office building while the day is still cool.
RL algorithms were originally inspired by how the brains of rats and humans respond to punishment and rewards. They choose actions, trying very hard to choose the action that will earn the greatest reward. You have to provide them with a set of possible actions, and they need to get feedback after each action on whether it was good, neutral, or a huge mistake.
Typically RL algorithms are a good fit for automated systems that have to make a lot of small decisions without a human’s guidance. Elevators, heating, cooling, and lighting systems are excellent candidates. RL was originally developed to control robots, so anything that moves on its own, from inspection drones to vacuum cleaners, is fair game. Questions that RL answers are always about what action should be taken, although the action is usually taken by machine.
RL usually requires more effort to get working than other algorithm types because it’s so tightly integrated with the rest of the system. The upside is that most RL algorithms can start working without any data. They gather data as they go, learning from trial and error.
The next and final post in this series will give lots of specific examples of sharp data science questions and the algorithm family best suited to each. Stay tuned.
The above article is published here in addition to providing a link in other pages of Big Data Space website so that visitors can still read the article in the event of having a broken link to the original article.
We are writing to share what we read about Big Data and related subjects with readers from around the world.