Machine Learning is a discipline where computer programs make predictions or draw insights based on patterns they identify in data and are able to improve those insights with experience — without humans explicitly telling them how to do so.
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Key themes that Machine Learning makes possible:
- Mass customization of a user’s environment, experience and system responses.
- The ability to visually identify objects and automate or tailor experiences accordingly.
- Automatic retrieval, generation or processing of content.
- Predictions, estimates and trends at scale.
- Detection of unusual activity or system failures.
- Enhanced experience and functionality for your customers.
- Internal functions, processes and business logic.
- Expansion to new verticals and new products.
AI involves machines that can perform tasks that are characteristic of human intelligence. Machine learning is simply a way of achieving AI. Deep learning is one of many approaches to machine learning. Reinforcement learning is another approach to machine learning.
At a conceptual level, we’re building a machine that given a certain set of inputs will produce a certain desired output by finding patterns in data and learning from it.
While the possibilities with ML are endless, there are certain questions you could ask to figure out how the technology could apply to your organization. Here are some examples:
- Where do people in my company today apply knowledge to make decisions that could be automated, so their skills could be better leveraged elsewhere?
- What is the data that people in my company normally search for, collect or extract manually from certain repositories of information and how can this be automated?
- What is the set of decisions people at my company make? Can those decisions conceivably be made by a machine if it magically ingested all the data my people have?
Products and experience for existing customers
- What parts of my customer interactions are customized by people and could potentially be customized by machines?
- Do I have a clear segmentation of my customers based on their preferences, behaviors and needs? Is my product / experience customized for each segment?
- Can I customize the experience for each individual customer based on what I know about them or their interaction with my site / app / product? How could I create a better, faster or otherwise more delightful experience for them?
- Specifically, what are the decisions and choices I’m asking my customers to make today? Can those decisions be automated based on some knowledge I already have or could have?
- How can I better identify good vs. bad customer experiences? Can I detect issues that will negatively impact customer experience or satisfaction before they happen or spread?
New verticals or customers
- Do I have any data that could be useful to other stakeholders in the industry or in adjacent industries? What sort of decisions can it help these stakeholders make?
All the above
- What are the metrics or trends that if I could correctly predict would have a meaningful impact on my ability to serve my customers or otherwise compete in the industry, e.g. forecast demand for certain categories of products, cost fluctuations etc.?
- What are the key entities about which I gather data (people, companies, products etc.)? Can I marry that data with any outside data (from public sources, partners etc.) in a way that tells me something new or useful about those entities? Useful to whom and how? For example: Identify potential customers when they are on the verge of looking for your product, understand how external factors affect demand in your industry and react accordingly, etc.
The data scientist’s role is to find the optimal machine to use given the inputs and the expected output. She has multiple templates — called algorithms — for machines. The machines she produces from those templates to solve a specific problem are called models. Templates have different options and settings that she can tweak to produce different models from the same template. She can use different templates and/or tweak the settings for the same template to generate many models that she can test to see which gives the best results.
Note that the model output is correct / useful for decision making at some degree of probability. Models are not 100% correct, but are rather “best guesses” given the amount of data the model has seen. The more data the model has seen, the more likely it is to give useful output.
The set of known inputs and outputs the data scientist uses to “train” the machine — i.e. let the model identify patterns in the data and create rules — is the “training set”. After she has a few of these “trained” models, she has to check how well they work and which one works best. She does that using a fresh set of data called the “validation set”. She runs the models on the validation set inputs to see which one gives results that are closest to the validation set outputs. Once she validated which model performs the best and picked the winner, our data scientist needs to determine the actual performance of that model, i.e. how good the best model she could produce really is in solving the problem. The final data set is called the “test set”.
Four types of Learning:
- Supervised learning—This is a type of learning where an algorithm needs to see a lot of labeled data examples — data that is comprised of both inputs and the corresponding output, in order to work. The “labeled” part refers to tagging the inputs with the outcome the model is trying to predict. Supervised learning algorithms see the labeled data (aka “ground truth” data), learn from it and make predictions based on those examples.
- Regression—a statistical method that determines the relationship between the independent variables (the data you already have) and the dependent variable whose value you’re looking to predict).
- Classification—Identifying which category an entity belongs to out of a given set of categories. This could be a binary classification — e.g. determining whether a post will go viral (yes / no), and multi-label categorization — e.g. labeling product photos with the appropriate category the product belongs to (out of possibly hundreds of categories).
- Unsupervised learning—the algorithm tries to identify patterns in the data without the need to tag the data set with the desired outcome. The data is “unlabeled” — it just “is”, without any meaningful label attached to it.
- Clustering—Given a certain similarity criterion, find which items are more similar to one another.
- Association—Categorize objects into buckets based on some relationship, so that the presence of one object in a bucket predicts the presence of another.
- Anomaly Detection—Identifying unexpected patterns in data that need to be flagged and handled.
- Semi-supervised learning—This is a hybrid between supervised and unsupervised learning, where the algorithm requires some training data, but a lot less than in the case of supervised learning (possibly an order of magnitude less). Algorithms could be extensions of methods used in either supervised and unsupervised learning — classification, regression, clustering, anomaly detection etc.
- Reinforcement learning—Here the algorithm starts with a limited set of data and learns as it gets more feedback about its predictions over time.
Deep learning is orthogonal to the above definitions. It is simply the application of a specific type of system to solve learning problems — the solution could be supervised, unsupervised etc. An Artificial Neural Network (ANN) is a learning system which tries to simulate the way our brain works — through a network of “neurons” that are organized in layers. A neural network has at a minimum an input layer — the set of neurons through which data is ingested into the network, an output layer — the neurons through which results are communicated out, and one or more layers in between, called “hidden layers”, which are the layers that do the computational work. Deep learning is simply the use of neural networks with more than one hidden layer to accomplish a learning task.
Ensemble methods or ensemble learning is the use of multiple models to get a result that is better than what each model could achieve individually. The models could be based on different algorithms or on the same algorithm with different parameters. The idea is that instead of having one model that takes input and generates output — say a prediction of some kind, you have a set of models that each generate a prediction, and some process to weigh the different results and decide what the output of the combined group should be. Ensemble methods are frequently used in supervised learning (they’re very useful in prediction problems) but can also apply in unsupervised learning. Your data science team will likely test such methods and apply them when appropriate.
Natural language processing (NLP) is the field of computer science dealing with understanding language by machines. Not all types of NLP use machine learning. For example, if we generate a “tag cloud” — a visual representation of the number of times a word appears in a text — there is no learning involved. More sophisticated analysis and understanding of language and text often requires ML. Some examples:
- Keyword generation. Understanding the topic of a body of text and automatically creating keywords for it
- Language disambiguation. Determining the relevant meaning out of multiple possible interpretations of a word or a sentence (this is a great explanation with examples)
- Sentiment analysis. Understanding where on the scale of negative to positive the sentiment expressed in a text lies
- Named entity extraction. Identifying companies, people, places, brands etc. in a text; this is particularly difficult when the names are not distinctive (e.g. the company “Microsoft” is easier to identify than the company “Target”, which is also a word in the English language)
NLP is not only used for language-oriented applications of ML such as chatbots. It is also used extensively to prepare and pre-process data before it can be a useful input into many ML models.
A small change in the problem definition could mean a completely different algorithm is required to solve it, or at a minimum a different model will be built with different data inputs.
ML models identify patterns in data. The data you feed into the models is organized into features (also called variables or attributes): These are relevant, largely independent pieces of data that describe some aspect of the phenomenon you’re trying to predict or identify.
Objective Function Selection
The objective function is the goal you’re optimizing for or the outcome the model is trying to predict. For example, if you’re trying to suggest products a user may be interested in, the output of a model could be the probability that a user will click on the product if they saw it. It may also be the probability that the user will buy the product. The choice of objective function depends primarily on your business goal .
The output of ML models is often a number — a probability, a prediction of the likelihood something will happen or is true. These two are a critical consideration in your approach to modeling, selecting features and presenting results.
explainability — to what degree the end user needs to be able to understand how the result was achieved.
interpretability — to what degree the user needs to draw certain conclusions based on the results.
A model is said to be “overfitted” when it follows the data so closely that it ends up describing too much of the noise rather than the true underlying relationship within the data (see illustration). Broadly speaking, if the accuracy of the model on the data you train it with (the data the model “learns from”) is significantly better than its accuracy on the data with which you validate and test it, you may have a case of overfitting.
Precision, Recall and the Tradeoff Between Them
There are two terms that are very confusing the first time you hear them but are important to fully understand since they have clear business implications.
The accuracy of classification (and other commonly used ML techniques such as document retrieval), is often measured by two key metrics: Precision and recall. Precision measures the share of true positive predictions out of all the positive predictions the algorithm generated, i.e. the % of positive predictions that are correct. If the precision is X%, X% of the algorithm’s positive predictions are true positives and (100-X)% are false positives. In other words, the higher the precision the less false positives you’ll have.
Recall is the share of positive predictions out of all the true positives in the data — i.e. what % of the true positives in the data your algorithm managed to identify as positives. If the recall is X%, X% of the true positives in the data were identified by the algorithm as positives, while (100-X)% were identified as (false) negatives. In other words, the higher the recall the less false negatives you’ll have.
There is always a tradeoff between precision and recall. If you don’t want any false positives — i.e. you need higher precision, the algorithm will have more false negatives, i.e. lower recall, because it would “prefer” to label something as a negative than to wrongly label it as a positive, and vice versa. This tradeoff is a business decision. Take the loan application example: Would you rather play it safe and only accept applicants you’re very sure deserve to be accepted, thereby increasing the chances of rejecting some good customers (higher precision, lower recall = less false positives, more false negatives), or accept more loan applicants that should be rejected but not risk missing out on good customers (higher recall but lower precision = less false negatives, more false positives)? While you can simplistically say this is an optimization problem, there are often factors to consider that are not easily quantifiable such as customer sentiment (e.g. unjustly rejected customers will be angry and vocal), brand risk (e.g. your reputation as an underwriter depends on a low loan default rate), legal obligations etc., making this very much a business, not a data science, decision.
The Often Misleading Model Accuracy Metric
Model accuracy alone is not a good measure for any model. Imagine a disease with an incidence rate of 0.1% in the population. A model that says no patient has the disease regardless of the input is 99.9% accurate, but completely useless. It’s important to always consider both precision and recall and balance them according to business needs. Accuracy is a good metric when the distribution of possible outcomes is quite uniform and the importance of false positives and false negatives is also about equal, which is rarely the case.
Developing a Machine Learning Model
- Ideation. Align on the key problem to solve, and the potential data inputs to consider for the solution.
- Align on the problem
- Choose an objective function
- Define quality metrics
- Brainstorm potential inputs
- Data preparation. Collect and get the data in a useful format for a model to digest and learn from.
- Collect data for your prototype in the fastest way possible
- Data cleanup and normalization
- Prototyping and testing. Build a model or set of models to solve the problem, test how well they perform and iterate until you have a model that gives satisfactory results.
- Build prototype
- Validate and test prototype
- Productization. Stabilize and scale your model as well as your data collection and processing to produce useful outputs in your production environment.
- Increase data coverage
- Scale data collection
- Refresh data
- Scale models
- Check for outliers
In a company where data science is becoming part of the DNA, it is essential to make data scientists an integral part of the product team, rather than treating them as a separate entity.
Do not underestimate the amount of thought and effort you need to put into the user experience side of the problem — even the best model is useless if users can’t understand, trust or act upon its output.