Frequently Asked Questions on Machine Learning

What is Machine Learning best suited for?

Machine Learning is good at replacing labor-intensive decision-making systems that are predicated on hand-coded decision rules or manual analysis. Six types of analysis that Marchine Learning is well suited for are:

  • classification (predicting the class/group membership of items)
  • regression (predicting real-valued attributes)
  • clustering (finding natural groupings in data)
  • multi-label classification (tagging items with labels)
  • recommendation engines (connecting users to items)

How is Machine Learning different than predictive analytics?

You can think of Machine Learning as “predictive analytics on steroids.” Whereas predictive analytics has historically relied on restrictive, so-called parametric models (such as linear regression & logistic regression), Machine Learning employs truly data-driven models that can extract more knowledge from your data. As a result, machine learning is typically more accurate than classicial predictive analytics, especially for large and heterogeneous datasets.

Further, Machine Learning is more expansive than predictive analytics, encompassing problems ranging from supervised (such as classification and regression) to unsupervised (such as clustering and dimensionality reduction) and everything in between (so-called semi-supervised problems). Predictive analytics, on the other hand, solely focuses on supervised learning problems.

What do I need to get started with Machine Learning?

A training set of data and an interesting business problem is all that is needed. The training set should consist of a set of events or objects for which the outcome of interest is known plus any relevant input data that may be predictive of that outcome. The amount of training data required to really varies from problem to problem, but typically a couple thousand instances suffice.

For example, if you want to build a Machine Learning fraud detector for credit card transactions, all you need to get started is historical data about the transactions that were / were not fraudulent. In this case, relevant input features could be data surrounding the transaction (location, time, amount) and customer (credit score, credit history, employment status).

What is a training data set?

A set of data whose outcome variable is known. Training data are used to build a supervised ML model. Training data can further be split into separate training and testing data sets, where the latter is used to validate the performance of the model. This step allows us to estimate what the accuracy of the model will be on future data and thus ensure that production-grade standards will be met.

What is feature extraction?

Feature extraction is the process of extracting additional information out of your raw data. A simple example of feature extraction is taking a datetime input and expanding it into an enriched vector of information such as second, minute, hour, week, month, quarter and year. A more sophisticated example is processing input text data into natrual language features that capture the essence of the meaning of the text. In this way, feature extraction allows you to turn unstructured data–text, images, time series, video, graphs, etc.–into structured representations of the salient information encoded in those data, which can in turn be directly used in an Machine Learning algorithm.

Supervised ML produces predictive models. These algorithms require a training set of data with known outcomes, and yield an optimized decision-making process. Unsupervised ML is mainly used for exploratory analysis (finding natural clusters, patterns, or outliers in the data) or visualization (through dimensionality reduction).