Machine Learning: Is it really a Black Box?

May 7, 2018 by

Machine Learning isn’t the “black box” that many perceive it to be. On complex data sets, the use of Machine Learning with a rigorous process and supporting visualizations can yield far more transparency than other methods.

What is a “Black Box”?

Machine learning models are sometimes characterized as being Black Boxes due to their powerful ability to model complex relationships between inputs and outputs without being accompanied by a tidy, intuitive description of how exactly they do this. A “Black Box” is “a device, system or object which can be viewed in terms of inputs and outputs without any knowledge of its internal workings” (Source: Wikipedia).

Black Boxes (and Machine Learning models) exist everywhere

We tend to label things as “Black Boxes” when we don’t trust them more than when we don’t understand them. Machine Learning models aren’t unique in having an element of “mystery” in how they work – there are all sorts of things we trust all around us for which we don’t fully understand the inner workings. GPS, search engines, car engines, step counters, even the curve fitting algorithms in Excel are examples where we trust what’s happening inside because we’re able to see and, with experience, have confidence in the results they produce.

Machine Learning itself is everywhere, and is already “trusted’ widely by nearly everyone, whether we realize it or not. Google Maps, airline autopilots, spam filters, optical character recognition (OCR), shopping recommendations, depositing a cheque at an ATM, fraud protection, voice-to-text on our mobile phones, and even some best-in-class medical diagnostic techniques are all things that rely on Machine Learning to be as effective as they are.

(Supervised) Machine Learning is just another fitting algorithm

Most of the time, when people talk about Machine Learning in oil and gas, they are referring to Supervised Machine Learning. This is where an algorithm learns by example after observing many cases of “given data X1, X2, X3 and X4, the outcome was Y”. With enough examples, an algorithm can learn how to predict the outcome from this type of data, assuming a relationship exists.

This is basically the same process as fitting a linear model or calculating a linear correlation coefficient in Excel or other tools, but with the ability to handle more complex relationships. Linear regression is included as part of the “Machine Learning” algorithm toolkit by all prominent open source tools. Sometimes, that’s all we need. Other times, it’s useful to have more powerful tools that can better handle many inputs and the complex, nonlinear relationships they may have with both the outcome we’re trying to predict and with each other.

Affords us unforeseen insights

The most obvious way Machine Learning differentiates itself from a Black Box is its powerful ability to describe feature importance. Feature importance is a measure of how useful a particular input (or “feature”) is in predicting the outcome. The power of Machine Learning can make it difficult to grasp all that’s going on under the hood. This same power gives us a very high level of confidence that if there is a meaningful relationship to be found, Machine Learning will find it. However, this power should be used carefully as Machine Learning can produce results that could be misleading (see our blog on Finding the Signal or Fitting the Noise?).

Conversely, if there is no predictive relationship found between a particular feature and the outcome, we can be confident that none exists. Gaining an understanding of what matters and what doesn’t in making predictions, and to what extent, is very valuable in better understanding the problem.

Test hypotheses quickly

The potent ability of Machine Learning to discover and communicate relationships in the data make it an excellent tool for hypothesis testing. Without Machine Learning the task of investigating a hunch like “I think horizontal well inclination and frac intensity together can tell me a lot about how much gas a well will produce in the first year” can be time consuming. It can even yield a false conclusion if the relationship is complex or other influencing factors aren’t properly accounted for. Sometimes relationships are nonlinear, threshold-driven, or exist among interactions between different types of data. Machine Learning excels at uncovering complexity. An underappreciated benefit of Machine Learning is that it allows us to focus further attention on only the most promising leads and avoid going down dead-end rabbit holes.

Visualization brings transparency to Machine Learning

We use interactive visualizations in VERDAZO throughout our Machine Learning process to add transparency, identify data opportunities and steer the process toward better results. Visualizations that we typically use include:

  • Statistical and map views of the data to help select sample data
  • Feature importance (tornado plots)
  • Comparing Model results in cross-plots & cumulative probability distributions
  • Fitness assessment and error characterization using maps, distributions and probit plots, slicing and dicing by categories, vintages and quartiles
  • Sensitivity profiles (applying variations to one value while holding all others constant)
  • Parallel coordinates distributions for validation

In the end, Machine Learning isn’t the Black Box that many perceive it to be, especially when it involves a rigorous process and is supported by a robust set of visualizations. Our Machine Learning services are purpose-built to help our clients get the greatest clarity, understanding and insight from their data – and build trust in the reliability of the results.