Skip to content Skip to sidebar Skip to footer

How To Restore Credibility Of Machine Learning Pipeline Output Challenged In Study Of Real World Deployments

All domains are going to be turned upside down by machine learning (ML). This is the consistent story that we keep hearing over the past few years. Except for the practitioners and some geeks, most people are not aware of the nuances of ML. ML is definitely related to Artificial Intelligence. Whether it is a pure subset or a closely related area depends on who you ask. The dream of general Artificial Intelligence to have machines solve previously unseen problems in all domains using cognitive skills had turned into AI winter as those approaches did not yield results for more than forty or fifty years. The resurgence of ML turned the field around. ML became tractable as the horsepower of computers increased and much more data about different domains became available to train models. ML turned the focus away from trying to model the whole world using data and symbolic logic to make predictions. Instead ML relied on statistical methods and limiting the field of prediction as discussed below.

There are three separate approaches in ML; one is called supervised learning, the second semi-supervised learning and the third is unsupervised learning. Deep learning is characterized by multiple layers of such methods. The success of ML comes from the ability of models to trained through large amounts of data in a particular domain called training sets to make predictions. In any ML pipeline a number of candidate models are trained using data. At the end of the training, an essential of amount of basic structure of the domain are encoded in the model. This allows for the ML model to generalize to create predictions in the real world. For example, a large number of cat videos and non-cat videos can be fed into the model to train the model to recognize cat videos. At the end of the training a certain amount of cat-videoness is encoded in successful models.

ML is used in many familiar systems; including movie recommendations based on viewing data, market basket analysis which suggest new products based on the current contents of shopping carts. Facial recognition, skin cancer prediction from clinical images, identifying retinal neuropathy from retinal scans, predictions of cancer from MRI scans are all in the domain of ML. Of course recommender systems for movies and predicting skin cancer or the beginnings of retinal neuropathy and blindness are vastly different in scope and importance.

The key idea after this training is to use an independent and identically distributed (iid) evaluation procedure using data drawn from the training distribution which it has not yet encountered. This evaluation is used to choose the candidate for deployment in the real world. Many candidates can perform similarly during this phase, even though there are subtle differences between them due to the starting assumptions, number of runs, data that they trained on etc.

Ideally the iid evaluation predicts the expected performance of the model. This helps separate the wheat from the chaff. The duds from the iid-optimal models. That there would be some structural misalignment between the training sets and the real world is obvious. The real world is messy, chaotic, images are blurry, the operators are not trained to capture pristine images, there are equipment breakdowns. All predictors deemed equivalent at the evaluation phase should have should have shown similar defects in the real world. A paper written by three principals and backed by fifty researchers all from google
GOOG
, probes this theory to explain many high profile failures of ML models in the real world. The paper notes that all predictors that performed similarly during the evaluation phase did not perform equally in the real world. Uh oh, this means that the duds and the good performers could not be distinguished at the end of the pipeline. This paper is a sledgehammer taken to the process of choosing a predictor and the current construction of an ML pipeline.

The paper identifies the root cause of this behavior as underspecification in ML pipelines. Underspecification is a well understood and well documented phenomenon in ML, it arises due to the presence of more unknowns than independent linear equations. Underspecification is a deliberate tactic to reach the choice faster. The first claim is that underspecification in ML pipelines is a key obstacle to reliably training models that behave as expected in deployment. The second claim is that underspecification is ubiquitous in modern applications of ML, and has substantial practical implications. There is no easy cure for the underspecification. Further all deployed ML predictors using the old pipeline are suspect.

The solution is to be aware of the perils of underspecification and choose multiple predictors, and then subject them to stress tests and choose the best performer; in other words, expand the testing regime. All this points to the need for better quality data to be used in both the training and evaluation set, which brings us to the use of blockchains and smart contracts to implement healthcare systems. With access to higher quality and varied training data may reduce underspecification and hence create a pathway to better ML models, faster.



2020-11-28 03:05:20

Forbes

Bitcoin News Network

Source link

Leave a comment