DataRobot, Google AutoML, H2O.ai… Automated machine learning solutions have multiplied in recent years. Their ambition? Automate the creation of learning models.
A new AI buzzword, automated machine learning (auto ML), promises, as its name suggests, to automate the creation of learning models. In recent years, many players have launched an assault on this new Wild West. In 2012, the pure player DataRobot and H2O.ai opened the road. Google followed suit in 2018 by launching its AutoML cloud service. In 2019, it was Microsoft’s turn with Azure ML and AWS with Sagemaker Autopilot. At the same time, data science studios are getting involved by integrating this dimension into their offer. It is the case of the French Dataiku, the German Knime, or the American Rapidminer (read the comparison of automated machine learning tools). Can these solutions claim to replace the work of the data scientist? Obviously in part, but for the moment, they are above all considered as tools capable of assisting the data scientist in the implementation of his projects.
“Automated machine learning allows a data analyst to quickly create a simple model, for example, an image classifier, by training it from a data set of labelled photos,” explains Didier Gaultier, director of data science & AI within Business & Decision, an Orange subsidiary expert in data. The operating mode of an auto ML environment? Depending on a problem to be solved (a financial prediction, preventive maintenance, image recognition, etc.), he starts by selecting several possible algorithms. Then it trains them, as seen above, based on a predefined training data set. Via a scoring layer, it then compares their results by mixing several combinations of hyperparameters. In the case of deep learning, these correspond, for example, to the number of layers of the neural network and the number of nodes in each layer. Taking into account the target objective to be reached, the most efficient model is ultimately retained.
If the problem to be solved becomes more complex, automated machine learning will quickly be outdated. “These solutions are very efficient for benchmarking classic supervised learning models: linear regression, decision tree, random forest, support vector machine, “notes Aymen Chakhari, director of AI at the French ESN Devoteam.” They will provide the data scientist with reliability scores on this perimeter, allowing them to save time and reduce the time to value. “But to achieve a satisfactory result on complex predictions, for example in econometrics, pharmaceutical research or for the detection of fraud, it will be necessary to personalize the models or even to combine them.” example of computer attack detection systems. Some attacks change signatures in real-time. It is, therefore, no longer possible to detect them via algorithms. classic. To spot them, we have to go through semi-supervised or unsupervised approaches that require the intervention of a data scientist. “
Another textbook case, Devoteam, published in early June, a model on the evolution of Covid-19 in France. “The curve of the epidemic will flatten, with the number of deaths reaching 30,293 in France on July 15, against 29,021 on June 4, “than predicted the ESN. Result: in mid-July, the number of deaths from Covid-19 rose to 30,120 in France, i.e., a degree of accuracy of 99.42% compared to the initial projection. The Levallois-Perret company first used several automated machine learning technologies (Azure ML, Google AutoML, H2O.ai, Knime, and Rapidminer) to achieve this finesse of prediction. “With these tools, we obtained predictions in the order of 80,000 to 120,000 deaths for that date, associated with accuracies estimated at around 96%, which misses the point. But we knew the likelihood of a big gap would exist, “Aymen Chakhari admits.
Why? Because two central components were missing to solve the equation. The first: the need to deal with a wide variety of data (rate of spread of the virus, respect for barrier gestures, the behaviour of French people in public places, etc.) “Automated machine learning has admittedly made it possible to observe that the forest random provided good precision, but this algorithm was not able to deal with the diversity of content to be processed,” explains Aymen Chakhari. “We had to mix it with the reinforcement learning of the armed Bandit .” The auto ML is incapable of reaching such a degree of engineering, particularly of carrying out the assembly of models.
The second added value provided by Devoteam data scientists: feature engineering. A step consists in selecting the variables involved in the algorithm and their respective weights. “Feature engineering is not a science; it’s more of art, of feeling,” said Aymen Chakhari.
Automated machine learning allows it to detect types of variables in a data set: timestamp variable, text variable, discrete or continuous numeric variable, URL, date … “The question remains how to code them. can for example define a speed value in ranges: between 0 and 20 km / h, 21 and 60 km / h, 61 and 80 km / h, etc. But if the ranges are coded differently, the model result will not be the same “, points out Didier Gaultier at Business & Decision. “As a result, auto ML tools are unable to do this job. They just do brute force recoding, calculating all the coding possibilities of a variable with itself, which is not very efficient and requires considerable computing resources. “However, the auto ML will make it possible to pinpoint the 20% of significant variables and give clues as to the weights to associate with them. “Variables that the data scientist will then have to recode by hand,” adds Didier Gaultier.
The weight to be allocated to variables is another dimension that automated machine learning has difficulty understanding. “Take the modeling that we carried out to estimate the evolution of French GDP following the Covid-19 crisis. The algorithm was trained on data from INSEE, Statista, the Bank of France, and the ‘OECD, covering the years 2017, 2018, and 2019, “says Aymen Chakhari. Iterative feature engineering operations were then carried out to accurately weight the impact of the parameters chosen according to the context: job losses (with a deliberately higher score), job creation, household consumption, the contribution of economic sectors….”
This last example demonstrates the road that remains to be followed before achieving an automated machine learning capable of taking into account a business context and providing a coherent and adapted response, however complex it may be. “We will have to wait another ten years before the auto ML enters the era of business understanding. After that, it should go through techniques such as reinforcement learning or antagonistic generative networks”, anticipates Aymen Chakhari.