Use Amazon’s AutoGluon to predict loan default
Published in · 10 min read · Jan 18, 2020
10 min read
Jan 18, 2020
On January 9, 2020, Amazon introduced AutoGluon, an open-source library that empowers developers to easily build Automatic Machine Learning (AutoML) models. AutoML has become a hot topic in the realm of machine learning recently. Many high-tech companies have introduced their AutoML toolkits, including Mircosoft, Google, and Facebook. It allows data scientists to automate ML tasks by deploying ML into ML itself, and has the advantages of high accuracy, deployment simplicity, and time efficiency. AutoML is also very versatile as it can handle many data tasks such as image classification, language processing, and tabular prediction. According to Gartner, it is estimated that over 40% of data science tasks will be automated by 2020.
Amazon has a unique vision and high expectation for AutoGluon. As Amazon puts it, its purpose is to “democratise machine learning, and make the power of deep learning available to all developers.” Multiple features of AutoGluon have proven Amazon is indeed “democratizing” the formidable task of machine learning and make it possible for anyone to learn:
- AutoGluon can easily train and deploy high-accuracy models.
- It only requires a few lines of code.
- AutoGluon can be customized toward specific use cases.
- The library utilizes automatic hyperparameter tuning, model selection, architecture search, and data processing.
The last feature in this list warrants special attention, as it foreshadows the future of data science will be largely automated and does not require building and tuning models manually. It is reasonable that some data science experts have started to worry if they will be replaced by AutoML in the near future.
In this article, My colleague Yiyang Zhang and I will test the potentials of AutoGluon and compare its performance with other popular ML methods. Our benchmark will be classic random forest, as well as XGBoost (extreme gradient boosting), which has also been a prevailing technique for supervised learning. It features optimal computing resource allocation, which contributes to its fast execution speed and extremely high accuracy. XGBoost has been considered as the go-to algorithm for winners in Kaggle data competitions.
The main task to compare model performance will be loan default prediction, which involves predicting whether a person with given features would default on a bank loan. This task has been one of the most popular data science topics for a long time.
Our tabular dataset was obtained from the Kaggle website. It consists of 30,000 observations of individuals currently in debt and their basic information, including id, sex, education, marriage, age, and given credit. The dataset also contains repayments status, amount of bill statement, and amount of previous payment in consecutive 6 months, as well as the default indicator during the next month. The original dataset was already very clean. We dummy coded the categorical variables (which is required for running XGBoost), i.e., converted categorical variables into a series of dichotomous variables with only levels of 0 and 1.
Our team decided to use random forest as the first benchmark. It is an ensemble learning method for classification. The model randomly samples training data and creates subsets of features in order to build hundreds or thousands of decision trees, then aggregate all the trees together. Random forest has been one of the most popular ML methods because of its high accuracy.
In our test, we set the number of trees to be 100 and the criterion to be “gini”, which stands for gini impurity:
# Random Forest
rm = RandomForestClassifier(
The confusion matrix and feature importance are summarized below. We have achieved an accuracy score of 81.77% using random forest.
XGBoost is a successor of random forest and is considered as the crown jewel of the decision-tree-based ensemble ML algorithms. It was originally developed at the University of Washington in 2016 by Tianqi Chen and Carlos Guestrin, and quickly gained enormous popularity in the data science industry. XGBoost improves upon Gradient Boosting Machines through systems optimization and algorithmic enhancements.
In a test done by data scientist Vishal Morde, the performance of XGBoost is significantly better than traditional algorithms in terms of accuracy and computing efficiency. the result is shown in the following graph:
Since XGBoost has excellent performance when it comes to prediction problems with tabular data, we believe it can be a strong contender of AutoGluon in this case. We ran XGBoost to build the model for loan default prediction:
# XGBoost ClassifierX_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=123)model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, gamma=0,learning_rate=0.3, max_delta_step=0, max_depth=3,min_child_weight=1, missing=None, n_estimators=10, n_jobs=1,nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,verbosity=1)model.fit(X_train,y_train)
preds = model.predict(X_test)
XGBoost helped us achieve an accuracy score of 82.00%. The following graphs represent the confusion matrix and feature importance.
As previously mentioned, Amazon claims that AutoGluon performs exceptionally well in various data tasks that involve image, text, and tabular data. In our study, we focused on predicting with tabular data. Unlike random forest and XGBoost, AutoGluon takes care of model selection and hyperparameter tuning. The first step it takes is to use ML to determine several models that would be appropriate for the given task. After the models are determined, AutoGluon utilizes ML again to optimize the hyperparameters of each model. For example, if AutoGluon “thinks” random forest is a good model for the task, it will then decide the number of decision trees in the forest, as well as the number of features considered when splitting a node. The following is our code for AutoGluon:
# AutoGluon Classifierpredictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)y_pred = predictor.predict(test_data_nolab)
In our test, AutoGluon utilized the following models. The performance of each model and the corresponding computing time are summarized in the following two graphs:
The final accuracy score of AutoGluon is 82.96%, an inspiring number. We have also created the confusion matrix for the model. Please note that the accuracy can be further improved by configuring the model, which requires more advanced hardware.
However, the computing time is much longer, compared with the previous two models. It is also worth mentioning that when we executed the command, we kept receiving the following warning:
Warning: Model is expected to require 25.288658239618677 percent of available memory...
The computer we ran the code has above-average computing power in terms of both memory and CPU. It seems that AutoGluon is extremely compute-intensive.
To evaluate the three models, we consider not just model performance based on accuracy score and AUC score, but also computing time and difficulty of model building. We believe all of them are essential in a standard data science project. The result is summarized below:
AutoGluon has the best model accuracy in terms of both the AUC score and the accuracy score. Besides, we found that AutoGluon was extremely easy to build, as it only took a few lines of code to complete and did not require any hyperparameter tuning. However, one of its main disadvantages is that it takes a much longer time than XGBoost to compute. Depend on specific demands, a trade-off between time and accuracy has to be made when choosing models. In spite of that, we are very confident in AutoGluon’s potential when it comes to automated ML tasks. It represents the turning point of the data science industry: not just data experts, people who do not possess coding skills can now build effective models to meet their various needs.
So will AutoGluon and other AutoML eventually replace data scientists? The answer is mixed: some tasks will be fully automated for sure, leading to less demand for data scientists in the future. Specifically, job functions that focus on model tuning and optimizing will more likely be replaced by AutoML. According to Justin Ho, a famous Chinese data scientist, those who only possess the skills of model tuning will lose their competitive advantage as AutoML reaches a more mature stage in the future. However, we still doubt that it can fully replace data scientists for five major reasons:
- In unsupervised learning, there is no clear measure to assess the quality of results. Unsupervised learning is considered to be successful as long as it provides information that can be used in further analysis. Besides, the process requires significant domain knowledge. AutoML cannot provide sufficient help in this step.
- No effective AutoML has been developed for reinforcement learning as of today, according to Marcia Oliveira, a Senior Data Scientist at Skim Technologies.
- AutoML cannot deal with complex data such as network data and web data very well.
- Very few AutoML can handle feature engineering.
- AutoML cannot fully automate the process of conceiving business insights and implementing decision making.
The following graph illustrates our prediction of what AutoML can accomplish in a standard project workflow in the future. We believe only a limited fraction of the tasks can and will be automated. Please note that the deeper the blue is, the more likely the task will be automated.
After all, Data science is an art that requires the combination of business insights and data techniques to address problems in the real world. While AutoML can replace humans in specific processes, there are many other areas that need data scientists’ creativity and domain knowledge. No matter how “smart” AutoML is, it is only a tool to help achieve greater model performance. Besides, it aims to relieve data scientists from tedious “chores” such as hyperparameter tuning and data cleaning, allowing them to focus on other more important tasks. For this reason, AutoML should be considered as a friend of data scientists instead of an enemy. As data scientists, we should be aware of the trend of the development of AutoML and reinforce our abilities that cannot be fully replaced by AutoML. Most importantly, we should embrace AutoML’s power by continuing to unveil its true potentials and making it better serve humanity.
Although Amazon’s AutoGluon seems to be the best choice in our test, there are still several caveats that need to be noticed:
- Our test has only proved that AutoGluon is better than XGBoost in this specific task using given data. That is, AutoGluon maybe not be the best choice if we use another dataset or change the nature of the task.
- While AutoGluon is slightly better than XGBoost in terms of accuracy, it is more compute-intensive than XGBoost during most of the time.
- The impressive performance of AutoGluon may be the result of pure luck.
: Shivani. (Aug 2 2019). The Growth Of Automated Machine Learning. https://www.cisin.com/coffee-break/technology/the-growth-of-automated-machine-learning-automl.html
: Ambika Choudhury. (Jan 14 2019). How Amazon AutoGluon Can Automate Deep Learning Models with Just A Few Lines of Codes. https://analyticsindiamag.com/how-amazons-autogluon-can-automate-deep-learning-models-with-just-a-few-lines-of-codes/
: Jason Brownlee. (Aug 21 2019). A Gentle Introduction to XGBoost for Applied Machine Learning. https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
: Will Koehrsen. (Aug 30 2018). An Implementation and Explanation of the Random Forest in Python. https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76
: Vishal Morde. (Apr 7 2019). XGBoost Algorithm: Long May She Reign! https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d
: Marcia Oliveira. (Mar 2019). 3 Reasons Why AutoML Won’t Replace Data Scientists Yet. https://www.kdnuggets.com/2019/03/why-automl-wont-replace-data-scientists.html
: UCI Machine Learning. Default of Credit Card Clients Dataset. https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset