Analysing the overfit of the auto-sklearn automated machine learning tool.

Fabris, Fabio and Freitas, Alex A. (2020) Analysing the overfit of the auto-sklearn automated machine learning tool. In: Machine Learning, Optimization, and Data Science 5th International Conference. Lecture Notes in Computer Science . Springer, Cham, Switzerland, pp. 508-520. ISBN 978-3-030-37598-0. E-ISBN 978-3-030-37599-7. (doi:10.1007/978-3-030-37599-7_42) (The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided) (KAR id:79931)

The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided. (Contact us about this Publication)
Official URL: https://dx.doi.org/10.1007/978-3-030-37599-7_42

Abstract

With the ever-increasing number of pre-processing and classification algorithms, manually selecting the best algorithm and their best hyper-parameter settings (i.e. the best classification workflow) is a daunting task. Automated Machine Learning (Auto-ML) methods have been recently proposed to tackle this issue. Auto-ML tools aim to automatically choose the best classification workflow for a given dataset. In this work we analyse the predictive accuracy and overfit of the state-of-the-art auto-sklearn tool, which iteratively builds a classification ensemble optimised for the user’s dataset. This work has 3 contributions. First, we measure 3 types of auto-sklearn’s overfit, involving the differences of predictive accuracies measured on different data subsets: two parts of the training set (for learning and internal validation of the model) and the hold-out test set used for final evaluation. Second, we analyse the distribution of types of classification models selected by auto-sklearn across all 17 datasets. Third, we measure correlations between predictive accuracies on different data subsets and different types of overfitting. Overall, substantial degrees of overfitting were found in several datasets, and decision tree ensembles were the most frequently selected types of models.

Item Type:	Book section
DOI/Identification number:	10.1007/978-3-030-37599-7_42
Uncontrolled keywords:	data mining, machine learning, classification, Auto-ML
Subjects:	Q Science > Q Science (General) > Q335 Artificial intelligence
Divisions:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Alex Freitas
Date Deposited:	03 Feb 2020 16:59 UTC
Last Modified:	05 Nov 2024 12:45 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/79931 (The current URI for this page, for reference purposes)

University of Kent Author Information

Freitas, Alex A..

Creator's ORCID:	https://orcid.org/0000-0001-9825-4700
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.