Interpreting machine learning pipelines produced by evolutionary AutoML for biochemical property prediction

de Sá, Alex G. C., Pappa, Gisele L., Freitas, Alex A., Ascher, D.B. (2025) Interpreting machine learning pipelines produced by evolutionary AutoML for biochemical property prediction. In: GECCO'25 Companion: Proceedings of the 2025 Genetic and Evolutionary Computation Conference Companion. ACM ISBN 979-8-4007-1464-1. (doi:10.1145/3712255.3734339) (KAR id:110941)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/5MB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://doi.org/10.1145/3712255.3734339

Abstract

Machine learning (ML) has been playing a crucial role in drug discovery, mainly through quantitative structure-activity relationship models that relate molecular structures to properties, such as absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. However, traditional ML approaches often lack customisation to a particular biochemical task and fail to generalise to new biochemical spaces, resulting in reduced predictive performance. Automated machine learning (AutoML) has emerged to address these limitations by automatically selecting the suitable ML pipelines for a given input dataset. Despite its potential, AutoML is underutilised in cheminformatics, and its decisions often lack interpretability, reducing user trust - especially among non-experts. Accordingly, this paper proposes an evolutionary AutoML method for biochemical property prediction that outputs an interpretable model for understanding the evolved ML pipelines. It combines grammar-based genetic programming with Bayesian networks to guide search and enhance the searched pipelines' interpretability. The evaluation on 12 benchmark ADMET datasets showed that the proposed AutoML method obtained similar or better results than three existing methods. Additionally, the interpretable Bayesian network identified, among the ML pipelines' components generated by the AutoML method (i.e. components like biochemical feature extraction methods, preprocessing techniques and ML algorithms), which components affect the ML pipelines' predictive performance.

Item Type:	Conference proceeding
DOI/Identification number:	10.1145/3712255.3734339
Uncontrolled keywords:	supervised machine learning, classification, evolutionary algorithms, estimation of distribution algorithms, bioinformatics
Subjects:	Q Science > Q Science (General) > Q335 Artificial intelligence
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	There are no former institutional units.
Funders:	University of Kent (https://ror.org/00xkeyj56)
Depositing User:	Alex Freitas
Date Deposited:	13 Aug 2025 09:04 UTC
Last Modified:	14 Aug 2025 14:54 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/110941 (The current URI for this page, for reference purposes)

University of Kent Author Information

Freitas, Alex A..

Creator's ORCID:	https://orcid.org/0000-0001-9825-4700
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.