Skip to main content
Kent Academic Repository

Pre-processing feature selection for improved C&RT models for oral absorption

Newby, Danielle, Freitas, Alex A., Ghafourian, Taravat (2013) Pre-processing feature selection for improved C&RT models for oral absorption. Journal of Chemical Information and Modeling, 53 (10). pp. 2730-2742. ISSN 1549-9596. (doi:10.1021/ci400378j) (The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided) (KAR id:38778)

The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided. (Contact us about this Publication)
Official URL:


There are currently thousands of molecular descriptors that can be calculated to represent a chemical compound. Utilizing all molecular descriptors in Quantitative Structure–Activity Relationships (QSAR) modeling can result in overfitting, decreased interpretability, and thus reduced model performance. Feature selection methods can overcome some of these problems by drastically reducing the number of molecular descriptors and selecting the molecular descriptors relevant to the property being predicted. In particular, decision trees such as C&RT, although they have an embedded feature selection algorithm, can be inadequate since further down the tree there are fewer compounds available for descriptor selection, and therefore descriptors may be selected which are not optimal. In this work we compare two broad approaches for feature selection: (1) a “two-stage” feature selection procedure, where a pre-processing feature selection method selects a subset of descriptors, and then classification and regression trees (C&RT) selects descriptors from this subset to build a decision tree; (2) a “one-stage” approach where C&RT is used as the only feature selection technique. These methods were applied in order to improve prediction accuracy of QSAR models for oral absorption. Additionally, this work utilizes misclassification costs in model building to overcome the problem of the biased oral absorption data sets with more highly absorbed than poorly absorbed compounds. In most cases the two-stage feature selection with pre-processing approach had higher model accuracy compared with the one-stage approach. Using the top 20 molecular descriptors from the random forest predictor importance method gave the most accurate C&RT classification model. The molecular descriptors selected by the five filter feature selection methods have been compared in relation to oral absorption. In conclusion, the use of filter pre-processing feature selection methods and misclassification costs produce models with better interpretability and predictability for the prediction of oral absorption.

Item Type: Article
DOI/Identification number: 10.1021/ci400378j
Uncontrolled keywords: data mining, machine learning, classification, decision tree, pharmacokinetics, drug oral absorption
Subjects: Q Science > Q Science (General) > Q335 Artificial intelligence
Q Science > QD Chemistry
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Divisions > Division of Natural Sciences > Medway School of Pharmacy
Depositing User: Alex Freitas
Date Deposited: 14 Mar 2014 17:43 UTC
Last Modified: 16 Feb 2021 12:52 UTC
Resource URI: (The current URI for this page, for reference purposes)

University of Kent Author Information

Freitas, Alex A..

Creator's ORCID:
CReDIT Contributor Roles:

Ghafourian, Taravat.

Creator's ORCID:
CReDIT Contributor Roles:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.