An empirical analysis of feature engineering for predictive modeling

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Feature engineering with clinical expert knowledge: A case study assessment of machine learning model complexity and performance

Roles Conceptualization, Formal analysis, Methodology, Writing – review & editing

Affiliations Johns Hopkins Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America, The Institute of Clinical and Translational Research, Johns Hopkins University, Baltimore, MD, United States of America

ORCID logo

Roles Formal analysis, Validation, Writing – review & editing

Affiliations Johns Hopkins Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America, Department of Computer Science, Johns Hopkins University Whiting School of Engineering, Baltimore, MD, United States of America

Roles Formal analysis, Writing – review & editing

Affiliation Division of Health Sciences Informatics, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America

Roles Validation, Writing – review & editing

Affiliations Johns Hopkins Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America, The Institute of Clinical and Translational Research, Johns Hopkins University, Baltimore, MD, United States of America, Division of Health Sciences Informatics, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America, Division of General Internal Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America

Affiliation Division of General Internal Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America

Roles Writing – review & editing

Affiliation Johns Hopkins University Applied Physics Laboratory, Laurel, MD, United States of America

Roles Methodology, Writing – review & editing

Roles Conceptualization, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Johns Hopkins Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America, The Institute of Clinical and Translational Research, Johns Hopkins University, Baltimore, MD, United States of America, Division of Health Sciences Informatics, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America, Division of General Internal Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America

  • Kenneth D. Roe, 
  • Vibhu Jawa, 
  • Xiaohan Zhang, 
  • Christopher G. Chute, 
  • Jeremy A. Epstein, 
  • Jordan Matelsky, 
  • Ilya Shpitser, 
  • Casey Overby Taylor

PLOS

  • Published: April 23, 2020
  • https://doi.org/10.1371/journal.pone.0231300
  • Reader Comments

Table 1

Incorporating expert knowledge at the time machine learning models are trained holds promise for producing models that are easier to interpret. The main objectives of this study were to use a feature engineering approach to incorporate clinical expert knowledge prior to applying machine learning techniques, and to assess the impact of the approach on model complexity and performance. Four machine learning models were trained to predict mortality with a severe asthma case study. Experiments to select fewer input features based on a discriminative score showed low to moderate precision for discovering clinically meaningful triplets, indicating that discriminative score alone cannot replace clinical input. When compared to baseline machine learning models, we found a decrease in model complexity with use of fewer features informed by discriminative score and filtering of laboratory features with clinical input. We also found a small difference in performance for the mortality prediction task when comparing baseline ML models to models that used filtered features. Encoding demographic and triplet information in ML models with filtered features appeared to show performance improvements from the baseline. These findings indicated that the use of filtered features may reduce model complexity, and with little impact on performance.

Citation: Roe KD, Jawa V, Zhang X, Chute CG, Epstein JA, Matelsky J, et al. (2020) Feature engineering with clinical expert knowledge: A case study assessment of machine learning model complexity and performance. PLoS ONE 15(4): e0231300. https://doi.org/10.1371/journal.pone.0231300

Editor: Ozlem Uzuner, George Mason University, UNITED STATES

Received: June 25, 2018; Accepted: March 20, 2020; Published: April 23, 2020

Copyright: © 2020 Roe et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data underlying the study belong to a third party, MIMIC III. Data are available on request from https://mimic.physionet.org for researchers who meet the criteria for access to confidential data. The authors confirm that they did not have any special access to this data.

Funding: This work was funded in part by the Biomedical Translator Program initiated and funded by NCATS (NIH awards 1OT3TR002019, 1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027, 1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520, 1OT2TR002584). Any opinions expressed in this manuscript are those of co-authors who are members of the Translator community and do not necessarily reflect the views of NCATS, individual Translator team members, or affiliated organizations and institutions.

Competing interests: The authors have declared that no competing interest exist.

Introduction

Improved access to large longitudinal electronic health record (EHR) datasets through secure open data platforms [ 1 ] and the use of high-performance infrastructure [ 2 ] are enabling applications of sophisticated machine learning (ML) models in decision support systems for major health care practice areas. Areas with recent successes include early detection and diagnosis [ 3 , 4 ] treatment [ 5 , 6 ] and outcome prediction and prognosis evaluation [ 7 , 8 ]. Relying on ML models trained on large EHR datasets, however, may lead to implementing decision-support systems as black boxes—systems that hide their internal logic to the user [ 9 ]. A recent survey of methods for explaining black box models highlights two main inherent risks [ 10 ]: (1) using decision support systems that we do not understand, thus impacting health care provider and institution liability; and (2) a risk of inadvertently making wrong decisions, learned from spurious correlations in the training data. This work takes a feature engineering approach that incorporates clinical expert knowledge in order to bias the ML algorithms away from the spurious correlations and towards meaningful relationships.

Severe asthma as a case study

We explored severe asthma as a case study given the multiple limitations of current computational methods to optimize asthma care management. Documented limitations include: the low prediction accuracy of existing approaches to project outcomes for asthma patients, limitations with communicating the reasons why patients are at high risk, difficulty explaining the rules and logic inside an approach and a lack of causal inference capability to provide clear guidance on what patients could safely be moved off care management [ 11 ]. Incorporating clinical expert knowledge at the time that computational models are trained may help to overcome these limitations.

Expert clinical knowledge and model performance

Incorporate expert knowledge into the computational model building process has potential to produce ML models that show performance improvements. One previous study, for example, found that including known risk factors of heart failure (HF) as features during training yielded the greatest improvement in the performance of models to predict HF onset [ 12 ]. Different from that approach, we use a feature engineering approach to incorporate clinical expert knowledge.

Our feature engineering approach involved first extracting triplets from a longitudinal clinical data set, ranking those triplets according to a discriminative score, and then filtering those triplets with input from clinical experts. Triplets explored in this work were laboratory results and their relationship to clinical events such as medical prescriptions (i.e., lab-event-lab triples).

The goal of this research was to apply the feature engineering approach with a severe asthma case study and to assess model performance for a range of ML approaches: gradient boosting [ 13 ], neural network [ 14 ], logistic regression and k-nearest neighbor. Non-zero coefficients were assessed as a metric of model complexity for two ML approaches: logistic regression and gradient boosting.

For each ML model, we conducted several experiments to understand the impact of ranking features based upon discriminative score and of filtering features with clinical input on model complexity and performance. To assess performance, we used measures of model accuracy and fidelity. Experiments were completed with a case study of patients with severe asthma in the MIMIC-III [ 1 ] dataset for a mortality prediction task.

Discovering triplets from longitudinal clinical data

First, we discovered triplets, defined as a lab-event-lab sequence where the value of a laboratory result is captured before and after a clinical event. These triplets occur within the context of an ICU stay. Clinical events captured in this study were medication prescriptions and clinical procedures. The ranking step used an information theoretic approach to calculate and associate a discriminative score for triplets. The filtering step involved input from clinical experts who filtered out triplets that were not considered relevant to asthma. The final list of ranked and filtered laboratory results were used to select or weight features in a range of machine learning models.

In order to discover triplets, laboratory results were pre-processed as follows:

  • Laboratory values were cleaned by merging laboratory result names according to the approach described in ref [ 17 ]. That work provided a file outlining bundled laboratory names (e.g., heart rate) that grouped name variations (e.g., pulse rate), abbreviations (e.g., HR) and, misspellings (e.g., heat rate) of the same concept [ 18 ]. In addition, there were circumstances where laboratory values consisted of both numerical and textual representations. In those cases, we converted the textual values to numbers according to simple rules (e.g., “≤1” converted “1”). Many laboratory result entries had values such as “error” which could not be converted. In those instances, entries were ignored.
  • Laboratory values were divided into a finite number of bins. Bin boundaries were defined by a clinical expert familiar with the normal ranges of each laboratory test. For tests where normal ranges were unknown, six dividers were defined based upon mean and standard deviation (i.e., μ − 2 σ , μ − σ , μ − σ /2, μ + σ /2, μ + σ , μ + 2 σ ).

Next, triplets were discovered according to the following steps:

  • Laboratory value bins before and after a clinical event (i.e., a lab-event-lab triplet) were captured. A laboratory result could involve different clinical events—resulting in multiple triplets. In addition, each patient in our dataset could have multiple triplets. The amount of time between the clinical event and lab measurement also varies depending upon the lab. For each event, lab test time duration before and after each event were calculated. For each lab test, time duration before an event was defined as the time immediately after the prior lab test until (and including) the time of the event. The duration after an event was defined as the time immediately after the event until the next lab measure occurred. The start time for the first recorded lab measure for an individual, was defined as the start time for the ICU stay. Similarly, the end time for the last recorded lab measure, was the defined as the end time for the ICU stay.
  • Lab-event-lab triplets were categorized as no change, decreasing or increasing by assessing the laboratory value bin before and after the anchoring clinical event.
  • Cross tabulations where then performed for each triplet category (no change, decreasing, increasing) for two patient sub-groups (patients who died and patient that did not die).
  • Triplets with cross tabulation values of 10 or fewer were excluded from further analysis. We did this because small counts cannot reliability determine whether there is a statistically meaningful relationship.

Ranking and filtering laboratory result features

feature engineering research paper

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0231300.t001

Due to table dimensions (two by three), the MI score will always be between 0 and log2 ≈ 0.6931. For this reason, we did not use normalized mutual information. While more sophisticated measures of association between features and outcomes, conditional on other features, such as conditional mutual information are potentially informative, they are also very challenging to evaluate in our setting. Given that our models include high dimensional feature sets, we choose this more simple measure.

After MI score ’s were calculated for triplets, clinical experts hand-picked the subset that were clinically relevant to asthma. The selected triplets were then used to filter laboratory result features. Filtered features were those laboratory tests that were represented among the clinically meaningful triplets. The filtered laboratory result features were used in experiments described in the “Evaluation” section.

We also calculated a composite discriminative score that was used to rank laboratory results features. For each laboratory result represented among all triplets, a discriminative score was calculated by taking the sum of MI score ’s from each triplet in which it appeared. The discriminative scores were used to rank laboratory result features used in experiments described in the “Evaluation” section.

Machine learning models for longitudinal clinical datasets

For all machine learning models explored in this study (gradient boosting, neural network, logistic regression and k-nearest neighbors), time series data was used. KNN allows us to specify feature importance. The other models do not support the input of feature weights. However, we performed experiments selecting subsets of the most important features.

For all four models, we normalized the training data and normalized features by removing the mean and scaling to unit variance. Normalization is done to prevent biasing the algorithms. For example, many algorithms tend to sum together features, which would cause bias towards features with a wide range of values.

This same pipeline was used on the test dataset. 10-fold cross validations with a limited hyper parameter search was used to predict mortality. Baseline ML models and clinical input-informed ML models were created. Baseline ML models included 42 top ranked features according to discriminative score. Different from baseline ML models, clinical input-informed ML models included filtered features (i.e., a subset of laboratory result features determined to be clinically relevant). We used both feature selection and weighting approaches, depending on the modeling approach. For logistic regression, gradient boosting, and neural network (using shallow decision trees as week learners), we used feature weights to perform feature selection. For KNN, we used weighted distance as described by others [ 20 ] to do prediction.

We evaluated model complexity and model performance with two characteristics of the feature engineering approach: ranking features by discriminative score, and filtering features according to input on which triplets are clinically meaningful. We used one measure of model complexity and two measures of performance (model accuracy and model fidelity). Our specific research questions were:

  • What is the impact of our approach to rank laboratory result features according to discriminative score on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the impact of our approach to filter features on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the impact of encoding triplet information directly into the model on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the model complexity and performance trade off of ranking laboratory result features according to discriminative score? of filtering features? and of encoding triplet information directly into the model?

A summary of data acquisition and data pre-processing steps to enable these analyses, as well a description of what was analyzed is summarized in Fig 1 and below.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g001

Data source, study population and machine learning models

Previous work indicates that admission to the intensive care unit for asthma is a marker for severe disease [ 21 ]. Thus we chose to use the MIMIC-III (‘Medical Information Mart for Intensive Care’) public dataset [ 1 ] that includes de-identified health care data collected from patients admitted to critical care units at Beth Israel Deaconess Medical Center from 2001 to 2012.

Patients included in our analyses had an asthma diagnosis or medication to treat asthma according to criteria proposed by the eMERGE (Electronic Medical Records and Genomics) consortium [ 22 ]. We also use the in-hospital mortality labels defined in MIMIC-III for our case study task to predict whether a patient dies in the hospital. This task was treated as a binary classification problem. For the task to predict patient mortality, we narrowed our cohort to include only those patients included in a MIMIC III benchmark dataset [ 17 ] and with an admission period of ≥48 hours. We selected logistic regression, gradient boosting, neural network and KNN ML models that were trained as implemented by the scikit-learn Python package [ 14 ].

We considered gradient boosting models to be “black-box” due to the lack of direct interpretation with the use of boosted classifiers with multiple trees. Neural network models are also considered “black-box” given that there are often many layers of neurons and it is difficult to relate connection weights to specific concepts. For logistic regression, the coefficients have an interpretation in terms of log odds. KNN also offers some interpretability because we know the closest data points to the current query point being used for making a decision. Note, however that these notions of what is considered black-box may be appropriate in different contexts.

Data subsets and machine learning experiments

To conduct our study, we performed experiments with laboratory results and data subsets that incrementally added encoded information on patient demographics, on clinical events from triplets, and on laboratory results from triplets. These four data subsets were used to train the ML models. The first Labs data subset comes directly from the MIMIC III benchmark dataset. It contained, for each patient, the sequence of laboratory results collected during their first ICU admission. The values at each hour over a 48 hour period were included. For each one hour period, we could have zero values, one value or more than one value. For each of the f × 48 slots (where f is the number of labs or features used in the machine learning experiment), we computed mean, min, max, standard deviation and skew. This created a total of f × 48 × 5 inputs to the ML models. We also normalized our datasets so that the mean is zero and variance is one. This was done using sklearn’s StandardScalar [ 14 ].

The second Labs+demo data subset added demographic information: age group at the time of admission to the ICU, race/ethnicity and sex. Age groups included: <2 , 2 − 17 , 18 − 34 , 35 − 49 , 50 − 69 and 70+ years old. Race and ethnicity’s included: white , black , asian , hispanic , multi and other . The other category was used when we could not determine the group based on the MIMIC III entry. For sex, groups included male and female . For each patient, these values were repeated for all time slots.

The third Labs+demo+events data subset added a column for each clinical event (ie., drug prescriptions and procedures) from triplets considered clinically relevant. Each column includes a zero or a one, with one indicating that the clinical event was recorded during the time slot being considered, and zero otherwise.

The fourth Labs+demo+events+triples data subset duplicates columns from the Labs data subset and for a given time slot, replaces laboratory values with a zero if all clinical event values are zero. Otherwise, the laboratory values were left as-is.

The Labs+demo+events and Labs+demo+events+triples data subsets allowed us to examine the extent to which model complexity and performance was impacted by encoding triplet information. See “Machine learning experiments and analyses of model complexity and performance trade-off” for details on experiments with these data subsets.

Analysis of triplet ranking

The ranking of triplets according to discriminative score was assessed by three co-authors (XZ, CGC, JAE) who manually reviewed the clinical relevance of all features in our severe asthma case study. A two-step strategy was applied. The first step was to evaluate whether the medication/procedure was generally known to or could conceivably have an impact (directly or indirectly) on a laboratory result (e.g., we consider arterial blood gas and pulmonary function test results to have medical relevance in asthma patients). The second step was to evaluate whether the combination is considered relevant to asthma case study (e.g., ‘Gauge Ordering’ does not indicate specific clinical uses). This process allowed us to filter lab-event-lab triplets that were not relevant to our case study. For top ranked triplets according to MI score , we calculated precision in the top k ranked triplets, i.e. the fraction of the count of individual triplets selected by our experts within the top k ranked patterns, divided by k . This set was used in the ML predictors explored in this study. All of steps to analyze the model complexity and performance trade-off were computational.

Machine learning experiments and analyses of model complexity and performance

We conducted several experiments to assess the model complexity and performance of ML models. We used one measure of model complexity and two measures of performance (model accuracy and model fidelity). Complexity, accuracy and fidelity are three characteristics used to describe ML algorithms that have been summarized by others [ 9 , 23 ].

Measuring ML model complexity.

To assess the impact of using ranked laboratory result features on model complexity, we conducted experiments analyzing the number of non-zero coefficients with use of fewer than the baseline 42 features ( k = 32, 16, 8, 4, and 2). In order to assess the impact of using filtered features on model complexity, we conducted experiments comparing the number of non-zero coefficients in the baseline ML model and in the ML models based on filtered features. In order to assess the impact of encoding triplet information into the model on model complexity, we compared the number of non-zero coefficients in ML model subsets that included triplet information (i.e., Labs+demo+events and Labs+demo+events+triples ) to the data subset that includes laboratory results and demographic information (i.e., Labs+demo ). This assessment was conducted for logistic regression and gradient boosting. We did not assess non-zero coefficients for neural network or KNN because the number of non-zero coefficients is not related to the complexity for these models. For all comparisons, we assessed the degree of difference.

Measuring ML model accuracy.

Model accuracy was assessed by examining the extent to which ranking laboratory result features according to discriminative score and filtering features in ML models can accurately predict mortality. For models with ranked features and with filtered features, we also examined the extent to which encoding demographic information and triplet information influences model accuracy. Model fidelity was assessed by observing the extent to which ML models with filtered features (with and without encoding triplet information directly) are able to imitate the baseline ML predictors.

In order to assess the model accuracy of ranking according to discriminative score, we performed feature selection experiments with logistic regression, gradient boosting, and neural network models. For these experiments, we selected the top k = 2, 4, 8, 16, 32, and 42 laboratory results ranked according to model weights for gradient boosting and neural network models, and according to the sum of MI score for logistic regression models. For KNN we did not assess performance changes when considering ranking according to discriminative score. For each experiment, receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) were reported. The range of AUC values was also reported for feature selection experiments.

In order to assess the model accuracy of baseline and clinical input-informed ML models, for all three models we performed experiments with data subsets. Baseline ML models included models with 42 features and its data subsets. Filtered features were laboratory results represented among triplets determined to be clinically relevant. For logistic regression, gradient boosting, and neural network models, the Labs+demo data subset enabled assessing the influence of encoding demographic information on model accuracy. The Labs+demo+events and Labs+demo+events+triples data subsets enabled assessing the influence of encoding triplet information on model accuracy. KNN experiments were conducted with the Labs data subset only, so we did not assess the influence of encoding triplet information. For each experiment, ROC curves and AUC were reported.

Measuring ML model fidelity.

Fidelity was assessed by examining the extent to which the clinical input-informed ML features (with and without triplet information) are able to accurately imitate our baseline ML predictors. We conducted experiments that enabled comparing clinical input-informed ML model performance to baseline ML model performance. For logistic regression, gradient boosting, and neural network models, clinical input-informed ML models were compared to baseline ML models for three data subsets: Labs , Labs+demo and Labs+demo+events . For KNN, we used the Lab data subset and compared the performance of models with filtered features (with and without weights) to the baseline ML model. We report the difference in AUC for clinical input-informed ML models with and without encoding triplets (i.e., the Labs+demo+events and Labs+demo+events+triples data subsets) compared to the baseline for logistic regression, gradient boosting, and neural network ML models.

Triplet identification and ranking

We discovered 218 prescription and 535 procedure triplets with more than 10 instances. These two lists of triplets were sorted by MI score prior to manual clinical review. Upon clinical review, we found that 82 triplets (27 prescription and 55 procedure triplets, see S1 and S2 Tables) were meaningful for our case study. Precision at k in a top- k problem for prescription and procedure events according to MI score are shown in Table 2 . Triplets used in this calculation are summarized in S3 and S4 Tables. For prescription triplets, precision at k = 3, 5, 10, and 20 in a top- k ranked list ranged from 20% to 67% i.e., the percentage of the triplets that were relevant to the case study. For procedure triplets, precision at k = 3, 5, 10, and 20 in a top- k ranged from 0% to 20%.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.t002

The triplets were used to enable more interpretable ML models through their use to select and weight features. Eleven laboratory tests were represented among the 82 clinically meaningful triplets discovered at this step (i.e., filtered laboratory results, see S5 Table ). The “Performance result: ML model accuracy” section shows how feature selection using the 11 laboratory tests (i.e., filtered features) impacted the ML result accuracy).

Machine learning results

The ML experiments are based on a subset of 7777 patient records from the MIMIC III database that are also included in a benchmark dataset used by others [ 17 ]. This dataset of 7777 patients was divided into 6222 training cases (death rate 0.489) and 1555 for testing cases (death rate 0.494). An overview of the final dataset, data pre-processing and data analysis steps are shown in Fig 1 and Table 3 . The rankings of the top 42 ranked laboratory tests used in logistic regression and gradient boosting model experiments are shown in S6 and S7 Tables.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.t003

ML model complexity results

The ML models for logistic regression and gradient boosting for k ≤ 42 laboratory result features are illustrated in Tables 4 and 5 . For both, model complexity decreased with use of fewer features informed by discriminative score (Tables 4 and 5 ). For logistic regression models, there was a -1.8 fold change in non-zero coefficients between with 16 and 8 features, a -2.8 fold change between models with 4 and 2 features. For gradient boosting models, there was -1.7 fold change in non-zero coefficients between models with 32 and 16 features, a -1.7 fold change between models with 8 and 4 features, and a -2.5 fold change for models with 4 and 2 features. All other model fold changes were 0 to 0.5 and were interpreted as no difference.

The use of clinical input-informed (filtered) features decreased model complexity for logistic regression and gradient boosting ML models. For logistic regression and gradient boosting models, they both had a -1.7 fold change in non-zero coefficients between the models with filtered features for Labs data subsets and the baseline model. There was no difference between models with filtered features for Labs+demo and Labs+demo+events data subsets, and the baseline model (i.e., fold changes were 0 to 0.5).

When examining the influence of encoding demographic and triplet information for models with non-filtered features, we found that model complexity decreased. Among logistic regression models, there was a -1.7 fold change in non-zero coefficients between models that encode demographic information ( Labs+demo ) compared to the baseline model ( Labs ), and a -1.6 fold change between models that encode triplet information ( Labs+demo+events ) compared to the baseline model. Among gradient boosting models, there was a -1.9 fold change in non-zero coefficients between models that encode demographic information ( Labs+demo ) compared to the baseline model, and a -2.0 fold change between models that encode triplet information ( Labs+demo+events ) compared to the baseline model.

When examining the influence of encoding demographic and triplet information for models with filtered features, the model complexity increased for one model that encoded triplet information. There was a 2.3 fold change for the gradient boosting model with filtered features for the Labs+demo+events+triples data subset compared to the baseline model. There were no differences between other models encoding demographic and triplet information with filtered features and the baseline model (i.e., fold changes were 0 to 0.5 for Labs+demo and Labs+demo+events data subsets).

thumbnail

https://doi.org/10.1371/journal.pone.0231300.t004

thumbnail

https://doi.org/10.1371/journal.pone.0231300.t005

Performance result: ML model accuracy and fidelity

Model accuracy and fidelity for logistic regression, gradient boosting, and neural network ML models (top k = 2, 4, 8, 16, 32, 42 features, and 11 filtered features) are summarized in Table 6 . The top performing models yielded AUC ’s of 0.73 for logistic regression, 0.75 for gradient boosting, and 0.68 for neural network models. For KNN the baseline model accuracy was AUC 42 = 0.57.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.t006

For logistic regression, gradient boosting, and neural network ML models, we found model accuracy to be robust to feature removal informed by discriminative score (Figs 2 , 3 and 4 ). We observed slightly lower AUC ’s between models with 32 and 16 features ( ΔAUC 32‖16 = 0.04, 0.02, and 0.03 for logistic regression, gradient boosting, and neural network models, respectively). Step-wise feature removal otherwise yielded negligible differences (i.e., ΔAUC ≤ 0.01). Across all data subsets, the maximum difference from baseline was ΔAUC max = 0.1 for logistic regression, ΔAUC max = 0.09 for gradiant boosting, and ΔAUC max = 0.06 for neural network models.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g002

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g003

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g004

For all ML models, the clinical input-informed ML features (i.e., filtered features) yielded comparable model accuracy to the baseline (Figs 5 , 6 and 7 ). The top performing models with filtered features were 0.73 for logistic regression, 0.74 for gradient boosting, 0.68 for neural network, 0.54 for KNN unweighted , and 0.56 for KNN weighted models. The magnitude of the difference in AUC from baseline models were >0.01 for the KNN unweighted model that achieved lower but comparable accuracy ( ΔAUC = 0.03), and for the neural network model for the Labs+demo+events data subset achieved higher but comparable accuracy ( ΔAUC = 0.03). Differences from the baseline were negligible for all other models.

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g005

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g006

thumbnail

https://doi.org/10.1371/journal.pone.0231300.g007

Findings from feature engineering experiments used to assess the influence of encoding demographic and triplet information on model accuracy are also illustrated in Figs 4 , 5 and 6 . ML models encoding demographic information (i.e., Labs+data data subsets) showed higher accuracy than baseline ML models. Among ML models without filtered features, the performance of those encoding demographic information (i.e., Labs+demo data subsets) were 0.73 for logistic regression, 0.74 for gradient boosting, and 0.67 for neural network models. The magnitude of performance improvement of Labs+demo data subsets from the baseline were ΔAUC = 0.09 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.04 for neural network models. Among ML models with filtered features, the performance of models encoding demographic information were 0.72 for logistic regression, 0.73 for gradient boosting, and 0.67 for neural network models. The magnitude of the performance improvements from the baseline were ΔAUC = 0.08 for logistic regression, ΔAUC = 0.04 for gradient boosting, and ΔAUC = 0.04 for neural network models.

ML models encoding triplet information (i.e., Labs+data+events and Labs+demo+events+triples data subsets) showed higher accuracy than baseline ML models. Among ML models encoding triplet information, the top performing models were 0.73 for logistic regression, 0.75 for gradient boosting, and 0.68 for neural network models. For ML models without filtered features, when compared to baseline models, the performance improvements were ΔAUC = 0.7 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.02 for neural network models for Labs+demo+events data subsets. The same differences were observed for logistic regression and gradient boosting ML models with filtered features for Labs+demo+events data subsets. For the neural network model with filtered features for the Labs+demo+events data subset, ΔAUC = 0.05. For models with filtered features for Labs+demo+events+triples data subsets, when compared to baseline models, the performance improvements were ΔAUC = 0.8 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.02 for neural network models.

Clinically meaningful triplets and utility of discriminative score

Eighty two clinically meaningful triplets were discovered for a severe asthma case study ( S1 and S2 Tables). We found that the precision of MI score rankings to discover those triplets was low to moderate for prescriptions (20% to 67%) and low for procedures (0% to 20%). This finding indicates that more work is needed in order to move from manual clinical review to an automated process. Through further assessment of MI score rankings, we found that similar model performance for the mortality prediction task can be achieved with decreased model complexity (i.e., fewer features selected according to discriminative score, Figs 2 , 3 and 4 ). These findings suggest that MI score alone cannot replace clinical input, but that the use of MI score -informed rankings has potential to decrease model complexity at little cost to model performance.

Model complexity and performance of machine learning models

We assessed model complexity for logistic regression and gradient boosting. We found decreases in model complexity with the use of fewer features informed by discriminative score and with clinical input ( Table 4 and Table 5 ). We also found decreases in model complexity when demographic information was encoded. Results were mixed for models that encoded triplet information, with decreases in model complexity observed with Labs+demo+events data subsets, and an increase in complexity observed with one Labs+demo+events+triples data subset.

When considering model performance for approaches that decreased model complexity, we found that logistic regression, gradient boosting, and neural network ML models were robust to feature removal informed by discriminative score. We did however observe diminishing performance with fewer features. Among the machine learning models we explored, the neural network model was the most robust to feature removal.

For logistic regression, gradient boosting, neural network, and KNN models with filtered features, we found small differences in performance when compared to baseline ML models.( ΔAUC was ≤0.03 across all modeling approaches). For logistic regression, gradient boosting and neural network models with filtered features, we also found that encoding demographic and triplet information showed both decreases in model complexity and improvements in performance for the Labs+demo and Labs+demo+events subsets. The results were mixed for the Labs+demo+events+triples subsets, with observed decreases in model complexity and improvements in performance observed for the logistic regression and gradient boosting models with filtered features for the Labs+demo+events+triples subsets. For the neural network model with filtered features for the Labs+demo+events+triples subset, model complexity increased and the improvement in performance was very small.

Limitations and implications for future work

Our use of the MIMIC-III dataset in this case study may influence generalizability of our approach to other clinical datasets given that the dataset derives exclusively from patients in intensive care settings. Because of this, information typically collected in routine outpatient settings, such as pulmonary function tests, would not be included in our model. This limitation was mitigated in part through our filtering approach that excluded routine features of intensive care.

There are also some limitations due to our approach to define triplets. First, when considering prescriptions as the anchor event for triplets, the half-life of the medications may be relevant. Our approach to discover triplets as lab-event-lab triples (i.e., laboratory results prior to and immediately following a prescription) may exclude some triplets relevant to medications with a longer half-life. Second, under circumstances where we used a default binning approach for laboratory results, extreme outliers skewed the boundaries. There are other approaches to discretize time series data such as SAX [ 24 ] and Temporal Discretization for Classification (TD4C) [ 25 ] methods that might be considered to overcome this limitation.

In addition, the performance of our ML approaches may be influenced by our cohort description. First, severe asthma patients were selected according to asthma medications and diagnoses listed for those patients. Our inclusion criteria may be improved by processing the free text “admission reason” to determine if asthma or asthma-like terms are mentioned. Second, we did not consider the presence of co-morbid conditions in our predictive modeling. Future work may draw from others to improve prediction or treatment guidelines for severe asthma. For example, there are many studies such as [ 26 ] that aim to detect factors that can predict asthma. We anticipate that such considerations can provide improved predictions over approaches explored here.

A major contribution of this work to the ML literature is our approach to incorporate clinical expertise. When reviewing ML and temporal data mining research broadly, we found that this was a gap. [ 4 , 6 , 8 , 27 – 33 ]. Unlike with previous efforts, our experiments tested the impact of using a feature engineering approach to incorporate clinical input on model complexity and performance. We also provided a simplified framework to encode triplets (lab-event-lab patterns) for direct inclusion into our models that warrants further exploration. Furthermore, well-known feature selection methods such as Lasso [ 34 ] do not incorporate expert knowledge and rely on statistical methods to choose features. Findings from our work motivate future efforts to explore ways to use statistical methods together with approaches such ours that incorporate clinician-informed information on the relationship between clinical events and laboratory measurements into the machine learning process.

This work explored a feature engineering approach with longitudinal data that enabled incorporating clinical input in ML models. We assessed the impact of two characteristics of the approach on the complexity and performance of ML models for a mortality prediction task for a severe asthma case study: ranking features by discriminative score (e.g., MI score sum), and filtering laboratory features according to input on lab-event-lab triplets that are clinically meaningful. We found that ML models that use fewer input features selected based on discriminative score or according to which triplets are clinically meaningful, can both decrease model complexity with little cost to performance. Furthermore, for models with lower model complexity through the use of filtered features, the performance of ML models showed improvements from the baseline for data subsets that encode demographic and triplet information. Such approaches to reduce ML model complexity at little cost to performance warrant further comparison and consideration to combine with state-of-the-art feature selection methods, as well as exploration beyond the severe asthma case study.

Supporting information

S1 table. clinically meaningful prescription triplets ranked by discriminative score..

https://doi.org/10.1371/journal.pone.0231300.s001

S2 Table. Clinically meaningful procedure triplets ranked by discriminative score.

https://doi.org/10.1371/journal.pone.0231300.s002

S3 Table. Top 20 prescription triplets ranked by discriminative score.

https://doi.org/10.1371/journal.pone.0231300.s003

S4 Table. Top 20 procedure triplets ranked by discriminative score.

https://doi.org/10.1371/journal.pone.0231300.s004

S5 Table. Laboratory tests found in list of clinically meaningful triplets.

https://doi.org/10.1371/journal.pone.0231300.s005

S6 Table. Forty-two laboratory tests used in gradient boosting experiments, sorted by weight.

https://doi.org/10.1371/journal.pone.0231300.s006

S7 Table. Forty-two laboratory tests used in logistic regression experiments, sorted by weight.

https://doi.org/10.1371/journal.pone.0231300.s007

Acknowledgments

The authors would like to thank the anonymous reviewers for their comments and suggestions for improving this manuscript. We also want to thank Brant Chee (Johns Hopkins University Applied Physics Laboratory), Matt Kinsey (Johns Hopkins University Applied Physics Laboratory), and Richard Zhu (Institute for Clinical and Translational Research, Johns Hopkins University) for their valuable input.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 2. Chennamsetty H, Chalasani S, Riley D. Predictive analytics on Electronic Health Records (EHRs) using Hadoop and Hive. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT); 2015. p. 1–5.
  • 7. Schulam P, Saria S. A Framework for Individualizing Predictions of Disease Trajectories by Exploiting Multi-resolution Structure. In: Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1. NIPS’15. Cambridge, MA, USA: MIT Press; 2015. p. 748–756. Available from: http://dl.acm.org/citation.cfm?id=2969239.2969323 .
  • 12. Sun J, Hu J, Luo D, Markatou M, Wang F, Edabollahi S, et al. Combining knowledge and data driven insights for identifying risk factors using electronic health records. In: 2012 AMIA Annual Symposium; 2012. p. 901.
  • 13. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM; 2016. p. 785–794.
  • 14. SciKit;. Available from: https://www.scipy.org/scikits.html .
  • 18. YerevaNN. MIMIC III Benchmark Resources; 2018. Available from: https://github.com/YerevaNN/mimic3-benchmarks/blob/master/mimic3benchmark/resources/itemid_to_variable_map.csv .
  • 19. Wikipedia. Mutual information; 2004. Available from: https://en.wikipedia.org/wiki/Mutual_information .
  • 22. Vazquez L, Connolly J. CHOP. Asthma. PheKB; 2013; 2013. Available from: https://phekb.org/phenotype/146 .
  • 24. Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. ACM; 2003. p. 2–11.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

Feature engineering with clinical expert knowledge: A case study assessment of machine learning model complexity and performance

Kenneth d. roe.

1 Johns Hopkins Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America

2 The Institute of Clinical and Translational Research, Johns Hopkins University, Baltimore, MD, United States of America

3 Department of Computer Science, Johns Hopkins University Whiting School of Engineering, Baltimore, MD, United States of America

Xiaohan Zhang

4 Division of Health Sciences Informatics, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America

Christopher G. Chute

5 Division of General Internal Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America

Jeremy A. Epstein

Jordan matelsky.

6 Johns Hopkins University Applied Physics Laboratory, Laurel, MD, United States of America

Ilya Shpitser

Casey overby taylor.

7 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America

Associated Data

Data underlying the study belong to a third party, MIMIC III. Data are available on request from https://mimic.physionet.org for researchers who meet the criteria for access to confidential data. The authors confirm that they did not have any special access to this data.

Incorporating expert knowledge at the time machine learning models are trained holds promise for producing models that are easier to interpret. The main objectives of this study were to use a feature engineering approach to incorporate clinical expert knowledge prior to applying machine learning techniques, and to assess the impact of the approach on model complexity and performance. Four machine learning models were trained to predict mortality with a severe asthma case study. Experiments to select fewer input features based on a discriminative score showed low to moderate precision for discovering clinically meaningful triplets, indicating that discriminative score alone cannot replace clinical input. When compared to baseline machine learning models, we found a decrease in model complexity with use of fewer features informed by discriminative score and filtering of laboratory features with clinical input. We also found a small difference in performance for the mortality prediction task when comparing baseline ML models to models that used filtered features. Encoding demographic and triplet information in ML models with filtered features appeared to show performance improvements from the baseline. These findings indicated that the use of filtered features may reduce model complexity, and with little impact on performance.

Introduction

Improved access to large longitudinal electronic health record (EHR) datasets through secure open data platforms [ 1 ] and the use of high-performance infrastructure [ 2 ] are enabling applications of sophisticated machine learning (ML) models in decision support systems for major health care practice areas. Areas with recent successes include early detection and diagnosis [ 3 , 4 ] treatment [ 5 , 6 ] and outcome prediction and prognosis evaluation [ 7 , 8 ]. Relying on ML models trained on large EHR datasets, however, may lead to implementing decision-support systems as black boxes—systems that hide their internal logic to the user [ 9 ]. A recent survey of methods for explaining black box models highlights two main inherent risks [ 10 ]: (1) using decision support systems that we do not understand, thus impacting health care provider and institution liability; and (2) a risk of inadvertently making wrong decisions, learned from spurious correlations in the training data. This work takes a feature engineering approach that incorporates clinical expert knowledge in order to bias the ML algorithms away from the spurious correlations and towards meaningful relationships.

Severe asthma as a case study

We explored severe asthma as a case study given the multiple limitations of current computational methods to optimize asthma care management. Documented limitations include: the low prediction accuracy of existing approaches to project outcomes for asthma patients, limitations with communicating the reasons why patients are at high risk, difficulty explaining the rules and logic inside an approach and a lack of causal inference capability to provide clear guidance on what patients could safely be moved off care management [ 11 ]. Incorporating clinical expert knowledge at the time that computational models are trained may help to overcome these limitations.

Expert clinical knowledge and model performance

Incorporate expert knowledge into the computational model building process has potential to produce ML models that show performance improvements. One previous study, for example, found that including known risk factors of heart failure (HF) as features during training yielded the greatest improvement in the performance of models to predict HF onset [ 12 ]. Different from that approach, we use a feature engineering approach to incorporate clinical expert knowledge.

Our feature engineering approach involved first extracting triplets from a longitudinal clinical data set, ranking those triplets according to a discriminative score, and then filtering those triplets with input from clinical experts. Triplets explored in this work were laboratory results and their relationship to clinical events such as medical prescriptions (i.e., lab-event-lab triples).

The goal of this research was to apply the feature engineering approach with a severe asthma case study and to assess model performance for a range of ML approaches: gradient boosting [ 13 ], neural network [ 14 ], logistic regression and k-nearest neighbor. Non-zero coefficients were assessed as a metric of model complexity for two ML approaches: logistic regression and gradient boosting.

For each ML model, we conducted several experiments to understand the impact of ranking features based upon discriminative score and of filtering features with clinical input on model complexity and performance. To assess performance, we used measures of model accuracy and fidelity. Experiments were completed with a case study of patients with severe asthma in the MIMIC-III [ 1 ] dataset for a mortality prediction task.

Discovering triplets from longitudinal clinical data

First, we discovered triplets, defined as a lab-event-lab sequence where the value of a laboratory result is captured before and after a clinical event. These triplets occur within the context of an ICU stay. Clinical events captured in this study were medication prescriptions and clinical procedures. The ranking step used an information theoretic approach to calculate and associate a discriminative score for triplets. The filtering step involved input from clinical experts who filtered out triplets that were not considered relevant to asthma. The final list of ranked and filtered laboratory results were used to select or weight features in a range of machine learning models.

In order to discover triplets, laboratory results were pre-processed as follows:

  • Laboratory values were cleaned by merging laboratory result names according to the approach described in ref [ 17 ]. That work provided a file outlining bundled laboratory names (e.g., heart rate) that grouped name variations (e.g., pulse rate), abbreviations (e.g., HR) and, misspellings (e.g., heat rate) of the same concept [ 18 ]. In addition, there were circumstances where laboratory values consisted of both numerical and textual representations. In those cases, we converted the textual values to numbers according to simple rules (e.g., “≤1” converted “1”). Many laboratory result entries had values such as “error” which could not be converted. In those instances, entries were ignored.
  • Laboratory values were divided into a finite number of bins. Bin boundaries were defined by a clinical expert familiar with the normal ranges of each laboratory test. For tests where normal ranges were unknown, six dividers were defined based upon mean and standard deviation (i.e., μ − 2 σ , μ − σ , μ − σ /2, μ + σ /2, μ + σ , μ + 2 σ ).

Next, triplets were discovered according to the following steps:

  • Laboratory value bins before and after a clinical event (i.e., a lab-event-lab triplet) were captured. A laboratory result could involve different clinical events—resulting in multiple triplets. In addition, each patient in our dataset could have multiple triplets. The amount of time between the clinical event and lab measurement also varies depending upon the lab. For each event, lab test time duration before and after each event were calculated. For each lab test, time duration before an event was defined as the time immediately after the prior lab test until (and including) the time of the event. The duration after an event was defined as the time immediately after the event until the next lab measure occurred. The start time for the first recorded lab measure for an individual, was defined as the start time for the ICU stay. Similarly, the end time for the last recorded lab measure, was the defined as the end time for the ICU stay.
  • Lab-event-lab triplets were categorized as no change, decreasing or increasing by assessing the laboratory value bin before and after the anchoring clinical event.
  • Cross tabulations where then performed for each triplet category (no change, decreasing, increasing) for two patient sub-groups (patients who died and patient that did not die).
  • Triplets with cross tabulation values of 10 or fewer were excluded from further analysis. We did this because small counts cannot reliability determine whether there is a statistically meaningful relationship.

Ranking and filtering laboratory result features

We used an information theoretic approach to calculate discriminative scores for triplets and used those scores to rank and filter laboratory result features. In particular, we calculated mutual information [ 19 ] score, MI score ( Eq 1 ) to rank triplets that may be estimators of mortality. MI score was chosen as simple and fast measure that can be used to highlight triplets that may be relevant for clinical experts. The MI score is a measure of the marginal association of each triplet category with patient mortality sub-group, thus it is based on a table with three columns (no change, increase, decrease) and two rows (died or survived). An illustration of MI score calculations for triplets is shown in ( Table 1 ).

Due to table dimensions (two by three), the MI score will always be between 0 and log2 ≈ 0.6931. For this reason, we did not use normalized mutual information. While more sophisticated measures of association between features and outcomes, conditional on other features, such as conditional mutual information are potentially informative, they are also very challenging to evaluate in our setting. Given that our models include high dimensional feature sets, we choose this more simple measure.

After MI score ’s were calculated for triplets, clinical experts hand-picked the subset that were clinically relevant to asthma. The selected triplets were then used to filter laboratory result features. Filtered features were those laboratory tests that were represented among the clinically meaningful triplets. The filtered laboratory result features were used in experiments described in the “Evaluation” section.

We also calculated a composite discriminative score that was used to rank laboratory results features. For each laboratory result represented among all triplets, a discriminative score was calculated by taking the sum of MI score ’s from each triplet in which it appeared. The discriminative scores were used to rank laboratory result features used in experiments described in the “Evaluation” section.

Machine learning models for longitudinal clinical datasets

For all machine learning models explored in this study (gradient boosting, neural network, logistic regression and k-nearest neighbors), time series data was used. KNN allows us to specify feature importance. The other models do not support the input of feature weights. However, we performed experiments selecting subsets of the most important features.

For all four models, we normalized the training data and normalized features by removing the mean and scaling to unit variance. Normalization is done to prevent biasing the algorithms. For example, many algorithms tend to sum together features, which would cause bias towards features with a wide range of values.

This same pipeline was used on the test dataset. 10-fold cross validations with a limited hyper parameter search was used to predict mortality. Baseline ML models and clinical input-informed ML models were created. Baseline ML models included 42 top ranked features according to discriminative score. Different from baseline ML models, clinical input-informed ML models included filtered features (i.e., a subset of laboratory result features determined to be clinically relevant). We used both feature selection and weighting approaches, depending on the modeling approach. For logistic regression, gradient boosting, and neural network (using shallow decision trees as week learners), we used feature weights to perform feature selection. For KNN, we used weighted distance as described by others [ 20 ] to do prediction.

We evaluated model complexity and model performance with two characteristics of the feature engineering approach: ranking features by discriminative score, and filtering features according to input on which triplets are clinically meaningful. We used one measure of model complexity and two measures of performance (model accuracy and model fidelity). Our specific research questions were:

  • What is the impact of our approach to rank laboratory result features according to discriminative score on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the impact of our approach to filter features on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the impact of encoding triplet information directly into the model on model complexity? and on model performance (model accuracy and model fidelity)?
  • What is the model complexity and performance trade off of ranking laboratory result features according to discriminative score? of filtering features? and of encoding triplet information directly into the model?

A summary of data acquisition and data pre-processing steps to enable these analyses, as well a description of what was analyzed is summarized in Fig 1 and below.

An external file that holds a picture, illustration, etc.
Object name is pone.0231300.g001.jpg

Data source, study population and machine learning models

Previous work indicates that admission to the intensive care unit for asthma is a marker for severe disease [ 21 ]. Thus we chose to use the MIMIC-III (‘Medical Information Mart for Intensive Care’) public dataset [ 1 ] that includes de-identified health care data collected from patients admitted to critical care units at Beth Israel Deaconess Medical Center from 2001 to 2012.

Patients included in our analyses had an asthma diagnosis or medication to treat asthma according to criteria proposed by the eMERGE (Electronic Medical Records and Genomics) consortium [ 22 ]. We also use the in-hospital mortality labels defined in MIMIC-III for our case study task to predict whether a patient dies in the hospital. This task was treated as a binary classification problem. For the task to predict patient mortality, we narrowed our cohort to include only those patients included in a MIMIC III benchmark dataset [ 17 ] and with an admission period of ≥48 hours. We selected logistic regression, gradient boosting, neural network and KNN ML models that were trained as implemented by the scikit-learn Python package [ 14 ].

We considered gradient boosting models to be “black-box” due to the lack of direct interpretation with the use of boosted classifiers with multiple trees. Neural network models are also considered “black-box” given that there are often many layers of neurons and it is difficult to relate connection weights to specific concepts. For logistic regression, the coefficients have an interpretation in terms of log odds. KNN also offers some interpretability because we know the closest data points to the current query point being used for making a decision. Note, however that these notions of what is considered black-box may be appropriate in different contexts.

Data subsets and machine learning experiments

To conduct our study, we performed experiments with laboratory results and data subsets that incrementally added encoded information on patient demographics, on clinical events from triplets, and on laboratory results from triplets. These four data subsets were used to train the ML models. The first Labs data subset comes directly from the MIMIC III benchmark dataset. It contained, for each patient, the sequence of laboratory results collected during their first ICU admission. The values at each hour over a 48 hour period were included. For each one hour period, we could have zero values, one value or more than one value. For each of the f × 48 slots (where f is the number of labs or features used in the machine learning experiment), we computed mean, min, max, standard deviation and skew. This created a total of f × 48 × 5 inputs to the ML models. We also normalized our datasets so that the mean is zero and variance is one. This was done using sklearn’s StandardScalar [ 14 ].

The second Labs+demo data subset added demographic information: age group at the time of admission to the ICU, race/ethnicity and sex. Age groups included: <2 , 2 − 17 , 18 − 34 , 35 − 49 , 50 − 69 and 70+ years old. Race and ethnicity’s included: white , black , asian , hispanic , multi and other . The other category was used when we could not determine the group based on the MIMIC III entry. For sex, groups included male and female . For each patient, these values were repeated for all time slots.

The third Labs+demo+events data subset added a column for each clinical event (ie., drug prescriptions and procedures) from triplets considered clinically relevant. Each column includes a zero or a one, with one indicating that the clinical event was recorded during the time slot being considered, and zero otherwise.

The fourth Labs+demo+events+triples data subset duplicates columns from the Labs data subset and for a given time slot, replaces laboratory values with a zero if all clinical event values are zero. Otherwise, the laboratory values were left as-is.

The Labs+demo+events and Labs+demo+events+triples data subsets allowed us to examine the extent to which model complexity and performance was impacted by encoding triplet information. See “Machine learning experiments and analyses of model complexity and performance trade-off” for details on experiments with these data subsets.

Analysis of triplet ranking

The ranking of triplets according to discriminative score was assessed by three co-authors (XZ, CGC, JAE) who manually reviewed the clinical relevance of all features in our severe asthma case study. A two-step strategy was applied. The first step was to evaluate whether the medication/procedure was generally known to or could conceivably have an impact (directly or indirectly) on a laboratory result (e.g., we consider arterial blood gas and pulmonary function test results to have medical relevance in asthma patients). The second step was to evaluate whether the combination is considered relevant to asthma case study (e.g., ‘Gauge Ordering’ does not indicate specific clinical uses). This process allowed us to filter lab-event-lab triplets that were not relevant to our case study. For top ranked triplets according to MI score , we calculated precision in the top k ranked triplets, i.e. the fraction of the count of individual triplets selected by our experts within the top k ranked patterns, divided by k . This set was used in the ML predictors explored in this study. All of steps to analyze the model complexity and performance trade-off were computational.

Machine learning experiments and analyses of model complexity and performance

We conducted several experiments to assess the model complexity and performance of ML models. We used one measure of model complexity and two measures of performance (model accuracy and model fidelity). Complexity, accuracy and fidelity are three characteristics used to describe ML algorithms that have been summarized by others [ 9 , 23 ].

Measuring ML model complexity

To assess the impact of using ranked laboratory result features on model complexity, we conducted experiments analyzing the number of non-zero coefficients with use of fewer than the baseline 42 features ( k = 32, 16, 8, 4, and 2). In order to assess the impact of using filtered features on model complexity, we conducted experiments comparing the number of non-zero coefficients in the baseline ML model and in the ML models based on filtered features. In order to assess the impact of encoding triplet information into the model on model complexity, we compared the number of non-zero coefficients in ML model subsets that included triplet information (i.e., Labs+demo+events and Labs+demo+events+triples ) to the data subset that includes laboratory results and demographic information (i.e., Labs+demo ). This assessment was conducted for logistic regression and gradient boosting. We did not assess non-zero coefficients for neural network or KNN because the number of non-zero coefficients is not related to the complexity for these models. For all comparisons, we assessed the degree of difference.

Measuring ML model accuracy

Model accuracy was assessed by examining the extent to which ranking laboratory result features according to discriminative score and filtering features in ML models can accurately predict mortality. For models with ranked features and with filtered features, we also examined the extent to which encoding demographic information and triplet information influences model accuracy. Model fidelity was assessed by observing the extent to which ML models with filtered features (with and without encoding triplet information directly) are able to imitate the baseline ML predictors.

In order to assess the model accuracy of ranking according to discriminative score, we performed feature selection experiments with logistic regression, gradient boosting, and neural network models. For these experiments, we selected the top k = 2, 4, 8, 16, 32, and 42 laboratory results ranked according to model weights for gradient boosting and neural network models, and according to the sum of MI score for logistic regression models. For KNN we did not assess performance changes when considering ranking according to discriminative score. For each experiment, receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) were reported. The range of AUC values was also reported for feature selection experiments.

In order to assess the model accuracy of baseline and clinical input-informed ML models, for all three models we performed experiments with data subsets. Baseline ML models included models with 42 features and its data subsets. Filtered features were laboratory results represented among triplets determined to be clinically relevant. For logistic regression, gradient boosting, and neural network models, the Labs+demo data subset enabled assessing the influence of encoding demographic information on model accuracy. The Labs+demo+events and Labs+demo+events+triples data subsets enabled assessing the influence of encoding triplet information on model accuracy. KNN experiments were conducted with the Labs data subset only, so we did not assess the influence of encoding triplet information. For each experiment, ROC curves and AUC were reported.

Measuring ML model fidelity

Fidelity was assessed by examining the extent to which the clinical input-informed ML features (with and without triplet information) are able to accurately imitate our baseline ML predictors. We conducted experiments that enabled comparing clinical input-informed ML model performance to baseline ML model performance. For logistic regression, gradient boosting, and neural network models, clinical input-informed ML models were compared to baseline ML models for three data subsets: Labs , Labs+demo and Labs+demo+events . For KNN, we used the Lab data subset and compared the performance of models with filtered features (with and without weights) to the baseline ML model. We report the difference in AUC for clinical input-informed ML models with and without encoding triplets (i.e., the Labs+demo+events and Labs+demo+events+triples data subsets) compared to the baseline for logistic regression, gradient boosting, and neural network ML models.

Triplet identification and ranking

We discovered 218 prescription and 535 procedure triplets with more than 10 instances. These two lists of triplets were sorted by MI score prior to manual clinical review. Upon clinical review, we found that 82 triplets (27 prescription and 55 procedure triplets, see S1 and S2 Tables) were meaningful for our case study. Precision at k in a top- k problem for prescription and procedure events according to MI score are shown in Table 2 . Triplets used in this calculation are summarized in S3 and S4 Tables. For prescription triplets, precision at k = 3, 5, 10, and 20 in a top- k ranked list ranged from 20% to 67% i.e., the percentage of the triplets that were relevant to the case study. For procedure triplets, precision at k = 3, 5, 10, and 20 in a top- k ranged from 0% to 20%.

The triplets were used to enable more interpretable ML models through their use to select and weight features. Eleven laboratory tests were represented among the 82 clinically meaningful triplets discovered at this step (i.e., filtered laboratory results, see S5 Table ). The “Performance result: ML model accuracy” section shows how feature selection using the 11 laboratory tests (i.e., filtered features) impacted the ML result accuracy).

Machine learning results

The ML experiments are based on a subset of 7777 patient records from the MIMIC III database that are also included in a benchmark dataset used by others [ 17 ]. This dataset of 7777 patients was divided into 6222 training cases (death rate 0.489) and 1555 for testing cases (death rate 0.494). An overview of the final dataset, data pre-processing and data analysis steps are shown in Fig 1 and Table 3 . The rankings of the top 42 ranked laboratory tests used in logistic regression and gradient boosting model experiments are shown in S6 and S7 Tables.

ML model complexity results

The ML models for logistic regression and gradient boosting for k ≤ 42 laboratory result features are illustrated in Tables ​ Tables4 4 and ​ and5. 5 . For both, model complexity decreased with use of fewer features informed by discriminative score (Tables ​ (Tables4 4 and ​ and5). 5 ). For logistic regression models, there was a -1.8 fold change in non-zero coefficients between with 16 and 8 features, a -2.8 fold change between models with 4 and 2 features. For gradient boosting models, there was -1.7 fold change in non-zero coefficients between models with 32 and 16 features, a -1.7 fold change between models with 8 and 4 features, and a -2.5 fold change for models with 4 and 2 features. All other model fold changes were 0 to 0.5 and were interpreted as no difference.

The use of clinical input-informed (filtered) features decreased model complexity for logistic regression and gradient boosting ML models. For logistic regression and gradient boosting models, they both had a -1.7 fold change in non-zero coefficients between the models with filtered features for Labs data subsets and the baseline model. There was no difference between models with filtered features for Labs+demo and Labs+demo+events data subsets, and the baseline model (i.e., fold changes were 0 to 0.5).

When examining the influence of encoding demographic and triplet information for models with non-filtered features, we found that model complexity decreased. Among logistic regression models, there was a -1.7 fold change in non-zero coefficients between models that encode demographic information ( Labs+demo ) compared to the baseline model ( Labs ), and a -1.6 fold change between models that encode triplet information ( Labs+demo+events ) compared to the baseline model. Among gradient boosting models, there was a -1.9 fold change in non-zero coefficients between models that encode demographic information ( Labs+demo ) compared to the baseline model, and a -2.0 fold change between models that encode triplet information ( Labs+demo+events ) compared to the baseline model.

When examining the influence of encoding demographic and triplet information for models with filtered features, the model complexity increased for one model that encoded triplet information. There was a 2.3 fold change for the gradient boosting model with filtered features for the Labs+demo+events+triples data subset compared to the baseline model. There were no differences between other models encoding demographic and triplet information with filtered features and the baseline model (i.e., fold changes were 0 to 0.5 for Labs+demo and Labs+demo+events data subsets).

Performance result: ML model accuracy and fidelity

Model accuracy and fidelity for logistic regression, gradient boosting, and neural network ML models (top k = 2, 4, 8, 16, 32, 42 features, and 11 filtered features) are summarized in Table 6 . The top performing models yielded AUC ’s of 0.73 for logistic regression, 0.75 for gradient boosting, and 0.68 for neural network models. For KNN the baseline model accuracy was AUC 42 = 0.57.

AUC = area under the receiver operating characteristic curve

For logistic regression, gradient boosting, and neural network ML models, we found model accuracy to be robust to feature removal informed by discriminative score (Figs ​ (Figs2, 2 , ​ ,3 3 and ​ and4). 4 ). We observed slightly lower AUC ’s between models with 32 and 16 features ( ΔAUC 32‖16 = 0.04, 0.02, and 0.03 for logistic regression, gradient boosting, and neural network models, respectively). Step-wise feature removal otherwise yielded negligible differences (i.e., ΔAUC ≤ 0.01). Across all data subsets, the maximum difference from baseline was ΔAUC max = 0.1 for logistic regression, ΔAUC max = 0.09 for gradiant boosting, and ΔAUC max = 0.06 for neural network models.

An external file that holds a picture, illustration, etc.
Object name is pone.0231300.g002.jpg

For all ML models, the clinical input-informed ML features (i.e., filtered features) yielded comparable model accuracy to the baseline (Figs ​ (Figs5, 5 , ​ ,6 6 and ​ and7). 7 ). The top performing models with filtered features were 0.73 for logistic regression, 0.74 for gradient boosting, 0.68 for neural network, 0.54 for KNN unweighted , and 0.56 for KNN weighted models. The magnitude of the difference in AUC from baseline models were >0.01 for the KNN unweighted model that achieved lower but comparable accuracy ( ΔAUC = 0.03), and for the neural network model for the Labs+demo+events data subset achieved higher but comparable accuracy ( ΔAUC = 0.03). Differences from the baseline were negligible for all other models.

An external file that holds a picture, illustration, etc.
Object name is pone.0231300.g005.jpg

Findings from feature engineering experiments used to assess the influence of encoding demographic and triplet information on model accuracy are also illustrated in Figs ​ Figs4, 4 , ​ ,5 5 and ​ and6. 6 . ML models encoding demographic information (i.e., Labs+data data subsets) showed higher accuracy than baseline ML models. Among ML models without filtered features, the performance of those encoding demographic information (i.e., Labs+demo data subsets) were 0.73 for logistic regression, 0.74 for gradient boosting, and 0.67 for neural network models. The magnitude of performance improvement of Labs+demo data subsets from the baseline were ΔAUC = 0.09 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.04 for neural network models. Among ML models with filtered features, the performance of models encoding demographic information were 0.72 for logistic regression, 0.73 for gradient boosting, and 0.67 for neural network models. The magnitude of the performance improvements from the baseline were ΔAUC = 0.08 for logistic regression, ΔAUC = 0.04 for gradient boosting, and ΔAUC = 0.04 for neural network models.

ML models encoding triplet information (i.e., Labs+data+events and Labs+demo+events+triples data subsets) showed higher accuracy than baseline ML models. Among ML models encoding triplet information, the top performing models were 0.73 for logistic regression, 0.75 for gradient boosting, and 0.68 for neural network models. For ML models without filtered features, when compared to baseline models, the performance improvements were ΔAUC = 0.7 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.02 for neural network models for Labs+demo+events data subsets. The same differences were observed for logistic regression and gradient boosting ML models with filtered features for Labs+demo+events data subsets. For the neural network model with filtered features for the Labs+demo+events data subset, ΔAUC = 0.05. For models with filtered features for Labs+demo+events+triples data subsets, when compared to baseline models, the performance improvements were ΔAUC = 0.8 for logistic regression, ΔAUC = 0.05 for gradient boosting, and ΔAUC = 0.02 for neural network models.

Clinically meaningful triplets and utility of discriminative score

Eighty two clinically meaningful triplets were discovered for a severe asthma case study ( S1 and S2 Tables). We found that the precision of MI score rankings to discover those triplets was low to moderate for prescriptions (20% to 67%) and low for procedures (0% to 20%). This finding indicates that more work is needed in order to move from manual clinical review to an automated process. Through further assessment of MI score rankings, we found that similar model performance for the mortality prediction task can be achieved with decreased model complexity (i.e., fewer features selected according to discriminative score, Figs ​ Figs2, 2 , ​ ,3 3 and ​ and4). 4 ). These findings suggest that MI score alone cannot replace clinical input, but that the use of MI score -informed rankings has potential to decrease model complexity at little cost to model performance.

Model complexity and performance of machine learning models

We assessed model complexity for logistic regression and gradient boosting. We found decreases in model complexity with the use of fewer features informed by discriminative score and with clinical input ( Table 4 and Table 5 ). We also found decreases in model complexity when demographic information was encoded. Results were mixed for models that encoded triplet information, with decreases in model complexity observed with Labs+demo+events data subsets, and an increase in complexity observed with one Labs+demo+events+triples data subset.

When considering model performance for approaches that decreased model complexity, we found that logistic regression, gradient boosting, and neural network ML models were robust to feature removal informed by discriminative score. We did however observe diminishing performance with fewer features. Among the machine learning models we explored, the neural network model was the most robust to feature removal.

For logistic regression, gradient boosting, neural network, and KNN models with filtered features, we found small differences in performance when compared to baseline ML models.( ΔAUC was ≤0.03 across all modeling approaches). For logistic regression, gradient boosting and neural network models with filtered features, we also found that encoding demographic and triplet information showed both decreases in model complexity and improvements in performance for the Labs+demo and Labs+demo+events subsets. The results were mixed for the Labs+demo+events+triples subsets, with observed decreases in model complexity and improvements in performance observed for the logistic regression and gradient boosting models with filtered features for the Labs+demo+events+triples subsets. For the neural network model with filtered features for the Labs+demo+events+triples subset, model complexity increased and the improvement in performance was very small.

Limitations and implications for future work

Our use of the MIMIC-III dataset in this case study may influence generalizability of our approach to other clinical datasets given that the dataset derives exclusively from patients in intensive care settings. Because of this, information typically collected in routine outpatient settings, such as pulmonary function tests, would not be included in our model. This limitation was mitigated in part through our filtering approach that excluded routine features of intensive care.

There are also some limitations due to our approach to define triplets. First, when considering prescriptions as the anchor event for triplets, the half-life of the medications may be relevant. Our approach to discover triplets as lab-event-lab triples (i.e., laboratory results prior to and immediately following a prescription) may exclude some triplets relevant to medications with a longer half-life. Second, under circumstances where we used a default binning approach for laboratory results, extreme outliers skewed the boundaries. There are other approaches to discretize time series data such as SAX [ 24 ] and Temporal Discretization for Classification (TD4C) [ 25 ] methods that might be considered to overcome this limitation.

In addition, the performance of our ML approaches may be influenced by our cohort description. First, severe asthma patients were selected according to asthma medications and diagnoses listed for those patients. Our inclusion criteria may be improved by processing the free text “admission reason” to determine if asthma or asthma-like terms are mentioned. Second, we did not consider the presence of co-morbid conditions in our predictive modeling. Future work may draw from others to improve prediction or treatment guidelines for severe asthma. For example, there are many studies such as [ 26 ] that aim to detect factors that can predict asthma. We anticipate that such considerations can provide improved predictions over approaches explored here.

A major contribution of this work to the ML literature is our approach to incorporate clinical expertise. When reviewing ML and temporal data mining research broadly, we found that this was a gap. [ 4 , 6 , 8 , 27 – 33 ]. Unlike with previous efforts, our experiments tested the impact of using a feature engineering approach to incorporate clinical input on model complexity and performance. We also provided a simplified framework to encode triplets (lab-event-lab patterns) for direct inclusion into our models that warrants further exploration. Furthermore, well-known feature selection methods such as Lasso [ 34 ] do not incorporate expert knowledge and rely on statistical methods to choose features. Findings from our work motivate future efforts to explore ways to use statistical methods together with approaches such ours that incorporate clinician-informed information on the relationship between clinical events and laboratory measurements into the machine learning process.

This work explored a feature engineering approach with longitudinal data that enabled incorporating clinical input in ML models. We assessed the impact of two characteristics of the approach on the complexity and performance of ML models for a mortality prediction task for a severe asthma case study: ranking features by discriminative score (e.g., MI score sum), and filtering laboratory features according to input on lab-event-lab triplets that are clinically meaningful. We found that ML models that use fewer input features selected based on discriminative score or according to which triplets are clinically meaningful, can both decrease model complexity with little cost to performance. Furthermore, for models with lower model complexity through the use of filtered features, the performance of ML models showed improvements from the baseline for data subsets that encode demographic and triplet information. Such approaches to reduce ML model complexity at little cost to performance warrant further comparison and consideration to combine with state-of-the-art feature selection methods, as well as exploration beyond the severe asthma case study.

Supporting information

Acknowledgments.

The authors would like to thank the anonymous reviewers for their comments and suggestions for improving this manuscript. We also want to thank Brant Chee (Johns Hopkins University Applied Physics Laboratory), Matt Kinsey (Johns Hopkins University Applied Physics Laboratory), and Richard Zhu (Institute for Clinical and Translational Research, Johns Hopkins University) for their valuable input.

Funding Statement

This work was funded in part by the Biomedical Translator Program initiated and funded by NCATS (NIH awards 1OT3TR002019, 1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027, 1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520, 1OT2TR002584). Any opinions expressed in this manuscript are those of co-authors who are members of the Translator community and do not necessarily reflect the views of NCATS, individual Translator team members, or affiliated organizations and institutions.

Data Availability

Feature Engineering for Machine Learning

Inna Logunova

Despite being vital to machine learning, feature engineering is not always given due attention. Feature engineering is a supporting step in machine learning modeling, but with a smart approach to data selection, it can increase a model’s efficiency and lead to more accurate results. It involves extracting meaningful features from raw data, sorting features, dismissing duplicate records, and modifying some data columns to obtain new features more relevant to your goals.

From this article, you will learn what feature engineering is and how it can be used to improve your machine learning models. We’ll also discuss different types and techniques of feature engineering and what each type is used for.

This article was updated on December 21, 2023 (new libraries added to the section “Tools for feature engineering”).

  • Why is feature engineering essential?

Feature engineering is necessary for designing accurate predictive models to solve problems while decreasing the time and computation resources needed.

The features of your data have a direct impact on the predictive models you train and the results you can get with them. Even if your data for analysis is not ideal, you can still get the outcomes you are looking for with a good set of features.

But what is a feature? Features are measurable data elements used for analysis. In datasets, they appear as columns. So, by choosing the most relevant pieces of data, we achieve more accurate predictions for the model.

Another important reason for using feature engineering is that it enables you to cut time spent on data analysis.

  • What is feature engineering?

Feature engineering process

Feature engineering is a machine learning technique that transforms available datasets into sets of figures essential for a specific task. This process involves:

  • Performing data analysis and correcting inconsistencies (like incomplete, incorrect data or anomalies).
  • Deleting variables that do not influence model behavior.
  • Dismissing duplicates and correlating records and, sometimes, carrying out data normalization.

This technique is equally applicable to supervised and unsupervised learning. With the modified, more relevant data, we can enhance the model accuracy and response time with a smaller number of variables rather than increasing the size of the dataset.

Feature engineering steps

feature engineering steps

  • Preliminary stage: Data preparation

To start the feature engineering process, you first need to convert raw data collected from various sources into a format that the ML model can use. For this, you perform data cleansing, fusion, ingestion, loading, and other operations. Now you’re ready for feature engineering.

  • Exploratory data analysis

It consists in performing descriptive statistics on datasets and creating visualizations to explore the nature of your data. Next, we should look for correlated variables and their properties in the dataset columns and clean them, if necessary.

  • Feature improvement

This step involves the modification of data records by adding missing values, transforming, normalizing, or scaling data, as well as adding dummy variables. We’ll explain all these methods in detail in the next section.

  • Feature construction

You can construct features automatically and manually. In the first case, algorithms like PCA, tSNE, or MDS (linear and nonlinear) will be helpful. When it comes to manual feature construction, options are virtually endless. The choice of the method depends on the problem to be solved. One of the most well-known solutions is convolution matrices . For example, they have been widely used to create new features while working on computer vision problems.

  • Feature selection

Feature selection, also known as variable selection or attribute selection, is a process of reducing the number of input variables (feature columns) by selecting the most important ones that correlate best with the variable you’re trying to predict while eliminating unnecessary information.

There are many techniques you can use for feature selection:

  • filter-based, where you filter out the irrelevant features;
  • wrapper-based, where you train ML models with different combinations of features;
  • hybrid, which implements both of the techniques above.

In the case of filter-based methods, statistical tests are used to determine the strength of correlation of the feature with the target variable. The choice of the test depends on the data type of both input and output variable (i.e. whether they are categorical or numerical.). You can see the most popular tests in the table below.

Feature selection process

  • Model evaluation and verification

Evaluate the model’s accuracy on training data with the set of selected features. If you have achieved the desired accuracy of the model, proceed to model verification. If not, go back to feature selection and revisit your set of attributes.

Feature engineering methods

Feature engineering methods

Imputation is the process of managing missing values, which is one of the most common problems when it comes to preparing data for machine learning. By missing values, we mean places where information is missing in some cells of a respective row.

There may be different causes for missing values, including human error, data flow interruptions, cross-datasets errors, etc. Since data completeness impacts how well machine learning models perform, imputation is quite important.

Here are some ways how you can solve the issue of missing values:

  • If a row is less than 20-30% complete, it’s recommended to dismiss such a record.
  • A standard approach to assigning values to the missing cells is to calculate a mode, mean, or median for a column and replace the missing values with it.
  • In other cases, there are possibilities to reconstruct the value based on other entries. For example, we can find out the name of a country if we have the name of a city and an administrative unit. Conversely, we can often determine the country/city by a postal code.

You can find more sophisticated approaches to imputation in this post .

  • Outlier handling

Outlier handling is another way to increase the accuracy of data representation. Outliers are data points that are significantly different from other observations.

This graph shows how outliers can influence the ML model. By dismissing the outliers, we can achieve more accurate results.

Outlier handling

It can be done by removing or replacing outliers. Check out this post for an overview of the five most popular approaches to handling outliers.

  • One-hot encoding

Categorical values (often referred to as nominal ) such as gender, seasons, pets, brand names, or age groups often require transformation, depending on the ML algorithm used. For example, decision trees can work with categorical data. However, many others need the introduction of additional artificial categories with a binary representation.

Binary representation means you assign a value of 1 if the feature is valid and 0 if it is not.

One-hot encoding is a technique of preprocessing categorical features for machine learning models. For each category, it designs a new binary feature, often called a “ dummy variable .”

  • Log transformation

This method can approximate a skewed distribution to a normal one. Logarithm transformation (or log transformation) replaces each variable x with a log(x).

Log transformation

The benefits of log transform:

  • Data magnitude within a range often varies. For example, magnitude between ages 10 and 20 is not the same as that between ages 60 and 70. Differences in this type of data are normalized by log transformation.
  • Normalizing magnitude differences and increasing the robustness of the model also reduces the negative effect of outliers.

Normal and skewed distribution

Normal distribution has undeniable advantages, but note that in some cases it can affect the model’s robustness and accuracy of results.

Scaling is a data calibration technique that facilitates the comparison of different types of data. It is useful for measurements to correct the way the model handles small and large numbers.

For example, despite its small value, the floor number in a building is as important as the square footage.

Scaling for data calibration

The most popular scaling techniques include:

  • min-max scaling
  • absolute maximum
  • standardization
  • normalization

Min-max scaling is represented by the following formula:

The absolute maximum scaling technique consists in dividing all figures in the data set by its max value.

Standardization is done by calculating the difference between the individual numbers and their mean, divided by the range of variation, called the standard deviation (sigma). The following equation describes the entire process:

Normalization is quite similar, except that we work with the difference of each value from the mean, divided by the difference between maximum and minimum values in the dataset.

To implement scaling, you can use Python frameworks such as Scikit-learn, Panda, or RasgoQL. Check out this Kaggle guide for some practical scaling tips.

Tools for feature engineering

Below, you will find an overview of some of the best libraries and frameworks you can use for automating feature engineering.

  • Scikit-learn

This machine learning library for Python is known for its simplicity and efficiency in data analysis and modeling. Scikit-learn offers many robust tools, including classification, regression, clustering, and dimensionality reduction. What sets Scikit-learn apart is its user-friendly interface, which allows even those with minimal experience in ML to easily implement powerful algorithms. Additionally, it provides extensive documentation and plenty of examples.

  • Feature-engine

Feature-engine is a Python library designed to engineer and select features for machine learning models, compatible with Scikit-learn’s fit() and transform() methods. It offers a range of transformers for tasks, such as missing data imputation, categorical encoding, outlier handling, and more, allowing for targeted transformations on selected variable subsets. Feature-engine transformers can be integrated into scikit-learn pipelines, enabling the creation and deployment of comprehensive ML pipelines in a single object.

TSFresh is a free Python library containing well-known algorithms from statistics, time series analysis, signal processing, and nonlinear dynamics. It is used for the systematic extraction of time-series features. These include the number of peaks, the average value, the largest value, or more sophisticated properties, such as the statistics of time reversal symmetry.

  • Feature Selector

As the name suggests, Feature Selector is a Python library for choosing features. It determines attribute significance based on missing data, single unique values, collinear or insignificant features. For that, it uses “lightgbm” tree-based learning methods. The package also includes a set of visualization techniques that can provide more information about the dataset.

PyCaret is a Python-based open-source library. Although it is not a dedicated tool for automated feature engineering, it does allow for the automatic generation of features before model training. Its advantage is that it lets you replace hundreds of code lines with just a handful, thus increasing productivity and exponentially speeding up the experimentation cycle.

Some other useful tools for feature engineering include:

  • the NumPy library with numeric and matrix operations;
  • Pandas where you can find the DataFrame, one of the most important elements of data science in Python;
  • Matplotlib and Seaborn that will help you with plotting and visualization.
  • Advantages and drawbacks of feature engineering

Practical examples of feature engineering

Serokell uses feature engineering in a variety of custom ML services we provide. In this section, we share some industry examples from our experience.

  • Gaining more insights from the same data

Many datasets contain variables such as date, distance, age, weight, etc. However, often it would be best to transform them into other formats to obtain answers to your specific questions. For example, weight per se might not be helpful for your analysis. But if you convert your data into BMI (body mass index, a measure of body fat based on height and weight), you get a different picture, enabling you to make conclusions about a person’s overall health.

Values like date and duration can predict repetitive actions, such as repeated user visits to an online store over a month or year, or reveal correlations between sales volumes and seasonal trends.

Watch the video below to see how real-life feature engineering looks like in Python.

  • Building predictive models

By selecting relevant features, you can build predictive models for various industries, for example public transportation. This case study describes how to design a model predicting the ridership on Chicago “L” trains. It should be noted that each constructed feature needs verification. Thus, the hypothesis that weather conditions impact the number of people entering a station turned out invalid.

This academic paper demonstrates how feature engineering can improve prediction for heart failure readmission or death.

Another case study shows how to predict people’s profession based on discrete data. Predictor clusters included variables such as geographic location, religious affiliation, astrological sign, children, pets, income, education.

As an additional example, watch this video explaining how feature engineering helps construct a model in TensorFlow to predict income based on age.

  • Overcoming the “black box” problem

One of the most significant drawbacks of neural networks, especially critical for healthcare, is that it’s impossible to understand the logic behind their predictions. This “black box” effect decreases trust in the analysis as physicians can’t explain why the algorithm came up with a particular conclusion.

The authors of this research paper suggest incorporating expert knowledge in ML models with the help of feature engineering. This way, the model created can be simpler and easier to interpret.

  • Detecting malware

Malware is hard to detect manually. Neural networks are not always effective either. But you can use a combined approach which includes feature engineering as the first step. With it, you can highlight specific classes and structures for which the ML model should look out at the next stage. Find out more here .

  • Conclusions and further learning

Books on feature engineering

As we have seen, feature engineering is an essential and extremely helpful approach for data scientists that can enhance ML model efficiency exponentially. If you are ready to dive deeper into this promising area, we recommend reading the following books.

Feature Engineering Bookcamp by Sinan Ozdemir

The book takes you through a series of projects that give you hands-on experience with basic FE techniques. In this helpful book, you’ll learn about feature engineering through engaging, code-based case studies such as classifying tweets, detecting stock price fluctuations or predicting pandemic development, and much more.

Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models by Soledad Galli

With this cookbook , you will learn to simplify and improve the quality of your code and apply feature engineering techniques in machine learning. The authors explain how to work with continuous and discrete datasets and modify features from unstructured datasets using Python packages like Pandas, Scikit-learn, Featuretools, and Feature-Engine. You will master methods for selecting the best features and the most appropriate extraction techniques.

Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson

The development of predictive models is a multi-step process. Most materials focus on the modeling algorithms. This book explains how to select the optimal predictors to improve model performance. The author illustrates his narrative with various data sets examples and provides R programs for replicating the results.

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists by Alice Zheng, Amanda Casari

Each chapter guides you through a particular data problem, such as text or image data representation. The authors provide a series of exercises throughout the book rather than just teaching specific topics (FE for numeric data, natural text analysis techniques, model stacking, etc.). The final chapter rounds the discussion by applying various feature engineering methods to an actual, structured data set. Code examples include Numpy, Pandas, Scikit-Learn, and Matplotlib.

You will find more relevant materials on ML-related topics on our blog .

Banner that links to Serokell Shop. You can buy hip FP T-shirts there!

feature engineering research paper

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Feature Engineering

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Save to Library
  • Last »
  • Keyphrase Extraction Follow Following
  • Supervised Learning Follow Following
  • Decision Tree Follow Following
  • Policy modeling Follow Following
  • Complex Project Management Follow Following
  • Achievement Management Follow Following
  • Complex Problem Solving Follow Following
  • Atmospheric Carbon Dioxide Follow Following
  • Philosophy of Computer Scince Follow Following
  • Hot Deep Oil Reservoirs Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Publishing
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Subscribe to the PwC Newsletter

Join the community, a two dimensional feature engineering method for relation extraction.

Transforming a sentence into a two-dimensional (2D) representation (e.g., the table filling) has the ability to unfold a semantic plane, where an element of the plane is a word-pair representation of a sentence which may denote a possible relation representation composed of two named entities. The 2D representation is effective in resolving overlapped relation instances. However, in related works, the representation is directly transformed from a raw input. It is weak to utilize prior knowledge, which is important to support the relation extraction task. In this paper, we propose a two-dimensional feature engineering method in the 2D sentence representation for relation extraction. Our proposed method is evaluated on three public datasets (ACE05 Chinese, ACE05 English, and SanWen) and achieves the state-of-the-art performance. The results indicate that two-dimensional feature engineering can take advantage of a two-dimensional sentence representation and make full use of prior knowledge in traditional feature engineering. Our code is publicly available at https://github.com/Wang-ck123/A-Two-Dimensional-Feature-Engineering-Method-for-Entity-Relation-Extraction

BME

Faculty of Engineering

Quick Access

  • About PolyU
  • PolyU 85th Anniversary
  • Study at PolyU
  • Research at PolyU
  • PolyU A to Z

PolyU

  • Academic Staff
  • Affiliate Faculty Members
  • Supporting Staff
  • Research Staff
  • Research Students
  • Academic Advisor and Advisory Committee
  • Honorary Staff
  • Outstanding Alumni Award
  • Why PolyU BME?
  • Program Overview
  • List of Subjects and Subject Description Forms
  • Minor Programme
  • Work-Integrated Education
  • Academic Advising
  • Cluster Area Requirements
  • Service Learning
  • Credit Transfer
  • Taught Postgraduate Programme
  • Research Postgraduate Programme

Experience and Opportunities

  • Career Opportunities
  • Student Exchange Programmes
  • Study Tours
  • Student Message
  • Medical Imaging, Instrumentation and Sensing
  • Molecular, Cellular and Tissue Engineering
  • Neuromusculoskeletal Science and Engineering
  • Prosthetics, Orthotics and Rehabilitation Engineering
  • Research Laboratory and Facilities
  • Research Opportunities
  • PhD and MPhil
  • Impactful Knowledge Transfer
  • Research Institute for Sports Science and Technology (RISports)
  • Research Institute for Smart Ageing (RISA)
  • Joint Research Centre for Biosensing and Precision Theranostics (RSBPT)
  • Jockey Club Rehabilitation Engineering Clinic
  • Jockey Club Smart Ageing Hub
  • News and Events
  • BME in the News
  • Feature Stories
  • Message from Department
  • Vision and Mission
  • Acknowledgement

Dr Puxiang Lai’s research on “High-security learning-based optical encryption assisted by disordered metasurface” published in Nature Communications

10 Apr 2024

Puxiang_nature communication_1

Dr Puxiang Lai’s research on “High-security learning-based optical encryption assisted by disordered metasurface” is published in Nature Communications

The research paper titled “ High-security learning-based optical encryption assisted by disordered metasurface ”, with Associate Professor Dr Puxiang Lai as one of the co-authors, is published in Nature Communications , an open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical, Earth, social, mathematical, applied, and engineering sciences. 

“ High-security learning-based optical encryption assisted by disordered metasurface ” Zhipeng Yu, Huanhao Li, Wannian Zhao, Po-Sheng Huang, Yu-Tsung Lin, Jing Yao, Wenzhao Li, Qi Zhao, Pin Chieh Wu, Bo Li, Patrice Genevet, Qinghua Song & Puxiang Lai Nature Communications volume 15, Article number: 2607 (2024). doi: 10.1038/s41467-024-46946-w

Abstract Artificial intelligence has gained significant attention for exploiting optical scattering for optical encryption. Conventional scattering media are inevitably influenced by instability or perturbations, and hence unsuitable for long-term scenarios. Additionally, the plaintext can be easily compromised due to the single channel within the medium and one-to-one mapping between input and output. To mitigate these issues, a stable spin-multiplexing disordered metasurface (DM) with numerous polarized transmission channels serves as the scattering medium, and a double-secure procedure with superposition of plaintext and security key achieves two-to-one mapping between input and output. In attack analysis, when the ciphertext, security key, and incident polarization are all correct, the plaintext can be decrypted. This system demonstrates excellent decryption efficiency over extended periods in noisy environments. The DM, functioning as an ultra-stable and active speckle generator, coupled with the double-secure approach, creates a highly secure speckle-based cryptosystem with immense potentials for practical applications.

Puxiang_nature communication_v2

We use Cookies to give you a better experience on our website. By continuing to browse the site without changing your privacy settings, you are consenting to our use of Cookies. For more information, please see our Privacy Policy Statement .

Your browser is not the latest version. If you continue to browse our website, Some pages may not function properly. You are recommended to upgrade to a newer version or switch to a different browser. A list of the web browsers that we support can be found here

What are you looking for?

feature engineering research paper

Book cover

Soft Computing: Theories and Applications pp 769–789 Cite as

Understanding the Role of Feature Engineering in Fake News Detection

  • Ajay Agarwal 14 ,
  • Basant Agarwal 15 &
  • Priyanka Harjule 16  
  • Conference paper
  • First Online: 02 June 2022

747 Accesses

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 425))

The nature of a pandemic often seizes its capacity to adjust its duration. Whilst biological pandemics might ebb in a duration of a few years, cognitive pandemics like misinfodemic that attack the cognitive possessions of the human brain fail to converge at points to ebb. Consequently, it is natural that the world has been constantly in the grasp of misinformation, evolving from the form of simple linguistic deception manifested as lies to heavy-lettered words of misinformation and disinformation . As the academic research community attempts to circumnavigate this phenomenon from interdisciplinary perspectives, it becomes crucial to take a step back and put this phenomenon under the microscope of temporal analysis. This allows us to visualize that similar to any biological threat, the cue to understanding misinformation lies in its simple evolution from a linguistic lie to the state it occupies today. Psycho linguistically speaking, the cure of misinfodemic lies in circumnavigating the devil that lies in the fine print published as deceptive lies a few years ago to misinformation disseminated on social media platforms today. Our study fulfils the task and expands to present how these fine linguistic cues of deception, even today, play a significant role in drafting feature engineering augmented NLP models that occupy positions of the state of the art in ML-based task of fake news classification and identification.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Long B et al (2014) Active learning for ranking through expected loss optimization. IEEE Trans Knowl Data Eng 27.5:1180–1191

Google Scholar  

Shu K et al (2017) Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor Newsl 19.1:22–36

Allcott H, Gentzkow M (2017) Social media and fake news in the 2016 election. J Econ Perspect 31(2):211–236

Article   Google Scholar  

Zubiaga A, Liakata M, Procter R (2017) Exploiting context for rumour detection in social media. In: International conference on social informatics. Springer, Cham

Derczynski L et al (2017) SemEval-2017 task 8: rumoureval: determining rumour veracity and support for rumours. arXiv preprint arXiv:1704.05972

Feeley TH, deTurck MA, Young MJ (1995) Baseline familiarity in lie detection. Commun Res Rep 12.2:160–169

Reinhard M-A (2010) Need for cognition and the process of lie detection. J Exp Soc Psychol 46(6):961–971

Wright GRT, Berry CJ, Bird G (2012) “You can't kid a kidder”: association between production and detection of deception in an interactive deception task. Frontiers Hum Neurosci 6:87

Reinhard M-A, Scharmach M, Müller P (2013) It’s not what you are, it’s what you know: experience, beliefs, and the detection of deception in employment interviews. J Appl Soc Psychol 43(3):467–479

Burgoon JK et al (2003) Detecting deception through linguistic analysis. In: International Conference on Intelligence and Security Informatics. Springer, Berlin

Burgoon JK et al (2003) Trust and deception in mediated communication. In: Proceedings of the 36th annual Hawaii international conference on system sciences. IEEE

Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45.2:167–256

Vrij A (2000) Detecting lies and deceit: the psychology of lying and implications for professional practice. Wiley

DePaulo BM et al (2003) Cues to deception. Psychol Bull 129.1:74

Knapp ML, Hart RP, Dennis HS (1974) An exploration of deception as a communication construct. Hum Commun Res 1(1):15–29

Reinhard M-A, Sporer SL (2010) Content versus source cue information as a basis for credibility judgments. Soc Psychol

Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256

Millar M, Millar K (1995) Detection of deception in familiar and unfamiliar persons: the effects of information restriction. J Nonverbal Behav 19(2):69–84

Vrij A (2008) Detecting lies and deceit: Pitfalls and opportunities. Wiley

Sporer SL, Schwandt B (2006) Paraverbal indicators of deception: a meta‐analytic synthesis. Appl Cognitive Psychol Official J Soc Appl Res Mem Cogn 20.4:421–446

Ten Brinke L, Stimson D, Carney DR (2014) Some evidence for unconscious lie detection. Psychol Sci 25.5:1098–1105

Bond CF Jr, Atoum AO (2000) International deception. Pers Soc Psychol Bull 26.3:385–395

Mortensen CD, Ayres CM (1997) Miscommunication. Sage

Reinhard M-A et al (2011) Listening, not watching: situational familiarity and the ability to detect deception. J Pers Soc Psychol 101.3:467

Bond Jr, Charles F, Kahler KN, Paolicelli LM (1985) The miscommunication of deception: an adaptive perspective. J Exp Soc Psychol 21.4:331–345

Buller DB, Burgoon JK (1996) Interpersonal deception theory. Commun Theory 6(3):203–242

Pennebaker JW, Mayne TJ, Francis ME (1997) Linguistic predictors of adaptive bereavement. J Pers Soc Psychol 72(4):863

Aker A, Derczynski L, Bontcheva K (2017) Simple open stance classification for rumour analysis. arXiv preprint arXiv:1708.05286

Farnadi G et al (2016) Computational personality recognition in social media. User Model User-Adap Interact 26.2:109–142

Agarwal A, Agarwal B, Harjule P, Rahman A, Manjunath Aradhya VN (2020) The accidental checkmate—Understanding the intent behind sharing misinformation on social media during COVID-19 using game theory. J Exp Theor Artif Intell

Kochkina E, Liakata M, Augenstein I (2017) Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm. arXiv preprint arXiv:1704.07221

Augenstein I et al (2016) Stance detection with bidirectional conditional encoding. arXiv preprint arXiv:1606.05464

Chen Y-C, Liu Z-Y, Kao H-Y (2017) IKM at SemEval-2017 task 8: convolutional neural networks for stance detection and rumor verification. In: Proceedings of the 11th international work-shop on semantic evaluations (SemEval-2017). Vancouver, Canada, pp 465–469

Kim Y (2014) Convolutional neu-ral networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar, pp 1746–1751

Shu K et al (2020) Fakenewsnet: a data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data 8.3:171–188

Qazvinian V, Rosen-gren E, Radev DR, Mei Q (2011) Rumor has it: Identifying Misinformationin Microblogs. In: Proceedings of the 2011 conference on empirical methods in natural language processing. Edin-burgh, Scotland, pp 1589–1599

Download references

Author information

Authors and affiliations.

DIT University, Dehradun, India

Ajay Agarwal

Indian Institute of Information Technology, Kota, India

Basant Agarwal

MNIT Jaipur, Jaipur, India

Priyanka Harjule

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ajay Agarwal .

Editor information

Editors and affiliations.

Department of Electrical Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Rajesh Kumar

Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)

Chang Wook Ahn

Department of Computer Science, Shobhit University, Gangoh, India

Tarun K. Sharma

Department of Instrumentation and Control Engineering, Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, India

Om Prakash Verma

Indian Institute of Information Technology Kota, Jaipur, Rajasthan, India

Anand Agarwal

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Agarwal, A., Agarwal, B., Harjule, P. (2022). Understanding the Role of Feature Engineering in Fake News Detection. In: Kumar, R., Ahn, C.W., Sharma, T.K., Verma, O.P., Agarwal, A. (eds) Soft Computing: Theories and Applications. Lecture Notes in Networks and Systems, vol 425. Springer, Singapore. https://doi.org/10.1007/978-981-19-0707-4_70

Download citation

DOI : https://doi.org/10.1007/978-981-19-0707-4_70

Published : 02 June 2022

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-0706-7

Online ISBN : 978-981-19-0707-4

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Te Kura Mata-Ao | School of Engineering

Innovation that guides our future.

Engineering Design Show 2023 0 23 screenshot 1

Te Kura Mata-Ao School of Engineering

Eight engineering programmes.

At Waikato, we offer a full-range engineering programmes including civil, mechanical, chemical, and electrical engineering. These programmes are internationally accredited through the Washington Accord and recognised around the world.

engineering why video tb

Why study engineering at Waikato

Study with us, bachelor of engineering (hons), master of engineering, master of engineering practice, engineering research groups, engineering design show.

The annual Waikato Engineering Design Show celebrates the fine achievements of our undergraduate professional engineering students. The event is open to the public and a great opportunity to see some cutting-edge tech in action.

Engineering Design Show 2023 0 2 screenshot

Engineering jump start programme

If you're passionate about engineering but do not quite meet the BE(Hons) entry requirements or want a refresh of NCEA Level 3 maths and physics, join us for the Engineering Jump Start programme from late January on the Hamilton campus.

Add a business edge

Waikato offers a unique opportunity for students to gain engineering expertise and business acumen through the Diploma in Engineering Management alongside their BE(Hons), preparing them for top management roles in industry.

Engineering Scholarships and Awards

Find out more about the scholarships we have available.

What our students are saying

Meet the staff of the school of engineering.

Family footsteps to engineering excellence: Shermi Perera's graduates with honours

Bachelor of Engineering graduate, Shermi Perera, is working at Beca in Dunedin, engaging in diverse engineering projects aimed at enhancing water resource management.

Waikato's engineering degrees receive international accreditation

Read about the international accreditation the University of Waikato has received for all eight of its Bachelor of Engineering (Honours).

Waikato mechatronics connecting with the world

Dr. Shen Hin Lim, Senior Lecturer of Mechatronics and Programme Leader at the University's School of Engineering, eagerly anticipates a peaceful summer after frequent trips to Korea.

Significant funding boost for University of Waikato research

The University of Waikato has achieved significant results in the latest funding round from the Ministry of Business, Innovation and Employment's Endeavour Fund.

Contact the School of Engineering

You’re viewing this website as a domestic student.

You’re currently viewing the website as a domestic student, you might want to change to international.

You're a domestic student if you are:

  • A citizen of New Zealand or Australia
  • A New Zealand permanent resident

You're an International student if you are:

  • Intending to study on a student visa
  • Not a citizen of New Zealand or Australia

Help | Advanced Search

Computer Science > Machine Learning

Title: toward efficient automated feature engineering.

Abstract: Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks, which has achieved great success in real-world applications. Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment. Therefore, in this work, we propose a generic framework to improve the efficiency of AFE. Specifically, we construct the AFE pipeline based on reinforcement learning setting, where each feature is assigned an agent to perform feature transformation \com{and} selection, and the evaluation score of the produced features in downstream tasks serve as the reward to update the policy. We improve the efficiency of AFE in two perspectives. On the one hand, we develop a Feature Pre-Evaluation (FPE) Model to reduce the sample size and feature size that are two main factors on undermining the efficiency of feature evaluation. On the other hand, we devise a two-stage policy training strategy by running FPE on the pre-evaluation task as the initialization of the policy to avoid training policy from scratch. We conduct comprehensive experiments on 36 datasets in terms of both classification and regression tasks. The results show $2.9\%$ higher performance in average and 2x higher computational efficiency comparing to state-of-the-art AFE methods.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • MyU : For Students, Faculty, and Staff

Lina Liu receives NSF Graduate Research Fellowship

Headshot photographs of four Math graduate students over a maroon banner.

MINNEAPOLIS / ST. PAUL (4/16/2024) – Four School of Mathematics graduate students were recently honored with recognition by the National Science Foundation Graduate Research Fellowship Program (NSF GRFP). Lina Liu was awarded a fellowship, and Connor Bass, Daniel Miao, and Ian Ruohoniemi received honorable mentions.

Lina Liu joined the School of Mathematics in 2022 after completing her undergraduate studies at the University of Wisconsin - Madison. Advised by Prof. Erkao Bao, she is interested in studying symplectic geometry. “I think of this field as a melting pot of differential geometry, functional analysis, topology, algebraic topology, and beyond,” she says. “I am specifically interested in studying Morse homology, a classical topic in symplectic geometry that uses analysis on manifolds and algebraic topology.”

Lina serves the Mathematics community through Math Club, a student group organized by graduate students to support mathematics undergraduates. The club hosts weekly meetings that feature practical workshops, informal math competitions and trivia events, and opportunities to build connections with other math students. Lina says “I am grateful for the added time I have to continue to support undergraduate students through the Math Club! I know I'm only where I am today in my academic studies because of the gracious support of my mentors and friends, and I hope to pass that forward.”

The NSF GRFP recognizes and supports outstanding graduate students in NSF-supported science, technology, engineering, and mathematics disciplines who are pursuing research-based master’s and doctoral degrees at accredited United States institutions. The program also seeks to support the participation of underrepresented groups in STEM graduate studies. Connor Bass, Daniel Miao and Ian Ruohoniemi received honorable mentions for their applications. 

We congratulate all four mathematicians on this significant national academic achievement!

Related news releases

  • Professor Jasmine Foo receives Distinguished McKnight University Professorship
  • UMN student team wins national data analytics competition
  • Professor Bernardo Cockburn receives 2024 Frontiers of Science Award in Mathematics
  • Professor Michelle Chu selected for McKnight Land-Grant Professorship award
  • Duggal, Gomes, Houssou, Kenney, Manivel, and Zhang named Outstanding TAs
  • Future undergraduate students
  • Future transfer students
  • Future graduate students
  • Future international students
  • Diversity and Inclusion Opportunities
  • Learn abroad
  • Living Learning Communities
  • Mentor programs
  • Programs for women
  • Student groups
  • Visit, Apply & Next Steps
  • Information for current students
  • Departments and majors overview
  • Departments
  • Undergraduate majors
  • Graduate programs
  • Integrated Degree Programs
  • Additional degree-granting programs
  • Online learning
  • Academic Advising overview
  • Academic Advising FAQ
  • Academic Advising Blog
  • Appointments and drop-ins
  • Academic support
  • Commencement
  • Four-year plans
  • Honors advising
  • Policies, procedures, and forms
  • Career Services overview
  • Resumes and cover letters
  • Jobs and internships
  • Interviews and job offers
  • CSE Career Fair
  • Major and career exploration
  • Graduate school
  • Collegiate Life overview
  • Scholarships
  • Diversity & Inclusivity Alliance
  • Anderson Student Innovation Labs
  • Information for alumni
  • Get engaged with CSE
  • Upcoming events
  • CSE Alumni Society Board
  • Alumni volunteer interest form
  • Golden Medallion Society Reunion
  • 50-Year Reunion
  • Alumni honors and awards
  • Outstanding Achievement
  • Alumni Service
  • Distinguished Leadership
  • Honorary Doctorate Degrees
  • Nobel Laureates
  • Alumni resources
  • Alumni career resources
  • Alumni news outlets
  • CSE branded clothing
  • International alumni resources
  • Inventing Tomorrow magazine
  • Update your info
  • CSE giving overview
  • Why give to CSE?
  • College priorities
  • Give online now
  • External relations
  • Giving priorities
  • Donor stories
  • Impact of giving
  • Ways to give to CSE
  • Matching gifts
  • CSE directories
  • Invest in your company and the future
  • Recruit our students
  • Connect with researchers
  • K-12 initiatives
  • Diversity initiatives
  • Research news
  • Give to CSE
  • CSE priorities
  • Corporate relations
  • Information for faculty and staff
  • Administrative offices overview
  • Office of the Dean
  • Academic affairs
  • Finance and Operations
  • Communications
  • Human resources
  • Undergraduate programs and student services
  • CSE Committees
  • CSE policies overview
  • Academic policies
  • Faculty hiring and tenure policies
  • Finance policies and information
  • Graduate education policies
  • Human resources policies
  • Research policies
  • Research overview
  • Research centers and facilities
  • Research proposal submission process
  • Research safety
  • Award-winning CSE faculty
  • National academies
  • University awards
  • Honorary professorships
  • Collegiate awards
  • Other CSE honors and awards
  • Staff awards
  • Performance Management Process
  • Work. With Flexibility in CSE
  • K-12 outreach overview
  • Summer camps
  • Outreach events
  • Enrichment programs
  • Field trips and tours
  • CSE K-12 Virtual Classroom Resources
  • Educator development
  • Sponsor an event

X

UCL Institute for Environmental Design and Engineering

  • Undergraduate
  • Master's
  • Scholarships
  • What our students say
  • Short courses
  • Partnerships

Menu

Virtual Decision Rooms for Water Neutral Urban Planning (VENTURA)

Ventura is a collaborative research project between Imperial College (ICL), University College London (UCL) and the British Geological Survey (BGS).

Image of a dam with green trees

15 April 2024

The VENTURA project  is schedule to run from October 2021 to April 2024) and was funded by the UK Research and Innovation (UKRI) under the Engineering and Physical Sciences Research Council (EPSRC) as part of a programme entitled, 'Digital Economy: Sustainable Digital Society'.

The project team for this collaborative project include academics from ICL, UCL and the BGS and the UCL Ventura team is led by Nici Zimmermann , Ke (Koko) Zhou ,  Irene Pluchinotta  and Pepe Puchol-Salort . The broader aim of Ventura is to support water neutrality decisions through digital tools.

The UCL team led the participatory systems thinking activities in case studies in Greater Manchester and Enfield, London. They engaged with a range of stakeholders including the trilateral group, namely Greater Manchester Combined Authority, United Utilities, and the Environment Agency. They also engaged with local boroughs, the Greater London Authority, and broader environmental stakeholders. The aim of the project's stakeholder engagement activities is to understand stakeholders’ experience and perspectives of the water neutrality governance challenges and complexities, supporting the exploration of the interconnections between hydrological and institutional dynamics towards sustainable growth.

The team employed comprehensive participatory approaches: identifying challenges with stakeholders and mapping the complex core problem. They developed connection circles and causal loop diagrams (CLDs) representing the system's complexity. The team drew from systems thinking, behavioural operational research, and organisational management studies (institutions, attention and emotions) to inform their reflections on the CLDs. They also analysed the decision-making process and compared stakeholders’ perceived boundaries of the system, specifically on how different groups of stakeholders perceive the system differently, and their reflections to achieve integrated decision-making.

Figure: Manchester workshop, causal loop diagrams developed by stakeholder groups

In their collaborated with Imperial College London and the British Geological Survey integrating systems modelling and CLDs in a digital tool, the UCL team called a virtual decision room (VDR) for the Manchester case study. They used this combination of systems thinking with GIS-based modelling in a virtual decision room to enhance practical decision-making by offering an easily accessible interface. This integration facilitates a holistic understanding of specific policy issues and broader governance challenges.

The VENTURA project operated in an highly interdisciplinary setting, and it followed principles of transdisciplinary, and they also used it as a basis to reflect on the process of collaborating across disciplines and working with stakeholders—with the aim to better understand the development of digital tools in participatory, inter- and transdisciplinary contexts.

For more information about the project, please visit following resources. If there are specific questions about the project, you are welcome to contact the team directly.

Figure: the VENTURA virtual decision room

Publications and conference papers:.

VENTURA - virtual decision rooms for water neutral planning  https://nora.nerc.ac.uk/id/eprint/535476/

Comparisons of Emotional Boundaries: A case study of water neutrality and sustainable urban development in London: https://discovery.ucl.ac.uk/id/eprint/10180093/

Other documents:

Ventura VDR and the CLD:  https://ventura.bgs.ac.uk

Ventura project summary and more details about the development of CLD:  https://www.imperial.ac.uk/systems-engineering-innovation/research/ventura/

Ventura Github:  https://github.com/ventura-water

Researchers (with profile links):

Ke (Koko) Zhou

Irene Pluchinotta  

Nici Zimmermann

Pepe Puchol-Salort

Image credit: pexels.com

IMAGES

  1. What Is Feature Engineering and why should it be automated?

    feature engineering research paper

  2. (PDF) Published Research on Engineering Work

    feature engineering research paper

  3. What is Feature Engineering and its main goals?

    feature engineering research paper

  4. (PDF) Research paper on E-Learning application design features: Using

    feature engineering research paper

  5. (PDF) Comparing Feature Engineering Approaches to Predict Complex

    feature engineering research paper

  6. How to Write an Engineering Research Paper: Tips from Experts

    feature engineering research paper

VIDEO

  1. MFML 008

  2. Research Topics On Environmental Engineering

  3. Challenges and Opportunities for Educational Data Mining ! Research Paper review

  4. Feature engineering as part of Weidmüller Industrial Analytics

  5. Why Feature Engineering is a an important skill for data science and machine learning

  6. Overall feature selection process-Machine Learning-FEATURE SUBSET SELECTION-Unit-2-CSE-R20-JNTUA

COMMENTS

  1. (PDF) Feature Engineering (FE) Tools and Techniques for Better

    Step 3: Feature Engineering (Feature Transformation and Creation of additional features): Clean data is still raw. where lot of data is useless, so data is filtered, sliced and transformed to ...

  2. Feature Engineering

    384 papers with code • 1 benchmarks • 5 datasets. Feature engineering is the process of taking a dataset and constructing explanatory variables — features — that can be used to train a machine learning model for a prediction problem. Often, data is spread across multiple tables and must be gathered into a single table with rows ...

  3. Principles, research status, and prospects of feature engineering for

    This paper systematically reviewed state of the art in feature engineering as used in research on data-driven building energy prediction. We first summarized the concept of feature engineering and the operating principle of main methods for it, including feature construction, selection, and extraction.

  4. Special issue on feature engineering editorial

    The idea of feature engineering for unstructured data is to extract featurs such that these can be fed into a classical machine learning technique (e.g., decision tree, neural network, XGBoost) for pattern recognition. For image data, various featurization techniques exist, depending on the particular goal or task at hand.

  5. PDF An Empirical Analysis of Feature Engineering for Predictive Modeling

    Engineering such features is primarily a manual, time-consuming task. Additionally, each type of model will respond differently to different kinds of engineered features. This paper reports empirical research to demonstrate what kinds of engi-neered features are best suited to various machine learning model types.

  6. [1709.07150] Feature Engineering for Predictive Modeling using

    View a PDF of the paper titled Feature Engineering for Predictive Modeling using Reinforcement Learning, by Udayan Khurana and Horst Samulowitz and Deepak Turaga. View PDF Abstract: Feature engineering is a crucial step in the process of predictive modeling. It involves the transformation of given feature space, typically using mathematical ...

  7. Principles, research status, and prospects of feature engineering for

    The paper reviews the concept and methods of feature engineering. ... Although data-driven building energy prediction research adopted feature engineering as early as 1990s [120,121], it has not been highly concerned by data-driven building energy prediction research until recent years. It is worth reflecting on why feature engineering had been ...

  8. An empirical analysis of feature engineering for predictive modeling

    Engineering such features is primarily a manual, time-consuming task. Additionally, each type of model will respond differently to different types of engineered features. This paper reports on empirical research to demonstrate what types of engineered features are best suited to which machine learning model type. This is accomplished by ...

  9. Feature Engineering and Selection: A Practical Approach for Predictive

    However, it is quite common to encounter datasets in biomedical research in which repeated measurements have been taken over time on the same individual, for example. Presented in Chapters 10-12 are strategies for performing feature selection, the process of determining which engineered features should be in the final predictive model. Two ...

  10. Feature Engineering

    1 Introduction. As the amount of data generated and collected grows, analyzing and modeling so many input variables get more difficult. So, it is important to reduce model complexity and establish simple, accurate and robust models. Feature engineering is the process of using domain knowledge to extract input variables from raw data, prioritize ...

  11. Feature Engineering

    Feature Engineering (FE) is a set of techniques that allows human knowledge and intuitions to be added to an ML solution by controlling the input of raw data during the ML process. There are a number of well-understood methods and transformations that can be applied to the features. This process is better done iteratively, starting from EDA and ...

  12. Deep learning-based feature engineering methods for improved building

    As shown in Fig. 1, the research methodology consists of two main parts, i.e., feature engineering and predictive modeling.The first step performs data-driven feature extraction based on different feature engineering methods. In total, five feature engineering methods are adopted, including two conventional data-driven feature engineering methods and three deep-learning based methods.

  13. Feature engineering with clinical expert knowledge: A case study ...

    The goal of this research was to apply the feature engineering approach with a severe asthma case study and to assess model performance for a range of ML approaches: gradient boosting , neural network , logistic regression and k-nearest neighbor. Non-zero coefficients were assessed as a metric of model complexity for two ML approaches: logistic ...

  14. Feature engineering with clinical expert knowledge: A case study

    The goal of this research was to apply the feature engineering approach with a severe asthma case study and to assess model performance for a range of ML approaches: gradient boosting , neural network , logistic regression and k-nearest neighbor. Non-zero coefficients were assessed as a metric of model complexity for two ML approaches: logistic ...

  15. [1701.07852] An Empirical Analysis of Feature Engineering for

    Engineering such features is primarily a manual, time-consuming task. Additionally, each type of model will respond differently to different kinds of engineered features. This paper reports empirical research to demonstrate what kinds of engineered features are best suited to various machine learning model types. We provide this recommendation ...

  16. Feature Engineering for ML: Tools, Tips, FAQ, Reference Sources

    Feature engineering is the process of designing predictive models based on a carefully selected set of data. Read our step-by-step guide on how to introduce feature engineering into your model. ... The authors of this research paper suggest incorporating expert knowledge in ML models with the help of feature engineering. This way, the model ...

  17. Feature Engineering Research Papers

    Feature engineering (FE) is one of the most important steps in data science research. FE provides useful features to be used later in the study. ... This paper proposed feature engineering algorithms that can efficiently estimate hourly traffic volume and generate features from the existing dataset for all traffic census stations in Malaysia ...

  18. A Two Dimensional Feature Engineering Method for ...

    A Two Dimensional Feature Engineering Method for Relation Extraction . Transforming a sentence into a two-dimensional (2D) representation (e.g., the table filling) has the ability to unfold a semantic plane, where an element of the plane is a word-pair representation of a sentence which may denote a possible relation representation composed of two named entities.

  19. Deep Learning Bubble Segmentation on a Shoestring

    Image segmentation in bubble plumes is notoriously difficult, with individual bubbles having ill-defined shapes overlapping each other in images. In this paper, we present a cheap and robust segmentation procedure to identify bubbles from bubble swarm images. This is done in three steps. First, individual, nonoverlapping bubbles are detected and isolated from true experimental images. In the ...

  20. Professor Lei Sun's researches published in Nature Communications and

    The research papers of Professor Lei Sun have been published in the world-renowned scientific journals Nature Communications and Proceedings of the National Academy of Sciences of the United States of America (PNAS) respectively. Below are the details of these outstanding researches: "Nanobubble-actuated ultrasound neuromodulation for selectively shaping behavior in mice"

  21. Feature selection in machine learning: A new perspective

    Abstract. High-dimensional data analysis is a challenge for researchers and engineers in the fields of machine learning and data mining. Feature selection provides an effective way to solve this problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and facilitate a better understanding ...

  22. Dr Puxiang Lai's research on "High-security learning-based optical

    The research paper titled "High-security learning-based optical encryption assisted by disordered metasurface", with Associate Professor Dr Puxiang Lai as one of the co-authors, is published in Nature Communications, an open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical, Earth, social, mathematical ...

  23. Understanding the Role of Feature Engineering in Fake News ...

    In this paper, we analyzed how the deceptive linguistic cues evolved to become prominent features in NLP feature engineering augmented ML-based classification for fake news identification. Our main factor of analysis was the linguistic features that often hide in the fine print of fake news published online and disseminated across social media ...

  24. Te Kura Mata-Ao

    Family footsteps to engineering excellence: Shermi Perera's graduates with honours . Bachelor of Engineering graduate, Shermi Perera, is working at Beca in Dunedin, engaging in diverse engineering projects aimed at enhancing water resource management.

  25. [2212.13152] Toward Efficient Automated Feature Engineering

    Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks, which has achieved great success in real-world applications. Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment. Therefore, in this work, we propose a generic framework ...

  26. Lina Liu receives NSF Graduate Research Fellowship

    MINNEAPOLIS / ST. PAUL (4/16/2024) - Four School of Mathematics graduate students were recently honored with recognition by the National Science Foundation Graduate Research Fellowship Program (NSF GRFP). Lina Liu was awarded a fellowship, and Connor Bass, Daniel Miao, and Ian Ruohoniemi received honorable mentions.Lina Liu joined the School of Mathematics in 2022 after completing her ...

  27. Virtual Decision Rooms for Water Neutral Urban Planning (VENTURA)

    Ventura is a collaborative research project between Imperial College (ICL), UCL and the British Geological Survey (BGS). ... (October 2021 to April 2024) was funded by the UK Research and Innovation (UKRI) under the Engineering and Physical Sciences Research Council (EPSRC) as part of a programme titled Digital Economy: Sustainable Digital ...