Data Sciences and Analytics Center

Data Driven Feature Learning

Student Name: Ambika Kaul

"The importance of feature engineering has been established by the exceptional performance of deep learning techniques, which automate this task for some applications. For other applications, generally, feature engineering requires substantial manual effort in designing and selecting features and is often tedious and non scalable. In this work, we introduce a scalable regression-based feature learning algorithm. It requires no domain knowledge, being data driven, and is applicable to any dataset having numeric attributes. Such a generic representation is learnt by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regularized regression and selecting those relationships that are stable. Our experimental evaluation on 25 benchmark UC Irvine and Gene Expression datasets across different domains provides evidence that the features generated through our learning model can improve the prediction accuracy significantly for different classifiers without using any domain knowledge. We attribute this improvement to the ability of regression to extract behavioral trends in addition to the patterns usually present in a dataset."