This infocell decides whether a feature is nominal, ordinal or numeric, when the feature type is unknown (blind datasets).
It also explores each variable distribution detecting singular values as errors or some representation of missing values.
This infocell answers questions like:
Depending on the model we are fitting and the nature of the variables we are working with, it might be necessary to transform some of them (like counters, quantities or elapsed times) in order to improve model performance.
The results obtained are by far better than standard approaches of the problem (Recursive Feature Elimination, mRMR, Boruta, Lasso, ...).
We usually reach double digit improvements in several metrics (auc, rmse, logloss, normal discount cumulative gain, mean average precission...) over a fresh test set.
This step is critical for most existing models and interacts with other infocells, like Variable Representation or Feature Selection.
Most out-of-the-box procedures for relative importance assessment of variables are biased and overfitted.
This is very useful to understand the relationship between variables and the response in black box models.
This provides a gold standard model and allows comparing the relative performance of each model with it.
We currently use a ruled based system to select initial values, options and parameter ranges for each model. In the near future this will change:
We are currently working in a meta learning model that uses topological descriptors of the dataset in order to select the most suitable model family, parameter ranges where the model will be optimal and variable transformations that will perform better. This allows obtaining an optimal solution without the need of computing a huge number of models.