
We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance.
the “percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases. the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data and. To identify these types of predictors, the following two metrics can be calculated: These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. For example, in the drug resistance data, the nR11 descriptor (number of 11-membered rings) data have a few unique numeric values that are highly unbalanced: data(mdrr)ĭata.frame( table(mdrrDescr $nR11)) # Var1 Freq Similarly, predictors might have only a handful of unique values that occur with very low frequencies. For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). Note there is no intercept and each factor has a dummy variable for each level, so this parameterization may not be useful for some model functions, such as lm.ģ.2 Zero- and Near Zero-Variance Predictors Head( predict(dummies, newdata = etitanic)) # pclass.1st pclass.2nd pclass.3rd sex.female sex.male age sibsp parch Using dummyVars: dummies <- dummyVars(survived ~. , data = etitanic)) # (Intercept) pclass2nd pclass3rd sexmale age sibsp parch The base R function model.matrix would generate the following variables: library(earth) The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.įor example, the etitanic data set in the earth package includes two factors: pclass (passenger class, with levels 1st, 2nd, 3rd) and sex (with levels female, male). Train caret full#
The function dummyVars can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. 22.2 Internal and External Performance Estimates.22 Feature Selection using Simulated Annealing.21.2 Internal and External Performance Estimates.21 Feature Selection using Genetic Algorithms.
20.3 Recursive Feature Elimination via caret.20.2 Resampling and External Validation.19 Feature Selection using Univariate Filters.18.1 Models with Built-In Feature Selection.16.6 Neural Networks with a Principal Component Step.16.2 Partial Least Squares Discriminant Analysis.16.1 Yet Another k-Nearest Neighbor Function.13.9 Illustrative Example 6: Offsets in Generalized Linear Models.13.8 Illustrative Example 5: Optimizing probability thresholds for class imbalances.13.7 Illustrative Example 4: PLS Feature Extraction Pre-Processing.13.6 Illustrative Example 3: Nonstandard Formulas.13.5 Illustrative Example 2: Something More Complicated - LogitBoost.13.2 Illustrative Example 1: SVMs with Laplacian Kernels.12.1.2 Using additional data to measure performance.12.1.1 More versatile tools for preprocessing data.11.4 Using Custom Subsampling Techniques.7.0.27 Multivariate Adaptive Regression Splines.5.9 Fitting Models Without Parameter Tuning.5.8 Exploring and Comparing Resampling Distributions.5.7 Extracting Predictions and Class Probabilities.5.1 Model Training and Parameter Tuning.4.4 Simple Splitting with Important Groups.4.1 Simple Splitting Based on the Outcome.3.2 Zero- and Near Zero-Variance Predictors.