10 Modern Statistical Concepts Discovered by Data Scientists
Here’s the list:
- Clustering using tagging or indexation methods (see section 3 after clicking on the link), allowing you to cluster text (articles, websites) much faster than any traditional statistical technique, with a scalable algorithm very easy to implement
- Bucketization – the science and art of identifying the right homogeneous data buckets (millions of buckets among billions of observations), to provide highly localized (or segment-targeted) predictions, or to smooth regression parameters across similar buckets, with strong statistical significance. It is equivalent to joint (not sequential) binning in multiple dimensions, which is a combinatorial optimization problem. While decision trees also produce some bucketization, the data science approach is more robust, simple, scalable and model-free. It does not directly produce decision trees, and lead to easy interpretation (each data bucket corresponding to a specific type of fraud, in a fraud detection problem). A related problem is bucket clustering, via standard hierarchical clustering techniques.
- Random number generation, a 3,000 year old problem, benefited from data science advances: for instance, using the digits of irrational numbers such as Pi or SQRT(2), produced with very fast algorithms, to simulate randomness.
- Model-free confidence intervals, getting rid of p-value, hypothesis testing, asymptotic analysis, errors due to poor model-fitting or outliers, and of a bunch of obscure statistical old-fashioned concepts
- Variable / feature selection and data reduction, without using L2-based, model-based techniques such as PCA, potentially numerically unstable, which are sensitive to outliers, and lead to difficult interpretation
- Hidden decision trees, an hybrid technique combining some sort of averaged decision trees and Jackknife regression, more accurate, and far easier to code, implement, and interpret than either logistic regression or traditional decision trees. Not subject to over-fitting, unlike its ancestor statistical techniques.
- Jackknife regression, a universal, simplified regression technique, easy to code and to integrate in black-box analytical products. Traditional statistical science offers hundreds of regression techniques, nobody but statisticians know which one to use, and when, obviously a nightmare in production environments.
- Predictive power and other synthetic metrics designed for robustness rather than for mathematical elegance
- Identification of true signal in data subject to the curse of big data (spurious correlations)
- New data visualization techniques – in particular using data video to display insights