Four very useful papers | Robust Analysis

Perhaps it’s an overstatement to call the following four papers “classics” but I believe they’ll withstand the test of time. I wouldn’t be surprised if people cite them as primary sources 100 years from now.

Pretty much everyone who has analyzed data has had occasion to use linear regression analysis. For analysis of multivariate data, if the signal sources aren’t known beforehand then you’d typically use – or least explore – Principal Components Analysis (PCA) or Singular Value Decomposition (SVD) or maybe Independent Components Analysis (ICA) to identify what to use as regressors/predictor variables. If you have to go that route then you need to have a rational basis for choosing how many regressors (PCs or ICs) are statistically-significant, as you only want to use the statistically-significant ones in your analysis. (If you include too many then you’re fitting noise.) Choosing the number of regressors is an act of model order selection. This paper describes a very sensible basis for choosing model order for a PCA. Their method is general however, not limited to PCA:

“Cross-Validatory Choice of the Number of Components from a Principal Component Analysis,” H. T. Eastment and W. J. Krzanowski, Technometrics, vol. 24, no. 1 (Feb., 1982), pp. 73-77.

If you’ve done much multivariate regression you’ve probably encountered situations where unconstrained least-squares estimation of regression coefficients yields ‘unphysical’ coefficient values, i.e., values that yield a decent fit to the data but, for one reason or another, don’t make a lot of sense. Ridge regression is a mathematically-convenient method of pulling parameter values back into a reasonable range. It balances deviations between data and the model against the deviations of the associated model parameter values from their nominal values. More specifically, ridge regression introduces a penalty term which biases parameter values towards a more reasonable range – reasonable as defined by the analyst. (Alternatively stated, ridge regression introduces a weakly informative Bayesian prior into the parameter estimation process.) The trick to effective ridge regression is giving sensible weight to the parameter penalty term relative to the fit residuals (data-minus-model) term. Golub, et al describe a method for determining that weight:

“Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter,” G. H. Golub, M. Heath, G. Wahba, Technometrics, vol. 21, no. 2 (May, 1979), pp. 215-223

The parameter value penalty term in conventional ridge regression is quadratic. Quadratic cost functions tend to over-penalize rare but legitimate anomalous parameter values, e.g., values which are five, six, or ten standard deviations from their nominal value. Tibshirani finds a sensible workaround to this issue. He formulates a regression method which is essentially ridge regression but with parameter values constrained so that the sum of their absolute value is less than an analyst-specified constant. (That constraint is what defines the “lasso” in the title of the paper.) His formulation has the effect of making the estimation method itself sparsity-promoting, i.e., defining the parameter value constraint as he does automatically zeroes out regressor coefficients which do little to reduce the deviation between the data and the best fit model:

“Regression Shrinkage and Selection via the Lasso,” R. Tibshirani, J. Royal Stat. Soc. Section B., vol. 58, no. 1, (1996), pp. 267-288.

Finally, another challenge with regression analysis is that parameter estimation methods based on minimization of sum-of-squared-residuals are sensitive to anomalous data, a.k.a., outliers. (NB: The title of this blog, Robust Analysis, follows in part from my interest in estimation methods which are insensitive to anomalous data. Such methods are known as statistically-robust estimation methods, so “robust” has both a colloquial meaning and a technical meaning.) Rousseeuw and van Zomeren describe how to identify outliers and leverage points in your data. Outliers are anomalous deviations between the data and the best fit model. Leverage points are data which have a particularly large influence on calculated regression coefficient values. Everyone who does regression analysis should read this paper. In fact, I believe it should be mandatory reading for undergraduates:

“Unmasking Multivariate Outliers and Leverage Points,” P. J. Rousseeuw and B. C. van Zomeren, J. Am. Stat. Assoc., vol. 85, no. 411, (Sept. 1990), pp. 633-639.

[What happened in Connecticut today is incomprehensible to me. I cannot imagine how someone’s brain is wired that it permits them to commit that kind of violence. I try to fathom what it must be like to have lost someone or to have been a first responder or to be survivor and I go numb. I got in the car after work and cried most of the way home.]