Thursday, November 24, 2011

Studentized residual for detecting outliers

Last time, I discussed the outliers and a simple approach of Dixon’s Q test for detecting a single outlier. When there are multiple outliers, we can detect the outliers using the standard deviation (for data that is normal distributed) or using percentiles (for the skewed data). A box plot may be useful to visually check the data for potential outliers.

In regression setting, there are several approaches in detecting the outliers. One of the approaches is to utilize the ‘standardized residual’ or ‘studentized resitual’. In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables.

The studentized residual is the quotient resulting from division of a residual by an estimate of its standard deviation. Just like the standard deviation, the studentized residual is very useful in detecting the outliers. For values outside the 3, 4, or 5 times standard deviation, we may have reasonable doubt that the values are outliers. In regression setting, observed values outside 3, 4, or 5 times the studentized residual are the targets for outliers.

In SAS, two regression procedures can be easily utilized to compute the studendized residual for detecting outliers. PROC REG and PROC GLM. The studentized residual is labelled as RSTUDENT in Output statement. Other regression procedure (such as PROC MIXED) also compute studentized residual as part of Influence test. 

               output out=newdata rstudent=xxx;
Further readings:

No comments: