## 2017年11月11日 星期六

### PCA

In PCA, we are interested to find the directions (components) that maximize the variance in our dataset. In DataSet, 某些Feature對於每一筆X有比較大的差異, 對Y分類影響會比較明顯。若Feature 對於所有X都差不多, 則此Feature並不是作為分群(Partition)的最好選擇。

PCA(n_components=2), 會自動找出前2個主要Feature。下圖為2維, 畫出其分類(0,1,2)的狀態

==============

Residual variance ?

Then you fit a regression model. You use the regression equation to calculate a predicted score for each person.

Then you find the difference between the predicted scores and the actual scores. You calculate the variance of the set of scores. It's the residual variance. The residual variance will be less than the total variance (or if your predictors are completely useless, they will be equal).

How much variance did you explain?
Explained variance = (total variance - residual variance)

The proportion of variance explained is therefore:

explained variance / total variance

If your predicted scores exactly match the outcome scores, you've perfectly predicted the scores, and you've explained all of the variance. The residuals are all zero.

(Note: The calculations are done with sums of squares, variances will give a very slightly different answer as they are usually calculated as SS/(N-1). But if the sample is large, this difference is trivial).