Photo by Bill Mackie on Unsplash

PCA — Brief Explanation

Jeffrey Ng
2 min readDec 5, 2020

--

PCA or Principal Component Analysis is an exploratory data analysis tool where there is a dimensionality reduction of the data points while preserving as much of the data’s variation as possible and it is performed only on continuous data. The dimension reduction takes place when there is an orthogonal transformation of the coordinate system thereby reducing the amount of dimensions and also thereby producing eigenvalues and eigenvectors. Research on linear algebra may demystify these two terms. Eigenvalues and eigenvectors are important measures as we will see later in determining our data’s latent connections. PCA is an abstract interpretation of our data’s relationship to each other.

There are several reasons to perform dimensionality reduction. Dimensionality reduction is done to reduce noise in our data and also to prevent overfitting. It can simplify models while maintaining data integrity and more importantly, it can be used to remove multicollinearity in our models by dropping collinear features. PCA is an unsupervised learning technique.

The first concept to introduce is loadings = eigenvectors * sqrt(eigenvalues). The eigenvalue is the explained variance by that component whereas the eigenvector tells us the direction of the component. Each component can be thought of as a dimension in our data. It is also a feature of our model as well. When we discuss loadings we are discussing the latent relationships in our data with respect to its components or features. A way to interpret our loadings is to find a common relationship between the features. By observing our explained variance we may be able to draw a general conclusion about our loadings.

It is important to scale our data with a MinMaxScaler(), RobustScaler(), or StandardScaler(), before we apply PCA. A scree plot plots variance explained (eigenvalues) against number of components. The elbow or the point where there is a significant leveling off of explained variance tell us the number of components that are important in our model. The loadings are our interpretation of our components.

Some useful applications of PCA are in facial recognition software, clinical psychology, general dimensionality reduction, and data visualization.

--

--