What is PCA ?
PCA is an algorithm capable of finding patterns in data, it is used to reduce the dimension of the data.
X is a matrix of size
(m, n). We want to find an encoding fonction
f such as
f(X) = C where
C is a matrix of size
l < m and a decoding fonction
g that can approximately reconstruct
X such as
g(C) ≈ X
C is a representation of
X in a lower dimension, we want to find
f so that the loss of information in minimal.
PCA implementation steps
This article requires to know what is SVD and eigen decomposition if you want to understand each step. However if you don’t you can still read it to use the implementation !
We suppose that
X is a Numpy array containing the data.
k is the number of components we want after the transformation.
k = 3 X = torch.from_numpy(iris.data)
We need to standardize the data :
X_mean = torch.mean(X,0) X = X - X_mean.expand_as(X)
Perform Singular Value Decomposition
torch.SVD() we obtain the singular value decomposition: V the eigenvectors of X and S the eigenvalues in decreasing order. So
U[:,:k] corresponds to the k largest eigenvalues.
U,S,V = torch.svd(torch.t(X)) C = torch.mm(X,U[:,:k])
We will use our
def PCA(data, k=2): # preprocess the data X = torch.from_numpy(data) X_mean = torch.mean(X,0) X = X - X_mean.expand_as(X) # svd U,S,V = torch.svd(torch.t(X)) return torch.mm(X,U[:,:k])
Now we will visualize the PCA on the IRIS dataset from
iris = datasets.load_iris() X = iris.data y = iris.target X_PCA = my_PCA(X) plt.figure() for i, target_name in enumerate(iris.target_names): plt.scatter(X_PCA[y == i, 0], X_PCA[y == i, 1], label=target_name) plt.legend() plt.title('PCA of IRIS dataset') plt.show()
The PCA allowed us to visualize the iris dataset on a two dimensions visualization and to find combinations of attributes to identify each type of iris.