Talk:Principal component analysis/Archive 1

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Separate articles on arg max and arg min notations

We probably need a small article on the arg max and arg min notations.

— Preceding unsigned comment added by The Anome (talk • contribs) 09:11, 21 July 2003

Missing crucial details

The article seems to be missing crucial details. I can't see where the actual dimension reduction is happening. Is the idea that you have several samples of the measurement vector x and you use these to estimate the expectations? 130.188.8.9 16:49, 20 Aug 2003 (UTC)

There should now be a clue. However, the article still needs work — Preceding unsigned comment added by Sboehringer (talk • contribs) 18:47, 22 March 2004‎

Wrong arithmetic in Table 2.2 of Page 7 of the 4th External Link (A tutorial on PCA by Lindsay I. Smith). The sum of the numbers in the second table is 1352.46, and not 1149.89 . Jeanbrincat (talk) 12:29, 29 August 2008 (UTC)

Requested move 2004

Principle components analysis is better known as Principle component analysis (singular). This should be the main title and the plural form a synonym referring to this page (Unfortunately I do not know how to do it). — Preceding unsigned comment added by Sboehringer (talk • contribs) 18:47, 22 March 2004‎

I've always heard it with the plural. I have a PhD in statistics. I'm not saying the singular could never be used, but the plural is certainly the one that's frequently heard. Michael Hardy 21:18, 22 Mar 2004 (UTC)

The only monography solely dedicated to PCA is from Jolliffe to my knowledge and is titled "Principal component analysis". The naming issue is discussed in the introduction otherwise than you indicate. Then again naming issues are conventions and vary across the globe. Sboehringer

Google says: "Principal component analysis": 103,000 hits, "Principal components analysis": 46,300 hits. MH 13:48, 25 Mar 2004 (UTC)

I have that monograph and you are correct. It seems, however, that the analysis elucidates the principal components, plural, and so unless one is only interested in one principal component at a time, the plural appears to be more appropriate. — Preceding unsigned comment added by 24.10.224.158 (talk • contribs) 18:07, 21 August 2004‎

In various scientific papers/books I have seen it spelled like Principal Component Analysis. But as long as it is referred to the same content. I won't loose any sleep over it.

I have until now never seen it in plural, not in scientific papers as well. Do you ever write plural before "analysis"? "Houses analysis", "cars analysis", "components analysis"... I think plural is wrong, but I'm no native English speaker. Anoko moonlight 13:21, 30 July 2007 (UTC)

House prices analysis? Car sales analysis? Jheald 14:49, 30 July 2007 (UTC)

Okay, reasons to change it to singular: Primary PCA journal uses singular, more Google hits with singular, more Web of Science hits with singular. Reasons to keep it as plural: some wikipedia user says "the plural appears to be more appropriate." Well, I think that's enough for a change! 76.69.33.144 (talk) 15:20, 29 February 2008 (UTC)

There's a simple reason why it's usually called "principal component analysis". It's not that we use the singular, it's that in English when a plural noun is used attributively, you don't put an "s" on it. Actually it goes back to the Old English genitive plural, which ended in "-um" rather than "-s", and later the "-um" dropped off. Eric Kvaalen (talk) 17:44, 1 November 2010 (UTC)

Article needs serious improvement

Moving Michael Hardy's comments to Talk:

This article needs some serious revamping, to say the least. One cannot assume without loss of generality that the expectation is zero. If the expectation were observable, one could subtract it from x and get something with zero expectation, and so no generality would be lost by this assumption. In practice the expecation is never observable, and one must consider the probability distribution of the difference between x and an estimate, based on data, of the expectation of x.

Excuse me, but that is absurd. If the mean were observable, then one could simply subtract the mean from X, getting something with zero mean, and then indeed no generality would be lost by assuming that. In practice, one must use a data-based and therefore uncertain estimate of the mean, and one must therefore consider the probability distribution of the difference between X and the estimate of the mean of X.

If I may respond --- PCA is a technique that is applied to empirical data sets. PCA eigendecomposes the maximum likelihood covariance matrix. Indeed, there is a distribution of PCA decompositions about the "true" decomposition that you would get in the infinite data limit. But, that does not make it absurd. Or rather, no more absurd than any other maximum likelihood estimate. Any ML technique will have a variance around the estimate from infinite data.

Are you objecting because ML is not mentioned in the article? Or is it something else? -- hike395 04:39, 5 May 2004 (UTC)

Something else. Several something elses. It doesn't seem like that good an article. I'll probably drastically edit it within a few months; it's on my list. Michael Hardy 16:31, 5 May 2004 (UTC)

PCR and PLS?

would it be redundant to include some discussion of principal components regression? i don't think so, but i don't feel qualified to explain it. — Preceding unsigned comment added by Robotica (talk • contribs) 17:00, 9 August 2004‎

It would also be nice to have a piece on Partial Least Squares. Geladi and Kowalski Analytica Chimica Acta 185 (1986) 1-17 may serve as a starting point. — Preceding unsigned comment added by 12.18.36.40 (talk • contribs) 04:03, 22 March 2005‎

I disagree --- PLS and PCR are both forms of linear regression, which is supervised learning. PCA is density estimation, which is unsupervised learning. Very different sorts of algorithms --- hike395 04:35, 22 Mar 2005 (UTC)

The Principal Components Regression is used when the predictive variables are not uncorrelated, it means cov(x_i;x_j)<>0, for some i<>j. When this happens, we are in presence of multicolineality, which reduces the power of the inference. The technique of PCA is applied to the independent variables, and finally a regression model is adjusted with the principal factors chosen. The new estimated parameters are biased, but uncorrelated, and the variance of the new model is lesser.

I think that at least a discussion of the NIPALS algorithm would be useful here, since it is the main method by which PCs are calculated for very large datasets in the 'omics sciences. The Geladi reference is seminal. --Amaher (talk) 01:47, 15 January 2010 (UTC)

I've added the beginnings of a section on the NIPALS method - will expand over time. —Preceding unsigned comment added by Amaher (talk • contribs) 02:02, 15 January 2010 (UTC)

PCA & Least Squares

Is PCA the same as a least squares fit? (Furthermore, is either the same as finding the principle moment of inertia of an n-dimensional body?) —BenFrantzDale 23:53, August 3, 2005 (UTC)

No. A least-squares fit minimizes (the squares of) the residuals, the vertical distances from the fit line (hyperplane) to the data. PCA minimizes the orthogonal projections to the hyperplane. (Or something like that; I don't really know what I'm talking about.) As for moments of inertia, well, physics isn't exactly my area of expertise. —Caesura ^(t) 18:44, 14 December 2005 (UTC)

Yes. PCA is equivalent to finding the principal axes of inertia for N point masses in m dimensions, and then throwing all but l of the new transformed co-ordinates away. It's also mathematically the same problem as Total Least Squares (errors in all variables), rather than Ordinary Least Squares (errors only in y, not x), if you can scale it so the errors in all the variables are uncorrelated and the same size. You're then finding the best l dimensional hyperplane that your data ought to sit on through the m dimensional space. The real power tool behind all of this to get a feel for is Singular Value Decomposition. PCA is just SVD applied to your data. -- Jheald 19:40, 12 January 2006 (UTC).

I understand it better now. The article should clarify more about SVD of the (zero-mean) data matrix versus eigendecomposition of the covariance matrix. The latter approach seems most intuitive, but both are valid. Somewhere (here or on another page) we should have the fact that the singular values divided by n-1 give the principal variances... —Ben FrantzDale (talk) 23:30, 27 April 2010 (UTC)

Derivation of PCA

Shouldn't the constraint that we are looking for the maximum variance appear somewhere in that derivation ? I cannot understand it clearly as it is right now. --Raistlin 12:49, 24 August 2005 (UTC)

It is my understanding that the first principal component is the least squares fit to a multidimensional configuration of points, which happens to also be the axis of maximum variance. The second principal component is also a least squares fit to the configuration, with the additional constraint that it must be orthogonal to the first principal component. The third, fourth, fifth, etc, principal components are also least squares fits, except that they are each constrained to be orthogonal to all of the principal components before them. 24.221.60.71 05:03, 21 May 2007 (UTC)

Exactly right. The more of the variance that can be put into the first n components, ie the n-component subspace fitted, the less is the variance (sum of squares) of the points' residuals orthogonal to that subspace. Jheald 15:46, 21 May 2007 (UTC)

Conjugate transpose

and * T represents the conjugate transpose operation.

Why conjugate transpose instead of a normal transpose ? Does it even work with complex numbers ? Taw 04:18, 31 December 2005 (UTC)

As you probably know, conjugate transpose is a generalization of plain old transpose that allows these operations to work on complex numbers instead of just real numbers. If the source data X consists entirely of real numbers, then the conjugate operation is completely transparent, since the conjugate of a real number is the number itself. But if the source data includes complex numbers, then the conjugate operations is absolutely essential for the matrix operations to yield meaningful results. As far as I can tell, it does work on complex numbers. As an example where you might have complex numbers as source data, you might want to use PCA on the Fourier components of a real, discrete-time signal, which are in general complex. -- Metacomet 18:59, 1 January 2006 (UTC)

I have added a motivation paragraph at Conjugate_transpose#Motivation to try to show why it is so natural for the conjugate transpose to turn up, whenever the matrix you're transposing includes complex numbers. Hope it's helpful. -- Jheald 20:14, 12 January 2006 (UTC).

Computation -- surely this is not the right way to go ?

The section on computation looks to make a real meal of things, IMO; and to be pretty dubious too, as regards its numerical analysis. As soon as you square the data matrix, you're going to reduce the accuracy of your SVD from double precision to single precision.

Is there any reason to prefer either of the methods in the text, compared to choosing which bits of the SVD you actually want to keep, and then just wheeling out R-SVD ? (Which I imagine is quicker, too). -- Jheald 19:05, 12 January 2006 (UTC).

I agree that this article is unreadable. The lengthy "PCA algorithm" section is one of the main reasons - it is too long, and it doesn't agree with the equations in the introduction (where did we divide by N-1? why? what about the empirical standard deviations?). It doesn't even say what the output of the algorithm is, AFAICT. A5 13:32, 6 March 2006 (UTC)

I am working on improving the algorithm section to make it more readable. In the end, the section will still be quite long, because the algorithm is rather complicated and I think it is important to include enough detail so that people can actually implement it in software. After I have completed this upgrade, please make specific suggestions for further improvements. -- Metacomet 21:37, 9 March 2006 (UTC)

I am done for now. There is still more work to do, but it's a good start. Please provide comments and suggestions for improvement. Thanks. -- Metacomet 23:12, 9 March 2006 (UTC)

The improvement I would suggest is to delete the whole entire section completely, starting from the table, and then everything following it; and instead tell people to use SVD.

A standard SVD routine will be better written, better tested, faster, and more numerically stable.

IMO it is totally irresponsible for the article to be suggesting inefficient homespun routines, actually leading people away from the standard SVD routines. -- Jheald 00:03, 10 March 2006 (UTC).

I'm no expert Jheald, but I don't see what you're so worried about. Algorithms for SVD that I have seen on the WWW basically consist of the same algorithm that is listed on this PCA page, only done twice, once for left handed eigenvalues, once for right. Is there some other algorithm for SVD that is much preferable? --Chinasaur 08:40, 25 May 2006 (UTC)

I'm sort of an expert - I have a PhD in computer science, in algorithms, although numerical algorithms are not my specific thing - and yeah, actually, the algorithms for SVD that you'll find in a package like LAPACK, R or Mathematica actually are different from the one described here. They avoid computing the covariance matrix for the reason Jheald suggested. ProfessorSpice 01:07, 29 June 2007 (UTC)

I am really glad that you took some time to carefully review the work that I did and make some thoughtful recommendations. Thanks for the constructive feedback. Oh yes, that is sarcasm, in case you were wondering. -- Metacomet 00:48, 10 March 2006 (UTC)

"...totally irresponsible..." Don't you think that is just a wee bit of hyperbole?

"...homespun routines..." Are you referring to calculating the mean, the standard deviation, or the covariance? No, that can't be right, those are well-known and well-established procedures from statistics. Or perhaps eigenvectors and eigenvalues? Hmmmm, those are standard routines in linear algebra. Sorting the basis vectors by energy content and keeping only the ones with the highest contribution? No, that's also a standard concept called the 80-20 rule (or Pareto's principle). I guess I just don't understand what you mean by homespun routines....

-- Metacomet 01:36, 10 March 2006 (UTC)

BTW, I am pretty sure that dividing by N-1 is correct, which means the introduction needs to be fixed, not the algorithm. The reason the algorithm needs to divide by N-1 is that it is computing the expected value of the product, not the product itself. -- Metacomet 21:50, 9 March 2006 (UTC)

I dont know nothing abouth Maths, but all the pages about the Covariance Matrix use N so maybe N-1 is not so correct...? -- IC 18:48, 18 November 2006 (GMT+1)

A mathematical derivation with eigenvalues and eigenvectors is OK but such methods should not be called algorithms. The practical computation should be SVD. Squaring the matrix to get the covariance is harmful. By the way, homegrown SVD is harmful as well and I must support Jheald on both counts. A professional implementation should use SVD from some efficient and stable library, just as one should never write matrix-matrix multiplication except in a college homework. LAPACK is a de-facto standard for all that. ~~(There might be justified exceptions such as programming something exotic with small memory.)~~ Even if some engineering textbooks might have algorithms like here with the covariance created explicitly, their authors are obviously not professional numerical analysts or developers or they do not care about numerical aspects. Jmath666 07:28, 16 March 2007 (UTC)

I would like to support the suggestion of using an SVD function rather than a generic eigensolver. While the algorithm described is a mathematically correct way of doing PCA, not all mathematically correct algorithms are equally good. Two big issues for numerical algorithms are accuracy (how much does it lose in round-off error?) and efficiency. People spend their lives worrying about these issues, and those are the people who write programs like LAPack, SciPy, R, MatLab or Mathematica. Since the person implementing PCA is going install some package like this to compute the eigenvalues anyway, s/he might as well use the SVD function in the package instead of computing the covariance matrix and then using the eigen-solver function. SVD is more "special purpose" - it takes B, and computes the decomposition directly, without going through the step of computing BB^T. The SVD function will certainly be more accurate (as Jheald said, computing the covariance matrix loses digits to round-off error) and I believe almost always more efficient. And it makes the PCA implementation shorter and easier to do! Just to drag in a real big-gun reference, Golub and VanLoan's textbook Matrix Computation gets 14,000+ citations on Google Scholar, and it recommends against computing the covaraince matrix when computing SVD, in Section 8.3. ProfessorSpice 01:07, 29 June 2007 (UTC)

You all make a very strong case for SVD, however i hope we can all agree that the SVD article lacks any sort of decent step by step explanation of an algorithm to produce it. Until there is, i will use the method outlined here, as LAPACK is not an option for me. Please don't complain about what's up here until you know there is something better elsewhere on wikipedia, right now that's just not the case. Jordyhoyt 10:46, 1 August 2007 (UTC)

Question on reduced-space data matrix

The article states: and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first L singular vectors, W_L:

\mathbf {Y} =\mathbf {W_{L}} ^{T}\mathbf {X} =\mathbf {\Sigma _{L}} \mathbf {V_{L}} ^{T}

I believe that the correct formula is:

\mathbf {Y} =\mathbf {X} \mathbf {V_{L}} =\mathbf {W_{L}} \mathbf {\Sigma _{L}}

Can anyone verify this? — Preceding unsigned comment added by 216.113.168.141 (talk • contribs) 20:37, 16 June 2006‎

Afraid not. The way things are set up in the article, the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows.

With the reduced space, we want to find a smaller set of L new variables, which for each sampling preserves as much of the information as possible out of the original M variables.

So we're looking for an L x N matrix, with the same number of columns (the same number of samples), but a smaller number of rows (so each sample is described by fewer variables).

Matrix W_L is an M x L matrix, so W_L Σ_L is also an M x L matrix - not the shape we're looking for. But Σ_L V_L^T is the desired L x N shape.

Hope this helps. -- Jheald 11:02, 17 June 2006 (UTC).

Yes, that clarifies. Thanks Jheald! I thought that X row vectors were the sampling events and the column vectors were the variables -- since the definition of X is in fact the transpose of what I thought, then everything makes sense. -- 12:33, 26 June 2006

On a related note, am I misunderstanding, or is this inconsistent with the article text? It says,

"each row represents a different repetition of the experiment, and each column gives the results from a particular probe"

how is that compatible with what was just said,

"the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows."? 140.163.0.5 (talk) 16:03, 14 August 2009 (UTC)

That sounds contradictory to me. FWIW, Matlab's cov function returns the covariance matrix for row vectors. That is, cov(mxn) is n-by-n. Obviously, you can define any of these either way, but consistency would be nice, particularly between this page and covariance matrix. —Ben FrantzDale (talk) 20:01, 16 August 2009 (UTC)

Eigenvector/eigenvalue ordering

Under "Find the eigenvectors and eigenvalues of the covariance matrix", the article says "The eigenvalues and eigenvectors are ordered and paired." But then the next section says to order the columns by decreasing eigenvalue. Maybe I'm misunderstanding the previous section but this seems contradictory. 71.199.186.28 (talk) 00:06, 30 March 2008 (UTC)

I'v used PCA for classification without reordering at thist stage successfuly. It may be that it is simply trying to preempt the later stage discussed. 134.225.217.52 (talk) —Preceding comment was added at 02:20, 25 April 2008 (UTC)

Simplification

Could someone put one sentence at the top explaining this in layman's terms? It looks to me like a very fancy and statistically smart way to average a whole heap of data into some sort of dataset common to all of the data -- is this at all a correct impression? --Fastfission 04:07, 28 January 2006 (UTC)

Agreed! Understanding the introductory paragraphs requires one to first read a half dozen other articles. It ought to be possible to provide a simple explanation of PCA, perhaps with an illustrative example of its application, before diving headlong into statistician's jargon. I'm reasonably well educated and after reading this article, I have almost no clue what PCA is. --Anonymous —Preceding unsigned comment added by 192.55.52.9 (talk) 19:51, 6 November 2008 (UTC)

I agree with this! I have a science PhD, but unfortunately, minimal requisite statistics to process biological data, and this article simply confounded me. I request starting with the objective of PCA, then moving to what it provides, then moving to how it's done followed by theory. This is a classical teaching flow that many lessons follow.

Best of Luck! —Preceding unsigned comment added by 129.59.121.18 (talk) 15:01, 14 June 2010 (UTC)

How about this

Principal Component Analysis is a statistical method based on concepts from linear algebra. It seeks to summarize as much of the variation in a set of observations as possible with a small number of dimensions. Given a set of n observations (or cases), each of which contains a numerical value from each of a set of m variables, we regard each observation as a point in an m-dimensional vector space. Principal component analysis constructs an alternative orthogonal basis for the space, with dimensions chosen so that the first captures as much of the variance in the original data as possible, the second captures as much of the residual variance as possible after removing the first, and so on. Because the new dimensions are ordered from most to least "important" one can take a reduced basis consisting of the first k dimensions -- these are the "principal components" of the original multidimensional distribution. —Preceding unsigned comment added by 98.178.173.192 (talk) 00:13, 5 November 2010 (UTC)

Cov Matrix (Contradiction)

If one is dealing with a MxN data set, i.e N factors and M obervations of each, the resulting cov matrix will be a NxN, not MxM.

It seems like everything from the mean vector subtraction to the covariance matrix calculation is done as if the data are organized as M rows of variables and N columns of observations. This is not properly explained in the "organizing the data" section, and is kind of opposite what most people would expect. I'm inclined to reverse everything. --Chinasaur 22:39, 19 May 2006 (UTC)

Yeah, this whole covariance matrix thing seems completely wrong. It states:

\mathbf {C} ={1 \over N}\mathbf {B} \cdot \mathbf {B} ^{*}

And this is inconsistent on two levels. First of all the covariance matrix of B is NxN, not MxM, as other's have stated above. This is in direct contradiction to what this section of the article states, and to what the "organizing data section" states. Secondly, assuming that each data set is in a column (so 3 datasets of 5 points each is organized into a 5x3 [M=5, N=3] matrix), the covariance matrix is NOT

{1 \over N}\mathbf {B} \cdot \mathbf {B} ^{*}

but is actually

{1 \over N}\mathbf {B} ^{*}\cdot \mathbf {B}

. So the equation given above is for the transpose of B. And in any case, that's not even the covariance matrix of the transpose of B, the 1/N is wrong, it should be 1/(N-1). Unfortunately I don't know enough about it to make the correction, and the Wikipedia article that I came to to learn about it is quite inadequate. Anyways, I will be putting a contradiction tag on this article because of this. This is a very, very poorly written article, and the original author really deserves a sound spanking. Once I actually do find a correct, concise source of information regarding PCA, I'll be redoing it. --JCipriani 22:50, 13 April 2007 (UTC)

This article seems taken from some confused engineering textbooks that make it too complicated because they try to be elementary and try to teach other things at the same time. I had to wade through the mess myself trying to learn about PCA not too long ago but I never found an acceptable source. In fact, it is very simple: PCA is the spectral decomposition of the sample covariance matrix. It is best computed by SVD. It can be proved that the eigenvectors have certain optimal properties regarding the variance. It is very short, really. The Karhunen-Loeve decomposition is something a bit else (see Loeve Probability theory ISBN 0-387-90262-7) and it is done in advanced graduate courses in probability theory; but once you know that you can just say that "PCA is KL decomposition with covariance replaced by the sample covariance". All else is crud. I may write it up one day if I have the time. If you want to give it a shot these are on the clearer side: Holmes et al ISBN 0-521-55142-0, and Liang, Y. C. et al Proper orthogonal decomposition and its applications. I Theory, Journal of Sound and Vibration, 252 (2002) 527--544. Jmath666 04:51, 22 April 2007 (UTC)

Should the sample covariance matrix be used here instead of the population covariance? It seems the calculation should be;

\mathbf {C} ={1 \over N-1}\mathbf {B} \cdot \mathbf {B} ^{T}

p.484 of David Lay's Linear Algebra and its Applications, 3rd ed ISBN 0-201-70970-8 and p.5 of the paper "A Tutorial on Principal Component Analysis" by Lindsay Smith support this. When you are performing PCA its typically on a sample of the population, right? Zhroth (talk) 16:21, 19 February 2008 (UTC)

Comments: Outer product between matrix is not commonly used math term. Outer product is usually understood as operator betwen two vectors. Anyway, the use of outer product here only make the description seemly more formal and complicated.

 —Preceding unsigned comment added by 68.147.165.202 (talk) 18:46, 8 March 2008 (UTC)

Cov Matrix size

The size of the cov matrix C is still unclear. From the session “Find the eigenvectors and eigenvalues of the covariance matrix” on, it is considered to be NxN, while in the session “Find the covariance matrix” it is MxM, which I think is the right size, since the matrix B is a MxN. 133.6.156.71 12:07, 6 June 2006 (UTC)

Shouldn't it read "inner product" instead of "outer product" as C as the outer product $B\cdot B^{*}$ would make it a $M\times N\times N\times M$ tensor?

outer product is right it is just they switched the meaning of the dimensions as some of the previous comments have indicated. —Preceding unsigned comment added by 74.192.1.156 (talk) 11:46, 18 October 2007 (UTC)

This isn't really working!

The first point, I wondered about, is "Calculate the empirical mean". I think the mean is not calculated in the right way. The mean is calculated over each dimension M. Isn't that sophisticating the data. I think you have to take the mean over each observation (N-vector).
The second point is the size, first of the covariance and then the size of the eigenvalue-matrix. By calculating the eigenvalues you get one for each variable in the data set. So, the size of this matrix should be MxM. And before, to reach this result, the covariance Matrix must have the same size.
... Has anybody an idea how it's really working?

Subtracting the mean of of the observation is nonsense. If I have features X1 X2 where X1 is on the order of 10^20 and X2 is on the order of 10^-20 subtracting the mean of the observation will just make X2 a hugely negative value and X1 close to 0. Subtracting the mean of the dimension makes sense because you are trying to shift the problem back to the orgin (if it were plotted). —Preceding unsigned comment added by 74.192.1.156 (talk) 11:50, 18 October 2007 (UTC)

Whats the difference between PCA and ICA

Just wondering.. This ist clear to me for these articles? --137.215.6.53 12:18, 3 August 2006 (UTC)

Principle Components analysis versus Exploratory factor analysis

I suggest to include a subsection discussing the differences between PCA and exploratory factor analysis. Based on my experience in working in Stat Lab is that students/clients get them confused. Perhaps a description of the differences between PCA and EFA may be included. This can be added to common factor analyses. Below is my undertanding on the differences. I did not want to use "greek" symbols so that it may perhaps be more accessible to non-mathematicians. What do you think?

Exploratory factor analysis (EFA) and principal component analysis (PCA) may differ in their utility. The goal in using EFA is factor structure interpretation and also in data reduction (reducing a large set of variables to a smaller set of new variables); whereas, the goal for PCA is usually only data reduction.

EFA is used to determine the number and the nature of latent factors which may account for a large part of the correlations among a large number of measured variables. On the other hand, PCA is used to reduce scores on a large set of observed (or measured) variables to a smaller set of linear composites of the original (or observed) variables that retain as much information as possible from the original (or observed) variables. That is, the components (linear combinations of the observed items) serve as reduced set of the observed variables.

Moreover, the core theoretical assumptions are different for both methods. EFA is based on the common factor model (FA), whereas, PCA is not.

1. Common and unique variances

Common Factor Model (FA): Factors are latent variables that explain the covariances (or correlations) among the observed variables (items). That is, each observed item is a linear equation of the common factors (i.e., single or multiple latent factors) and one unique factor (latent construct affiliated with the observed variable). The latent factors are viewed as the causes of the observed variables.

Note: Total variance of variable = common variance + unique variance (in which, unique variance = specific + error variance).

Principal Components (PCA): In contrast, PCA does not distinguish between common or unique variances. The components are estimated to represent the variances of the observed variables in an economical fashion as possible (i.e., in a small a number of dimensions as possible), and no latent (or common) variables underlying the observed variables need to be invoked. Instead, the principal components are optimally weighted sums of the observed variables (i.e., components are linear combinations of the observed items). So, in a sense, the observed variables are the causes of the composite variables.

2. Reproduction of observed variables

FA: Underlying factor structure tries to reproduce the correlations among the items

PCA: Composites reproduce the variances of observed variables

3. Assumption concerning communalities & the matrix type.

FA: Assumes that a variable's variance is composed of common variance and unique variance. For this reason, we analyze the matrix of correlations among measured variables with communality estimates (i.e., proportion of variance accounted for in each variable by the rest of the variables) on the main diagonal. This matrix is called the Rreduced.

Note: Principal Axis factoring (PAF) = principal component analysis on Rreduced.

PCA: There is no place for unique variance and all variance is common. Hence, we analyze the matrix of correlations (Rxx) among measured variables with 1.0s (representing all of the variance of the observed variables) on the main diagonal. The variance of each measured variable are entirely accounted for by the linear combination of principal components.

Also See factor analysis

(please bare with me, I am new with using wikipedia).

RicoStatGuy 15:53, Sept 30, 2006(UTC)

Orthogonality of components

According to this PDF, the eigenvectors of a covariance matrix are orthogonal. The eigenvectors of an arbitrary matrix are not necessarily orthogonal, as seen in the leading picture on the eigenvector page. So what gives? Why are these eigenvectors necessarily orthogonal? —Ben FrantzDale 14:44, 7 September 2006 (UTC)

According to Symmetric matrix, "Another way of stating the spectral theorem is that the eigenvectors of a symmetric matrix are orthogonal." That explains that. 128.113.54.151 20:00, 7 September 2006 (UTC)

If the multiplicity of every eigenvalue of the covariance matrix is 1, then the eigenvectors will by necessity be orthogonal.

If there exists an eigenvalue of the covariance matrix with multiplicity greater than 1, say of dimension r, then this corresponds to an r-dimensional subspace of Rⁿ (n being the dimension of the covariance matrix). Then the corresponding eigenvectors can be in principle any basis of this subspace. But generally speaking, the basis is chosen to be orthogonal.

So to answer the question, in some cases they must be orthogonal, and in some cases they do not all have to be, but are usually chosen to be so.

On a side note, all software packages I am aware of will return orthogonal eigenvectors in the multiplicity case. I suspect that this is because the algorithms implicitly force this by recursively projecting Rⁿ into the nullspace of the most recent eigenvector, or something equivalent. Baccyak4H (talk) 17:56, 20 November 2006 (UTC)

Actually, 128.113.54.151 is exactly correct. Because covariance matricies are symmetric, they are necessarily normal. The complex spectral theorem tells us that in ALL cases a nomal operator on a complex vector space has an orthonormal basis of eigenvectors. In fact, the theorem tells us that such an orthonormal basis exists if and only if the operator is normal. If we restrict ourselvs to the reals, the Real Spectral Theorem tells us that a matrix has an orthonormal collection of eigenvectors if and only if it is self-adjoint. Covariance matricies are self-adjoint, so again the theorem holds. The statement above by Baccyak4H that in some cases the eigenvectors do not have to be orthogonal is incorrect when we are talking about covariance matrices. His assertion that an arbitrary collection of eigenvectors can be reshaped into an orthogonal collection of eigenvectors is also incorrect. 167.206.189.3 20:44, 19 June 2007 (UTC)

No, Baccyak4H is correct. As you say, it is always possible to find a basis of orthonormal eigenvectors for a real symmetric matrix. However if two eigenvectors u₁ and u₂ share the same common eigenvalue λ, then any arbitrary linear combinations v₁=α₁u₁+β₁u₂ and v₂=α₂u₁+β₂u₂ are also eigenvectors with the same eigenvalue. So yes, it is possible to find vectors u₁ and u₂ which are orthogonal, but if they share an eigenvalue then one can also find an infinite number of pairs of valid eigenvectors v₁ and v₂ which are not orthogonal. Jheald 21:06, 19 June 2007 (UTC)

Cluttered

The article seems terribly cluttered. In particular, I dislike the table of symbols. Sboehringer 18:17, 14 December 2006‎

Rows and columns

I think our convention for the Data matrix is probably the wrong way round. At the the of the day, it would probably be more natural if our "principal components" vector was a column vector.

I also think that confusion between the two conventions is one of the things that has been making the article more difficult than it needs to be.

I propose to go ahead and make this change, unless anyone thinks it's a bad idea ? Jheald 16:53, 31 January 2007 (UTC).

Percent variance??

Presumably one wants to compare the sum of the leading eigenvalues to the sum of all eigenvalues. The example of comparing to a threshold of 90% doesn't make much sense otherwise

Cumulative energy

This term for the contributions of components seems to be from some other field. Is there a better general term for this ? Shyamal 08:25, 12 March 2007 (UTC)

Terminology used

Many users of PCA expect certain terminology such as the decomposition into "loadings" and "scores". The term loading itself is never used in the article and this can be confusing. The following is a mechanical statement for PCA in Matlab.

For a dataset X we can use Eigenvalue decomposition to produce
1) An Eigenvector matrix V whose columns are Eigenvectors and
2) Eigenvalue matrix D (diagonal) such that
(X-D)*V=0
and X=V*D*inv(V)=V*D*V'

depending on the algorithm the elements of D may be ascending, descending or in unsorted order, but the elements of D and the columns
of V may be suitably sorted without change in the identities, Matlab for instance puts the D values in ascending order in the eig()
function but descending is often preferred

If X consists of samples in rows and variables in columns, then 
X'X gives the covariance matrix if X is mean centered. 

PCA can be done on the covariance matrix or using X'X even without mean centering

Cov=X'X

Cov now is a square matrix with the dimension being the number of variables or columns

[V,D]=eig(Cov) in Matlab will now give the Loadings in V

The scores can be obtained with

Scores=X * V(:,m-k,m)

for k components, m is the number of variables

It can be verified that

X ≈ Scores * Loadings'

Instead of using the Covariance matrix X'X, one can also compute PCA using just the X matrix. Here the Singular Value Decomposition
algorithm (SVD) may be used. In Matlab

[U,S,V]=svd(X)

The V here is identical to the V (loadings) obtained by Eigenvalue decomposition and the Scores are now equal to U*S

Hope someone can use the above suitably formatted in the article with explanation of the terms scores and loadings. Shyamal 09:15, 12 March 2007 (UTC)

I'm starting to wonder if this terminology is common in some fields and not so in others. Usually when I see PCA mentioned in computer science (not that I see it all that much there), people talk about it using more obvious terms, such as eigenvectors, basis vectors, component vectors, etc. Someone added a reference for "results are usually discussed in terms of component scores and loadings" (some stats for environmental sciences book), though. Maybe these are more common in the softer sciences?

The terminology is also rather opaque, so explaining its origins would also be good. (what is being loaded and where? is "score" used because some people normalize PCA results often enough to almost consider normalization as part of the PCA itself, and score then comes from standard score, or something?). -- Coffee2theorems (talk) 09:35, 13 February 2009 (UTC)

Merge POD and PCA

These seem to be just different terms used in different circles/applications for the same thing. Jmath666 01:53, 16 March 2007 (UTC)

Agree. Shyamal 08:48, 16 March 2007 (UTC)

Agree. Algorithms 16:57, 7 June 2007 (UTC)

Disagree. - I would find it very confusing. Though I wouldn't object the other way around if POD is really the same thing. --MatthewKarlsen 17:21, 16 July 2007 (UTC)

Request information on how to choose how many components to retain

Could someone include information on the proper method for choosing the number of components to retain? I've done some searching and haven't found any 'rules'. Part of my interest is that in MBH98 they retain only the first PC, but they apparently did incorrect centering. If it is correctly centered then a similar result is achieved if including the first 4 PCs. http://en.wikipedia.org/wiki/Hockey_stick_controversy It might be useful for individuals familiar with PCA add some details to the above link.

LetterRip 10:49, 22 March 2007 (UTC)

In the past, we usually used PCA to try to represent 95% of the variablity involved. I was involved in software metrics for a while and we'd collect a number of metrics that measured similar, but not quite the same, items. Using PCA, we could reduce the measures from 20ish to maybe 4 and account for 95% or more of the variability. This way we could have a 95% confidence saying things like "modules with this level of complexity have a far higher rate of bugs than modules with that level of complexity." Tangurena 04:15, 18 September 2007 (UTC)

Of course after I post the question I find a good reference :)

"Component retention in principal component analysis with application to cDNA microarray data"

"Many methods, both heuristic and statistically based, have been proposed to determine the number k, that is, the number of "meaningful" components. Some methods can be easily computed while others are computationally intensive. Methods include (among others): the broken stick model, the Kaiser-Guttman test, Log-Eigenvalue (LEV) diagram, Velicer's Partial Correlation Procedure, Cattell's SCREE test, cross-validation, bootstrapping techniques, cumulative percentage of total of variance, and Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], Section 6.1]. For convenience, a brief overview of the techniques considered in this paper is given in the appendices.

Most techniques either suffer from an inherent subjectivity or have a tendency to under estimate or over estimate the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimensionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound statistical foundations seem, at present, to offer little advantage over simpler rules in most circumstances." A comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]."

http://www.biology-direct.com/content/2/1/2

LetterRip 11:11, 22 March 2007 (UTC)

Question concerning subsection "Convert the source data to z-scores"

Is it correct to transform normalized source data using a PCA which is based on the covariance matrix? Would it not be necessary to use a PCA based on correlation matrix instead (which corresponds to the covariance matrix of normalized source data)?

The covariance matrix based on z scores is the correlation matrix. Technically, one might need to worry about whether the covariance matrix is the empirical moments or "unbiased estimators", which differ by a factor of n/(n-1). There is a related chapter in Jolliffe. —Preceding unsigned comment added by Dfarrar (talk • contribs) 14:16, 6 March 2008 (UTC)

What is h in the z-scores section? —Preceding unsigned comment added by 216.184.13.6 (talk) 17:45, 30 September 2007 (UTC)

Not necesarily orthogonal

I removed the bit about 'assumption that the principal components are orthogonal' I believe that two things got mixed up:

If the noise is not white, principal components are not orthogonal, such that PCA is not optimal, or canonnical, or anything. If the distribution of the noise is known, you may apply a linear transform to whiten the noise or (equivalently) apply a Generalized SVD or Restricted SVD.

I believe ICA applies to nonnormal variables. If the noise is jointly normally distributed, then a covariance of zero implies independence (See http://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_imply_independent). Even though PCA is not optimal or canonical or anything, I believe this means that PCA does find uncorrelated, and hence independent variables in this case. Hence, I think it should be concluded that PCA is a form of ICA for jointly normal variables. —Preceding unsigned comment added by 130.89.67.57 (talk) 14:41, 17 March 2008 (UTC)

Fixed basis?

What does this sentence from the Details section mean: "Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set." A linear transformation does not possess a basis at all. Does it mean that there is no standard choice of basis, with respect to which to compute the coefficients (matrix) of the linear transformation? 137.22.3.172 (talk) 18:46, 20 March 2008 (UTC)

Good point. I removed that sentence. —Ben FrantzDale (talk) 00:36, 21 March 2008 (UTC)

Presumably what was meant was that PCA cannot be represented by a particular fixed matrix operator.

If it did have a fixed matrix operator, eg like a Fourier transform, you could use SVD to identify a particular characteristic set of "input" basis directions, a set of "output" basis directions, and a corresponding set of scalings.

But a PCA is not that kind of a transformation. (It is not linear in the data). Jheald (talk) 09:09, 21 March 2008 (UTC)

But PCA might be considered as an approximately linear in the data if the sample size is thought large enough for the covarince matrix to be essentially fixed, and if the effects on only a limited number of data points are considered. Melcombe (talk) 17:43, 22 April 2008 (UTC)

Split for readability?

Would it be worth putting the algorithm in a seperate artical, and maintaining this artical as a discussion of PCA theory? 134.225.217.52 (talk) —Preceding comment was added at 02:33, 25 April 2008 (UTC)

You mean the section "Computing PCA using the Covariance Method". I'm not sure if it should be in Wikipedia at all (at least in its current form). It's rather cookbookish and Wikipedia is not a cookbook. It is also absurdly long for the little content in it. For instance, it should take at most a couple of lines to describe computation of the covariance matrix, not a whole screenful with three headings. Similarly for postprocessing. The only involved part, the eigendecomposition, tells the reader to use ready-made software. Maybe this should really be a short code listing that works in octave/matlab or numpy (at least it would be far clearer and shorter, and the actual mathematics is adequately explained elsewhere in the article). If it can't be shortened to less than a screenful, then I guess it'd be better off in its own article (or in an open source project somewhere), instead of keeping this article unreadable.

Incidentally, the section "Table of symbols and abbreviations" is also an eyesore. If an encyclopedia article requires a full-screen table describing the notation used in the article, something is very badly wrong. -- Coffee2theorems (talk) 10:38, 13 February 2009 (UTC)

Agree that the article needs to be split into two: one for the technial/mathematical details of how it's calculated and another for the more practial/applied aspects. Tayste (edits) 21:05, 24 August 2011 (UTC)

Diagram

To whoever made this request, what kind of diagram do you want? --pfctdayelise (talk) 17:04, 2 August 2008 (UTC)

Karhunen-Loève transform?

The Karhunen-Loève transform is referred to several times, in a way that implies the reader knows what it is. It is not hyperlinked, and indeed the Wiki topic for Karhunen-Loève transform redirects to this page. It makes those areas very unclear. For example:

The Karhunen-Loève transform is therefore equivalent to finding the singular value decomposition of the data matrix X,...

Personally I think it deserves its own short Wiki page. I still don't know what it is. From the definition-like equation in the subsection Project the z-scores of the data onto the new basis, it sounds like KLT(X) is defined as being the projection of the data X onto the PCA basis obtained by thresholding cumulative variance at 90%. Apparently if I threshold at 91%, I'm no longer doing the KLT? Or should it be parameterized, e.g., KLT(X,90%)? Also, I take it from the article that the z-score conversion/normalization is also required to qualify as a KLT?

Other comments while I'm att it:

The Discussion section is in pretty horrible shape. It reads in a very disconnected, jumpy fashion. It tries to pursue a derivation of sorts, but lacks any description of what is being shown, and then jumps unexpectedly into connections with other topic areas. All the connections to other areas (EOFs, ANNs, LDA) should be separated from the derivation, and made clear that the purpose of the section is to identify connections to other topics.

In the subsection Compute the covariance matrix, shouldn't the equation be the sample covariance, i.e.:

\mathbf {C} =...={1 \over {N-1}}\mathbf {B} \cdot \mathbf {B} ^{*}

Why the subtitle: Compute the cumulative energy content for each eigenvector? What is with the use of the term energy? Why not variance and cumulative variance? —Preceding unsigned comment added by 98.207.54.162 (talk) 22:52, 8 January 2009 (UTC)

Requested move

Regarding the above attempted counter-examples, "prices" and "sales" are things being analysed; Principal Component Analysis is the name of a method. The naming of statistical methods is very frequently in the singular like this. Surely even Dr Hardy would not refer to the method of "Factors Analysis (sic)"? Ged.R (talk) 19:52, 18 March 2009 (UTC)

Moved to new section. See also Talk:Principal components analysis#Requested move 2004 above, for earlier discussion on this topic. 199.125.109.126 (talk) 20:58, 18 March 2009 (UTC)

As the move was done in March, I have removed the move-request tag. Melcombe (talk) 09:14, 23 April 2009 (UTC)

Regularized PCA

Several authors discuss regularized principal component analysis, or regularized singular value decomposition (which presumably could help with regularized PCA). Does anyone have something they could put in this article: theory, computation, refs?

dfrankow (talk) 19:19, 23 January 2009 (UTC)

Section on "Computing Principal Components with Expectation Maximization"

I believe what is discussed in this section is more commonly referred to as the "power method" for calculating eigenvectors. This is what it is called in Golub and van Loan and other standard numerical linear algebra books. There may be some connection to the EM algorithm, but it is not apparent to me. In any case, I don't think this is an advisable way to carry out PCA. There are a number of good algorithms for computing the SVD, including Lanczos-type methods if only a few of the PC's are needed. However I don't think this page should get into SVD algorithms at all, focusing instead on how PCA is motivated, how it is used, strengths/limitations, and alternative/related techniques. Skbkekas (talk) 03:40, 11 May 2009 (UTC)

I agree that the caption is a bit hyped. The algorithm shown is however more likely to be NIPALS since it seems to be working on X rather than X'X which the power method uses. No reliable reference for this, but the difference is mentioned here. And yes I think the whole slew of methods should be just listed out and exact SVD and Eigenvalue computation algorithms moved to the SVD and Eigenvalue articles. Shyamal (talk) 05:20, 11 May 2009 (UTC)

Thanks for pointing this out, I wasn't aware of the distinction. Skbkekas (talk) 14:17, 12 May 2009 (UTC)

Can you have negative weightings (even if unphysical)?

Is it automatically unphysical to have a PCA reconstruction that has some stations negatively weighted? Would think that it could occur for both degeneracy and anticorrelation with the average (actual physical effects). Of course the summation must be positive, but is it automatically wrong if some of the stations have negative weights?

This is being debated on these blog threads. Unfortunateley, the debate has muddled particular examination of the Stieg Antarctic PCA-based recon with general absolute claims that negative weightings are bad, bad, bad.

Could you please ajuticate?

See here:

http://noconsensus.wordpress.com/2009/06/07/antarctic-warming-the-final-straw/#comment-6727

http://noconsensus.wordpress.com/2009/06/09/tired-and-wrong-again/#comment-6726

http://wattsupwiththat.com/2009/06/10/quote-of-the-week-9-negative-thermometers/#more-8362

http://www.climate-skeptic.com/2009/06/forgetting-about-physical-reality.html

Also, if you can point to an academic expert who could answer this question, (negative weightings allowed?) would appreciate it.

P.s. Wiki police: this might have relevance to the article. —Preceding unsigned comment added by 69.250.46.136 (talk) 16:54, 19 June 2009 (UTC)

Negative weightings are not only allowed, they are routine.

For example, the most significant axis might be the difference between two stations.

Remember, the PCA is identifying signals that are correlated *deviations from the average*. (You have to centre the data first, by subtracting off the average for each station). Now if one of the most systematic deviations is that whenever station A reports high, then station B reports low, and vice-versa, then (Station A - Station B) could well be the kind of signal that PCA would extract. Jheald (talk) 17:02, 19 June 2009 (UTC)

Appreciate a little more detail (citations and/or more looking at the actual details of the debate here, which is more than just a PCA, but use of the factors afterwards. I like that you agree with me, but think I need more. BTW, we can take this conversation to your talk page if the wiki police feel this is too forum-like. 69.250.46.136 (talk) 17:35, 19 June 2009 (UTC)

Okay, so having looked at the blog posts, I think the thing to remember is that PCA isn't setting out to produce an "average" temperature. Instead, what PC1 represents is the direction of strongest correlation in the data. So if it systematically occurs that whenever there are higher temperatures than average at Station A, that there are lower temperatures than average at station B, then if station A has got a positive weight, station B will get a negative weight.

So PCA-weighted sum isn't giving you an "average" temperature. Rather, it's trying to give you an indication of the strength of the deviation in the correlation direction specified by PC1. So a lower temperature than usual, measured at a station which tends to record unusually low temperatures when other stations record unusually high temperatures, will be add to the strength being indicated for the effect.

Which seems to be what's happened here. The mistake then is to interpret the PCA-weighted sum as an averaged temperature, rather an indication of the strength of a particular trend, which may have +ve effects in some places, -ve effects in others.

It should be remembered, however, that PCA is quite a rough and ready tool. In particular it can mix together the effects of different causes. So for example, the strongest driver correlating most stations together might be the long term secular trend. But for some pairs of stations the most significant correlation might be their response to the effects of say short term storm-fronts. The #1 PCA vector just records the direction of strongest correlation overall, which can mix together the responses to different causes. Other, more sophisticated techniques, for example Independent Component Analysis (ICA) try to produce vector directions which maximally separate out the effects of different causes. It might be of interest to run the raw station data through ICA to see how different the vectors produced by PCA. Looking at the plot of the PCA station weights, it's clear that for the most part the whole continent is moving in step, though with the temperatures in some parts apparently systematically moving faster than others. The "rogue" stations may be actually ones where the temperature reacts the other way to the same overall stimulus; or they may be anti-correlated for other reasons. You'd probably need to have a look at the detailed data (or know about the typical weather patterns of those parts) to say for sure.

Finally, ridge regression doesn't really make any difference to the overall picture. The idea is that some of the variation in PC1 is just "noise" power. You can get an estimate by computing the power (ie average mean-square variation) for each of the subsequent principal component directions. The idea is that the later Principal Component weightings become increasingly random and geographically meaningless, so the degree of power in the component from a real geophysical cause falls off as the PC number goes up, until the later PCs in effect only reflect power due to noise. You can therefore use them to estimate the "noise floor" -- ie the background noise power in each of the components; and then use maths in effect identical to a Wiener filter to scale back each of the PCs by an amount appropriate to remove the amount of noise and leave only the amount of genuine signal.

Anyhow, I hope that helps a little bit. The main point is that the PCA-weighted sum is being used to estimate the size of a signal, (without a-priori knowing what that signal is or means), and not an 'average' temperature. Jheald (talk) 01:38, 20 June 2009 (UTC)

Cool. So, though. Am I right, that we could have theoretically recons with some stations negative weighted. Or the other guys who jump up and down and make pictures of negative thermometers and tell me it is shit, shit, shit, to have a station negative weighted. That it is wrong in all situations.

Oh, and are you a stats expert? Can you come down and just rule on this by way of credentials or something? Special:Contributions/69.250.46.136|69.250.46.136]] (talk) 02:48, 20 June 2009 (UTC)

Sorry if my last comment was a bit rambling. It was knocking on 3am here local time, and I really should have been asleep! And no, I'm not going to put credentials on the line for you.

As to who is right, the answer is somewhere in the middle. As I tried to clarify last night, PCA is trying to pull out a signal from the data, by seeing what measured deviations from the average systematically correlate together.

And it has done this, quite successfully. It thinks there is a signal in the multi-station temperature data, expressed particularly strongly in one category of stations, less strongly in another. ((i.e. one area is warming up more quickly, and PCA has noticed this))

But the potential problem comes if we then make a statement something like, "the clearest signal which comes out of the data (according to PCA) is a secular long-term warming trend in the typical temperatures".

This is pretty much true -- the signal is pretty much going up in a straight line with time, so it is appropriate to identify it as relating to a "secular long-term trend". And the strong majority of stations are weighted with the same sign -- so for most of them a positive value of the signal does represent warming.

But it's not quite the whole story, because for some of the stations a positive value of the signal represents cooling. Now, there are two suggestions I've made as to why this might have been found: (i) these stations may actually have seen a long-term temperature fall, which correlates to the temperature rise elsewhere, and has some shared physical driver; or (ii) the PCA is telling us that for these stations, the most important correlation with their neighbours is not the long term picture at all, but (say) a marked anti-correlation in deviations in the short term.

Either way, that therefore makes it too simple to talk about the signal being plotted up as time just as an "average temperature" -- the combination, with its negative weights for some stations, is more complex than that; and PCA wasn't being told to find an average temperature.

But it may be true that a common average long-term temperature trend (albeit going up faster in some places than others) is, for at least most of the stations, the most powerful correlation between different stations that PCA has found. Jheald (talk) 07:25, 20 June 2009 (UTC)

(Sorry if that may be a bit more nuanced than you were looking for!) Jheald (talk) 07:25, 20 June 2009 (UTC)

But my question is if it is just wrong, like violating a law of physics if a recon has some negative weights. I'm NOT asking if Steig is a good recon. But just this basic and GENERAL question. If you say "in the middle", aren't you really saying it's WRONG to make the general statement "no negative thermometers"? 69.250.46.136 (talk) 14:03, 20 June 2009 (UTC)

What do you mean by "recon"? My understanding is that what Steig has produced is not what I would think of as a recon, but perhaps you can clarify what the word means. Jheald (talk) 18:27, 20 June 2009 (UTC)

(Unindenting) He is doing a reconstruction of Antarctic temperature. He uses the correlation of sattelite areas (spanning the entire continent) to fixed stations, during the period of overlap (post 1982) to figure out predictors for the interior surfaces (based on the stations) during the period 1957-1982. This allows him to reconstruct what the temp was in those interior areas (and he posits that this is more accurate than doing simple distance weighting because it captures patterns and the like). Here is a link to a description of the algorithm (sorry, by a denialist, but still it's a helpful explanation). http://noconsensus.wordpress.com/2009/04/06/updated-flow-chart-for-antarctic-paper/ 69.250.46.136 (talk) 21:57, 20 June 2009 (UTC)

I don't know who is writing this but I believe the meaning of 'negative weights' has been confused. In the antarctic reconstruction being discussed at noconsensus above, the negative weights are unrelated to PCA. They are the final result of the reconstruction where 34 thermometer signals are added together to create a weighted average. This is different from the eigenvalue weights used to reconstruct the PCA. Therefore: The noconsensus links above relate to the weighting of a group of thermometers rather than the weighting of PCA. I am an engineer and I can say to a great deal of certainty that a upside down thermometer signal in a weighted average is not good. JeffId1 (talk) 12:41, 21 June 2009 (UTC)Jeff Id

You're quite right. My apologies. I hadn't looked closely enough at the blog posts to see what the argument was about. This is a controversy about negative weights appearing in a quadrature, completely unrelated to positive or negative weights being in the PCA.

On the other hand, negative weights in quadrature may not be quite so beyond the pale as you seem to be propounding.

Consider a straight line, y = m x + c, and suppose we want the area under it between x=0 and x=5, but we only have measurements of y at x=0 and x=1.

Straightforwardly, by integration, the area A = (25/2)m + 5c.

Now putting in our measurements, we find

A = (25/2)(y₁ - y₀) + 5y₀

= (25/2) y₁ - (15/2) y₀

So the presence of a negative weight in a quadrature estimate is not necessarily a red stop light; at most, it's a warning light that the result may be quite dependent on the accuracy of some extrapolation. Jheald (talk) 09:02, 22 June 2009 (UTC)

I'm unfamiliar with the reply syntax of wiki - so I apologize for that.

The problem as I've pointed out on my thread to TCO is not in the fact that a negative value is not mathematically explanatory. In fact I fully believe we get a better regression fit. The problem is in that we are talking about temperature data, and in a weighted average of thermometer data, the negative thermometer is a non-physical value. This likely results in improper overweighting of similar positive thermometers and improper distribution of trend in a location critical reconstruction.

I'm actually sorry that TCO has taken your time with this as it is off topic for the thread it started on.

To make a simple point, if you have three thermometers recording temperature in your back yard for 50 years. How would you determine the average temperature over 50 years? If someone told you that one of the thermometers entire record should be flipped and arbitrarily (without explanation) given a negative sign, would you agree or question the rational for the negative sign? Consider that there are only 34 thermometers in the entire Antarctic during tis reconstruction, and under 20 lead back to the beginning and also consider they are thermometers and fairly good at recording temperature.

Would you consider flipping 5 of 34 to determine temp reasonable?

This is an amazingly simple slam dunk issue which has been deliberately expanded by a truly famous troll (thousands of people know TCO). He went searching for experts to find who would take the bait contacting at least a few different people. I'm sorry again that it wasted your time. - -Jeff

Jeff: Let the guy engage, Jeff. It's a privilege to have his attention. Don't try to cut him off. You can still speak your piece on the content. I welcome that. As I think it will help the expert engage and correct you. Oh and ixnay on teh OTC-ay abel-ay. I am either blocked for several months of wieght loss (according to one admin) or perma-blocked according to another. So...be weeery quiet with the naming.

Jheald: I worked through your example. Thanks. I understand that negative weighting in the recon is not axiomatically wrong, despite "thermometers" indicating positive above absolute zero. I'm not familiar with quadrature as a term. Have not seen people using that term in climate science papers or blog posts. 72.82.44.253 (talk) 15:31, 24 June 2009 (UTC)

Starting with Correlation vs. Covariance

Though you can normalize a covariance matrix, it seems easier to just start with a correlation matrix instead. This is discussed in the PCA book by George Dunteman (and likely others)

To recap, if your data contains values that are on a radically different scales, the covariance has to be carefully corrected, so that one number doesn't swamp the other simply because it's naturally larger.

A quick example: Suppose you're dealing with national financial statistics. One of your columns represents interest rates, measured in fractions of a percent, so let's say it varies between 2 samples by 1/4 of a percent, 0.0025, or 2.5e-3. But let's say you've also got data points measuring imports and exports in dollars, and let's say the same two samples differ by $25 billion, or 2.5e10. The native scales of these variables happen to be 13 orders of magnitude. And we'll assume you've got 20 other variables at different scales, perhaps some other numbers in the millions and billions, so this difference isn't obvious at first glance.

As I understand it, starting with covariance is dangerous. Without special care, even a change of interest rate of 10% (quite dramatic) would be overwhelmed by the other numbers in the millions and billions.

The article DOES talk about normalization, but I didn't notice this issue pointed out by that name. And perhaps one of those formulas is equivalent to converting covariance to correlation? If so, maybe point that out? I do see the mention of the R matrix in the table of terms.

Ttennebkram (talk) 18:06, 26 July 2009 (UTC)

Derivation of PCA using the covariance method

This section was edited 2011-02-23. See the old version versus the new version.

I think the new version has some improvements; it's much shorter, for one. Also, the transposing of P makes things simpler and neater. However, I also think it's too dense and uses unnecessary terminology (and in particular, mixes terminology unnecessarily) - something the old version did too, but there it's less problematic since the forumlae were more explicit. e.g. usage of unitary vs. orthonormal transformation matrix.

I think it might be clearer to reduce the usage of terminology (use orthogonal matrix rather than unitary+orthonormal transformation matrix), and to reintroduce the explicit diagonalization rather than just mention it. AFAIK the WLOG assumption that mean(X) = 0 isn't necessary, or am I missing something?

Thoughts?

Eamon Nerbonne (talk) 10:06, 4 April 2011 (UTC)

"Software/source code" woes

The "software/source code" section seems unwieldly, unhelpful, and is riddled with unhelpful external links. Any stats package worth its salt will have support for principal component analysis, do we really need all these disorganized links? 188.74.104.106 removed the external links tag, but I'm reinstating because there's a lot that this section leaves to be desired in terms of that. Thoughts? Statisfactions (talk) 00:55, 28 November 2011 (UTC)

Looks like it could be trimmed down a bit. Just we should be careful not to be too draconian. Some of the links are quite useful/illustrative/unique. Kevin Baas^talk 14:23, 30 November 2011 (UTC)

Agree. I think it's a very good article. The subject is non-trivial. Perfection is unattainable. Dratman (talk) 02:43, 31 December 2011 (UTC)

Mistake in details section

The Details section is inconsistent. It states that X has n rows and m columns, i.e. X is an n x m matrix. It then goes on to say that $\mathbf {X} =\mathbf {W\Sigma V} ^{\top }$ , where W is m x m, Sigma is m x n and V is n x n, which would make X an m x n matrix. I think the latter is the correct form as opposed to the former. It is also consistent with the Table of Symbols and Abbreviations, and with the reference http://www.snl.salk.edu/~shlens/pca.pdf. Can someone confirm this? — Preceding unsigned comment added by 131.111.185.68 (talk • contribs) 11:50, 15 April 2011 ‎

It says that X^T is n-by-m, which makes X an m-by-n matrix, so there's no mistake. -- 195.178.200.68 (talk) 09:44, 16 April 2011 (UTC)

I agree that this part is formally correct. But I strongly recommend to point out in the text, that X^T is defined, not X. I also overead that and it took me 2 hours (and a look on this discussion site) to understand what I got wrong. Because if you read X instead of X^T, everything gets inconsistant to other literature.

I just added a note on the topic. — Preceding unsigned comment added by Ga29sic (talk • contribs) 14:13, 4 December 2011 (UTC)

Mistake in Find a covariance section

How can we start with the population covariance, e.g. the mathematical expectation operator and then move to the empirical measure, e.g. the sample mean, in the same line. That does not look right to me. — Preceding unsigned comment added by 15:33, 13 February 2012 (talk • contribs) 78.145.18.55

This is indeed wrong and I will change this now. We cannot move from one two another without invoking asymptotic theorems. — Preceding unsigned comment added by 11:21, 7 March 2013 (talk • contribs) 192.193.116.137

Wrong Indices at X?

At the beginning is written: X^T as n rows but later in the table that X as N rows. I think that the description later is correct. — Preceding unsigned comment added by 193.174.63.68 (talk) 13:07, 12 July 2012 (UTC)

Correct terminology of PCA

The correct term is principal component analysis, not principle component anlaysis. The definition of “principal” from a dictionary is “first in order of importance”. PCA indicates that "large variances have important meaning".

--Sangdon Lee (talk) 19:47, 6 June 2013 (UTC)

Image in lead

The caption of the image in the lead states: "...with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction". I'm not sure how to understand the "(0.878, 0.478) direction" bit. Does this refer to the arrow pointing to the upper right of the graph? If so, could we make it clearer? Regards. Gaba ^(talk) 01:39, 16 June 2013 (UTC)

"Details" re-written

I have re-written the "Details" section to (I hope) be more accessible (diff). It now I hope takes more of a constructive approach, starting with the preservation of variance on a vector-by-vector basis, then the eigendecomposition of X^TX, and only then the SVD (which is likely to be less familiar to many of our readers).

It's possibly a little long-winded, and people might like to streamline it, but I think this is probably a better sequence than what was there before.

Material that wasn't directly duplicated I have moved down to a "Further considerations" section, which still needs some fairly heavy editing (as does much of the rest of the article) but I wanted to put the first section into place first, to give people a chance to say what they think of it. Jheald (talk) 14:12, 19 June 2013 (UTC)

PCA in simple explanation

PCA reduces/projects/transforms/decomposes/regularizes a matrix X (m x n, m samples by n variables) into a thinner matrix T (m x f, m samples by f principal component, f << n) by defining new variables called principal components (PC) which are a linear combination of n variables (i.e., weighted value of n variables). In other words, X ≈ T_f. The PC is also called latent structure or loading in statistics and variation mode in engineering (e.g., structural dynamics). The weights are the eigenvectors of X'X (i.e., X'X can be scaled to be the same as a covariance or correlation matrix). ( ' ) stands for matrix transpose. The number of PC, (i.e., f ) is customarily selected to account more than 80% of total variance in statistics. Note that the n variables are reduced to f but not the m samples.

PC1 = W11*X1+W12*X2+W13*X3+…..…+W1n*Xn.
In other words, PC1 is a linear combination of variables in X.
Let’s suppose that (1) the exam scores of 4 classes for 1000 high school students (Math, Science, History & English, 1000 x 4 )are collected and (2) the following is the PCA outcome with the eigenvalues of 2.5, 1.3, 0.15, and 0.05;
PC1 = 0.7*Math score + 0.7*Science score + 0.6*History score + 0.5*English score,
PC2 = 0.7*Math score + 0.6*Science score - 0.6*History score - 0.7*English score,

Note that the sum of eigenvalues are always equal to the number of variables (i.e., 4) if the original variables are standardized. The first and the second PC (PC1 and PC2) take account 62.5% and 32.5% of total variance (i.e, 2.5 divided by 4, 1.3 divided by 4), respectively. In total, the first two PCs explain 95% of total variance in X. Thus the 1000 x 4 matrix can be reduced to 1000 x 2 matrix without much loss of information. PC1 can be interpreted as the overall score (i.e., the coefficients (or eigenvectors) are of similar magnitude with the same signs) and PC2 shows the difference between analytical versus non-analytical score(i.e., the coefficients show opposite signs). Eigenvectors with the same (or opposite) signs indicate that they are moving to the same (or opposite) direction.

Certainly some weights are large and others are small in values so that they show how the each variable is “loaded” in defining the PC (thus the word “loading”). Thus, the original variables with higher weights provide a clue to interpret the meaning of each PC. The weights are called eigenvectors of of X'X or the singular vectors of X. The value of PC is called PC score.

PCA in one sentense:

(1) PCA is the same as SVD (Singular Value Decomposition) of X ;

X = U * S * V' = T * V'

SVD indicates that a matrix X is decomposed into three matrices: the first matrix shows the similarity of samples, the second matrix shows the degree of similarity (i.e., correlation) which is scaled to be equal to the number of variables, and the third matrix shows the similarity of variables.

X = similarity of samples * degree of similarity * similarity of variables = PC scores * eigenvectors'
( T = U * S, U' * U = I, V' * V = I)

(2) PCA is the same as EVD (Eigenvalue Decomposition) of X'X ;

X'X = (U*S*V')' * U*S*V'= V *S² * V',
( S² = eigenvalues = square of singular values)
Note that the squares of singular values are equal to the eigenvalues.

Visualization of PCA results

Score plot: scatter plot of PC scores (e.g., PC Score 1 vs. PC Score 2) to visualize m samples. PC score 1 is the value of PC 1. PC 1 is a linear combination of n variables, in which the weights are eigenvectors. The interpretation of the score plot is that if the samples are close (or faraway) to each other, they are similar (or dissimilar)
Loading plot: scatter plot of eigenvectors (e.g., PC 1 vs. PC 2) to visualize n variables. The interpretation of the loading plot is that if the orginal variables are close to each other, they are well correlated and thus move to the same direction (e.g., bending mode).
Biplot: the score plot and the loading plot are overlapped to show how samples and the orginal variables are related.
Scree plot: line plot of eigenvalues vs. corresponding PC to determine the number of PC. The number of PC is ususally determined by eigenvalue >= 1 or if there is considerable change in the eigenvalues. Scree is the debris at the bottom of a cliff.

How to compute PC scores for new samples

T = X*V
(PC score = original variables*eigenvector)

X is always mean-centered or normalized (i.e., the averages of each columns are zero). X is usually variance-scaled or standardized(i.e., the averages of each columns are zero and the standard deviations are one) if the measure of units for n variables are different. X'X/sqrt(n-1) is the same as covariance matrix if it is mean-centered and correlation matrix if it is variance-scaled. The sqrt(n-1) term is always ignored in statistics/mathematics including PCA because it is constant.

Important facts about PCA

The total variance of a matrix is scaled to be equal to the number of variables which is equal to the sum of eigenvalues.

The followings are 'always' true for 'any' matrix if the variables are standardized or variance-scaled.

Let's say, X is 1000 by 10 variables with variance-scaled;

Total variance in X (i.e., 10)

≡ Number of variables in X (i.e., 10)

≡ Sum of the diagonal elements of correlation matrix (i.e., 10)

≡ Sum of the eigenvalues of X'X (i.e., 10)

For example, if the first and second eigenvalues are 6, and 2, respectively, then, the first and the second PC take account 60% and 20% of total variance in X, respectively (i.e., 6/10, 2/10). Largest eigenvalue of X explains the largest variation in X vice versa. Thus the X (1000 x 10) is reduced/transformed/projected to T (1000 x 2). Further analysis (clustering, visualization, etc) can be performed using T with easy due to reduced variables. The number of PC (i.e, 2) shows the “true” rank of the X (i.e., the number of orthogonal/independent variables in X).

Various names of PCA (Sangdon Lee’s contribution to Wiki)

Depending on the field of application, PCA is also named the discrete Karhunen–Loève transform (KLT) in electrical engineering, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (Golub and Van Loan, 1983), eigenvalue decomposition (EVD) of X'X in linear algebra, factor analysis, Eckart–Young theorem, or Schmidt–Mirsky theorem in psychometrics (Harman, 1960), empirical orthogonal functions (EOF) in meteorological science, empirical eigenfunction decomposition (Sirovich, 1987), empirical component analysis (Lorenz, 1956), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. SVD is mathematically related to Tikhonov regularization which is known as ridge regression. Dr. Gilbert Strang at MIT called the SVD as the fundemental theorem of linear algebra.

Various multivariate statistical analyses

The core idea of various exploratory multivariate statistical analyses (such as PCA, MDS, CA, etc) is to combine/decompose variables if they are similar Note that SVD or EVD is a type of similarity transformation in linear agbebra. Depending on the definition of similarity, various exploratory methods have been developed including multidimensional scaling (MDS), correspondence analysis (CA), PCA, etc. Then, SVD is applied to this similarity matrix. The measure of similarity of PCA is covariance/correlation among variables (i.e, variables with high correlation (and thus similar) are combined/decomposed), while distance is for MDS (i.e., variables with short distance (and thus similar) are combined), profile is for correspondence analysis (i.e., variables with similar histograms (e.g., histogram is an example of profile, left-skewed, right-skewed, bell-shaped histograms, etc) are combined) and kurtosis for independent component analysis.

Important connection to engineering with Ax = b

Ax = b,

This equation is known as systems of linear equation and “THE” core of many engineering analyses such as statics (i.e, whether a bridge can endure a certain load (e.g., how many trucks), or dynamics (e.g., noise and vibration characteristic of a bridge or a car), differential equation, or feedback control, etc.

Three types of solutions are available;

(1) A system with fewer equations than unknowns. It has many solutions, (also known as, an underdetermined system).
(2) A system with the same number of equations and unknowns. It has a unique solution; x = A^-1*b.
(3) A system with more equations than unknowns (aka., over-determined system). The least square estimation is used to find a solution, x = (A' A)^-1 A'b, which is known as normal equation in statistics and used to find the regression coefficients.

In statistics, the Ax = b is usually written as y = Xb, which is confusing but the important fact is that they (Ax = b or y = Xb) refer to the same thing. The least square estimations are x = (A'A)^-1 A' b or b = (X'X)^-1 X'*y.

Dynamical system (Linear Time Invariant, LTI) is usually modelled as dx/dt=A*x, and the eigenvalues of A show whether a dynamic system is stable or not. The system dx\dt=Ax is asymptotically stable (i.e., a system reaches steady state eventually after initial trasient fluctuation) if and only if all the eigenvalues of A have negative real parts. The eigenvalues of A are often called the “modes” of the system dx\dt=Ax. The corresponding eigenvectors are the "mode-shapes".

ill-conditioning in solving Ax = b

Note that the computation of inverse matrix is involved in solving the Ax = b with the solution of x = (A'A)^-1 A' b, or y = Xb with the regression coefficients of b = (X'X)^-1 X'*y. Computation of inverse matrix assumes that the variables are orthogonal/independent or at least well-conditioned or well-posed; otherwise, the inverse matrix is unstable (also known as "ill-conditioned" or "near-singular" in linear algebra or engineering, "confounded"/"aliased" in design of experiments (DOE), or (multi) "collinearity" in regression). Collinearity or ill-conditioning indicates that a small change in A from Ax = b or X from y=Xb causes huge change in the inverse matrix.

The ill-effects of collinearity are:
(1) the interpretation of regression coefficients are often unwarranted,
(2) the regression coefficients show opposite signs compared to common observation (known as wrong sign in regression), and
(3) statistically significant predictors are often declared as statistically insignificant.

Ill-conditioning is an universal issue as long as the computation of inverse matrix is needed. Artificial neural network suffers the ill-conditioning also. In computer softwares such as Matlab, etc, the computation of inverse matrix is NEVER computed due to numerical issues (rounding error, ill-conditioning, memory usage, etc). Instead, various decomposition methods such as QR, LU, EVD, SVD, etc) are performed. SVD is the most reliable decomposition method. By the way, it is recommended not to use the inv command in Matlab due to numerical issues. The inv command is for a simple demonstration purpose. Use backslash (\) or apply SVD, instead of inv. Backslash ('\') is a Matlab-specific command to compute any inverse matrix. For example in Matlab, A*x = b, x = A^-1*b = A\b, x = (A'*A)^-1 A'*b = (A'*A)\A'*b, or b = (X'*X)^-1 X'*y = (X'*X)\X'*y .

Two measures are widely used: condition number and variation inflation factor (VIF). Condition number is the ratio of the largest eigenvalue divided by the smallest eigenvalue. Note that eigenvalue close to zero will inflate the condition number because small eigenvalue explains trivial amount of variance in X which could be considered as noise in X. Ill-conditioning is a matter of degree (i.e., from weak ill-conditioning to severe ill conditioning) and unfortunately there is no clear cut about how near is near-singular. Large condition number indicates that the inverse matrix is unstable. The orthogonal variables have the value of one for the condition number and VIF.

--Sangdon Lee (talk) 13:30, 13 June 2013 (UTC)

Hi Sangdon!

I haven't had time to read through all of what you've written in detail, but one thing that might be useful to add to your first picture is to give the eigenvalue equation

\mathbf {X} \mathbf {W} =\mathbf {W} \mathbf {\Lambda }

with the numerical values are for your example substituted in ,to show where the ± 0.707 come from.

Just a suggestion,

All best, Jheald (talk) 08:00, 20 June 2013 (UTC)

Hi Jheald,

Thanks for the great suggestion! I will make a change. I'm not familier with how to edit Wiki pages including uploading so it will take some time.

Sangdon --Sangdon Lee (talk) 13:11, 21 June 2013 (UTC)

Software

Does anyone remember "Arthur" a FORTRAN program for chemistry written by Kowalski in the 1980's? It did all sort of multivariate analysis including PCA. The manual was awful but the software was great and at the time it seemed ground breaking, does it need a mention here?

Also I recall SPSS and later SPSSx could do PCA but I could not see them in the list.

194.176.105.145 (talk) 12:08, 26 September 2013 (UTC)

Historical note

I do not think that it is totally true that Karl Pearson was the inventor of PCA in 1901. He sure created the basic concept but the formulation of the method is usually credited to Harold Hotelling,:

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417-441, and 498-520.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 27, 321-77. — Preceding unsigned comment added by 79.191.254.158 (talk) 22:09, 5 April 2013 (UTC)

Historical Notes

I also do not think that Karl Pearson was the inventor of PCA in 1901:

G. W. Stewart, On the early history of the singular value decomposition. SIAM Review, vol. 35, No. 4, pp. 551-566, December, 1993.

From the paper above, Stewart wrote .....
"...five mathematicians who were responsible for establishing the existence of the SVD and developing its theory"- Eugenio Beltrami (1835-1899), Camille Jordan (1838-1921), James Joseph Sylvester (1814-1897) , Erhard Schmidt (1876-1959), and Hermann Weyl (1885-1955)......".

Stewart wrote.....
"The subsequent history is one of extensions, new discoveries, and applications.......An alternative to factor analysis is the PCA of Hotelling (1933).... " .

— Preceding unsigned comment added by Sangdon Lee (talk • contribs) 13:04, 11 October 2013 (UTC)

Multidimensional Scaling as another name

Isn't multidimensional scaling yet another name for PCA? Tedtoal (talk) 00:37, 16 December 2013 (UTC)

Sangdon Lee wrote:

Multidimensional scaling (MDS) is mathmatically related to PCA but MDS is not the same as PCA.

Many multivariate statiscal methods are developed such as PCA, MDS, CA (correspondence analysis), etc. All of them apply SVD or EVD. The difference among them is what kind of a measure is used for "similarity". For example, if two variables are correlated, it can be said that they are similar. Other similarity measures are profile (e.g., historgram), or distance, etc. These methods apply SVD after the similarity matrix is developed.

Let's say, there is a data matrix, "X", (n by m).

- PCA: The "columns" of "X" are standardized or normalized and then SVD is applied. Note that X'X is the same as correlation matrix.

- CA: Both columns AND rows of "X" are standaridzed and then "profile" matrix is computed. Then, SVD is applied on the profile matrix.

- MDS: a new matrix is computed based on various "distance" measures. Then SVD is applied to this distance matrix. Many different measures for distance are available (e.g., city-block distance, Euclidean distance, p-norm distance, etc) so there are many different techniques from MDS

- Independent component analysis (ICA): Kurtosis (i.e., spikeness) is computed and then SVD is applied. (Actually, ICA involves more than this but conceptually, I believe that SVD on Kurtosis is good enough)

Hope this helps. 198.208.251.48 (talk) 16:08, 9 January 2014 (UTC)

Sources of "Compute the cumulative energy content for each eigenvector"

This and above matters can be found in a paper from:

International Conference on Advancements of Medicine and Health Care through Technology, Springer Science & Business Media, 2010, http://books.google.ro/books?id=65fivMyBwtYC&pg=PA200&lpg=PA200&dq=%22The+eigenvalues+represent+the+distribution+of+the+source+data%27s+energy%22&source=bl&ots=qcwYxEoXrU&sig=SQJNd4NdJKf41bCaRVweyWlieRg&hl=ro&sa=X&ei=6XnGU4-UBIi9ygOY3YHgBg&ved=0CDYQ6AEwAg#v=onepage&q=%22The%20eigenvalues%20represent%20the%20distribution%20of%20the%20source%20data%27s%20energy%22&f=false

at page 200. Either the paper is the source of the Wikipedia article or the Wikipedia article is the source of the paper. Maybe someone can figure out. Ferred (talk) 13:19, 16 July 2014 (UTC)

"Explain like I'm 13"

The average person understands a 3D world. I'm hoping we can write a gentle introduction to PCA accessible to a smart person who is not in a math related career.

SUMMARY: PCA can be thought of as a projection, much like a "2D" shadow of a 3D person. PCA is highly generalizable to data inputs having thousands of dimensions (variables). PCA works by reducing the number of dimensions (variables) so that a smaller number of dimensions can be visualized by humans and efficiently computed. Mathematically, PCA correlates (covariance matrix) and transforms variables from the input matrix into a new coordinate system called "principal components". Remarkably, a very small number of Principal Components can often be used to approximate highly complex systems. For this reason, PCA is used in countless applications like 3D modeling, face recognition, Graph matching and the Human Genome Project.

STRAW MAN EXAMPLE ( please help edit and condense ! )

Imagine you are looking at a person's shadow. The shadow is basically in 2D, because the shadow is flat plane with no depth. The shadow is a 2D projection of the person in a 3D world. But how approximate is this 2D observation, and how does it represent the real 3D coordinates of the person?

PCA helps address such questions, by reducing the number of dimensions we observe. The shadow of the person is in 2D. The first two principal components are also in 2 dimensions. The 3rd principal component is in 3 dimensions, but there are never more principal components than original dimensions (variables) in the input matrix. Because each principal component is orthogonal -- meaning that each "component" is at a right angle to the previous component -- we can imagine a 2D shadow as a new "X and Y" coordinate system. Importantly, this new PCA coordinate system is a projection -- meaning that the first two dimensions "X" and "Y" are transformations of the original 3D space (X,Y,Z).

This technique is highly generalizable: we could have looked at at 4D (X,Y,Z,and time) or even thousands of dimensions. This is really useful for data analysis where the number of variables often exceeds human abilities to reason about complex systems.

PCA dimensionality reduction typically causes a minor loss of precision from the original input matrix. Nonetheless, principal components have been shown to approximate highly complex systems with very few PCs. In DNA studies involving thousands of gene variables, a standard way to visualize the groups of genes is to plot the Principal Components on a 2D "X","Y" coordinate plain. Visual inspection can then reveal differences in test groups. PCA is far more than a visualization technique. Studies have shown that with fewer than 50 principal components you can classify the differences between all human cancers [refs].

PCA is more often used to extract the "dimensionally reduced embedding" of complex input such as images, graphs, and matrices with many variables. This "embedding" is the set of Principal Components that describe the magnitude and direction of the Eigenvectors and eigenvalues in a new coordinate system. The result is a smaller number of variables that often capture nearly all the information present in the original dataset. It is remarkable how often complex systems can be approximated with fewer than 50 principal components. In fact, for many data analysis tasks only a few Principal Components are used because the components so accurately "explain" the original data.

PCA is the most popular dimensionality reduction method. In PCA, the number of principal components is picked by calculating and variable correlations (covariance matrix). Principal components that provide no additional "explanation" of input correlations are then ignored. Conversely, Kernel Methods increase the number of variable dimensions. By adding one "free" dimension, kernel methods enable simpler linear equations modeling. Choosing the right number of dimensions depends on the task. Generally, PCA is for reducing variable dimensions. Kernel Methods are for labeling sampels and predicting outcomes. — Preceding unsigned comment added by 98.207.93.61 (talk) 13:56, 12 June 2013 (UTC)

Confusing Notation

It says that PCA minimizes "the total squared reconstruction error $\|\mathbf {T} \mathbf {W} ^{T}-\mathbf {T} _{L}\mathbf {W} _{L}^{T}\|_{2}^{2}$ or $\|\mathbf {X} -\mathbf {X} _{L}\|_{2}^{2}$ ." That can be confused with the spectral norm. Wouldn't it be better to use $\|\mathbf {T} \mathbf {W} ^{T}-\mathbf {T} _{L}\mathbf {W} _{L}^{T}\|_{F}^{2}$ or $\|\mathbf {X} -\mathbf {X} _{L}\|_{F}^{2}$ ? --Nicolamr (talk) 05:37, 3 August 2014 (UTC)

I'm sorry - what I probably want to say is that it should be made more explicit that PCA minimizes both the Frobenius and the spectral norm. --Nicolamr (talk) 05:43, 3 August 2014 (UTC)

I am a layman , I didn't know the Principle components are zero mean.

The article is good.But for a layman, some details are needed.

The Details part only mentioned how to maximize the norm of the principle components. I didn't find the relationship between maximizing the norm and obtaining the greatest variance for the first glance.

After a few hours searching on google ,I finally found this . It details the reason why the principle components are also zero mean as $X$ 's col vectors. Alexyangfox (talk) 09:00, 4 August 2014 (UTC)

Should mention and address the fact that PCA is dimensionally incoherent

This article neither addresses, nor even mentions the observation (made, e.g., by David MacKay) that principle component analysis appears to be incoherent: if you try to write down the equation defining the 'eigenvectors' of a covariance matrix, the expression you get is dimensionally inconsistent. Consider data relating the heights (L) and the masses (M) of a sample of adult males. We can build a covariance matrix A for this data. This matrix has types [LL, LM; LM, MM]. The 'eigenvectors' would be the solutions x to A x = l x, where l is a dimensionless number. This 'equation' is obviously badly typed.

Comments (arguments why PCA is nevertheless well-founded would be interesting to read) on this observation would be useful. — Preceding unsigned comment added by 87.234.240.61 (talk) 08:08, 13 October 2014 (UTC)

On the other hand we do mention scaling. (Though perhaps we don't say enough about it, or to motivate why it's appropriate, and the dimensional inconsistency without it might be good to point up, because it highlights the issue well).

A very classic scaling approach is to scale each column of the data by its standard deviation. So then one isn't building a covariance matrix for L and M (which as you note is dimensionally incoherent), but for (L / ∆L) and (M / ∆M) -- which leads to an equation which is dimensionally coherent.

∆L, ie the spread of L, might either be estimated eg using the empirical variance (the most frequent approach); or with wider knowledge of the system, it may be possible to make an up-front theoretical prediction for it.

Alternatively, if L and M have the same units (ie not heights and masses), one might not scale the data, if it was total variance in that set of units that one is looking to preserve. Alternatively again sometimes one might scale according to the precision of the measurements, if one is trying to preserve the most 'meaningfulness' in the dimensional reduction. But yes, scaling is important, and it can be a bit of a black art. Jheald (talk) 09:02, 13 October 2014 (UTC)

Computing PCA using the covariance method -> Calculate the deviations from the mean

Maybe i'm missing something really crucial, but why is that h-vector there? It seems to have no purpose, considering it's made up only of 1s. — Preceding unsigned comment added by 93.218.89.92 (talk) 17:18, 15 October 2016 (UTC)

edit: I'm an idiot. It spans the vector into a matrix (so used to shorthand programming where it applies it automatically for each row/col). Maybe that explanation should be added there. — Preceding unsigned comment added by 93.218.89.92 (talk) 17:19, 15 October 2016 (UTC)

Merger proposal

I propose that Empirical orthogonal functions be merged into Principal component analysis. The two are effectively the same, and the article on PCA says as much: Depending on the field of application, it is also named (... ) empirical orthogonal functions (EOF) in meteorological science (...). Therefore, it makes sense to have a single article describing both. --Gerrit ^CUTEDH 17:28, 4 December 2015 (UTC)

In theory, theory and practice are the same. But in practice... Therefore, Empirical orthogonal functions could be used to put the emphasis on the practical side of the story. Moreover, EOF seems to consider the time as a special parameter, that should be dealed in a special manner. Pinging User:Kevin Baas, the creator of the page. Pldx1 (talk) 13:13, 12 December 2015 (UTC)

I agree that the context for EOF may differ from the one for PCA, but I think this can pretty well be covered within a single article. --Gerrit ^CUTEDH 16:22, 14 December 2015 (UTC)

I believe a merge & redirect would be fine, possibly combined with adding another subheading for meteorology under "Applications" and noting the differences in context and parameterization there.-- Elmidae 16:23, 25 January 2016 (UTC)

Thanks for the ping. Wow that was a long time ago, I barely recognize my own writing. PCA is certainly at least similiar, and certainly in the same boat. But EOF I would think is broader. EOF can also be done with ICA (independant component analysis). They are both forms of "blind signal separation". "factor analysis" is also a related term.

It may take a bit to weed out the exact relationships of these terms, but it seems to me that EOF would be higher in the heriarchy. That is to say while PCA is a form of EOF, something that is an EOF is not necessarily a PCA (or produced by PCA).

This is not to imply that there isn't an opportunity for a merge one place or another, just that this might not be it. (And i'm reminded of the WP:NOT: wikipedia is not paper, but on the flip side too much sprawl / redundancy can be non-ideal, hence the existence of mergers in the first place) - Kevin Baas^talk 20:51, 4 March 2016 (UTC)

Short version: the two are not effectively the same, despite what the article might state. They definitely are closely related, though. Kevin Baas^talk 20:54, 4 March 2016 (UTC)

On further reading of the article it may in fact be ripe for a merger here, as the EOF article seems to suggest that at least in practice it is "typically" done by eigendecomposition.Perhaps the name implies something more broad than it is. Or then again perhaps the article is misleading about how narrow it is. It does in any case sound like PCA is a special case of EOF, but not the other way around. For instance, PCA does not use kernel functions. Kevin Baas^talk 21:34, 4 March 2016 (UTC)

Oppose merge: I agree with emerging consensus above that PCA is a subset of EOF, but argue that PCA is a distinctly notable and important subset, deserving a distinct article. Klbrain (talk) 11:21, 27 January 2018 (UTC)

Oppose merge: PCA is evolving to be its own branch, particularly under the popularity of "Big Data" and the stand alone article will be able to reflect specific growth of techniques in that branch.Limit-theorem (talk) 12:32, 27 January 2018 (UTC)

Closed.

Resolved

Klbrain (talk) 12:16, 5 February 2018 (UTC)

Details sections is incomprehensible

I just came back to this article to refresh my memory, and this currently reads very poorly. The meaning of subscripts is not explained before it's used, but after (and not even clearly at that). If you don't know what a component score is, then the semantic meaning of the paragraph is not clear. Even variables such as l are used before it's explained what they mean. This opening paragraph really needs to be rewritten by actually stating the problem mathematically. — Preceding unsigned comment added by 128.30.27.204 (talk) 15:08, 11 January 2019 (UTC)

Figure 6a

Under the "Limitations" section, the phrase "(see Figure 6a in the reference)" exists. The wiki article does not label figures. What is "the" reference? There is no link or citation. ----Cowlinator (talk) 20:12, 23 September 2019 (UTC)