PCA reveals trends

This story takes us back to a time when James Bond was sent undercover as an MI-6 agent to a highly classified school. His mission was to observe the participants closely.

Composing an average  from a number of data matrices

Bond meticulously tracked the grades of all the students and created tables where each row represented a student, and each column represented their scores in specific subjects such as math, sports, and geography. Soon, Bond found himself overwhelmed by a massive amount of data. To simplify it, he decided to calculate the average grade for a certain period, thinking it would provide a more representative picture.

However, the results turned out to be inconclusive. Bond then attempted to calculate the average grades across all students. Since this measure was independent of each student’s individual abilities, it reflected rather the quality of teaching in the school for each subject. In the next step, Bond realized that the individual deviations from this average would be more informative.

Nevertheless, analyzing the data proved to be challenging. The headquarters advised him to employ some linear algebra techniques, specifically the rotation of basis. You see, the scores in the three subjects can be likened to coordinates in a three-dimensional space, and they can be transformed into a rotated basis.

Recast data matrix in a new rotation basis

James experimented with this rotation and made an intriguing discovery. It was possible to rotate the basis in such a way that most of the columns in the transformed table became zero or close to zero. In other words, when the data was projected onto the new y and z coordinates, it resulted in little useful information and mostly represented noise. These y and z columns were deemed unimportant and could be removed.

Now, what about the remaining new x coordinate? James Bond found that it exhibited a distinct pattern: positive numbers for math and negative numbers for sports and geography, in a certain proportion. Aha, thought James, students with positive values in this specific score tended to have analytical thinking skills, while those with negative values were more inclined towards activities such as traveling, acting, and shooting.

Bond now had a powerful tool to characterize the individual profiles of each student, providing crucial results to report back to headquarters.

Technical example

How can we apply James Bond’s experience to our own endeavors? Let’s consider a vast collection of spectra comprising 1000 energy channels, all of which are affected by noise. Behind the scenes, the only variation lies between compound A and compound B, whose ideal spectra are depicted in the figure.

Spectra of two compounds

However, the presence of significant noise makes spectra horrible. It is not clear what is going on in the data set:

Typical spectra corrupted by noise

No panic! Following Bond’s strategy, we calculate the mean spectrum and treat all the data as deviations from this mean. Still, analyzing the data with 1000 channels proves challenging. Go further!

Apply Principal Component Analysis (PCA), which is akin to the rotation technique employed by James Bond. Miraculously, PCA reveals that essentially one parameter varies across the data: the proportion of compounds A and B. To emphasize, instead of dealing with 1000 independent counts across 1000 channels, we find that there is only one parameter that predominantly governs all counts. This parameter is the strength of the deviation from the mean while the shape of deviation is characterized by a certain curve revealed by the PCA basis.

Decompose spectra on mean and deviation

To distance ourselves from spy terminology, let’s refer to this curve as an “eigenvector” rather than a “signature.” This eigenvector shows a trend.

Lets also clarify our usage of the term “trend,” as it deviates from the commonly intuitive definition of collective behavior driven by a specific impulse, such as the buying of Tesla stocks. Instead, we refer to a statistical linear trend, which can be extrapolated in two directions. Following a positive direction signifies the strengthening of a particular feature, while following a negative direction indicates its weakening.

Now, all spectra with a parameter close to +1.0 would exhibit the spectrum of compound A, while those close to -1.0 would display the spectrum of B. Naturally, there are numerous spectra that fall in between A and B, representing a mixture of the two compounds. Their parameter is in the range (-1.0 : +1.0). Our comprehension of the data set has now reached a state of clarity and coherence.

In conclusion, PCA analysis not only enables us to compress and denoise spectroscopic data but also allows us to extract clear variation trends that may go unnoticed to the naked eye. By reducing the complexity of the data and identifying the dominant trends, PCA provides valuable insights and reveals patterns that might otherwise remain hidden.

he Python codes can be found in the pdf version of this document: Full Text with Codes.

Comments

4 responses to “PCA reveals trends”

  1. Abdul Avatar
    Abdul

    That’s all wrong! People charachters can’t be described by one parameter.

    1. pavel.temdm Avatar
      pavel.temdm

      You are absolutely correct; people are indeed complex beings. I must admit that I oversimplified the story that Bond shared with me. In reality, there were approximately 40 subjects and 4 distinctive parameters involved. From what I recall, these parameters included: 1) “analytical vs acting,” 2) “artistic vs technical,” 3) “lazy vs hardworking,” and 4) “communicative vs reserved.” Even with these parameters, one can begin to construct a reasonably accurate profile of an individual. I use the term “in reality,” but it remains uncertain what information Bond may have shared. A significant portion of this story still remains top-secret, even to this day….

  2. Thomas Avatar
    Thomas

    I checked several papers on
    application of PCA. They generally do not subtract the mean value before
    the decomposition. If you count everything from the mean, the eigen
    spectra may be negative like in your last picture. That is hard to
    understand. Counting from zero, not from mean, is more logical.

    1. pavel.temdm Avatar
      pavel.temdm

      Hmm… I doubt that most people do not subtract the mean prior to PCA. I believe the most common approach is subtracting the mean, which is called centered PCA. However, you are right, sometimes people ignore centering. This is because they do not attempt to find a trend in the data but simply want to denoise it. I will try to clarify this with a simple example.
      centered vs uncentered PCA
      Suppose there is a dataset with only two energy channels or features. There is a clear trend – the features change in a 2:1 proportion as shown in the figure. Centered PCA will immediately find the direction of this trend, while uncentered PCA will first find the eigenvector pointing more or less to the center of the data distribution. Moreover, the second eigenvector will also not coincide with the true trend because it is restricted to the orthogonality conditions of eigenvectors. Therefore, you end up with two basic vectors, none of which coincides with the true trend. However, it is easy to see that the ‘true’ trend direction is just a linear combination of these two vectors. Thus, the denoising reconstruction still works, although you need one more component than in the centered case.
      I should mention, however, that if there are more than one trend in the data, the situation becomes more complicated and the differences between centered and uncentered PCAs are not significant.
      To summarize: for the denoising task, centered and uncentered PCAs are almost equivalent, but if you want to retrieve the trend, centered PCA is needed.

Leave a Reply

Your email address will not be published. Required fields are marked *