How much noise can we remove by PCA?

Y’all probably heard about Principal Component Analysis (PCA) and how it can be used to clean up noisy datasets. This can be done with our software, for instance. But have you ever wondered how it actually works? And more importantly, can it eliminate all the noise or just a fraction? Well, this post is here to shed some light on those questions. Let’s dive in!

Let’s break down the concept of Principal Component Analysis (PCA) in simple terms. Imagine we have a spy named Mr. Bond, who’s tasked with sending reports from a top-secret school. These reports contain student grades in different subjects. Each week, Mr. Bond creates a table (shown in blue in the figure) where each row represents a student, and each column represents their scores in specific subjects like math, sport, and geography. Now, here’s the catch: Mr. Bond can’t transmit the table as is because it needs to be encoded to maintain secrecy. To do this, he has a set of predefined keys (shown in yellow). He simply multiplies the blue matrix (the grade table) with the yellow matrix (the keys) to obtain a new matrix shown in green. This encoded matrix is then transmitted via radio during the cover of night.

At the headquarters, the smart folks there know linear algebra. They receive the encoded matrix and use the reverse key matrix to decipher it and restore the original table with the students’ grades.

After a few weeks, Mr. Bond starts to notice a peculiar pattern with the keys he receives from the headquarters. It turns out that these keys aren’t just random combinations. Whenever he multiplies any column of the key matrix with another column, the result is always zero. It dawns on him that his lazy boss, who designed the keys, took a rather simplistic approach.

You see, his boss considered the three subjects as coordinates in a 3D space, and all he did was rotate these coordinates to mix up the results. Each week, he came up with three new basic vectors and defined them in terms of the original coordinates, as shown in the figure. In his old-fashioned ways, the boss made sure that the basic vectors were always orthogonal to each other. That’s why multiplying the coordinates of these basic vectors always yields zero. Now, here’s the funny part: For one week, his boss provided a unit matrix (the most left matrix in the first figure). This meant that there was no encoding that week!

As the story unfolds, Mr. Bond is now tasked with transmitting a number of spectra obtained from a highly classified material. Each spectrum consists of 2048 channels, which means Mr. Bond receives 2048×2048 key matrices for encoding. Just like before, his boss continues to construct the encoding matrices by rotating the basis vectors, but this time operating in a vast 2048-dimensional space.

As Mr. Bond faithfully transmits the encoded spectra, he begins to notice something interesting. Certain columns in the key matrices seem to be more efficient at encoding than others. These columns, presumably manually defined by his boss, result in smooth variations when applied to the spectra. However, there are other columns (probably appearing due to the orthogonality constraints) that produce noisy columns in the encoded matrices. Mr. Bond becomes skeptical about whether these noisy columns carry any valuable information at all.

Driven by his intuition, Mr. Bond decides to skip these seemingly useless coding columns. This choice speeds up his transmission work at night and reduces the risk he faces. He informs the MI-6 headquarters that they should retain only a few specified rows in the reverse key matrix when decoding. To his surprise, the headquarters manages to successfully decode the spectra using this reduced set of rows. In fact, they even admit that the quality of the decoded spectra has improved significantly. It appears that much of the data Mr. Bond had been transmitting before was nothing more than noise.

On that fateful day, James Bond made a life-altering decision. He stopped stealing, peeling, eavesdropping, and embarked on a completely different path. No longer would he receive key matrices from headquarters; instead, he took matters into his own hands and constructed the keys himself for each dataset. His new mission was clear: to discover basis vectors that would enable him to express the data using the fewest possible coordinates, ultimately compressing the data and improving its quality by reducing noise.

As he delved deeper into this new endeavor, Bond came across a remarkable rule. The key columns that yielded the highest data variance in the columns of the encoded matrix were most efficient for encoding. This criterion of maximizing data variance wasn’t the only possible criterion (we can explore other criteria in future discussions), but it proved astonishingly successful. Bond meticulously constructed the rows in the reverse key matrix, sorting them based on their efficiency in producing maximal data variance. He then employed only a few of the top-ranked rows to reconstruct the complete datasets.

With this transition, James Bond inadvertently invented Principal Component Analysis (PCA). He now refers to the red reverse key matrix as the “loadings” matrix, and to the green encoded matrix as the “scores” matrix.

James Bond’s role has changed dramatically since that time. He is still employed by the secret service, but now as a data scientist. We are not going to name his employer. To conceal the real name, let’s call it by some senseless abbreviation, for instance, ‘MI-6’. And James Bond’s career in MI-6 develops quite successfully.

Now closer to the technical topic

For those familiar with PCA, they can skip the essay above and delve right into this paragraph. Is the Bond’s invention indeed as magical? How much noise can we remove with PCA? All the noise or just a fraction? The answer is that PCA cannot completely eliminate all noise from a dataset.

Consider the plot of variances in the scores columns, which is called scree plot (although some journal technical editors always tend to correct it for screen plot…). Each column in the score matrix and the companion row in the loadings matrix form a principal component. The scree plot for a typical EELS dataset is shown below. Note that the variances are displayed in the logarithmic scale, therefore the negative values denote just numbers below 1. When constructing the scree plot, it is common practice to calculate and plot the variances for a limited number of principal components, typically the first 20 to 50 as I showed in the left figure. However, for better understanding the things, I also calculated the variance for all 2048 principal component (shown in the right). Yes, the total number of the principal components equals the number of the energy channels, i.e. 2048.

The scree plot helps researchers strike a balance between retaining enough principal components to capture meaningful data variations and reducing the noise contained mostly in less significant components. By selecting a cut-off point, such as the 5th component, we declare: all components at the left (i.e. 1-5th components) are useful while all components at the right (6-2048th) are “noise components” and should be removed. Does it mean that we get rid of all the noise by removing components 6-2048? No way!

I believe Edmund Malinowski was the first who clearly showed that the so-called ‘meaningful components’ also consist of noise. With using a simple assumption of equal distribution of noise in the green ‘score’ matrix above he calculated how much noise is removed. There is always an ‘imbedded’ noise in the major principal components, although typically not much. For the shown example, I estimated that 99.5% of the total noise is incorporated in components 6-2048 while only 0.5% is imbedded in components 1-5. that is not surprising as we compressed the data 2048/5 ~400 times!

Let’s explore the question: What happens if we use more than 5 components for the PCA reconstruction? Would the results significantly worsen? To answer this, let’s delve into the calculations. The noise variance is additive among the components, thus we can safely sum it up in any required range. Do not forget to take a square root of this sum in order to rescale it from quadratic deviations to the linear scale! The results are in figure above. As we increase the number of included principal components, we observe a gradual decrease in the amount of noise removed. Initially, this reduction follows an almost linear pattern, but it becomes slower as we include more components. If we utilize 50 components instead of 5 for reconstruction, the amount of removed noise decreases from 99.5% to 95%. The question arises: is this reduction substantial or not? It depends…

The considered EELS spectra exhibit distinct statistics across different energy regions. When examining the Ti L edge region, the reconstructions using both 5 and 50 components yield identical results. However, in the case of the region near the Mg K edge, where the data are heavily affected by noise, the reconstructions using 5 and 50 components display noticeable differences.

Correspondingly, when we examine the spectrum-image slice at 1880eV with a width of 1eV, we observe significant differences in the reconstruction results obtained using 2048, 50, 20, 10, and 5 components. Note that reconstructing with 2048 components means preserving all possible principal components, which is equivalent to not applying PCA at all.

In summary, keeping an excessive number of principal components during PCA reconstruction gradually diminishes the denoising effectiveness of PCA. However, the question arises: why do people sometimes still use too many components? This topic will be further explored and discussed in upcoming posts.

Comments

6 responses to “How much noise can we remove by PCA?”

  1. Juan Avatar
    Juan

    Thank you for this other wonderful article. Is there any place else where
    folks can get that kind of info in such an ideal manner of
    writing? I have a presentation next week, and I am looking for more info on this topic.

    1. pavel.temdm Avatar
      pavel.temdm

      Unless specially warned, all materials of this blog are free to use in your own presentations. You may use it fully as-it-is or partially or in the modified form. The only thing I would kindly ask for: please add a reference “based on materials posted at temdm.com” at the corner of your presentation slides. Please let me know if you need any pictures in the better quality.

  2. Lars Avatar
    Lars

    Hello Pavel. Thank you for this contribution, very well written! However, I can’t really figure out from your explanation how you came up with 99.5% noise reduction. So, you compressed your set by 2048:5 = 410 times. That means the noise should also be compressed 410 times. 100% devided by 410 equals 0.24%. So, shouldn’t it be 99.8% noise reduction?! – Am I missing something? Thank you in advance

    1. pavel.temdm Avatar
      pavel.temdm

      I afraid its more complicated. First, the noise level itself cannot be additively summed up among components. The noise variance may be summed, then we should take the square root from the summed variance. Second, Malinowski (Anal. Chem 49 (1977) 606) assumed the equal distribution of the noise variance among components, which is not true. In the later article (J. Chemometrics 1 (1987) 33) he introduced some dependence, which still typically underestimated the noise. You can check the article (Chemomentrics Int. Lab. Sys. 94 (2008) 19) to get feeling how complicated might be the distribution of noise among components. I just linearly extrapolated the noise variance from components 10-20 to components 1-5. This way I got 0.5% of noise still remaining in the meaningful components, which is still, of course, a very rough estimation.

  3. Juan Avatar
    Juan

    Does it have something to do with the percentage of the explained variance?

    1. pavel.temdm Avatar
      pavel.temdm

      No. The explained_variance_ratio is easy to calculate, however its applicability is limited. Namely, it is useful in the situations of little noise only. Then you can say ‘the first 5 principal components explain 99% of the signal variance, so I may compress the data to 5 components and not loose much. However, the explained_variance_ratio can be misleading for very noisy data. Imagine a data set with the only noise, no signal variation. Still, PCA will retrieve the noisy components that are a bit more variable than the other noise ones. You can also calculate the explained_variance_ratio and probably claim: “the first 50 components explain 99% of variance”. But still, these 99 percents are nothing but noise because there is no signal variation in this set. The estimation of the signal : noise proportion in data is a much more complicated task and the starting point here is the theory of Malinowski (Anal. Chem 49 (1977) 606).

Leave a Reply

Your email address will not be published. Required fields are marked *