Getting Started
Top-down proteomics informatics is a complicated subject that, like most things, builds on many simpler concepts that evolved over time. This guide looks at that evolution through the lens of sample complexity; more complicated samples require more experimental preparation and additional informatics considerations. Let us begin with understanding what a proteoform looks like by mass spectrometry.
Single known proteoformβ
Imagine the simplest proteomics sample possible: a single proteoform all by itself. It's not quite clear why you would be doing this (see possible examples), but you have none the less directly infused this sample and acquired a single spectrum showing many copies of the same intact proteoform. Because of electrospray ionization ESI, the proteoform molecules are distributed among many different charge states. Your intact proteoform spectrum (aka MS1) might look something like this:
This spectrum shows an isotopically resolved example of ubiquitin with no impurities. We can confirm this by estimating the observed proteoform mass and comparing it to the theoretical proteoform mass. Wait, you think a mass alone is flimsy evidence? You don't believe me? π² Fine, let's use fragmentation to break apart all the copies of ubiquitin in our sample and confirm the sequence we are seeing. A fragment map (below) shows where the different molecules broke apart and we were able to observe the pieces.
NaΓ―vely, one could simply count the number of matchings fragment to get a sense of how good this match is. However, we can also use a variety of metrics and scores that take more factors into account (e.g. Percent of fragments explained, Percent of backbone bonds cleaved, etc.).
Single known geneβ
OK, let's up the ante: we still know the gene, but now there are multiple proteoforms involved. This could be an actual experiment, where you'd like to understand the proteoform family membership and perhaps some notion of quantitation. In many cases, the intact proteoform spectrum will be enough to give us a good idea of the proteoform landscape. In the example below, we have 2 charge state distributions that correspond to 2 proteoforms.
Wow, that's starting to look a bit muddled with all those peaks! Let's run a deconvolution algorithm on the spectrum to both decharge and deisotope those proteoforms.
Much better. It is now clear that we have 2 proteoforms that are ~80 Daltons apart. Given prior knowledge about this gene and our sample, we have a pretty good idea that this is a phosphorylation. Wait, what? You don't believe me again? π² Fear not, fragmentation can again provide confidence in our assignment. With a mass spectrometer, it is possible to isolate a portion of a spectrum and only pass it through for fragmentation. After isolating the heavier proteoform, fragmenting, and running deconvolution on the fragmentation spectrum (aka MS2), we are able to see the modification clearly.
At this point, you can also start to think about how the abundances of these proteoforms relate to each (aka relative quantitation or quantification). While there are some complications to be aware of, typically the abundance (or intensity) of the proteoforms after deconvolution can be used for comparisons. In our example above, the unmodified proteoform is about twice as abundant as the modified one (meaning that our sample had roughly double the number of unmodified proteoform molecules).
Multiple known genesβ
- Compare proteoform families
- Complex standard (e.g. Pierce Intact Protein Standard Mix)
Time to move on to bigger (and possibly better) things! In this sample, we have a mixture of genes that we know, but we aren't sure what proteoforms are around. In addition, we suspect that some proteoforms are endogenously processed and, consequently, are smaller than the "base" proteoform. Below, we show an examples of m/z and deconvoluted intact spectra that contain 3 hemoglobin genes (alpha, beta, and delta).
Clearly, the intact masses alone are not enough to determine which proteoforms are present. So, we must isolate and fragment each of these targets in turn and generate fragment maps using all of the known gene sequences. We are able to figure out 4 out of the 5 peaks because they match well to one of the three sequences, but the smallest peak (around 10 kDa) only matches on one end (aka terminus).
In this case we are lucky (thank the demonstration gods!). Because the fragmentation is only supporting one termini, we correctly guess that this is a subsequence and start removing amino acids from the opposite termini until we get good agreement with the intact mass from deconvolution and the fragment map improves (The initiator methionine is another good indicator that the subsequence is correct).
But, I hear you ask ... what if we aren't that lucky? ... more on that later.
Single unknown proteoformβ
Thus far we have been careful to maintain a high level of certainty about our samples. However, life isn't always that easy! Let's add some uncertainty to things and explore how we might face an unknown single proteoform.
Example 1 - We know the sample's organism and it's a common organismβ
Got a mystery proteoform that came out of a mouse? You're in luck! The typical workflow involves creating a database from a trusted source and running a search algorithm to find the best couple of matches. Given the small scale of a single proteoform, you can manually validate the fragment maps and pick the best result.
TopPIC result with a couple hits.
Example 2 - We know the sample's organism and it's an uncommon organismβ
Got a mystery proteoform that came out of a woolly mammoth? This might be a bit harder. You might be able to use a database from a similar species (maybe African Elephant?) and find some matches. Do you happen to have a set of DNA or RNA sequences? These can be turned into a proteomic search space without much difficulty. Is your fragmentation fantastic? Try the next example ...
Something to show RNA -> AA?
Example 3 - We can't get a database, but the fragmentation is amazing!β
This can happen in very novel cases or perhaps in the case of antibodies, where the proteoform's sequence isn't determined by the genome. While it is difficult for top-down data, if you have very rich fragmentation, you can give de novo sequencing a try.
Something from a de novo paper?
Multiple unknown proteoformsβ
Let's bring a couple more unknown proteoforms to the party!
Although we skipped right past it before, the hardest part about dealing with multiple unknowns is formatting your observation data into a something that the search algorithms can handle. This typically means that a single instrument acquisition will have both intact and fragmentation scans that are brought together by the software. If your data are in multiple files you might be out of luck.
Maybe some snake venom?
- Isolation of interesting peaks and manually put together features with fragments
Complex mixture of unknown proteoformsβ
At this point, your sample contains hundreds or thousands of the proteoforms and is WAY too complex to directly introduce to a mass spectrometer. At the very least, you will need to space the proteoforms out over time so they don't hit the instrument in one big glob. If the complexity is high enough, you should consider splitting your sample into multiple simpler mixtures (by mass, collisional cross-section, etc.) before you get onto the instrument. Below is an example of a chromatogram that shows scan intensity over time.
Example of chromatogram
This is also the first time that we have more data than we can manually validate. Feature Detection algorithms will scan your instrument data file and find proteoform signals within the chromatographic peaks. Features and their corresponding deconvoluted fragment masses are searched in an automated fashion against a database to produce a list of potential proteoforms in your sample.
Example of proteins and proteoforms in TDviewer
This last point is crucial: there is no way anyone has time to properly validate the thousands of potential proteoform results from these searches! To address this, most search algorithms include additional scores that take the broader context of the search into account (database size, search parameters, etc.). The best approach attempts to estimate a false discovery rate (FDR) by comparing the proteoform results to the scores from fake (or decoy) proteoforms.
Target/Decoy distribution plot
Multiple samples of known proteoformsβ
Wow, let's stop to catch of breath, this is a lot of stuff! OK, that's enough resting ... keep moving. Let's do multiple simple samples in a single experiment. This kind of experiment is usually involved in the development and execution of proteoform assays.
Example of PfRM
Multiple samples of complex mixturesβ
The final stop on this journey is of most complex: multiple complex mixtures. This is wading into the world of full proteome quantitation and more advanced statistics. The workflow consists should look something like the following.