Contaminants, as in the unintended proteinaceous parts of your sample, that, especially in mass spectrometry-based proteomics, are important to think about. This specific writeup began as a series of questions that began when as I was analyzing cells grown in FBS, and Ron Beavis pointed me to this FBS FASTA, 38 curated bovine sequences that worked beautifully on my mass spec data. (I later learned of this cRFP paper as well)
But moving from FBS into a general question about what contaminant lists folks were using. TL;DR Check out the efforts to update and confirm contaminant resources by Ling Hao et al., Pre-Print and GitHub.
It's late on a Friday, but I was hoping for an assist from the masses. What are the different contaminant databases (fasta not spectral libraries)? There is cRAP from @NorSivaeb , there is something in Philosopher (@leprevostfv) and in MetaMorpheus (@Smith_Chem_Wisc). What else?
— Ben Neely 🇺🇦 (@neely615) April 22, 2022
After the excellent feedback above, I ended up somewhat summarizing the contaminant discussions in something else I was in the midst of writing (with input and feedback from Phil Wilmarth and Jesse Meyer).
Samples are rarely comprised of only proteins from the species of interest. There can be protein contamination during sample collection or processing. This may include proteins from human skin, wool from clothing, particles from latex, or even porcine trypsin itself, all of which contain proteins that can be digested along with the intended sample and analyzed in the mass spectrometer. As early as 2004, The Global Proteome Machine was providing a protein sequence collection of these common Repository of Adventitious Proteins (cRAP), while another contaminant list was published in 2008 Keller et al.. The current cRAP version (v1.0) was described in 2012 and is still widely in use today. cRAP is the contaminant protein list used in nearly all modern database searching software, though the documentation, versioning or updating of many of these “built-in” contaminant sequence collections is difficult to follow. There is also another contaminant sequence collection distributed with MaxQuant. Together, the cRAP and MaxQuant contaminant protein sequence collections are found in some form across most software, including MetaMorpheus and Philosopher (available in FragPipe; Leprevost et al., 2020). This list of known frequently contaminating proteins can either be automatically included by the software or can be retrieved as a FASTA to be used along with the primary search FASTA(s). Recently the Hao Lab has revisited these common contaminant sequences in an effort to update the protein sequences, test their utility on experimental data, and add or remove entries Frankenfield et al., 2022.
In addition to these environmentally unintended contaminants, there are known contaminants that also have available protein sequence collections (or can be generated using the steps above) and should be included in the search space. These can include the media cells were grown in (e.g., fetal bovine serum), food fed to cells/animals (e.g., Caenorhabditis elegans grown on Escherichia coli) or known non-specific binders in affinity purification (i.e., CRAPome; side twitter conversation by Ed Huttlin about how specific the CRAPome seems). Accurately defining the search space is essential for accurate proteomic results and, especially in the case of contaminants, requires knowledge of the experiment and sample processing to adequately define possible background proteins.
Of course, much of our thoughts on the need for contaminants is based on their utility in data-dependent searching (aka spectrum-centric approaches), and a question I have always wanted to ask the experts is why would you use contaminants in DIA (why are they being used in an analyte-centric search approach? see Ting et al., 2015).
Before SciTwitter ends (kidding!), a request for lively public discourse on a not-hot hot proteomics topic.
— Ben Neely 🇺🇦 (@neely615) April 27, 2022
If DIA search statics are largely analyte-centric, then why would you include contaminants? Bonus: point to some papers/blogs/interpretive dances on the subject.
My take on these responses is a lot of this seems to be that the current DIA search software needs to identify as many co-eluting peptides as possible to help with modeling (I am poorly paraphrasing). So it isn’t that you need to have contaminants to avoid mis-identification (i.e., a contaminant being mis-identified as an analyte), but the software(s) perform better when more ions are being identified. Many points were also made that knowing the contaminant levels is important more from a QC standpoint (if I have a ton of trypsin and not a lot of peptides, that is informative).
A later point is about using an inappropriate database leads to unintended conclusions (honey bee paper), but I would counter that in DIA searching, this answer is more complex and simply requires additional filtering of rare peptide identifications (I actually discuss that in this paper, with much credit to discussions with Vadim Demichev, @DemichevLab).