आईएसएसएन: 0974-276X
Carsten Henneges, Georg Hinselmann, Stephan Jung, Johannes Madlung, Wolfgang Schütz, Alfred Nordheim and Andreas Zell
Proteomics facilities accumulate large amounts of proteomics data that are archived for documentation purposes. Since proteomics search engines, e.g. Mascot or Sequest, are used for peptide sequencing resulting in peptide hits that are ranked by a score, we apply ranking algorithms to combine archived search results into predictive models. In this way peptide sequences can be identified that frequently achieve high scores. Using our approach they can be predicted directly from their molecular structure and then be used to support protein identification or perform experiments that require reliable peptide identification. We prepared all peptide sequences and Mascot scores from a four year period of proteomics experiments on Homo sapiens of the Proteome Center Tuebingen for training. To encode the peptides MacroModel and DragonX were used for molecular descriptor computation. All features were ranked by ranking-specific feature selection using the Greedy Search Algorithm to significantly improve the performance of RankNet and FRank. Model evaluation on hold-out test data resulted in a Mean Average Precision up to 0.59 and a Normalized Discounted Cumulative Gain up to 0.81. Therefore we demonstrate that ranking algorithms can be used for the analysis of long term proteomics data to identify frequently top scoring peptides.