Step-by-step LC-MS/MS Identification Wokflow in R

Proteomics in Biomedicine

Gorka Prieto <gorka.prieto@ehu.eus>

University of the Basque Country (UPV/EHU)

October 3, 2024

1 Introduction

1.1 Outline

1.2 Target-decoy approach

Many MS/MS spectrum matches are incorrect!! How to discriminate them?
- Manual inspection … of thousands?
- Using a statistical approach

Commonly we use a decoy database:
- See Elias and Gygi (2007) and Käll et al. (2008)
- Derived from the original (target) database to be similar
- Matches (incorrect) can be used to estimate p-values but … we use false discovery rate (FDR) or q-values to also account for multiple testing

1.3 Sample dataset

We will use the dataset from Ramirez et al. (2018):

Organism: Drosophila melanogaster (heads).
Two conditions: control and experimental (Ube3a over-expression).
Three replicates (biological) for each condition.
Four gel slices in each replicate.

Search results:

Fasta with SwissProt (reviewed) and TrEMBL (unreviewed) entries, including isoforms and common contaminants.
Decoy entries generated by SearchGUI with “_REVERSED” suffix.
PSMs by MS-GF+ into the HUPO PSI standard mzIdentML format (*.mzid).

2 Peptide-spectrum matches (PSMs)

2.1 Loading PSMs from mzIdentML files

Set CONDITION and REPLICATE variables to the desired dataset.
We will use functions from the b10prot R package (Prieto 2024) and from the companion helper.R file.

CONDITION <- "ctrl"
REPLICATE <- "rep1"

psms <- 
  # Load PSMs from mzIdentML files
  iwf_load_psms(
    path = paste("data", CONDITION, REPLICATE, sep = "/"), pattern = ".mzid",
    # We will use MS-GF+ spectral E-value for the target-decoy approach
    psm_score = "MS.GF.SpecEValue") %>%
  # Parse fasta headers
  parse_psms()

psms %>% glimpse()

2.1 Loading PSMs from mzIdentML files

Rows: 343,768
Columns: 29
$ spectrumID               <chr> "index=2234", "index=2234", "index=2234", "in…
$ chargeState              <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ rank                     <int> 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, …
$ passThreshold            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ experimentalMassToCharge <dbl> 1052.482, 1052.482, 1052.482, 1052.482, 1052.…
$ calculatedMassToCharge   <dbl> 1052.484, 1052.484, 1052.484, 1052.486, 1052.…
$ sequence                 <chr> "GQTGGDVNVEMDAAPGVDLSR", "GQTGGDVNVEMDAAPGVDL…
$ peptideRef               <chr> "Pep_GQTGGDVNVEM+16DAAPGVDLSR", "Pep_GQTGGDVN…
$ modNum                   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, …
$ isDecoy                  <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, …
$ post                     <chr> "I", "I", "I", "Y", "Y", "Y", "Y", "I", "I", …
$ pre                      <chr> "R", "R", "R", "K", "K", "K", "K", "R", "R", …
$ start                    <int> 264, 260, 260, 59, 59, 59, 59, 992, 992, 992,…
$ end                      <int> 284, 280, 280, 75, 75, 75, 75, 1009, 1009, 10…
$ DatabaseAccess           <chr> "P08779", "Q3ZAW8", "Q9Z2K1", "tr|Q9VUV4_REVE…
$ DBseqLength              <int> 473, 474, 469, 861, 851, 845, 701, 3843, 3842…
$ DatabaseSeq              <chr> "", "", "", "", "", "", "", "", "", "", "", "…
$ DatabaseDescription      <chr> "P08779 SWISS-PROT:P08779 Tax_Id=9606 GN=KRT1…
$ spectrum.title           <chr> "File: \"C:\\Users\\ProteomeDiscoverer\\Deskt…
$ scan.number.s.           <dbl> 5045, 5045, 5045, 5045, 5045, 5045, 5045, 504…
$ scan.start.time          <chr> "1278.0", "1278.0", "1278.0", "1278.0", "1278…
$ acquisitionNum           <dbl> 2234, 2234, 2234, 2234, 2234, 2234, 2234, 223…
$ MS.GF.RawScore           <dbl> 203, 203, 203, 31, 31, 31, 31, 25, 25, 25, 23…
$ MS.GF.DeNovoScore        <dbl> 204, 204, 204, 204, 204, 204, 204, 204, 204, …
$ MS.GF.SpecEValue         <dbl> 1.204841e-26, 1.204841e-26, 1.204841e-26, 2.3…
$ MS.GF.EValue             <dbl> 2.142592e-19, 2.142592e-19, 2.142592e-19, 4.0…
$ psmScore                 <dbl> 1.204841e-26, 1.204841e-26, 1.204841e-26, 2.3…
$ proteinRef               <chr> "P08779", "Q3ZAW8", "Q9Z2K1", "Q9VUV4_REVERSE…
$ geneRef                  <chr> "KRT16", "Krt16", "Krt16", "sff_REVERSED", "s…

2.2 Counting data

How many spectra, peptides and PSMs are there?

psms %>% count_psms()

subset	psms	spectra	peptides
decoy	52953	12359	41617
target	61289	12410	47007
total	114242	12447	88602

2.3 Target-decoy competition

We select the best PSM per spectrum (rank=1)
How should have changed the counts? which should have more entries? and less?

psms %>% 
  filter(rank == 1) %>% 
  count_psms()

subset	psms	spectra	peptides
decoy	4510	4261	3884
target	8934	8623	6392
total	13444	12447	10271

3 Peptide level

3.1 Peptide-level score

Compute a simple peptide-level score using the score of the best PSM
Remove peptide sequences present in both target and decoy databases

peptides <- 
  psms %>%
  # Best PSM per spectrum
  filter(rank == 1) %>% 
  # Best PSM per peptide
  iwf_psm2pep(lower_better = TRUE)

peptides %>% 
  count_peptides()

subset	n
decoy	3879
target	6387
total	10266

Is the score useful to discriminate between target and decoy peptides?

peptides %>%
  mutate(target = !isDecoy) %>% 
  ggplot(aes(x=-log(pepScore), fill=target)) +
  geom_histogram(position = "dodge") +
  geom_vline(xintercept = -log(4.84e-9), linetype="dashed") # This value will be detailed soon

3.2 Peptide-level FDR

What’s the peptide-level global FDR without using any score threshold?

peptides %>% 
  global_fdr()

Target	Decoy	Global FDR (%)
6387	3879	60.73274

Compute a peptide-level FDR using the peptide-level score
How many peptides pass a 1% FDR threshold?

scored_peptides <- 
  peptides %>% 
  target_decoy_approach(score = pepScore)

scored_peptides %>% 
  filter(qval <= 0.01) %>% 
  global_fdr()

Target	Decoy	Global FDR (%)
2239	22	0.9825815

Which is the corresponding score threshold?

scored_peptides %>% 
  filter(qval <= 0.01 & qval > 0.0095 )

# A tibble: 4 × 9
  peptideRef       pepScore isDecoy decoys targets    pval    LP     FDR    qval
  <chr>               <dbl> <lgl>    <int>   <int>   <dbl> <dbl>   <dbl>   <dbl>
1 Pep_VNQDANDKSPDK  4.75e-9 TRUE        22    2236 0.00554  2.26 0.00984 0.00983
2 Pep_HLKEDTLFK     4.76e-9 FALSE       22    2237 0.00580  2.24 0.00983 0.00983
3 Pep_QSK+114TK+1…  4.80e-9 FALSE       22    2238 0.00580  2.24 0.00983 0.00983
4 Pep_YREFENILR     4.84e-9 FALSE       22    2239 0.00580  2.24 0.00983 0.00983

Why do we use q-value for threshold instead of FDR?

4 Protein level

4.1 Peptide-to-protein relations

First, we obtain the peptide-to-protein relations from the initial PSMs.
Then, merge the peptide-level scores from the previous section:

pep2prot <- 
  # Peptide-to-protein relations
  iwf_pep2level(psms, levelRef = proteinRef) %>% 
  # Include peptide scores
  inner_join(scored_peptides, by = "peptideRef")

pep2prot %>% glimpse()

4.1 Peptide-to-protein relations

Rows: 32,209
Columns: 11
$ peptideRef <chr> "Pep_AAAALFTR", "Pep_AAAALFTR", "Pep_AAAALFTR", "Pep_AAAART…
$ proteinRef <chr> "Q6NKM1_REVERSED", "Q960D5_REVERSED", "Q9VVA9_REVERSED", "Q…
$ shared     <int> 3, 3, 3, 2, 2, 2, 2, 5, 5, 5, 5, 5, 1, 2, 2, 1, 1, 5, 5, 5,…
$ pepScore   <dbl> 2.128236e-06, 2.128236e-06, 2.128236e-06, 9.431675e-05, 9.4…
$ isDecoy    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ decoys     <int> 2383, 2383, 2383, 3735, 3735, 1538, 1538, 3588, 3588, 3588,…
$ targets    <int> 4777, 4777, 4777, 6227, 6227, 3949, 3949, 6074, 6074, 6074,…
$ pval       <dbl> 0.6142046919, 0.6142046919, 0.6142046919, 0.9627481310, 0.9…
$ LP         <dbl> 0.21168687, 0.21168687, 0.21168687, 0.01648732, 0.01648732,…
$ FDR        <dbl> 0.4988486, 0.4988486, 0.4988486, 0.5998073, 0.5998073, 0.38…
$ qval       <dbl> 0.4986399, 0.4986399, 0.4986399, 0.5998073, 0.5998073, 0.38…

4.2 Counting data

How many proteins are potentially identified with a 1% peptide-level FDR?
How will it be the protein-level FDR?

pep2prot %>% 
  filter(qval <= 0.01) %>% # Peptide-level FDR threshold
  group_by(proteinRef) %>% 
  summarise(isDecoy = any(isDecoy)) %>% 
  global_fdr() # Protein-level FDR

Target	Decoy	Global FDR (%)
2082	58	2.785783

4.3 Dealing with ambiguities

The same peptide can match different proteins
How do these shared peptides affect the protein count?
Now count the number of proteins identified only with unique (i.e. non-shared) peptides

pep2prot %>% 
  filter(shared == 1, qval <= 0.01) %>% # Peptide-level filter
  group_by(proteinRef) %>% 
  summarise(isDecoy = any(isDecoy)) %>% 
  global_fdr() # Protein-level FDR

Target	Decoy	Global FDR (%)
153	6	3.921569

Could we use shared peptides without inflating protein numbers?

4.4 Protein-level FDR

For the moment we will use only unique peptides
Steps:
1. Calculate a protein-level score:
- See Prieto and Vázquez (2020)
1. With this score calculate the protein-level FDR

scored_proteins <- 
  pep2prot %>% 
  # Only consider unique (i.e. not shared) peptides
  filter(shared==1) %>% 
  # Calculate protein-level scores
  lpg(proteinRef) %>% 
  # Calculate target-decoy approach metrics using the LPGF score
  target_decoy_approach(LPGF, lower_better = FALSE)

plot_rank(scored_proteins)

Steps (continuation):
1. Filter by 1% protein-level FDR

scored_proteins %>% 
  filter(qval <= 0.01) %>% 
  global_fdr()

Target	Decoy	Global FDR (%)
140	1	0.7142857

But we have very small proteins counts because:
- We have removed shared peptides
- There are many protein isoforms in the fasta database
What could we do?

5 Gene level

Using peptide-to-gene relations we could follow the same strategy used at protein-level
Peptides only shared between protein isoforms within the same gene will be preserved

5.1 Peptide-to-gene relations

Again, we obtain the peptide-to-gene relations from the initial PSMs and then merge the peptide-level scores:

pep2gene <- 
  # Peptide-to-gene relations
  iwf_pep2level(psms, levelRef = geneRef) %>% 
  # Include peptide scores
  inner_join(scored_peptides, by = "peptideRef")

pep2gene %>% glimpse()

5.1 Peptide-to-gene relations

Rows: 14,583
Columns: 11
$ peptideRef <chr> "Pep_AAAALFTR", "Pep_AAAALFTR", "Pep_AAAARTNK", "Pep_AAADEW…
$ geneRef    <chr> "CG9715_REVERSED", "NA_REVERSED", "CG6650_REVERSED", "Ssrp-…
$ shared     <int> 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
$ pepScore   <dbl> 2.128236e-06, 2.128236e-06, 9.431675e-05, 8.691764e-07, 8.6…
$ isDecoy    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TR…
$ decoys     <int> 2383, 2383, 3735, 1538, 1538, 3588, 0, 2462, 0, 1389, 3141,…
$ targets    <int> 4777, 4777, 6227, 3949, 3949, 6074, 110, 4862, 168, 3826, 5…
$ pval       <dbl> 0.6142046919, 0.6142046919, 0.9627481310, 0.3963650425, 0.3…
$ LP         <dbl> 0.211686870, 0.211686870, 0.016487316, 0.401904655, 0.40190…
$ FDR        <dbl> 0.4988486, 0.4988486, 0.5998073, 0.3894657, 0.3894657, 0.59…
$ qval       <dbl> 0.4986399, 0.4986399, 0.5998073, 0.3893246, 0.3893246, 0.59…

5.2 Counting data

How many genes are potentially identified with a 1% peptide-level FDR?
How will it be the gene-level FDR?

pep2gene %>% 
  filter(qval <= 0.01) %>% # Peptide level
  group_by(geneRef) %>% 
  summarise(isDecoy = any(isDecoy)) %>% 
  global_fdr() # Gene level

Target	Decoy	Global FDR (%)
867	28	3.229527

And using only unique peptides?

pep2gene %>% 
  filter(shared == 1, qval <= 0.01) %>%  # Peptide level
  group_by(geneRef) %>% 
  summarise(isDecoy = any(isDecoy)) %>% 
  global_fdr() # Gene level

Target	Decoy	Global FDR (%)
395	16	4.050633

5.3 Gene-level FDR

Calculate a gene-level score
With this score calculate the gene-level FDR

scored_genes <- 
  pep2gene %>% 
  # Only consider unique (i.e. not shared) peptides
  filter(shared==1) %>% 
  # Calculate gene-level scores
  lpg(geneRef) %>% 
  # Calculate target-decoy approach metrics using the LPGF score
  target_decoy_approach(LPGF, lower_better = FALSE)

plot_rank(scored_genes)

Filter by 1% gene-level FDR

scored_genes %>% 
  filter(qval <= 0.01) %>% 
  global_fdr()

Target	Decoy	Global FDR (%)
372	3	0.8064516

We have increased the identifications, but enough?

6 Protein ambiguity groups

Instead of using only unique peptides we can exploit shared peptides
PAnalyzer tool from Prieto et al. (2012):

6.1 Peptide-to-group relations

We let PAnalyzer transform peptide-to-protein relations into peptide-to-protein-to-group relations:

pep2prot2group <- 
  pep2prot %>% 
  iwf_grouping()

pep2prot2group %>% 
  count_groups()

Type	TargetProteins	DecoyProteins	TargetGroups	DecoyGroups
ambiguous	553	270	59	27
conclusive	1505	1272	1505	1272
indistinguishable	7181	6188	1961	1727
non-conclusive	882	665	882	665

Display an indistinguishable group:

pep2prot2group %>% plot_groups(
  pep2prot2group %>% filter(!isDecoy, proteinType == "indistinguishable") %>% summarise(first(groupRef)))

Display an ambiguous group:

pep2prot2group %>% plot_groups(
  pep2prot2group %>% filter(!isDecoy, proteinType == "ambiguous") %>% summarise(first(groupRef)))

6.2 Protein group-level FDR

Calculate a protein group-level score
With this score calculate the protein group-level FDR

scored_groups <- 
  pep2prot2group %>% 
  # Instead of iwf_pep2level() we use iwf_pep2group() to retain the list of proteins within each group
  iwf_pep2group() %>% 
  # Only consider peptides unique to one group (this also removes non-conclusive proteins) 
  filter(shared==1) %>% 
  # Calculate protein group-level scores
  lpg(groupRef) %>% 
  # Calculate target-decoy approach metrics using the LPGF score
  target_decoy_approach(LPGF, lower_better = FALSE)

plot_rank(scored_groups)

Filter by 1% protein group-level FDR

scored_groups %>% 
  filter(qval <= 0.01) %>% 
  global_fdr()

Target	Decoy	Global FDR (%)
507	5	0.9861933

7 Exercises

7.1 Use another dataset

Another replicate from the same condition:
- ~~"data/ctrl/rep1"~~, "data/ctrl/rep2", "data/ctrl/rep3"
A replicate from another condition:
- "data/exp/rep1", "data/exp/rep2", "data/exp/rep3"
What is the minimum number of protein groups using a 1% protein-group-level FDR?
Save identified genes to a local xlsx file by executing:

scored_genes %>% 
  filter(qval <= 0.01) %>% 
  select(geneRef, LPGF) %>% 
  mutate(condition=CONDITION, replicate=REPLICATE) %>% 
  write_xlsx(paste0(paste("genes", CONDITION, REPLICATE, sep = "_"), ".xlsx"))

7.2 Compare the identifications

From your colleages obtain the xlsx files with the identifications for the remaining replicates and conditions.
Combine the results into a comparison.xlsx by executing:

list.files(pattern = "genes_.*xlsx") %>% 
  map(read_excel) %>% 
  bind_rows() %>% 
  pivot_wider(names_from = c(condition, replicate), values_from = LPGF) %>% 
  rowwise() %>% 
  mutate(num_ctrl = sum(!is.na(across(starts_with("ctrl"))))) %>% 
  mutate(num_exp = sum(!is.na(across(starts_with("exp"))))) %>% 
  write_xlsx("comparison.xlsx")

Open comparison.xlsx with LibreOffice:
- Which genes are reported in at least two replicates of the experimental condition and no replicates in the control condition?

7.3 Test

Answer the questions in eGela:
- 1..3: theoretical
- 4..7: results for your dataset
- 8..10: comparison between datasets

References

Elias, Joshua E., and Steven P. Gygi. 2007. “Target-Decoy Search Strategy for Increased Confidence in Large-Scale Protein Identifications by Mass Spectrometry.” Nature Methods 4 (3): 207–14. https://doi.org/10.1038/nmeth1019.

Käll, Lukas, John D. Storey, Michael J. MacCoss, and William Stafford Noble. 2008. “Assigning Significance to Peptides Identified by Tandem Mass Spectrometry Using Decoy Databases.” Journal of Proteome Research 7 (1): 29–34. https://doi.org/10.1021/pr700600n.

Prieto, Gorka. 2024. B10prot: Protein Identification for Shotgun Proteomics Data. https://akrogp.github.io/b10prot/.

Prieto, Gorka, Kerman Aloria, Nerea Osinalde, Asier Fullaondo, Jesus M. Arizmendi, and Rune Matthiesen. 2012. “PAnalyzer: A Software Tool for Protein Inference in Shotgun Proteomics.” BMC Bioinformatics 13 (1): 288. https://doi.org/10.1186/1471-2105-13-288.

Prieto, Gorka, and Jesús Vázquez. 2020. “Protein Probability Model for High-Throughput Protein Identification by Mass Spectrometry-Based Proteomics.” Journal of Proteome Research 19 (3): 1285–97. https://doi.org/10.1021/acs.jproteome.9b00819.

Ramirez, Juanma, Benoit Lectez, Nerea Osinalde, Monika Sivá, Nagore Elu, Kerman Aloria, Michaela Procházková, et al. 2018. “Quantitative proteomics reveals neuronal ubiquitination of Rngo/Ddi1 and several proteasomal subunits by Ube3a, accounting for the complexity of Angelman syndrome.” Human Molecular Genetics 27 (11): 1955–71. https://doi.org/10.1093/hmg/ddy103.