Preprocessing

library(PRONE)

Load Data (TMT)

Here, we are directly working with the SummarizedExperiment data. For more information on how to create the SummarizedExperiment from a proteomics data set, please refer to the “Get Started” vignette.

The example TMT data set originates from (Biadglegne et al. 2022).

data("tuberculosis_TMT_se")
se <- tuberculosis_TMT_se

Overview of the Data

To get an overview on the number of NAs, you can simply use the function get_NA_overview():

get_NA_overview(se, ain = "log2")
#>    Total.Values NA.Values NA.Percentage
#>           <int>     <int>         <num>
#> 1:         6020      1945      32.30897

To get an overview on the number of samples per sample group or batch, you can simply use the function plot_condition_overview() by specifying the column of the meta-data that should be used for coloring. By default (condition = NULL), the column specified in load_data()will be used.

plot_condition_overview(se, condition = NULL)
#> Condition of SummarizedExperiment used!


plot_condition_overview(se, condition = "Pool")

A general overview of the protein intensities across the different samples is provided by the function plot_heatmap(). The parameter “ain” specifies the data to plot, currently only “raw” and “log2” is available (names(assays(se)). Later if multiple normalization methods are executed, these will be saved as assays, and the normalized data can be visualized.

available_ains <- names(SummarizedExperiment::assays(se))

plot_heatmap(se, ain = "log2", color_by = c("Pool", "Group"), label_by = NULL, only_refs = FALSE)
#> Label of SummarizedExperiment used!
#> $log2

Similarly, an upset plot can be generated to visualize the overlaps between sets defined by a specific column in the metadata. The sets are generated by using non-NA values.

plot_upset(se, color_by = NULL, label_by = NULL, mb.ratio = c(0.7,0.3), only_refs = FALSE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

If you are interested in the intensities of specific biomarkers, you can use the plot_markers_boxplots() function to compare the distribution of intensities per group. The plot can be generated per marker and facet by normalization method (facet_norm = TRUE) or by normalization method and facet by marker (facet_marker = TRUE).

p <- plot_markers_boxplots(se, markers = c("Q92954;J3KP74;E9PLR3", "Q9Y6Z7", "Q68CQ4"), ain = "log2", id_column = "Protein.IDs", facet_norm = FALSE, facet_marker = TRUE)
#> Condition of SummarizedExperiment used!
#> No shaping done.
p[[1]] + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))

Filter Proteins

Remove Proteins With Missing Values in ALL Samples

se <- filter_out_complete_NA_proteins(se)
#> 13 proteins were removed.

Remove Proteins With a Specific Value in a Specific Column

Typically proteins with “+” in the columns “Reverse”, “Only.identified.by.site”, and “Potential.contaminant” are removed in case of a MaxQuant proteinGroups.txt output file.

se <- filter_out_proteins_by_value(se, "Reverse", "+")
#> 17 proteins were removed.
se <- filter_out_proteins_by_value(se, "Only.identified.by.site", "+")
#> 1 proteins were removed.
#se <- filter_out_proteins_by_value(se, "Potential.contaminant", "+")

Remove Proteins by ID

If you don’t want to remove for instance all proteins with “Potential.contaminant == +”, you can also first get the protein ID with the specific value, check them in Uniprot, and then remove only some by using the function filter_out_proteins_by_ID().

pot_contaminants <- get_proteins_by_value(se, "Potential.contaminant", "+")
#> 24 proteins were identified.
se <- filter_out_proteins_by_ID(se, pot_contaminants)
#> 24 proteins were removed.

Explore Missing Value Pattern

Due to the high amount of missing values in MS-based proteomics data, it is important to explore the missing value pattern in the data. The function plot_NA_heatmap() provides a heatmap of the proteins with at least one missing value across all samples.

plot_NA_heatmap(se, color_by = NULL, label_by = NULL, cluster_samples = TRUE, cluster_proteins = TRUE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Another way to explore the missing value pattern is to use the functions plot_NA_density() and plot_NA_frequency().

plot_NA_density(se)

plot_NA_frequency(se)

Filter Proteins By Applying a Missing Value Threshold

To reduce the amount of missing values, it is possible to filter proteins by applying a missing value threshold. The function filter_out_NA_proteins_by_threshold() removes proteins with more missing values than the specified threshold. The threshold is a value between 0 and 1, where 0.7, for instance, means that proteins with less than 70% of real values will be removed, i.e., proteins with more than 30% missing values will be removed.

se <- filter_out_NA_proteins_by_threshold(se, thr = 0.7) 
#> 99 proteins were removed.

plot_NA_heatmap(se)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Filter Samples

Following filtering proteins by different criteria, samples can be analyzed more in detail. PRONE provides some functions, such as plot_nr_prot_samples() and plot_tot_int_samples(), to get an overview of the number of proteins and the total intensity per sample, but also offers the automatic outlier detection method of POMA.

Quality Control

plot_nr_prot_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!


plot_tot_int_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Remove Samples Manually

Based on these plots, samples “1.HC_Pool1” and 1_HC_Pool2 seem to be outliers. You can easily remove samples manually by using the remove_samples_manually() function.

se <- remove_samples_manually(se, "Label", c("1.HC_Pool1", "1.HC_Pool2"))
#> 2 samples removed.

Remove Reference Samples

And you can remove the reference samples directly using the function remove_reference_samples(). But attention: possibly you need them for normalization! That is exactly why we currently keep them!

se_no_refs <- remove_reference_samples(se)
#> 2 reference samples removed from the SummarizedExperiment object.

Outlier Detection via POMA R Package

The POMA R package provides a method to detect outliers in proteomics data. The function detect_outliers_POMA() detects outliers in the data based on the POMA algorithm. The function returns a list with the following elements: polygon plot, distance boxplot, and the outliers. For further information on the POMA algorithm, please refer to the original publication (Castellano-Escuder et al. 2021):

poma_res <- detect_outliers_POMA(se, ain = "log2")
#> Condition of SummarizedExperiment used!
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

poma_res$polygon_plot


poma_res$distance_boxplot


DT::datatable(poma_res$outliers, options = list(scrollX = TRUE))

To remove the outliers detected via the POMA algorithm, just put the data.table of the detect_outliers_POMA() function into the remove_POMA_outliers() function.

se <- remove_POMA_outliers(se, poma_res$outliers)
#> 1 outlier samples removed.

Session Info

utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.4
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices datasets  utils     methods   base     
#> 
#> other attached packages:
#> [1] PRONE_0.99.6
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.16.0          
#>   [3] jsonlite_1.8.8              shape_1.4.6.1              
#>   [5] MultiAssayExperiment_1.30.3 magrittr_2.0.3             
#>   [7] farver_2.1.2                MALDIquant_1.22.2          
#>   [9] rmarkdown_2.27              GlobalOptions_0.1.2        
#>  [11] fs_1.6.4                    zlibbioc_1.50.0            
#>  [13] ragg_1.3.2                  vctrs_0.6.5                
#>  [15] janitor_2.2.0               htmltools_0.5.8.1          
#>  [17] S4Arrays_1.4.1              SparseArray_1.4.8          
#>  [19] mzID_1.42.0                 sass_0.4.9                 
#>  [21] bslib_0.7.0                 htmlwidgets_1.6.4          
#>  [23] desc_1.4.3                  plyr_1.8.9                 
#>  [25] lubridate_1.9.3             impute_1.78.0              
#>  [27] cachem_1.1.0                igraph_2.0.3               
#>  [29] lifecycle_1.0.4             iterators_1.0.14           
#>  [31] pkgconfig_2.0.3             Matrix_1.7-0               
#>  [33] R6_2.5.1                    fastmap_1.2.0              
#>  [35] snakecase_0.11.1            GenomeInfoDbData_1.2.12    
#>  [37] MatrixGenerics_1.16.0       clue_0.3-65                
#>  [39] digest_0.6.36               pcaMethods_1.96.0          
#>  [41] colorspace_2.1-0            S4Vectors_0.42.1           
#>  [43] crosstalk_1.2.1             textshaping_0.4.0          
#>  [45] GenomicRanges_1.56.1        vegan_2.6-6.1              
#>  [47] labeling_0.4.3              timechange_0.3.0           
#>  [49] fansi_1.0.6                 httr_1.4.7                 
#>  [51] abind_1.4-5                 mgcv_1.9-1                 
#>  [53] compiler_4.4.1              withr_3.0.0                
#>  [55] doParallel_1.0.17           BiocParallel_1.38.0        
#>  [57] UpSetR_1.4.0                highr_0.11                 
#>  [59] MASS_7.3-61                 DelayedArray_0.30.1        
#>  [61] rjson_0.2.21                permute_0.9-7              
#>  [63] mzR_2.38.0                  tools_4.4.1                
#>  [65] PSMatch_1.8.0               glue_1.7.0                 
#>  [67] nlme_3.1-164                QFeatures_1.14.2           
#>  [69] gridtext_0.1.5              grid_4.4.1                 
#>  [71] cluster_2.1.6               reshape2_1.4.4             
#>  [73] generics_0.1.3              gtable_0.3.5               
#>  [75] preprocessCore_1.66.0       tidyr_1.3.1                
#>  [77] data.table_1.15.4           xml2_1.3.6                 
#>  [79] utf8_1.2.4                  XVector_0.44.0             
#>  [81] BiocGenerics_0.50.0         foreach_1.5.2              
#>  [83] pillar_1.9.0                stringr_1.5.1              
#>  [85] limma_3.60.4                circlize_0.4.16            
#>  [87] splines_4.4.1               dplyr_1.1.4                
#>  [89] ggtext_0.1.2                lattice_0.22-6             
#>  [91] renv_1.0.7                  tidyselect_1.2.1           
#>  [93] ComplexHeatmap_2.20.0       knitr_1.48                 
#>  [95] gridExtra_2.3               IRanges_2.38.1             
#>  [97] ProtGenerics_1.36.0         SummarizedExperiment_1.34.0
#>  [99] stats4_4.4.1                xfun_0.46                  
#> [101] Biobase_2.64.0              statmod_1.5.0              
#> [103] MSnbase_2.30.1              matrixStats_1.3.0          
#> [105] DT_0.33                     stringi_1.8.4              
#> [107] UCSC.utils_1.0.0            lazyeval_0.2.2             
#> [109] yaml_2.3.10                 evaluate_0.24.0            
#> [111] codetools_0.2-20            MsCoreUtils_1.16.0         
#> [113] tibble_3.2.1                BiocManager_1.30.23        
#> [115] cli_3.6.3                   affyio_1.74.0              
#> [117] systemfonts_1.1.0           munsell_0.5.1              
#> [119] jquerylib_0.1.4             Rcpp_1.0.13                
#> [121] GenomeInfoDb_1.40.1         png_0.1-8                  
#> [123] XML_3.99-0.17               parallel_4.4.1             
#> [125] pkgdown_2.1.0               ggplot2_3.5.1              
#> [127] dendsort_0.3.4              AnnotationFilter_1.28.0    
#> [129] scales_1.3.0                affy_1.82.0                
#> [131] ncdf4_1.22                  purrr_1.0.2                
#> [133] crayon_1.5.3                POMA_1.14.0                
#> [135] GetoptLong_1.0.5            rlang_1.1.4                
#> [137] vsn_3.72.0

References

Biadglegne, Fantahun, Johannes R. Schmidt, Kathrin M. Engel, Jörg Lehmann, Robert T. Lehmann, Anja Reinert, Brigitte König, Jürgen Schiller, Stefan Kalkhof, and Ulrich Sack. 2022. “Mycobacterium Tuberculosis Affects Protein and Lipid Content of Circulating Exosomes in Infected Patients Depending on Tuberculosis Disease State.” Biomedicines 10 (4): 783. https://doi.org/10.3390/biomedicines10040783.

Castellano-Escuder, Pol, Raúl González-Domínguez, Francesc Carmona-Pontaque, Cristina Andrés-Lacueva, and Alex Sánchez-Pla. 2021. “POMAShiny: A User-Friendly Web-Based Workflow for Metabolomics and Proteomics Data Analysis.” Edited by Manja Marz. PLOS Computational Biology 17 (7): e1009148. https://doi.org/10.1371/journal.pcbi.1009148.

Arend Lis