Introduction
This guide serves as an introduction to the PRONE R package, designed to facilitate the preparation of your data set for utilization of the PRONE package’s functionalities. It begins by delineating the underlying data structure essential for the application of the package, followed by a brief description of how to apply different normalization techniques to your data. Additionally, this tutorial shows how to export the normalized data at the end.
Beyond the scope of this introductory tutorial, PRONE encompassess a broad spectrum of functionalities, ranging from preprocessing steps, imputation, normalization and evaluation of the performance of different normalization techniques, to the identification of differentially expressed proteins. These functionalities are detailed in dedicated vignettes, offering detailed insights and instructions for leveraging full capabilities of the PRONE package:
Furthermore, PRONE provides additional functionalities for the analysis of spike-in data sets, which are detailed in the following vignette:
Installation
# Install PRONE.R from github and build vignettes
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("lisiarend/PRONE.R")
Load Data
PRONE uses the SummarizedExperiment
class as storage for protein intensities and meta data information on
the proteomics data set. Hence, before being able to execute the
functionalities of PRONE, the data needs to be saved accordingly. For
this, the load_data()
function was implemented and requires
different parameters which are explained in the following:
- data: refers to the data.frame containing the protein intensities
- md: refers to the data.frame containing the meta-data information
- protein_column: refers to the column in the data frame that contains the protein IDs
- gene_column (optional): refers to the column in the data frame that contains the gene IDs
- condition_column (optional): refers to the column in the meta-data table that contains the condition information - this can also be specified later
- label_column (optional): refers to the column in the meta-data table that contains the label information - sometimes
If you have a TMT data set with samples being measured in different batches than you have to specify the batch information. If reference samples were included in each batch, then additionally specify the samples names of the reference samples.
- batch_column (optional): refers to the column in the meta-data table that contains the batch information
- ref_samples (optional): refers to the samples that should be used as reference samples for normalization
Attention: You need to make sure that the sample names are saved in a column named “Column” in the meta-data table and are named accordingly in the protein intensity table.
Example 1: TMT Data Set
The example TMT data set originates from (Biadglegne et al. 2022).
data_path <- readPRONE_example("tuberculosis_protein_intensities.csv")
md_path <- readPRONE_example("tuberculosis_metadata.csv")
data <- read.csv(data_path)
md <- read.csv(md_path)
md$Column <- stringr::str_replace_all(md$Column, " ", ".")
ref_samples <- md[md$Group == "ref",]$Column
se <- load_data(data, md, protein_column = "Protein.IDs", gene_column = "Gene.names", ref_samples = ref_samples, batch_column = "Pool", condition_column = "Group", label_column = "Label")
Example 2: LFQ Data Set
The example data set originates from (Vehmas et al. 2016). This data set is used for the subsequent examples in this tutorial.
data_path <- readPRONE_example("mouse_liver_cytochrome_P450_protein_intensities.csv")
md_path <- readPRONE_example("mouse_liver_cytochrome_P450_metadata.csv")
data <- read.csv(data_path, check.names = FALSE)
md <- read.csv(md_path)
se <- load_data(data, md, protein_column = "Accession", gene_column = "Gene names", ref_samples = NULL, batch_column = NULL, condition_column = "Condition", label_column = NULL)
Data Structure
The SummarizedExperiment object contains the protein intensities as “assay”, the meta-data table as “colData”, and additional columns for instance resulting from MaxQuant as “rowData”. Furthermore, information on the different columns, for instance, which columns contains the batch information, can be found in the “metadata” slot.
se
#> class: SummarizedExperiment
#> dim: 1499 12
#> metadata(4): condition batch refs label
#> assays(2): raw log2
#> rownames(1499): 1 2 ... 1498 1499
#> rowData names(4): Gene.Names Protein.IDs Peptides used for quantitation
#> IDs
#> colnames(12): 2206_WT 2208_WT ... 2285_Arom 2253_Arom
#> colData names(3): Column Animal Condition
The different data types can be accessed by using the
assays()
function. Currently, only the raw data and
log2-transformed data are stored in the SummarizedExperiment object.
SummarizedExperiment::assays(se)
#> List of length 2
#> names(2): raw log2
Preprocessing, Imputation, Normalization, Evaluation, and Differential Expression
As already mentioned in the introduction section, many functionalities are available in PRONE. All these functionalities are mainly based on the SummarizedExperiment object.
In this tutorial, we will only perform simple normalization of the data using median and LoessF normalization.
se <- normalize_se(se, c("Median", "LoessF"))
#> Median completed.
#> LoessF completed.
The normalized intensities will be saved as additional assays in the SummarizedExperiment object.
SummarizedExperiment::assays(se)
#> List of length 4
#> names(4): raw log2 Median LoessF
Again, more information on the individual processes can be find in dedicated vignettes.
Download Data
Finally, you can easily download the normalized data by using the
export_data()
function. This function will save the
specified assays as CSV files and the SummarizedExperiment object as an
RDS file in a specified output directory. Make sure that the output
directory exists.
if(!dir.exists("output/")) dir.create("output/")
export_data(se_norm, out_dir = "output/", ain = c("log2", "Median", "LoessF"))
Session Info
utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.4
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices datasets utils methods base
#>
#> other attached packages:
#> [1] PRONE_0.99.6
#>
#> loaded via a namespace (and not attached):
#> [1] rlang_1.1.4 magrittr_2.0.3
#> [3] clue_0.3-65 matrixStats_1.3.0
#> [5] compiler_4.4.1 systemfonts_1.1.0
#> [7] vctrs_0.6.5 reshape2_1.4.4
#> [9] stringr_1.5.1 ProtGenerics_1.36.0
#> [11] pkgconfig_2.0.3 crayon_1.5.3
#> [13] fastmap_1.2.0 XVector_0.44.0
#> [15] utf8_1.2.4 rmarkdown_2.27
#> [17] UCSC.utils_1.0.0 preprocessCore_1.66.0
#> [19] ragg_1.3.2 purrr_1.0.2
#> [21] xfun_0.46 MultiAssayExperiment_1.30.3
#> [23] zlibbioc_1.50.0 cachem_1.1.0
#> [25] GenomeInfoDb_1.40.1 jsonlite_1.8.8
#> [27] DelayedArray_0.30.1 BiocParallel_1.38.0
#> [29] parallel_4.4.1 cluster_2.1.6
#> [31] R6_2.5.1 bslib_0.7.0
#> [33] stringi_1.8.4 limma_3.60.4
#> [35] GenomicRanges_1.56.1 jquerylib_0.1.4
#> [37] iterators_1.0.14 Rcpp_1.0.13
#> [39] SummarizedExperiment_1.34.0 knitr_1.48
#> [41] IRanges_2.38.1 Matrix_1.7-0
#> [43] igraph_2.0.3 tidyselect_1.2.1
#> [45] rstudioapi_0.16.0 abind_1.4-5
#> [47] yaml_2.3.10 ggtext_0.1.2
#> [49] doParallel_1.0.17 codetools_0.2-20
#> [51] affy_1.82.0 lattice_0.22-6
#> [53] tibble_3.2.1 plyr_1.8.9
#> [55] withr_3.0.0 Biobase_2.64.0
#> [57] evaluate_0.24.0 desc_1.4.3
#> [59] xml2_1.3.6 pillar_1.9.0
#> [61] affyio_1.74.0 BiocManager_1.30.23
#> [63] MatrixGenerics_1.16.0 renv_1.0.7
#> [65] foreach_1.5.2 stats4_4.4.1
#> [67] MSnbase_2.30.1 MALDIquant_1.22.2
#> [69] ncdf4_1.22 generics_0.1.3
#> [71] S4Vectors_0.42.1 ggplot2_3.5.1
#> [73] munsell_0.5.1 scales_1.3.0
#> [75] glue_1.7.0 lazyeval_0.2.2
#> [77] tools_4.4.1 data.table_1.15.4
#> [79] mzID_1.42.0 QFeatures_1.14.2
#> [81] vsn_3.72.0 mzR_2.38.0
#> [83] fs_1.6.4 XML_3.99-0.17
#> [85] grid_4.4.1 impute_1.78.0
#> [87] tidyr_1.3.1 MsCoreUtils_1.16.0
#> [89] colorspace_2.1-0 GenomeInfoDbData_1.2.12
#> [91] PSMatch_1.8.0 cli_3.6.3
#> [93] textshaping_0.4.0 fansi_1.0.6
#> [95] S4Arrays_1.4.1 dplyr_1.1.4
#> [97] AnnotationFilter_1.28.0 pcaMethods_1.96.0
#> [99] gtable_0.3.5 sass_0.4.9
#> [101] digest_0.6.36 BiocGenerics_0.50.0
#> [103] SparseArray_1.4.8 htmlwidgets_1.6.4
#> [105] htmltools_0.5.8.1 pkgdown_2.1.0
#> [107] lifecycle_1.0.4 httr_1.4.7
#> [109] statmod_1.5.0 gridtext_0.1.5
#> [111] MASS_7.3-61