The CPSM is a R-package for a computational pipeline for the predicting the Survival Probability of Cancer Patients. It performs various steps: Data Processing, Split data into training and test subset, Data Normalization, Select Significant features based on Univariate survival, Generate LASSO PI Score, Develop Prediction model for survival probability based on different features and draw survival curve based on predicted survival probability values and barplots for predicted mean and median survival time of patients.
#Load Packages
# Load CPSM packages
library(CPSM)
# Load other required packages
library(preprocessCore)
library(ggfortify)
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(MASS)
library(MTLR)
library(dplyr)
library(SurvMetrics)
library(pec)
library(glmnet)
library(reshape2)
library(rms)
library(Matrix)
library(Hmisc)
library(survivalROC)
library(ROCR)
Example Input data: “Example_TCGA_LGG_FPKM_data” is a tab separated
file. It contains Samples (184 LGG Cancer Samples) in the rows and
Features in the columns. Gene Expression is available in terms of FPKM
values in the data. Features information: In the data there are 11
clinical + demographic, 4 types survival with time and event information
and 19,978 protein coding genes. Clinical and demographic features:
Clinical demographic features that are present in this example data
include Age, subtype, gender, race,
ajcc_pathologic_tumor_stage, histological_type,
histological_grade,
treatment_outcome_first_course, radiation_treatment_adjuvant,
sample_type,
type. Types of Survival: 4 types of Survival include OS (overall
survival), PFS (progression-free survival), DSS (disease-specific
survival), DFS (Disease-free survival). In the data, column names OS,
PFS, DSS and DFS represent event information, while OS.time, PFS.time,
DSS.time and DFS.time indicate survival time in days.
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'matrixStats'
#> The following object is masked from 'package:dplyr':
#>
#> count
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following object is masked from 'package:dplyr':
#>
#> explain
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#>
#> expand, unname
#> The following objects are masked from 'package:dplyr':
#>
#> first, rename
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#>
#> collapse, desc, slice
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
#> The following object is masked from 'package:Hmisc':
#>
#> contents
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
Example_TCGA_LGG_FPKM_data
#> class: SummarizedExperiment
#> dim: 184 2024
#> metadata(0):
#> assays(1): expression
#> rownames(184): TCGA-TM-A7CA-01 TCGA-DU-A6S3-01 ... TCGA-E1-A7YM-01
#> TCGA-DH-5143-01
#> rowData names(1): rownames.LGG_FPKM_data.
#> colnames(2024): Age subtype ... BAZ1B BAZ2A
#> colData names(1): colnames.LGG_FPKM_data.
This function converts OS time (in days) into months and then removes
samples where OS/OS.time information is missing. Here, we need to
provide input data in tsv or txt format. Further, we needs to define
col_num (column number at which clinical/demographic and survival
information ends,e.g. 20, surv_time (name of column which contain
survival time (in days) information, e.g. OS.time ) and output file
name, e.g.
“New_data.txt”
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
New_data <- data_process_f(assays(Example_TCGA_LGG_FPKM_data)$expression,
col_num = 20, surv_time = "OS.time"
)
str(New_data[1:10])
#> 'data.frame': 176 obs. of 10 variables:
#> $ Age : num 44.9 60.3 57.9 45.7 70.7 ...
#> $ subtype : chr "PN" "PN" NA "PN" ...
#> $ gender : chr "Male" "Male" "Female" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G2" "G3" "G3" ...
#> $ treatment_outcome_first_course: chr "Complete Remission/Response" "Stable Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "NO" "NO" NA "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After data processing, we got a new output file “New_data”, which
contains 176 samples. Thus, data_process_f function removes 8 samples
where OS/OS time information is missing. Besides, here is a new 21st
column in the data with
column name “OS_month” where OS time is available in months.
Before proceeding further, we need to split our data into training and test subset for the purpose of feature selection and model development. Here, we need output from the previous step as an input ( which was “New_data.txt”). Next we need to define the fraction (e.g. 0.9) by which we want to split data into training and test. Thus, fraction=0.9 will split data into 90% training and 10% as test set. Besides, we also need to provide training and set output names (e.g. train_FPKM.txt,test_FPKM.txt )
data(New_data, package = "CPSM")
# Call the function
result <- tr_test_f(data = assays(New_data)$expression, fraction = 0.9)
# Access the train and test data
train_FPKM <- result$train_data
str(train_FPKM[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 53.1 54.3 38.1 25.8 46.2 ...
#> $ subtype : chr "PN" "ME" "NE" "NE" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "NOT AVAILABLE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G3" "G3" "G2" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA "Complete Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "NO" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
test_FPKM <- result$test_data
str(test_FPKM[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ Age : num 70.7 34.6 32.4 61 34.4 ...
#> $ subtype : chr "PN" NA "PN" "PN" ...
#> $ gender : chr "Male" "Female" "Male" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G2" ...
#> $ treatment_outcome_first_course: chr "Stable Disease" "Stable Disease" "Partial Remission/Response" "Complete Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr NA NA "YES" "YES" ...
#> $ sample_type : chr "Recurrent" "Primary" "Primary" "Primary" ...
After the train-test split, we got a two new outputs: “train_FPKM”, “test_FPKM”, where, train_FPKM contains 158 samples and test_FPKM contains 18 samples. Thus, tr_test_f function splits data into 90:10 ratio.
Next to select features and develop ML models, data must be
normalized. Since, expression is available in terms of FPKM values.
Thus,
train_test_normalization_f
function will first convert FPKM
value into log scale [log2(FPKM+1) followed by quantile normalization
using the “preprocessCore” package. Here, training data will be used as
a target matrix for quantile normalization. Here, we need to provide
training and test datasets (that we obtained from the previous step of
Train/Test Split). Further, we need to provide column number where
clinical information ends (e.g. 21) in the input datasets. Besides, we
also need to provide output files names (train_clin_data (which contains
only Clinical information of training data), test_clin_data (which
contains only Clinical information of training data),
train_Normalized_data_clin_data (which contains Clinical information and
normalized values of genes of training samples),
test_Normalized_data_clin_data (which contains Clinical information and
normalized values of genes of test samples).
# Step 3 - Data Normalization
# Normalize the training and test data sets
data(train_FPKM, package = "CPSM")
data(test_FPKM, package = "CPSM")
Result_N_data <- train_test_normalization_f(
train_data = train_FPKM,
test_data = test_FPKM,
col_num = 21
)
# Access the Normalized train and test data
Train_Clin <- Result_N_data$Train_Clin
Test_Clin <- Result_N_data$Test_Clin
Train_Norm_data <- Result_N_data$Train_Norm_data
Test_Norm_data <- Result_N_data$Test_Norm_data
str(Train_Clin[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
str(Train_Norm_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After, running the function, we obtained 4 outputs: Train_Clin - Contains only Clinical features, Test_Clin - contains only Clinical features of Test samples; Train_Norm_data - Clinical features with normalized values of genes for training samples; Test_Norm_data - Clinical features with normalized values of genes for test samples.
Next to create a survival model, we will create a Prognostic Index
(PI)
Score. PI score is calculated based on the expression of the features
selected by the LASSO regression model and their beta coefficients. For
instance, 5 features (G1, G2, G3, G4, and G5 and their coefficient
values are B1, B2, B3, B4, and B5, respectively) selected by the LASSO
method. Then PI score will be computed as following:
PI score = G1B1 + G2B2 + G3 * B3 + G4B4+ G5B5
Here, we need to provide Normalized training (Train_Norm_data) and
test data (Test_Norm_data)as input data that we have obtained from the
previous function “train_test_normalization_f”. Further, we need to
provide col_num n column number at which clinical features ends
(e.g. 21), nfolds (number of folds
e.g. 5) for the LASSO regression method to select features. We
implemented LASSO using the “glmnet” package. Further, we need to
provide surv_time (name of column containing survival time in months,
e.g. OS_month) and surv_event (name of column containing survival event
information, e.g. OS) information in the data. Besides, we also need to
provide names and training and test output file names to store data
containing LASSO genes and PI values.
# Step 4 - Lasso PI Score
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_PI <- Lasso_PI_scores_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
nfolds = 5,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
Train_Lasso_key_variables <- Result_PI$Train_Lasso_key_variables
Train_PI_data <- Result_PI$Train_PI_data
Test_PI_data <- Result_PI$Test_PI_data
str(Train_PI_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> $ AADACL4 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.212 0.297 0.053 0.358 0.064 0.032 0.09 0.051 0.1 0.014 ...
#> $ ABCC3 : num 0.1 5.476 0.029 0.108 0.037 ...
#> $ ABI1 : num 44.7 15.3 43.7 27.7 31.6 ...
#> $ ABRA : num 0 0.035 0 0 0.015 0.034 0.006 0 0.009 0.005 ...
#> $ AC006059.2: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.032 0.034 0.032 0 0.014 0.04 0.011 0.01 0.017 0.042 ...
#> $ AC008764.4: num 0.038 0 0 0 0 0 0.008 0 0 0.015 ...
str(Test_PI_data[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> $ AADACL4 : num 0.011 0 0.016 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.048 0.165 0.044 0.033 0.053 0.02 0.09 0.038 0.058 0.096 ...
#> $ ABCC3 : num 0.679 1.358 0.184 0.321 0.024 ...
#> $ ABI1 : num 39.4 19.9 31 27.8 56.4 ...
#> $ ABRA : num 0 0.001 0.029 0 0 0 0.026 0.004 0 0.014 ...
#> $ AC006059.2: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.002 0.054 0.012 0.092 0.031 0.003 0.004 0.078 0.046 0 ...
#> $ AC008764.4: num 0.036 0.009 0 0 0.015 0 0 0 0 0 ...
plot(Result_PI$cvfit)
Thus, Lasso_PI_scores_f gave us following outputs: 1. Train_Lasso_key_variables: List of features selected by LASSO and their beta coefficient values 2. Train_Cox_Lasso_Regression_lamda_plot: Lasso Regression Lambda plot. 3. Train_PI_data: It contains expression of genes selected by LASSO and PI score in the last column for training samples. 4. Test_PI_data: It contains expression of genes selected by LASSO and PI score in the last column for test samples.
Besides PI score, with the “Univariate_sig_features_f” function of
CPSM package, we can select significant (p-value <0.05) features
based on univariate survival analysis. These features are selected based
on their capability to stratify high-risk and low-risk survival groups
using the cut off value of their median expression.
Here, we need to provide Normalized training (Train_Norm_data.txt) and
test data (Test_Norm_data.txt)as input data that we have obtained from
the previous function “train_test_normalization_f”. Further, we need to
provide a “col_num” (e.g 21)column number at which clinical features
ends. Further, we need to provide surv_time (name of column containing
survival time in months, e.g. OS_month) and surv_event (name of column
containing survival event information, e.g. OS) information in the data.
Besides, we also need to provide names and training and test output file
names to store data containing expression of selected genes.
# Step 4b - Univariate Survival Significant Feature Selection.
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_Uni <- Univariate_sig_features_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
Univariate_Suv_Sig_G_L <- Result_Uni$Univariate_Survival_Significant_genes_List
Train_Uni_sig_data <- Result_Uni$Train_Uni_sig_data
Test_Uni_sig_data <- Result_Uni$Test_Uni_sig_data
Uni_Sur_Sig_clin_List <- Result_Uni$Univariate_Survival_Significant_clin_List
Train_Uni_sig_clin_data <- Result_Uni$Train_Uni_sig_clin_data
Test_Uni_sig_clin_data <- Result_Uni$Test_Uni_sig_clin_data
str(Univariate_Suv_Sig_G_L[1:10])
#> chr [1:10] "A2ML1" "AADACL4" "AAMDC" "AAR2" "ABCA12" "ABCB4" "ABCB5" ...
Thus, Univariate_sig_features_f gave us following outputs: Univariate_Suv_Sig_G_L: a table of univariate significant genes along with their corresponding coefficient values, HR value, P-values, C-Index values. Train_Uni_sig_data: It contains expression of significant genes selected by univariate survival analysis for training samples. Test_Uni_sig_data: It contains expression of significant genes selected by univariate survival analysis for test samples.
After selecting significant or key features using LASSO or Univariate
survival analysis, next we want to develop an ML prediction model to
predict survival probability of patients. MTLR_pred_model_f function of
CPSM give us multiple options to develop models including Only Clinical
features (Model_type=1), PI score (Model_type=2), PI Score + Clinical
features (Model_type=3), Significant Univariate features (Model_type=4),
Significant Univariate features Clinical features (Model_type=5) using
MTLR package. Further, here, we were interested in developing a model
based on PI score. Thus, we need to provide following inputs: (1)
Training data with only clinical features, (2) Test data with only
clinical features, (3) Model type (e.g. 2, since we want to develop
model based on PI score), (4) Training data with PI score , (5) Test
data with PI score, (6) Clin_Feature_List (e.g. Key_PI_list.txt), a list
of features which will be
used to build model . Furthermore, we also need to provide surv_time
(name of column containing survival time in months, e.g. OS_month) and
surv_event (name of column containing survival event information,
e.g. OS) information in the clinical data
#Model for only Clinical features
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Key_Clin_feature_list, package = "CPSM")
Result_Model_Type1 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 1,
train_features_data = Train_Clin,
test_features_data = Test_Clin,
Clin_Feature_List = Key_Clin_feature_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type1$survCurves_data
mean_median_survival_tim_d <- Result_Model_Type1$mean_median_survival_time_data
survival_result_bas_on_MTLR <- Result_Model_Type1$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type1$Error_mat_for_Model
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_PI_list, package = "CPSM")
Result_Model_Type2 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 2,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_PI_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type2$survCurves_data
mean_median_surviv_tim_da <- Result_Model_Type2$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type2$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type2$Error_mat_for_Model
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_Clin_features_with_PI_list, package = "CPSM")
Result_Model_Type3 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 3,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_Clin_features_with_PI_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type3$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type3$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type3$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type3$Error_mat_for_Model
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_Uni_sig_data, package = "CPSM")
data(Test_Uni_sig_data, package = "CPSM")
data(Key_univariate_features_with_Clin_list, package = "CPSM")
Result_Model_Type5 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 4,
train_features_data = Train_Uni_sig_data,
test_features_data = Test_Uni_sig_data,
Clin_Feature_List = Key_univariate_features_with_Clin_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type5$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type5$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type5$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type5$Error_mat_for_Model
After, implementing MTLR_pred_model_f function , we got following outputs: 1. Model_with_PI.RData : Model on training data 2. survCurves_data : Table containing predicted survival probability of each patient at different time points. This data can be further used to plot the survival curve of patients. 3. mean_median_survival_time_data : Table containing predicted mean and median survival time of each patient in the test data. This data can be further used for bar plots. 4. Error_mat_for_Model : Table containing performance parameters obtained on test data based on prediction model. It contains IBS score (Integrated Brier Score) =0.192, C-Index =0.81.
Next to visualize survival of patients, we will plot survival curve plots using the surv_curve_plots_f function based on the data “survCurves_data” that we obtained from the previous step after running the MTLR_pred_model_f function. Further, the surv_curve_plots_f function also allows highlighting a specific patient on the curve. Thus the function needs only two inputs: 1) Surv_curve_data, (2) Sample ID of a specific patient (e.g. TCGA-TQ-A8XE-01) that needs to be highlighted.
# Create Survival curves/plots for individual patients
data(survCurves_data, package = "CPSM")
plots <- surv_curve_plots_f(
Surv_curve_data = survCurves_data,
selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots$all_patients_plot)
Here, we obtained two output plots: 1. Survival curves for all patients in the test data with different colors 2. Survival curves for all patients (in black) and highlighted patient (yellow) in the test data
Next to visualize predicted survival time of patients, we will plot barplot for mean/median using “mean_median_surv_barplot_f” function based on the data that we obtained from step 5 after running the MTLR_pred_model_f function. Further, the mean_median_surv_barplot_f function also allows highlighting a specific patient on the curve. Thus the function needs only two inputs: 1) surv_mean_med_data, (2) Sample ID of a specific patient (e.g. TCGA-TQ-A8XE-01) that needs to be highlighted.
data(mean_median_survival_time_data, package = "CPSM")
plots_2 <- mean_median_surv_barplot_f(
surv_mean_med_data =
mean_median_survival_time_data,
selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots_2$mean_med_all_pat)
Here, we obtained two output plots: 1. Barplot for all patients in the test data, where the red color bar represents mean survival and cyan/green color bar represents median survival time. 2. Barplot for all patients with a highlighted patient (dashed black outline) in the test data. It shows this patient has a predicted mean and median survival is 81.58 and 75.50 months.
Next, the Nomogram_generate_f function of CPSM also provides an option to generate a nomogram plot based on user defined clinical and other features in the data. For instance, we will generate a nomogram based on 6 features (Age, gender, race, histological_type, sample_type, PI). Here, we will provide data containing all the features (Samples in rows and features in columns) (e.g. Train_Data_Nomogram_input) and a list of features (feature_list_for_Nomogram) based on which we want to generate a nomogram. Further, we also need to provide surv_time (name of column containing survival time in months, e.g. OS_month) and surv_event (name of column containing survival event information, e.g. OS) information in the data.
data(Train_Data_Nomogram_input, package = "CPSM")
data(feature_list_for_Nomogram, package = "CPSM")
Result_Nomogram <- Nomogram_generate_f(
data = Train_Data_Nomogram_input,
Feature_List = feature_list_for_Nomogram,
surv_time = "OS_month",
surv_event = "OS"
)
Here, we will get a Nomogram based on features that we provide. This nomogram can predict Risk (Event risk, eg, Death), 1-year, 3-year, 5-year and 10 years survival of patients.
As last part of this document, we call the function “sessionInfo()”, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [3] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [5] IRanges_2.41.2 S4Vectors_0.45.2
#> [7] BiocGenerics_0.53.3 generics_0.1.3
#> [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [11] ROCR_1.0-11 survivalROC_1.0.3.1
#> [13] rms_6.9-0 Hmisc_5.2-1
#> [15] reshape2_1.4.4 glmnet_4.1-8
#> [17] Matrix_1.7-1 pec_2023.04.12
#> [19] prodlim_2024.06.25 SurvMetrics_0.5.0
#> [21] MTLR_0.2.1 MASS_7.3-63
#> [23] dplyr_1.1.4 survminer_0.5.0
#> [25] ggpubr_0.6.0 survival_3.8-3
#> [27] ggfortify_0.4.17 ggplot2_3.5.1
#> [29] preprocessCore_1.69.0 CPSM_0.99.3
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.17.1
#> [4] jsonlite_1.8.9 magrittr_2.0.3 TH.data_1.1-2
#> [7] farver_2.1.2 rmarkdown_2.29 vctrs_0.6.5
#> [10] base64enc_0.1-3 rstatix_0.7.2 htmltools_0.5.8.1
#> [13] S4Arrays_1.7.1 polspline_1.1.25 broom_1.0.7
#> [16] SparseArray_1.7.2 Formula_1.2-5 sass_0.4.9
#> [19] parallelly_1.41.0 bslib_0.8.0 htmlwidgets_1.6.4
#> [22] plyr_1.8.9 sandwich_3.1-1 zoo_1.8-12
#> [25] cachem_1.1.0 lifecycle_1.0.4 iterators_1.0.14
#> [28] pkgconfig_2.0.3 R6_2.5.1 fastmap_1.2.0
#> [31] GenomeInfoDbData_1.2.13 future_1.34.0 digest_0.6.37
#> [34] numDeriv_2016.8-1.1 colorspace_2.1-1 labeling_0.4.3
#> [37] km.ci_0.5-6 httr_1.4.7 abind_1.4-8
#> [40] compiler_4.5.0 withr_3.0.2 htmlTable_2.4.3
#> [43] backports_1.5.0 carData_3.0-5 ggsignif_0.6.4
#> [46] lava_1.8.0 quantreg_5.99.1 DelayedArray_0.33.3
#> [49] tools_4.5.0 foreign_0.8-87 future.apply_1.11.3
#> [52] nnet_7.3-20 glue_1.8.0 DiagrammeR_1.0.11
#> [55] nlme_3.1-166 grid_4.5.0 checkmate_2.3.2
#> [58] cluster_2.1.8 gtable_0.3.6 KMsurv_0.1-5
#> [61] tidyr_1.3.1 data.table_1.16.4 car_3.1-3
#> [64] XVector_0.47.1 foreach_1.5.2 pillar_1.10.0
#> [67] stringr_1.5.1 splines_4.5.0 lattice_0.22-6
#> [70] SparseM_1.84-2 tidyselect_1.2.1 knitr_1.49
#> [73] gridExtra_2.3 svglite_2.1.3 xfun_0.49
#> [76] visNetwork_2.1.2 stringi_1.8.4 UCSC.utils_1.3.0
#> [79] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
#> [82] data.tree_1.1.0 tibble_3.2.1 cli_3.6.3
#> [85] rpart_4.1.23 systemfonts_1.1.0 xtable_1.8-4
#> [88] randomForestSRC_3.3.1 munsell_0.5.1 jquerylib_0.1.4
#> [91] survMisc_0.5.6 Rcpp_1.0.13-1 globals_0.16.3
#> [94] parallel_4.5.0 MatrixModels_0.5-3 listenv_0.9.1
#> [97] mvtnorm_1.3-2 timereg_2.0.6 scales_1.3.0
#> [100] purrr_1.0.2 crayon_1.5.3 rlang_1.1.4
#> [103] multcomp_1.4-26