CPSM is an R package that provides a comprehensive computational pipeline for predicting survival probabilities and risk groups in cancer patients. It includes dedicated modules to perform key steps such as data preprocessing, training/test splitting, and normalization. CPSM enables feature selection through univariate cox-regression survival analysis, feature selection though LASSO method, and calculates a LASSO-based Prognostic Index (PI) score. It supports the development of predictive models using different feature sets and offers a suite of visualization tools, including survival curves based on predicted probabilities, barplots of predicted mean and median survival times, Kaplan-Meier (KM) plots overlaid with individual survival predictions, and nomograms for estimating 1-, 3-, 5-, and 10-year survival probabilities. Together, these functionalities make CPSM a powerful and versatile tool for survival analysis in cancer research.
To install this package, start R (version “4.4”) and enter the code provided:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("CPSM")
The example input data object, Example_TCGA_LGG_FPKM_data
, contains data for 184 LGG cancer samples as rows and various features as columns. Gene expression data is represented in FPKM values. The dataset includes 11 clinical and demographic features, 4 types of survival data (with both time and event information), and 19,978 protein-coding genes. The clinical and demographic features in the dataset include Age
, subtype
, gender
, race
, ajcc_pathologic_tumor_stage
, histological_type
, histological_grade
, treatment_outcome_first_course
, radiation_treatment_adjuvant
, sample_type
, and type
. The four types of survival data included are Overall Survival (OS), Progression-Free Survival (PFS), Disease-Specific Survival (DSS), and Disease-Free Survival (DFS). In the dataset, the columns labeled OS, PFS, DSS, and DFS represent event occurrences, while the columns OS.time, PFS.time, DSS.time, and DFS.time provide survival times (in days).
library(CPSM)
library(SummarizedExperiment)
set.seed(7) # set seed
#load data (from the package)
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
# view data
str(Example_TCGA_LGG_FPKM_data[1:10])
#> Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
#> ..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. ..@ rownames : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> .. .. ..@ nrows : int 184
#> .. .. ..@ elementType : chr "ANY"
#> .. .. ..@ elementMetadata: NULL
#> .. .. ..@ metadata : list()
#> .. .. ..@ listData :List of 20
#> .. .. .. ..$ Age : num [1:184] 44.9 60.3 47.8 57.9 45.7 ...
#> .. .. .. ..$ subtype : chr [1:184] "PN" "PN" "PN" NA ...
#> .. .. .. ..$ gender : chr [1:184] "Male" "Male" "Female" "Female" ...
#> .. .. .. ..$ race : chr [1:184] "WHITE" "WHITE" "WHITE" "WHITE" ...
#> .. .. .. ..$ ajcc_pathologic_tumor_stage : logi [1:184] NA NA NA NA NA NA ...
#> .. .. .. ..$ histological_type : chr [1:184] "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" ...
#> .. .. .. ..$ histological_grade : chr [1:184] "G2" "G2" "G2" "G3" ...
#> .. .. .. ..$ treatment_outcome_first_course: chr [1:184] "Complete Remission/Response" "Stable Disease" NA NA ...
#> .. .. .. ..$ radiation_treatment_adjuvant : chr [1:184] "NO" "NO" "YES" NA ...
#> .. .. .. ..$ sample_type : chr [1:184] "Primary" "Primary" "Primary" "Primary" ...
#> .. .. .. ..$ type : chr [1:184] "LGG" "LGG" "LGG" "LGG" ...
#> .. .. .. ..$ OS : int [1:184] 0 0 0 1 1 1 0 0 0 0 ...
#> .. .. .. ..$ OS.time : int [1:184] 1058 656 NA 155 1401 919 993 3 77 395 ...
#> .. .. .. ..$ DSS : int [1:184] 0 0 0 1 1 0 0 0 0 0 ...
#> .. .. .. ..$ DSS.time : int [1:184] 1058 656 NA 155 1401 919 993 3 77 395 ...
#> .. .. .. ..$ DFI : int [1:184] 0 NA 0 NA NA NA NA NA NA 1 ...
#> .. .. .. ..$ DFI.time : int [1:184] 1058 NA NA NA NA NA NA NA NA 335 ...
#> .. .. .. ..$ PFI : int [1:184] 0 0 0 1 1 1 0 0 0 1 ...
#> .. .. .. ..$ PFI.time : int [1:184] 1058 656 NA 155 837 260 993 3 77 335 ...
#> .. .. .. ..$ sample : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> ..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
#> .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
#> .. .. .. .. ..@ listData :List of 1
#> .. .. .. .. .. ..$ expression: num [1:10, 1:184] 0.1166 0.0073 65.8414 0.6225 0.2736 ...
#> .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. .. .. .. ..$ : chr [1:10] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> .. .. .. .. .. .. .. ..$ : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. ..@ metadata : list()
#> ..@ NAMES : chr [1:10] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. ..@ rownames : NULL
#> .. .. ..@ nrows : int 10
#> .. .. ..@ elementType : chr "ANY"
#> .. .. ..@ elementMetadata: NULL
#> .. .. ..@ metadata : list()
#> .. .. ..@ listData :List of 1
#> .. .. .. ..$ gene: chr [1:10] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> ..@ metadata : list()
The example above demonstrates how to load data using the CPSM package with preloaded example data. If you have your own dataset in tab-separated (.txt, .tsv) or comma-separated (.csv) format—for instance, a file named TCGA-LGG_FPKM_data_with_clin_data.txt where samples are in rows and features in columns—you can upload it as follows:
# Step1 - Specify the file path to your data
#file_path <- "path/to/your/TCGA-LGG_FPKM_data_with_clin_data.txt"
# Step2 - load/read data
#data <- read.table(file = file_path, header = TRUE, sep = "\t",stringsAsFactors = FALSE, check.names = FALSE)
# Step3 - View/Inspect the first few rows
#head(data[1:30)
Make sure your clinical columns and survival columns are named consistently, e.g., OS, OS.time, etc.
Gene expression features should start after the clinical and survival columns. You can adjust the column indices in CPSM functions accordingly.
For CSV files, change sep = “ to sep =”,".
Above code only demonstrates how user can load their data; actual analysis steps (in examples) are performed using data of CPSM package here.
The data_process_f function converts OS time (in days) into months and removes samples where OS/OS.time information is missing.
## Required inputs
To use this function, the input data should be provided in TSV format. Additionally, you need to define col_num
(the column number at which clinical, demographic, and survival information ends, e.g., 20), surv_time
(the name of the column that contains survival time information, e.g., OS.time
), and output
(the desired name for the output, e.g., “New_data”).
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
combined_df <- cbind(
as.data.frame(colData(Example_TCGA_LGG_FPKM_data))
[, -ncol(colData(Example_TCGA_LGG_FPKM_data))],
t(as.data.frame(assay(
Example_TCGA_LGG_FPKM_data,
"expression"
)))
)
# View top rows and first 30 columns of data
print(str(Example_TCGA_LGG_FPKM_data[1:30]),2)
#> Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
#> ..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. ..@ rownames : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> .. .. ..@ nrows : int 184
#> .. .. ..@ elementType : chr "ANY"
#> .. .. ..@ elementMetadata: NULL
#> .. .. ..@ metadata : list()
#> .. .. ..@ listData :List of 20
#> .. .. .. ..$ Age : num [1:184] 44.9 60.3 47.8 57.9 45.7 ...
#> .. .. .. ..$ subtype : chr [1:184] "PN" "PN" "PN" NA ...
#> .. .. .. ..$ gender : chr [1:184] "Male" "Male" "Female" "Female" ...
#> .. .. .. ..$ race : chr [1:184] "WHITE" "WHITE" "WHITE" "WHITE" ...
#> .. .. .. ..$ ajcc_pathologic_tumor_stage : logi [1:184] NA NA NA NA NA NA ...
#> .. .. .. ..$ histological_type : chr [1:184] "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" ...
#> .. .. .. ..$ histological_grade : chr [1:184] "G2" "G2" "G2" "G3" ...
#> .. .. .. ..$ treatment_outcome_first_course: chr [1:184] "Complete Remission/Response" "Stable Disease" NA NA ...
#> .. .. .. ..$ radiation_treatment_adjuvant : chr [1:184] "NO" "NO" "YES" NA ...
#> .. .. .. ..$ sample_type : chr [1:184] "Primary" "Primary" "Primary" "Primary" ...
#> .. .. .. ..$ type : chr [1:184] "LGG" "LGG" "LGG" "LGG" ...
#> .. .. .. ..$ OS : int [1:184] 0 0 0 1 1 1 0 0 0 0 ...
#> .. .. .. ..$ OS.time : int [1:184] 1058 656 NA 155 1401 919 993 3 77 395 ...
#> .. .. .. ..$ DSS : int [1:184] 0 0 0 1 1 0 0 0 0 0 ...
#> .. .. .. ..$ DSS.time : int [1:184] 1058 656 NA 155 1401 919 993 3 77 395 ...
#> .. .. .. ..$ DFI : int [1:184] 0 NA 0 NA NA NA NA NA NA 1 ...
#> .. .. .. ..$ DFI.time : int [1:184] 1058 NA NA NA NA NA NA NA NA 335 ...
#> .. .. .. ..$ PFI : int [1:184] 0 0 0 1 1 1 0 0 0 1 ...
#> .. .. .. ..$ PFI.time : int [1:184] 1058 656 NA 155 837 260 993 3 77 335 ...
#> .. .. .. ..$ sample : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> ..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
#> .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
#> .. .. .. .. ..@ listData :List of 1
#> .. .. .. .. .. ..$ expression: num [1:30, 1:184] 0.1166 0.0073 65.8414 0.6225 0.2736 ...
#> .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. .. .. .. ..$ : chr [1:30] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> .. .. .. .. .. .. .. ..$ : chr [1:184] "TCGA-TM-A7CA-01" "TCGA-DU-A6S3-01" "TCGA-CS-5390-01" "TCGA-DU-8158-01" ...
#> .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. ..@ metadata : list()
#> ..@ NAMES : chr [1:30] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. ..@ rownames : NULL
#> .. .. ..@ nrows : int 30
#> .. .. ..@ elementType : chr "ANY"
#> .. .. ..@ elementMetadata: NULL
#> .. .. ..@ metadata : list()
#> .. .. ..@ listData :List of 1
#> .. .. .. ..$ gene: chr [1:30] "A1BG" "A1CF" "A2M" "A2ML1" ...
#> ..@ metadata : list()
#> NULL
#------------------------ OUTPUTS ---------------------#
# Access the output of function
New_data <- data_process_f(combined_df, col_num = 20, surv_time = "OS.time")
# View/Inspect the first few rows of output data after data pre-processing
str(New_data[1:10])
#> 'data.frame': 176 obs. of 10 variables:
#> $ Age : num 44.9 60.3 57.9 45.7 70.7 ...
#> $ subtype : chr "PN" "PN" NA "PN" ...
#> $ gender : chr "Male" "Male" "Female" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G2" "G3" "G3" ...
#> $ treatment_outcome_first_course: chr "Complete Remission/Response" "Stable Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "NO" "NO" NA "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After data processing, the output object New_data
is generated, which contains 176 samples. This indicates that the function has removed 8 samples where OS/OS.time information was missing. Moreover, a new 21st column, OS_month
, is added to the data, containing OS time values in months.
Before proceeding further, we need to split the data into training and test subsets for feature selection and model development.
## Required inputs
The output from the previous step, New_data
, serves as the input for this process. Next, you need to define the fraction (e.g., 0.9) by which to split the data into training and test sets. For example, setting fraction = 0.9
will divide the data into 90% for training and 10% for testing. Additionally, you should specify names for the training and test outputs (e.g., train_FPKM
and test_FPKM
).
#load data
data(New_data, package = "CPSM")
# View top rows and first 30 columns of data
print(head(New_data[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-TM-A7CA-01 44.94 PN Male WHITE NA
#> TCGA-DU-A6S3-01 60.34 PN Male WHITE NA
#> TCGA-DU-8158-01 57.85 <NA> Female WHITE NA
#> TCGA-DU-6397-01 45.71 PN Male WHITE NA
#> TCGA-DH-A669-02 70.66 PN Male WHITE NA
#> TCGA-R8-A6MO-01 53.20 PN Female ASIAN NA
#> histological_type histological_grade
#> TCGA-TM-A7CA-01 Astrocytoma G2
#> TCGA-DU-A6S3-01 Oligodendroglioma G2
#> TCGA-DU-8158-01 Astrocytoma G3
#> TCGA-DU-6397-01 Oligodendroglioma G3
#> TCGA-DH-A669-02 Oligodendroglioma G3
#> TCGA-R8-A6MO-01 Oligodendroglioma G2
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-TM-A7CA-01 Complete Remission/Response NO
#> TCGA-DU-A6S3-01 Stable Disease NO
#> TCGA-DU-8158-01 <NA> <NA>
#> TCGA-DU-6397-01 <NA> YES
#> TCGA-DH-A669-02 Stable Disease <NA>
#> TCGA-R8-A6MO-01 <NA> <NA>
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-TM-A7CA-01 Primary LGG 0 1058 0 1058 0 1058 0
#> TCGA-DU-A6S3-01 Primary LGG 0 656 0 656 NA NA 0
#> TCGA-DU-8158-01 Primary LGG 1 155 1 155 NA NA 1
#> TCGA-DU-6397-01 Primary LGG 1 1401 1 1401 NA NA 1
#> TCGA-DH-A669-02 Recurrent LGG 1 919 0 919 NA NA 1
#> TCGA-R8-A6MO-01 Primary LGG 0 993 0 993 NA NA 0
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-TM-A7CA-01 1058 35 0.1166 0.0073 65.8414 0.6225 0.2736 0.4515
#> TCGA-DU-A6S3-01 656 22 0.0782 0.0070 42.7621 0.6587 0.1310 1.7230
#> TCGA-DU-8158-01 155 5 0.2789 0.0052 91.1555 0.8302 0.0613 2.7992
#> TCGA-DU-6397-01 837 46 0.0547 0.0053 72.1564 1.1607 0.0493 2.1882
#> TCGA-DH-A669-02 260 30 0.2119 0.0022 84.0821 0.3020 0.0000 1.9116
#> TCGA-R8-A6MO-01 993 33 0.0677 0.0094 72.4583 0.4161 0.0735 2.3783
#> A4GNT AAAS AACS AADAC
#> TCGA-TM-A7CA-01 0.0393 11.4270 1.6621 0.0000
#> TCGA-DU-A6S3-01 0.0251 13.0340 2.1700 0.0000
#> TCGA-DU-8158-01 0.1550 6.3739 1.5613 0.0384
#> TCGA-DU-6397-01 0.1040 10.2621 1.2255 0.0000
#> TCGA-DH-A669-02 0.0238 18.8499 2.8438 0.0000
#> TCGA-R8-A6MO-01 0.0254 14.1970 0.9921 0.0000
# Call/Run the function for "New_data"
result <- tr_test_f(data = New_data, fraction = 0.9)
#------------------------ OUTPUTS ---------------------#
# Access output from the function :train and test data
train_FPKM <- result$train_data
# View top rows and first 30 columns of data
print(head(train_FPKM[1:30]),2)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-CS-5396-01 53.11 PN Female WHITE NA
#> TCGA-DU-A76L-01 54.27 ME Male WHITE NA
#> TCGA-DB-5270-01 38.07 NE Female WHITE NA
#> TCGA-DB-A75P-01 25.77 NE Female NOT AVAILABLE NA
#> TCGA-S9-A6U0-01 46.24 ME Male WHITE NA
#> TCGA-E1-5307-01 62.52 PN Female WHITE NA
#> histological_type histological_grade
#> TCGA-CS-5396-01 Oligodendroglioma G3
#> TCGA-DU-A76L-01 Oligodendroglioma G3
#> TCGA-DB-5270-01 Oligoastrocytoma G3
#> TCGA-DB-A75P-01 Astrocytoma G2
#> TCGA-S9-A6U0-01 Astrocytoma G3
#> TCGA-E1-5307-01 Astrocytoma G3
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-CS-5396-01 <NA> YES
#> TCGA-DU-A76L-01 Progressive Disease <NA>
#> TCGA-DB-5270-01 <NA> NO
#> TCGA-DB-A75P-01 Complete Remission/Response NO
#> TCGA-S9-A6U0-01 Partial Remission/Response YES
#> TCGA-E1-5307-01 <NA> YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-CS-5396-01 Primary LGG 0 1631 0 1631 NA NA 0
#> TCGA-DU-A76L-01 Primary LGG 1 814 1 814 NA NA 1
#> TCGA-DB-5270-01 Primary LGG 0 3733 0 3733 0 3733 0
#> TCGA-DB-A75P-01 Primary LGG 0 492 0 492 0 492 0
#> TCGA-S9-A6U0-01 Primary LGG 1 742 1 742 NA NA 1
#> TCGA-E1-5307-01 Primary LGG 1 1762 1 1762 NA NA 1
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-CS-5396-01 1631 54 0.1885 0 71.3633 0.1814 0.0682 1.7497
#> TCGA-DU-A76L-01 410 27 0.2011 0 239.7874 0.1535 0.3259 2.1433
#> TCGA-DB-5270-01 3733 123 0.0582 0 65.8589 1.0568 0.0304 1.7527
#> TCGA-DB-A75P-01 492 16 0.0977 0 70.3146 2.3686 0.0764 3.3844
#> TCGA-S9-A6U0-01 692 24 0.0649 0 288.3351 1.5200 0.1269 6.4862
#> TCGA-E1-5307-01 1452 58 0.1815 0 78.8000 2.6014 0.0546 0.4213
#> A4GNT AAAS AACS AADAC
#> TCGA-CS-5396-01 0.0628 7.5286 1.4166 0.0000
#> TCGA-DU-A76L-01 0.0441 9.1813 1.4905 0.0120
#> TCGA-DB-5270-01 0.0000 6.5046 3.1825 0.0000
#> TCGA-DB-A75P-01 0.0329 6.3535 2.2791 0.0120
#> TCGA-S9-A6U0-01 0.0730 7.1669 1.8484 0.0318
#> TCGA-E1-5307-01 0.0105 8.0024 1.8582 0.0000
# Access output - test data
test_FPKM <- result$test_data
# View top rows and first 30 columns of data
print(head(test_FPKM[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-DH-A669-02 70.66 PN Male WHITE NA
#> TCGA-WY-A859-01 34.58 <NA> Female WHITE NA
#> TCGA-TQ-A7RN-01 32.37 PN Male WHITE NA
#> TCGA-FG-A6IZ-01 60.99 PN Male WHITE NA
#> TCGA-HW-8319-01 34.38 PN Female WHITE NA
#> TCGA-TM-A7CF-02 41.56 NE Female WHITE NA
#> histological_type histological_grade
#> TCGA-DH-A669-02 Oligodendroglioma G3
#> TCGA-WY-A859-01 Astrocytoma G2
#> TCGA-TQ-A7RN-01 Oligodendroglioma G2
#> TCGA-FG-A6IZ-01 Oligodendroglioma G2
#> TCGA-HW-8319-01 Astrocytoma G3
#> TCGA-TM-A7CF-02 Astrocytoma G2
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-DH-A669-02 Stable Disease <NA>
#> TCGA-WY-A859-01 Stable Disease <NA>
#> TCGA-TQ-A7RN-01 Partial Remission/Response YES
#> TCGA-FG-A6IZ-01 Complete Remission/Response YES
#> TCGA-HW-8319-01 Partial Remission/Response YES
#> TCGA-TM-A7CF-02 Complete Remission/Response NO
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-DH-A669-02 Recurrent LGG 1 919 0 919 NA NA 1
#> TCGA-WY-A859-01 Primary LGG 0 1213 0 1213 NA NA 0
#> TCGA-TQ-A7RN-01 Primary LGG 0 1026 0 1026 NA NA 0
#> TCGA-FG-A6IZ-01 Primary LGG 0 457 0 457 0 457 0
#> TCGA-HW-8319-01 Primary LGG 1 1209 1 1209 NA NA 1
#> TCGA-TM-A7CF-02 Recurrent LGG 0 1989 0 1989 1 924 1
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-DH-A669-02 260 30 0.2119 0.0022 84.0821 0.3020 0.0000 1.9116
#> TCGA-WY-A859-01 1213 40 0.1112 0.0017 35.4291 1.9501 0.1288 0.6084
#> TCGA-TQ-A7RN-01 1026 34 0.0474 0.0000 164.7839 11.4102 0.0712 0.8503
#> TCGA-FG-A6IZ-01 457 15 0.0160 0.0000 44.3524 0.4603 0.1045 0.9300
#> TCGA-HW-8319-01 997 40 0.1476 0.0113 69.7726 2.5899 0.3037 1.1103
#> TCGA-TM-A7CF-02 924 65 0.3037 0.0067 26.4753 0.9142 0.1875 1.2120
#> A4GNT AAAS AACS AADAC
#> TCGA-DH-A669-02 0.0238 18.8499 2.8438 0
#> TCGA-WY-A859-01 0.0278 7.3509 6.0648 0
#> TCGA-TQ-A7RN-01 0.1310 7.1035 1.2761 0
#> TCGA-FG-A6IZ-01 0.0000 19.4177 2.2619 0
#> TCGA-HW-8319-01 0.0524 9.9054 1.9720 0
#> TCGA-TM-A7CF-02 0.0479 13.9021 1.4295 0
After the train-test split, two new output objects are generated: train_FPKM
and test_FPKM
. The train_FPKM
object contains 158 samples, while test_FPKM
contains 18 samples. This indicates that the tr_test_f
function splits the data in a 90:10 ratio.
In order to select features and develop ML models, the data must be normalized. Since the expression data is available in terms of FPKM values, the train_test_normalization_f
function will first convert the FPKM values into a log scale using the formula [log2(FPKM+1)], followed by quantile normalization. The training data will be used as the target matrix for the quantile normalization process.
## Required inputs
For this function, you need to provide the training and test datasets obtained from the previous step (Train/Test Split). Additionally, you must specify the column number where clinical information ends (e.g., 21) in the input datasets. Finally, you need to define output names for the resulting datasets: train_clin_data
(which contains only clinical information from the training data), test_clin_data
(which contains only clinical information from the test data), train_Normalized_data_clin_data
(which contains both clinical information and normalized gene expression values for the training samples), and test_Normalized_data_clin_data
(which contains both clinical information and normalized gene expression values for the test samples).
# Step 3 - Data Normalization
#load train and test data from package
data(train_FPKM, package = "CPSM")
# View top rows and first 50 columns of data
print(head(train_FPKM[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-FG-8191-01 30.81 PN Male WHITE NA
#> TCGA-HT-7611-01 36.45 <NA> Male WHITE NA
#> TCGA-HW-7489-01 38.88 NE Male WHITE NA
#> TCGA-HT-A5RA-01 65.08 CL Female WHITE NA
#> TCGA-TQ-A7RW-01 32.32 PN Male WHITE NA
#> TCGA-DH-5141-01 32.43 PN Male WHITE NA
#> histological_type histological_grade
#> TCGA-FG-8191-01 Oligodendroglioma G3
#> TCGA-HT-7611-01 Oligoastrocytoma G2
#> TCGA-HW-7489-01 Oligoastrocytoma G2
#> TCGA-HT-A5RA-01 Astrocytoma G3
#> TCGA-TQ-A7RW-01 Oligodendroglioma G2
#> TCGA-DH-5141-01 Oligodendroglioma G3
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-FG-8191-01 <NA> YES
#> TCGA-HT-7611-01 Complete Remission/Response <NA>
#> TCGA-HW-7489-01 <NA> NO
#> TCGA-HT-A5RA-01 Partial Remission/Response YES
#> TCGA-TQ-A7RW-01 Progressive Disease YES
#> TCGA-DH-5141-01 <NA> YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-FG-8191-01 Primary LGG 0 992 0 992 NA NA 0
#> TCGA-HT-7611-01 Primary LGG 0 1752 0 1752 0 1752 0
#> TCGA-HW-7489-01 Primary LGG 1 1262 1 1262 NA NA 1
#> TCGA-HT-A5RA-01 Primary LGG 0 832 0 832 NA NA 0
#> TCGA-TQ-A7RW-01 Primary LGG 1 821 1 821 NA NA 1
#> TCGA-DH-5141-01 Primary LGG 0 968 0 968 NA NA 0
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-FG-8191-01 992 33 0.2225 0.0017 65.0561 1.3771 0.0985 0.6899
#> TCGA-HT-7611-01 1752 58 0.1776 0.0016 121.2081 2.8675 0.0302 0.8552
#> TCGA-HW-7489-01 1262 41 0.0793 0.0017 47.4765 2.9089 0.0490 1.6156
#> TCGA-HT-A5RA-01 832 27 0.0464 0.0021 73.5076 4.0962 0.0805 1.2023
#> TCGA-TQ-A7RW-01 511 27 0.1385 0.0026 154.0140 1.2802 0.3198 1.1617
#> TCGA-DH-5141-01 968 32 0.0202 0.0017 72.8135 0.6277 0.0000 2.7282
#> A4GNT AAAS AACS AADAC
#> TCGA-FG-8191-01 0.0094 7.7563 1.0195 0.0103
#> TCGA-HT-7611-01 0.0087 8.8128 0.6825 0.0000
#> TCGA-HW-7489-01 0.0188 7.5007 3.6948 0.0000
#> TCGA-HT-A5RA-01 0.0116 8.8330 1.9868 0.0000
#> TCGA-TQ-A7RW-01 0.0141 10.6275 1.3589 0.0154
#> TCGA-DH-5141-01 0.0454 9.4101 0.5631 0.0000
data(test_FPKM, package = "CPSM")
# View top rows and first 50 columns of data
print(head(test_FPKM[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-P5-A5EY-01 21.13 NE Male WHITE NA
#> TCGA-TM-A84H-01 44.56 <NA> Female WHITE NA
#> TCGA-DU-8161-01 63.15 CL Female WHITE NA
#> TCGA-HT-8114-01 36.13 <NA> Male WHITE NA
#> TCGA-E1-A7YY-01 27.01 NE Female WHITE NA
#> TCGA-S9-A7R4-01 46.91 PN Male WHITE NA
#> histological_type histological_grade
#> TCGA-P5-A5EY-01 Astrocytoma G2
#> TCGA-TM-A84H-01 Oligoastrocytoma G3
#> TCGA-DU-8161-01 Oligoastrocytoma G3
#> TCGA-HT-8114-01 Oligoastrocytoma G3
#> TCGA-E1-A7YY-01 Oligodendroglioma G2
#> TCGA-S9-A7R4-01 Astrocytoma G3
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-P5-A5EY-01 <NA> <NA>
#> TCGA-TM-A84H-01 Complete Remission/Response YES
#> TCGA-DU-8161-01 <NA> YES
#> TCGA-HT-8114-01 <NA> YES
#> TCGA-E1-A7YY-01 Stable Disease YES
#> TCGA-S9-A7R4-01 Partial Remission/Response YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-P5-A5EY-01 Primary LGG 0 72 0 72 NA NA 0
#> TCGA-TM-A84H-01 Primary LGG 0 926 0 926 0 926 0
#> TCGA-DU-8161-01 Primary LGG 1 722 1 722 NA NA 1
#> TCGA-HT-8114-01 Primary LGG 0 1040 0 1040 0 1040 0
#> TCGA-E1-A7YY-01 Primary LGG 1 4445 1 4445 NA NA 1
#> TCGA-S9-A7R4-01 Primary LGG 0 914 0 914 NA NA 0
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-P5-A5EY-01 72 2 0.0470 0.0000 73.9497 2.5386 0.0612 1.7536
#> TCGA-TM-A84H-01 926 30 0.2326 0.0035 108.8116 0.0441 0.0812 9.2614
#> TCGA-DU-8161-01 111 24 0.3725 0.0112 52.7906 2.0212 0.2807 1.1864
#> TCGA-HT-8114-01 1040 34 0.1101 0.0000 118.9994 1.2509 0.0615 1.6700
#> TCGA-E1-A7YY-01 958 146 0.0906 0.0036 51.2459 0.8483 0.0506 1.0117
#> TCGA-S9-A7R4-01 914 30 0.0383 0.0046 231.7222 0.7090 0.1069 3.7380
#> A4GNT AAAS AACS AADAC
#> TCGA-P5-A5EY-01 0.0235 4.3158 1.8799 0.0000
#> TCGA-TM-A84H-01 0.1588 11.1481 1.9256 0.0713
#> TCGA-DU-8161-01 0.0706 9.9388 1.6917 0.0000
#> TCGA-HT-8114-01 0.0530 10.0073 0.8450 0.0064
#> TCGA-E1-A7YY-01 0.0291 5.2719 8.4546 0.0317
#> TCGA-S9-A7R4-01 0.1106 8.3687 0.9448 0.0000
# Call function to Normalize the training and test data sets
Result_N_data <- train_test_normalization_f(
train_data = train_FPKM,
test_data = test_FPKM,
col_num = 21
)
#------------------------ OUTPUTS ---------------------#
# Access the Normalized train and test data
Train_Clin <- Result_N_data$Train_Clin
Test_Clin <- Result_N_data$Test_Clin
Train_Norm_data <- Result_N_data$Train_Norm_data
Test_Norm_data <- Result_N_data$Test_Norm_data
# view output - train clinincal data
str(Train_Clin[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
# view output - Train normalized data
str(Train_Norm_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
# view output - test clinincal data
str(Test_Clin[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ Age : num 21.1 44.6 63.1 36.1 27 ...
#> $ subtype : chr "NE" NA "CL" NA ...
#> $ gender : chr "Male" "Female" "Female" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Oligoastrocytoma" "Oligoastrocytoma" "Oligoastrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G3" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA NA ...
#> $ radiation_treatment_adjuvant : chr NA "YES" "YES" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
# view output - Test normalized data
str(Test_Norm_data[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ Age : num 21.1 44.6 63.1 36.1 27 ...
#> $ subtype : chr "NE" NA "CL" NA ...
#> $ gender : chr "Male" "Female" "Female" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Oligoastrocytoma" "Oligoastrocytoma" "Oligoastrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G3" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA NA ...
#> $ radiation_treatment_adjuvant : chr NA "YES" "YES" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After running the function, four outputs objects are generated: Train_Clin
(which contains only clinical features from the training data), Test_Clin
(which contains only clinical features from the test data), Train_Norm_data
(which includes clinical features and normalized gene expression values for the training samples), and Test_Norm_data
(which includes clinical features and normalized gene expression values for the test samples).
To create a survival model, the next step is to calculate the Prognostic Index (PI) score. The PI score is based on the expression levels of features selected by the LASSO regression model and their corresponding beta coefficients. For example, suppose five features (G1, G2, G3, G4, G5) are selected by the LASSO method, and their associated coefficients are B1, B2, B3, B4, and B5, respectively. The PI score is then computed using the following formula:
PI score = G1 * B1 + G2 * B2 + G3 * B3 + G4 * B4 + G5 * B5
For this function, you need to provide the normalized training data object (Train_Norm_data) and test data object (Test_Norm_data) obtained from the previous step (train_test_normalization_f). Additionally, you must specify the column number (col_num
) where clinical features end (e.g., 21), the number of folds (nfolds
) for the LASSO regression method (e.g., 5), and the survival time (surv_time
) and survival event (surv_event
) columns in the data (e.g., OS_month
and OS
, respectively). The LASSO regression is implemented using the glmnet
package. Finally, you need to define names of output object to store the results, which will include the selected LASSO features and their corresponding PI values.
# Step 4 - Lasso PI Score
#load data - Normalized train data
data(Train_Norm_data, package = "CPSM")
# View top rows and first 30 columns of data
print(str(Train_Norm_data[1:30]),2)
#> 'data.frame': 158 obs. of 30 variables:
#> $ Age : num 42.1 54.3 39.4 33.5 25.9 ...
#> $ subtype : chr "PN" "ME" NA "PN" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "YES" NA ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DSS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ DSS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DFI : int NA NA NA NA NA 0 0 0 NA 1 ...
#> $ DFI.time : int NA NA NA NA NA 1189 1040 457 NA 858 ...
#> $ PFI : int 1 1 0 1 1 0 0 0 0 1 ...
#> $ PFI.time : int 362 410 1469 56 1205 1189 1040 457 1294 858 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> $ A1BG : num 0.034 0.136 0.074 0.325 0.049 0.051 0.111 0.027 0.006 0.21 ...
#> $ A1CF : num 0 0 0.003 0.04 0.013 0.001 0 0 0 0.014 ...
#> $ A2M : num 218.9 244 126.2 193.3 58.3 ...
#> $ A2ML1 : num 0.414 0.103 1.309 0.047 1.388 ...
#> $ A3GALT2 : num 0.078 0.229 0.04 0.154 0.122 0.048 0.066 0.147 0.194 0.079 ...
#> $ A4GALT : num 2.818 1.691 5.026 2.166 0.558 ...
#> $ A4GNT : num 0.049 0.037 0.055 0.011 0.018 0 0.06 0 0 0.03 ...
#> $ AAAS : num 12.01 8.58 12.41 15.38 6.77 ...
#> $ AACS : num 1.999 1.146 0.936 1.373 3.612 ...
#> $ AADAC : num 0 0.009 0 0.027 0.011 0 0.005 0 0 0.142 ...
#> NULL
#load data - Normalized test data
data(Test_Norm_data, package = "CPSM")
# View top rows and first 30 columns of data
print(str(Test_Norm_data[1:30]),2)
#> 'data.frame': 18 obs. of 30 variables:
#> $ Age : num 41 57.8 64.8 50.6 34.5 ...
#> $ subtype : chr NA "CL" "NE" NA ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G2" ...
#> $ treatment_outcome_first_course: chr "Progressive Disease" "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" "YES" "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DSS : int 1 1 0 0 0 0 0 0 0 NA ...
#> $ DSS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DFI : int NA NA NA NA NA 0 0 NA NA NA ...
#> $ DFI.time : int NA NA NA NA NA 795 526 NA NA NA ...
#> $ PFI : int 1 1 0 1 0 0 0 0 1 1 ...
#> $ PFI.time : int 294 350 50 456 548 795 526 863 1627 2097 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> $ A1BG : num 0.223 0.112 0.068 0.173 0.159 0.189 0.532 0.091 0.249 0.21 ...
#> $ A1CF : num 0.001 0 0.001 0.011 0 0 0.007 0.001 0.004 0 ...
#> $ A2M : num 39.8 218.9 87.8 371.1 136 ...
#> $ A2ML1 : num 4.826 1.396 2.443 0.168 2.54 ...
#> $ A3GALT2 : num 0.144 0.049 0.106 0.214 0.137 0.036 0.3 0.06 0.147 0.08 ...
#> $ A4GALT : num 0.986 1.183 1.8 2.483 1.026 ...
#> $ A4GNT : num 0.085 0.033 0.042 0.061 0.028 0.028 0.052 0.034 0.046 0.022 ...
#> $ AAAS : num 8.06 7.93 8.44 7.98 10.67 ...
#> $ AACS : num 1.04 1.995 1.772 0.675 1.222 ...
#> $ AADAC : num 0 0.005 0 0 0.012 0 0 0 0 0 ...
#> NULL
# Call/Run function to select features and generate PI score using LASSO-Regression
Result_PI <- Lasso_PI_scores_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
nfolds = 5,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
#------------------------ OUTPUTS ---------------------#
# Access output - selected features (with beta coefficient value)) by LASSO
Train_Lasso_key_variables <- Result_PI$Train_Lasso_key_variables
#view top features (from selected set) with beta coeff values
print(head(Train_Lasso_key_variables))
#> coeff
#> AADACL4 2.324
#> ABCA12 1.089
#> ABCC3 0.067
#> ABI1 -0.004
#> ABRA -0.712
#> AC006059.2 -372.393
# Access output - Train set with PI values
Train_PI_data <- Result_PI$Train_PI_data
# view Train set with PI values
str(Train_PI_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> $ AADACL4 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.212 0.297 0.053 0.358 0.064 0.032 0.09 0.051 0.1 0.014 ...
#> $ ABCC3 : num 0.1 5.476 0.029 0.108 0.037 ...
#> $ ABI1 : num 44.7 15.3 43.7 27.7 31.6 ...
#> $ ABRA : num 0 0.035 0 0 0.015 0.034 0.006 0 0.009 0.005 ...
#> $ AC006059.2: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.032 0.034 0.032 0 0.014 0.04 0.011 0.01 0.017 0.042 ...
#> $ AC008764.4: num 0.038 0 0 0 0 0 0.008 0 0 0.015 ...
# Access output - Test set with PI values
Test_PI_data <- Result_PI$Test_PI_data
# view Test set with PI values
str(Test_PI_data[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> $ AADACL4 : num 0.011 0 0.016 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.048 0.165 0.044 0.033 0.053 0.02 0.09 0.038 0.058 0.096 ...
#> $ ABCC3 : num 0.679 1.358 0.184 0.321 0.024 ...
#> $ ABI1 : num 39.4 19.9 31 27.8 56.4 ...
#> $ ABRA : num 0 0.001 0.029 0 0 0 0.026 0.004 0 0.014 ...
#> $ AC006059.2: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.002 0.054 0.012 0.092 0.031 0.003 0.004 0.078 0.046 0 ...
#> $ AC008764.4: num 0.036 0.009 0 0 0.015 0 0 0 0 0 ...
# view plot from LASSO-Lasso regression lambda plot
plot(Result_PI$cvfit)
## Outputs
The
Lasso_PI_scores_f
function generates the following outputs objects:
1. Train_Lasso_key_variables
: A list of features selected by LASSO along with their beta coefficient values.
2. Train_Cox_Lasso_Regression_lambda_plot
: The Lasso regression lambda plot.
3. Train_PI_data
: This dataset contains the expression values of genes selected by LASSO along with the PI score in the last column for the training samples.
4. Test_PI_data
: This dataset contains the expression values of genes selected by LASSO along with the PI score in the last column for the test samples.
In addition to the Prognostic Index (PI) score, the Univariate_sig_features_f
function in the CPSM package allows for the selection of significant features based on univariate cox-regression survival analysis. This function identifies features with a p-value less than 0.05, which are able to stratify high-risk and low-risk survival groups. The stratification is done by using the median expression value of each feature as a cutoff.
## Required inputs
To use this function, you need to provide the normalized training (Train_Norm_data) and test (Test_Norm_data) dataset objects, which were obtained from the previous step (train_test_normalization_f). Additionally, you must specify the column number (col_num
) where the clinical features end (e.g., 21), as well as the names of the columns containing survival time (surv_time
, e.g., OS_month
) and survival event information (surv_event
, e.g., OS
). Furthermore, you need to define output names for the resulting datasets that will contain the expression values of the selected genes. These outputs will be used to store the significant genes identified through univariate survival analysis.
# Step 4b Univariate Survival Significant Feature Selection.
# load normalized train data
data(Train_Norm_data, package = "CPSM")
# View top rows and first 30 columns of data
print(head(Train_Norm_data[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-E1-5318-01 42.09 PN Female WHITE NA
#> TCGA-DU-A76L-01 54.27 ME Male WHITE NA
#> TCGA-CS-6667-01 39.36 <NA> Female WHITE NA
#> TCGA-E1-A7YI-01 33.48 PN Female WHITE NA
#> TCGA-HT-7610-01 25.86 NE Female WHITE NA
#> TCGA-HT-7856-01 35.99 NE Male WHITE NA
#> histological_type histological_grade
#> TCGA-E1-5318-01 Oligodendroglioma G2
#> TCGA-DU-A76L-01 Oligodendroglioma G3
#> TCGA-CS-6667-01 Astrocytoma G2
#> TCGA-E1-A7YI-01 Astrocytoma G3
#> TCGA-HT-7610-01 Oligoastrocytoma G2
#> TCGA-HT-7856-01 Oligodendroglioma G3
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-E1-5318-01 <NA> YES
#> TCGA-DU-A76L-01 Progressive Disease <NA>
#> TCGA-CS-6667-01 <NA> YES
#> TCGA-E1-A7YI-01 <NA> <NA>
#> TCGA-HT-7610-01 <NA> NO
#> TCGA-HT-7856-01 <NA> YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-E1-5318-01 Primary LGG 1 2379 1 2379 NA NA 1
#> TCGA-DU-A76L-01 Primary LGG 1 814 1 814 NA NA 1
#> TCGA-CS-6667-01 Primary LGG 0 1469 0 1469 NA NA 0
#> TCGA-E1-A7YI-01 Primary LGG 1 111 1 111 NA NA 1
#> TCGA-HT-7610-01 Primary LGG 0 1706 0 1706 NA NA 1
#> TCGA-HT-7856-01 Primary LGG 0 1189 0 1189 0 1189 0
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-E1-5318-01 362 78 0.034 0.000 218.915 0.414 0.078 2.818
#> TCGA-DU-A76L-01 410 27 0.136 0.000 243.966 0.103 0.229 1.691
#> TCGA-CS-6667-01 1469 48 0.074 0.003 126.158 1.309 0.040 5.026
#> TCGA-E1-A7YI-01 56 4 0.325 0.040 193.297 0.047 0.154 2.166
#> TCGA-HT-7610-01 1205 56 0.049 0.013 58.336 1.388 0.122 0.558
#> TCGA-HT-7856-01 1189 39 0.051 0.001 52.523 1.139 0.048 1.691
#> A4GNT AAAS AACS AADAC
#> TCGA-E1-5318-01 0.049 12.010 1.999 0.000
#> TCGA-DU-A76L-01 0.037 8.581 1.146 0.009
#> TCGA-CS-6667-01 0.055 12.411 0.936 0.000
#> TCGA-E1-A7YI-01 0.011 15.385 1.373 0.027
#> TCGA-HT-7610-01 0.018 6.767 3.612 0.011
#> TCGA-HT-7856-01 0.000 6.568 3.026 0.000
# load normalized test data
data(Test_Norm_data, package = "CPSM")
# View top rows and first 30 columns of data
print(head(Test_Norm_data[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-E1-A7Z6-01 41.03 <NA> Female WHITE NA
#> TCGA-S9-A7IX-01 57.75 CL Male WHITE NA
#> TCGA-HT-8010-01 64.83 NE Female WHITE NA
#> TCGA-VM-A8C8-01 50.63 <NA> Female WHITE NA
#> TCGA-DU-5847-01 34.51 ME Female WHITE NA
#> TCGA-TQ-A7RQ-01 38.84 PN Female WHITE NA
#> histological_type histological_grade
#> TCGA-E1-A7Z6-01 Astrocytoma G2
#> TCGA-S9-A7IX-01 Astrocytoma G3
#> TCGA-HT-8010-01 Oligodendroglioma G2
#> TCGA-VM-A8C8-01 Oligodendroglioma G2
#> TCGA-DU-5847-01 Astrocytoma G3
#> TCGA-TQ-A7RQ-01 Oligodendroglioma G2
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-E1-A7Z6-01 Progressive Disease YES
#> TCGA-S9-A7IX-01 Progressive Disease YES
#> TCGA-HT-8010-01 <NA> NO
#> TCGA-VM-A8C8-01 <NA> YES
#> TCGA-DU-5847-01 <NA> YES
#> TCGA-TQ-A7RQ-01 Complete Remission/Response NO
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-E1-A7Z6-01 Primary LGG 1 984 1 984 NA NA 1
#> TCGA-S9-A7IX-01 Primary LGG 1 819 1 819 NA NA 1
#> TCGA-HT-8010-01 Primary LGG 0 50 0 50 NA NA 0
#> TCGA-VM-A8C8-01 Primary LGG 0 1397 0 1397 NA NA 1
#> TCGA-DU-5847-01 Primary LGG 0 548 0 548 NA NA 0
#> TCGA-TQ-A7RQ-01 Primary LGG 0 795 0 795 0 795 0
#> PFI.time OS_month A1BG A1CF A2M A2ML1 A3GALT2 A4GALT
#> TCGA-E1-A7Z6-01 294 32 0.223 0.001 39.777 4.826 0.144 0.986
#> TCGA-S9-A7IX-01 350 27 0.112 0.000 218.915 1.396 0.049 1.183
#> TCGA-HT-8010-01 50 2 0.068 0.001 87.826 2.443 0.106 1.800
#> TCGA-VM-A8C8-01 456 46 0.173 0.011 371.100 0.168 0.214 2.483
#> TCGA-DU-5847-01 548 18 0.159 0.000 136.028 2.540 0.137 1.026
#> TCGA-TQ-A7RQ-01 795 26 0.189 0.000 30.120 1.463 0.036 0.842
#> A4GNT AAAS AACS AADAC
#> TCGA-E1-A7Z6-01 0.085 8.056 1.040 0.000
#> TCGA-S9-A7IX-01 0.033 7.931 1.995 0.005
#> TCGA-HT-8010-01 0.042 8.441 1.772 0.000
#> TCGA-VM-A8C8-01 0.061 7.977 0.675 0.000
#> TCGA-DU-5847-01 0.028 10.669 1.222 0.012
#> TCGA-TQ-A7RQ-01 0.028 12.269 1.549 0.000
# Call/Run function to select features using Univariate COX method
Result_Uni <- Univariate_sig_features_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
#------------------------ OUTPUTS ---------------------#
# Access output - A table of univariate significant genes
Univariate_Suv_Sig_G_L <- Result_Uni$Univariate_Survival_Significant_genes_List
# view top features (from selected set) using Univariate Cox regression
print(head(Univariate_Suv_Sig_G_L))
#> ID Beta HR P-value
#> [1,] "A2ML1" "-0.866856742823893" "0.420270493515833" "0.0106145057604715"
#> [2,] "AADACL4" "0.873027884562316" "2.39414909680176" "0.0175881269132526"
#> [3,] "AAMDC" "-0.998086698141534" "0.368583979372125" "0.0110714841848022"
#> [4,] "AAR2" "0.881686359818125" "2.41496878080511" "0.0164162899035616"
#> [5,] "ABCA12" "1.64057170506327" "5.15811759138245" "0.00304881327508366"
#> [6,] "ABCB4" "2.95478458786445" "19.1975868812067" "0.000116035546113627"
#> GP1 GP2 Hr-Inv-lst Concordance Std_Error
#> [1,] "29" "129" "2.37941995792842" "0.56482982171799" "0.0396590237209888"
#> [2,] "135" "23" "0.417684930874129" "0.562722852512156" "0.0357392244341072"
#> [3,] "109" "49" "2.71308590705293" "0.604051863857374" "0.0323332776194887"
#> [4,] "136" "22" "0.414084027896467" "0.582333873581848" "0.038819253941559"
#> [5,] "150" "8" "0.193869174613366" "0.549108589951378" "0.0289413274861312"
#> [6,] "156" "2" "0.052089880159831" "0.537925445705024" "0.0262359794323798"
# Access output - Train data with only sig features
Train_Uni_sig_data <- Result_Uni$Train_Uni_sig_data
# view Train data with only sig features
str(Train_Uni_sig_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ A2ML1 : num 0.414 0.103 1.309 0.047 1.388 ...
#> $ AADACL4: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ AAMDC : num 3.2 1.98 3.48 2.46 1.41 ...
#> $ AAR2 : num 19.3 20.4 16.6 19.3 11.7 ...
#> $ ABCA12 : num 0.212 0.297 0.053 0.358 0.064 0.032 0.09 0.051 0.1 0.014 ...
#> $ ABCB4 : num 2.08 0.417 0.669 3.152 0.121 ...
#> $ ABCB5 : num 0 0.004 0.002 0 0 0.001 0.034 0 0 0 ...
#> $ ABCB6 : num 0.207 0.197 0.541 0.622 0.309 0.207 0.262 0.37 0.135 0.917 ...
#> $ ABCC3 : num 0.1 5.476 0.029 0.108 0.037 ...
#> $ ABCC9 : num 1.02 0.39 4.535 0.626 1.895 ...
# Access output - Test data with only sig features
Test_Uni_sig_data <- Result_Uni$Test_Uni_sig_data
# view Test data with only sig features
str(Test_Uni_sig_data[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ A2ML1 : num 4.826 1.396 2.443 0.168 2.54 ...
#> $ AADACL4: num 0.011 0 0.016 0 0 0 0 0 0 0 ...
#> $ AAMDC : num 3.27 1.58 1.76 1.67 3.28 ...
#> $ AAR2 : num 11.4 11.6 10 21.6 17.7 ...
#> $ ABCA12 : num 0.048 0.165 0.044 0.033 0.053 0.02 0.09 0.038 0.058 0.096 ...
#> $ ABCB4 : num 0.16 0.318 0.189 0.929 0.414 0.24 0.651 0.15 0.603 0.305 ...
#> $ ABCB5 : num 0 0 0 0.004 0.002 0 0 0 0.006 0.001 ...
#> $ ABCB6 : num 0.364 0.121 0.151 0.118 0.445 0.911 0.387 0.758 0.358 0.258 ...
#> $ ABCC3 : num 0.679 1.358 0.184 0.321 0.024 ...
#> $ ABCC9 : num 7.311 0.813 4.246 0.917 0.825 ...
# Access output -
Uni_Sur_Sig_clin_List <- Result_Uni$Univariate_Survival_Significant_clin_List
# view A table of univariate significant clinical feature
print(head(Uni_Sur_Sig_clin_List))
#> ID Beta HR
#> [1,] "Age" "0.0531572486440569" "1.0545954657885"
#> [2,] "histological_grade" "1.12802828640997" "3.08955876560186"
#> [3,] "treatment_outcome_first_course" "-13.8983014939401" "0.282236870220404"
#> [4,] "OS.time" "-0.07615081340589" "0.926676440790178"
#> [5,] "DSS" "3.39293135972313" "29.7530414304684"
#> [6,] "DSS.time" "-0.07615081340589" "0.926676440790178"
#> P-value GP1 GP2 Hr-Inv-lst Concordance
#> [1,] "2.48069300679025e-05" "1" "1" "0.948230892736031" "0.73452188006483"
#> [2,] "0.00185360938997885" "72" "86" "0.323670813817712" "0.646353322528363"
#> [3,] "9.20543579649292e-07" "29" "1" "3.54312319017385" "0.764795144157815"
#> [4,] "9.63794925582338e-07" "1" "1" "1.07912530844887" "0.989789303079416"
#> [5,] "1.52430917638486e-08" "118" "40" "0.0336100093275156" "0.811831442463533"
#> [6,] "9.63794925582338e-07" "1" "1" "1.07912530844887" "0.989789303079416"
#> Std_Error
#> [1,] "0.0462323166675435"
#> [2,] "0.0354801995297574"
#> [3,] "0.0675486348293551"
#> [4,] "0.00340008416737385"
#> [5,] "0.0325148651509674"
#> [6,] "0.00340008416737385"
# Access output - ZPH test results
ZPH_test <- Result_Uni$ZPH_Genes
# View ZPH (diagnostics test) results -
#str(ZPH_test[1:10])
head(ZPH_test[1:10])
#> $A1BG
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 2.15 1 0.14
#> GLOBAL 2.15 1 0.14
#>
#> $A1CF
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 2.72 1 0.099
#> GLOBAL 2.72 1 0.099
#>
#> $A2M
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 0.00433 1 0.95
#> GLOBAL 0.00433 1 0.95
#>
#> $A2ML1
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 1.45 1 0.23
#> GLOBAL 1.45 1 0.23
#>
#> $A3GALT2
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 1.09 1 0.3
#> GLOBAL 1.09 1 0.3
#>
#> $A4GALT
#> chisq df p
#> (tr_data1[, i]) > (median(tr_data1[1, i])) 0.485 1 0.49
#> GLOBAL 0.485 1 0.49
The Univariate_sig_features_f
function generates the following output objects:
1. Univariate_Surv_Sig_G_L
: A table of univariate significant genes, along with their corresponding coefficient values, hazard ratio (HR) values, p-values, and C-Index values.
2. Train_Uni_sig_data
: This dataset contains the expression values of the significant genes selected by univariate survival analysis for the training samples.
3. Test_Uni_sig_data
: This dataset contains the expression values of the significant genes selected by univariate survival analysis for the test samples.
After selecting significant features using LASSO or univariate survival analysis, the next step is to develop a machine learning (ML) prediction model to estimate the survival probability of patients. The MTLR_pred_model_f
function in the CPSM package provides several options for building prediction models based on different feature sets. These options include:
- Model_type = 1: Model based on only clinical features
- Model_type = 2: Model based on PI score
- Model_type = 3: Model based on PI score + clinical features
- Model_type = 4: Model based on significant univariate features
- Model_type = 5: Model based on significant univariate features + clinical features
For this analysis, we are interested in developing a model based on the PI score (i.e., Model_type = 2).
## Required inputs
To use this function, the following inputs are required:
1. Training data with only clinical features
2. Test data with only clinical features
3. Model type (e.g., 2 for a model based on PI score)
4. Training data with PI score
5. Test data with PI score
6. Clin_Feature_List
(e.g., Key_PI_list), a list of features to be used for building the model
7. surv_time
: The name of the column containing survival time in months (e.g., OS_month
)
8. surv_event
: The name of the column containing survival event information (e.g., OS
)
9. ** nfolds
**: An integer specifying the number of folds for cross-validation.
These inputs will allow the MTLR_pred_model_f
function to generate a prediction model for the survival probability of patients based on the provided data.
# load data for model building/development
# Load Training data
data(Train_Clin, package = "CPSM")
# View top rows of data
print(str(Train_Clin),2)
#> 'data.frame': 158 obs. of 20 variables:
#> $ Age : num 42.1 54.3 39.4 33.5 25.9 ...
#> $ subtype : chr "PN" "ME" NA "PN" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "YES" NA ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DSS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ DSS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DFI : int NA NA NA NA NA 0 0 0 NA 1 ...
#> $ DFI.time : int NA NA NA NA NA 1189 1040 457 NA 858 ...
#> $ PFI : int 1 1 0 1 1 0 0 0 0 1 ...
#> $ PFI.time : int 362 410 1469 56 1205 1189 1040 457 1294 858 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> NULL
# Load Test data
data(Test_Clin, package = "CPSM")
# View top rows of data
print(str(Test_Clin),2)
#> 'data.frame': 18 obs. of 20 variables:
#> $ Age : num 41 57.8 64.8 50.6 34.5 ...
#> $ subtype : chr NA "CL" "NE" NA ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G2" ...
#> $ treatment_outcome_first_course: chr "Progressive Disease" "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" "YES" "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DSS : int 1 1 0 0 0 0 0 0 0 NA ...
#> $ DSS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DFI : int NA NA NA NA NA 0 0 NA NA NA ...
#> $ DFI.time : int NA NA NA NA NA 795 526 NA NA NA ...
#> $ PFI : int 1 1 0 1 0 0 0 0 1 1 ...
#> $ PFI.time : int 294 350 50 456 548 795 526 863 1627 2097 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> NULL
# Load a list of selected features
data(Key_Clin_feature_list, package = "CPSM")
print(Key_Clin_feature_list)
#> ID
#> 1 Age
#> 2 gender
#> 3 subtype
#> 4 radiation_treatment_adjuvant
# Call/Run function to develop MTLR model
Result_Model_Type1 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 1,
train_features_data = Train_Clin,
test_features_data = Test_Clin,
Clin_Feature_List = Key_Clin_feature_list,
surv_time = "OS_month",
surv_event = "OS",
nfolds = 5
)
#------------------------ OUTPUTS ---------------------#
# Access output - Predicted Survival probabilty across different time points
survCurves_data <- Result_Model_Type1$survCurves_data
# View Predicted Survival probabilty for test samples
str(survCurves_data)
#> 'data.frame': 13 obs. of 14 variables:
#> $ time_point : num 0 8.31 14 19 24 ...
#> $ TCGA-S9-A7IX-01: num 1 0.963 0.944 0.867 0.826 ...
#> $ TCGA-HT-8010-01: num 1 0.97 0.954 0.895 0.865 ...
#> $ TCGA-DU-5847-01: num 1 0.982 0.972 0.927 0.899 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.991 0.986 0.964 0.952 ...
#> $ TCGA-HT-7606-01: num 1 0.99 0.985 0.962 0.947 ...
#> $ TCGA-S9-A7QY-01: num 1 0.992 0.988 0.969 0.958 ...
#> $ TCGA-DH-5142-01: num 1 0.99 0.985 0.961 0.947 ...
#> $ TCGA-DU-6408-01: num 1 0.993 0.989 0.971 0.96 ...
#> $ TCGA-FG-8191-01: num 1 0.99 0.984 0.959 0.943 ...
#> $ TCGA-DU-6542-01: num 1 0.992 0.987 0.967 0.953 ...
#> $ TCGA-DB-A4XF-01: num 1 0.985 0.977 0.942 0.922 ...
#> $ TCGA-S9-A6WN-01: num 1 0.979 0.967 0.916 0.885 ...
#> $ TCGA-P5-A5EX-01: num 1 0.976 0.963 0.907 0.872 ...
# Access output - Predicted mean and median survival of test data
mean_median_survival_tim_d <- Result_Model_Type1$mean_median_survival_time_data
# View Predicted mean and median survival time
str(mean_median_survival_tim_d)
#> 'data.frame': 13 obs. of 4 variables:
#> $ ID : chr "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> $ Mean : num 59.9 69.9 73.1 82.8 80.3 ...
#> $ Median : num 50 59 76.1 83.4 82.1 ...
#> $ OS_month: num 27 2 18 26 17 28 64 114 33 2 ...
# Access output -
survival_result_based_on_MTLR <- Result_Model_Type1$survival_result_based_on_MTLR
# Access output - Final Evaluation parameters of results
Error_mat_for_Model <- Result_Model_Type1$Error_mat_for_Model
str(Error_mat_for_Model)
#> num [1:2, 1:4] 0.76 1 41.68 49.55 37.81 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:4] "C_index" "Mean_MAE" "Median_MAE" "IBS"
# view Evaluation parameters of the model
head(Error_mat_for_Model)
#> C_index Mean_MAE Median_MAE IBS
#> Training_set 0.76 41.68 37.81 0.197
#> Test_set 1.00 49.55 47.66 0.174
# Load training data with clinical features
data(Train_Clin, package = "CPSM")
# View top rows of data
print(str(Train_Clin))
#> 'data.frame': 158 obs. of 20 variables:
#> $ Age : num 42.1 54.3 39.4 33.5 25.9 ...
#> $ subtype : chr "PN" "ME" NA "PN" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "YES" NA ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DSS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ DSS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DFI : int NA NA NA NA NA 0 0 0 NA 1 ...
#> $ DFI.time : int NA NA NA NA NA 1189 1040 457 NA 858 ...
#> $ PFI : int 1 1 0 1 1 0 0 0 0 1 ...
#> $ PFI.time : int 362 410 1469 56 1205 1189 1040 457 1294 858 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> NULL
# Load test data with clinical features
data(Test_Clin, package = "CPSM")
# View top rows of data
print(str(Train_Clin))
#> 'data.frame': 158 obs. of 20 variables:
#> $ Age : num 42.1 54.3 39.4 33.5 25.9 ...
#> $ subtype : chr "PN" "ME" NA "PN" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "YES" NA ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DSS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ DSS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DFI : int NA NA NA NA NA 0 0 0 NA 1 ...
#> $ DFI.time : int NA NA NA NA NA 1189 1040 457 NA 858 ...
#> $ PFI : int 1 1 0 1 1 0 0 0 0 1 ...
#> $ PFI.time : int 362 410 1469 56 1205 1189 1040 457 1294 858 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> NULL
# load training data with PI score
data(Train_PI_data, package = "CPSM")
# View top rows of data
print(head(Train_PI_data),3)
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-5318-01 1 78 0 0.212 0.100 44.705 0.000 0
#> TCGA-DU-A76L-01 1 27 0 0.297 5.476 15.317 0.035 0
#> TCGA-CS-6667-01 0 48 0 0.053 0.029 43.732 0.000 0
#> TCGA-E1-A7YI-01 1 4 0 0.358 0.108 27.662 0.000 0
#> TCGA-HT-7610-01 0 56 0 0.064 0.037 31.640 0.015 0
#> TCGA-HT-7856-01 0 39 0 0.032 0.198 22.709 0.034 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-5318-01 0.032 0.038 0.000 0.021 0.000
#> TCGA-DU-A76L-01 0.034 0.000 0.000 0.010 0.007
#> TCGA-CS-6667-01 0.032 0.000 0.000 0.009 0.000
#> TCGA-E1-A7YI-01 0.000 0.000 0.000 0.023 0.007
#> TCGA-HT-7610-01 0.014 0.000 0.000 0.010 0.006
#> TCGA-HT-7856-01 0.040 0.000 0.002 0.007 0.008
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-5318-01 0.000 0.000 0 0.017 0
#> TCGA-DU-A76L-01 0.010 0.000 0 0.016 0
#> TCGA-CS-6667-01 0.007 0.013 0 0.000 0
#> TCGA-E1-A7YI-01 0.075 0.012 0 0.043 0
#> TCGA-HT-7610-01 0.018 0.005 0 0.002 0
#> TCGA-HT-7856-01 0.014 0.000 0 0.024 0
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-5318-01 0 18.305 0.039 0.000 0 25.731 28.388 0.000
#> TCGA-DU-A76L-01 0 5.218 1.808 0.000 0 13.483 7.859 0.010
#> TCGA-CS-6667-01 0 15.122 0.009 0.000 0 44.266 10.953 0.006
#> TCGA-E1-A7YI-01 0 6.822 0.520 0.016 0 6.670 26.863 0.000
#> TCGA-HT-7610-01 0 15.385 0.790 0.000 0 57.390 19.435 0.006
#> TCGA-HT-7856-01 0 21.867 0.289 0.000 0 48.499 6.707 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-5318-01 0.196 0.562 0.082 2.853 8.264 0.153 1.455 0.000
#> TCGA-DU-A76L-01 0.139 0.180 0.524 4.107 136.028 0.051 0.719 0.030
#> TCGA-CS-6667-01 0.309 0.228 0.154 2.628 9.123 0.067 1.745 0.006
#> TCGA-E1-A7YI-01 0.340 0.187 0.457 5.320 30.573 0.069 1.540 0.013
#> TCGA-HT-7610-01 0.411 2.123 0.068 2.818 16.100 0.102 1.277 0.002
#> TCGA-HT-7856-01 1.191 2.166 0.158 6.175 18.849 0.116 0.730 0.008
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-5318-01 1.388 0.183 0.067 65.301 0.00 1.013 0.000 690.584
#> TCGA-DU-A76L-01 3.489 0.898 0.072 56.409 0.00 0.153 0.000 371.100
#> TCGA-CS-6667-01 2.123 0.151 0.142 85.268 0.00 0.953 0.000 1506.178
#> TCGA-E1-A7YI-01 1.772 0.567 0.603 37.475 0.00 0.805 0.025 690.584
#> TCGA-HT-7610-01 1.463 0.145 0.062 63.921 0.01 0.659 0.000 1506.178
#> TCGA-HT-7856-01 1.146 0.098 0.018 45.204 0.00 0.516 0.000 1506.178
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-5318-01 0.031 28.563 0.581 0.000 93.080 0.000
#> TCGA-DU-A76L-01 0.071 11.792 0.590 0.005 85.268 0.000
#> TCGA-CS-6667-01 0.032 27.989 0.708 0.009 67.615 0.015
#> TCGA-E1-A7YI-01 0.452 24.567 1.621 0.000 275.465 0.000
#> TCGA-HT-7610-01 0.014 16.323 0.197 0.004 95.574 0.000
#> TCGA-HT-7856-01 0.027 12.551 0.178 0.003 95.574 0.000
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-5318-01 0.156 0.042 6.080 0.255 21.294 33.770 26.363 5.186 -1.634461
#> TCGA-DU-A76L-01 0.123 0.119 7.764 1.745 18.212 18.568 2.597 0.516 0.961367
#> TCGA-CS-6667-01 0.140 0.028 5.476 0.146 371.100 39.388 14.003 4.639 -3.111244
#> TCGA-E1-A7YI-01 0.499 0.113 11.032 6.689 23.747 21.755 2.574 0.741 4.830642
#> TCGA-HT-7610-01 0.210 0.093 7.154 0.207 243.966 49.878 1.764 7.564 -2.217197
#> TCGA-HT-7856-01 0.631 0.109 7.787 0.101 318.823 25.125 10.915 3.012 -1.824024
# Load test data with PI score
data(Test_PI_data, package = "CPSM")
# View top rows of data
print(head(Test_PI_data,3))
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-A7Z6-01 1 32 0.011 0.048 0.679 39.388 0.000 0
#> TCGA-S9-A7IX-01 1 27 0.000 0.165 1.358 19.936 0.001 0
#> TCGA-HT-8010-01 0 2 0.016 0.044 0.184 30.971 0.029 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-A7Z6-01 0.002 0.036 0 0.036 0.008
#> TCGA-S9-A7IX-01 0.054 0.009 0 0.009 0.006
#> TCGA-HT-8010-01 0.012 0.000 0 0.022 0.011
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-A7Z6-01 0.060 0.000 0 0.000 0.000
#> TCGA-S9-A7IX-01 0.011 0.021 0 0.006 0.005
#> TCGA-HT-8010-01 0.013 0.008 0 0.002 0.000
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-A7Z6-01 0 15.317 0.016 0.000 0 78.952 9.914 0.004
#> TCGA-S9-A7IX-01 0 10.114 0.333 0.000 0 23.873 6.906 0.006
#> TCGA-HT-8010-01 0 19.743 1.206 0.009 0 73.436 11.533 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-A7Z6-01 0.358 1.396 0.039 0.665 6.024 0.093 1.381 0.003
#> TCGA-S9-A7IX-01 0.197 0.626 0.148 2.552 15.872 0.100 0.420 0.016
#> TCGA-HT-8010-01 0.708 1.480 0.135 3.159 10.517 0.037 1.229 0.000
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-A7Z6-01 1.846 0.228 0.078 47.912 0.000 0.405 0 1506.178
#> TCGA-S9-A7IX-01 2.586 0.651 0.030 53.270 0.001 0.375 0 690.584
#> TCGA-HT-8010-01 1.106 0.411 0.002 26.525 0.009 0.358 0 1506.178
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-A7Z6-01 0.002 20.833 0.439 0.026 75.157 0
#> TCGA-S9-A7IX-01 0.033 11.968 0.396 0.000 62.682 0
#> TCGA-HT-8010-01 0.018 16.323 0.499 0.000 85.268 0
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-A7Z6-01 0.399 0.023 5.710 0.003 275.465 25.594 1.878 12.169 -2.478895
#> TCGA-S9-A7IX-01 0.135 0.488 7.447 0.848 275.465 19.435 7.424 3.978 0.795975
#> TCGA-HT-8010-01 0.659 0.028 6.007 0.017 886.982 29.705 3.529 15.256 -1.494746
# Load a list of feature (PI )
data(Key_PI_list, package = "CPSM")
#view list of features to build model
print(str(Key_PI_list))
#> 'data.frame': 1 obs. of 1 variable:
#> $ ID: chr "PI"
#> NULL
# Call/Run function to develop MTLR prediction model based on PI score
Result_Model_Type2 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 2,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_PI_list,
surv_time = "OS_month",
surv_event = "OS",
nfolds = 5
)
#------------------------ OUTPUTS ---------------------#
# Access output - Predicted Survival probabilty across different time points
survCurves_data <- Result_Model_Type2$survCurves_data
# View Predicted Survival probabilty for test samples
str(survCurves_data)
#> 'data.frame': 15 obs. of 19 variables:
#> $ time_point : num 0 4 10.9 14.4 18.9 ...
#> $ TCGA-E1-A7Z6-01: num 1 1 1 1 1 ...
#> $ TCGA-S9-A7IX-01: num 1 0.987 0.969 0.945 0.854 ...
#> $ TCGA-HT-8010-01: num 1 1 1 1 1 ...
#> $ TCGA-VM-A8C8-01: num 1 0.998 0.997 0.994 0.98 ...
#> $ TCGA-DU-5847-01: num 1 1 1 1 1 ...
#> $ TCGA-TQ-A7RQ-01: num 1 1 1 1 1 ...
#> $ TCGA-HT-7606-01: num 1 0.986 0.966 0.939 0.841 ...
#> $ TCGA-S9-A7QY-01: num 1 1 1 1 1 ...
#> $ TCGA-DH-5142-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-6408-01: num 1 1 1 1 1 ...
#> $ TCGA-FG-8191-01: num 1 1 1 1 1 ...
#> $ TCGA-TM-A84S-01: num 1 1 1 1 1 ...
#> $ TCGA-DB-A64S-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-6542-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-A7T6-01: num 1 0.992 0.983 0.969 0.911 ...
#> $ TCGA-DB-A4XF-01: num 1 1 1 1 1 ...
#> $ TCGA-S9-A6WN-01: num 1 1 1 1 0.999 ...
#> $ TCGA-P5-A5EX-01: num 1 0.999 0.999 0.998 0.991 ...
# Access output - Predicted mean and median survival of test data
mean_median_surviv_tim_da <- Result_Model_Type2$mean_median_survival_time_data
# View Predicted mean and median survival of test data
str(mean_median_surviv_tim_da)
#> 'data.frame': 18 obs. of 4 variables:
#> $ ID : chr "TCGA-E1-A7Z6-01" "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-VM-A8C8-01" ...
#> $ Mean : num 153 30 79.9 44.4 74.6 ...
#> $ Median : num 145.8 30.7 75.4 43.1 69.6 ...
#> $ OS_month: num 32 27 2 46 18 26 17 28 64 114 ...
# Access output - all results for test data
survival_result_b_on_MTLR <- Result_Model_Type2$survival_result_based_on_MTLR
# Access output - Final Evaluation parameters of results
Error_mat_for_Model <- Result_Model_Type2$Error_mat_for_Model
# view Evaluation parameters of the model
str(Error_mat_for_Model)
#> num [1:2, 1:4] 0.97 0.8 78.75 155.48 43.23 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:4] "C_index" "Mean_MAE" "Median_MAE" "IBS"
# view Evaluation parameters of the model
head(Error_mat_for_Model)
#> C_index Mean_MAE Median_MAE IBS
#> Training_set 0.97 78.75 43.23 0.089
#> Test_set 0.80 155.48 56.88 0.238
# Load training data with clinical feature
data(Train_Clin, package = "CPSM")
# View top rows of data
print(str(Train_Clin),2)
#> 'data.frame': 158 obs. of 20 variables:
#> $ Age : num 42.1 54.3 39.4 33.5 25.9 ...
#> $ subtype : chr "PN" "ME" NA "PN" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Astrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "YES" NA ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DSS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ DSS.time : int 2379 814 1469 111 1706 1189 1040 457 1294 1341 ...
#> $ DFI : int NA NA NA NA NA 0 0 0 NA 1 ...
#> $ DFI.time : int NA NA NA NA NA 1189 1040 457 NA 858 ...
#> $ PFI : int 1 1 0 1 1 0 0 0 0 1 ...
#> $ PFI.time : int 362 410 1469 56 1205 1189 1040 457 1294 858 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> NULL
# Load test data with clinical feature
data(Test_Clin, package = "CPSM")
# View top rows of data
print(str(Test_Clin),2)
#> 'data.frame': 18 obs. of 20 variables:
#> $ Age : num 41 57.8 64.8 50.6 34.5 ...
#> $ subtype : chr NA "CL" "NE" NA ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G3" "G2" "G2" ...
#> $ treatment_outcome_first_course: chr "Progressive Disease" "Progressive Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "YES" "YES" "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
#> $ type : chr "LGG" "LGG" "LGG" "LGG" ...
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DSS : int 1 1 0 0 0 0 0 0 0 NA ...
#> $ DSS.time : int 984 819 50 1397 548 795 526 863 1943 3470 ...
#> $ DFI : int NA NA NA NA NA 0 0 NA NA NA ...
#> $ DFI.time : int NA NA NA NA NA 795 526 NA NA NA ...
#> $ PFI : int 1 1 0 1 0 0 0 0 1 1 ...
#> $ PFI.time : int 294 350 50 456 548 795 526 863 1627 2097 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> NULL
# Load training data with PI score value
data(Train_PI_data, package = "CPSM")
# View top rows of data
print(head(Train_PI_data),3)
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-5318-01 1 78 0 0.212 0.100 44.705 0.000 0
#> TCGA-DU-A76L-01 1 27 0 0.297 5.476 15.317 0.035 0
#> TCGA-CS-6667-01 0 48 0 0.053 0.029 43.732 0.000 0
#> TCGA-E1-A7YI-01 1 4 0 0.358 0.108 27.662 0.000 0
#> TCGA-HT-7610-01 0 56 0 0.064 0.037 31.640 0.015 0
#> TCGA-HT-7856-01 0 39 0 0.032 0.198 22.709 0.034 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-5318-01 0.032 0.038 0.000 0.021 0.000
#> TCGA-DU-A76L-01 0.034 0.000 0.000 0.010 0.007
#> TCGA-CS-6667-01 0.032 0.000 0.000 0.009 0.000
#> TCGA-E1-A7YI-01 0.000 0.000 0.000 0.023 0.007
#> TCGA-HT-7610-01 0.014 0.000 0.000 0.010 0.006
#> TCGA-HT-7856-01 0.040 0.000 0.002 0.007 0.008
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-5318-01 0.000 0.000 0 0.017 0
#> TCGA-DU-A76L-01 0.010 0.000 0 0.016 0
#> TCGA-CS-6667-01 0.007 0.013 0 0.000 0
#> TCGA-E1-A7YI-01 0.075 0.012 0 0.043 0
#> TCGA-HT-7610-01 0.018 0.005 0 0.002 0
#> TCGA-HT-7856-01 0.014 0.000 0 0.024 0
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-5318-01 0 18.305 0.039 0.000 0 25.731 28.388 0.000
#> TCGA-DU-A76L-01 0 5.218 1.808 0.000 0 13.483 7.859 0.010
#> TCGA-CS-6667-01 0 15.122 0.009 0.000 0 44.266 10.953 0.006
#> TCGA-E1-A7YI-01 0 6.822 0.520 0.016 0 6.670 26.863 0.000
#> TCGA-HT-7610-01 0 15.385 0.790 0.000 0 57.390 19.435 0.006
#> TCGA-HT-7856-01 0 21.867 0.289 0.000 0 48.499 6.707 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-5318-01 0.196 0.562 0.082 2.853 8.264 0.153 1.455 0.000
#> TCGA-DU-A76L-01 0.139 0.180 0.524 4.107 136.028 0.051 0.719 0.030
#> TCGA-CS-6667-01 0.309 0.228 0.154 2.628 9.123 0.067 1.745 0.006
#> TCGA-E1-A7YI-01 0.340 0.187 0.457 5.320 30.573 0.069 1.540 0.013
#> TCGA-HT-7610-01 0.411 2.123 0.068 2.818 16.100 0.102 1.277 0.002
#> TCGA-HT-7856-01 1.191 2.166 0.158 6.175 18.849 0.116 0.730 0.008
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-5318-01 1.388 0.183 0.067 65.301 0.00 1.013 0.000 690.584
#> TCGA-DU-A76L-01 3.489 0.898 0.072 56.409 0.00 0.153 0.000 371.100
#> TCGA-CS-6667-01 2.123 0.151 0.142 85.268 0.00 0.953 0.000 1506.178
#> TCGA-E1-A7YI-01 1.772 0.567 0.603 37.475 0.00 0.805 0.025 690.584
#> TCGA-HT-7610-01 1.463 0.145 0.062 63.921 0.01 0.659 0.000 1506.178
#> TCGA-HT-7856-01 1.146 0.098 0.018 45.204 0.00 0.516 0.000 1506.178
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-5318-01 0.031 28.563 0.581 0.000 93.080 0.000
#> TCGA-DU-A76L-01 0.071 11.792 0.590 0.005 85.268 0.000
#> TCGA-CS-6667-01 0.032 27.989 0.708 0.009 67.615 0.015
#> TCGA-E1-A7YI-01 0.452 24.567 1.621 0.000 275.465 0.000
#> TCGA-HT-7610-01 0.014 16.323 0.197 0.004 95.574 0.000
#> TCGA-HT-7856-01 0.027 12.551 0.178 0.003 95.574 0.000
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-5318-01 0.156 0.042 6.080 0.255 21.294 33.770 26.363 5.186 -1.634461
#> TCGA-DU-A76L-01 0.123 0.119 7.764 1.745 18.212 18.568 2.597 0.516 0.961367
#> TCGA-CS-6667-01 0.140 0.028 5.476 0.146 371.100 39.388 14.003 4.639 -3.111244
#> TCGA-E1-A7YI-01 0.499 0.113 11.032 6.689 23.747 21.755 2.574 0.741 4.830642
#> TCGA-HT-7610-01 0.210 0.093 7.154 0.207 243.966 49.878 1.764 7.564 -2.217197
#> TCGA-HT-7856-01 0.631 0.109 7.787 0.101 318.823 25.125 10.915 3.012 -1.824024
# Load test data with PI score value
data(Test_PI_data, package = "CPSM")
# View top rows of data
print(head(Test_PI_data),3)
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-A7Z6-01 1 32 0.011 0.048 0.679 39.388 0.000 0
#> TCGA-S9-A7IX-01 1 27 0.000 0.165 1.358 19.936 0.001 0
#> TCGA-HT-8010-01 0 2 0.016 0.044 0.184 30.971 0.029 0
#> TCGA-VM-A8C8-01 0 46 0.000 0.033 0.321 27.818 0.000 0
#> TCGA-DU-5847-01 0 18 0.000 0.053 0.024 56.409 0.000 0
#> TCGA-TQ-A7RQ-01 0 26 0.000 0.020 0.078 29.119 0.000 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-A7Z6-01 0.002 0.036 0.000 0.036 0.008
#> TCGA-S9-A7IX-01 0.054 0.009 0.000 0.009 0.006
#> TCGA-HT-8010-01 0.012 0.000 0.000 0.022 0.011
#> TCGA-VM-A8C8-01 0.092 0.000 0.000 0.004 0.000
#> TCGA-DU-5847-01 0.031 0.015 0.004 0.003 0.017
#> TCGA-TQ-A7RQ-01 0.003 0.000 0.000 0.082 0.007
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-A7Z6-01 0.060 0.000 0 0.000 0.000
#> TCGA-S9-A7IX-01 0.011 0.021 0 0.006 0.005
#> TCGA-HT-8010-01 0.013 0.008 0 0.002 0.000
#> TCGA-VM-A8C8-01 0.014 0.013 0 0.014 0.000
#> TCGA-DU-5847-01 0.000 0.011 0 0.038 0.000
#> TCGA-TQ-A7RQ-01 0.055 0.000 0 0.012 0.006
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-A7Z6-01 0 15.317 0.016 0.000 0 78.952 9.914 0.004
#> TCGA-S9-A7IX-01 0 10.114 0.333 0.000 0 23.873 6.906 0.006
#> TCGA-HT-8010-01 0 19.743 1.206 0.009 0 73.436 11.533 0.000
#> TCGA-VM-A8C8-01 0 13.595 0.226 0.000 0 20.959 14.586 0.000
#> TCGA-DU-5847-01 0 11.574 0.000 0.000 0 40.463 8.081 0.000
#> TCGA-TQ-A7RQ-01 0 15.256 0.119 0.000 0 35.183 15.385 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-A7Z6-01 0.358 1.396 0.039 0.665 6.024 0.093 1.381 0.003
#> TCGA-S9-A7IX-01 0.197 0.626 0.148 2.552 15.872 0.100 0.420 0.016
#> TCGA-HT-8010-01 0.708 1.480 0.135 3.159 10.517 0.037 1.229 0.000
#> TCGA-VM-A8C8-01 0.177 0.518 0.626 7.882 8.264 0.071 1.923 0.045
#> TCGA-DU-5847-01 0.554 1.325 0.802 4.535 10.594 0.058 4.432 0.016
#> TCGA-TQ-A7RQ-01 0.314 0.719 0.115 4.006 9.571 0.110 1.034 0.000
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-A7Z6-01 1.846 0.228 0.078 47.912 0.000 0.405 0 1506.178
#> TCGA-S9-A7IX-01 2.586 0.651 0.030 53.270 0.001 0.375 0 690.584
#> TCGA-HT-8010-01 1.106 0.411 0.002 26.525 0.009 0.358 0 1506.178
#> TCGA-VM-A8C8-01 1.791 1.079 0.016 23.477 0.000 0.253 0 690.584
#> TCGA-DU-5847-01 2.060 0.452 0.038 37.475 0.000 0.384 0 1506.178
#> TCGA-TQ-A7RQ-01 1.887 0.235 0.033 45.204 0.000 0.492 0 886.982
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-A7Z6-01 0.002 20.833 0.439 0.026 75.157 0.000
#> TCGA-S9-A7IX-01 0.033 11.968 0.396 0.000 62.682 0.000
#> TCGA-HT-8010-01 0.018 16.323 0.499 0.000 85.268 0.000
#> TCGA-VM-A8C8-01 0.015 17.289 0.390 0.000 51.760 0.015
#> TCGA-DU-5847-01 0.023 43.226 0.214 0.000 83.035 0.006
#> TCGA-TQ-A7RQ-01 0.018 21.755 0.785 0.003 75.157 0.002
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-A7Z6-01 0.399 0.023 5.710 0.003 275.465 25.594 1.878 12.169 -2.478895
#> TCGA-S9-A7IX-01 0.135 0.488 7.447 0.848 275.465 19.435 7.424 3.978 0.795975
#> TCGA-HT-8010-01 0.659 0.028 6.007 0.017 886.982 29.705 3.529 15.256 -1.494746
#> TCGA-VM-A8C8-01 0.348 0.112 7.007 0.825 158.854 22.473 1.674 0.481 -0.108948
#> TCGA-DU-5847-01 0.859 0.039 6.024 0.099 117.817 34.058 2.090 2.574 -1.358111
#> TCGA-TQ-A7RQ-01 0.291 0.039 6.231 0.056 78.952 24.023 25.440 3.679 -1.654919
# Load a list of feature containing Clinical feature along with PI
data(Key_Clin_features_with_PI_list, package = "CPSM")
# View top rows of feature list based on which model will build
print(head(Key_Clin_features_with_PI_list))
#> ID
#> 1 Age
#> 2 gender
#> 3 radiation_treatment_adjuvant
#> 4 PI
# Call/Run function to develop MTLR prediction model based on PI score with Clinical features
Result_Model_Type3 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 3,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_Clin_features_with_PI_list,
surv_time = "OS_month",
surv_event = "OS",
nfolds = 5
)
#> Error in survival_summary_tr3$OS_month - survival_summary_tr3$Median: non-numeric argument to binary operator
#------------------------ OUTPUTS ---------------------#
# Access output - Predicted Survival probabilty across different time points
survCurves_data <- Result_Model_Type3$survCurves_data
#> Error: object 'Result_Model_Type3' not found
# View Predicted Survival probabilty for test samples
str(survCurves_data)
#> 'data.frame': 15 obs. of 19 variables:
#> $ time_point : num 0 4 10.9 14.4 18.9 ...
#> $ TCGA-E1-A7Z6-01: num 1 1 1 1 1 ...
#> $ TCGA-S9-A7IX-01: num 1 0.987 0.969 0.945 0.854 ...
#> $ TCGA-HT-8010-01: num 1 1 1 1 1 ...
#> $ TCGA-VM-A8C8-01: num 1 0.998 0.997 0.994 0.98 ...
#> $ TCGA-DU-5847-01: num 1 1 1 1 1 ...
#> $ TCGA-TQ-A7RQ-01: num 1 1 1 1 1 ...
#> $ TCGA-HT-7606-01: num 1 0.986 0.966 0.939 0.841 ...
#> $ TCGA-S9-A7QY-01: num 1 1 1 1 1 ...
#> $ TCGA-DH-5142-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-6408-01: num 1 1 1 1 1 ...
#> $ TCGA-FG-8191-01: num 1 1 1 1 1 ...
#> $ TCGA-TM-A84S-01: num 1 1 1 1 1 ...
#> $ TCGA-DB-A64S-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-6542-01: num 1 1 1 1 1 ...
#> $ TCGA-DU-A7T6-01: num 1 0.992 0.983 0.969 0.911 ...
#> $ TCGA-DB-A4XF-01: num 1 1 1 1 1 ...
#> $ TCGA-S9-A6WN-01: num 1 1 1 1 0.999 ...
#> $ TCGA-P5-A5EX-01: num 1 0.999 0.999 0.998 0.991 ...
# Access output - Predicted mean and median survival of test data
mean_median_surv_tim_da <- Result_Model_Type3$mean_median_survival_time_data
#> Error: object 'Result_Model_Type3' not found
# View Predicted mean and median survival of test data
str(mean_median_surv_tim_da)
#> Error: object 'mean_median_surv_tim_da' not found
# Access output - all results for test data
survival_result_b_on_MTLR <- Result_Model_Type3$survival_result_based_on_MTLR
#> Error: object 'Result_Model_Type3' not found
# Access output - Final Evaluation parameters of results
Error_mat_for_Model <- Result_Model_Type3$Error_mat_for_Model
#> Error: object 'Result_Model_Type3' not found
# view Evaluation parameters of the model
str(Error_mat_for_Model)
#> num [1:2, 1:4] 0.97 0.8 78.75 155.48 43.23 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:4] "C_index" "Mean_MAE" "Median_MAE" "IBS"
# view Evaluation parameters of the model
head(Error_mat_for_Model)
#> C_index Mean_MAE Median_MAE IBS
#> Training_set 0.97 78.75 43.23 0.089
#> Test_set 0.80 155.48 56.88 0.238
# Load training data with clinical features
data(Train_Clin, package = "CPSM")
# View top rows of data
print(head(Train_Clin,3))
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-E1-5318-01 42.09 PN Female WHITE NA
#> TCGA-DU-A76L-01 54.27 ME Male WHITE NA
#> TCGA-CS-6667-01 39.36 <NA> Female WHITE NA
#> histological_type histological_grade
#> TCGA-E1-5318-01 Oligodendroglioma G2
#> TCGA-DU-A76L-01 Oligodendroglioma G3
#> TCGA-CS-6667-01 Astrocytoma G2
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-E1-5318-01 <NA> YES
#> TCGA-DU-A76L-01 Progressive Disease <NA>
#> TCGA-CS-6667-01 <NA> YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-E1-5318-01 Primary LGG 1 2379 1 2379 NA NA 1
#> TCGA-DU-A76L-01 Primary LGG 1 814 1 814 NA NA 1
#> TCGA-CS-6667-01 Primary LGG 0 1469 0 1469 NA NA 0
#> PFI.time OS_month
#> TCGA-E1-5318-01 362 78
#> TCGA-DU-A76L-01 410 27
#> TCGA-CS-6667-01 1469 48
# Load test data with clinical features
data(Test_Clin, package = "CPSM")
# View top rows of data
print(head(Test_Clin,3))
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-E1-A7Z6-01 41.03 <NA> Female WHITE NA
#> TCGA-S9-A7IX-01 57.75 CL Male WHITE NA
#> TCGA-HT-8010-01 64.83 NE Female WHITE NA
#> histological_type histological_grade
#> TCGA-E1-A7Z6-01 Astrocytoma G2
#> TCGA-S9-A7IX-01 Astrocytoma G3
#> TCGA-HT-8010-01 Oligodendroglioma G2
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-E1-A7Z6-01 Progressive Disease YES
#> TCGA-S9-A7IX-01 Progressive Disease YES
#> TCGA-HT-8010-01 <NA> NO
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-E1-A7Z6-01 Primary LGG 1 984 1 984 NA NA 1
#> TCGA-S9-A7IX-01 Primary LGG 1 819 1 819 NA NA 1
#> TCGA-HT-8010-01 Primary LGG 0 50 0 50 NA NA 0
#> PFI.time OS_month
#> TCGA-E1-A7Z6-01 294 32
#> TCGA-S9-A7IX-01 350 27
#> TCGA-HT-8010-01 50 2
# Load normalized training data
data(Train_Uni_sig_data, package = "CPSM")
# View top rows of data with first 10 columns
print(head(Train_Uni_sig_data[1:10],3))
#> A2ML1 AADACL4 AAMDC AAR2 ABCA12 ABCB4 ABCB5 ABCB6 ABCC3 ABCC9
#> TCGA-E1-5318-01 0.414 0 3.204 19.338 0.212 2.080 0.000 0.207 0.100 1.020
#> TCGA-DU-A76L-01 0.103 0 1.980 20.384 0.297 0.417 0.004 0.197 5.476 0.390
#> TCGA-CS-6667-01 1.309 0 3.476 16.567 0.053 0.669 0.002 0.541 0.029 4.535
# Load normalized test data
data(Test_Uni_sig_data, package = "CPSM")
# View top rows of data (first 10 columns)
print(head(Test_Uni_sig_data[1:10],3))
#> A2ML1 AADACL4 AAMDC AAR2 ABCA12 ABCB4 ABCB5 ABCB6 ABCC3 ABCC9
#> TCGA-E1-A7Z6-01 4.826 0.011 3.266 11.409 0.048 0.160 0 0.364 0.679 7.311
#> TCGA-S9-A7IX-01 1.396 0.000 1.576 11.574 0.165 0.318 0 0.121 1.358 0.813
#> TCGA-HT-8010-01 2.443 0.016 1.764 10.016 0.044 0.189 0 0.151 0.184 4.246
# Load a list of clinical feature along with top feature from univariate survival analysis
data(Key_univariate_features_with_Clin_list, package = "CPSM")
# View list of feature based on which model will built
print(head(Key_univariate_features_with_Clin_list))
#> ID
#> 1 Age
#> 2 gender
#> 3 radiation_treatment_adjuvant
#> 4 A2ML1
#> 5 AADACL2
#> 6 AARS1
# Call/Run function to develop MTLR model based on Clinical and selected univariate features
Result_Model_Type5 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 4,
train_features_data = Train_Uni_sig_data,
test_features_data = Test_Uni_sig_data,
Clin_Feature_List = Key_univariate_features_with_Clin_list,
surv_time = "OS_month",
surv_event = "OS",
nfolds = 5
)
#------------------------ OUTPUTS ---------------------#
# Access output - Predicted Survival probabilty across different time points
survCurves_data <- Result_Model_Type5$survCurves_data
# View Predicted Survival probabilty for test samples
str(survCurves_data)
#> 'data.frame': 14 obs. of 18 variables:
#> $ time_point : num 0 8.86 14 18 20.43 ...
#> $ TCGA-E1-A7Z6-01: num 1 0.999 0.998 0.996 0.995 ...
#> $ TCGA-S9-A7IX-01: num 1 0.931 0.892 0.803 0.732 ...
#> $ TCGA-HT-8010-01: num 1 0.995 0.992 0.986 0.983 ...
#> $ TCGA-VM-A8C8-01: num 1 0.982 0.972 0.947 0.929 ...
#> $ TCGA-DU-5847-01: num 1 0.996 0.994 0.99 0.987 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.986 0.979 0.963 0.953 ...
#> $ TCGA-HT-7606-01: num 1 0.982 0.97 0.922 0.873 ...
#> $ TCGA-S9-A7QY-01: num 1 0.996 0.994 0.989 0.985 ...
#> $ TCGA-DH-5142-01: num 1 0.998 0.997 0.995 0.993 ...
#> $ TCGA-DU-6408-01: num 1 0.998 0.996 0.993 0.99 ...
#> $ TCGA-FG-8191-01: num 1 0.995 0.992 0.985 0.979 ...
#> $ TCGA-TM-A84S-01: num 1 0.996 0.994 0.989 0.985 ...
#> $ TCGA-DB-A64S-01: num 1 0.998 0.997 0.995 0.993 ...
#> $ TCGA-DU-6542-01: num 1 0.991 0.986 0.975 0.965 ...
#> $ TCGA-DB-A4XF-01: num 1 0.995 0.992 0.985 0.981 ...
#> $ TCGA-S9-A6WN-01: num 1 0.987 0.978 0.963 0.952 ...
#> $ TCGA-P5-A5EX-01: num 1 0.987 0.979 0.964 0.949 ...
# Access output - Predicted mean and median survival of test data
mean_median_surv_tim_da <- Result_Model_Type5$mean_median_survival_time_data
# View Predicted mean and median survival of test data
str(mean_median_surv_tim_da)
#> 'data.frame': 17 obs. of 4 variables:
#> $ ID : chr "TCGA-E1-A7Z6-01" "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-VM-A8C8-01" ...
#> $ Mean : num 105.1 42.5 95.9 73.2 99 ...
#> $ Median : num 99.9 32.5 91 68.1 93.3 ...
#> $ OS_month: num 32 27 2 46 18 26 17 28 64 114 ...
# Access output - all results for test data
survival_result_b_on_MTLR <- Result_Model_Type5$survival_result_based_on_MTLR
# Access output - Final Evaluation parameters of results
Error_mat_for_Model <- Result_Model_Type5$Error_mat_for_Model
# view Evaluation parameters of the model
str(Error_mat_for_Model)
#> num [1:2, 1:4] 0.9 0.77 48.35 61.2 43.85 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:4] "C_index" "Mean_MAE" "Median_MAE" "IBS"
# view Evaluation parameters of the model
head(Error_mat_for_Model)
#> C_index Mean_MAE Median_MAE IBS
#> Training_set 0.90 48.35 43.85 0.152
#> Test_set 0.77 61.20 64.96 0.194
After implementing the MTLR_pred_model_f
function, the following outputs are generated:
To visualize the survival of patients, we use the surv_curve_plots_f
function, which generates survival curve plots based on the survCurves_data
obtained from the previous step (after running the MTLR_pred_model_f
function). This function also provides the option to highlight the survival curve of a specific patient.
The function requires two inputs:
1. Surv_curve_data: The data object containing predicted survival probabilities for all patients.
2. Sample ID: The ID of the specific patient (e.g., TCGA-TQ-A8XE-01
) whose survival curve you want to highlight.
# Create Survival curves/plots for individual patients
# Load predicted survival probability (at different time points ) data of test samples to plot survival curve plot
data(survCurves_data, package = "CPSM")
# View top rows of data
print(head(survCurves_data,3))
#> time_point TCGA-E1-A7Z6-01 TCGA-S9-A7IX-01 TCGA-HT-8010-01 TCGA-VM-A8C8-01
#> 1 0.00000 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 4.00000 0.9979268 0.9866125 0.9961509 0.9914884
#> 3 10.93333 0.9915719 0.9455033 0.9843460 0.9653644
#> TCGA-DU-5847-01 TCGA-TQ-A7RQ-01 TCGA-HT-7606-01 TCGA-S9-A7QY-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9958189 0.9965105 0.9862409 0.9979053
#> 3 0.9829948 0.9858092 0.9439892 0.9914842
#> TCGA-DH-5142-01 TCGA-DU-6408-01 TCGA-FG-8191-01 TCGA-TM-A84S-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9994770 0.9978670 0.9975332 0.9960577
#> 3 0.9978757 0.9913285 0.9899704 0.9839665
#> TCGA-DB-A64S-01 TCGA-DU-6542-01 TCGA-DU-A7T6-01 TCGA-DB-A4XF-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9995280 0.9956397 0.9883252 0.9970554
#> 3 0.9980829 0.9822654 0.9524809 0.9880265
#> TCGA-S9-A6WN-01 TCGA-P5-A5EX-01
#> 1 1.0000000 1.0000000
#> 2 0.9949919 0.9925523
#> 3 0.9796288 0.9696968
# Call/Run functions to plot survival curve plot
plots <- surv_curve_plots_f(
Surv_curve_data = survCurves_data,
selected_sample = "TCGA-TQ-A7RQ-01",
font_size = 12,
line_size = 0.5,
all_line_col = "grey70",
highlight_col = "red"
)
#------------------------ OUTPUTS ---------------------#
# Access output - print survival plot with all patients in test data
print(plots$all_patients_plot)
# Access output - print survival plot with highlighting the survival curve of selected patient
print(plots$highlighted_patient_plot)
## Outputs
After running the function, two output plots are generated:
1. Survival curves for all patients in the test data, displayed with different colors for each patient.
2. Survival curves for all patients (in black) with the selected patient highlighted in red.
These plots allow for easy visualization of individual patient survival in the context of the overall test data.
To visualize the predicted survival times for patients, we use the mean_median_surv_barplot_f
function, which generates bar plots for the mean and median survival times based on the data obtained from Step 5 after running the MTLR_pred_model_f
function. This function also provides the option to highlight a specific patient on the bar plot.
This function requires two inputs:
1. surv_mean_med_data: The data containing the predicted mean and median survival times for all patients.
2. Sample ID: The ID of the specific patient (e.g., TCGA-TQ-A7RQ-01
) whose bar plot should be highlighted.
# Load data of predicted mean/median survival time for test samples
data(mean_median_survival_time_data, package = "CPSM")
# View top rows of data
print(head(mean_median_survival_time_data),3)
#> IDs Mean Median
#> 1 TCGA-E1-A7Z6-01 92.93574 88.06892
#> 2 TCGA-S9-A7IX-01 56.12662 47.65407
#> 3 TCGA-HT-8010-01 81.62288 79.30814
#> 4 TCGA-VM-A8C8-01 65.90511 57.91935
#> 5 TCGA-DU-5847-01 80.05510 77.80689
#> 6 TCGA-TQ-A7RQ-01 83.46231 80.89266
custom_colors <- c(
"TRUE.Mean" = "cyan",
"TRUE.Median" = "pink",
"FALSE.Mean" = "gray90",
"FALSE.Median"= "gray70"
)
# Call/Run functions to plot barplot
plots_2 <- mean_median_surv_barplot_f(
surv_mean_med_data =
mean_median_survival_time_data,
selected_sample = "TCGA-TQ-A7RQ-01",
font_size = 8,
font_color = "black",
bar_colors = custom_colors
)
#------------------------ OUTPUTS ---------------------#
# Access output - Print barplots representing predicted mean/median survival time of patients
print(plots_2$mean_med_all_pat)
# Access output - Print barplots representing predicted mean/median survival time of patients with highlighting selected patient
print(plots_2$highlighted_selected_pat)
After running the function, two output bar plots are generated: 1. Bar plot for all patients in the test data, where the red-colored bars represent the mean survival time, and the cyan/green-colored bars represent the median survival time. 2. Bar plot for all patients with a highlighted patient (indicated by a dashed black outline). This plot shows that the highlighted patient has predicted mean and median survival times of 81.58 and 75.50 months, respectively.
These plots provide a clear comparison of the predicted survival times for all patients and the highlighted individual patient.
To predict the survival-based risk group of test samples (i.e., high-risk with shorter survival or low-risk with longer survival), we use the predict_survival_risk_group_f()
function provided in the CPSM package. This function implements a randomForestSRC-based prediction approach for survival risk classification. Thhis function first defines actual risk groups in the training data using the median overall survival time:
Multiple Random Survival Forest (RSF) models are then trained using different values for ntree
:
10, 20, 50, 100, 250, 500, 750, 1000.
The model with the best performance (e.g., highest accuracy) is selected automatically. This best-performing model is used to predict the risk group of test samples, along with prediction probabilities.
selected_train_data
: A data frame with normalized expression values for selected features and survival information (OS_month
, OS_event
) for the training set.selected_test_data
: A data frame with normalized expression values for the same features for the test set.Feature_List
: A character vector containing the names of selected features to be used in the model.# Load example data from CPSM package
# Load training data with selected features (e.g. PI values)
data(Train_PI_data, package = "CPSM")
# View top rows of data
print(head(Train_PI_data),3)
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-5318-01 1 78 0 0.212 0.100 44.705 0.000 0
#> TCGA-DU-A76L-01 1 27 0 0.297 5.476 15.317 0.035 0
#> TCGA-CS-6667-01 0 48 0 0.053 0.029 43.732 0.000 0
#> TCGA-E1-A7YI-01 1 4 0 0.358 0.108 27.662 0.000 0
#> TCGA-HT-7610-01 0 56 0 0.064 0.037 31.640 0.015 0
#> TCGA-HT-7856-01 0 39 0 0.032 0.198 22.709 0.034 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-5318-01 0.032 0.038 0.000 0.021 0.000
#> TCGA-DU-A76L-01 0.034 0.000 0.000 0.010 0.007
#> TCGA-CS-6667-01 0.032 0.000 0.000 0.009 0.000
#> TCGA-E1-A7YI-01 0.000 0.000 0.000 0.023 0.007
#> TCGA-HT-7610-01 0.014 0.000 0.000 0.010 0.006
#> TCGA-HT-7856-01 0.040 0.000 0.002 0.007 0.008
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-5318-01 0.000 0.000 0 0.017 0
#> TCGA-DU-A76L-01 0.010 0.000 0 0.016 0
#> TCGA-CS-6667-01 0.007 0.013 0 0.000 0
#> TCGA-E1-A7YI-01 0.075 0.012 0 0.043 0
#> TCGA-HT-7610-01 0.018 0.005 0 0.002 0
#> TCGA-HT-7856-01 0.014 0.000 0 0.024 0
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-5318-01 0 18.305 0.039 0.000 0 25.731 28.388 0.000
#> TCGA-DU-A76L-01 0 5.218 1.808 0.000 0 13.483 7.859 0.010
#> TCGA-CS-6667-01 0 15.122 0.009 0.000 0 44.266 10.953 0.006
#> TCGA-E1-A7YI-01 0 6.822 0.520 0.016 0 6.670 26.863 0.000
#> TCGA-HT-7610-01 0 15.385 0.790 0.000 0 57.390 19.435 0.006
#> TCGA-HT-7856-01 0 21.867 0.289 0.000 0 48.499 6.707 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-5318-01 0.196 0.562 0.082 2.853 8.264 0.153 1.455 0.000
#> TCGA-DU-A76L-01 0.139 0.180 0.524 4.107 136.028 0.051 0.719 0.030
#> TCGA-CS-6667-01 0.309 0.228 0.154 2.628 9.123 0.067 1.745 0.006
#> TCGA-E1-A7YI-01 0.340 0.187 0.457 5.320 30.573 0.069 1.540 0.013
#> TCGA-HT-7610-01 0.411 2.123 0.068 2.818 16.100 0.102 1.277 0.002
#> TCGA-HT-7856-01 1.191 2.166 0.158 6.175 18.849 0.116 0.730 0.008
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-5318-01 1.388 0.183 0.067 65.301 0.00 1.013 0.000 690.584
#> TCGA-DU-A76L-01 3.489 0.898 0.072 56.409 0.00 0.153 0.000 371.100
#> TCGA-CS-6667-01 2.123 0.151 0.142 85.268 0.00 0.953 0.000 1506.178
#> TCGA-E1-A7YI-01 1.772 0.567 0.603 37.475 0.00 0.805 0.025 690.584
#> TCGA-HT-7610-01 1.463 0.145 0.062 63.921 0.01 0.659 0.000 1506.178
#> TCGA-HT-7856-01 1.146 0.098 0.018 45.204 0.00 0.516 0.000 1506.178
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-5318-01 0.031 28.563 0.581 0.000 93.080 0.000
#> TCGA-DU-A76L-01 0.071 11.792 0.590 0.005 85.268 0.000
#> TCGA-CS-6667-01 0.032 27.989 0.708 0.009 67.615 0.015
#> TCGA-E1-A7YI-01 0.452 24.567 1.621 0.000 275.465 0.000
#> TCGA-HT-7610-01 0.014 16.323 0.197 0.004 95.574 0.000
#> TCGA-HT-7856-01 0.027 12.551 0.178 0.003 95.574 0.000
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-5318-01 0.156 0.042 6.080 0.255 21.294 33.770 26.363 5.186 -1.634461
#> TCGA-DU-A76L-01 0.123 0.119 7.764 1.745 18.212 18.568 2.597 0.516 0.961367
#> TCGA-CS-6667-01 0.140 0.028 5.476 0.146 371.100 39.388 14.003 4.639 -3.111244
#> TCGA-E1-A7YI-01 0.499 0.113 11.032 6.689 23.747 21.755 2.574 0.741 4.830642
#> TCGA-HT-7610-01 0.210 0.093 7.154 0.207 243.966 49.878 1.764 7.564 -2.217197
#> TCGA-HT-7856-01 0.631 0.109 7.787 0.101 318.823 25.125 10.915 3.012 -1.824024
# Load test data with selected features (e.g. PI values)
data(Test_PI_data, package = "CPSM")
# View top rows of data
print(head(Train_PI_data),3)
#> OS OS_month AADACL4 ABCA12 ABCC3 ABI1 ABRA AC006059.2
#> TCGA-E1-5318-01 1 78 0 0.212 0.100 44.705 0.000 0
#> TCGA-DU-A76L-01 1 27 0 0.297 5.476 15.317 0.035 0
#> TCGA-CS-6667-01 0 48 0 0.053 0.029 43.732 0.000 0
#> TCGA-E1-A7YI-01 1 4 0 0.358 0.108 27.662 0.000 0
#> TCGA-HT-7610-01 0 56 0 0.064 0.037 31.640 0.015 0
#> TCGA-HT-7856-01 0 39 0 0.032 0.198 22.709 0.034 0
#> AC008676.3 AC008764.4 AC010132.3 AC012651.1 AC013477.1
#> TCGA-E1-5318-01 0.032 0.038 0.000 0.021 0.000
#> TCGA-DU-A76L-01 0.034 0.000 0.000 0.010 0.007
#> TCGA-CS-6667-01 0.032 0.000 0.000 0.009 0.000
#> TCGA-E1-A7YI-01 0.000 0.000 0.000 0.023 0.007
#> TCGA-HT-7610-01 0.014 0.000 0.000 0.010 0.006
#> TCGA-HT-7856-01 0.040 0.000 0.002 0.007 0.008
#> AC109583.1 AC113348.2 AC117457.1 AC131160.1 AC144573.1
#> TCGA-E1-5318-01 0.000 0.000 0 0.017 0
#> TCGA-DU-A76L-01 0.010 0.000 0 0.016 0
#> TCGA-CS-6667-01 0.007 0.013 0 0.000 0
#> TCGA-E1-A7YI-01 0.075 0.012 0 0.043 0
#> TCGA-HT-7610-01 0.018 0.005 0 0.002 0
#> TCGA-HT-7856-01 0.014 0.000 0 0.024 0
#> AC235565.2 ACO2 ACTC1 ACTL9 ACTRT2 ADD3 ADGRL3 ADH7
#> TCGA-E1-5318-01 0 18.305 0.039 0.000 0 25.731 28.388 0.000
#> TCGA-DU-A76L-01 0 5.218 1.808 0.000 0 13.483 7.859 0.010
#> TCGA-CS-6667-01 0 15.122 0.009 0.000 0 44.266 10.953 0.006
#> TCGA-E1-A7YI-01 0 6.822 0.520 0.016 0 6.670 26.863 0.000
#> TCGA-HT-7610-01 0 15.385 0.790 0.000 0 57.390 19.435 0.006
#> TCGA-HT-7856-01 0 21.867 0.289 0.000 0 48.499 6.707 0.000
#> ADPRHL1 ADRA2A ADRA2B AFMID AKAP12 AKAP14 AKAP3 AL139260.3
#> TCGA-E1-5318-01 0.196 0.562 0.082 2.853 8.264 0.153 1.455 0.000
#> TCGA-DU-A76L-01 0.139 0.180 0.524 4.107 136.028 0.051 0.719 0.030
#> TCGA-CS-6667-01 0.309 0.228 0.154 2.628 9.123 0.067 1.745 0.006
#> TCGA-E1-A7YI-01 0.340 0.187 0.457 5.320 30.573 0.069 1.540 0.013
#> TCGA-HT-7610-01 0.411 2.123 0.068 2.818 16.100 0.102 1.277 0.002
#> TCGA-HT-7856-01 1.191 2.166 0.158 6.175 18.849 0.116 0.730 0.008
#> ALG6 ANGPTL6 ANKRD20A1 ANXA6 ANXA8L1 AOC2 APOBEC1 APOE
#> TCGA-E1-5318-01 1.388 0.183 0.067 65.301 0.00 1.013 0.000 690.584
#> TCGA-DU-A76L-01 3.489 0.898 0.072 56.409 0.00 0.153 0.000 371.100
#> TCGA-CS-6667-01 2.123 0.151 0.142 85.268 0.00 0.953 0.000 1506.178
#> TCGA-E1-A7YI-01 1.772 0.567 0.603 37.475 0.00 0.805 0.025 690.584
#> TCGA-HT-7610-01 1.463 0.145 0.062 63.921 0.01 0.659 0.000 1506.178
#> TCGA-HT-7856-01 1.146 0.098 0.018 45.204 0.00 0.516 0.000 1506.178
#> ARHGAP11B ARHGAP12 ARHGEF19 ARL2.SNX15 ARL6IP1 ARPC4.TTLL3
#> TCGA-E1-5318-01 0.031 28.563 0.581 0.000 93.080 0.000
#> TCGA-DU-A76L-01 0.071 11.792 0.590 0.005 85.268 0.000
#> TCGA-CS-6667-01 0.032 27.989 0.708 0.009 67.615 0.015
#> TCGA-E1-A7YI-01 0.452 24.567 1.621 0.000 275.465 0.000
#> TCGA-HT-7610-01 0.014 16.323 0.197 0.004 95.574 0.000
#> TCGA-HT-7856-01 0.027 12.551 0.178 0.003 95.574 0.000
#> ASB4 ASB5 ASB6 ASPM ATP1A2 ATP2B4 AZGP1 B3GAT2 PI
#> TCGA-E1-5318-01 0.156 0.042 6.080 0.255 21.294 33.770 26.363 5.186 -1.634461
#> TCGA-DU-A76L-01 0.123 0.119 7.764 1.745 18.212 18.568 2.597 0.516 0.961367
#> TCGA-CS-6667-01 0.140 0.028 5.476 0.146 371.100 39.388 14.003 4.639 -3.111244
#> TCGA-E1-A7YI-01 0.499 0.113 11.032 6.689 23.747 21.755 2.574 0.741 4.830642
#> TCGA-HT-7610-01 0.210 0.093 7.154 0.207 243.966 49.878 1.764 7.564 -2.217197
#> TCGA-HT-7856-01 0.631 0.109 7.787 0.101 318.823 25.125 10.915 3.012 -1.824024
# Load feature list
data(Key_PI_list, package = "CPSM")
# View feature list
print(str(Key_PI_list))
#> 'data.frame': 1 obs. of 1 variable:
#> $ ID: chr "PI"
#> NULL
# Call/Run function to Predict survival-based risk groups for test samples
Results_Risk_group_Prediction <- predict_survival_risk_group_f(
selected_train_data = Train_PI_data,
selected_test_data = Test_PI_data,
Feature_List = Key_PI_list
)
#> Training _with ntree = 10
#> Training _with ntree = 20
#> Training _with ntree = 50
#> Training _with ntree = 100
#> Training _with ntree = 250
#> Training _with ntree = 500
#> Training _with ntree = 750
#> Training _with ntree = 1000
#------------------------ OUTPUTS ---------------------#
# Access output - Performance of the best model on Training and Test data
Best_model_Prediction_results<- Results_Risk_group_Prediction$misclassification_results
# View Performance of the best model on Training and Test data
print(head(Best_model_Prediction_results))
#> Best_ntree OOB_Misclassification High_Risk_Error Low_Risk_Error
#> all 10 0.406 0.443 0.392
#> Train_Misclassification_Error Train_Accuracy Train_Sensitivity
#> all 0.038 96.2 96.2
#> Train_Specificity Test_Misclassification_Error Test_Accuracy
#> all 96.2 0.5 50
#> Test_Sensitivity Test_Specificity
#> all 41.67 66.67
# View Prediction results of the best model on Test set
Test_results <- Results_Risk_group_Prediction$Test_results #Prediction resulst on Test data
print(head(Test_results))
#> Sample_ID Actual Predicted_Risk_Group High_Risk_Prob
#> TCGA-E1-A7Z6-01 TCGA-E1-A7Z6-01 Low_Risk Low_Risk 0.3
#> TCGA-S9-A7IX-01 TCGA-S9-A7IX-01 High_Risk Low_Risk 0.2
#> TCGA-HT-8010-01 TCGA-HT-8010-01 High_Risk High_Risk 1.0
#> TCGA-VM-A8C8-01 TCGA-VM-A8C8-01 Low_Risk High_Risk 1.0
#> TCGA-DU-5847-01 TCGA-DU-5847-01 High_Risk Low_Risk 0.0
#> TCGA-TQ-A7RQ-01 TCGA-TQ-A7RQ-01 High_Risk Low_Risk 0.0
#> Low_Risk_Prob Prediction_Prob OS_month OS_event
#> TCGA-E1-A7Z6-01 0.7 0.7 32 1
#> TCGA-S9-A7IX-01 0.8 0.8 27 1
#> TCGA-HT-8010-01 0.0 1.0 2 0
#> TCGA-VM-A8C8-01 0.0 1.0 46 0
#> TCGA-DU-5847-01 1.0 1.0 18 0
#> TCGA-TQ-A7RQ-01 1.0 1.0 26 0
The output is a list that includes: 1. Best prediction model 2. Performance metrics (accuracy, sensitivity, specificity, Error rate, etc.) for training and test data 3. Predicted risk groups with prediction probability values for training samples 4. Predicted risk groups with prediction probability values for test samples
User can use these results for further validation and visualization, such as overlaying test sample survival curves on the training KM plot (see next step).
To visually evaluate how a specific test sample compares to survival risk groups defined in the training dataset, we use the km_overlay_plot_f()
function.
This function overlays the predicted survival curve of a selected test sample onto the Kaplan-Meier (KM) survival plot derived from the training data. This visual comparison helps determine how closely the test sample aligns with population-level survival trends.
## Required Inputs
It requres requires following inputs
- Train_results
:
A data frame containing predicted risk groups, survival times (OS_month
), event status (OS_event
), and additional training data.
Row names must correspond to sample IDs.
Test_results
:
A data frame with predicted risk groups and prediction probabilities for the test dataset.
Row names must correspond to sample IDs.
survcurve_te_data
:
A data frame with predicted survival probabilities over multiple time points for test samples (that we obtained from Step 5).
selected_sample
:
The sample ID (matching a row in Test_results
) for which the test survival curve should be plotted.
# Load example data
# Load predicted risk-group results for training samples
data(Train_results, package = "CPSM")
# View top rows of data
print(head(Train_results),3)
#> Sample_ID Actual Predicted_Risk_Group High_Risk_Prob
#> TCGA-E1-5318-01 TCGA-E1-5318-01 Low_Risk Low_Risk 0.3
#> TCGA-DU-A76L-01 TCGA-DU-A76L-01 High_Risk High_Risk 1.0
#> TCGA-CS-6667-01 TCGA-CS-6667-01 Low_Risk Low_Risk 0.0
#> TCGA-E1-A7YI-01 TCGA-E1-A7YI-01 High_Risk High_Risk 1.0
#> TCGA-HT-7610-01 TCGA-HT-7610-01 Low_Risk Low_Risk 0.0
#> TCGA-HT-7856-01 TCGA-HT-7856-01 Low_Risk Low_Risk 0.2
#> Low_Risk_Prob Prediction_Prob OS_month OS_event
#> TCGA-E1-5318-01 0.7 0.7 78 1
#> TCGA-DU-A76L-01 0.0 1.0 27 1
#> TCGA-CS-6667-01 1.0 1.0 48 0
#> TCGA-E1-A7YI-01 0.0 1.0 4 1
#> TCGA-HT-7610-01 1.0 1.0 56 0
#> TCGA-HT-7856-01 0.8 0.8 39 0
# Load predicted risk-group results for test samples
data(Test_results, package = "CPSM")
# View top rows of data
print(head(Test_results),3)
#> Sample_ID Actual Predicted_Risk_Group High_Risk_Prob
#> TCGA-E1-A7Z6-01 TCGA-E1-A7Z6-01 Low_Risk Low_Risk 0.1
#> TCGA-S9-A7IX-01 TCGA-S9-A7IX-01 High_Risk Low_Risk 0.3
#> TCGA-HT-8010-01 TCGA-HT-8010-01 High_Risk High_Risk 0.9
#> TCGA-VM-A8C8-01 TCGA-VM-A8C8-01 Low_Risk High_Risk 1.0
#> TCGA-DU-5847-01 TCGA-DU-5847-01 High_Risk Low_Risk 0.0
#> TCGA-TQ-A7RQ-01 TCGA-TQ-A7RQ-01 High_Risk Low_Risk 0.2
#> Low_Risk_Prob Prediction_Prob OS_month OS_event
#> TCGA-E1-A7Z6-01 0.9 0.9 32 1
#> TCGA-S9-A7IX-01 0.7 0.7 27 1
#> TCGA-HT-8010-01 0.1 0.9 2 0
#> TCGA-VM-A8C8-01 0.0 1.0 46 0
#> TCGA-DU-5847-01 1.0 1.0 18 0
#> TCGA-TQ-A7RQ-01 0.8 0.8 26 0
# Load predicted survival probabiliy data (at multiple time points) for test samples
data(survCurves_data, package = "CPSM")
# View top rows of data
print(head(survCurves_data),3)
#> time_point TCGA-E1-A7Z6-01 TCGA-S9-A7IX-01 TCGA-HT-8010-01 TCGA-VM-A8C8-01
#> 1 0.00000 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 4.00000 0.9979268 0.9866125 0.9961509 0.9914884
#> 3 10.93333 0.9915719 0.9455033 0.9843460 0.9653644
#> 4 14.40000 0.9866364 0.9158244 0.9753768 0.9461198
#> 5 18.86667 0.9753052 0.8542952 0.9554066 0.9050815
#> 6 21.00000 0.9749286 0.8525218 0.9547707 0.9038511
#> TCGA-DU-5847-01 TCGA-TQ-A7RQ-01 TCGA-HT-7606-01 TCGA-S9-A7QY-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9958189 0.9965105 0.9862409 0.9979053
#> 3 0.9829948 0.9858092 0.9439892 0.9914842
#> 4 0.9732808 0.9776494 0.9135256 0.9864991
#> 5 0.9517442 0.9593905 0.8504851 0.9750597
#> 6 0.9510625 0.9588050 0.8486727 0.9746797
#> TCGA-DH-5142-01 TCGA-DU-6408-01 TCGA-FG-8191-01 TCGA-TM-A84S-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9994770 0.9978670 0.9975332 0.9960577
#> 3 0.9978757 0.9913285 0.9899704 0.9839665
#> 4 0.9965739 0.9862554 0.9841322 0.9747879
#> 5 0.9933885 0.9746242 0.9708411 0.9543764
#> 6 0.9932728 0.9742383 0.9704045 0.9537275
#> TCGA-DB-A64S-01 TCGA-DU-6542-01 TCGA-DU-A7T6-01 TCGA-DB-A4XF-01
#> 1 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 0.9995280 0.9956397 0.9883252 0.9970554
#> 3 0.9980829 0.9822654 0.9524809 0.9880265
#> 4 0.9969043 0.9721505 0.9264372 0.9810999
#> 5 0.9940069 0.9497736 0.8719608 0.9654678
#> 6 0.9939009 0.9490674 0.8703710 0.9649606
#> TCGA-S9-A6WN-01 TCGA-P5-A5EX-01
#> 1 1.0000000 1.0000000
#> 2 0.9949919 0.9925523
#> 3 0.9796288 0.9696968
#> 4 0.9680701 0.9527668
#> 5 0.9426846 0.9163851
#> 6 0.9418915 0.9152825
# Select a test sample to visualize
sample_id <- "TCGA-TQ-A7RQ-01"
# Generate KM overlay plot
KM_plot <- km_overlay_plot_f( Train_results = Train_results,
Test_results = Test_results,
survcurve_te_data = survCurves_data,
selected_sample = sample_id,
font_size = 12, #font size
train_palette = c("firebrick", "blue"), # custom risk group colors
test_curve_col = "darkgreen", # highlight test sample in darkgreen
test_curve_size = 1.2, # thicker test sample curve
test_curve_lty = "dotdash", # dashed-dot test curve
annotation_col = "black" # annotation text in black
)
#------------------------ OUTPUTS ---------------------#
# View KM plot representing the comparison of survival curve of selected test sample vs survival curves of risk-groups of training data
KM_plot
This visualization is useful for: - Displaying individual patient patterns in a survival context of training samples - Verifying predicted risk classifications
The Nomogram_generate_f
function in the CPSM package allows you to generate a nomogram plot based on user-defined clinical and other relevant features in the data. For example, we will generate a nomogram using six features: Age, Gender, Race, Histological Type, Sample Type, and PI score.
To create the nomogram, we need to provide the following inputs:
1. Train_Data_Nomogram_input: A dataset containing all the features, where samples are in the rows and features are in the columns.
2. feature_list_for_Nomogram: A list of features (e.g., Age, Gender, etc.) that will be used to generate the nomogram.
3. surv_time: The column name containing survival time in months (e.g., OS_month
).
4. surv_event: The column name containing survival event information (e.g., OS
).
# Load Normalozed Training data with along with Survival info selected features (based on which user want to develop nomogram) and survival info
data(Train_Data_Nomogram_input, package = "CPSM")
# View top rows of data
print(head(Train_Data_Nomogram_input[1:30]),3)
#> Age subtype gender race ajcc_pathologic_tumor_stage
#> TCGA-CS-5396-01 53.11 PN Female WHITE NA
#> TCGA-DU-A76L-01 54.27 ME Male WHITE NA
#> TCGA-DB-5270-01 38.07 NE Female WHITE NA
#> TCGA-DB-A75P-01 25.77 NE Female NOT AVAILABLE NA
#> TCGA-S9-A6U0-01 46.24 ME Male WHITE NA
#> TCGA-E1-5307-01 62.52 PN Female WHITE NA
#> histological_type histological_grade
#> TCGA-CS-5396-01 Oligodendroglioma G3
#> TCGA-DU-A76L-01 Oligodendroglioma G3
#> TCGA-DB-5270-01 Oligoastrocytoma G3
#> TCGA-DB-A75P-01 Astrocytoma G2
#> TCGA-S9-A6U0-01 Astrocytoma G3
#> TCGA-E1-5307-01 Astrocytoma G3
#> treatment_outcome_first_course radiation_treatment_adjuvant
#> TCGA-CS-5396-01 <NA> YES
#> TCGA-DU-A76L-01 Progressive Disease <NA>
#> TCGA-DB-5270-01 <NA> NO
#> TCGA-DB-A75P-01 Complete Remission/Response NO
#> TCGA-S9-A6U0-01 Partial Remission/Response YES
#> TCGA-E1-5307-01 <NA> YES
#> sample_type type OS OS.time DSS DSS.time DFI DFI.time PFI
#> TCGA-CS-5396-01 Primary LGG 0 1631 0 1631 NA NA 0
#> TCGA-DU-A76L-01 Primary LGG 1 814 1 814 NA NA 1
#> TCGA-DB-5270-01 Primary LGG 0 3733 0 3733 0 3733 0
#> TCGA-DB-A75P-01 Primary LGG 0 492 0 492 0 492 0
#> TCGA-S9-A6U0-01 Primary LGG 1 742 1 742 NA NA 1
#> TCGA-E1-5307-01 Primary LGG 1 1762 1 1762 NA NA 1
#> PFI.time OS_month OS_month.1 ALG6 ARHGAP11A DESI1 GALNT7
#> TCGA-CS-5396-01 1631 54 54 4.705 1.241 24.866 5.760
#> TCGA-DU-A76L-01 410 27 27 3.392 3.124 9.048 2.927
#> TCGA-DB-5270-01 3733 123 123 1.106 0.233 21.734 1.696
#> TCGA-DB-A75P-01 492 16 16 1.241 0.153 15.546 2.890
#> TCGA-S9-A6U0-01 692 24 24 2.223 2.091 15.385 2.788
#> TCGA-E1-5307-01 1452 58 58 2.314 0.454 17.887 2.681
#> GJD3 GPC1 H2BC5 HOXD12 RNF185
#> TCGA-CS-5396-01 0.074 20.434 4.004 0.019 19.033
#> TCGA-DU-A76L-01 2.521 61.482 5.335 0.087 9.067
#> TCGA-DB-5270-01 0.059 22.450 3.475 0.000 12.511
#> TCGA-DB-A75P-01 0.270 30.171 2.319 0.000 14.795
#> TCGA-S9-A6U0-01 0.990 112.796 12.242 0.108 11.498
#> TCGA-E1-5307-01 0.061 26.434 9.964 0.012 18.959
# Load a list of selected features (based on which user want to develop nomogram)
data(feature_list_for_Nomogram, package = "CPSM")
# View feature list to build nomogram
print(str(feature_list_for_Nomogram))
#> 'data.frame': 6 obs. of 1 variable:
#> $ ID: chr "Age" "gender" "race" "histological_type" ...
#> NULL
# Call/run function to generate nomogram
Result_Nomogram <- Nomogram_generate_f(
data = Train_Data_Nomogram_input,
Feature_List = feature_list_for_Nomogram,
surv_time = "OS_month",
surv_event = "OS",
font_size = 0.8,
axis_cex = 0.5,
tcl_len = 0.5,
label_margin = 0.5,
col_grid = gray(c(0.85, 0.95))
)
#------------------------ OUTPUTS ---------------------#
# Access output - C-index value and Nomogram plot
C_index_mat <- Result_Nomogram$C_index_mat
#display C-index
print(C_index_mat)
#> Bias-corrected C-index C-index
#> [1,] 0.85 0.88
After running the function, the output is a nomogram that predicts the risk (e.g., Event risk such as death), as well as the 1-year, 3-year, 5-year, and 10-year survival probabilities for patients based on the selected features.The nomogram provides a visual representation to estimate the patient’s survival outcomes over multiple time points, helping clinicians make more informed decisions.
As last part of this document, we call the function “sessionInfo()”, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package.
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.39.2 Biobase_2.69.1
#> [3] GenomicRanges_1.61.5 Seqinfo_0.99.2
#> [5] IRanges_2.43.2 S4Vectors_0.47.2
#> [7] BiocGenerics_0.55.1 generics_0.1.4
#> [9] MatrixGenerics_1.21.0 matrixStats_1.5.0
#> [11] CPSM_1.1.4 BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 SurvMetrics_0.5.1 rstudioapi_0.17.1
#> [4] jsonlite_2.0.0 shape_1.4.6.1 magrittr_2.0.4
#> [7] magick_2.9.0 TH.data_1.1-4 farver_2.1.2
#> [10] rmarkdown_2.29 vctrs_0.6.5 base64enc_0.1-3
#> [13] tinytex_0.57 rstatix_0.7.2 polspline_1.1.25
#> [16] htmltools_0.5.8.1 S4Arrays_1.9.1 broom_1.0.10
#> [19] SparseArray_1.9.1 Formula_1.2-5 pROC_1.19.0.1
#> [22] caret_7.0-1 sass_0.4.10 parallelly_1.45.1
#> [25] bslib_0.9.0 htmlwidgets_1.6.4 sandwich_3.1-1
#> [28] plyr_1.8.9 zoo_1.8-14 lubridate_1.9.4
#> [31] cachem_1.1.0 lifecycle_1.0.4 iterators_1.0.14
#> [34] pkgconfig_2.0.3 Matrix_1.7-4 R6_2.6.1
#> [37] fastmap_1.2.0 future_1.67.0 digest_0.6.37
#> [40] colorspace_2.1-2 Hmisc_5.2-3 ggpubr_0.6.1
#> [43] labeling_0.4.3 km.ci_0.5-6 timechange_0.3.0
#> [46] abind_1.4-8 compiler_4.5.1 proxy_0.4-27
#> [49] withr_3.0.2 htmlTable_2.4.3 S7_0.2.0
#> [52] backports_1.5.0 carData_3.0-5 ggsignif_0.6.4
#> [55] MASS_7.3-65 lava_1.8.1 quantreg_6.1
#> [58] DelayedArray_0.35.3 ModelMetrics_1.2.2.2 tools_4.5.1
#> [61] foreign_0.8-90 future.apply_1.20.0 nnet_7.3-20
#> [64] glue_1.8.0 DiagrammeR_1.0.11 nlme_3.1-168
#> [67] gridtext_0.1.5 grid_4.5.1 checkmate_2.3.3
#> [70] cluster_2.1.8.1 reshape2_1.4.4 recipes_1.3.1
#> [73] gtable_0.3.6 KMsurv_0.1-6 class_7.3-23
#> [76] preprocessCore_1.71.2 tidyr_1.3.1 survminer_0.5.1
#> [79] data.table_1.17.8 xml2_1.4.0 car_3.1-3
#> [82] XVector_0.49.1 foreach_1.5.2 pillar_1.11.1
#> [85] stringr_1.5.2 splines_4.5.1 ggtext_0.1.2
#> [88] dplyr_1.1.4 lattice_0.22-7 survival_3.8-3
#> [91] SparseM_1.84-2 tidyselect_1.2.1 rms_8.0-0
#> [94] knitr_1.50 gridExtra_2.3 bookdown_0.44
#> [97] xfun_0.53 hardhat_1.4.2 timeDate_4041.110
#> [100] visNetwork_2.1.4 stringi_1.8.7 yaml_2.3.10
#> [103] evaluate_1.0.5 codetools_0.2-20 data.tree_1.2.0
#> [106] tibble_3.3.0 BiocManager_1.30.26 cli_3.6.5
#> [109] rpart_4.1.24 xtable_1.8-4 randomForestSRC_3.4.1
#> [112] MTLR_0.2.1 jquerylib_0.1.4 survMisc_0.5.6
#> [115] dichromat_2.0-0.1 Rcpp_1.1.0 globals_0.18.0
#> [118] parallel_4.5.1 MatrixModels_0.5-4 ggfortify_0.4.19
#> [121] gower_1.0.2 ggplot2_4.0.0 listenv_0.9.1
#> [124] glmnet_4.1-10 mvtnorm_1.3-3 ipred_0.9-15
#> [127] e1071_1.7-16 scales_1.4.0 prodlim_2025.04.28
#> [130] purrr_1.1.0 crayon_1.5.3 rlang_1.1.6
#> [133] multcomp_1.4-28