ROC Curve

introduction

在信号检测理论中，接收者操作特征曲线（receiver operating characteristic curve）是一种坐标图式的分析工具，用于（1）选择最佳的信号侦测模型，舍弃次佳的模型（2）在同一模型中设定最佳阈值。

ROC分析的是二元分类模型，也就是输出结果只有两种类别的模型，例如：（阳性/阴性）（有病/没病）等。

术语

		真实值	真实值	总数
		p(positive)	n(negative)
预测值	p’	真阳性（TP）	伪阳性（FP）	P’
预测值	n’	伪阴性（FN）	真阴性（TN）	N’
	总数	P	N

真阳性（TP）:正确的肯定，诊断为有，实际上也有癌症。

伪阳性（FP）:错误的肯定，诊断为有，实际上却没有癌症。第一型错误。

真阴性（TN）:正确的否定，诊断为没有，实际上也没有癌症。

伪阴性（FN）:错误的否定，诊断为没有，实际上却有癌症。第二型错误。

真阳性率（TPR, true positive rate）,又称为命中率（hit rate），敏感度（sensitivity）

TPR = TP / P = TP / (TP+FN)

伪阳性率（FPR, false positive rate）,又称为命中率，假警报率 (false alarm rate)

FPR = FP / N = FP / (FP + TN)

真阴性率（TNR）, 又称为特异度（SPC, specificity）

SPC = TN / N = TN / (FP + TN) = 1 - FPR

准确度 (ACC, accuracy)

ACC = (TP + TN) / (P + N), 即：(真阳性+真阴性) / 总样本数

阳性预测值 (PPV)：PPV = TP / (TP + FP)

阴性预测值 (NPV)：NPV = TN / (TN + FN)

假发现率 (FDR)：FDR = FP / (FP + TP)

ROC曲线

x轴：伪阳性率（FPR）; Y轴：真阳性率（TPR）

给定一个二元分类模型和它的阈值，就能从所有样本的（阳性／阴性）真实值和预测值计算出一个 (X=FPR, Y=TPR) 座标点。

例如：假设有100个癌症病人和100个正常人，miR-X的表达值为0，0.1，…10.5的200个值。

我们假设一个阈值：0.5以上为预测有癌症，0.5以下为预测没有癌症的。

	p	n
p’	TP=63	FP=28	91
n’	FN=37	TN=72	109
	100	100

TPR = TP/(TP+FN) = 63/(63+37) = 0.63，FPR = FP/(FP+TN) = 28/(28+72) = 0.28

如果我们调整一个阈值：0.3以上为预测有癌症，0.3以下为预测没有癌症的。

	p	n
p’	TP=77	FP=45	122
n’	FN=23	TN=55	78
	100	100

TPR = TP/(TP+FN) = 77/(77+323) = 0.77，FPR = FP/(FP+TN) = 45/(45+55) = 0.45

完美的预测是一个在左上角的点，在ROC空间座标 (0,1)点，X=0 代表着没有伪阳性，Y=1 代表着没有伪阴性（所有的阳性都是真阳性）。

将同一模型每个阈值的 (FPR, TPR) 座标都画在ROC空间里，就成为特定模型的ROC曲线。

Draw ROC curve using single feature

When we use single feature to draw ROC curve, we need to install and include pROC and ggplot2 package. The data we will be using is displayed below:

SampleID	Expression_of_miR-1	Type
1	66	cancer
2	32	normal
3	73	normal
4	82	cancer
5	61	normal
6	71	normal
7	50	normal
8	41	cancer
9	67	normal
10	45	normal
11	91	cancer
12	72	normal
13	20	normal
14	85	cancer
15	10	normal
16	74	cancer
17	53	cancer
18	41	normal
19	22	normal
20	49	cancer

Here, we present a single feature model. In the data, the first column stands for the ID number for each sample. The second column stands for the expression value of a certain kind of miRNA. The last column represents whether this sample comes from a normal person or a cancer patient.

First we are going to input the data using data.frame. The code is as follows:

SampleID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
Expression_of_miR-1 <- c(66, 32, 73, 82, 61, 71, 50, 41, 67, 45, 91, 72, 20, 85, 10, 74, 53, 41, 22, 49)
Type <- c("cancer", "normal", "normal", "cancer", "normal", "normal", "normal", "cancer", "normal", "normal", "cancer", "normal", "normal", "cancer", "normal", "cancer", "cancer", "normal", "normal", "cancer")
SampleData <- data.frame(SampleID, Expression_of_miR-1, Type)

Then using special function roc to construct ROC curve and use plot function to display the curve:

rocobj <- roc(SampleData$Type, SampleData$Expression_of_miR-1)
plot(rocobj)

The curve should look like the following. Notice that false positive rate (0 to 1) equals specificity (1 to 0), and true positive rate (0 to 1) equals sensitivity (0 to 1).

Draw ROC curve using multiple features

The following part discusses how to draw ROC curve using R with multiple features. To accomplish this, we should make sure that the ROCR package has been installed and included in the program. We use random forest to construct a model to predict the type. The data we will be using is displayed below:

SampleID	Expression_of_miR-1	Expression_of_miR-2	Expression_of_miR-3	Type
1	12	55	74	cancer
2	87	44	46	normal
3	70	23	56	normal
4	74	35	69	normal
5	46	43	59	cancer
6	58	31	90	cancer
7	55	40	30	normal
8	33	50	76	cancer
9	60	20	34	normal
10	50	22	11	normal
11	70	50	60	cancer
12	22	60	90	cancer
13	68	10	9	normal
14	90	30	20	normal
15	10	40	39	cancer
16	78	50	23	normal
17	50	60	82	cancer
18	70	33	51	normal
19	81	31	12	normal
20	44	11	5	normal
21	20	56	44	cancer
22	51	31	17	normal
23	40	11	4	normal
24	30	60	57	normal
25	81	13	24	normal

Here, the first column stands for the ID number for each sample. The second to the fourth column stands for the expression value of a certain kind of miRNA (1 to 3). The last column represents whether this sample comes from a normal person or a cancer patient. In random forest machine learning, we train the neuron network with 80% of the above data and use the other 20% to test this model and draw ROC curve.

First, we input the data into R using data.frame.

SampleID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
Expression_of_miR-1 <- c(12, 87, 70, 74, 46, 58, 55, 33, 60, 50, 70, 22, 68, 90, 10, 78, 50, 70, 81, 44, 20, 51, 40, 30, 81)
Expression_of_miR-2 <- c(55, 44, 23, 35, 43, 31, 40, 50, 20, 22, 50, 60, 10, 30, 40, 50, 60, 33, 31, 11, 56, 31, 11, 60, 13)
Expression_of_miR-3 <- c(74, 46, 56, 69, 59, 90, 30, 76, 34, 11, 60, 90, 9, 20, 39, 23, 82, 51, 12, 5, 44, 17, 4, 57, 24)
Type <- c("cancer", "normal", "normal", "normal", "cancer", "cancer", "normal", "cancer", "normal", "normal", "cancer", "cancer", "normal", "normal", "cancer", "normal", "cancer", "normal", "normal", "normal", "cancer", "normal", "normal", "normal", "normal")
SampleData <- data.frame(SampleID, Expression_of_miR-1, Expression_of_miR-2, Expression_of_miR-3, Type)

Set random seed so that the repeatability is ensured.

set.seed(123)

As we are trying to make the model predict whether the sample comes from a healthy person or someone with cancer according to the expression value of miRNA, we should input our data without telling the type of each sample. In other words, we will remove the last column of the data. Meanwhile, we need to store all the types in another factor.

all_data <- SampleData[, 1:(ncol(SampleData) - 1)]
all_classes <- factor(SampleData$Type)

As mentioned above, we choose 80% data to train our model and other 20% as test. We can perform this task using the following code:

#Choose 80% of data randomly 
n_samples <- nrow(all_data)
n_train <- floor(n_samples * 0.8)
indices <- sample(1:n_samples)
indices <- indices[1:n_train]
#Transfer data from all_data to separate data frame
train_data <- all_data[indices, ]
train_classes <- all_classes[indices]
test_data <- all_data[-indices, ]
test_classes <- all_classes[-indices]

Then we can train our random forest model with 100 trees and predict the type of test sample. We can also set ‘cancer’ as our positive class.

rf_classifier = randomForest(train_data, train_classes, trees = 100)
predicted_classes <- predict(rf_classifier, test_data)
positive_class <- 'cancer'

In order to draw ROC curve, we should also include the real type of test sample and the probability predicted by the model. This can be generated by the following code:

#Calculate the probability
predicted_probs <- predict(rf_classifier, test_data, type = 'prob')
#Find the test data's real class, and compare with the predicted class
test_labels <- vector('integer', length(test_classes))
test_labels[test_classes != positive_class] <- 0
test_labels[test_classes == positive_class] <- 1
pred <- prediction(predicted_probs[, positive_class], test_labels)

Then, we can obtain the ROC curve by the following code:

roc <- performance(pred, 'tpr', 'fpr')
plot(roc, main = 'ROC Curve')

The curve should look like this:

Reference

[1] Fawcett, Tom. “An Introduction to ROC Analysis.” Pattern Recognition Letters 27.8 (2006): 861-74.

[2] Olivares-Morales, Andrés, Oliver Hatley, J. Turner, D. Galetin, David Aarons, and Aleksandra Rostami-Hodjegan. “The Use of ROC Analysis for the Qualitative Prediction of Human Oral Bioavailability from Animal Data.” Pharmaceutical Research 31.3 (2014): 720-30.

[3] Robin, Xavier, Turck, Natacha, Hainard, Alexandre, Tiberti, Natalia, Lisacek, Frederique, Sanchez, Jean-Charles, and Muller, Markus. “PROC: An Open-source Package for R and S to Analyze and Compare ROC Curves.” BMC Bioinformatics 12 (2011): 77.

[4] https://lulab2.gitbook.io/teaching/part-iii.-machine-learning-basics/8.1-machine-learning-with-r

training_class

ROC Curve

introduction

Draw ROC curve using single feature

Draw ROC curve using multiple features

Reference