文本摘要生成的评估指标

文本摘要生成的评估指标

ROUGE-N

这里的N就是指的n-gram,n=1时叫ROUGE-1(也叫Unigrams);n=2时,叫ROUGE-2(也叫Bigrams);n=3时,叫ROUGE-3(Trigrams)。

上图中precision和recall的计算均是采用ROUGE-1的方式。

ROUGE-L

这里L是指the longest common subsequence(LCS,即最长公共子串),根据这一指标来评估。

ROUGE-S

这里S指的skip-gram(word2vec中有学习过)。

思考:ROUGE方法的缺点

如果模型生成的文本和真实的文本语义完全一致,但是表达方式不一样(例如使用近义词),那么这种评测方式就有很大问题,也就说缺乏对语义的评测

代码实战

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from rouge import Rouge


# 单条句子评估
hypothesis = ("the #### transcript is a written version of each day 's cnn student news program use this transcript to help students with reading comprehension and vocabulary use the weekly newsquiz to test your knowledge of storie s you saw on cnn student news")

reference = ("this page includes the show transcript use the transcript to help students with reading comprehension and vocabulary at the bottom of the page , comment for a chance to be mentioned on cnn student news . you must be a teacher or a student age # # or older to request a mention on the cnn student news roll call . the weekly newsquiz tests students ' knowledge of even ts in the news")

rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)
print(scores)

[
{
'rouge-1': {'r': 0.42857142857142855, 'p': 0.5833333333333334,
'f': 0.49411764217577864},
'rouge-2': {'r': 0.18571428571428572, 'p': 0.3170731707317073,
'f': 0.23423422957552154},
'rouge-l': {'r': 0.3877551020408163, 'p': 0.5277777777777778,
'f': 0.44705881864636676}
}
]

批量评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
model_outs = [
"this is a nlp lesson",
"my name is liutao",
"how are you ?"
]
reference = [
"this is a text",
"your name is liutao",
"what do you like ?"
]
scores = rouge.get_scores(model_outs, reference, avg=True)
print(scores)


{
'rouge-1': {'r': 0.6333333333333333, 'p': 0.6166666666666667,
'f': 0.6203703654115226},
'rouge-2': {'r': 0.4444444444444444, 'p': 0.38888888888888884,
'f': 0.4126984093990931},
'rouge-l': {'r': 0.6333333333333333, 'p': 0.6166666666666667,
'f': 0.6203703654115226}
}

rouge没法直接对中文进行评估,需要做一些额外处理。