No Access

Beyond ROUGE: A Comprehensive Evaluation Metric for Abstractive Summarization Leveraging Similarity, Entailment, and Acceptability

Mohammed Khalid Hilmi Briman

https://orcid.org/0009-0000-5785-6916

Computer Engineering Department, Atilim University, Kizilcasar Mahallesi, Incek Golbasi, Ankara 06830, Turkey

E-mail Address: mohammedbriman@gmail.com

Search for more papers by this author

and

Beytullah Yildiz

https://orcid.org/0000-0001-7664-5145

Software Engineering Department, Atilim University, Kizilcasar Mahallesi, Incek Golbasi, Ankara 06830, Turkey

E-mail Address: beytullah.yildiz@atilim.edu.tr

Search for more papers by this author

https://doi.org/10.1142/S0218213024500179Cited by:6 (Source: Crossref)

Abstract

A vast amount of textual information on the internet has amplified the importance of text summarization models. Abstractive summarization generates original words and sentences that may not exist in the source document to be summarized. Such abstractive models may suffer from shortcomings such as linguistic acceptability and hallucinations. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a metric commonly used to evaluate abstractive summarization models. However, due to its n-gram-based approach, it ignores several critical linguistic aspects. In this work, we propose Similarity, Entailment, and Acceptability Score (SEAScore), an automatic evaluation metric for evaluating abstractive text summarization models using the power of state-of-the-art pre-trained language models. SEAScore comprises three language models (LMs) that extract meaningful linguistic features from candidate and reference summaries and a weighted sum aggregator that computes an evaluation score. Experimental results show that our LM-based SEAScore metric correlates better with human judgment than standard evaluation metrics such as ROUGE-N and BERTScore.

Keywords:

Remember to check out the Most Cited Articles!
Check out Notable Titles in Artificial Intelligence.

References

1. Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou and T. Zhao , Neural document summarization by jointly learning to score and select sentences, in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1 (Association for Computational Linguistics, 2018), pp. 654–663. Crossref, Google Scholar
2. X. Zhang, M. Lapata, F. Wei and M. Zhou , Neural latent extractive document summarization, in Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2018). Crossref, Google Scholar
3. S. Narayan, S. B. Cohen and M. Lapata , Ranking sentences for extractive summarization with reinforcement learning, in Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2018), pp. 1747–1759. Crossref, Google Scholar
4. A. See, P. Liu and C. D. Manning , Get to the point: Summarization with pointer-generator networks, in Proc. of the Meeting of the Association for Computational Linguistics, Vol. 1 (Association for Computational Linguistics, 2017), pp. 1073–1083. Crossref, Google Scholar
5. S. Gehrmann, Y. Deng and A. M. Rush , Bottom-Up abstractive summarization, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2018). Crossref, Google Scholar
6. R. Pasunuru and M. Bansal , Multi-Reward reinforced summarization with saliency and entailment, in Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics, Vol. 2 (Association for Computational Linguistics, 2018), pp. 646–653. Crossref, Google Scholar
7. W. Li, X. Xiao, Y. Lyu and Y. Wang , Improving neural abstractive document summarization with structural regularization, in Proc. of the 2018 Conf. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2018). Crossref, Google Scholar
8. Z. Cao, F. Wei, W. Li and S. Li, Faithful to the original: Fact aware neural abstractive summarization, arXiv:1711.04434. Google Scholar
9. W. Kryscinski, B. McCann, C. Xiong and R. Socher , Evaluating the factual consistency of abstractive text summarization, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2020). Crossref, Google Scholar
10. S. Gabriel, A. Bosselut, J. Da, A. Holtzman, J. Buys, K. Lo, A. Celikyilmaz and Y. Choi , Discourse understanding and factual consistency in abstractive summarization, in Conf. the European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2021). Crossref, Google Scholar
11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin , Attention is all you need, Neural Inf. Process. Syst. 30 (2017) 5998–6008. Google Scholar
12. C. Raffel, N. Shazeer, A. Roberts, K. J. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and P. Liu , Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (2020) 1–67. Web of Science, Google Scholar
13. Y. Liu and M. Lapata , Text summarization with pretrained encoders, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2019). Crossref, Google Scholar
14. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv:1910.13461. Google Scholar
15. J. Zhang, Y. Zhao, M. Saleh and P. Liu , PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, in Proc. of the Int. Conf. Machine Learning, Vol. 1 (ACM, 2020), pp. 11328–11339. Google Scholar
16. C.-Y. Lin , ROUGE: A package for automatic evaluation of summaries, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2004), pp. 74–81. Google Scholar
17. K. Papineni, S. Roukos, T. J. Ward and W.-J. Zhu , BLEU, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2002). Google Scholar
18. S. Banerjee and A. Lavie , METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2005), pp. 65–72. Google Scholar
19. T. Zhang, V. Kishore, F. F. Wu, K. Q. Weinberger and Y. Artzi , BERTScore: Evaluating text generation with BERT, in Int. Conf. Learning Representations (2020). Google Scholar
20. W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. Meyer and S. Eger , MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2019). Crossref, Google Scholar
21. T. Scialom, S. Lamprier, B. Piwowarski and J. Staiano , Answers unite! Unsupervised metrics for reinforced summarization models, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. Natural Language Processing (Association for Computational Linguistics, 2019). Crossref, Google Scholar
22. Y. Gao, W. Zhao and S. Eger , SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2020). Crossref, Google Scholar
23. A. Williams, N. Nangia and S. R. Bowman , A broad-coverage challenge corpus for sentence understanding through inference, in Proc. of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2018). Crossref, Google Scholar
24. A. Williams, N. Nangia and S. R. Bowman, MultiNLI (n.d.). Google Scholar
25. A. Warstadt, A. Singh and S. R. Bowman , Neural network acceptability judgments, Trans. Assoc. Comput. Linguist. 7 (2019) 625–641. Crossref, Google Scholar
26. A. Warstadt, A. Singh and S. R. Bowman, The Corpus of Linguistic Acceptability (CoLA) (n.d.). Google Scholar
27. Abisee, GitHub — abisee/cnn-dailymail: Code to obtain the CNN/Daily mail dataset (non-anonymized) for summarization (n.d.). Google Scholar
28. M. Bhandari, P. N. Gour, A. Ashfaq, P. Liu and G. Neubig , Re-evaluating evaluation in text summarization, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2020). Crossref, Google Scholar
29. Neulab, GitHub — neulab/REALSumm: REALSumm: Re-evaluating Evaluation in Text Summarization (n.d.). Google Scholar
30. M. Peyrard, T. Botschen and I. Gurevych , Learning to score system summaries for better content selection evaluation, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2017). Crossref, Google Scholar
31. J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv:1810.04805. Google Scholar
32. M. J. Kusner, Y. Sun, N. I. Kolkin and K. Q. Weinberger , From word embeddings to document distances, in Proc. Int. Conf. Machine Learning (ACM, 2015), pp. 957–966. Google Scholar
33. M. Kaster, W. Zhao and S. Eger, Global explainability of BERT-Based evaluation metrics by disentangling along linguistic factors, arXiv:2110.04399. Google Scholar
34. E. A. Clark, A. Celikyilmaz and N. A. Smith , Sentence mover’s similarity: Automatic evaluation for multi-sentence texts, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2019). Google Scholar
35. B. Yildiz , Reinforcement learning using fully connected, attention, and transformer models in knapsack problem solving, Concurren. Comput. 34 (2021). Web of Science, Google Scholar
36. O. V. Vasilyev, V. Dharnidharka and J. Bohannon, Fill in the BLANC: Human-free quality estimation of document summaries, arXiv:2002.09836. Google Scholar
37. H. Kan, M. Y. Kocyigit, A. Abdalla, P. Ajanoh and M. Coulibali, NUBIA: NeUral based interchangeability assessor for text generation, arXiv:2004.14667. Google Scholar
38. W. Yuan, G. Neubig and P. Liu, BARTScore: Evaluating generated text as text generation, arXiv:2106.11520. Google Scholar
39. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy and S. R. Bowman , GLUE: A multi-task benchmark and analysis platform for natural language understanding, in Proc. of the Int. Conf. Learning Representations (Association for Computational Linguistics, 2018). Crossref, Google Scholar
40. K. Song, X. Tan, T. Qin, J. Lu and T.-Y. Liu , MPNet: Masked and permuted pre-training for language understanding, Neural Inf. Process. Syst. 33 (2020) 16857–16867. Google Scholar
41. P. He, X. Liu, J. Gao and W. Chen, DeBERTa: Decoding-enhanced BERT with disentangled attention, arXiv:2006.03654. Google Scholar
42. Y. Liu, M. Ott, N. Goyal, J. Du, M. S. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv:1907.11692. Google Scholar
43. SentenceTransformers, SentenceTransformers Documentation-Sentence-Transformers documentation (n.d.). Google Scholar
44. Huggingface, microsoft/deberta-xlarge-mnli Hugging Face (n.d.). Google Scholar
45. HuggingFace, textattack/roberta-base-CoLA Hugging Face (n.d.). Google Scholar
46. Huggingface, GitHub — huggingface/transformers: Transformers: State-of-the-art machine learning for pytorch, TensorFlow, and JAX. (n.d.). Google Scholar
47. N. Reimers and I. Gurevych , Sentence-BERT: Sentence embeddings using siamese BERT-Networks, in Proc. of the Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2019). Crossref, Google Scholar
48. R. Sennrich, B. Haddow and A. Birch , Neural machine translation of rare words with subword units, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2016). Crossref, Google Scholar
49. A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong, R. Socher and D. R. Radev , SummEval: Re-evaluating summarization evaluation, Trans. Assoc. Comput. Linguist. 9 (2021) 391–409. Crossref, Google Scholar
50. Yale-Lily, GitHub — Yale-LILY/SummEval: Resources for the “SummEval: Re-evaluating Summarization Evaluation” paper (n.d.). Google Scholar
51. F. Chollet , Deep Learning With Python, 2nd edn. (Manning Publication Co., 2021). Google Scholar
52. O. Shapira, D. Gabay, Y. Gao, H. Ronen, R. Pasunuru, M. Bansal, Y. Amsterdamer and I. Dagan, Crowdsourcing lightweight pyramids for manual summary evaluation, arXiv:1904.05929. Google Scholar
53. Y.-C. Chen and M. Bansal , Fast abstractive summarization with reinforce-selected sentence rewriting, in Proc. of the Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2018). Crossref, Google Scholar
54. L. Dong, N. Yang, W.-H. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou and H.-W. Hon , Unified language model pre-training for natural language understanding and generation, Neural Inf. Process. Syst. 32 (2019) 13042–13054. Google Scholar
55. H. Zhang, J. Cai, J. Xu and J. Wang , Pretraining-based natural language generation for text summarization, in Proc. of the Conf. Computational Natural Language Learning (Association for Computational Linguistics, 2019). Crossref, Google Scholar
56. W.-J. Yoon, Y. Yeo, M. Jeong, B.-J. Yi and J. Kang, Learning by semantic similarity makes abstractive summarization better, arXiv:2002.07767. Google Scholar