Content Models for Survey Generation: A Factoid-Based Evaluation

Abstract: We present a new factoid-annotated dataset for evaluating content models for scientific survey article generation containing 3,425 sentences from 7 topics in natural language processing. We also introduce a novel HITS-based content model for automated survey article generation called HITSUM that exploits the lexical network structure between sentences from citing and cited papers. Using the factoid-annotated data, we conduct a pyramid evaluation and compare HITSUM with two previous state-of-the-art content models: C-Lexrank, a network based content model, and TOPICSUM, a Bayesian content model. Our experiments show that our new content model captures useful survey-worthy information and outperforms C-Lexrank by 4% and TOPICSUM by 7% in pyramid evaluation.

Recommended citation: Rahul Jha, Catherine Finegan-Dollak, Ben King, Reed Coke, Dragomir Radev, "Content Models for Survey Generation: A Factoid-Based Evaluation." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.
Download Paper