
Authors:
(1) Rohit Saxena, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg;
(2) RFrank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg.
Abstractive summarization for long-form narrative texts such as movie scripts is challenging due to the computational and memory constraints of current language models. A movie script typically comprises a large number of scenes; however, only a fraction of these scenes are salient, i.e., important for understanding the overall narrative. The salience of a scene can be operationalized by considering it as salient if it is mentioned in the summary. Automatically identifying salient scenes is difficult due to the lack of suitable datasets. In this work, we introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies. We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes. Using QA-based evaluation, we show that our model outperforms previous state-of-the-art summarization methods and reflects the information content of a movie more accurately than a model that takes the whole movie script as input.[1]
Abstractive summarization is the process of reducing an information source to its most important content by generating a coherent summary. Previous work has primarily focused on news (Cheng and Lapata, 2016; Gehrmann et al., 2018), meetings (Zhong et al., 2021), and dialogues (Zhong et al., 2022; Zhu et al., 2021a), but there is limited prior work on summarizing long-form narrative texts such as movie scripts (Gorinski and Lapata, 2015; Chen et al., 2022).
Long-form narrative summarization poses challenges to large language models (Beltagy et al., 2020; Zhang et al., 2020a; Huang et al., 2021) both in terms of memory complexity and in terms of attending to salient information in the text. Large language models perform poorly for long sequence lengths in zero-shot settings compared to finetuned models (Shaham et al., 2023). Recently, Liu et al. (2024) showed that the performance of these models degrades when the relevant information is present in the middle of a long document. With an average length of 110 pages, movie scripts are therefore challenging to summarize.
Several methods have previously relied on content selection for summarization to reduce the input size by either performing content selection implicitly using neural network attention (Chen and Bansal, 2018; You et al., 2019; Zhong et al., 2021) or explicitly (Ladhak et al., 2020; Manakul and Gales, 2021; Zhang et al., 2022) by aligning the source document with the summary using metrics such as ROUGE (Lin, 2004). Unlike for news articles, the implicit attention-based method is problematic for movie scripts, as current methods cannot reliably process text of such length. On the other hand, current explicit methods are neither optimized nor evaluated for content selection using gold-standard labels. In addition, considering the large number of sentences in movies that contain repeated mentions of characters and locations, a method based on a lexical overlap metric such as ROUGE creates many false positives. Crucially, all these methods use source–summary alignment as an auxiliary task without actually optimizing or evaluating this task.
For news summarization, Ernst et al. (2021) created crowd-sourced development and test sets for the evaluation of proposition-level alignment. However, news texts differ from movie scripts both in length and in terms of the rigid inverted pyramid structure that is typical for news articles. For movie scripts, Mirza et al. (2021) proposed a specialized alignment method which they evaluated on a set of 10 movies. However, they do not perform movie script summarization.
Movie scripts are structured in terms of scenes, where each scene describes a distinct plot element and hapening at a fixed place and time, and involving a fixed set of characters. It therefore makes sense to formalize movie summarization as the identification of the most salient scenes from a movie, followed by the generation of an abstractive summary of those scenes (Gorinski and Lapata, 2015). Hence we define movie scene saliency based on whether the scene is mentioned in the summary i.e., if the scene is mentioned in the summary, it is considered salient. Using scene saliency for summarization is therefore a method of explicit content selection.
In this paper, we first introduce MENSA, a Movie ScENe SAliency dataset that includes human annotation of salient scenes in movie scripts. Our annotators manually align Wikipedia summary sentences with movie scenes for 100 movies. We use these gold-standard annotations to evaluate existing explicit alignment methods. We then propose a supervised scene saliency classification model to identify salient scenes given a movie script. Specifically, we use the alignment method that performs best on the gold-standard data to generate silverstandard labels on a larger dataset, on which we then train a sequence classification model using scene embeddings to identify salient scenes. We then fine-tune a pre-trained language model using only the salient scenes to generate movie summaries. This model achieves new state-of-the-art summarization results as measured by ROUGE and BERTScore (Zhang et al., 2020b). In addition to that, we evaluate the generated summaries using a question-answer-based metric (Deutsch et al., 2021) and show that summaries generated using only the salient scenes outperform those generated using the entire movie script or baseline models.
Summarization of long-form documents has been studied across various domains, such as news articles (Zhu et al., 2021b), books (Kryscinski et al., 2022), dialogues (Zhong et al., 2022), meetings (Zhong et al., 2021), and scientific publications (Cohan et al., 2018). To handle and process the long documents, many efficient transformer variants have been proposed (Zaheer et al., 2020; Zhang et al., 2020a; Huang et al., 2021). Similarly, work such as Longformer (Beltagy et al., 2020) uses local and global attention in transformers (Vaswani et al., 2017) to process long inputs. However, given that movie scripts are particularly long (see Table 1), these models still have a limited capacity due to memory and time complexity, and need to truncate movie scripts based on the maximum sequence length supported by the model.
Over the past decade, numerous approaches movie summarization have been proposed. Gorinski and Lapata (2018, 2015) generate movie overviews using a graph-based model and create movie script summaries based on progression, diversity, and importance. In contrast, the aim of our work is to find salient scenes and use these for summarization. Papalampidi et al. (2019, 2021) summarize movie scripts by identifying turning points, important narrative events. In contrast, our approach is based on salient scenes and does not assume a rigid narrative structure. Recently, Agarwal et al. (2022) proposed a shared task for script summarization; the best model (Pu et al., 2022) used a heuristic approach to truncate the script.
Several methods (Ladhak et al., 2020; Manakul and Gales, 2021; Liu et al., 2022) have leveraged content selection for summarization. Chen and Bansal (2018) and Zhang et al. (2022) generate silver standard labels through greedy alignment of the source document sentences with summary sentences. However, these methods do not explicitly evaluate alignments. Moreover, movie scripts consist of a large number of sentences with the same characters and location names, which can generate many false positives in greedy alignment. We collect gold-standard saliency labels to compare and evaluate alignment methods. Mirza et al. (2021) proposed a movie script alignment method for summaries but do not actually propose a summarization model. Recent work (Dou et al., 2021; Wang et al., 2022) has employed neural network attention for the summarization of short documents. However, movie scripts are challenging for attention-based methods, given their length.
This paper is available on arxiv under CC BY 4.0 DEED license.
[1] Our dataset and code is released at https://github.com/saxenarohit/select_summ.