
Authors:
(1) Rohit Saxena, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg;
(2) RFrank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg.
We define the saliency of a movie scene based on the mention of the scene in a user-written summary of the movie. If the scene appears in the summary, then it is considered salient for understanding the narrative of the movie. By aligning summary sentences to movie scenes, we identify salient scenes and later use them for movie summarization.
The MENSA dataset consists of the scripts of 100 movies and respective Wikipedia plot summaries annotated with gold-standard sentence-to-scene alignment. We selected 80 movies randomly from ScriptBase (Gorinski and Lapata, 2015) and added 20 recently released, manually corrected movie scripts, which all had Wikipedia summaries.
Both MENSA and ScriptBase datasets are movie scripts datasets and differ from other dialogue/narrative datasets such as SummScreenFD (Chen et al., 2022), the ForeverDreaming subset of the SummScreen dataset as used in the SCROLLS benchmark (Shaham et al., 2022). SummScreenFD is dataset of TV show episodes and consists of crowd-sourced transcripts and recaps. In contrast, the movie scripts in our dataset were written by screenwriters and the summaries were curated by Wikipedia. It is important to note that movies and TV shows have different storytelling structures, number of acts, and length. SummScreenFD has shorter input texts and summaries compared to movie scripts as shown in Table 2.
Formally, let M denote a movie script consisting of a sequence of scenes M = {S1, S2, ..., SN } and let D denote the Wikipedia plot summary consisting of a sequence of sentences D = {s1, s2, ..., sT }. The aim is to annotate and select a subset of salient scenes M′ such that M′ ⊂ M and |M′ | ≪ |M|, where for every scene in M′ there exist one or more aligned sentences in D.
To manually align the summary sentences for 100 movies, we recruited five in-house annotators. They received detailed annotation instructions and were trained by the authors until they were able to perform the alignment task reliably. To analyze inter-annotator agreement, 15 movies were selected randomly and triple-annotated by the annotators. The remaining 85 movies were single annotated, similar to the annotation process used by Papalampidi et al. (2019), to reduce the cost of annotation. As annotating and aligning a full-length movie script with its summary is a difficult task, we provided a default alignment to annotators generated by the alignment model of Mirza et al. (2021). For every summary sentence, annotators first verified the default alignment with movie script scenes. If the alignment was only partially correct or missing, they corrected the alignment by adding or removing scenes for a given sentence using a web-based tool. We assume that each sentence can be aligned to one or more scenes and vice versa. In Table 1, we present statistics of the scripts and summaries in the MENSA dataset.
To evaluate the quality of the annotations collected, we computed inter-annotator agreement on the triple annotated movies using three metrics: (a) Exact Match Agreement (EMA), (b) Partial Agreement (P A), and (c) Mean Annotation Distance (D). These measures were used for a similar annotation task by Papalampidi et al. (2019).[2] EMA is the ratio of the intersection of the scenes that the three annotators exactly agree upon for a given summary sentence, which is averaged over all sentences in the summary (Jaccard Similarity) and computed as follows:
Partial agreement (P A) is the ratio where there is an overlap of at least one scene among the annotators and is given as follows:
Annotation distance (d) for a summary sentence s between two annotators is defined as the minimum overlap distance and is computed as follows:
EMA and PA between our annotators was 52.80% and 81.63%, respectively. The PA indicates that for every sentence in the summaries, there is a high overlap of at least one scene. This is consistent with the low mean annotation distance of 1.21, which indicates that on average the distance between the annotations is around one scene. The EMA shows that for more than half of the sentences, there is an exact match in scene-to-sentence alignment among the annotators.
Since it is too expensive and time-consuming to collect gold-standard scene saliency labels for the whole of Scriptbase (Gorinski and Lapata, 2015), we generate silver-standard labels to train a model for scene saliency classification. Based on our definition of scene saliency above, silver-standard labels for scene saliency can be generated by aligning movie scenes with summary sentences.
Alignment between the source document segments and the summary sentences has been previously proposed for news summarization (Chen and Bansal, 2018; Zhang et al., 2022) and narrative text (Mirza et al., 2021). Using our gold-standard labels, we investigate which of these approaches yields better alignment between movie scripts and summaries and therefore should be used to generate silver-standard labels for scene saliency.
Chen and Bansal (2018) used ROUGE-L to align a summary sentence to the most similar source document sentence. In our case, we transformed these source document (movie script) sentence-level alignments to scene-level alignments such that if the scene contains the aligned sentence, the scene will be aligned to the summary sentence. Zhang et al. (2022) used a greedy algorithm for aligning the document segment and the summary sentences. For each segment, the sentences are aligned based on the gain in ROUGE-1 score. In our case, movie scenes are considered as source document segments. Mirza et al. (2021) proposed an alignment method specifically for movie scripts using semantic similarity combined with Integer Linear Programming (ILP) to align movie script scenes to summary sentences.
We present the results of applying these three approaches on our gold-standard MENSA dataset in Table 3. We report macro-averaged precision (P), recall (R), and F1 score. The Mirza et al. (2021) method performs significantly better than the ROUGE-based methods, possibly as it was specifically proposed to align movie scenes and summary sentences.[3] We therefore used this alignment method to generate silver-standard scene saliency labels for the complete Scriptbase corpus.
Our dataset can be used in the future to evaluate content selection strategies in long documents. The gold-standard salient scenes can also be used to evaluate extractive summarization methods.
We now introduce our Select and Summarize (SELECT & SUMM) model, which first uses a classification model (Section 4) to predict the salient scenes and then utilizes only the salient scenes to generate a movie summary using a pre-trained abstractive summarization model (Section 5). These models are trained in a two-stage pipeline.
This paper is available on arxiv under CC BY 4.0 DEED license.
[2] We renamed total agreement in Papalampidi et al. (2019) to EMA for clarity.
[3] It was also used to generate the default alignment that our human annotators had to correct, which biases our evaluation towards the method of Mirza et al. (2021). However, our results are still a good measure of how many errors human annotators find in the alignment generated by this method.