
Authors:
(1) Rohit Saxena, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg;
(2) RFrank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg.
All experiments were performed on an A100 GPU with 80GB memory. It took approximately 22 hours to fully fine-tune the LED model and 30 hours for the Pegasus-X model. The LED-based models have 161M parameters, which were all fine-tuned. Our Scene Saliency Model has 60.2M parameters. The total number of parameters is 221.2M. The Pegasus-X has 568M parameters but its performance is lower than LED.
For evaluation, we used Benjamin Heinzerling’s implementation of Rouge[5] and BERTScore with the microsoft/deberta-xlarge-mnli model.
We compared the performance of Roberta with that of BART (Lewis et al., 2020) and LED (Encoder only) as the base models for computing scene embeddings in the classification of salient scenes. For each model, we employed the large variant and extracted the encoder’s last hidden state as scene embeddings. We report the results of scene saliency classification with different base models in Table 8. Among these models, Roberta’s embeddings performed marginally better and also had fewer parameters.
To study the robustness of the scene saliency classifier we performed k-fold cross-validation with k = 5. We report mean results with standard deviation across all folds in Table 9. The low standard deviation shows that the performance of the scene classifier is robust across different folds.
All the ROUGE scores reported in the paper are mean F1 scores with bootstrap resampling with 1000 number of samples. To assess the significance of the results, we are reporting 95% confidence interval results for our model and the closest baseline in Table 10.
This paper is available on arxiv under CC BY 4.0 DEED license.
[5] https://github.com/bheinzerling/pyrouge