Sometimes we have to deal with huge volumes of data and handle long texts. Automatic text summarization refers to the shortening of long texts in order to create coherent and fluent summaries in which only the important aspects of the original texts are present. Inside text summarization we can distinguish two techniques:
- Abstractive text summarization aims to shorten long text into a human-readable form with the most important aspects involving paraphrasing (creation of a restatement of the meaning of a text using other words).
- By contrast, the most used type of summarization is the extractive text summarization which only retrieves some parts of the source text, selecting the most important sentences without modifying them.
In general, abstraction can condense a text more strongly than extraction, but abstractive summarization is harder to develop as natural language generation technology is required. Moreover, the level of actual abstraction taking into account novel phrases is scarce in existing abstractive text summarization approaches. They tend to copy long passages of the source document directly into the summary.
In the scientific article “Improving Abstraction in Text Summarization”, the authors propose two techniques to improve the level of abstraction of generated summaries:
1. The decoder is decomposed into a contextual network that extracts and compacts the source document.
- The decoder uses temporal attention over the encoded sequence that penalizes input tokens that previously had high attention scores.
- The decoder also attends to its previous states via intra-attention over the decoded sequence.
2. Fusion with a pretrained language model which generates concise paraphrases.
- The language model is fused by means of an element-wise multiplication in a gating function.
Another contribution of the paper is the combination of the optimization of the n-gram overlap with the ground-truth summary while encouraging the abstraction of the text using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as an abstractive rewarding metric.
The results show ROUGE scores and human evaluation comparable with the state-of-the-art models and improvements in the level of abstraction measured by n-gram overlap with the source document.
Pangeanic is working on automatic summarization with its technology division PangeaMT making summaries of innovative news and creating fintech solutions. Moreover, PangeaMT works on collecting parallel data for summarization tasks.