Big Data Full-Text Search Index Minimization Using Text Summarization
Keywords:Big Data, Indexing, Searching, Index Minimization, Text Summarization
An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storage
cost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertain
full-text search queries with good performance. It also incurs overhead to store, manage, and update the large
size index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text search
over Big Data using an automatic extractive-based text summarization method. To evaluate the effectiveness
of the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets using
Apache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average ranking
score measures of search results obtained using different search queries. Our experimental evaluation shows
that automatic text summarization is an effective method to reduce the index size significantly. We obtained a
maximum of 82% reduction in index size with 42% higher relevance of the search results using the proposed
solution to minimize the full-text index size.
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.