Big Data Full-Text Search Index Minimization Using Text Summarization

Authors

  • Waheed Iqbal Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan
  • Waqas Ilyas Malik Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan
  • Faisal Bukhari Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan
  • Khaled Mohamad Almustafa College of Computer and Information Sciences, Prince Sultan University Riyadh, Saudi Arabia
  • Zubiar Nawaz Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan

DOI:

https://doi.org/10.5755/j01.itc.50.2.25470

Keywords:

Big Data, Indexing, Searching, Index Minimization, Text Summarization

Abstract

An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storage
cost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertain
full-text search queries with good performance. It also incurs overhead to store, manage, and update the large
size index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text search
over Big Data using an automatic extractive-based text summarization method. To evaluate the effectiveness
of the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets using
Apache Lucene and studied average simple overlapping, Spearman’s rho correlation, and average ranking
score measures of search results obtained using different search queries. Our experimental evaluation shows
that automatic text summarization is an effective method to reduce the index size significantly. We obtained a
maximum of 82% reduction in index size with 42% higher relevance of the search results using the proposed
solution to minimize the full-text index size.

Downloads

Published

2021-06-17

Issue

Section

Articles