Big Data Full-Text Search Index Minimization Using Text Summarization

An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storage cost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertain full-text search queries with good performance. It also incurs overhead to store, manage


Introduction
Recent advancements and adaptation of technology are contributing to growing digital data exponentially. For example, nowadays inexpensive and readily available network-enabled electronic devices (smartphones, laptops, personal computers, and tablets), the adaptation of social networks, and electronic healthcare gadgets are widely used and generate enormous data. This increasing growth of data poses special challenges to process, store, and analyze it [27]. To overcome this challenge, recently a new research area namely Big Data has emerged. One of the important research problems in Big Data is to provide efficient full-text search services on a large dataset. A common technique to provide text search is through data indexing [5,6]. There are many data structures in practice to provide data indexing. One of the widely used data structure is inverted index [6]. This data structure is based on the hash table. Each entry in the inverted index is a key-value pair, where the key is a term and the value is a list of document identifiers containing the term. All of the terms from the specific dataset are also known as term dictionary and the corresponding list of document identifiers are known as posting list. An inverted index after compression is roughly 20-30% of the actual size of the dataset.
Apache Lucene is the most used library to index data [24,29,32] for providing full-text search. It uses inverted index data structure to provide efficient data search capabilities. Furthermore, Lucene uses compression techniques to reduce the size of the index. However, still, the index size remains around 20% to 30% of the actual size of the data. To entertain search queries, Lucene loads the inverted index into a physical memory as a hash table containing term dictionary and posting lists. For each query consisting of multiple words, Lucene identifies the corresponding posting lists, merge them, and rank the documents to return as a search query result.
Lucene index generation is a time-consuming task specifically for large datasets [17,40]. Figure 1 shows profiling of index time for different sizes of datasets using Lucene. The figure shows actual and expected index time for the datasets. The actual line is plotted after profiling index time for different dataset sizes varying from 1 GB to 200 GB. Whereas, the expected line is plotted by fitting the line using small size datasets varying from 1 GB to 10 GB. This shows that on increasing size of datasets, the performance of Lucene decreases significantly. We advocate that a large dataset can be reduced to a smaller representative dataset for indexing to offer full-text search with better performance.
Traditionally data index is minimized using posting list compression techniques [50,51]. Compression algorithms provide an effective reduction in space but introduce overhead on decompression as it requires to serve the queries which reduce the speed of search queries drastically for a large index size. We advocate to reduce the actual dataset using text automatic summarization method and then posting list compression methods can further be used to reduce the index size. Automatic text summarization [35] is a process to create a summary of a text document by significantly reducing its size. However, it ensures to retain important points of the document. In this paper, we investigate to minimize the index size of Big Data using an automatic text summarization method. To evaluate the effectiveness of this approach, we performed four different experiments using two datasets to study average overlapping, average ranking score, and Spearman's rho correlation measures of search results using different search queries in comparison with actual datasets. Apache Lucene index generation time profiling using different sizes of datasets 1 1. . I In nt tr ro od du uc ct ti io on n Recent advancements and adaptation of technology are contributing to growing digital data exponentially. For example, nowadays inexpensive and readily available network-enabled electronic devices (smartphones, laptops, personal computers, and tablets), the adaptation of social networks, and electronic healthcare gadgets are widely used and generate enormous data. This increasing growth of data poses special challenges to process, store, and analyze it [27]. To overcome this challenge, recently a new research area namely Big Data has emerged. One of the important research problems in Big Data is to provide efficient full-text search services on a large dataset. A common technique to provide text search is through data indexing [5,6]. There are many data structures in practice to provide data indexing. One of the widely used data structure is inverted index [6]. This data structure is based on the hash table. Each entry in the inverted index is a key-value pair, where the key is a term and the value is a list of document identifiers containing the term. All of the terms from the specific dataset are also known as term dictionary and the corresponding list of document identifiers are known as posting list. An inverted index after compression is roughly 20-30% of the actual size of the dataset.
Apache Lucene is the most used library to index data [24,29,32] for providing full-text search. It uses inverted index data structure to provide efficient data search capabilities. Furthermore, Lucene uses compression techniques to reduce the size of the index. However, still, the index size remains around 20% to 30% of the actual size of the data. To entertain search queries, Lucene loads the inverted index into a physical memory as a hash table containing term dictionary and posting lists. For each query consisting of multiple words, Lucene identifies the corresponding posting lists, merge them, and rank the documents to return as a search query result.
Lucene index generation is a time-consuming task specifically for large datasets [17,40]. Figure 1 shows profiling of index time for different sizes of datasets using Lucene. The figure shows actual and expected index time for the datasets. The actual line is plotted after profiling index time for different dataset sizes varying from 1 GB to 200 GB. Whereas, the expected line is plotted by fitting the line using small size datasets varying from 1 GB to 10 GB. This shows that on increasing size of datasets, the performance of Lucene decreases significantly. We advocate that a large dataset can be reduced to a smaller representative dataset for indexing to offer full-text search with better performance.

Figure 1
Apache Lucene index generation time profiling using different sizes of datasets.
Traditionally data index is minimized using posting list compression techniques [50,51]. Compression algorithms provide an effective reduction in space but introduce overhead on decompression as it requires to serve the queries which reduce the speed of search queries drastically for a large index size. We advocate to reduce the actual dataset using text automatic summarization method and then posting list compression methods can further be used to reduce the index size. Automatic text summarization [35] is a process to create a summary of a text document by significantly reducing its size. However, it ensures to retain important points of the document. In this paper, we investigate to minimize the index size of Big Data using an automatic text summarization method. To evaluate the effectiveness of this approach, we performed four different experiments using two datasets to study average overlapping, average ranking score, and Spearman's rho correlation measures of search results using different search queries in comparison with actual datasets.
The main contributions of this paper includes:

•
We propose an automatic extractivebased text summarization for Big Data index minimization for the full-text search problem.

•
We evaluate the effectiveness of the The main contributions of this paper includes: _ We propose an automatic extractive-based text summarization for Big Data index minimization for the full-text search problem. _ We evaluate the effectiveness of the proposed method by studying relevance and overlapping of the search query results with baseline datasets. _ Study the effect of different text summarization threshold levels on data index minimization and search results.
The rest of this paper is organized as follow. Related work is discussed in Section 2. Commonly used Big Data tools for fulltext search are discussed in Section 3. The proposed solution for index generation using automatic text summarization is presented in Section 4. Experimental evaluation setup is discussed in Section 5. Experimental results are presented in Section 6. Finally, the conclusion is drawn and future work is discussed in Section 7.

Related Work
There have been excellent efforts to develop tools, methods, and programming models to store, process, and analyze Big Data [3,31,41]. The full-text search on the Big Data is a challenging and interesting problem which recently gained attention. Many applications and domains are using full-text search. For example, Cuggia et al. [9] developed a full-text search engine to use clinical notes for identifying different diseases. Garcelon et al. [15] use full-text search to detect the family history of patients from a biomedical data warehouse. Hanauer et al. [19] develop a search query recommendations system which exploits the query semantics using synonym variants of the query text and obtain most relevant data from electronic health record system. Abe et al. [1] present a high-speed fulltext search engine for system log files. The solution automatically converts system log files into an efficient searchable index and provide good performance to facilitate full-text search for end-users. Wang et al. [44] use full-text search for large-scale cloud data center monitoring. Their proposed solution is based on tree index structure and correlation methods to index the data and obtain relevant results.
Full-text search on Big Data is commonly achieved using data indexing and hashing methods. A comprehen-sive survey on Big Data indexing methods is reported by Gani et al. [13]. Zhu et al. [51] introduced sparse hashing for effectively searching high-dimensional data by reducing the dimensionality of data dynamically. He et al. [20] proposed and evaluated deep learning solution to image-text retrieval using two convolution-based networks to offer efficient image-text retrieval. Wang et al. [43] present a survey on learning to hash algorithms and categorize them. The learning to the hash method is used to find data elements from the database on given query so the distance of the selected data elements is minimum with the query text.
There have been several efforts to minimize search index using posting list compression techniques. For example, Zhang et al. [50] discussed inverted index compression for high-performance information retrieval systems by compressing posting list. This work explained various posting list compression algorithms and then proposed a solution to select method to use disk speed, cache size, and memory effectively to improve performance for search engines. Yan et al. [47] proposed to reorder document IDs in the posting list for higher data index compression. They proposed a method to optimize compression for posting list and query processing by optimally reordering the documents. Wang et al. [42] proposes a new inverted list compression method based on multiple techniques including fixed-bit encoding, inblock compression, dynamic data partitioning, and cache-aware optimization to improve the query performance. Another work by Claude et al. [7] introduced a new method to compress inverted indexes for applications required full-text facility on a large repository of repetitive documents like version control systems. Wu et al. [46] reduce the indexing space and time by using indexing blocks instead of individual records for minimizing the query processing time.
The relevance of the full-text search results is important and many studies have been performed to observe user's behavior towards the ranking of the full-text search results. Most of these studies show that only a few tops ranked results are important from user's perspective [22,23,37,38]. Bar-Ilan et al. [4] discussed different techniques used to correlate the rankings of search engine results. This work applied different methods of comparison of top 10 results using a specific set of queries and compared ranked results returned by major search engines. Ghose et al. [16] present a Information Technology and Control 2021/2/50 378 study of ranking results obtained from a search engine based on consumer behavior and revenue of search engine using hierarchical Bayesian model. Williams et al. [45] use pseudo-relevance feedback recursively to improve search results for given text query.
Automatic text summarization is a well-established research area [2,12] which reduces the size of the text significantly. However, it ensures to retain important sentences and central idea of the given text while reducing its size. Two different methods exist for automatic text summarization namely Extractive and Abstractive. In Extractive method [28], important sentences are picked up from the given text to generate a compact summary. It first ranks sentences according to their importance and then assigns them relevance score and finally selects the sentences with the higher score as a summary of the document. In Abstractive method [14,25], Natural Language Processing (NLP) methods are used to generate the summary of given text document. This method generates the summary with possibly new vocabulary and sentences similar to a human generated summary of the documents. Recently, advanced techniques are used to improve automatic text summary generation. For example, [48] introduced deep learning for automatic text summarization.
To our knowledge, no work has investigated the use of text summarization methods for Big Data search index minimization. We take the first step to introduce text summarization for reducing search index size significantly while providing higher relevance of the search results.

Apache Lucene
Apache Lucene [32] is a high-performance opensource information retrieval library written in Java programming language. It is primarily used for indexing and searching of text data. Lucene provides fast indexing and fast searching capabilities for very large datasets. It can process roughly 150 GB/hour of data on latest hardware [10] with heap consumption of only 1 MB. The index size generated by Lucene is 20-30% of actual dataset size. Besides simple indexing and searching functionality, Lucene also ranks the search results to show the most relevant results in descending order of relevance. These features make it appealing to use and build Big Data solutions on top of Lucene.

Apache Solr
Apache Solr [18] is a highly scalable enterprise search engine that uses Lucene for indexing and searching functionality. Solr extends Lucene and provides functionality like rich documents processing (including PDF, XML, HTML etc.), integration with the database, index replication and load balancing for fault tolerance. Solr also provides Distributed Searching by introducing the concept of Shards. Solr provides REST-based XML/JSON APIs that make it integrable with most of the programming languages. Solr exploits the fast-searching capability of Lucene and make sure the availability of documents for searching immediately after they are added for indexing.

Elasticsearch
Elasticsearch [26] is also an enterprise search engine. Like Solr, it also uses Lucene for indexing and searching. Elasticsearch is much similar to Solr in terms of its functionality. Elasticsearch, like Solr also provides distribution of index by dividing it into different Shards. It maintains replicas of every Shard. Elasticsearch also provides a feature of Gateway that allows recovery in case of any server crash. Elasticsearch supports NoSQL solutions which makes it attractive to use as a database with Big Data applications. However, it doesn't support distributed transactions.

Cloudera Search
Cloudera Search uses Hadoop Distributed File System (HDFS) for storing data indices to provide near to real-time full-text search facility. It is based on Apache Solr and provides fast individual and batch indexing of text data. It works by indexing events (streamed by Flume) while they are being stored in HDFS. It first maps all events to Solr schema and then uses Lucene for indexing of events. Cloudera Search offers fault tolerance by leveraging the benefits of HDFS. Cloudera Search is easy to integrate with HBase to provide full-text search.

Sphinx
Sphinx is an open-source search engine written in C++ which uses native protocols to communicate with any Data Base Management System (DBMS). This allows Sphinx to directly index data of any DBMS. It also works with NoSQL-based database and allows the user to use it with raw text data to index and use it in their applications. Sphinx also allows RDBMS like query style (use of WHERE, GROUP BY etc. clauses). It offers aggregate functions for sum, average, minimum, maximum etc. It also allows the distributed searching and very easy to integrate with any application.

Xapian
Xapian is an open-source search engine library written in C++. It is fast and highly scalable for searching text documents and can scale up to hundreds of millions of documents. Xapian has a built-in support for Probabilistic IR models for the ranking of results. It also allows the use of Boolean operators like AND, OR etc. Xapian allows transactions with a guarantee of data consistency in case of any failure. Some of the important features available in Xapian are data updates, automatic spell correction, probabilistic ranking algorithms, and intuitive usage of synonyms for the given text query.
Among all of these Apache Lucene is the most powerful, mature, and famous among Big Data application developers to use for offering the full-text search facility. Moreover, the flexibility of Apache Lucene to use its APIs and easy customization of the source code helps greatly to integrate and implement our proposed index minimization method.

Proposed Index Minimization Using Text Summarization
Our proposed solution is based on automatic text summarization for Big Data index minimization to offer efficient fulltext search. We explained automatic text summarization method and the proposed system in the following subsections in turn.

Automatic Text Summarization Methods
Automatic text summarization is a process to create a summary of a text document by significantly reducing its size. However, it ensures to retain important points of the document. Mainly, two different methods exist for automatic text summarization [35] namely Extractive and Abstractive. In Extractive method, important keywords, and sentences from the original text are selected to create the summary. However, in Abstractive method, natural language processing techniques are used to create the summary. This method generates a summary which looks closer to a human generated summary of the document. But this method may not use sentences and keywords from the original document to prepare the summary.
Searching text documents heavily rely on the keywords present in the documents, therefore, in our context Extractive method is appropriate to prepare the summary of the document. Then, we index the summary to significantly reduce the index size. Most of the Extractive methods generate a summary by finding the similarity between sentences and then assigning a similarity score to them. Finally, our method selects sentences having the higher similarity scores to prepare the summary. There are two commonly used approaches to prepare extractive-based text summary known as Textrank-based [30] and centroid-based [33]. Textrank-based algorithm prepares a complete graph of sentences. Where each sentence represents a vertex in the graph and edges represent intersection score between two sentences. In this paper, we chosen centroid-based [33] algorithm to prepare the summary of text documents as this method is better than textrank-based text summarization algorithm [11]. We explained the centroid-based text summarization method in the following subsection.

Centroid-based Text Summarization
The centroid-based algorithm identifies a set of keywords, labels them as centroid and then identify cosine similarity [33,39] among other keywords to the centroid. To identify set of centroid keywords for a document, many techniques exist. For example, Cohen [8] proposed to use n-gram statistics to identify the set of keywords. Ramos [34] proposes to use term frequency-inverse document frequency (tf-idf ) of keywords to prepare a set of important keywords of a document. Once the list of keywords is prepared then cosine similarity score for each sentence is computed with the centroid set of keywords. Finally, sentences with higher cosine similarity are picked. However, the number of selected sentences is defined by the user as a percentage of the text (summary threshold) required to be part of the summary.

Figure 2
Proposed Big Data index generation and query serving system using text summarization More formally, let d i is a given document to generate a summary, a user defined τ which represents maximum summary size in percentage for the given document, and C = {k 1 , k 2 ,..., k n } is a set of keywords identified as a list of centroid keywords using tif-idf measure. A vector measure. A vector (1) Then we get a list of similarity for each sentence in the document: Then we sort the sentences by their similarity scores and identify top sentences that give the summary size less than given τ. We call τ as a summary threshold which defines the user choice of the size of summary required to generate for the given text document. The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses NLP methods like stop word removal and stemming to filter insignificant data. Stopword removes all frequent words like a, the, their, we, etc. Stemming reduces the words to their roots which greatly help to minimize the vocabulary of the given documents. Once the pre-processing is done, the important extracted keywords are then passed to the indexing library (Apache Lucene) which prepares an inverted index using the given keywords and document IDs.
Once the inverted index is ready, then users can invoke queries using the methods exposed by the indexing library. For the given queries, indexing library identifies the related documents, sorts them with ranking scores and returns a list of documents to the users.
generation time and size for both datasets.

5. . E Ex xp pe er ri im me en nt ta al l E Ev va al lu ua at ti io on n
In this section, we explain dataset, evaluation criteria, and experiments performed to evaluate the proposed method to index Big Data for full-text search applications. We performed all experiments using a core i7 identified for each sentence. We use the following formula to identify the consine similarity between each (1) Then we get a list of similarity for each sentence in the document: Then we sort the sentences by their similarity scores and identify top sentences that give the summary size less than given τ. We call τ as a summary threshold which defines the user choice of the size of summary required to generate for the given text document. The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses NLP methods like stop word removal and stemming to filter insignificant data. Stopword removes all frequent words like a, the, their, we, etc. Stemming reduces the words to their roots which greatly help to minimize the vocabulary of the given documents. Once the pre-processing is done, the important extracted keywords are then passed to the indexing library (Apache Lucene) which prepares an inverted index using the given keywords and document IDs.
Once the inverted index is ready, then users can invoke queries using the methods exposed by the indexing library. For the given queries, indexing library identifies the related documents, sorts them with ranking scores and returns a list of documents to the users.
generation time and size for both datasets. (1) Then we get a list of similarity for each sentence in the document: Then we sort the sentences by their similarity scores and identify top sentences that give the summary size less than given τ. We call τ as a summary threshold which defines the user choice of the size of summary required to generate for the given text document. The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses NLP methods like stop word removal and stemming to filter insignificant data. Stopword removes all frequent words like a, the, their, we, etc. Stemming reduces the words to their roots which greatly help to minimize the vocabulary of the given documents. Once the pre-processing is done, the important extracted keywords are then passed to the indexing library (Apache Lucene) which prepares an inverted index using the given keywords and document IDs.
Once the inverted index is ready, then users can invoke queries using the methods exposed by the indexing library. For the given queries, indexing library identifies the related documents, sorts them with ranking scores and returns a list of documents to the users.
generation time and size for both datasets. 5 5. . E Ex xp pe er ri im me en nt ta al l E Ev va al lu ua at ti io on n In this section, we explain dataset, evaluation criteria, and experiments performed to evaluate the proposed method to index Big Data for full-text search applications. We performed all experiments using a core i7 (1) Then we get a list of similarity for each sentence in the document: (1) Then we get a list of similarity for each sentence in the document: Then we sort the sentences by their similarity scores and identify top sentences that give the summary size less than given τ. We call τ as a summary threshold which defines the user choice of the size of summary required to generate for the given text document. The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses NLP methods like stop word removal and stemming to filter insignificant data. Stopword removes all frequent words like a, the, their, we, etc. Stemming reduces the words to their roots which greatly help to minimize the vocabulary of the given documents. Once the pre-processing is done, the important extracted keywords are then passed to the indexing library (Apache Lucene) which prepares an inverted index using the given keywords and document IDs.
Once the inverted index is ready, then users can invoke queries using the methods exposed by the indexing library. For the given queries, indexing library identifies the related documents, sorts them with ranking scores and returns a list of documents to the users.
generation time and size for both datasets. In this section, we explain dataset, evaluation criteria, and experiments performed to evaluate the proposed method to index Big Data for full-text search applications. We performed all experiments using a core i7 Then we sort the sentences by their similarity scores and identify top sentences that give the summary size less than given τ. We call τ as a summary threshold which defines the user choice of the size of summary required to generate for the given text document.

Proposed System
Figure 2 shows our proposed system to index data using text summarization and serving user queries. The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses NLP methods like stop word removal and stemming to filter insignificant data. Stopword removes all frequent words like a, the, their, we, etc. Stemming reduces the words to their roots which greatly help to minimize the vocabulary of the given documents. Once the pre-processing is done, the important extracted keywords are then passed to the indexing library (Apache Lucene) which prepares an inverted index using the given keywords and document IDs. Once the inverted index is ready, then users can invoke queries using the methods exposed by the indexing library. For the given queries, indexing library The data indexing process works by aggregating data from different data sources like websites, social networks, server logs, and smart devices. The data from these sources are collected as text documents and passed to the text summarization method which generates summaries for all given documents. Then a preprocessing step is performed which uses 381 Information Technology and Control 2021/2/50 identifies the related documents, sorts them with ranking scores and returns a list of documents to the users.
Generation time and size for both datasets.

Experimental Evaluation
In this section, we explain dataset, evaluation criteria, and experiments performed to evaluate the proposed method to index Big Data for full-text search applications. We performed all experiments using a core i7 computer system with 8GB physical memory.

Datasets and Search Queries
We used two publicly available datasets namely Wikipedia and Project Gutenberg5 to generate their summaries and then indexed them using the proposed system to study the effect of index generation on time and size. We also studied the overlapping and relevance of search results using summarized datasets with the actual dataset. The experimental datasets (actual and summarized) are briefly explained in Table 1. Figure 3 shows the effect of different summary thresholds on index generation time and size for both datasets.
The actual Wikipedia dataset we used is 77 GB in size, consisting of 14.25 million HTML pages and its search index size is 12 GB. We used nine different summarized datasets of Wikipedia datasets with different values of the summary threshold. The Project   Table 1 Datasets with different summary thresholds. The actual dataset (without summ Effect of different summary thresholds on index generation time and size for both datasets.

Evaluation Criteria
We used different measures to compare the summarized and the actual datasets for overlapping and relevance of the results. We used simple overlapping, document ranking scores, and Spearman's rho correlation to study the impact of the summarized dataset on search results. In this section, we explain the evaluation measures used in our experimental evaluation.

Simple Overlapping
To compare search results of the summarized dataset with the actual dataset, we use a simple overlapping measure.
here R is the number of documents returned by a uery from the actual datasets and Rs is the number f intersection documents between results returned r the query from the actual and the summarized ataset. .
and tft,d is term frequency for give term t in document di. The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document di is computed using ℜ , for the qiven query q using: Finally, ℜ , is associated with each document di and then return to the user in sorted order as a response to the search expression. We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (Sr) is computed using the following formula: where Di represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in (2) where R is the number of documents returned by a query from the actual datasets and R s is the number of intersection documents between results returned for the query from the actual and the summarized dataset.

Ranking Score
To compare the relevance of search results with the queries for both the summarized and the actual datasets, we use ranking score assigned by Apache Lucene to each document in the search result. We compute the average ranking of Top where R is the number of documents returned by a query from the actual datasets and Rs is the number of intersection documents between results returned for the query from the actual and the summarized dataset.
and tft,d is term frequency for give term t in document di. The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document di is computed using ℜ , for the qiven query q using: Finally, ℜ , is associated with each document di and then return to the user in sorted order as a response to the search expression. We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (Sr) is computed using the following formula: where Di represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in ) (3) and tf t,d is term frequency for give term t in document d i . The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document d i is computed using where R is the number of documents returned by a query from the actual datasets and Rs is the number of intersection documents between results returned for the query from the actual and the summarized dataset.
and tft,d is term frequency for give term t in document di. The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document di is computed using ℜ , for the qiven query q using: Finally, ℜ , is associated with each document di and then return to the user in sorted order as a response to the search expression.

5. .2 2. .3 3. . S Sp pe ea ar rm ma an n' 's s r rh ho o C Co or rr re el la at ti io on n
We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (Sr) is computed using the following formula: where Di represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in for the qiven query q using: on index size, time, overlapping, and relevance.
To consider the effect of different search queries for both actual and summarized datasets, we used 200 different queries. For Wikipedia actual and summarized datasets, we used 200 search queries randomly selected from a set of 5000 most frequent search queries of Wikipedia website. For the Project Gutenberg actual and summarized dataset, we used 200 randomly selected nouns from the dataset. 5 5. .2 2. . E Ev va al lu ua at ti io on n C Cr ri it te er ri ia a We used different measures to compare the summarized and the actual datasets for overlapping and relevance of the results. We used simple overlapping, document ranking scores, and Spearman's rho correlation to study the impact of the summarized dataset on search results. In this section, we explain the evaluation measures used in our experimental evaluation.
where R is the number of documents returned by a query from the actual datasets and Rs is the number of intersection documents between results returned for the query from the actual and the summarized dataset.
and tft,d is term frequency for give term t in document di. The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document di is computed using ℜ , for the qiven query q using: Finally, ℜ , is associated with each document di and then return to the user in sorted order as a response to the search expression.

5. .2 2. .3 3. . S Sp pe ea ar rm ma an n' 's s r rh ho o C Co or rr re el la at ti io on n
We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (Sr) is computed using the following formula: where Di represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in (4) Finally, thresholds to study the impact , overlapping, and relevance. ect of different search queries for mmarized datasets, we used 200 For Wikipedia actual and ets, we used 200 search queries from a set of 5000 most frequent ikipedia website. For the Project and summarized dataset, we ly selected nouns from the C Cr ri it te er ri ia a nt measures to compare the the actual datasets for elevance of the results. We used g, document ranking scores, and rrelation to study the impact of ataset on search results. In this n the evaluation measures used l evaluation.
O Ov ve er rl la ap pp pi in ng g ch results of the summarized ctual dataset, we use a simple ure. The simple overlapping of overlapping search results e same queries on both datasets arized) for Top 1, Top 5, Top 10, search results. We calculate the ng ( ) using the following and tft,d is term frequency for give term t in document di. The remaining part of the Equation 3 is representing inverse document frequency. Then finally ranking score of document di is computed using ℜ , for the qiven query q using: Finally, ℜ , is associated with each document di and then return to the user in sorted order as a response to the search expression. 5 5. .2 2. .3 3. . S Sp pe ea ar rm ma an n' 's s r rh ho o C Co or rr re el la at ti io on n We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (Sr) is computed using the following formula: where Di represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in is associated with each document d i and then return to the user in sorted order as a response to the search expression.

Spearman's rho Correlation
We used Spearman's rho [49] to find the correlation between summarized and actual datasets for Top 1, Top 5, Top 10, Top 15, and Top 20 search results. The Spearman's rho works by finding overlapping between two given sets. It ignores non-overlapping members of the set and gives a higher score to higher ranked overlapped results to compute a measure ranging between −1 and 1. The sign of Spearman's rho value shows the direction of overlapping. Since, in our experimental evaluation, we required to identify absolute overlapping between two search results, therefore, we take the absolute value of Spearman's rho measure. The Spearman's rho (S r ) is computed using the following formula: , (5) where D i represents the difference of document ranking between two sets of documents, returned by both actual and summarized datasets for i th query and N is the total number of overlapped documents in both sets.

Experimental Details
We performed three experiments to evaluate the effectiveness of the summarization method for min-imizing index size for full-text search. In all three experiments, we used 200 randomly selected search queries for Wikipedia and Project Gutenberg datasets. In each experiment, we use the actual datasets (without summarization) as a baseline and compare the results with summarized datasets of 200 randomly selected search queries for Top 1, Top 5, Top 10, Top 15, and Top 20 search results.
In Experiment 01, we compare and evaluate average simple overlapping, explained in Section 5.2.1, for the search results obtained on actual and summarized versions for both datasets. In Experiment 02 and Experiment 03, we compare and evaluate the average ranking score, explained in Section 5.2.2, and Spearman's rho measures, described in Section 5.2.3, respectively on the search results.

Experiment 1: Overlapping Results
Figure 4(a) shows average overlapping results of summarized datasets using different summary thresholds with baseline Wikipedia dataset for 200 search queries. The overlapping measure shows the summarized version of the Wikipedia dataset provides minimum 18% similar documents returned using 10% summary threshold and maximum 66% similar documents returned using 90% summary threshold. We observe that on increasing value of summary threshold the overlapping results also increased. However, we observe the growth of overlapping results slow down after 40% summary threshold. Figure 4(b) shows average overlapping results of summarized datasets using different summary thresholds with actual Project Gutenberg datasets for 200 search queries. The overlapping measure shows the summarized version of the Wikipedia dataset provides minimum 52% similar documents returned using 10% summary threshold and maximum 89% similar documents returned using 90% summary threshold. We observe that on increasing value of the summary threshold, the overlapping results also increased. However, we observe the growth of overlapping results slow down after 50% summary threshold.
This experimental result shows a correlation between summary threshold and overlapping results.  provides minimum 18% similar documents returned using 10% summary threshold and maximum 66% similar documents returned using 90% summary threshold. We observe that on increasing value of summary threshold the overlapping results also increased. However, we 6 6. . 2 2. . E E S Se ea ar rc ch h  In Experiment 01, we compare and evaluate average simple overlapping, explained in Section 5.2.1, for the search results obtained on actual and summarized versions for both datasets. In Experiment 02 and Experiment 03, we compare and evaluate the average ranking score, explained in Section 5.2.2, and Spearman's rho measures, described in Section 5.2.3, respectively on the search results. 6 6. . E Ex xp pe er ri im me en nt ta al l R Re es su ul lt ts s 6 6. .1 1. . E Ex xp pe er ri im me en nt t 1 1: : O Ov ve er rl la ap pp pi in ng g R Re es su ul lt ts s provides minimum 18% similar documents returned using 10% summary threshold and maximum 66% similar documents returned using 90% summary threshold. We observe that on increasing value of summary threshold the overlapping results also increased. However, we summary threshold. We observe that on increasing value of the summary threshold, the overlapping results also increased. However, we observe the growth of overlapping results slow down after 50% summary threshold.
This experimental result shows a correlation between summary threshold and overlapping results. If we use the higher value of summary threshold, we can find more similar results as compared to actual datasets. However, the large value of summary threshold will not help to decrease the index size significantly.   If we use the higher value of summary threshold, we can find more similar results as compared to actual datasets. However, the large value of summary threshold will not help to decrease the index size significantly.

Experiment 2: Relevance of Search Results
Figure 5(a) shows the average ranking score of summarized datasets using different summary thresholds and actual Wikipedia dataset for 200 search queries. The ranking score for the summarized version of the Wikipedia datasets provides always higher ranking score comparing to actual dataset results. However, the non-overlapping results which are part of actual dataset results but missing from the summarized dataset are low. It shows the non-overlapping results are not highly relevant to the search queries. The maximum score is obtained using 10% summary Information Technology and Control 2021/2/50 384 Figure 5 Experiment 2 results show average ranking score and standard deviation for Wikipedia and Project Gutenberg datasets threshold. Overall, the ranking score decreases whenever we increase the value of the summary threshold. Figure 5(b) shows the average ranking score of summarized datasets using different summary thresholds with actual Project Gutenberg dataset. The ranking score for the summarized version of the Project Gutenberg datasets provides always higher ranking score comparing to actual dataset results. However, the non-overlapping results which are part of actual dataset results but missing from the summarized dataset are low. It shows the non-overlapping results are not highly relevant to the search queries. The maximum score is obtained using 10% summary threshold. Overall, the ranking score decreases whenever we increase the value of the summary threshold. Experiment 2 shows that summarized datasets yield higher ranking scores compared to the actual dataset. Moreover, nonoverlapping results always have lower ranking scores which show less relevance to the search queries.

Experiment 3: Spearman's Rho Correlation
Figure 6(a) shows average Spearman's rho correlation of ranking scores using summarized and actual datasets for 200 search queries on Wikipedia dataset. The Spearman's rho measure varies between 0.5 and 0.7 for most of the summary thresholds. However, we observed minimum Spearman's rho for Top 1 results obtained using actual dataset and 10% summary threshold. We observed the best results for Top 5 where R is the number of documents returned by a also obse behavior s dataset.   Average precision and recall for the different summary threshold for Projec results obtained using actual dataset as ground truth. using actual dataset a   Average precision and recall for the different summary threshold for Project Gutenberg dataset. We consider results obtained using actual dataset as ground truth. using actual dataset as ground truth.  This experiment shows that good correlation, higher than 0.5, exists between search results obtained using actual dataset and summarized results using different summary thresholds.

Precision and Recall Using Actual and Summarized Dataset
We consider the search results obtained from the actual dataset as ground truth to compute precision and recall of search results obtained using summarization datasets of different summary thresholds.
We compute precision and recall using the following formulas: relevance to the search queries. 6 6. .3 3. . E Ex xp pe er ri im me en nt t 3 3: : S Sp pe ea ar rm ma an n' 's s R Rh ho o C Co or rr re el la at ti io on n Figure 6(a) shows average Spearman's rho correlation of ranking scores using summarized and actual datasets for 200 search queries on Wikipedia dataset. The Spearman's rho measure varies between 0.5 and 0.7 for most of the summary thresholds. However, we observed minimum Spearman's rho for Top 1 results obtained using actual dataset and 10% summary threshold. We observed the best results for Top 5 documents using Spearman's rho measure. There is no specific relationship with increasing summary threshold 6 6. .4 4. . P Pr re ec ci is si io on n a an nd d R Re ec ca al ll l U Us si in ng g A Ac ct tu ua al l a an nd d S Su um mm ma ar ri iz ze ed d D Da at ta as se et t We consider the search results obtained from the actual dataset as ground truth to compute precision and recall of search results obtained using summarization datasets of different summary thresholds. We compute precision and recall using the following formulas: (6) relevance to the search queries. 6 6. .3 3. . E Ex xp pe er ri im me en nt t 3 3: : S Sp pe ea ar rm ma an n' 's s R Rh ho o C Co or rr re el la at ti io on n Figure 6(a) shows average Spearman's rho correlation of ranking scores using summarized and actual datasets for 200 search queries on Wikipedia dataset. The Spearman's rho measure varies between 0.5 and 0.7 for most of the summary thresholds. However, we observed minimum Spearman's rho for Top 1 results obtained using actual dataset and 10% summary threshold. We observed the best results for Top 5 documents using Spearman's rho measure. There is no specific relationship with increasing summary threshold 6 6. .4 4. . P Pr re ec ci is si io on n a an nd d R Re ec ca al ll l U Us si in ng g A Ac ct tu ua al l a an nd d S Su um mm ma ar ri iz ze ed d D Da at ta as se et t We consider the search results obtained from the actual dataset as ground truth to compute precision and recall of search results obtained using summarization datasets of different summary thresholds. We compute precision and recall using the following formulas: , (7) where R is the number of documents returned by a query on the actual dataset, R a is the total number of documents returned by a query from the summarized dataset, and R s is a number of similar documents between results returned by a query from the actual and the summarized datasets. Figure 7 shows recall, precision, and the precision-recall graph of all 200 search queries executed on actual Project Gutenberg and summarized datasets. The recall is increasing as number % of summary threshold increases. However, average precision remains very high and close to 1. The high precision results are justified as we are using extractive-based summarization method which selects the sentences rather building new sentences. Therefore, summarized datasets are a subset of the actual dataset and yielding results, which is also a subset of the ground truth. Due to this behavior, we also observe higher precision and recall behavior similar to Figure 7 for the Wikipedia dataset.
where R is the number of documents returned by a query on the actual dataset, Ra is the total number of documents returned by a query from the summarized dataset, and Rs is a number of similar documents between results returned by a query from the actual and the summarized datasets. Figure 7 shows recall, precision, and the precisionrecall graph of all 200 search queries executed on actual Project Gutenberg and summarized datasets. also observe higher precision and recall behavior similar to Figure 7 for the Wikipedia dataset. 6 6. .5 5. . E Ex xp pe er ri im me en nt ta al l S Su um mm ma ar ry y We summarize our experimental evaluation in Table 2. It shows index size & time decreased, ranking improved, overlapping, and Spearman's rho correlation using different summary thresholds for both   Average precision and recall for the different summary threshold for Project Gutenberg dataset. We consider results obtained using actual dataset as ground truth. using actual dataset as ground truth.
where R is the number of documents returned by a query on the actual dataset, Ra is the total number of documents returned by a query from the summarized dataset, and Rs is a number of similar documents between results returned by a query from the actual and the summarized datasets. Figure 7 shows recall, precision, and the precisionrecall graph of all 200 search queries executed on actual Project Gutenberg and summarized datasets. also observe higher precision and recall behavior similar to Figure 7 for the Wikipedia dataset. 6 6. .5 5. . E Ex xp pe er ri im me en nt ta al l S Su um mm ma ar ry y We summarize our experimental evaluation in Table 2. It shows index size & time decreased, ranking improved, overlapping, and Spearman's rho correlation using different summary thresholds for both   Average precision and recall for the different summary threshold for Project Gutenberg dataset. We consider results obtained using actual dataset as ground truth. using actual dataset as ground truth.
where R is the number of documents returned by a query on the actual dataset, Ra is the total number of documents returned by a query from the summarized dataset, and Rs is a number of similar documents between results returned by a query from the actual and the summarized datasets. Figure 7 shows recall, precision, and the precisionrecall graph of all 200 search queries executed on actual Project Gutenberg and summarized datasets. also observe higher precision and recall behavior similar to Figure 7 for the Wikipedia dataset. 6 6. .5 5. . E Ex xp pe er ri im me en nt ta al l S Su um mm ma ar ry y We summarize our experimental evaluation in Table 2. It shows index size & time decreased, ranking improved, overlapping, and Spearman's rho correlation using different summary thresholds for both   Average precision and recall for the different summary threshold for Project Gutenberg dataset. We consider results obtained using actual dataset as ground truth. using actual dataset as ground truth.

Figure 7
Average precision and recall for the different summary threshold for Project Gutenberg dataset. We consider results obtained using actual dataset as ground truth. using actual dataset as ground truth

Experimental Summary
We summarize our experimental evaluation in Table 2. It shows index size & time decreased, ranking improved, overlapping, and Spearman's rho correlation using different summary thresholds for both datasets in comparison with the baseline datasets (without summarization). The proposed text summarization method yields higher ranks for the search documents returned using both datasets compared to the baseline using a smaller value of the summary threshold. The index size and time decrease significantly using a smaller value of summary thresholds. However, overlapped results with baseline decrease on the smaller values of the summary threshold. We observed a good correlation of search results using all summarized datasets with the baseline.
We recommend using 10% summary threshold to use centroidbased text summarization method to significantly reduce the index size and time. It also improves the search results relevance and provides good ranking correlation with the actual dataset.
In the proposed system, the process of creating summaries for each document introduces CPU time overhead. Due to the linear time execution time complexity of the extractive text summarization algorithm introduce only minimal overhead [36].
An online system, which creates a summary for new text documents instantly, will not introduce any notable performance issue. However, for a batch processing system, in which a large dataset required to generate the summaries, we recommend using Hadoop and Spark [21] to parallel process the large datasets for reducing the overhead of creating the summaries.

Conclusion and Future Work
Providing an efficient text search services for a large dataset is an interesting research topic. In this paper, we used an automatic text summarization method based on an extractive approach to reducing the index size of large datasets. Our experimental evaluation shows a maximum of 82% reduction in the index size and 80% in index generation time when using text summarization method with 10% summary thresh-old. The relevance of search results obtained from summarized datasets is higher than baseline datasets. Moreover, the correlation between search results is good. However, the best overlapping results are 54% using the Project Gutenberg dataset. Automatic text summarization is an effective method to help to reduce the index size significantly with the better relevance of the search results.
We are currently identifying the best threshold in extractive-based text summarization method to improve the overlapping results with the actual dataset. Moreover, we are planning to build a cloud-based application using Apache Lucene to provide full-text search index minimization services.