Domain Independent Automatic Labeling system for Large-scale Social Data using Lexicon and Web-based Augmentation


  • Shaheen Khatoon King Faisal University
  • Lamis Abu Romman
  • Md Maruf Hasan



Information retrieval, Intelligent decision making, Unsupervised learning


Recently, with the large-scale adoption of social media, people have begun to express their opinion on these sites in the form of reviews. Potential consumers often forced to wade through huge amount of reviews to make informed decision. Sentiment analysis has become rapid and effective way to automatically gauge consumers’ opinion. However, such analysis often requires tedious process of manual tagging of large training examples or manually building a lexicon for the purpose of classifying reviews as positive or negative. In this paper, we present a method to automate the tedious process of labeling large textual data in an unsupervised, domain independent and scalable manner. The proposed method combines the lexicon-based and Web-based Point Wise Mutual Information (PMI) statistics to find the Semantic Orientation (SO) of opinion expressed in a review.  Based on proposed methods a system called Domain Independent Automatic Labeling System (DIALS) has been implemented, which takes collection of text from any domain as input and generates fully labeled dataset in an unsupervised and scalable manner. The result generated can be used to track and summarize online discussion and/or use to train any classifier in the next stage of development. The effectiveness of system is tested by comparing it with baseline machine learning and lexicon-based methods. Experiments on multi-domains dataset has shown that proposed method consistently shown improved recall and accuracy as compared to baseline machine learning and lexicon-based methods.