International Journal of Advances in Computer Science and Its Applications
Author(s) : ADNAN RASHID HUSSAIN , MOHD ABDUL HAMEED , S. FOUZIA SAYEEDUNNISSA
With the rapid growth of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations. Sentiment analysis extracts, identifies and measures the sentiment or opinion of documents as well as the topics within these documents. The Naïve Bayes algorithm performs a boolean classification i.e. it classifies a document as either positive or negative according to its sentiment. We have already seen by Sayeedunnisa et al , that the application of Naïve Bayes trained on high value features, extracted from a bagof- words model, yields an accuracy of 89.2%. This paper studies the application of Naïve Bayes technique for sentiment analysis by including training of bigram features to improve accuracy and the overall performance of the classifier. We also evaluate the impact of selecting low vs. high value features, calculated using the concepts of Information Gain. Our dataset constitutes of tweets containing movie reviews retrieved from the Twitter social network, which were obtained and analyzed on a cloud computing platform. Our experiment is divided into three steps; the first step constitutes of selecting high value features (words) from our bag-of-words model. The next step involves the identification and calculation of the probability of co-occurrence of words within the bag-of-words to derive a set of bigrams. We then used this set and the original features to re-train and test our classifier. In the final step, we selected the most informative features (unigrams + bigrams) using a Chi-Square scoring function, which yielded the best result with accuracy at 98.2%, positive precision 98%, positive recall 98.4% and negative recall 98%. It is evident from the results, that Naïve Bayes performs the best when including only the most informative (high value) features which constitute of both unigrams and bigrams for training.