Article Preview
TopIntroduction
With the advancement of Internet technologies, platforms like Social Media, E-Commerce and Movie Streaming Services have directly reached the millions of individuals. Over such platforms, people express and share their emotions, observations and opinions through a piece of text for topics, products, services etc. Sentiment analysis (El Alaoui et al., 2018; Fang & Zhan, 2015; Liu, 2012; Medhat, Hassan, & Korashy, 2014; Pang, Lee, & others, 2008; Pang, Lee, & Vaithyanathan, 2002; Pouransari & Ghili, 2014; Ravi & Ravi, 2015) is performed to elicit a sentiment orientation (i.e. positive and negative) of shared textual information, which can enhance decision making of Governments, Product designers, Political organizations, Marketing organizations, etc. Thus, there is a strong requisite for efficient sentiment analysis approaches. In sentiment analysis, Machine Learning (ML) algorithms have been extensively exploited due to their excellent capability to gain admirable performance (Dhanani, Mehta, Rana, & Tidke, 2018; Pang et al., 2002; Parikh, Palusa, Kasthuri, Mehta, & Rana, 2018; Xia, Wang, Hu, Li, & Zong, 2013; Yin, Wang, & Zheng, 2012; Zhang, Xu, Su, & Xu, 2015), where efficient feature extraction is the vital demand.
Word embedding techniques extract textual features by transforming raw words into real-valued vectors. Word2vec is a profound word embedding technique which learns the deep and implicit semantic information among the words vectors (also called as feature vectors, Word2vec embedding or word embedding) (Mikolov, Chen, Corrado, & Dean, 2013a; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013b). However, Word2vec fails to encode sufficient sentiment information into the feature vectors (Dhanani et al., 2018; Parikh et al., 2018; Tang et al., 2016, 2014; Yu, Wang, Lai, & Zhang, 2017). As a consequence, semantically similar words like “good” and “bad” are placed closer to each other, even though the sentiment orientations of these words are opposite, such as “good” is sentimentally positive and “bad” is sentimentally negative words. Hence, only semantic specific feature vectors could outcomes the declination in performance. In addition, real-life applications yield big sized textual data consisting of large vocabulary (i.e. unique words in the Text Corpus) (Dhanani et al., 2018; Ordentlich et al., 2016). Word2vec possesses scalability issues due to in-memory computation and accommodation of large vocabulary and its associated vectors (Dhanani et al., 2018; Ordentlich et al., 2016). For such big sized textual data, Word2vec demands huge memory and computing capability to achieve sufficient learning latency.
Many recent works attempted to solve the scalability issues by implementing Word2vec in a distributed environment (Apache Spark based Word2vec, n.d.; Dhanani et al., 2018; Ji, Satish, Li, & Dubey, 2016; Ordentlich et al., 2016). However, they are limited to learn the semantic specific Word2vec feature vectors. In contrast, several recent studies have focused on encoding sentiment information into the Word2vec feature vectors using prior sentiment knowledge (Parikh et al., 2018; Rezaeinia, Ghodsi, & Rahmani, 2017; Yu et al., 2017; Zhang et al., 2015). However, learning sentiment specific Word2vec feature vectors (i.e. which preserves both semantic and sentiment information) is computationally expensive for big sized textual data consisting of a large vocabulary. Existing sentiment analysis approaches learn sentiment and semantic specific feature vectors for big sized textual data, which possess the scalability issue (Dhanani et al., 2018; Parikh et al., 2018). To overcome these challenges, this research proposes a novel sentiment weighted word embedding approach using sentiment dictionary and distributed MapReduce environment.