Article Preview
TopIntroduction
The fields of opinion mining (Mohsen, Hassan, & Idrees, 2016; Mohsen, Hassan, & Idrees, 2016) and emotion analysis (Othman, Hassan, Moawad, & Idrees, 2018; Othman, Hassan, Moawad, & Idrees, 2016) have had a deep focus recently as their impact have been strongly proved in the different fields of business including the educational field (Idrees & Hassan, 2018; Khedr, Kholeif, & Hessen, March 2015; Khedr & Idrees, 2017; Khedr, Kholeif, & Hessen, April 2015; Khedr & Idrees, 2017), the agricultural field (Hassan, Dahab, Bahnassy, Idrees, & Gamal, 2015; Hassan, Dahab, Bahnasy, Idrees, & Gamal, 2014), health field (Hazman & Idrees, 2015), and the business intelligence field (Badawy, Abd El-Aziz, Idress, Hefny, & Hossam, 2016; Helmy, Khedr, Kolief, & Haggag, 2019; Idrees, 2015; Khedr, Abdel-Fattah, & Nagm-Aldeen, 2015). The success of processing large volume of text data that express the opinion is not a trivial task. Processing this information needs complicated analysis to find positive or negative opinions and emotions about a special topic or a product. This section introduces the main definitions related to this field.
Natural language processing is one of the computer science domains which target is to process and understand the people language (Dahab, Idrees, Hassan, & Rafea, 2010). This process is performed through determining the text from documents, then parsing the text automatically by applying determined steps to find their meaning. First step which is considered the main step and has a high impact on the results’ accuracy is “input preprocessing”. Input preprocessing means preparing the input text for the understanding step (El Seddawy, Sultan, & Khedr, 2013; Mostafa, Khedr, & Abdo, 2017), this preparation includes the following tasks:
- 1.
Tokenization: Considers parsing text and split text into a sequence of sentences and then splitting the sentence into a sequence of tokens. Some tasks which are performed in this step is removing punctuations and white spaces. The output of this step is a set of tokens;
- 2.
Filtration: It is mainly related to the removal of stop words such as (the, an, a, etc.) that will not affect the meaning of the text;
- 3.
Lemmatization: Considers converting the token to its original form like (walks, walked) to (walk);
- 4.
Stemming: Considers replacing the token to its stem such as replace the token “interesting” to be “interest”;
- 5.
Part of speech “POS” tagging: Considers determining and tagging each word with its type, either it is a verb, noun, adverb, adjective, etc.