Opinion target or aspect extraction is the key task of aspect-based sentiment analysis. This task focuses on the extraction of targeted words or phrases against which a user has expressed his/her opinion. Although, opinion target extraction has been studied extensively in the English language domain, with notable work in other languages such as Chinese, Arabic etc., other regional languages have been neglected. One of the reasons is the lack of resources and available texts for these languages. Urdu is one, with millions of native and non-native speakers across the globe. In this paper, the Urdu language domain is focused on to identify opinion targets from written Urdu texts. To accomplish this task, several syntactic rules are crafted to identify users’ opinions and associated target words. These rules are crafted using the grammatical and linguistic context of the words in the sentence. To the best of our knowledge, there is no existing work available in the Urdu domain for opinion target extraction. The proposed methodology is evaluated on an Urdu language dataset and compared with an existing approach for the English language by applying the same technique. The experiments have demonstrated that the proposed approach achieves promising performance as compared to the applied English language domain approach.
Sentiment analysis or opinion mining deals with the user’s sentiments or opinions portraying their feelings towards a specific entity. This entity could be a specific object, organization, service, people, etc. These opinions or sentiments are expressed in the form of users’ comments expressed over different discussion forums, social media, merchants’ or manufacturers’ websites (in the form of online reviews), etc. The core task of sentiment analysis is to identify users’ opinions or sentiments expressed towards some entity and classify these opinions or sentiments as positive, negative or neutral. To do this task, different granularity levels have been explored which include document-level, sentence-level, and aspect-level. While document and sentence-level emphasis on the overall sentiments scoring, aspect-level not only focuses on users’ opinions but also figures out the target of these opinions [
Even though a large amount of opinionated information is uploaded every day on distinct local and regional languages on the World wide web (WWW), current research has focused more on the English, Arabic and Chinese languages. Meanwhile, other local and regional languages are considered as a destitute resource for sentiment analysis [
Urdu is an Indo-Aryan language that utilizes Arabic and Persian scripts. Urdu is the 21st world’s most spoken language with approximately 104 million speakers in the whole world. Urdu is also known as the “Lash Kari” language, a mixture of many languages which makes the task of sentiment analysis more difficult and vigorous. It is the most frequently used language in Pakistan and is often used in other countries as well where people of Pakistan are living. Existing work for the text classification in Urdu language [
With the best of our knowledge, no work in Urdu language has focused on the extraction of opinion targets form the written text. Almost, all existing approaches emphasis on the identification of users’ opinion and classification of extracted opinions as positive, negative or neutral. Our study in this paper, focuses on the identification of opinions and their targets in Urdu language domain. A rule-based approach is proposed, which incorporates opinion lexicons to identify potential aspects. The proposed methodology starts with the pre-processing of the dataset which includes data cleaning, sentence boundary identification, tokenization and Part-of-speech (POS) tagging. Since, there are limited resources available for the Urdu language, we have utilized available resources or defined our own methodology to accomplish the task of pre-processing. Once the data is clean, several linguistic rules are crafted to identify opinions and associated aspects. Finally, these rules are applied over Urdu language dataset. Opinion words are identified with the help of Urdu language opinion lexicons and we have used linguistic rules along with distance calculation between opinion and their targeted words to extract aspects. To evaluate the effectiveness of the proposed approach for aspect extraction, the results are compared with the technique proposed by Hu et al. [
The main contributions of this paper are as follows: To extract syntactic rules to identify opinion targets. To extract opinion targets from Urdu text based on the syntactic rules.
The remainder of the paper is organized as follows; Section 2 reviews the related work. Section 3 demonstrates the proposed methodology for opinion target extraction. Section 4 presents the results and Section 5 concludes the paper.
Aspect extraction in sentiment analysis has been studied extensively in the domain of English language. The initial efforts in this regard were made by Hu et al. [
Dependency parser-based approaches have been also studied for the opinion target extraction. Qiu et al. [
With a huge effort in the English language domain, there are several studies which focused on the Chinese language domain. Wang et al. [
Rehman et al. [
Awais et al. [
Bilal et al. [
Mukhtar et al. [
Javed et al. [
Kanwal et al. [
Sentiment analysis in Roman Urdu has also been explored by researchers in recent years. However, the main task relied on the normalization of text and classification of sentiments. There is no study in Roman Urdu which focuses on the extraction of opinion targets. Only few techniques have tried to classify a sentence after normalization of words. Khan et al. [
The proposed research utilizes a lexicon-based strategy for the assessment of Urdu sentiments that operates at a phrase level to extract the targets of opinion words in a sentence. The proposed solution starts with the pre-processing of the input dataset. The sentences are tagged using part of speech (POS) tagger available for the Urdu language. The next step is to identify the sentence boundary followed by the tokenization of each word in the sentence. Thereafter, manually crafted syntactic rules are utilized for the identification of opinion targets. For opinion identification, Urdu opinion lexicon is used which contains positive and negative opinion words. Subsequently, the candidate aspects are ranked on the basis of their distance from the opinion words, and the aspect with the minimum distance is selected as the target of the extracted opinion. The complete workflow of the proposed approach is elaborated in
First task is to eliminate the unnecessary words from the raw data such as punctuation marks, links, and special characters i.e., !, ?, @, /, : * ( ), etc., used in sentences. However, full stop “.” is only removed if it appears within digits (e.g., 10.1). The full stop within the text was used for sentence boundary identification.
Following is an example of Urdu sentence:
! اس ٹیبلیٹ کی خاص@ بات یہ ہے کہ اس میں سکرین کا سائز 10.1 انچ سے بڑھا کر 12 انچ کر دیا گیا ہے
(The highlight of this tablet is that the screen size has been increased from 10.1 inches to 12 inches.)
After special character removal the sentence would be:
اس ٹیبلیٹ کی خاص بات یہ ہے کہ اس میں سکرین کا سائز انچ سے بڑھا کرانچ کر دیا گیا ہے
(The highlight of this tablet is that the screen size has been increased from 10.1 inches to 12 inches.)
Removing special characters is required as this can leads towards the incorrect interpretation of the sentence. Although, the size of the screen has been eliminated in the above mentioned example, it does not affect the overall structure and information expressed in the sentence. Due to the lack of resources for text analysis in Urdu language, the text is required to be in the clean form for better information extraction.
Sentence boundary identification (also known as sentence breaking, sentence border detection, and phrase segmentation) is applied for the identification of each sentence. As the focus of this research is on the sentiment analysis in the domain of Urdu text, therefore a separate unit “.” is placed at the end of each sentence (if not already there) to separate it from other sentences. After applying sentence boundary identification sentences are separated and a line number is assigned to each sentence.
Consider the example before sentence boundary identification in
After sentence boundary identification, the next task is to perform tokenization. Tokenization is a process in which a collection of strings such as letters, keywords, phrases, symbols and other elements (called tokens) are separated into units. In this work, tokenization is performed on the basis of space between two terms. Contrary to English language, in Urdu text alphabets are merged with each other to form a word. For example, “خوشی” (Happiness) has four alphabets which are merged to form a single word while space among different words identifies starting and ending of each word. Just considering the alphabet positions in the sentence is not appropriate and therefore System checks for a space in whole sentence and generates a token for each word in a sentence separated by a comma.
In linguistics, the process of marking a word in the text (
Sentence before applying POS:
جتنی خوراک کھائی جاتی ہے اتنی ہی ضائع بھی کی جاتی ہے
(As much food is consumed, so is the waste).
Sentence after applying POS:
<SM>جتنی<ADJ>خوراک<NN>کھائی<VB>جاتی<AA>ہے<TA>اتنی<ADV>ہی<I>ضائع<ADJ>بھی<I>کی<VB> جاتی<AA>ہے<TA>.
In English language, noun phrases are usually combination of more than one noun words. For example, “Energy sector” is a noun phrase where two nouns are combined to express a meaning. However, this is not the case in Urdu language where nouns can also be combined with some verbs. For example, the Urdu translation of the “Energy sector” is “توانائی کے شعبے”. Similarly, the translation of “Cabinet stand” in Urdu language is “موقف کا کابینہ”. Therefore, simply relying on the adjacent nouns to generate the noun phrases is not the appropriate way. Furthermore, there is no POS tagger available for Urdu language which can tag these words in a single noun phrase.
To cope with the aforementioned issue, we have crafted manually some linguistic rules to identify noun phrases associated with the opinion words. Noun phrase identification is almost similar to the approach proposed by Ali et al. [
Rule 1. After the opinion extraction if two forward and backward nouns are closely associated with opinion and separated by “کا“, ”کے“, ”کی” and “کو” then extract both nouns along with the separating word as a noun phrase.
Rule 2. After opinion extraction, if the opinion is directly associated with adjective then make it its potential aspect.
The following are some examples of aspect extraction by generating noun phrase.
1: پاکستان میں توانائی کے شعبے کی نئی جہت
(New dimension of the energy sector in Pakistan).
2: آباؤاجداد کی وراثت کی طرف جانے کی خوشی تھی
(It was a pleasure to go to the ancestral heritage).
In the following example as shown in
3: کیا ثابت کر فیصلہ دے کا مسجد بابری نے کورٹ سپریم کی بھارت
(The Supreme Court of India has proved the Babri Masjid decision).
In the above example as shown in
4: کارخانے بند ہو رہے ہیں
(Factories are closing).
In the above example as shown in
Previous section highlights the methodology adopted for identification and extraction for noun phrases. Several sentences contain only a single noun/noun phrase and associated opinion word is directly associated with the noun/noun phrase as shown in
In above equation,
Although, several Urdu blogs are available online but there is no benchmark dataset available in Urdu language domain. Therefore, we have used the dataset developed by Mukhtar et al. [
Total Sentences | Positive Sentences | Negative Sentences | Neutral Sentences | # of Manually Tagged Aspects |
---|---|---|---|---|
4000 | 1350 | 1450 | 1400 | 3122 |
Two opinions lexicon L1 and L2 have been utilized for our experimental evaluation which consists of several positive and negative words. These lexicons contain positive and negative opinion words for Urdu language. L1 is the larger set as compared to L2 which not only contains opinion words but also contains opinion terms consisting of more than one word. In Urdu text, both word and phrases can represent an opinion, therefore to handle such issues we have also utilized such lexicons which are capable to handle both variations. However, these lexicons contain only opinion terms and no polarity identification is given. The words are categorized into positive and negative opinions and it was assumed that positive opinions hold positive polarity and negative opinions hold negative polarity. The negation terms are handled separately which reverse the polarity of any opinion. Similarly, intensifiers are considered as separate terms which increase the polarity of positive opinions and reduces the polarity of negative opinions.
Lexicon | # of Positive Words | # of Negative Words |
---|---|---|
L1 | 9578 | 11,739 |
L2 | 2,607 | 4,728 |
Most of the existing work have used precision, recall, and F1-score for the performance evaluation of their proposed approach. Therefore, we have used same matrices to evaluate our proposed methodology. Following are the formulas for the evaluation matrices:
The performance of the proposed approach has been evaluated using two opinion lexicons L1 and L2 as elaborated in
Opinion Lexicon | Precision | Recall | F1-score |
---|---|---|---|
L1 | 0.63 | 0.68 | 0.65 |
L2 | 0.53 | 0.47 | 0.57 |
Opinion Lexicon | Precision | Recall | F1-score |
---|---|---|---|
L1 | 0.78 | 0.76 | 0.76 |
L2 | 0.74 | 0.56 | 0.63 |
By comparing the results of proposed methodology with the state-of-the-art approach, this can be observed that the methodology proposed for one language is not necessarily suitable for other languages. Urdu language has totally different structure of writing and grammatical rules as compared with the English language and therefore English language techniques cannot be implemented on Urdu language domain.
Sentiment analysis or opinion mining has gain huge attention of the researchers during the last two decades. The main focus of sentiment analysis is on the extraction of user opinions from available text. Among different granularity levels, aspect-level sentiment analysis tends towards the identification of both opinions and their targets. Existing studies have largely focused on the English language domain with comfortable research in Chines and Arabic languages. However, regional languages like Urdu, Hindi, Persian, etc. have been neglected. There is no significant work available in the literature which focused on the opinion target extraction in Urdu language domain. Therefore, in this research we focused on the Urdu language domain which is the 21st most spoken language in the world. Due to the unavailability of resources in Urdu language, we have proposed a rule-based approach for the identification of opinion targets. We manually crafted several syntactic rules for the identification of opinion words and their targets. These rules utilize opinion lexicons to identify opinion words in the sentence and associated target words/phrases. We also applied state-of-the-art frequency-based approach from English language domain on the Urdu text and compared the results with our proposed approach. Results have shown that our proposed approach has produced better results. In future, we plan to extend the proposed approach on Roman Urdu text which is the third most used language in the world. Future work also include to explore Urdu language resources which could be helpful to improve the task of opinion and aspect identification.
Thanks to our families & colleagues who supported us morally.