New methodologies to evaluate the consistency of emoji sentiment lexica and alternatives to generate them in a fully automatic unsupervised way
Abstract
Sentiment analysis aims at detecting sentiment polarities in unstructured Internet information. A relevant part of this information for that purpose, emojis, whose use in Twitter has grown considerably in these years, deserves attention. However, every time a new version of Unicode is released, finding out the sentiment users wish to express with a new emoji is challenging. In [KNSSM15], an Emoji Sentiment Ranking lexicon from manual annotations of messages in different languages was presented.
The quality of these annotations affects directly the quality of possible generated emoji sentiment lexica (high quality corresponds to high self-agreement and inter-agreement). In many cases, the creators of the datasets do not provide any quality metrics, so it is necessary to use another strategy to detect this issue. Therefore, we propose an automatic approach to identify and manage inconsistent manual sentiment annotations. Then, relying on a new approach to generate emoji sentiment lexica of good quality, we compare two such lexica with lexica created from manually annotated datasets with poor and high qualities.