This is the main repository for our ACL 2020 paper Learning and Evaluating Emotion Lexicons for 91 Languages.
Data and code for this research project are distributed across different places. This github repository serves as landing page linking to other relevant sites. It also contains the code necessary to re-run our experiments and analyses. Releases of this repository are archived as Zenodo records under DOI 10.5281/zenodo.3779901. While this repository contains our codebase and experimental results, the generated lexicon is archived in an second Zenodo record under DOI 10.5281/zenodo.3756606 due to its size.
We created emotion lexicons for 91 languages, each one covers eight emotional variables and comprises over 100k word entries. There are several versions of the lexicons, the difference being the choice of the expansion model: There is a linear regression baseline and three versions of neural network models. The main version of our lexicons (the version we refer to in the main experiments of our paper and the one we would recommend to use) is referred to as as MTL_grouped (applying multi-task learning within two groups of our target variables). If you are mainly interested in our lexicons, download this zip file (2.2GB). It contains 91 tsv files which are named <iso language code>.tsv
. Please refer to the description of the Zenodo record for more details.
The analyses and results we present in the paper can be found in /memolon/analyses
in form of jupyter notebooks and csv / json files. The names of the notebooks follow the section names in the paper.
If you are interested in the implementation of our methodology, replicating the lexicon creation or re-running our analyses, this section describes how to work with our code.
Make sure you have conda
installed on your machine. We ran the code on Debian 9. Necessary steps may differ across operating systems.
Clone this repository, cd
into the project's root directory, and run the following commands.
conda create --name "memolon2020" python=3.8 pip
conda activate memolon2020
pip install -r requirements.txt
source activate.src
The last line configures your PYTHONTPATH
.
Recreating the lexicons from scratch requires the Source lexicon, data splits, and the translation tables for all 91 languages. The data split (word lists in train.txt
, dev.txt
, and test.txt
in /memolon/data/Source
) as well as the translation tables (see content of /memolon/data/TranslationTables
) are already included in this repository. So, you only have to download the source lexicon. There are two files:
- Get the file Ratings_Warriner_et_al.csv (commit b1ed97e from 11 Nov 2019) and place it in
/memolon/data/Source
. - Get the file Warriner_BE.tsv (commit dbfa3b9 from 15 Jun 2018) and place it in
/memolon/data/Source
.
The python scripts for creating the lexicons can be found in /memolon/src
. You can either cd
there and simply run run_all.sh
or follow the more detailed instructions below. Please take note that the whole process may take several hours. You do not have to have a GPU to run our code in a reasonable amount of time.
- To download the fastText embedding models run
download_embeddings.py
which will download the vec.gz files and place them into/memolon/data/Embeddings
. - To train and use our models to create all four different versions of the target lexicons (
TargetPred
) run the following scripts (or just the one you want to use). They will create the lexicons and place them into the respective subfolder of/memolon/data/TargetPred
:TargetPred_MTLgrouped.py
: Multi-task learning within the two groups (VAD and BE5) but not across both. This is the version we mainly refer to in the paper and recommend to use.TargetPred_MTLall.py
: Multi-task learning among all 8 target variables.TargetPred_STL.py
: Single-task learning all 8 variables separately.TargetPred_ridge.py
: The ridge regression baseline.
As stated above, analyses are organized as jupyter notebooks in the folder /memolon/analyses
. Please note that running some of the notebooks requires data files from other notebooks. The recommended order of running the notebooks is the following, although other orders are possible as well.
overview-gold-lexica.ipynb
silver-evaluation.ipynb
gold-evaluation.ipynb
comparison_against_human_reliability.ipynb
translation_vs_prediction.ipynb
gold_vs_silver_evaluation.ipynb
overview-generated-lexicons.ipynb
Running the silver evaluation is quite simple. You can either generate our lexicons from scratch (see above), or, much easier, download our lexicons from the Zenodo record (see above). Unzip all four versions of the lexicons and place the tsv files in the respective subfolders of /memolon/data/TargetPred
.
Running the gold evaluation and related analyses requires you to manually collect all the gold datasets listed in the paper. This is a tedious process because they all have different copyright and access restrictions. Please find more detailed instructions below.
- en1. This is our Source lexicon, see the above section on lexicon generation.
- en2. Either request the 1999-version of the Affective Norms for English Words (ANEW) from the Center for the Study of Emotion and Attention at the University of Florida, or copy-paste/parse the data from the Techreport Bradley, M. M., & Lang, P. J. (1999). Affective Norms for English Words (Anew): Stimuli, Instruction Manual and Affective Ratings (C–1). The Center for Research in Psychophysiology, University of Florida. Format the data as an tsv file with column headers
word
,valence
,arousal
,dominance
and save it under/memolon/data/TargetGold/ANEW1999.tsv
. - en3. Get the file
Stevenson(2007)-ANEW_emotional_categories.xls
from Stevenson et al. (2007) and place it in/memolon/data/TargetGold
. - es1. Get the file
Redondo(2007).xls
from Redondo et al. (2007) and place it/memolon/data/TargetGold
. - es2. Get the file
13428_2015_700_MOESM1_ESM.csv
from Stadthagen-Gonzalez et al. (2017) and save it as/memolon/data/TargetGold/Stadthagen_VA.csv
- es3. Get the file
Hinojosa et al_Supplementary materials.xlsx
from Hinojosa et al. (2015) and place it in/memolon/data/TargetGold
. - es4. Included in the download for es3.
- es5. Get the file
13428_2017_962_MOESM1_ESM.csv
from Stadthagen-Gonzalez et al. (2018) and save it as/memolon/data/TargetGold/Stadthagen_BE.csv
. - es6. Get the file
13428_2016_768_MOESM1_ESM.xls
from Ferré et al. (2017) ad save it as/memolon/data/TargetGold/Ferre.xlsx
. - de1. Get the file
13428_2013_426_MOESM1_ESM.xlsx
from Schmidtke et al. (2014) and save it as/memolon/data/TargetGold/Schmidtke.xlsx
- de2. Get the file
BAWL-R.xls
from Vo et al. (2009) which is currently available here. You will need to request a password from the authors. Save the file without password as/memolon/data/TargetGold/BAWL-R.xls
. We had to run an automatic file repair when oping it with Excel for the first time. - de3. Get the file
LANG_database.txt
from Kaske and Kotz (2010) and place it/memolon/data/TargetGold
. - de4. Get de2 (see above). Then, get the file
13428_2011_59_MOESM1_ESM.xls
from Briesemeister et al. (2011) and save it as/memolon/data/TargetGold/Briesemeister.xls
. - pl1. Get the file
data sheet 1.xlsx
from Imbir (2016) and save it as/memolon/data/TargetGold/Imbir.xlsx
. - pl2. Get the file
13428_2014_552_MOESM1_ESM.xlsx
from Riegel et al. (2015) and save it as/memolon/data/TargetGold/Riegel.xlsx
- pl3. Get pl2 (see above). Then, get the file
S1 Dataset
from Wierzba et al. (2015) and save it as/memolon/data/TargetGold/Wierzba.xlsx
. - zh1. Get CVAW 2.0 from Yu et al. (2016) which is distributed via
this website. Use Google Translate to 'translate' the words in
cvaw2.csv
from traditional to simplified Chinese characters (you can batch-translate by copy-pasting multiple words separated by newline directly from the file). Save the modified file as/memolon/data/TargetGold/cvaw2_simplied.csv
. - zh2. Get the file
13428_2016_793_MOESM2_ESM.pdf
from Yao et al. (2017). Convert PDF to Excel (there are online tools for that but check the results for correctness) and save as/memolon/data/TargetGold/Yao.xlsx
. - it. Get the data from Montefinese et al. (2014). The website offers a PDF version
of the ratings. However, the formatting makes it very difficult to process automatically. Instead, the first author Maria Montefinese provided us with an Excel version. Save the ratings as
/memolon/data/TargetGold/Montefinese.xls
. - pt. Get the file
13428_2011_131_MOESM1_ESM.xls
from Soares et al. (2012). Save it as/memolon/data/TargetGold/Soares.xls
. - nl. Get the file
13428_2012_243_MOESM1_ESM.xlsx
from Moors et al. (2013). Save it as/memolon/data/TargetGold/Moors.xlsx
. - id. Get the file
Data Sheet 1.XLSX
from Sianipar et al. (2016). Save it as/memolon/data/TargetGold/Sianipar.xlsx
- el. Get the data from Palogiannidi et al. (2016): We downloaded the ratings via the link
provided in the paper on March 13, 2018. The link pointed to zip containing a single file
greek_affective_lexicon.csv
which we saved under/memolon/data/TargetGold
. However, the original link does not work anymore (as of April 22, 2020). We recommend contacting the authors for a replacement. - tr1. Get the file
TurkishEmotionalWordNorms.csv
from Kapucu et al. (2018) which is available here. Place it under/memolon/data/TargetGold
. - tr2. Included in the download for tr1.
- hr. Get the file
Supplementary material_Ćoso et al.xlsx
from Coso et al. (2019) which is available here. Save it as/memolon/data/TargetGold/Coso.xlsx
.
If you find this work useful, please cite our paper:
@inproceedings{buechel-etal-2020-learning-evaluating,
title = "Learning and Evaluating Emotion Lexicons for 91 Languages",
author = {Buechel, Sven and
R{\"u}cker, Susanna and
Hahn, Udo},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.112",
doi = "10.18653/v1/2020.acl-main.112",
pages = "1202--1217",
abstract = "Emotion lexicons describe the affective meaning of words and thus constitute a centerpiece for advanced sentiment and emotion analysis. Yet, manually curated lexicons are only available for a handful of languages, leaving most languages of the world without such a precious resource for downstream applications. Even worse, their coverage is often limited both in terms of the lexical units they contain and the emotional variables they feature. In order to break this bottleneck, we here introduce a methodology for creating almost arbitrarily large emotion lexicons for any target language. Our approach requires nothing but a source language emotion lexicon, a bilingual word translation model, and a target language embedding model. Fulfilling these requirements for 91 languages, we are able to generate representationally rich high-coverage lexicons comprising eight emotional variables with more than 100k lexical entries each. We evaluated the automatically generated lexicons against human judgment from 26 datasets, spanning 12 typologically diverse languages, and found that our approach produces results in line with state-of-the-art monolingual approaches to lexicon creation and even surpasses human reliability for some languages and variables. Code and data are available at https://github.com/JULIELab/MEmoLon archived under DOI 10.5281/zenodo.3779901.",
}
Please get in touch via sven dot buechel at uni-jena dot de.