extremeText is an extension of fastText library for multi-label classification including extreme cases with hundreds of thousands and millions of labels.
extremeText implements:
- Probabilistic Labels Tree (PLT) loss for extreme multi-Label classification with top-down hierarchical clustering (k-means) for tree building,
- sigmoid loss for multi-label classification,
- L2 regularization for all losses,
- ensemble of loss layers with bagging,
- calculation of hidden (document) vector as a weighted average of the word vectors,
- calculation of TF-IDF weights for words.
extremeText like fastText can be build as executable using Make (recommended) or/and CMake:
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
(optional) $ cmake .
$ make
This will produce object files for all the classes as well as the main binary extremetext
.
The easiest way to get extremeText is to use pip.
$ pip install extremetext
Installing on MacOS may require setting MACOSX_DEPLOYMENT_TARGET=10.9
first:
$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext
The latest version of extremeText can be build from sources using pip or alternatively setuptools.
$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
(or) $ python setup.py install
Now you can import this library with:
import extremeText
extremeText adds new options for fastText supervised command:
$ ./extremetext supervised
New losses for multi-label classification:
-loss sigmoid
-loss plt (Probabilistic Labels Tree)
With the following optional arguments:
General:
-l2 L2 regularization (default = 0)
-tfidfWeights calculate TF-IDF weights for words
-wordsWeights read word weights from file (format: <word>:<weights>)
-weight document weight prefix (default = __weight__; format: <weight prefix>:<document weight>)
-tag tags prefix (default = __tag__), tags are ignored words, that are outputed with prediction
-eosWeight weight of EOS token (default = 1.0)
-freezeVectors freeze pretrained word vectors for supervised learning
PLT (Probabilistic Labels Tree):
-treeType type of PLT: complete, huffman, kmeans (default = kmeans)
-arity arity of PLT (default = 2)
-maxLeaves maximum number of leaves (labels) in one internal node of PLT (default = 100)
-kMeansEps stopping criteria for k-means clustering (default = 0.001)
Ensemble:
-ensemble size of the ensemble (default = 1)
-bagging bagging ratio (default = 1.0)
extremeText also adds new commands and makes other to work in parallel:
$ ./extremetext predict[-prob] <model> <test-data> [<k>] [<th>] [<output>] [<thread>]
$ ./extremetext get-prob <model> <input> [<th>] [<output>] [<thread>]
Please cite below work if using this code for extreme classification.
M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński. A no-regret generalization of hierarchical softmax to extreme multi-label classification. Advances in Neural Information Processing Systems 31, 2018.
- Merge with the latest changes from fastText.
- Rewrite vanilla fastText losses as extremeText loss layers to support all new features.
- Introduction
- Resources
- Requirements
- Building fastText
- Example use cases
- Full documentation
- References
- Join the fastText community
- License
fastText is a library for efficient learning of word representations and sentence classification.
- Recent state-of-the-art English word vectors.
- Word vectors for 157 languages trained on Wikipedia and Crawl.
- Models for language identification and various supervised tasks.
- The preprocessed YFCC100M data used in [2].
You can find answers to frequently asked questions on our website.
We also provide a cheatsheet full of useful one-liners.
We are continously building and testing our library, CLI and Python bindings under various docker images using circleci.
Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :
- (g++-4.7.2 or newer) or (clang-3.3 or newer)
Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.
One of the oldest distributions we successfully built and tested the CLI under is Debian wheezy.
For the word-similarity evaluation script you will need:
- Python 2.6 or newer
- NumPy & SciPy
For the python bindings (see the subdirectory python) you will need:
- Python version 2.7 or >=3.4
- NumPy & SciPy
- pybind11
One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.
If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.
We discuss building the latest stable version of fastText.
You can find our latest stable release in the usual place.
There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.
$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
$ unzip v0.1.0.zip
$ cd fastText-0.1.0
$ make
This will produce object files for all the classes as well as the main binary fasttext
.
If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).
For now this is not part of a release, so you will need to clone the master branch.
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
This will create the fasttext binary and also all relevant libraries (shared, static, PIC).
For now this is not part of a release, so you will need to clone the master branch.
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .
For further information and introduction see python/README.md
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
In order to learn word vectors, as described in 1, do:
$ ./fasttext skipgram -input data.txt -output model
where data.txt
is a training file containing UTF-8
encoded text.
By default the word vectors will take into account character n-grams from 3 to 6 characters.
At the end of optimization the program will save two files: model.bin
and model.vec
.
model.vec
is a text file containing the word vectors, one per line.
model.bin
is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.
The binary file can be used later to compute word vectors or to restart the optimization.
The previously trained model can be used to compute word vectors for out-of-vocabulary words.
Provided you have a text file queries.txt
containing words for which you want to compute vectors, use the following command:
$ ./fasttext print-word-vectors model.bin < queries.txt
This will output word vectors to the standard output, one vector per line. This can also be used with pipes:
$ cat queries.txt | ./fasttext print-word-vectors model.bin
See the provided scripts for an example. For instance, running:
$ ./word-vector-example.sh
will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].
This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:
$ ./fasttext supervised -input train.txt -output model
where train.txt
is a text file containing a training sentence per line along with the labels.
By default, we assume that labels are words that are prefixed by the string __label__
.
This will output two files: model.bin
and model.vec
.
Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:
$ ./fasttext test model.bin test.txt k
The argument k
is optional, and is equal to 1
by default.
In order to obtain the k most likely labels for a piece of text, use:
$ ./fasttext predict model.bin test.txt k
or use predict-prob
to also get the probability for each label
$ ./fasttext predict-prob model.bin test.txt k
where test.txt
contains a piece of text to classify per line.
Doing so will print to the standard output the k most likely labels for each line.
The argument k
is optional, and equal to 1
by default.
See classification-example.sh
for an example use case.
In order to reproduce results from the paper 2, run classification-results.sh
, this will download all the datasets and reproduce the results from Table 1.
If you want to compute vector representations of sentences or paragraphs, please use:
$ ./fasttext print-sentence-vectors model.bin < text.txt
This assumes that the text.txt
file contains the paragraphs that you want to get vectors for.
The program will output one vector representation per line in the file.
You can also quantize a supervised model to reduce its memory usage with the following command:
$ ./fasttext quantize -output model
This will create a .ftz
file with a smaller memory footprint. All the standard functionality, like test
or predict
work the same way on the quantized models:
$ ./fasttext test model.ftz test.txt
The quantization procedure follows the steps described in 3. You can
run the script quantization-example.sh
for an example.
Invoke a command without arguments to list available arguments and their default values:
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurences [1]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Defaults may vary by mode. (Word-representation modes skipgram
and cbow
use a default -minCount
of 5.)
Please cite 1 if using this code for learning word representations or 2 if using for text classification.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2017enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={Transactions of the Association for Computational Linguistics},
volume={5},
year={2017},
issn={2307-387X},
pages={135--146}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@InProceedings{joulin2017bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month={April},
year={2017},
publisher={Association for Computational Linguistics},
pages={427--431},
}
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
(* These authors contributed equally.)
- Facebook page: https://www.facebook.com/groups/1174547215919768
- Google group: https://groups.google.com/forum/#!forum/fasttext-library
- Contact: [email protected], [email protected], [email protected], [email protected]
See the CONTRIBUTING file for information about how to help out.
fastText is BSD-licensed. We also provide an additional patent grant.