You are welcome to contribute by training and packaging your own hebspacy
model. Please follow the instructions below to enable seamless loading process.
You may choose the training codebase that best fits your requirements as long as you save the following files:
- The pretrained transformer layers (post-fine tuning) separately from the NER heads. file should be named
pytorch_model.bin
- All the files required to be loaded by
transformers.AutoModel
, including the standardconfig.json
,special_tokens.json
,tokenizer_config.json
,vocab.txt
. - Each of the NER head weights should be saved as separate
bin
file with a corresponding index to class mappingjson
file (see instructions below). Files should follow thener_<name>.bin
andner_<name>.json
name convention.
All weights files should be trained using Hugging Face and PyTorch libraries
For example, the following directory contains all the required files for a model that was jointly trained against the BMC and NEMO corpora:
resources/
├── config.json
├── pytorch_model.bin
├── special_tokens.json
├── tokenizer_config.json
├── vocab.txt
├── ner_bmc.bin
├── ner_bmc.json
├── ner_nemo.bin
└── ner_nemo.json
Each NER head should include a json
file that maps between the model class index to the corresponding token class name.
Note that indices 0
and 1
should always be associated with [PAD]
and O
classes, respectively. Also, the token annotation schema should be IBO2.
Here is an example of the index to class mapping json
file for ner_bmc
:
{
"0": "[PAD]",
"1": "O",
"2": "B-PERS",
"3": "I-PERS",
"4": "B-LOC",
"5": "I-LOC",
"6": "B-ORG",
"7": "I-ORG",
"8": "B-TIME",
"9": "I-TIME",
"10": "B-DATE",
"11": "I-DATE",
"12": "B-MONEY",
"13": "I-MONEY",
"14": "B-PERCENT",
"15": "I-PERCENT",
"16": "B-MISC__AFF",
"17": "I-MISC__AFF",
"18": "B-MISC__ENT",
"19": "I-MISC__ENT",
"20": "B-MISC_EVENT",
"21": "I-MISC_EVENT"
}
Once you have prepared all the directory with all required files, please follow these steps:
- Fork this repo (in case you haven't already)
- Make sure that
spacy
is installed in your running python environment (make sure it is the same version as mentioned in requirements.txt) - Navigate to the repo's root
- Run
python setup.py develop
, which should create ahebspacy.egg-info
directory - Navigate to
scripts\model
- Update
meta.json
accordingly (make sure to follow the spaCy package naming conventions) - Run
python package.py <RESOURCES_DIR> <OUTPUT_DIR>
, where<RESOURCES_DIR>
should point to the directory with all the files from the previous section. - Run
python -m spacy package <OUTPUT_DIR> <WHEEL_DIR> --build wheel
- Your
whl
andtar.gz
files are ready under<WHEEL_DIR>/<lang>_<name>-<version>/dist
- Install your files by running
pip install XXXXX.whl