-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from dkarmon/main
Initial merge
- Loading branch information
Showing
18 changed files
with
663 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# This workflow will install Python dependencies, run tests and lint with a single version of Python | ||
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions | ||
|
||
name: Python application | ||
|
||
on: | ||
push: | ||
branches: [ main ] | ||
pull_request: | ||
branches: [ main ] | ||
|
||
jobs: | ||
build: | ||
|
||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python 3.8 | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: "3.8" | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install flake8 pytest | ||
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi | ||
- name: Lint with flake8 | ||
run: | | ||
# stop the build if there are Python syntax errors or undefined names | ||
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics | ||
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide | ||
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics | ||
- name: Test with pytest | ||
run: | | ||
pytest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,80 @@ | ||
# HebSpacy | ||
Hebrew oriented NER spaCy pipeline | ||
# HebSpaCy | ||
|
||
A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including `GPE`, `PER`, `LOC` and `ORG`. | ||
|
||
---- | ||
[![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT) ![Release](https://img.shields.io/github/v/release/8400TheHealthNetwork/HebSpacy.svg) [![PyPI version](https://badge.fury.io/py/hebspacy.svg)](https://badge.fury.io/py/hebspacy) [![Pypi Downloads](https://img.shields.io/pypi/dm/hebspacy.svg)](https://img.shields.io/pypi/dm/hebspacy.svg) | ||
|
||
## Installation | ||
|
||
To run the package you will need to install the package as well as the model, preferably in a virtual environment: | ||
|
||
``` sh | ||
# Create conda env (optional) | ||
conda create --name hebspacy python=3.8 | ||
conda activate hebsafeharbor | ||
|
||
# Install hebspacy | ||
pip install hebspacy | ||
|
||
# Download and install the model (see below availbable models) | ||
pip install </path/to/download> | ||
``` | ||
|
||
#### Available Models | ||
| Model | Description | Install URL | | ||
| ----- | ----------- | ----------- | | ||
| he_ner_news_trf | A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more [here](#he_ner_news_trf).| [Download](https://github.com/8400TheHealthNetwork/HebSpacy/releases/download/he_ner_news_trf-3.2.1/he_ner_news_trf-3.2.1-py3-none-any.whl) | ||
|
||
|
||
## Getting started | ||
```python | ||
import spacy | ||
|
||
nlp = spacy.load("he_ner_news_trf") | ||
text = """מרגלית דהן | ||
מספר זהות 11278904-5 | ||
2/12/2001 | ||
ביקור חוזר מ18.11.2001 | ||
במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה: | ||
המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19, | ||
פריטין 10, סטורציית טרנספרין 8%. | ||
מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART | ||
""" | ||
|
||
doc = nlp(text) | ||
for entity in doc.ents: | ||
print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})") | ||
|
||
>>> מרגלית דהן PERS: 0.9999 (0,10) | ||
>>> 2/12/2001 DATE: 0.9897 (33,42) | ||
>>> מ18.11.2001 DATE: 0.8282 (54,65) | ||
>>> 8% PERCENT: 0.9932 (230,232) | ||
``` | ||
|
||
--------------- | ||
### he_ner_news_trf | ||
'he_ner_news_trf' is a multitask model constructed from [AlephBert](https://arxiv.org/pdf/2104.04052.pdf) and two NER focused heads, each trained against a different NER-annotated Hebrew corpus: | ||
1. [NEMO corpus](https://github.com/OnlpLab/NEMO-Corpus) - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category: `GPE` (geo-political entity), `PER` (person), `LOC` (location), `ORG` (organization), `FAC` (facility), `EVE` (event), `WOA` (work-of-art), `ANG` (language), `DUC` (product). | ||
2. [BMC corpus](https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/) - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories: `PERS` (person), `LOC` (location), `ORG` (organization), `DATE` (date), `TIME` (time), `MONEY` (money), `PERCENT` (percent), `MISC__AFF` (misc affiliation), `MISC__ENT` (misc entity), | ||
`MISC_EVENT` (misc event). | ||
|
||
The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline. | ||
|
||
#### Model integration | ||
The output model was split into three weight files: _the transformer embeddings, the BMC head, and the NEMO head_. | ||
The components were each packaged in a separate pipe and integrated into the custom pipeline. | ||
Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the `Doc.ents` property. | ||
|
||
To access the entities recognized by each NER head, use the `Doc._.<ner_head>` property (e.g., `doc._.nemo_ents` and `doc._.bmc_ents`). | ||
|
||
-------------- | ||
## Contribution | ||
You are welcome to contribute to `hebspacy` project and introduce new feature/ models. | ||
Kindly follow the [pipeline codebase instructions](contribute/pipeline/README.md) and the [model training and packaging guidelines](contribute/model/README.md). | ||
|
||
|
||
----- | ||
|
||
HebSpaCy is an open-source project developed by [8400 The Health Network](https://www.8400thn.org/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Contribute your own `hebspacy` model | ||
You are welcome to contribute by training and packaging your own `hebspacy` model. Please follow the instructions below to enable seamless loading process. | ||
|
||
## Model Training | ||
You may choose the training codebase that best fits your requirements as long as you save the following files: | ||
1. The pretrained transformer layers (post-fine tuning) **separately** from the NER heads. file should be named `pytorch_model.bin` | ||
2. All the files required to be loaded by `transformers.AutoModel`, including the standard `config.json`, `special_tokens.json`, `tokenizer_config.json`, `vocab.txt`. | ||
3. Each of the NER head weights should be saved as separate `bin` file with a corresponding index to class mapping `json` file (see instructions below). Files should follow the `ner_<name>.bin` and `ner_<name>.json` name convention. | ||
|
||
**All weights files should be trained using Hugging Face and PyTorch libraries** | ||
|
||
For example, the following directory contains all the required files for a model that was jointly trained against the BMC and NEMO corpora: | ||
```` | ||
resources/ | ||
├── config.json | ||
├── pytorch_model.bin | ||
├── special_tokens.json | ||
├── tokenizer_config.json | ||
├── vocab.txt | ||
├── ner_bmc.bin | ||
├── ner_bmc.json | ||
├── ner_nemo.bin | ||
└── ner_nemo.json | ||
```` | ||
|
||
### Index to class mapping file | ||
Each NER head should include a `json` file that maps between the model class index to the corresponding token class name. | ||
Note that indices `0` and `1` should always be associated with `[PAD]` and `O` classes, respectively. Also, the token annotation schema should be [**IBO2**](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). | ||
|
||
|
||
Here is an example of the index to class mapping `json` file for `ner_bmc`: | ||
````json | ||
{ | ||
"0": "[PAD]", | ||
"1": "O", | ||
"2": "B-PERS", | ||
"3": "I-PERS", | ||
"4": "B-LOC", | ||
"5": "I-LOC", | ||
"6": "B-ORG", | ||
"7": "I-ORG", | ||
"8": "B-TIME", | ||
"9": "I-TIME", | ||
"10": "B-DATE", | ||
"11": "I-DATE", | ||
"12": "B-MONEY", | ||
"13": "I-MONEY", | ||
"14": "B-PERCENT", | ||
"15": "I-PERCENT", | ||
"16": "B-MISC__AFF", | ||
"17": "I-MISC__AFF", | ||
"18": "B-MISC__ENT", | ||
"19": "I-MISC__ENT", | ||
"20": "B-MISC_EVENT", | ||
"21": "I-MISC_EVENT" | ||
} | ||
```` | ||
|
||
## Model Packaging | ||
Once you have prepared all the directory with all required files, please follow these steps: | ||
1. Fork this repo (in case you haven't already) | ||
2. Make sure that `spacy` is installed in your running python environment (**make sure it is the same version as mentioned in requirements.txt**) | ||
3. Navigate to the repo's root | ||
4. Run `python setup.py develop`, which should create a `hebspacy.egg-info` directory | ||
5. Navigate to `scripts\model` | ||
6. Update `meta.json` accordingly (make sure to follow the [spaCy package naming conventions](https://spacy.io/models#conventions)) | ||
7. Run `python package.py <RESOURCES_DIR> <OUTPUT_DIR>`, where `<RESOURCES_DIR>` should point to the directory with all the files from the previous section. | ||
8. Run `python -m spacy package <OUTPUT_DIR> <WHEEL_DIR> --build wheel` | ||
9. Your `whl` and `tar.gz` files are ready under `<WHEEL_DIR>/<lang>_<name>-<version>/dist` | ||
10. Install your files by running `pip install XXXXX.whl` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"lang": "he", | ||
"name": "ner_news_trf", | ||
"version": "3.2.1", | ||
"description": "Hebrew pipeline. Components: tok2vec, senter, multihead ner (bmc and nemo).", | ||
"author": "hebpsacy", | ||
"email": "[email protected]", | ||
"url": "https://github.com/8400TheHealthNetwork/HebSpacy", | ||
"license": "MIT License (MIT)" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
import json | ||
import os | ||
import shutil | ||
from pathlib import Path | ||
from spacy.lang.xx import MultiLanguage | ||
from thinc.config import Config | ||
import argparse | ||
|
||
|
||
def package(resource_dir: str, output_dir: str): | ||
input_dir = Path(resource_dir) | ||
target_dir = Path(output_dir) | ||
|
||
if target_dir.exists(): | ||
shutil.rmtree(target_dir) | ||
target_dir.mkdir() | ||
|
||
nlp = MultiLanguage() | ||
# update metadata | ||
with open("meta.json", "r") as f: | ||
meta = json.load(f) | ||
nlp.meta.update(meta) | ||
nlp.lang = "he" | ||
|
||
# Add sentence segmentation | ||
nlp.add_pipe("sentencizer") | ||
|
||
# Add transformer | ||
config = Config().from_disk("transformer.cfg") | ||
transformer_config = config["transformer"] | ||
transformer_config["model"]["name"] = str(input_dir) | ||
transformer = nlp.add_pipe("transformer", config=config["transformer"]) | ||
transformer.model.initialize() | ||
|
||
# Add ner heads | ||
for ner_head in input_dir.glob("ner_*.bin"): | ||
pipe = nlp.add_pipe("ner_head", name=ner_head.stem, last=True) | ||
pipe.load(input_dir) | ||
os.mkdir(target_dir / ner_head.stem) | ||
|
||
# Add consolidator | ||
nlp.add_pipe("consolidator", last=True) | ||
|
||
nlp.to_disk(target_dir) | ||
|
||
|
||
if __name__ == '__main__': | ||
# make sure to run `python setup.py develop` before running this script | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('resources_dir', help="directory containing the weights and configuration files") | ||
parser.add_argument('output_dir', default=".", help="pipeline output directory") | ||
|
||
args = parser.parse_args() | ||
|
||
package(args.resources_dir, args.output_dir) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
[transformer] | ||
max_batch_items = 4096 | ||
|
||
[transformer.set_extra_annotations] | ||
@annotation_setters = "spacy-transformers.null_annotation_setter.v1" | ||
|
||
[transformer.model] | ||
@architectures = "spacy-transformers.TransformerModel.v3" | ||
# name = WILL BE AUTOMATICALLY ADDED BY SCRIPT | ||
tokenizer_config = {"use_fast": true} | ||
transformer_config = {} | ||
mixed_precision = false | ||
grad_scaler_config = {} | ||
|
||
[transformer.model.get_spans] | ||
@span_getters = "spacy-transformers.sent_spans.v1" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Contribute to `hebspacy` pipeline | ||
In case you added new feature, fixed bugs or made any changes in the repo, please follow these instructions: | ||
|
||
1. Run existing and new tests on your code. make sure to place them under `/tests` | ||
2. Navigate to `hebspacy/version.py` and promote the package version according to [semantic versioning scheme](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#semantic-versioning-preferred) | ||
3. Run `python setup.py bdist_wheel` to create a new `whl` file | ||
4. You can now install your version by running `pip install XXXXX.whl` | ||
5. Once you are ready to upload your version to [PyPi](https://pypi.org/project/hebspacy/), please contact the package maintainers personally or via email to [[email protected]](mailto:[email protected]) | ||
|
||
Note that previous model files may need to be repackaged to work with the new version. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__version__ = '0.0.1' |
Oops, something went wrong.