Merge pull request #2 from dkarmon/main

Initial merge
8400TheHealthNetwork · Feb 17, 2022 · 98b3027 · 98b3027
2 parents 98fe59e + 635ecc0
commit 98b3027
Show file tree

Hide file tree

Showing 18 changed files with 663 additions and 2 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -0,0 +1,36 @@
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: Python application
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: "3.8"
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8 pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        pytest
diff --git a/README.md b/README.md
@@ -1,2 +1,80 @@
-# HebSpacy
-Hebrew oriented NER spaCy pipeline
+# HebSpaCy 
+
+A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including `GPE`, `PER`, `LOC` and `ORG`.
+
+----
+[![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT) ![Release](https://img.shields.io/github/v/release/8400TheHealthNetwork/HebSpacy.svg) [![PyPI version](https://badge.fury.io/py/hebspacy.svg)](https://badge.fury.io/py/hebspacy) [![Pypi Downloads](https://img.shields.io/pypi/dm/hebspacy.svg)](https://img.shields.io/pypi/dm/hebspacy.svg) 
+
+## Installation
+
+To run the package you will need to install the package as well as the model, preferably in a virtual environment:
+
+``` sh
+# Create conda env (optional)
+conda create --name hebspacy python=3.8
+conda activate hebsafeharbor
+
+# Install hebspacy
+pip install hebspacy
+
+# Download and install the model (see below availbable models)
+pip install </path/to/download>
+```
+
+#### Available Models
+| Model | Description | Install URL |
+| ----- | ----------- | ----------- |
+| he_ner_news_trf | A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more [here](#he_ner_news_trf).| [Download](https://github.com/8400TheHealthNetwork/HebSpacy/releases/download/he_ner_news_trf-3.2.1/he_ner_news_trf-3.2.1-py3-none-any.whl)
+
+
+## Getting started
+```python
+import spacy
+
+nlp = spacy.load("he_ner_news_trf")
+text = """מרגלית דהן
+מספר זהות 11278904-5
+
+2/12/2001
+ביקור חוזר מ18.11.2001
+במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה:
+המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19,
+פריטין 10, סטורציית טרנספרין 8%. 
+מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART
+"""
+
+doc = nlp(text)
+for entity in doc.ents:
+    print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})")
+
+>>> מרגלית דהן	 PERS: 0.9999 (0,10)
+>>> 2/12/2001 	 DATE: 0.9897 (33,42)
+>>> מ18.11.2001 	 DATE: 0.8282 (54,65)
+>>> 8% 	 PERCENT: 0.9932 (230,232)
+```
+
+---------------
+### he_ner_news_trf
+'he_ner_news_trf' is a multitask model constructed from [AlephBert](https://arxiv.org/pdf/2104.04052.pdf) and two NER focused heads, each trained against a different NER-annotated Hebrew corpus:
+1. [NEMO corpus](https://github.com/OnlpLab/NEMO-Corpus) - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category: `GPE` (geo-political entity), `PER` (person), `LOC` (location), `ORG` (organization), `FAC` (facility), `EVE` (event), `WOA` (work-of-art), `ANG` (language), `DUC` (product). 
+2. [BMC corpus](https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/) - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories: `PERS` (person), `LOC` (location), `ORG` (organization), `DATE` (date), `TIME` (time), `MONEY` (money), `PERCENT` (percent), `MISC__AFF` (misc affiliation), `MISC__ENT` (misc entity),
+ `MISC_EVENT` (misc event).
+
+The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline. 
+
+#### Model integration
+The output model was split into three weight files: _the transformer embeddings, the BMC head, and the NEMO head_.
+The components were each packaged in a separate pipe and integrated into the custom pipeline. 
+Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the `Doc.ents` property.
+
+To access the entities recognized by each NER head, use the `Doc._.<ner_head>` property (e.g., `doc._.nemo_ents` and `doc._.bmc_ents`).
+
+--------------
+## Contribution
+You are welcome to contribute to `hebspacy` project and introduce new feature/ models. 
+Kindly follow the [pipeline codebase instructions](contribute/pipeline/README.md) and the [model training and packaging guidelines](contribute/model/README.md).
+
+
+-----
+
+HebSpaCy is an open-source project developed by [8400 The Health Network](https://www.8400thn.org/).
diff --git a/contribute/model/README.md b/contribute/model/README.md
@@ -0,0 +1,70 @@
+# Contribute your own `hebspacy` model
+You are welcome to contribute by training and packaging your own `hebspacy` model. Please follow the instructions below to enable seamless loading process.
+
+## Model Training
+You may choose the training codebase that best fits your requirements as long as you save the following files:
+1. The pretrained transformer layers (post-fine tuning) **separately** from the NER heads. file should be named `pytorch_model.bin`
+2. All the files required to be loaded by `transformers.AutoModel`, including the standard `config.json`, `special_tokens.json`, `tokenizer_config.json`, `vocab.txt`.
+3. Each of the NER head weights should be saved as separate `bin` file with a corresponding index to class mapping `json` file (see instructions below). Files should follow the `ner_<name>.bin` and `ner_<name>.json` name convention.
+
+**All weights files should be trained using Hugging Face and PyTorch libraries**
+
+For example, the following directory contains all the required files for a model that was jointly trained against the BMC and NEMO corpora:
+````
+resources/
+├── config.json
+├── pytorch_model.bin
+├── special_tokens.json
+├── tokenizer_config.json
+├── vocab.txt
+├── ner_bmc.bin
+├── ner_bmc.json
+├── ner_nemo.bin
+└── ner_nemo.json
+````
+
+### Index to class mapping file
+Each NER head should include a `json` file that maps between the model class index to the corresponding token class name.
+Note that indices `0` and `1` should always be associated with `[PAD]` and `O` classes, respectively. Also, the token annotation schema should be [**IBO2**](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).
+
+
+Here is an example of the index to class mapping `json` file for `ner_bmc`:
+````json
+{
+	"0": "[PAD]",
+	"1": "O",
+	"2": "B-PERS",
+	"3": "I-PERS",
+	"4": "B-LOC",
+	"5": "I-LOC",
+	"6": "B-ORG",
+	"7": "I-ORG",
+	"8": "B-TIME",
+	"9": "I-TIME",
+	"10": "B-DATE",
+	"11": "I-DATE",
+	"12": "B-MONEY",
+	"13": "I-MONEY",
+	"14": "B-PERCENT",
+	"15": "I-PERCENT",
+	"16": "B-MISC__AFF",
+	"17": "I-MISC__AFF",
+	"18": "B-MISC__ENT",
+	"19": "I-MISC__ENT",
+	"20": "B-MISC_EVENT",
+	"21": "I-MISC_EVENT"
+}
+````
+
+## Model Packaging
+Once you have prepared all the directory with all required files, please follow these steps:
+1. Fork this repo (in case you haven't already)
+2. Make sure that `spacy` is installed in your running python environment (**make sure it is the same version as mentioned in requirements.txt**) 
+3. Navigate to the repo's root
+4. Run `python setup.py develop`, which should create a `hebspacy.egg-info` directory
+5. Navigate to `scripts\model`
+6. Update `meta.json` accordingly (make sure to follow the [spaCy package naming conventions](https://spacy.io/models#conventions))
+7. Run `python package.py <RESOURCES_DIR> <OUTPUT_DIR>`, where `<RESOURCES_DIR>` should point to the directory with all the files from the previous section.
+8. Run `python -m spacy package <OUTPUT_DIR> <WHEEL_DIR> --build wheel`
+9. Your `whl` and `tar.gz` files are ready under `<WHEEL_DIR>/<lang>_<name>-<version>/dist`
+10. Install your files by running `pip install XXXXX.whl`
diff --git a/contribute/model/meta.json b/contribute/model/meta.json
@@ -0,0 +1,10 @@
+{
+  "lang": "he",
+  "name": "ner_news_trf",
+  "version": "3.2.1",
+  "description": "Hebrew pipeline. Components: tok2vec, senter, multihead ner (bmc and nemo).",
+  "author": "hebpsacy",
+  "email": "[email protected]",
+  "url": "https://github.com/8400TheHealthNetwork/HebSpacy",
+  "license": "MIT License (MIT)"
+}
diff --git a/contribute/model/package.py b/contribute/model/package.py
@@ -0,0 +1,55 @@
+import json
+import os
+import shutil
+from pathlib import Path
+from spacy.lang.xx import MultiLanguage
+from thinc.config import Config
+import argparse
+
+
+def package(resource_dir: str, output_dir: str):
+    input_dir = Path(resource_dir)
+    target_dir = Path(output_dir)
+
+    if target_dir.exists():
+        shutil.rmtree(target_dir)
+    target_dir.mkdir()
+
+    nlp = MultiLanguage()
+    # update metadata
+    with open("meta.json", "r") as f:
+        meta = json.load(f)
+    nlp.meta.update(meta)
+    nlp.lang = "he"
+
+    # Add sentence segmentation
+    nlp.add_pipe("sentencizer")
+
+    # Add transformer
+    config = Config().from_disk("transformer.cfg")
+    transformer_config = config["transformer"]
+    transformer_config["model"]["name"] = str(input_dir)
+    transformer = nlp.add_pipe("transformer", config=config["transformer"])
+    transformer.model.initialize()
+
+    # Add ner heads
+    for ner_head in input_dir.glob("ner_*.bin"):
+        pipe = nlp.add_pipe("ner_head", name=ner_head.stem, last=True)
+        pipe.load(input_dir)
+        os.mkdir(target_dir / ner_head.stem)
+
+    # Add consolidator
+    nlp.add_pipe("consolidator", last=True)
+
+    nlp.to_disk(target_dir)
+
+
+if __name__ == '__main__':
+    # make sure to run `python setup.py develop` before running this script
+    parser = argparse.ArgumentParser()
+    parser.add_argument('resources_dir', help="directory containing the weights and configuration files")
+    parser.add_argument('output_dir', default=".", help="pipeline output directory")
+
+    args = parser.parse_args()
+
+    package(args.resources_dir, args.output_dir)
diff --git a/contribute/model/transformer.cfg b/contribute/model/transformer.cfg
@@ -0,0 +1,16 @@
+[transformer]
+max_batch_items = 4096
+
+[transformer.set_extra_annotations]
+@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
+
+[transformer.model]
+@architectures = "spacy-transformers.TransformerModel.v3"
+# name = WILL BE AUTOMATICALLY ADDED BY SCRIPT
+tokenizer_config = {"use_fast": true}
+transformer_config = {}
+mixed_precision = false
+grad_scaler_config = {}
+
+[transformer.model.get_spans]
+@span_getters = "spacy-transformers.sent_spans.v1"
diff --git a/contribute/pipeline/README.md b/contribute/pipeline/README.md
@@ -0,0 +1,10 @@
+# Contribute to `hebspacy` pipeline
+In case you added new feature, fixed bugs or made any changes in the repo, please follow these instructions:
+
+1. Run existing and new tests on your code. make sure to place them under `/tests`
+2. Navigate to `hebspacy/version.py` and promote the package version according to [semantic versioning scheme](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#semantic-versioning-preferred)
+3. Run `python setup.py bdist_wheel` to create a new `whl` file
+4. You can now install your version by running `pip install XXXXX.whl`
+5. Once you are ready to upload your version to [PyPi](https://pypi.org/project/hebspacy/), please contact the package maintainers personally or via email to [[email protected]](mailto:[email protected])
+
+Note that previous model files may need to be repackaged to work with the new version. 
diff --git a/hebspacy/__init__.py b/hebspacy/__init__.py
@@ -0,0 +1 @@
+__version__ = '0.0.1'