Skip to content

Commit

Permalink
Merge pull request #2 from dkarmon/main
Browse files Browse the repository at this point in the history
Initial merge
  • Loading branch information
roycboyc authored Feb 17, 2022
2 parents 98fe59e + 635ecc0 commit 98b3027
Show file tree
Hide file tree
Showing 18 changed files with 663 additions and 2 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Python application

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: "3.8"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
82 changes: 80 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,80 @@
# HebSpacy
Hebrew oriented NER spaCy pipeline
# HebSpaCy

A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including `GPE`, `PER`, `LOC` and `ORG`.

----
[![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT) ![Release](https://img.shields.io/github/v/release/8400TheHealthNetwork/HebSpacy.svg) [![PyPI version](https://badge.fury.io/py/hebspacy.svg)](https://badge.fury.io/py/hebspacy) [![Pypi Downloads](https://img.shields.io/pypi/dm/hebspacy.svg)](https://img.shields.io/pypi/dm/hebspacy.svg)

## Installation

To run the package you will need to install the package as well as the model, preferably in a virtual environment:

``` sh
# Create conda env (optional)
conda create --name hebspacy python=3.8
conda activate hebsafeharbor

# Install hebspacy
pip install hebspacy

# Download and install the model (see below availbable models)
pip install </path/to/download>
```

#### Available Models
| Model | Description | Install URL |
| ----- | ----------- | ----------- |
| he_ner_news_trf | A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more [here](#he_ner_news_trf).| [Download](https://github.com/8400TheHealthNetwork/HebSpacy/releases/download/he_ner_news_trf-3.2.1/he_ner_news_trf-3.2.1-py3-none-any.whl)


## Getting started
```python
import spacy

nlp = spacy.load("he_ner_news_trf")
text = """מרגלית דהן
מספר זהות 11278904-5
2/12/2001
ביקור חוזר מ18.11.2001
במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה:
המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19,
פריטין 10, סטורציית טרנספרין 8%.
מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART
"""

doc = nlp(text)
for entity in doc.ents:
print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})")

>>> מרגלית דהן PERS: 0.9999 (0,10)
>>> 2/12/2001 DATE: 0.9897 (33,42)
>>> מ18.11.2001 DATE: 0.8282 (54,65)
>>> 8% PERCENT: 0.9932 (230,232)
```

---------------
### he_ner_news_trf
'he_ner_news_trf' is a multitask model constructed from [AlephBert](https://arxiv.org/pdf/2104.04052.pdf) and two NER focused heads, each trained against a different NER-annotated Hebrew corpus:
1. [NEMO corpus](https://github.com/OnlpLab/NEMO-Corpus) - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category: `GPE` (geo-political entity), `PER` (person), `LOC` (location), `ORG` (organization), `FAC` (facility), `EVE` (event), `WOA` (work-of-art), `ANG` (language), `DUC` (product).
2. [BMC corpus](https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/) - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories: `PERS` (person), `LOC` (location), `ORG` (organization), `DATE` (date), `TIME` (time), `MONEY` (money), `PERCENT` (percent), `MISC__AFF` (misc affiliation), `MISC__ENT` (misc entity),
`MISC_EVENT` (misc event).

The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline.

#### Model integration
The output model was split into three weight files: _the transformer embeddings, the BMC head, and the NEMO head_.
The components were each packaged in a separate pipe and integrated into the custom pipeline.
Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the `Doc.ents` property.

To access the entities recognized by each NER head, use the `Doc._.<ner_head>` property (e.g., `doc._.nemo_ents` and `doc._.bmc_ents`).

--------------
## Contribution
You are welcome to contribute to `hebspacy` project and introduce new feature/ models.
Kindly follow the [pipeline codebase instructions](contribute/pipeline/README.md) and the [model training and packaging guidelines](contribute/model/README.md).


-----

HebSpaCy is an open-source project developed by [8400 The Health Network](https://www.8400thn.org/).
70 changes: 70 additions & 0 deletions contribute/model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Contribute your own `hebspacy` model
You are welcome to contribute by training and packaging your own `hebspacy` model. Please follow the instructions below to enable seamless loading process.

## Model Training
You may choose the training codebase that best fits your requirements as long as you save the following files:
1. The pretrained transformer layers (post-fine tuning) **separately** from the NER heads. file should be named `pytorch_model.bin`
2. All the files required to be loaded by `transformers.AutoModel`, including the standard `config.json`, `special_tokens.json`, `tokenizer_config.json`, `vocab.txt`.
3. Each of the NER head weights should be saved as separate `bin` file with a corresponding index to class mapping `json` file (see instructions below). Files should follow the `ner_<name>.bin` and `ner_<name>.json` name convention.

**All weights files should be trained using Hugging Face and PyTorch libraries**

For example, the following directory contains all the required files for a model that was jointly trained against the BMC and NEMO corpora:
````
resources/
├── config.json
├── pytorch_model.bin
├── special_tokens.json
├── tokenizer_config.json
├── vocab.txt
├── ner_bmc.bin
├── ner_bmc.json
├── ner_nemo.bin
└── ner_nemo.json
````

### Index to class mapping file
Each NER head should include a `json` file that maps between the model class index to the corresponding token class name.
Note that indices `0` and `1` should always be associated with `[PAD]` and `O` classes, respectively. Also, the token annotation schema should be [**IBO2**](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).


Here is an example of the index to class mapping `json` file for `ner_bmc`:
````json
{
"0": "[PAD]",
"1": "O",
"2": "B-PERS",
"3": "I-PERS",
"4": "B-LOC",
"5": "I-LOC",
"6": "B-ORG",
"7": "I-ORG",
"8": "B-TIME",
"9": "I-TIME",
"10": "B-DATE",
"11": "I-DATE",
"12": "B-MONEY",
"13": "I-MONEY",
"14": "B-PERCENT",
"15": "I-PERCENT",
"16": "B-MISC__AFF",
"17": "I-MISC__AFF",
"18": "B-MISC__ENT",
"19": "I-MISC__ENT",
"20": "B-MISC_EVENT",
"21": "I-MISC_EVENT"
}
````

## Model Packaging
Once you have prepared all the directory with all required files, please follow these steps:
1. Fork this repo (in case you haven't already)
2. Make sure that `spacy` is installed in your running python environment (**make sure it is the same version as mentioned in requirements.txt**)
3. Navigate to the repo's root
4. Run `python setup.py develop`, which should create a `hebspacy.egg-info` directory
5. Navigate to `scripts\model`
6. Update `meta.json` accordingly (make sure to follow the [spaCy package naming conventions](https://spacy.io/models#conventions))
7. Run `python package.py <RESOURCES_DIR> <OUTPUT_DIR>`, where `<RESOURCES_DIR>` should point to the directory with all the files from the previous section.
8. Run `python -m spacy package <OUTPUT_DIR> <WHEEL_DIR> --build wheel`
9. Your `whl` and `tar.gz` files are ready under `<WHEEL_DIR>/<lang>_<name>-<version>/dist`
10. Install your files by running `pip install XXXXX.whl`
10 changes: 10 additions & 0 deletions contribute/model/meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"lang": "he",
"name": "ner_news_trf",
"version": "3.2.1",
"description": "Hebrew pipeline. Components: tok2vec, senter, multihead ner (bmc and nemo).",
"author": "hebpsacy",
"email": "[email protected]",
"url": "https://github.com/8400TheHealthNetwork/HebSpacy",
"license": "MIT License (MIT)"
}
55 changes: 55 additions & 0 deletions contribute/model/package.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import json
import os
import shutil
from pathlib import Path
from spacy.lang.xx import MultiLanguage
from thinc.config import Config
import argparse


def package(resource_dir: str, output_dir: str):
input_dir = Path(resource_dir)
target_dir = Path(output_dir)

if target_dir.exists():
shutil.rmtree(target_dir)
target_dir.mkdir()

nlp = MultiLanguage()
# update metadata
with open("meta.json", "r") as f:
meta = json.load(f)
nlp.meta.update(meta)
nlp.lang = "he"

# Add sentence segmentation
nlp.add_pipe("sentencizer")

# Add transformer
config = Config().from_disk("transformer.cfg")
transformer_config = config["transformer"]
transformer_config["model"]["name"] = str(input_dir)
transformer = nlp.add_pipe("transformer", config=config["transformer"])
transformer.model.initialize()

# Add ner heads
for ner_head in input_dir.glob("ner_*.bin"):
pipe = nlp.add_pipe("ner_head", name=ner_head.stem, last=True)
pipe.load(input_dir)
os.mkdir(target_dir / ner_head.stem)

# Add consolidator
nlp.add_pipe("consolidator", last=True)

nlp.to_disk(target_dir)


if __name__ == '__main__':
# make sure to run `python setup.py develop` before running this script
parser = argparse.ArgumentParser()
parser.add_argument('resources_dir', help="directory containing the weights and configuration files")
parser.add_argument('output_dir', default=".", help="pipeline output directory")

args = parser.parse_args()

package(args.resources_dir, args.output_dir)
16 changes: 16 additions & 0 deletions contribute/model/transformer.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[transformer]
max_batch_items = 4096

[transformer.set_extra_annotations]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"

[transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
# name = WILL BE AUTOMATICALLY ADDED BY SCRIPT
tokenizer_config = {"use_fast": true}
transformer_config = {}
mixed_precision = false
grad_scaler_config = {}

[transformer.model.get_spans]
@span_getters = "spacy-transformers.sent_spans.v1"
10 changes: 10 additions & 0 deletions contribute/pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Contribute to `hebspacy` pipeline
In case you added new feature, fixed bugs or made any changes in the repo, please follow these instructions:

1. Run existing and new tests on your code. make sure to place them under `/tests`
2. Navigate to `hebspacy/version.py` and promote the package version according to [semantic versioning scheme](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#semantic-versioning-preferred)
3. Run `python setup.py bdist_wheel` to create a new `whl` file
4. You can now install your version by running `pip install XXXXX.whl`
5. Once you are ready to upload your version to [PyPi](https://pypi.org/project/hebspacy/), please contact the package maintainers personally or via email to [[email protected]](mailto:[email protected])

Note that previous model files may need to be repackaged to work with the new version.
1 change: 1 addition & 0 deletions hebspacy/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = '0.0.1'
Loading

0 comments on commit 98b3027

Please sign in to comment.