Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added RemovePII documentation #1440

Merged
merged 1 commit into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions docs/Developer-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -511,6 +511,50 @@ The defaults are:

SWIRL is configured to load English stopwords only. To change this, modify `SWIRL_DEFAULT_QUERY_LANGUAGE` in [swirl_settings/settings.py](https://github.com/swirlai/swirl-search/blob/main/swirl_server/settings.py) and change it to another [NLTK stopword language](https://stackoverflow.com/questions/54573853/nltk-available-languages-for-stopwords).

## Redact or Remove Personally Identifiable Information (PII) From Queries and/or Results

SWIRL supports the removal or redaction of PII entities using [Microsoft Presidio](https://microsoft.github.io/presidio/). There are three options available:

### `RemovePIIQueryProcessor`

This QueryProcessor removes PII entities from queries.

To use it. install it in in the QueryProcessing pipeline for a given SearchProvider:

```
"query_processors": [
"AdaptiveQueryProcessor",
"RemovePIIQueryProcessor"
]
```

Or, install it in the PreQueryProcessing pipeline to redact PII from all SearchProviders:

In `swirl/models.py`:
```
def getSearchPreQueryProcessorsDefault():
return ["RemovePIIQueryProcessor"]
```

More information: [ResultProcessors](./Developer-Reference.md#result-processors)

### `RemovePIIResultProcessor`

This ResultProcessor redacts PII entities in results. For example, "James T. Kirk" is replaced by "<PERSON>". To use it, install it in the ResultProcessing pipeline for a given SearchProvider.

```
"result_processors": [
"MappingResultProcessor",
"DateFinderResultProcessor",
"CosineRelevancyResultProcessor",
"RemovePIIResultProcessor"
]
```

More information: [ResultProcessors](./Developer-Reference.md#post-result-processors)

### `RemovePIIPostResultProcessor`

## Understand the Explain Structure

The [CosineRelevancyProcessor](Developer-Reference.html#cosinerelevancypostresultprocessor) outputs a JSON structure that explains the `swirl_score` for each result. It is displayed by default; to hide it add `&explain=False` to any mixer URL.
Expand Down
6 changes: 6 additions & 0 deletions docs/Developer-Reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -983,6 +983,7 @@ This table describes the query processors included in SWIRL:
| GenericQueryProcessor | Removes special characters from the query | |
| SpellcheckQueryProcessor | Uses [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction) to predict and fix spelling errors in `query_string` | Best deployed in a `SearchProvider.query_processor` for sources that need it; not recommended with Google PSEs |
| NoModQueryProcessor | Only removes leading SearchProvider Tags and does not modify the query terms in any way. | It is intended for repositories that allow non-search characters (such as brackets). |
| RemovePIIQueryProcessor | Removes PII entities from the query. It does not replace them. | |

## Result Processors

Expand All @@ -999,6 +1000,7 @@ The following table lists the Result Processors included with SWIRL:
| DateFinderResultProcessor | Looks for a date in any a number of formats in the body field of each result item. Should it find one, and the `date_published` for that item is `'unknown'`, it replaces `date_published` with the date extracted from the body, and notes this in the `result.messages`. | This processor can detect the following date formats:<br/> `06/01/23`<br/>`06/01/2023`<br/>`06-01-23`<br/>`06-01-2023`<br/>`jun 1, 2023`<br/>`june 1, 2023` |
| AutomaticPayloadMapperResultProcessor | Profiles response data to find good strings for SWIRL's `title`, `body`, and `date_published` fields. It is intended for SearchProviders that would otherwise have few (or no) good `result_mappings` options. | It should be place after the `MappingResultProcessor`. The `result_mappings` field should be blank, except for the optional DATASET directive, which will return only a single SWIRL response for each provider response, with the original response in the `payload` field under the `dataset` key. |
| RequireQueryStringInTitleResultProcessor | Drops results that do not contain the `query_string_to_provider` in the result `title` field. | It should be added after the `MappingResultProcessor` and is now included by default in the "LinkedIn - Google PSE" SearchProvider. |
| RemovePIIResultProcessor | Redacts PII entries in all result fields for configured SearchProviders, including payload string fields, with a generic tag showing the entity type. For example "James T. Kirk" -> "<PERSON>". | This processor may be installed before or after the CosineRelevancyResultProcessor. If it runs before, query terms which are PII entities will not be used in relevancy ranking, since they will be redacted. More information: [https://microsoft.github.io/presidio/](https://microsoft.github.io/presidio/) |

## Post Result Processors

Expand Down Expand Up @@ -1064,6 +1066,10 @@ The `DropIrrelevantPostResultProcessor` drops results with `swirl_score < settin
{: .highlight }
The Galaxy UI will not display the correct number of results if this ResultProcessor is deployed.

### `RemovePIIPostResultProcessor`

This processor is identical in most respects to the [RemovePIIResultProcessor](#result-processors), except that it operates on all results in a result set, not just a single SearchProvider.

# Mixers

The following table details the Result Mixers included with SWIRL:
Expand Down