From d128de674c4dfd92a13402bbf8abde62e64105db Mon Sep 17 00:00:00 2001 From: Sid Date: Thu, 26 Sep 2024 17:30:47 -0400 Subject: [PATCH] Added RemovePII documentation DS-2974 --- docs/Developer-Guide.md | 44 +++++++++++++++++++++++++++++++++++++ docs/Developer-Reference.md | 6 +++++ 2 files changed, 50 insertions(+) diff --git a/docs/Developer-Guide.md b/docs/Developer-Guide.md index 4a17de85..31893a46 100644 --- a/docs/Developer-Guide.md +++ b/docs/Developer-Guide.md @@ -511,6 +511,50 @@ The defaults are: SWIRL is configured to load English stopwords only. To change this, modify `SWIRL_DEFAULT_QUERY_LANGUAGE` in [swirl_settings/settings.py](https://github.com/swirlai/swirl-search/blob/main/swirl_server/settings.py) and change it to another [NLTK stopword language](https://stackoverflow.com/questions/54573853/nltk-available-languages-for-stopwords). +## Redact or Remove Personally Identifiable Information (PII) From Queries and/or Results + +SWIRL supports the removal or redaction of PII entities using [Microsoft Presidio](https://microsoft.github.io/presidio/). There are three options available: + +### `RemovePIIQueryProcessor` + +This QueryProcessor removes PII entities from queries. + +To use it. install it in in the QueryProcessing pipeline for a given SearchProvider: + +``` +"query_processors": [ + "AdaptiveQueryProcessor", + "RemovePIIQueryProcessor" + ] +``` + +Or, install it in the PreQueryProcessing pipeline to redact PII from all SearchProviders: + +In `swirl/models.py`: +``` +def getSearchPreQueryProcessorsDefault(): + return ["RemovePIIQueryProcessor"] +``` + +More information: [ResultProcessors](./Developer-Reference.md#result-processors) + +### `RemovePIIResultProcessor` + +This ResultProcessor redacts PII entities in results. For example, "James T. Kirk" is replaced by "". To use it, install it in the ResultProcessing pipeline for a given SearchProvider. + +``` +"result_processors": [ + "MappingResultProcessor", + "DateFinderResultProcessor", + "CosineRelevancyResultProcessor", + "RemovePIIResultProcessor" + ] +``` + +More information: [ResultProcessors](./Developer-Reference.md#post-result-processors) + +### `RemovePIIPostResultProcessor` + ## Understand the Explain Structure The [CosineRelevancyProcessor](Developer-Reference.html#cosinerelevancypostresultprocessor) outputs a JSON structure that explains the `swirl_score` for each result. It is displayed by default; to hide it add `&explain=False` to any mixer URL. diff --git a/docs/Developer-Reference.md b/docs/Developer-Reference.md index a21fe2b7..9b652aeb 100644 --- a/docs/Developer-Reference.md +++ b/docs/Developer-Reference.md @@ -983,6 +983,7 @@ This table describes the query processors included in SWIRL: | GenericQueryProcessor | Removes special characters from the query | | | SpellcheckQueryProcessor | Uses [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction) to predict and fix spelling errors in `query_string` | Best deployed in a `SearchProvider.query_processor` for sources that need it; not recommended with Google PSEs | | NoModQueryProcessor | Only removes leading SearchProvider Tags and does not modify the query terms in any way. | It is intended for repositories that allow non-search characters (such as brackets). | +| RemovePIIQueryProcessor | Removes PII entities from the query. It does not replace them. | | ## Result Processors @@ -999,6 +1000,7 @@ The following table lists the Result Processors included with SWIRL: | DateFinderResultProcessor | Looks for a date in any a number of formats in the body field of each result item. Should it find one, and the `date_published` for that item is `'unknown'`, it replaces `date_published` with the date extracted from the body, and notes this in the `result.messages`. | This processor can detect the following date formats:
`06/01/23`
`06/01/2023`
`06-01-23`
`06-01-2023`
`jun 1, 2023`
`june 1, 2023` | | AutomaticPayloadMapperResultProcessor | Profiles response data to find good strings for SWIRL's `title`, `body`, and `date_published` fields. It is intended for SearchProviders that would otherwise have few (or no) good `result_mappings` options. | It should be place after the `MappingResultProcessor`. The `result_mappings` field should be blank, except for the optional DATASET directive, which will return only a single SWIRL response for each provider response, with the original response in the `payload` field under the `dataset` key. | | RequireQueryStringInTitleResultProcessor | Drops results that do not contain the `query_string_to_provider` in the result `title` field. | It should be added after the `MappingResultProcessor` and is now included by default in the "LinkedIn - Google PSE" SearchProvider. | +| RemovePIIResultProcessor | Redacts PII entries in all result fields for configured SearchProviders, including payload string fields, with a generic tag showing the entity type. For example "James T. Kirk" -> "". | This processor may be installed before or after the CosineRelevancyResultProcessor. If it runs before, query terms which are PII entities will not be used in relevancy ranking, since they will be redacted. More information: [https://microsoft.github.io/presidio/](https://microsoft.github.io/presidio/) | ## Post Result Processors @@ -1064,6 +1066,10 @@ The `DropIrrelevantPostResultProcessor` drops results with `swirl_score < settin {: .highlight } The Galaxy UI will not display the correct number of results if this ResultProcessor is deployed. +### `RemovePIIPostResultProcessor` + +This processor is identical in most respects to the [RemovePIIResultProcessor](#result-processors), except that it operates on all results in a result set, not just a single SearchProvider. + # Mixers The following table details the Result Mixers included with SWIRL: