612 Introduced Percolator Recap Search Alerts #4200

albertisfu · 2024-07-13T03:30:59Z

This PR introduces the Percolator approach for RECAP Search Alerts as planed in #612

It works as follows:

Dockets and RECAPDocuments are percolated into the RECAPPercolator index to match alerts. Documents are percolated via their indexing signals:
- Docket:
  - On creation/update
  - When related BankruptcyInformation is added/updated
  - When related parties are added/updated. In this case, it is not via signals since parties use their own method to update the DocketDocument in ES, index_docket_parties_in_es, so percolation is included within this method.
  - As discussed, for now, when related BankruptcyInformation or Parties are percolated, there will be a previous percolation for the same docket triggered when the Docket is saved. A future performance improvement would be to avoid percolating the Docket if we know we will add BankruptcyInformation or Parties, so it gets percolated only once this additional data is added.
- RECAPDocument creation/update:
  - Considering a RECAPDocument is always created/updated after its related DocketEntry is added/updated, we don't percolate on DocketEntry changes and wait for the RECAPDocument to be added/updated.
Docket documents use the same approach as OA percolation, referring to the document by its ID from the main index where it was indexed.
For RECAPDocuments, the approach is different. We need to percolate a plain version of the RECAPDocument that contains Parties and other docket fields required to match and render alerts. To build this plain document, the new ESRECAPDocumentPlain mapping is used. The resultant dict is percolated into RECAPPercolator instead of referring to a document.

Other Auxiliary Indices:

To avoid triggering alerts when they shouldn't be triggered, such as when a RECAPDocument is ingested and it matches Docket-only query alerts, we use two auxiliary percolator indices (similar to the sweep approach):

RECAPDocumentPercolator: Only contains RECAPDocument fields
DocketDocumentPercolator: Only contains Docket fields

It works as follows:

When an alert is matched by the main percolator RECAPPercolator, it uses the auxiliary indices to avoid including a RECAPDocument hit in a Docket-only query alert according to the following boolean table:

Alert matched in DocketDocumentPercolator	Alert matched in RECAPDocumentPercolator	Trigger Alert	Description
False	False	True	AND Cross-object queries
False	True	True	RD-only queries.
True	False	False	Docket-only queries.
True	True	True	OR Cross-object queries

Whenever a new RECAP search Alert is created, it's indexed into the main RECAPPercolator index and also into RECAPDocumentPercolator and DocketDocumentPercolator. However, not all alerts are indexed into the auxiliary indices. If an alert contains a Docket field as a filter, it is not indexed into the RECAPDocumentPercolator because this index doesn't contain docket fields, and vice versa for RECAPDocument field filters and DocketDocumentPercolator. This is ok because filters already discard documents that don't contain the field, so auxiliary indices are required for filtering out the text query.

Since the Percolator doesn't support parent-child queries, queries are stored in the percolator as plain. A new method, build_plain_percolator_query, was created to transform parent-child queries to plain.

To prevent a document from triggering the same alert more than once, similar to the sweep index approach, we use a Redis set:

alert_hits:id.d stores Dockets that have triggered an alert.
alert_hits:id.r stores RECAPDocuments that have triggered the alert.

These sets are checked and updated before and after an alert is matched by the document.

Grouping Alert Emails:

To avoid triggering one email per each alert matched by a document ingestion, as it currently works in OA, RECAP alert hits for all rates (including RT) are stored using the ScheduledAlertHit. RT alerts are sent every 5 minutes using the new daemon cl_send_rt_percolator_alerts. Before sending them, hits are grouped per user and alert matched, so if multiple dockets or RECAPDocuments match an alert, they're grouped and nested within the same alert and Docket (in the case of RECAPDocuments). Other alert rates are sent according to their rate via the cl_send_scheduled_alerts command.

To limit the number of ScheduledAlertHit that can be stored, I added a content_object field to the ScheduledAlertHit model. This allows easy querying and counting of scheduled alert hits by its model, limiting the number of hits an alert can have (20) and the number of nested RECAPDocuments a hit can have (5).

For webhooks, there is no limit; all matched hits will be sent. I'll open a different issue to add a rate-limit or throttling to webhooks as we discussed.

Webhooks:

Webhooks for all rates are always triggered in real-time as alerts are matched by the percolator. The serializer used in RECAP Search Alerts webhooks is RECAPESResultSerializer, the same used in V4 RECAP Search API, supporting nested RECAPDocuments into the Docket.

Highlights:

RECAP Search alerts, both in emails and webhooks, support the same fields highlighted in the front end, either for the Docket or nested RECAPDocuments.

Screenshots and Examples:

Email with multiple alerts and grouping applied:

Webhook with nested document and HL.

{
   "payload":{
      "alert":{
         "id":866,
         "name":"Test Alert Cross-object",
         "rate":"rt",
         "user":344,
         "query":"q=\"File Amicus Curiae\" AND \"Motion to File 1\" AND \"plain text lorem\" AND \"410 Civil\" AND id:531&docket_number=1:21-bk-123&case_name=\"SUBPOENAS SERVED CASE\"&type=r",
         "alert_type":"r",
         "secret_key":"sCN6rjYMVD6HChMhxvzL3lCIiuMFaWlUJm5x2Wth",
         "date_created":"2024-07-25T14:15:08.083431-07:00",
         "date_last_hit":"None",
         "date_modified":"2024-07-25T14:15:08.083442-07:00"
      },
      "results":[
         {
            "firm":[
               
            ],
            "meta":{
               "timestamp":"2024-07-25T21:15:08.274909Z",
               "date_created":"2024-07-25T21:15:07.839735Z"
            },
            "cause":"<strong>410 Civil</strong>",
            "court":"Superior court for the dragons",
            "party":[
               
            ],
            "chapter":"None",
            "firm_id":[
               
            ],
            "attorney":[
               
            ],
            "caseName":"<strong>SUBPOENAS SERVED CASE</strong>",
            "court_id":"canb",
            "party_id":[
               
            ],
            "dateFiled":"None",
            "docket_id":663,
            "assignedTo":"None",
            "dateArgued":"1972-05-21",
            "juryDemand":"",
            "referredTo":"None",
            "suitNature":"",
            "attorney_id":[
               
            ],
            "trustee_str":"None",
            "docketNumber":"<strong>1:21-bk-123</strong>",
            "pacer_case_id":"242568",
            "assigned_to_id":"None",
            "case_name_full":"Stephenson and Sons, Stephens, Lowery and Beck, Duke Ltd, Adkins, Price and Stevens, and Williams and Sons v. Mark Kelly, Michael Guzman, Anthony Hansen, Gerald Tate, and Maria Vazquez",
            "dateTerminated":"None",
            "referred_to_id":"None",
            "recap_documents":[
               {
                  "id":531,
                  "meta":{
                     "timestamp":"2024-07-25T21:15:08.274909Z",
                     "date_created":"2024-07-25T21:15:07.839735Z"
                  },
                  "cites":[
                     
                  ],
                  "snippet":"<strong>plain text lorem</strong>",
                  "page_count":"None",
                  "description":"MOTION for Leave to <strong>File Amicus Curiae</strong> Lorem Served",
                  "absolute_url":"/docket/663/1/subpoenas-served-case/",
                  "entry_number":1,
                  "is_available":false,
                  "pacer_doc_id":"01803665981",
                  "document_type":"PACER Document",
                  "filepath_local":"None",
                  "docket_entry_id":269,
                  "document_number":1,
                  "entry_date_filed":"2024-08-19",
                  "attachment_number":"None",
                  "short_description":"<strong>Motion to File 1</strong>"
               }
            ],
            "jurisdictionType":"",
            "docket_absolute_url":"/docket/663/subpoenas-served-case/",
            "court_citation_string":"SCOTUS"
         }
      ]
   },
   "webhook":{
      "version":1,
      "event_type":2,
      "date_created":"2024-07-25T21:15:06.968890+00:00",
      "deprecation_date":"None"
   }
}

Oral Arguments Alerts:

They'll continue as usual after this PR is merged. I'll open a different issue to apply grouping to OA Search Alerts.

Old alert tasks

I duplicated celery tasks and related methods used by OA. So we can prevent that alerts scheduled fail once this is deployed. After a couple of days of this is deployed, we could remove those tasks.

Finally, I added a new setting:
PERCOLATOR_SEARCH_ALERTS_ENABLED, which is useful to avoid conflicts in tests that create RECAP-related documents but the Percolator indices do not exist. This setting is only enabled in Alert tests. Once we are ready to start percolating RECAP documents, we should set this setting to True.

Additionally, we'll need to create the following indices manually:

RECAPPercolator.init()
RECAPDocumentPercolator.init()
DocketDocumentPercolator.init()

Let me know what do you think.

semgrep-app · 2024-07-13T03:34:03Z

Semgrep found 12 baseclass-attribute-override findings:

cl/search/documents.py
- L1839 - Triage
- L1839 - Triage
- L1839 - Triage
- L1839 - Triage
- L1839 - Triage
- L1839 - Triage
- L1959 - Triage
- L1959 - Triage
- L1959 - Triage
- L1959 - Triage
- 2 more - Triage

Class RECAPPercolator inherits from both DocketDocument and ESRECAPDocument which both have a method named prepare_trustee_str; one of these methods will be overwritten.

_{Ignore this finding from baseclass-attribute-override.}

Semgrep found 5 template-unescaped-with-safe findings:

cl/alerts/templates/alert_email_es.html
- L39 - Triage
- L58 - Triage
- L61 - Triage
- L65 - Triage
- L108 - Triage

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

_{Ignore this finding from template-unescaped-with-safe.}

Semgrep found 6 avoid-query-set-extra findings:

cl/lib/elasticsearch_utils.py
- L3054-3056 - Triage
- L3068-3070 - Triage
- L3111-3113 - Triage
- L3119-3121 - Triage
- L3127-3131 - Triage
cl/alerts/utils.py
- L134 - Triage

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

_{Ignore this finding from avoid-query-set-extra.}

- Group and send RT RECAP Search Alerts.

- Fixed parent highlighting for cross-object hits

…proach. - Disabled percolation for all tests except the specific ones for alerts

CLAassistant · 2024-07-19T02:05:01Z

All committers have signed the CLA.

…estion

- Avoid indexing RECAP alerts in this test where the Percolator index doesn't exist.

…is reached

- Including an integration sweep-percolator test

…olator indices. - Improved merge_alert_child_documents logic - Added sql migration file

…avoid issues with OA scheduled tasks - These methods and tasks can be removed a few days after rolling them out.

…ch-alerts-percolator

ERosendo · 2024-10-17T20:36:17Z

cl/alerts/migrations/0010_add_schedule_alert_hit_content_type.py

+        migrations.AddIndex(
+            model_name="scheduledalerthit",
+            index=models.Index(
+                fields=["content_type", "object_id"],
+                name="alerts_sche_content_c5e627_idx",
+            ),
+        ),


Since both fields are new, the index creation process should be relatively quick. however, to make sure we don't slow down anything else while it's working, let's use AddIndexConcurrently.

Done! I split the migrations into two files: one for adding the new fields and another for the concurrent index, since it needs to be outside of a transaction (atomic = False)

ERosendo

@albertisfu, here are some suggestions for minor improvements to this pull request.

cl/search/types.py

ERosendo · 2024-10-18T14:07:56Z

cl/alerts/tasks.py

+    :param response: A two tuple, a list of Alerts triggered and the document
+    data that triggered the alert.


I think this docstring is incorrect. The PercolatorResponsesType is a tuple with 5 elements

Yeah I updated the docstring.

cl/alerts/utils.py

ERosendo · 2024-10-18T15:08:51Z

cl/alerts/utils.py

+    main_search_after: int | None = None,
+    rd_search_after: int | None = None,
+    d_search_after: int | None = None,
+) -> tuple[Response, Response | None, Response | None]:


A data class would provide a more structured and readable way to represent these elements, preventing potential misunderstandings about their order.

Yeah added PercolatorResponses dataclass

cl/alerts/utils.py

cl/lib/es_signal_processor.py

…ch-alerts-percolator

…gs in update_es_documents

- Fixed failing tests.

cl/search/documents.py

cl/alerts/utils.py

cl/alerts/management/commands/cl_send_rt_percolator_alerts.py

semgrep-app · 2024-10-19T01:28:42Z

Semgrep found 1 avoid-query-set-extra finding:

cl/alerts/utils.py
- L147 - Triage

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

_{Ignore this finding from avoid-query-set-extra.}

albertisfu · 2024-10-19T02:37:57Z

@ERosendo thank you for your comments and suggestions, this is ready for another review!

…ocket_parties_in_es

albertisfu · 2024-10-19T03:54:00Z

I made an additional change. We had PERCOLATOR_SEARCH_ALERTS_ENABLED to disable the percolator via a setting that was False by default. However, this setting was also disabling Oral Arguments Search Alerts, which are currently enabled in production.
Therefore, I renamed the setting to PERCOLATOR_RECAP_SEARCH_ALERTS_ENABLED, so it now only disables percolation related to RECAP Alerts.

@mlissner when this is merged the percolator will be disabled for RECAP Alerts. then it's required to create the RECAPPercolator index for doing that:

from cl.search.documents import RECAPPercolator
RECAPPercolator.init()

If, for any reason, you need to delete the Percolator index, you can run:

RECAPPercolator._index.delete(ignore=404)

Additionally, to enable the sending of RT alerts every 5 minutes, it's required to set up the daemon cl-send-rt-percolator-alerts, which is already configured in docker-entrypoint.sh

Once that's in place, you can enable the PERCOLATOR_RECAP_SEARCH_ALERTS_ENABLED setting to start the percolation of documents related to RECAP.

ERosendo

LGTM

albertisfu added 3 commits July 12, 2024 17:00

feat(alerts): Introduced RECAP Alerts percolator

39b96ec

fix(alerts): Refactored percolate_document method

bfeff5d

feat(alerts): Enabled RECAP Alerts signals for indexing and removal

5e83e72

albertisfu changed the base branch from main to 612-introduced-recap-search-alerts July 13, 2024 03:31

albertisfu added 2 commits July 15, 2024 20:11

feat(alerts): Trigger RECAP search alerts upon document ingestion

8229a90

feat(alerts): Trigger RECAP Alerts on related documents.

32f8ca9

- Group and send RT RECAP Search Alerts.

albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 6570ca5 to 32f8ca9 Compare July 17, 2024 02:04

albertisfu added 4 commits July 17, 2024 15:09

fix(alerts): Merge RECAPDocuments alert hits into the main case hit.

15dabaa

fix(alerts): Fixed limit parent and child ScheduledAlertHit

e2f30b5

- Fixed parent highlighting for cross-object hits

fix(alerts): Fixed percolate docket after parties are up to date

d5d4046

fix(alerts): Fixed multiple tests errors related to the percolator ap…

8373409

…proach. - Disabled percolation for all tests except the specific ones for alerts

albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 13a604d to 8373409 Compare July 19, 2024 15:43

albertisfu added 13 commits July 19, 2024 19:54

feat(alerts): Filter out percolator alert hits upon RECAPDocument ing…

652c2de

…estion

fix(alerts): Fixed SearchAlertsIndexingCommandTests

a980d6f

- Avoid indexing RECAP alerts in this test where the Percolator index doesn't exist.

fix(alerts): Fixed RECAP search alerts percolator webhooks

aa7e380

fix(alerts): Fixed percolator index conflicts in tests

1448417

fix(alerts): Optimize RECAP document indexing in alerts tests

1f38290

fix(alerts): Show View additional results button if child hits limit …

75c1444

…is reached

fix(alerts): Fix merge_documents type hint

90c7d9d

fix(alerts): Added additional percolator tests

7c1a934

- Including an integration sweep-percolator test

fix(alerts): Merge parent HL in percolator webhooks

8fdbfd7

fix(alerts): Fix bug when removing a non-existing alert from the perc…

dd866c8

…olator indices. - Improved merge_alert_child_documents logic - Added sql migration file

fix(alerts): Restore the original alert-related tasks and methods to …

63dd5a9

…avoid issues with OA scheduled tasks - These methods and tasks can be removed a few days after rolling them out.

Merge branch '612-introduced-recap-search-alerts' into 612-recap-sear…

20b6008

…ch-alerts-percolator

fix(alerts): Fixed search alerts tally stats

9c49e15

albertisfu marked this pull request as ready for review July 25, 2024 21:36

albertisfu requested a review from mlissner July 25, 2024 21:36

albertisfu mentioned this pull request Jul 26, 2024

Realtime alerts should group whenever possible #3102

Open

albertisfu added 2 commits September 27, 2024 09:47

fix(alerts): Fixed merged conflicts and resolved failing tests

b88ad4d

Merge branch '612-introduced-recap-search-alerts' into 612-recap-sear…

bad3f72

…ch-alerts-percolator

ERosendo reviewed Oct 17, 2024

View reviewed changes

ERosendo requested changes Oct 18, 2024

View reviewed changes

albertisfu added 4 commits October 18, 2024 13:09

Merge branch '612-introduced-recap-search-alerts' into 612-recap-sear…

e6bf25a

…ch-alerts-percolator

fix(alerts): Made PercolatorResponsesType a dataclass

7dd1e43

fix(alerts): Introduced SendAlertsResponse and PercolatorResponses

180727a

fix(alerts): Use partial instead of lambda to prevent late binding bu…

b1f2d13

…gs in update_es_documents

Base automatically changed from 612-introduced-recap-search-alerts to main October 19, 2024 00:17

fix(elasticsearch): Restored use of partial for update_es_document task

b043965

- Fixed failing tests.

semgrep-app bot reviewed Oct 19, 2024

View reviewed changes

cl/search/documents.py Show resolved Hide resolved

semgrep-app bot reviewed Oct 19, 2024

View reviewed changes

cl/alerts/utils.py Show resolved Hide resolved

semgrep-app bot reviewed Oct 19, 2024

View reviewed changes

cl/alerts/management/commands/cl_send_rt_percolator_alerts.py Show resolved Hide resolved

albertisfu added 2 commits October 18, 2024 20:17

fix(alerts): Fixed hits_count query due to mypy error

e284ca8

Merge branch 'main' into 612-recap-search-alerts-percolator

5df7f58

albertisfu added 2 commits October 18, 2024 21:15

fix(alerts): Disable percolator alerts by setting only for RECAP

a78de51

fix(alerts): Fixed send_or_schedule_search_alerts argument in index_d…

f640add

…ocket_parties_in_es

ERosendo added 7 commits October 19, 2024 00:45

Merge branch 'main' into 612-recap-search-alerts-percolator

39da4bb

Merge branch 'main' into 612-recap-search-alerts-percolator

5c48895

Merge branch 'main' into 612-recap-search-alerts-percolator

e5935a6

docs(alerts): Updates docstring for send_or_schedule_search_alerts

ab292dc

docs(alerts): Update percolator response processing method docstring

9fa2cc1

docs(search): Updates update_es_document docstring

6b69de9

docs(search): Updates es_save_document docstring

363527c

ERosendo approved these changes Oct 23, 2024

View reviewed changes

mlissner merged commit 0b10713 into main Oct 24, 2024
13 checks passed

mlissner deleted the 612-recap-search-alerts-percolator branch October 24, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

612 Introduced Percolator Recap Search Alerts #4200

612 Introduced Percolator Recap Search Alerts #4200

albertisfu commented Jul 13, 2024 •

edited

Loading

semgrep-app bot commented Jul 13, 2024

CLAassistant commented Jul 19, 2024 •

edited

Loading

ERosendo Oct 17, 2024

albertisfu Oct 19, 2024

ERosendo left a comment

ERosendo Oct 18, 2024

albertisfu Oct 19, 2024

ERosendo Oct 18, 2024

albertisfu Oct 19, 2024

semgrep-app bot commented Oct 19, 2024

albertisfu commented Oct 19, 2024

albertisfu commented Oct 19, 2024

ERosendo left a comment

		:param response: A two tuple, a list of Alerts triggered and the document
		data that triggered the alert.

612 Introduced Percolator Recap Search Alerts #4200

612 Introduced Percolator Recap Search Alerts #4200

Conversation

albertisfu commented Jul 13, 2024 • edited Loading

Other Auxiliary Indices:

Grouping Alert Emails:

Webhooks:

Highlights:

Screenshots and Examples:

Oral Arguments Alerts:

Old alert tasks

semgrep-app bot commented Jul 13, 2024

CLAassistant commented Jul 19, 2024 • edited Loading

ERosendo Oct 17, 2024

Choose a reason for hiding this comment

albertisfu Oct 19, 2024

Choose a reason for hiding this comment

ERosendo left a comment

Choose a reason for hiding this comment

ERosendo Oct 18, 2024

Choose a reason for hiding this comment

albertisfu Oct 19, 2024

Choose a reason for hiding this comment

ERosendo Oct 18, 2024

Choose a reason for hiding this comment

albertisfu Oct 19, 2024

Choose a reason for hiding this comment

semgrep-app bot commented Oct 19, 2024

albertisfu commented Oct 19, 2024

albertisfu commented Oct 19, 2024

ERosendo left a comment

Choose a reason for hiding this comment

albertisfu commented Jul 13, 2024 •

edited

Loading

CLAassistant commented Jul 19, 2024 •

edited

Loading