Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

612 Introduced Percolator Recap Search Alerts #4200

Merged
merged 46 commits into from
Oct 24, 2024

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Jul 13, 2024

This PR introduces the Percolator approach for RECAP Search Alerts as planed in #612

It works as follows:

  • Dockets and RECAPDocuments are percolated into the RECAPPercolator index to match alerts. Documents are percolated via their indexing signals:
    • Docket:
      • On creation/update
      • When related BankruptcyInformation is added/updated
      • When related parties are added/updated. In this case, it is not via signals since parties use their own method to update the DocketDocument in ES, index_docket_parties_in_es, so percolation is included within this method.
      • As discussed, for now, when related BankruptcyInformation or Parties are percolated, there will be a previous percolation for the same docket triggered when the Docket is saved. A future performance improvement would be to avoid percolating the Docket if we know we will add BankruptcyInformation or Parties, so it gets percolated only once this additional data is added.
    • RECAPDocument creation/update:
      • Considering a RECAPDocument is always created/updated after its related DocketEntry is added/updated, we don't percolate on DocketEntry changes and wait for the RECAPDocument to be added/updated.
  • Docket documents use the same approach as OA percolation, referring to the document by its ID from the main index where it was indexed.
  • For RECAPDocuments, the approach is different. We need to percolate a plain version of the RECAPDocument that contains Parties and other docket fields required to match and render alerts. To build this plain document, the new ESRECAPDocumentPlain mapping is used. The resultant dict is percolated into RECAPPercolator instead of referring to a document.

Other Auxiliary Indices:

To avoid triggering alerts when they shouldn't be triggered, such as when a RECAPDocument is ingested and it matches Docket-only query alerts, we use two auxiliary percolator indices (similar to the sweep approach):

  • RECAPDocumentPercolator: Only contains RECAPDocument fields
  • DocketDocumentPercolator: Only contains Docket fields

It works as follows:

When an alert is matched by the main percolator RECAPPercolator, it uses the auxiliary indices to avoid including a RECAPDocument hit in a Docket-only query alert according to the following boolean table:

Alert matched in DocketDocumentPercolator Alert matched in RECAPDocumentPercolator Trigger Alert Description
False False True AND Cross-object queries
False True True RD-only queries.
True False False Docket-only queries.
True True True OR Cross-object queries

Whenever a new RECAP search Alert is created, it's indexed into the main RECAPPercolator index and also into RECAPDocumentPercolator and DocketDocumentPercolator. However, not all alerts are indexed into the auxiliary indices. If an alert contains a Docket field as a filter, it is not indexed into the RECAPDocumentPercolator because this index doesn't contain docket fields, and vice versa for RECAPDocument field filters and DocketDocumentPercolator. This is ok because filters already discard documents that don't contain the field, so auxiliary indices are required for filtering out the text query.

Since the Percolator doesn't support parent-child queries, queries are stored in the percolator as plain. A new method, build_plain_percolator_query, was created to transform parent-child queries to plain.

To prevent a document from triggering the same alert more than once, similar to the sweep index approach, we use a Redis set:

  • alert_hits:id.d stores Dockets that have triggered an alert.
  • alert_hits:id.r stores RECAPDocuments that have triggered the alert.

These sets are checked and updated before and after an alert is matched by the document.

Grouping Alert Emails:

To avoid triggering one email per each alert matched by a document ingestion, as it currently works in OA, RECAP alert hits for all rates (including RT) are stored using the ScheduledAlertHit. RT alerts are sent every 5 minutes using the new daemon cl_send_rt_percolator_alerts. Before sending them, hits are grouped per user and alert matched, so if multiple dockets or RECAPDocuments match an alert, they're grouped and nested within the same alert and Docket (in the case of RECAPDocuments). Other alert rates are sent according to their rate via the cl_send_scheduled_alerts command.

To limit the number of ScheduledAlertHit that can be stored, I added a content_object field to the ScheduledAlertHit model. This allows easy querying and counting of scheduled alert hits by its model, limiting the number of hits an alert can have (20) and the number of nested RECAPDocuments a hit can have (5).

For webhooks, there is no limit; all matched hits will be sent. I'll open a different issue to add a rate-limit or throttling to webhooks as we discussed.

Webhooks:

Webhooks for all rates are always triggered in real-time as alerts are matched by the percolator. The serializer used in RECAP Search Alerts webhooks is RECAPESResultSerializer, the same used in V4 RECAP Search API, supporting nested RECAPDocuments into the Docket.

Highlights:

RECAP Search alerts, both in emails and webhooks, support the same fields highlighted in the front end, either for the Docket or nested RECAPDocuments.

Screenshots and Examples:

Email with multiple alerts and grouping applied:
Screenshot 2024-07-25 at 4 11 19 p m
Screenshot 2024-07-25 at 4 11 31 p m

Webhook with nested document and HL.

{
   "payload":{
      "alert":{
         "id":866,
         "name":"Test Alert Cross-object",
         "rate":"rt",
         "user":344,
         "query":"q=\"File Amicus Curiae\" AND \"Motion to File 1\" AND \"plain text lorem\" AND \"410 Civil\" AND id:531&docket_number=1:21-bk-123&case_name=\"SUBPOENAS SERVED CASE\"&type=r",
         "alert_type":"r",
         "secret_key":"sCN6rjYMVD6HChMhxvzL3lCIiuMFaWlUJm5x2Wth",
         "date_created":"2024-07-25T14:15:08.083431-07:00",
         "date_last_hit":"None",
         "date_modified":"2024-07-25T14:15:08.083442-07:00"
      },
      "results":[
         {
            "firm":[
               
            ],
            "meta":{
               "timestamp":"2024-07-25T21:15:08.274909Z",
               "date_created":"2024-07-25T21:15:07.839735Z"
            },
            "cause":"<strong>410 Civil</strong>",
            "court":"Superior court for the dragons",
            "party":[
               
            ],
            "chapter":"None",
            "firm_id":[
               
            ],
            "attorney":[
               
            ],
            "caseName":"<strong>SUBPOENAS SERVED CASE</strong>",
            "court_id":"canb",
            "party_id":[
               
            ],
            "dateFiled":"None",
            "docket_id":663,
            "assignedTo":"None",
            "dateArgued":"1972-05-21",
            "juryDemand":"",
            "referredTo":"None",
            "suitNature":"",
            "attorney_id":[
               
            ],
            "trustee_str":"None",
            "docketNumber":"<strong>1:21-bk-123</strong>",
            "pacer_case_id":"242568",
            "assigned_to_id":"None",
            "case_name_full":"Stephenson and Sons, Stephens, Lowery and Beck, Duke Ltd, Adkins, Price and Stevens, and Williams and Sons v. Mark Kelly, Michael Guzman, Anthony Hansen, Gerald Tate, and Maria Vazquez",
            "dateTerminated":"None",
            "referred_to_id":"None",
            "recap_documents":[
               {
                  "id":531,
                  "meta":{
                     "timestamp":"2024-07-25T21:15:08.274909Z",
                     "date_created":"2024-07-25T21:15:07.839735Z"
                  },
                  "cites":[
                     
                  ],
                  "snippet":"<strong>plain text lorem</strong>",
                  "page_count":"None",
                  "description":"MOTION for Leave to <strong>File Amicus Curiae</strong> Lorem Served",
                  "absolute_url":"/docket/663/1/subpoenas-served-case/",
                  "entry_number":1,
                  "is_available":false,
                  "pacer_doc_id":"01803665981",
                  "document_type":"PACER Document",
                  "filepath_local":"None",
                  "docket_entry_id":269,
                  "document_number":1,
                  "entry_date_filed":"2024-08-19",
                  "attachment_number":"None",
                  "short_description":"<strong>Motion to File 1</strong>"
               }
            ],
            "jurisdictionType":"",
            "docket_absolute_url":"/docket/663/subpoenas-served-case/",
            "court_citation_string":"SCOTUS"
         }
      ]
   },
   "webhook":{
      "version":1,
      "event_type":2,
      "date_created":"2024-07-25T21:15:06.968890+00:00",
      "deprecation_date":"None"
   }
}

Oral Arguments Alerts:

They'll continue as usual after this PR is merged. I'll open a different issue to apply grouping to OA Search Alerts.

Old alert tasks

I duplicated celery tasks and related methods used by OA. So we can prevent that alerts scheduled fail once this is deployed. After a couple of days of this is deployed, we could remove those tasks.

Finally, I added a new setting:
PERCOLATOR_SEARCH_ALERTS_ENABLED, which is useful to avoid conflicts in tests that create RECAP-related documents but the Percolator indices do not exist. This setting is only enabled in Alert tests. Once we are ready to start percolating RECAP documents, we should set this setting to True.

Additionally, we'll need to create the following indices manually:

  • RECAPPercolator.init()
  • RECAPDocumentPercolator.init()
  • DocketDocumentPercolator.init()

Let me know what do you think.

@albertisfu albertisfu changed the base branch from main to 612-introduced-recap-search-alerts July 13, 2024 03:31
Copy link

semgrep-app bot commented Jul 13, 2024

Semgrep found 12 baseclass-attribute-override findings:

Class RECAPPercolator inherits from both DocketDocument and ESRECAPDocument which both have a method named prepare_trustee_str; one of these methods will be overwritten.

Ignore this finding from baseclass-attribute-override.

Semgrep found 5 template-unescaped-with-safe findings:

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

Semgrep found 6 avoid-query-set-extra findings:

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 6570ca5 to 32f8ca9 Compare July 17, 2024 02:04
@CLAassistant
Copy link

CLAassistant commented Jul 19, 2024

CLA assistant check
All committers have signed the CLA.

@albertisfu albertisfu force-pushed the 612-recap-search-alerts-percolator branch from 13a604d to 8373409 Compare July 19, 2024 15:43
@albertisfu albertisfu marked this pull request as ready for review July 25, 2024 21:36
@albertisfu albertisfu requested a review from mlissner July 25, 2024 21:36
Comment on lines 30 to 36
migrations.AddIndex(
model_name="scheduledalerthit",
index=models.Index(
fields=["content_type", "object_id"],
name="alerts_sche_content_c5e627_idx",
),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both fields are new, the index creation process should be relatively quick. however, to make sure we don't slow down anything else while it's working, let's use AddIndexConcurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I split the migrations into two files: one for adding the new fields and another for the concurrent index, since it needs to be outside of a transaction (atomic = False)

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertisfu, here are some suggestions for minor improvements to this pull request.

cl/search/types.py Outdated Show resolved Hide resolved
Comment on lines 650 to 651
:param response: A two tuple, a list of Alerts triggered and the document
data that triggered the alert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring is incorrect. The PercolatorResponsesType is a tuple with 5 elements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I updated the docstring.

cl/alerts/utils.py Outdated Show resolved Hide resolved
cl/alerts/utils.py Outdated Show resolved Hide resolved
main_search_after: int | None = None,
rd_search_after: int | None = None,
d_search_after: int | None = None,
) -> tuple[Response, Response | None, Response | None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A data class would provide a more structured and readable way to represent these elements, preventing potential misunderstandings about their order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah added PercolatorResponses dataclass

cl/alerts/utils.py Outdated Show resolved Hide resolved
cl/lib/es_signal_processor.py Outdated Show resolved Hide resolved
Base automatically changed from 612-introduced-recap-search-alerts to main October 19, 2024 00:17
Copy link

semgrep-app bot commented Oct 19, 2024

Semgrep found 1 avoid-query-set-extra finding:

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu
Copy link
Contributor Author

@ERosendo thank you for your comments and suggestions, this is ready for another review!

@albertisfu
Copy link
Contributor Author

I made an additional change. We had PERCOLATOR_SEARCH_ALERTS_ENABLED to disable the percolator via a setting that was False by default. However, this setting was also disabling Oral Arguments Search Alerts, which are currently enabled in production.
Therefore, I renamed the setting to PERCOLATOR_RECAP_SEARCH_ALERTS_ENABLED, so it now only disables percolation related to RECAP Alerts.

@mlissner when this is merged the percolator will be disabled for RECAP Alerts. then it's required to create the RECAPPercolator index for doing that:

from cl.search.documents import RECAPPercolator
RECAPPercolator.init()

If, for any reason, you need to delete the Percolator index, you can run:

RECAPPercolator._index.delete(ignore=404)

Additionally, to enable the sending of RT alerts every 5 minutes, it's required to set up the daemon cl-send-rt-percolator-alerts, which is already configured in docker-entrypoint.sh

Once that's in place, you can enable the PERCOLATOR_RECAP_SEARCH_ALERTS_ENABLED setting to start the percolation of documents related to RECAP.

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mlissner mlissner merged commit 0b10713 into main Oct 24, 2024
13 checks passed
@mlissner mlissner deleted the 612-recap-search-alerts-percolator branch October 24, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants