-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
612 Introduced Percolator Recap Search Alerts #4200
Conversation
Semgrep found 12
Class RECAPPercolator inherits from both Semgrep found 5
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability. Ignore this finding from template-unescaped-with-safe.Semgrep found 6
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as |
- Group and send RT RECAP Search Alerts.
6570ca5
to
32f8ca9
Compare
- Fixed parent highlighting for cross-object hits
…proach. - Disabled percolation for all tests except the specific ones for alerts
13a604d
to
8373409
Compare
- Avoid indexing RECAP alerts in this test where the Percolator index doesn't exist.
- Including an integration sweep-percolator test
…olator indices. - Improved merge_alert_child_documents logic - Added sql migration file
…avoid issues with OA scheduled tasks - These methods and tasks can be removed a few days after rolling them out.
…ch-alerts-percolator
migrations.AddIndex( | ||
model_name="scheduledalerthit", | ||
index=models.Index( | ||
fields=["content_type", "object_id"], | ||
name="alerts_sche_content_c5e627_idx", | ||
), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both fields are new, the index creation process should be relatively quick. however, to make sure we don't slow down anything else while it's working, let's use AddIndexConcurrently
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! I split the migrations into two files: one for adding the new fields and another for the concurrent index, since it needs to be outside of a transaction (atomic = False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertisfu, here are some suggestions for minor improvements to this pull request.
cl/alerts/tasks.py
Outdated
:param response: A two tuple, a list of Alerts triggered and the document | ||
data that triggered the alert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring is incorrect. The PercolatorResponsesType is a tuple with 5 elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I updated the docstring.
cl/alerts/utils.py
Outdated
main_search_after: int | None = None, | ||
rd_search_after: int | None = None, | ||
d_search_after: int | None = None, | ||
) -> tuple[Response, Response | None, Response | None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A data class would provide a more structured and readable way to represent these elements, preventing potential misunderstandings about their order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah added PercolatorResponses
dataclass
…ch-alerts-percolator
…gs in update_es_documents
- Fixed failing tests.
Semgrep found 1 QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as |
@ERosendo thank you for your comments and suggestions, this is ready for another review! |
I made an additional change. We had @mlissner when this is merged the percolator will be disabled for RECAP Alerts. then it's required to create the
If, for any reason, you need to delete the Percolator index, you can run:
Additionally, to enable the sending of RT alerts every 5 minutes, it's required to set up the daemon Once that's in place, you can enable the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR introduces the Percolator approach for RECAP Search Alerts as planed in #612
It works as follows:
index_docket_parties_in_es
, so percolation is included within this method.DocketEntry
is added/updated, we don't percolate on DocketEntry changes and wait for theRECAPDocument
to be added/updated.ESRECAPDocumentPlain
mapping is used. The resultant dict is percolated intoRECAPPercolator
instead of referring to a document.Other Auxiliary Indices:
To avoid triggering alerts when they shouldn't be triggered, such as when a RECAPDocument is ingested and it matches Docket-only query alerts, we use two auxiliary percolator indices (similar to the sweep approach):
RECAPDocumentPercolator
: Only contains RECAPDocument fieldsDocketDocumentPercolator
: Only contains Docket fieldsIt works as follows:
When an alert is matched by the main percolator RECAPPercolator, it uses the auxiliary indices to avoid including a RECAPDocument hit in a Docket-only query alert according to the following boolean table:
Whenever a new RECAP search Alert is created, it's indexed into the main
RECAPPercolator
index and also intoRECAPDocumentPercolator
andDocketDocumentPercolator
. However, not all alerts are indexed into the auxiliary indices. If an alert contains a Docket field as a filter, it is not indexed into the RECAPDocumentPercolator because this index doesn't contain docket fields, and vice versa for RECAPDocument field filters andDocketDocumentPercolator
. This is ok because filters already discard documents that don't contain the field, so auxiliary indices are required for filtering out the text query.Since the Percolator doesn't support parent-child queries, queries are stored in the percolator as plain. A new method,
build_plain_percolator_query
, was created to transform parent-child queries to plain.To prevent a document from triggering the same alert more than once, similar to the sweep index approach, we use a Redis set:
alert_hits:id.d
stores Dockets that have triggered an alert.alert_hits:id.r
stores RECAPDocuments that have triggered the alert.These sets are checked and updated before and after an alert is matched by the document.
Grouping Alert Emails:
To avoid triggering one email per each alert matched by a document ingestion, as it currently works in OA, RECAP alert hits for all rates (including RT) are stored using the
ScheduledAlertHit
. RT alerts are sent every 5 minutes using the new daemoncl_send_rt_percolator_alerts
. Before sending them, hits are grouped per user and alert matched, so if multiple dockets or RECAPDocuments match an alert, they're grouped and nested within the same alert and Docket (in the case of RECAPDocuments). Other alert rates are sent according to their rate via thecl_send_scheduled_alerts
command.To limit the number of
ScheduledAlertHit
that can be stored, I added acontent_object
field to theScheduledAlertHit
model. This allows easy querying and counting of scheduled alert hits by its model, limiting the number of hits an alert can have (20) and the number of nested RECAPDocuments a hit can have (5).For webhooks, there is no limit; all matched hits will be sent. I'll open a different issue to add a rate-limit or throttling to webhooks as we discussed.
Webhooks:
Webhooks for all rates are always triggered in real-time as alerts are matched by the percolator. The serializer used in RECAP Search Alerts webhooks is
RECAPESResultSerializer
, the same used in V4 RECAP Search API, supporting nested RECAPDocuments into the Docket.Highlights:
RECAP Search alerts, both in emails and webhooks, support the same fields highlighted in the front end, either for the Docket or nested RECAPDocuments.
Screenshots and Examples:
Email with multiple alerts and grouping applied:
Webhook with nested document and HL.
Oral Arguments Alerts:
They'll continue as usual after this PR is merged. I'll open a different issue to apply grouping to OA Search Alerts.
Old alert tasks
I duplicated celery tasks and related methods used by OA. So we can prevent that alerts scheduled fail once this is deployed. After a couple of days of this is deployed, we could remove those tasks.
Finally, I added a new setting:
PERCOLATOR_SEARCH_ALERTS_ENABLED
, which is useful to avoid conflicts in tests that create RECAP-related documents but the Percolator indices do not exist. This setting is only enabled in Alert tests. Once we are ready to start percolating RECAP documents, we should set this setting toTrue
.Additionally, we'll need to create the following indices manually:
RECAPPercolator.init()
RECAPDocumentPercolator.init()
DocketDocumentPercolator.init()
Let me know what do you think.