From bbda4d67b598ae45815f4f55904a3ba134d28acd Mon Sep 17 00:00:00 2001 From: Sid Date: Mon, 12 Feb 2024 11:57:53 -0500 Subject: [PATCH 1/4] Added tutorial, needs review/testing --- docs/Tutorials.md | 629 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 629 insertions(+) create mode 100644 docs/Tutorials.md diff --git a/docs/Tutorials.md b/docs/Tutorials.md new file mode 100644 index 000000000..80b46ee06 --- /dev/null +++ b/docs/Tutorials.md @@ -0,0 +1,629 @@ +--- +layout: default +title: Tutorial 1 +nav_order: 8 +--- +
+ + Table of Contents + + {: .text-delta } +- TOC +{:toc} +
+ +# Tutorial 1: Extending Swirl + +The following tutorial is intended for a python developer who wants to extend Swirl by adding SearchProviders, Connectors and Processors. Please note this guide assumes use of Swirl 3.x or later. + +
+ +## Assumptions + +* Latest python, 3.11 or later, installed locally +* Redis installed and running +* Swirl installed locally (not in Docker) and running + +``` +% python swirl.py status +__S_W_I_R_L__3_._3_._0__________________________________________________________ + +Service: django...RUNNING, pid:34720 +Service: celery-worker...RUNNING, pid:34734 + + PID TTY TIME CMD +34720 ttys000 0:05.04 /Library/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.12/bin/daphne -b 0.0.0.0 -p 8000 swirl_server.asgi:application +34734 ttys000 0:06.67 /Library/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.12/bin/celery -A swirl_server worker +``` +
+ +# Background: The Swirl Search Workflow + +In a nutshell: + +* User creates a query - example: [http://localhost:8000/swirl/search/?q=ai](http://localhost:8000/swirl/search/?q=ai) +* Pre-query processing - example: [SpellcheckQueryProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/spellcheck_query.py) + +
+ +{: .highlight } +Each Search Provider executes in parallel + +
+ +* Query processing - example: [AdaptiveQueryProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/adaptive.py) +* Connector - example: [RequestsGet](https://github.com/swirlai/swirl-search/blob/main/swirl/connectors/requestsget.py) +* Result processing - example: [MappingResultsProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/mapping.py) + +
+ +{: .highlight } +Parallel processing ends here + +
+ +* Post-result processing - example: [CosineRelevancyPostResultProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/relevancy.py) +* Ranked-results available via mixer - example: [http://localhost:8000/swirl/results/?search_id=1](http://localhost:8000/swirl/results/?search_id=1) + +For more information, consult the [Developer Guide Workflow Overview](Developer-Guide.md#workflow). + +
+ +# Creating a SearchProvider + +A SearchProvider is a configuration of a Connector. So, to connect to a given source, first, verify that it supports a Connector you already have. (See the next tutorial for information on creating new Connectors.) + +For example, if trying to query a website using a URL like `https://host.com/?q=my+query+here` that returns JSON or XML, create a new SearchProvider configuring the RequestsGet connector as follows: + +* Copy any of the [Google PSE SearchProviders](https://github.com/swirlai/swirl-search/blob/main/SearchProviders/google_pse.json) + +Modify the `url` and `query_template` to construct the query URL. Using the above example: + +``` +{ + "url": "https://my-host.com/", + "query_template": "{url}?q={query_string}", +} +``` + +To learn more about query and URL parameters, refer to the [Developer Guide](./Developer-Guide.md). + +* If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the `PAGE=` and `DATE_SORT` query mappings to add support for these features through Swirl. + +For example: + +``` +TO DO +``` + +For more information refer to the [User Guide, Query Mappings section:](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#query-mappings) + +* Open the query URL in a browser and look through the JSON response. + +If using Visual Studio Code, right-click on the pasted JSON and select `Format Document`` to make it easier to read. + +* Identify the results list, and number of results found and retrieved. Put these JSON paths in the response_mappings. Then, identify the JSON paths to use to extract the Swirl default fields `title`, `body`, `url`, `date_published` and `author` from each item in the result lists in the result_mappings, with the Swirl field name on the left, and the source JSON path on the right. + +For example: + +``` + "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items", + "result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD", + +``` + +* Add credentials as required for the service. + +The format to use depends on the type of credential. Details are here: [User Guide Credentials Section](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#authentication--credentials) + +* Add a suitable tag that can be used to describe the source or what it knows about. + +Spaces are not permitted; good tags are clear and obvious when used in a query, like `company:tesla` or `news:openai`. + +For more about tags, see: [Organizing SearchProviders](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#organizing-searchproviders-with-active-default-and-tags) + +* Review the finished SearchProvider: + +``` +{ + "name": "My New SearchProvider", + "connector": "RequestsGet", + "url": "https://host.com/", + "query_template": "{url}?q={query_string}", + "query_processors": [ + "AdaptiveQueryProcessor" + ], + "query_mappings": "", + "result_processors": [ + "MappingResultProcessor", + "CosineRelevancyResultProcessor" + ], + "response_mappings": "FOUND=jsonpath.to.number.found,RETRIEVED=jsonpath.to.number.retrieved,RESULTS=jsonpath.to.result.list", + "result_mappings": "url=link,body=snippet,author=displayLink,NO_PAYLOAD", + "credentials": "bearer=your-bearer-token-here", + "tags": [ + "MyTag" + ] + } +``` + +* Go to Swirl `localhost:8000/swirl/searchproviders/`, logging in if necessary. Put the form at the bottom of the page into RAW mode, and paste the SearchProvider in. Then hit POST. The SearchProvider will reload. + +* Go to Galaxy `localhost:8000/galaxy/` and run a search using the tag you created earlier. Results should again appear in roughly the same period of time. + +
+ +# Creating a Connector + +In Swirl, Connectors are responsible for loading a SearchProvider, then constructing and transmitting queries to a particular type of service, then saving the response - typically a result list. + +:info: Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query. + +:info: If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, [Create a SearchProvider](#creating-a-searchprovider) that configures the RequestsGet connector included with Swirl. + +To create a new Connector: + +* Create a new file, e.g. `swirl/connectors/my_connector.py` + +* Copy the style of the `ChatGPT` connector as a starting point, or `BigQuery` it targeting a database. + +``` +class MyConnector(Connector): + + def __init__(self, provider_id, search_id, update, request_id=''): + self.system_guide = MODEL_DEFAULT_SYSTEM_GUIDE + super().__init__(provider_id, search_id, update, request_id) +``` + +In the __init__ class, load and persist anything that will be needed when connecting and querying the service. Use the ChatGPT Connector as a guide. + +* Import the python package(s) to connect to the service. The ChatGPT connector uses the openai package, for example: + +``` +import openai +``` + +* Modify the execute_search method to connect to the service. + +As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt via `openai.ChatCompletion.create()`, then stores the response. + +``` + def execute_search(self, session=None): + + logger.debug(f"{self}: execute_search()") + + if self.provider.credentials: + openai.api_key = self.provider.credentials + else: + if getattr(settings, 'OPENAI_API_KEY', None): + openai.api_key = settings.OPENAI_API_KEY + else: + self.status = "ERR_NO_CREDENTIALS" + return + + prompted_query = "" + if self.query_to_provider.endswith('?'): + prompted_query = self.query_to_provider + else: + if 'PROMPT' in self.query_mappings: + prompted_query = self.query_mappings['PROMPT'].format(query_to_provider=self.query_to_provider) + else: + prompted_query = self.query_to_provider + self.warning(f'PROMPT not found in query_mappings!') + + if 'CHAT_QUERY_REWRITE_GUIDE' in self.query_mappings: + self.system_guide = self.query_mappings['CHAT_QUERY_REWRITE_GUIDE'].format(query_to_provider=self.query_to_provider) + + if not prompted_query: + self.found = 0 + self.retrieved = 0 + self.response = [] + self.status = "ERR_PROMPT_FAILED" + return + logger.info(f'CGPT completion system guide:{self.system_guide} query to provider : {self.query_to_provider}') + self.query_to_provider = prompted_query + completions = openai.ChatCompletion.create( + model=MODEL, + messages=[ + {"role": "system", "content": self.system_guide}, + {"role": "user", "content": self.query_to_provider}, + ], + temperature=0, + ) + message = completions['choices'][0]['message']['content'] # FROM API Doc + + self.found = 1 + self.retrieved = 1 + self.response = message.replace("\n\n", "") + + return + +``` + +ChatGPT depends on the OpenAI API key, which is provided to Swirl via the .env file. To follow this pattern, create new values in .env then modify `swirl_server/settings.py`` to load them as Django settings, and set a reasonable default. + +* Modify the `normalize_response()` method to store the raw response. This is literally no more (or less) than writing the result objects out as a python list and storing that in `self.results`: + +``` + def normalize_response(self): + + logger.debug(f"{self}: normalize_response()") + + self.results = [ + { + 'title': self.query_string_to_provider, + 'body': f'{self.response}', + 'author': 'CHATGPT', + 'date_published': str(datetime.now()) + } + ] + + return +``` + +There's no need to do this if self.response is already a python list. + +* Add the new Connector to `swirl/connectors/__init__.py` + +``` +from swirl.connectors.my_connector import MyConnector +``` + +* Restart Swirl + +``` +% python swirl.py restart core +``` + +* Create a SearchProvider to configure the new Connector, then add it to the Swirl installation as noted in the [Create a SearchProvider tutorial](#creating-a-searchprovider). + +Don't forget a useful tag so you can easily target the new connector when ready to test. + +To learn more about developing Connectors, refer to the [Developer Guide](Developer-Guide.md#develop-new-connectors). + +
+ +# Creating a QueryProcessor + +A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that the result of Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProviders. + +In both cases, the goal is to modify the query sent to some group of SearchProviders. + +Note: if you just want to rewrite the query using lookup tables or regular expressions, consider [using `QueryTransformations` instead](https://github.com/swirlai/swirl-search/wiki/5.-Developer-Guide#using-query-transformations) + +To create a new QueryProcessor: + +* Create a new module like `swirl/processors/my_query_processor.py`. You can also add your new class to an existing module like `swirl/processors/Generic.py`. + +* Copy the `GenericQueryProcessor` class as a starting point, and rename it: + +``` +class MyQueryProcessor(QueryProcessor): + + type = 'MyQueryProcessor' + + def process(self): + # TO DO: modify self.query_string, and return it + return self.query_string + ' modified' +``` + +A common use case for a new QueryProcessor is to use some function or external service to rewrite the query. Here's an example of how to do the former. + +Given some function in package MyPackage... + +``` +def my_query_modification_function(query): + # implementation + return modified_query +``` + +Create a new QueryProcessor that imports the function: + +``` +from MyPackage import my_query_modification_function + +class MyQueryProcessor(QueryProcessor): + type = 'MyQueryProcessor' + def process(self): + new_query_string = my_query_modification(self.query_string) + if new_query_string: + return new_query_string + else: + return self.query_string +``` + +Save the new QueryProcessor. + +* If you created a new module, add it to `swirl/processors/__init__.py` + +``` +from swirl.processors.my_processor import * +``` + +This can be skipped if you added the new QueryProcessor to an existing module. + +* Add the new Processor to the appropriate `swirl.models` CHOICES block. + +For pre-query processing, add it to the `Search` object here: + +``` + PRE_QUERY_PROCESSOR_CHOICES = [ + ('GenericQueryProcessor', 'GenericQueryProcessor'), + ('SpellcheckQueryProcessor', 'SpellcheckQueryProcessor'), + ('MyQueryProcessor','MyQueryProcessor') + ] + +``` + +This is required for security reasons. To make the new Processor a default for new Search objects, add it to the return list in this block: + +``` +def getSearchPreQueryProcessorsDefault(): + return [] +``` + +For query processing in one or more SearchProviders, add it here: + +``` + QUERY_PROCESSOR_CHOICES = [ + ('GenericQueryProcessor', 'GenericQueryProcessor'), + ('SpellcheckQueryProcessor', 'SpellcheckQueryProcessor'), + ('MyQueryProcessor','MyQueryProcessor') + ] + +``` + +To make the new Processor a default for new SearchProvider objects, add it to the return list in this block: + +``` +def getSearchProviderQueryProcessorsDefault(): + return ["AdaptiveQueryProcessor"] +``` + +* Add the new QueryProcessor to either a `Search.pre_query_processing` pipeline or at least one `SearchProvider.query_processing` pipeline: + +`SearchProvider`: +``` + "query_processors": [ + "AdaptiveQueryProcessor", + "MyQueryProcessor" + ], +``` + +`Search`: +``` + { + "query_string": "news:ai", + "pre_query_processors": [ + "MyQueryProcessor" + ], + } +``` + +* Restart Swirl + +``` +% python swirl.py restart core +``` + +* Go to Galaxy `http://localhost:8000/galaxy/` + +Run a search; if using a query processor be sure to [target that SearchProvider with a tag](https://docs.swirl.today/User-Guide.html#using-tags-to-target-searchproviders). + +For example if you added a QueryProcessor to a SearchProvider query_processing pipeline with tag "news", the query would be `http://localhost:8000/swirl/search/?q=news:some+query` instead.Results should appear in a just a few seconds. In the `messages` block a message indicating that the new QueryProcessor rewrote the query should appear: + +``` +MyQueryProcessor rewrote query to: +``` + +
+ +# Creating a ResultProcessor + +A ResultProcessor is a stage executed by each SearchProvider, after the Connector has retrieved results. ResultProcessors operate on results and transform them as needed for downstream consumption or presentation. + +The GenericResultProcessor and MappingResultProcessor stages are intended to normalize JSON results. GenericResultProcessor searches for exact matches to the Swirl schema (as noted in the SearchProvider example) and copies them over. MappingResultProcessor applies result_mappings to normalize the results, again as shown in the SearchProvider example above. In general adding stages after these is a good idea, unless the SearchProvider is expected to respond in a Swirl schema format. + +To create a new ResultProcessor: + +* Create a new file, e.g. `swirl/processors/my_result_processor.py` + +* Copy the `GenericResultProcessor` class as a starting point, and rename it. + +* Implement the `process()` method. This is the only one required. + +Process() operates on `self.results`, which will contain all the results from a given SearchProvider, in python list format. Modify items in the result list, and report the number updated. + +For example, here is a process() method that adds a field 'my_field1' to each result item, with the value "test": + +``` + def process(self): + + if not self.results: + return + + updated = 0 + for item in self.results: + # TO DO: operate on each item and count number updated + item['my_field1'] = 'test' + updated = updated + 1 + + # note: there is no need to save in this type of Processor + + # save modified self.results + self.processed_results = self.results + # save number of updated + self.modified = updated + + return self.modified +``` + +Save the module. + +* Add the new module to `swirl/processors/__init__.py` + +``` +from swirl.processors.my_processor import MyResultProcessor +``` + +* Add the new module to the following swirl.models CHOICES block: + +``` + RESULT_PROCESSOR_CHOICES = [ + ('GenericResultProcessor', 'GenericResultProcessor'), + ('DuplicateHalfResultProcessor', 'DuplicateHalfResultProcessor'), + ('TestResultProcessor', 'TestResultProcessor'), + ('MappingResultProcessor', 'MappingResultProcessor'), + ('DateFinderResultProcessor','DateFinderResultProcessor'), + ('DedupeByFieldResultProcessor', 'DedupeByFieldResultProcessor'), + ('LenLimitingResultProcessor', 'LenLimitingResultProcessor'), + ('CleanTextResultProcessor','CleanTextResultProcessor'), + ('RequireQueryStringInTitleResultProcessor','RequireQueryStringInTitleResultProcessor'), + ('AutomaticPayloadMapperResultProcessor', 'AutomaticPayloadMapperResultProcessor'), + ('CosineRelevancyResultProcessor','CosineRelevancyResultProcessor'), + ('MyResultProcessor','MyResultProcessor') + ] +``` + +This is required for security reasons. + +To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: + +``` +def getSearchProviderResultProcessorsDefault(): + return ["MappingResultProcessor","DateFinderResultProcessor","CosineRelevancyResultProcessor"] +``` + +* Add the new module to at least one `SearchProvider.result_processing` pipeline: + +``` + "result_processors": [ + "MappingResultProcessor", + "MyResultProcessor", + "CosineRelevancyResultProcessor" + ], + ...etc... +``` + +* Restart Swirl + +``` +% python swirl.py restart core +``` + +* Go to Galaxy `http://localhost:8000/galaxy/` + +Run a search; be sure to target at least one SearchProvider that has the new ResultProcessor. + +For example if you added a ResultProcessor to a SearchProvider result_processing pipeline with tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. + +Results should appear in a just a few seconds. In the `messages` block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected. + +``` +MyResultProcessor updated 5 results from: MyConnector", +``` + +To learn more about writing Processors, refer to the [Developer Guide](Developer-Guide.md#develop-new-processors) + +
+ +# Creating a PostResultProcessor + +A PostResultProcessor is a stage executed after all SearchProviders have returned results. They operate on all the results for a given Search. + +To create a new ResultProcessor: + +* Create a new module like `swirl/processors/my_post_result_processor.py` + +* Copy the template below as a starting point, and rename it: + +``` +class MyPostResultProcessor(PostResultProcessor): + + type = 'MyPostResultProcessor' + + def process(self): + + updated = 0 + for results in self.results: + if not results.json_results: + continue + for item in results.json_results: + # TO DO: operate on each result item + item['my_field2'] = "test" + updated = updated + 1 + # end for + # call results.save() if any results were modified + if updated > 0: + results.save() + + # end for + ############################################ + + self.results_updated = updated + return self.results_updated +``` + +Modify the `process()` method, operating on the items and saving each result set as shown. + +* Add the new module to `swirl/processors/__init__.py` + +``` +from swirl.processors.my_post_result_processor import MyPostResultProcessor +``` + +* Add the new module to the following `swirl.models` CHOICES block: + +``` + POST_RESULT_PROCESSOR_CHOICES = [ + ('CosineRelevancyPostResultProcessor', 'CosineRelevancyPostResultProcessor'), + ('DropIrrelevantPostResultProcessor','DropIrrelevantPostResultProcessor'), + ('DedupeByFieldPostResultProcessor', 'DedupeByFieldPostResultProcessor'), + ('DedupeBySimilarityPostResultProcessor', 'DedupeBySimilarityPostResultProcessor'), + ('MyPostResultProcessor','MyPostResultProcessor') + ] +``` + +This is required for security reasons. + +To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: + +``` +def getSearchPostResultProcessorsDefault(): + return ["DedupeByFieldPostResultProcessor","CosineRelevancyPostResultProcessor,"MyPostResultProcessor"] +``` + +* Add the new module to a `Search.post_result_processing` pipeline: + +``` + { + "query_string": "news:ai", + "post_result_processors": [ + "DedupeByFieldPostResultProcessor", + "CosineRelevancyPostResultProcessor", + "MyPostResultProcessor" + ], + ...etc... + } +``` + +* Restart Swirl + +``` +% python swirl.py restart core +``` + +* Go to Galaxy `http://localhost:8000/galaxy/` + +Run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor. + +For example if you added a PostResultProcessor to a Search `post_result_processing` pipeline with tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. + +Results should appear in a just a few seconds. In the `messages` block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected. + +``` +MyPostResultProcessor updated 10 results from: MySearchProvider +``` + +To learn more about writing Processors, refer to the [Developer Guide](Developer-Guide.md#develop-new-processors) + From 7253bac1d1fdde84e234ae9658cc3f2ff1731243 Mon Sep 17 00:00:00 2001 From: Erik Spears Date: Mon, 12 Feb 2024 14:38:01 -0500 Subject: [PATCH 2/4] update order of docs for new Tutorial; initial edits to Tutorial formatting and links, more to come --- docs/AI-Guide.md | 2 +- docs/Azure.md | 2 +- docs/Developer-Reference.md | 2 +- docs/Tutorials.md | 79 +++++++++++-------------------------- 4 files changed, 27 insertions(+), 58 deletions(-) diff --git a/docs/AI-Guide.md b/docs/AI-Guide.md index 666c2c212..83f2107d8 100644 --- a/docs/AI-Guide.md +++ b/docs/AI-Guide.md @@ -1,7 +1,7 @@ --- layout: default title: AI Guide -nav_order: 8 +nav_order: 9 ---
diff --git a/docs/Azure.md b/docs/Azure.md index 95120445e..3cff84ba7 100644 --- a/docs/Azure.md +++ b/docs/Azure.md @@ -1,7 +1,7 @@ --- layout: default title: Azure Marketplace Guide -nav_order: 9 +nav_order: 10 ---
diff --git a/docs/Developer-Reference.md b/docs/Developer-Reference.md index 608867e29..0bccdc26b 100644 --- a/docs/Developer-Reference.md +++ b/docs/Developer-Reference.md @@ -1,7 +1,7 @@ --- layout: default title: Developer Reference -nav_order: 7 +nav_order: 8 ---
diff --git a/docs/Tutorials.md b/docs/Tutorials.md index 80b46ee06..9c4168f61 100644 --- a/docs/Tutorials.md +++ b/docs/Tutorials.md @@ -1,7 +1,7 @@ --- layout: default -title: Tutorial 1 -nav_order: 8 +title: Tutorial - Extending Swirl +nav_order: 7 ---
@@ -12,15 +12,15 @@ nav_order: 8 {:toc}
-# Tutorial 1: Extending Swirl +# Tutorial: Extending Swirl -The following tutorial is intended for a python developer who wants to extend Swirl by adding SearchProviders, Connectors and Processors. Please note this guide assumes use of Swirl 3.x or later. +## Intended Audience -
+The following tutorial is intended for a Python developer who wants to extend Swirl by adding SearchProviders, Connectors and Processors. Please note this guide assumes use of Swirl 3.x or later. ## Assumptions -* Latest python, 3.11 or later, installed locally +* Python 3.12.1 (or latest stable) installed locally * Redis installed and running * Swirl installed locally (not in Docker) and running @@ -35,95 +35,66 @@ Service: celery-worker...RUNNING, pid:34734 34720 ttys000 0:05.04 /Library/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.12/bin/daphne -b 0.0.0.0 -p 8000 swirl_server.asgi:application 34734 ttys000 0:06.67 /Library/Frameworks/Python.framework/Versions/3.12/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.12/bin/celery -A swirl_server worker ``` -
-# Background: The Swirl Search Workflow +## Background: The Swirl Search Workflow In a nutshell: * User creates a query - example: [http://localhost:8000/swirl/search/?q=ai](http://localhost:8000/swirl/search/?q=ai) * Pre-query processing - example: [SpellcheckQueryProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/spellcheck_query.py) -
- {: .highlight } Each Search Provider executes in parallel -
- * Query processing - example: [AdaptiveQueryProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/adaptive.py) * Connector - example: [RequestsGet](https://github.com/swirlai/swirl-search/blob/main/swirl/connectors/requestsget.py) * Result processing - example: [MappingResultsProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/mapping.py) -
- {: .highlight } Parallel processing ends here -
- * Post-result processing - example: [CosineRelevancyPostResultProcessor](https://github.com/swirlai/swirl-search/blob/main/swirl/processors/relevancy.py) * Ranked-results available via mixer - example: [http://localhost:8000/swirl/results/?search_id=1](http://localhost:8000/swirl/results/?search_id=1) -For more information, consult the [Developer Guide Workflow Overview](Developer-Guide.md#workflow). - -
+For more information, consult the [Developer Guide, Workflow Overview](Developer-Guide.md#workflow). # Creating a SearchProvider -A SearchProvider is a configuration of a Connector. So, to connect to a given source, first, verify that it supports a Connector you already have. (See the next tutorial for information on creating new Connectors.) +A SearchProvider is a configuration of a Connector. To connect to a given source, first verify that it supports a Connector that Swirl already has. (See the next tutorial for information on creating new Connectors.) -For example, if trying to query a website using a URL like `https://host.com/?q=my+query+here` that returns JSON or XML, create a new SearchProvider configuring the RequestsGet connector as follows: +For example, if trying to query a website using a URL like `https://host.com/?q=my+query+here` that returns JSON or XML, create a new SearchProvider configuring the `RequestsGet` Connector as follows: * Copy any of the [Google PSE SearchProviders](https://github.com/swirlai/swirl-search/blob/main/SearchProviders/google_pse.json) -Modify the `url` and `query_template` to construct the query URL. Using the above example: - +* Modify the `url` and `query_template` to construct the query URL. Using the above example: ``` { - "url": "https://my-host.com/", - "query_template": "{url}?q={query_string}", + "url": "https://my-host.com/", + "query_template": "{url}?q={query_string}", } ``` +To learn more about query and URL parameters, refer to the [User Guide, Using SearchProviders](User-Guide.html#using-searchproviders) section. -To learn more about query and URL parameters, refer to the [Developer Guide](./Developer-Guide.md). - -* If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the `PAGE=` and `DATE_SORT` query mappings to add support for these features through Swirl. - -For example: - +* If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the `PAGE=` and `DATE_SORT=` query mappings to add support for these features through Swirl. For example: ``` TO DO ``` - -For more information refer to the [User Guide, Query Mappings section:](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#query-mappings) +For more information, refer to the [User Guide, Query Mappings](User-Guide.html#query-mappings) section. * Open the query URL in a browser and look through the JSON response. +If using Visual Studio Code, right-click on the pasted JSON and select `Format Document` to make it easier to read. -If using Visual Studio Code, right-click on the pasted JSON and select `Format Document`` to make it easier to read. - -* Identify the results list, and number of results found and retrieved. Put these JSON paths in the response_mappings. Then, identify the JSON paths to use to extract the Swirl default fields `title`, `body`, `url`, `date_published` and `author` from each item in the result lists in the result_mappings, with the Swirl field name on the left, and the source JSON path on the right. - -For example: - +* Identify the results list, and number of results found and retrieved. Put these JSON paths in the `response_mappings`. Then, identify the JSON paths to use to extract the Swirl default fields `title`, `body`, `url`, `date_published` and `author` from each item in the result list and add those to in the `result_mappings`, with the Swirl field name on the left, and the source JSON path on the right. For example: ``` - "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items", - "result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD", - +"response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items", +"result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD", ``` -* Add credentials as required for the service. - -The format to use depends on the type of credential. Details are here: [User Guide Credentials Section](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#authentication--credentials) - -* Add a suitable tag that can be used to describe the source or what it knows about. - -Spaces are not permitted; good tags are clear and obvious when used in a query, like `company:tesla` or `news:openai`. +* Add credentials as required for the service. The format to use depends on the type of credential. Details are here: [User Guide, Authentication & Credentials](User-Guide.html#authentication--credentials) section. -For more about tags, see: [Organizing SearchProviders](https://github.com/swirlai/swirl-search/wiki/2.-User-Guide#organizing-searchproviders-with-active-default-and-tags) +* Add a suitable tag that can be used to describe the source or what it knows about. Spaces are not permitted; good tags are clear and obvious when used in a query, like `company:tesla` or `news:openai`. For more about tags, see: [Organizing SearchProviders with Active, Default and Tags](User-Guide.html#organizing-searchproviders-with-active-default-and-tags) * Review the finished SearchProvider: - ``` { "name": "My New SearchProvider", @@ -147,11 +118,9 @@ For more about tags, see: [Organizing SearchProviders](https://github.com/swirla } ``` -* Go to Swirl `localhost:8000/swirl/searchproviders/`, logging in if necessary. Put the form at the bottom of the page into RAW mode, and paste the SearchProvider in. Then hit POST. The SearchProvider will reload. - -* Go to Galaxy `localhost:8000/galaxy/` and run a search using the tag you created earlier. Results should again appear in roughly the same period of time. +* Go to Swirl `localhost:8000/swirl/searchproviders/`, logging in if necessary. Put the form at the bottom of the page into `Raw data` mode, and paste the SearchProvider into the box. Click the `POST` button to add it. The SearchProvider page will reload. -
+* Go to the Galaxy UI at `localhost:8000/galaxy/` and run a search using the Tag you created for the new SearchProvider. Results should again appear in roughly the same period of time. # Creating a Connector From 9c26dbcdca4123cd4484930bef9c32fa775778ed Mon Sep 17 00:00:00 2001 From: Erik Spears Date: Mon, 12 Feb 2024 16:01:31 -0500 Subject: [PATCH 3/4] update additional sections of the new Tutorial --- docs/Tutorials.md | 113 ++++++++++++++-------------------------------- 1 file changed, 35 insertions(+), 78 deletions(-) diff --git a/docs/Tutorials.md b/docs/Tutorials.md index 9c4168f61..120a44de4 100644 --- a/docs/Tutorials.md +++ b/docs/Tutorials.md @@ -124,18 +124,19 @@ If using Visual Studio Code, right-click on the pasted JSON and select `Format D # Creating a Connector -In Swirl, Connectors are responsible for loading a SearchProvider, then constructing and transmitting queries to a particular type of service, then saving the response - typically a result list. +Swirl Connectors are responsible for loading a SearchProvider, constructing and transmitting queries to a particular type of service, then saving the response - typically a result list. -:info: Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query. +{: .highlight } +Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query. -:info: If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, [Create a SearchProvider](#creating-a-searchprovider) that configures the RequestsGet connector included with Swirl. +{: .highlight } +If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, [Create a SearchProvider](#creating-a-searchprovider) that configures the RequestsGet connector included with Swirl. To create a new Connector: -* Create a new file, e.g. `swirl/connectors/my_connector.py` - -* Copy the style of the `ChatGPT` connector as a starting point, or `BigQuery` it targeting a database. +* Create a new Connector file, e.g. `swirl/connectors/my_connector.py` +* Copy the style of the `ChatGPT` connector as a starting point, or `BigQuery` if targeting a database. ``` class MyConnector(Connector): @@ -144,17 +145,14 @@ class MyConnector(Connector): super().__init__(provider_id, search_id, update, request_id) ``` -In the __init__ class, load and persist anything that will be needed when connecting and querying the service. Use the ChatGPT Connector as a guide. - -* Import the python package(s) to connect to the service. The ChatGPT connector uses the openai package, for example: +* In the `__init__` class, load and persist anything that will be needed when connecting and querying the service. +* Import the Python package(s) to connect to the service. The ChatGPT connector uses the `openai` package, for example: ``` import openai ``` -* Modify the execute_search method to connect to the service. - -As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt via `openai.ChatCompletion.create()`, then stores the response. +* Modify the `execute_search` method to connect to the service. As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt via `openai.ChatCompletion.create()`, then stores the response. ``` def execute_search(self, session=None): @@ -209,10 +207,9 @@ As you can see from the ChatGPT Connector, it first loads the OpenAI credentials ``` -ChatGPT depends on the OpenAI API key, which is provided to Swirl via the .env file. To follow this pattern, create new values in .env then modify `swirl_server/settings.py`` to load them as Django settings, and set a reasonable default. +* ChatGPT depends on the OpenAI API key, which is provided to Swirl via the `.env` file. To follow this pattern, create new values in `.env` then modify `swirl_server/settings.py` to load them as Django settings, and set a reasonable default. * Modify the `normalize_response()` method to store the raw response. This is literally no more (or less) than writing the result objects out as a python list and storing that in `self.results`: - ``` def normalize_response(self): @@ -230,42 +227,36 @@ ChatGPT depends on the OpenAI API key, which is provided to Swirl via the .env f return ``` -There's no need to do this if self.response is already a python list. +{: .highlight } +There's no need to do this if `self.response` is already a Python list. * Add the new Connector to `swirl/connectors/__init__.py` - ``` from swirl.connectors.my_connector import MyConnector ``` * Restart Swirl - ``` % python swirl.py restart core ``` -* Create a SearchProvider to configure the new Connector, then add it to the Swirl installation as noted in the [Create a SearchProvider tutorial](#creating-a-searchprovider). - -Don't forget a useful tag so you can easily target the new connector when ready to test. - -To learn more about developing Connectors, refer to the [Developer Guide](Developer-Guide.md#develop-new-connectors). +* Create a SearchProvider to configure the new Connector, then add it to the Swirl installation as noted in the [Creating a SearchProvider](#creating-a-searchprovider) section above. Don't forget a useful Tag so that you can easily target the new connector when ready to test. -
+To learn more about developing Connectors, refer to the [Developer Guide, Developing New Connectors](Developer-Guide.md#develop-new-connectors) section. # Creating a QueryProcessor -A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that the result of Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProviders. +A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that the result of Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProvider. In both cases, the goal is to modify the query sent to some group of SearchProviders. -Note: if you just want to rewrite the query using lookup tables or regular expressions, consider [using `QueryTransformations` instead](https://github.com/swirlai/swirl-search/wiki/5.-Developer-Guide#using-query-transformations) +Note: if you just want to rewrite the query using lookup tables or regular expressions, consider [using `QueryTransformations` instead](Developer-Guide.html#query-transformation-rules). To create a new QueryProcessor: * Create a new module like `swirl/processors/my_query_processor.py`. You can also add your new class to an existing module like `swirl/processors/Generic.py`. * Copy the `GenericQueryProcessor` class as a starting point, and rename it: - ``` class MyQueryProcessor(QueryProcessor): @@ -276,17 +267,14 @@ class MyQueryProcessor(QueryProcessor): return self.query_string + ' modified' ``` -A common use case for a new QueryProcessor is to use some function or external service to rewrite the query. Here's an example of how to do the former. - -Given some function in package MyPackage... - +* A common use case for a new QueryProcessor is to use some function or external service to rewrite the query. Here's an example of how to do the former. Given some function in package MyPackage: ``` def my_query_modification_function(query): # implementation return modified_query ``` -Create a new QueryProcessor that imports the function: +* Create a new QueryProcessor that imports the function: ``` from MyPackage import my_query_modification_function @@ -301,19 +289,16 @@ class MyQueryProcessor(QueryProcessor): return self.query_string ``` -Save the new QueryProcessor. - -* If you created a new module, add it to `swirl/processors/__init__.py` +* Save the new QueryProcessor. +* If you created a new module, add it to `swirl/processors/__init__.py`. This can be skipped if you added the new QueryProcessor to an existing module. ``` from swirl.processors.my_processor import * ``` -This can be skipped if you added the new QueryProcessor to an existing module. - * Add the new Processor to the appropriate `swirl.models` CHOICES block. -For pre-query processing, add it to the `Search` object here: +* For Pre-Query processing, add it to the `Search` object; this is required for security reasons: ``` PRE_QUERY_PROCESSOR_CHOICES = [ @@ -321,17 +306,15 @@ For pre-query processing, add it to the `Search` object here: ('SpellcheckQueryProcessor', 'SpellcheckQueryProcessor'), ('MyQueryProcessor','MyQueryProcessor') ] - ``` -This is required for security reasons. To make the new Processor a default for new Search objects, add it to the return list in this block: - +* To make the new Processor a default for new Search objects, add it to the return list in this block: ``` def getSearchPreQueryProcessorsDefault(): return [] ``` -For query processing in one or more SearchProviders, add it here: +* For Query processing in one or more SearchProviders, add it here: ``` QUERY_PROCESSOR_CHOICES = [ @@ -342,8 +325,7 @@ For query processing in one or more SearchProviders, add it here: ``` -To make the new Processor a default for new SearchProvider objects, add it to the return list in this block: - +* To make the new Processor a default for new SearchProvider objects, add it to the return list in this block: ``` def getSearchProviderQueryProcessorsDefault(): return ["AdaptiveQueryProcessor"] @@ -370,28 +352,22 @@ def getSearchProviderQueryProcessorsDefault(): ``` * Restart Swirl - ``` % python swirl.py restart core ``` -* Go to Galaxy `http://localhost:8000/galaxy/` - -Run a search; if using a query processor be sure to [target that SearchProvider with a tag](https://docs.swirl.today/User-Guide.html#using-tags-to-target-searchproviders). - -For example if you added a QueryProcessor to a SearchProvider query_processing pipeline with tag "news", the query would be `http://localhost:8000/swirl/search/?q=news:some+query` instead.Results should appear in a just a few seconds. In the `messages` block a message indicating that the new QueryProcessor rewrote the query should appear: +* Go to the Galaxy UI (`http://localhost:8000/galaxy/`) and run a search; if using a query processor be sure to [target that SearchProvider with a tag](User-Guide.html#using-tags-to-target-searchproviders). +For example if you added a QueryProcessor to a SearchProvider `query_processing` pipeline with a Tag of "news", the query would be `http://localhost:8000/swirl/search/?q=news:some+query` instead. Results should appear in a just a few seconds. In the `messages` block, a message indicating that the new QueryProcessor rewrote the query should appear: ``` MyQueryProcessor rewrote query to: ``` -
- # Creating a ResultProcessor A ResultProcessor is a stage executed by each SearchProvider, after the Connector has retrieved results. ResultProcessors operate on results and transform them as needed for downstream consumption or presentation. -The GenericResultProcessor and MappingResultProcessor stages are intended to normalize JSON results. GenericResultProcessor searches for exact matches to the Swirl schema (as noted in the SearchProvider example) and copies them over. MappingResultProcessor applies result_mappings to normalize the results, again as shown in the SearchProvider example above. In general adding stages after these is a good idea, unless the SearchProvider is expected to respond in a Swirl schema format. +The `GenericResultProcessor` and `MappingResultProcessor` stages are intended to normalize JSON results. `GenericResultProcessor` searches for exact matches to the Swirl schema (as noted in the SearchProvider example) and copies them over. `MappingResultProcessor` applies `result_mappings` to normalize the results, again as shown in the SearchProvider example above. In general, adding stages *after* these is a good idea, unless the SearchProvider is expected to respond in a Swirl schema format. To create a new ResultProcessor: @@ -399,12 +375,9 @@ To create a new ResultProcessor: * Copy the `GenericResultProcessor` class as a starting point, and rename it. -* Implement the `process()` method. This is the only one required. - -Process() operates on `self.results`, which will contain all the results from a given SearchProvider, in python list format. Modify items in the result list, and report the number updated. - -For example, here is a process() method that adds a field 'my_field1' to each result item, with the value "test": +* Implement the `process()` method. This is the only one required. `Process()` operates on `self.results`, which will contain all the results from a given SearchProvider, in Python list format. Modify items in the result list, and report the number updated. +* For example, here is a `process()` method that adds a field 'my_field1' to each result item, with the value "test": ``` def process(self): @@ -427,16 +400,12 @@ For example, here is a process() method that adds a field 'my_field1' to each re return self.modified ``` -Save the module. - -* Add the new module to `swirl/processors/__init__.py` - +* Save the new module and add it to `swirl/processors/__init__.py` ``` from swirl.processors.my_processor import MyResultProcessor ``` -* Add the new module to the following swirl.models CHOICES block: - +* Add the new module to the following swirl.models CHOICES block (this is required for security reasons.) ``` RESULT_PROCESSOR_CHOICES = [ ('GenericResultProcessor', 'GenericResultProcessor'), @@ -454,17 +423,13 @@ from swirl.processors.my_processor import MyResultProcessor ] ``` -This is required for security reasons. - -To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: - +* To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: ``` def getSearchProviderResultProcessorsDefault(): return ["MappingResultProcessor","DateFinderResultProcessor","CosineRelevancyResultProcessor"] ``` * Add the new module to at least one `SearchProvider.result_processing` pipeline: - ``` "result_processors": [ "MappingResultProcessor", @@ -475,27 +440,19 @@ def getSearchProviderResultProcessorsDefault(): ``` * Restart Swirl - ``` % python swirl.py restart core ``` -* Go to Galaxy `http://localhost:8000/galaxy/` - -Run a search; be sure to target at least one SearchProvider that has the new ResultProcessor. - -For example if you added a ResultProcessor to a SearchProvider result_processing pipeline with tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. - -Results should appear in a just a few seconds. In the `messages` block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected. +* Go to the Galaxy UI (`http://localhost:8000/galaxy/`) and run a search; be sure to target at least one SearchProvider that has the new ResultProcessor. For example if you added a ResultProcessor to a SearchProvider `result_processing` pipeline with the Tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. +* Results should appear in a just a few seconds. In the `messages` block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected. ``` MyResultProcessor updated 5 results from: MyConnector", ``` To learn more about writing Processors, refer to the [Developer Guide](Developer-Guide.md#develop-new-processors) -
- # Creating a PostResultProcessor A PostResultProcessor is a stage executed after all SearchProviders have returned results. They operate on all the results for a given Search. From bc92c6d6f85f2f18d14a58a629a1b15aa5f9d0e9 Mon Sep 17 00:00:00 2001 From: Erik Spears Date: Mon, 12 Feb 2024 16:06:05 -0500 Subject: [PATCH 4/4] updated final section of new Tutorials doc --- docs/Tutorials.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/docs/Tutorials.md b/docs/Tutorials.md index 120a44de4..ab66ea13a 100644 --- a/docs/Tutorials.md +++ b/docs/Tutorials.md @@ -209,7 +209,7 @@ import openai * ChatGPT depends on the OpenAI API key, which is provided to Swirl via the `.env` file. To follow this pattern, create new values in `.env` then modify `swirl_server/settings.py` to load them as Django settings, and set a reasonable default. -* Modify the `normalize_response()` method to store the raw response. This is literally no more (or less) than writing the result objects out as a python list and storing that in `self.results`: +* Modify the `normalize_response()` method to store the raw response. This is literally no more (or less) than writing the result objects out as a Python list and storing that in `self.results`: ``` def normalize_response(self): @@ -298,7 +298,7 @@ from swirl.processors.my_processor import * * Add the new Processor to the appropriate `swirl.models` CHOICES block. -* For Pre-Query processing, add it to the `Search` object; this is required for security reasons: +* For Pre-Query processing, add it to the `Search` object (this is required for security reasons): ``` PRE_QUERY_PROCESSOR_CHOICES = [ @@ -462,7 +462,6 @@ To create a new ResultProcessor: * Create a new module like `swirl/processors/my_post_result_processor.py` * Copy the template below as a starting point, and rename it: - ``` class MyPostResultProcessor(PostResultProcessor): @@ -490,16 +489,14 @@ class MyPostResultProcessor(PostResultProcessor): return self.results_updated ``` -Modify the `process()` method, operating on the items and saving each result set as shown. +* Modify the `process()` method operating on the items and saving each result set as shown. * Add the new module to `swirl/processors/__init__.py` - ``` from swirl.processors.my_post_result_processor import MyPostResultProcessor ``` -* Add the new module to the following `swirl.models` CHOICES block: - +* Add the new module to the following `swirl.models` CHOICES block (this is required for security reasons. ): ``` POST_RESULT_PROCESSOR_CHOICES = [ ('CosineRelevancyPostResultProcessor', 'CosineRelevancyPostResultProcessor'), @@ -510,17 +507,13 @@ from swirl.processors.my_post_result_processor import MyPostResultProcessor ] ``` -This is required for security reasons. - -To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: - +* To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block: ``` def getSearchPostResultProcessorsDefault(): return ["DedupeByFieldPostResultProcessor","CosineRelevancyPostResultProcessor,"MyPostResultProcessor"] ``` * Add the new module to a `Search.post_result_processing` pipeline: - ``` { "query_string": "news:ai", @@ -534,19 +527,13 @@ def getSearchPostResultProcessorsDefault(): ``` * Restart Swirl - ``` % python swirl.py restart core ``` -* Go to Galaxy `http://localhost:8000/galaxy/` - -Run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor. - -For example if you added a PostResultProcessor to a Search `post_result_processing` pipeline with tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. - -Results should appear in a just a few seconds. In the `messages` block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected. +* Go to the Galaxy UI (`http://localhost:8000/galaxy/`) and run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor. For example if you added a PostResultProcessor to a Search `post_result_processing` pipeline with the Tag "news", the query would need to be `http://localhost:8000/swirl/search/?q=news:some+query` instead of the above. +* Results should appear in a just a few seconds. In the `messages` block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected. ``` MyPostResultProcessor updated 10 results from: MySearchProvider ```