This is a quick start to use Unstructured to ingest any document into PostgreSQL and run semantic search on it.
With this setup, the document on your postgresql database will be vectorized and you can run semantic search on it.
The import.sh script imports the document into PostgreSQL and vectorizes the content of the document.
sequenceDiagram
participant U as User
participant S as Unstructured
participant T as PostgreSQL
participant P as pgai
participant O as pgai Vectorizer
U->>S: Input raw documents
S->>S: Extract and structure data
S->>T: Store structured data
P->>O: Configure pgai Vectorizer
T->>O: Process text data
O->>O: Generate vector embeddings
O->>T: Store vector embeddings
U->>T: Query data
T->>P: Generate embedding for query data
T->>T: Perform vector similarity search
T->>U: Return relevant results
U->>U: Analyze insights
The best part of unstructured is that it can ingest a lot of different file types and structure the content of the file in a way that is easy to query.
It supports csv, html, pdf, word, excel, image, markdown, and more.
This example covers a generic ingest of any data file and also a python script to ingest a web page.
Copy the .envrc.example
file to .envrc
and edit it to set the correct environment variables.
cp .envrc.example .envrc
Open .envrc
and set the correct environment variables. The file includes settings for:
- PostgreSQL connection details (for local setup or Timescale Cloud)
- OpenAI API configuration
- Unstructured API settings
- Other project-specific settings
For local setup, you can use a pre-built Docker container: https://github.com/timescale/pgai?tab=readme-ov-file#use-a-pre-built-docker-container
For production use, we recommend using Timescale Cloud: https://github.com/timescale/pgai?tab=readme-ov-file#use-a-timescale-cloud-service Sign up for a free trial and enable the pgai extension for your service.
Make sure to set the following key variables:
DB_HOST
,DB_PORT
,DB_NAME
,DB_USER
,DB_PASSWORD
: TimescaleDB connection detailsOPENAI_API_KEY
: Your OpenAI API key for text processing tasksUNSTRUCTURED_API_KEY
: API key for the Unstructured service (if using the hosted version)
You can adapt the openai model or the embedding model to your needs directly in the schema.sql file.
Adjust other settings as needed for your specific use case.
After setting up the environment variables, run install.sh
to install the necessary dependencies.
Run import.sh
to import the data into TimescaleDB.
Example:
./install.sh
./import.sh import/my/file.pdf
By default, the database schema is created using the schema.sql
file.
It creates a table elements
that includes all the fields that Unstructured can extract from a document.
You can use the import.sh script to import data from HTML files too.
./import.sh import/my/file.html
In this case, if you don't have data files to play, I recommend you to use the unstructured project that contains a lot of data files to play with.
git clone https://github.com/Unstructured-IO/unstructured
Then, you can run the following command to import the data from the HTML files:
./import.sh ../unstructured/example-docs/example-10k.html
Now, with a 10k file, it will have an output like this:
2024-08-21 15:05:34,322 MainProcess INFO Created index with configs: {"input_path": "../../unstructured/example-docs/example-10k.html", "recursive": false}, connection configs: {"access_config": "**********"}
2024-08-21 15:05:34,322 MainProcess INFO Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-08-21 15:05:34,322 MainProcess INFO Created filter with configs: {"file_glob": null, "max_file_size": null}
2024-08-21 15:05:34,323 MainProcess INFO Created partition with configs: {"strategy": "auto", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": "true", "split_pdf_allow_failed": "true", "split_pdf_concurrency_level": 15}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": ["id", "element_id", "text", "embeddings", "type", "system", "layout_width", "layout_height", "points", "url", "version", "date_created", "date_modified", "date_processed", "permissions_data", "record_locator", "category_depth", "parent_id", "attached_filename", "filetype", "last_modified", "file_directory", "filename", "languages", "page_number", "links", "page_name", "link_urls", "link_texts", "sent_from", "sent_to", "subject", "section", "header_footer_type", "emphasized_text_contents", "emphasized_text_tags", "text_as_html", "regex_metadata", "detection_class_prob"], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": null, "hi_res_model_name": null}
2024-08-21 15:05:34,323 MainProcess INFO Created upload_stage with configs: {}
2024-08-21 15:05:34,323 MainProcess INFO Created upload with configs: {"batch_size": 50}, connection configs: {"access_config": "**********", "db_type": "postgresql", "database": "vector_playground", "host": "localhost", "port": 5432, "connector_type": "sql"}
2024-08-21 15:05:34,391 MainProcess INFO Running local pipline: index (LocalIndexer) -> filter -> download (LocalDownloader) -> filter -> partition (auto) -> upload_stage (SQLUploadStager) -> upload (SQLUploader) with configs: {"reprocess": false, "verbose": true, "tqdm": false, "work_dir": "data", "num_processes": 2, "max_connections": null, "raise_on_error": false, "disable_parallelism": false, "preserve_downloads": false, "download_only": false, "max_docs": null, "re_download": false, "uncompress": false, "status": {}, "semaphore": null}
2024-08-21 15:05:35,819 MainProcess DEBUG Generated file data: {"identifier": "/Users/jonatasdp/code/unstructured/example-docs/example-10k.html", "connector_type": "local", "source_identifiers": {"filename": "example-10k.html", "fullpath": "/Users/jonatasdp/code/unstructured/example-docs/example-10k.html", "rel_path": "example-10k.html"}, "doc_type": "file", "metadata": {"url": null, "version": null, "record_locator": {"path": "/Users/jonatasdp/code/unstructured/example-docs/example-10k.html"}, "date_created": "1724179566.867809", "date_modified": "1724179566.8738236", "date_processed": "1724263535.8178291", "permissions_data": [{"mode": 33188}], "filesize_bytes": 2456707}, "additional_metadata": {}, "reprocess": false}
2024-08-21 15:05:35,822 MainProcess INFO Calling FilterStep with 1 docs
2024-08-21 15:05:35,822 MainProcess INFO processing content across processes
2024-08-21 15:05:35,822 MainProcess INFO processing content serially
2024-08-21 15:05:35,824 MainProcess INFO FilterStep [cls] took 0.002730131149291992 seconds
2024-08-21 15:05:35,824 MainProcess INFO Calling DownloadStep with 1 docs
2024-08-21 15:05:35,825 MainProcess INFO processing content async
2024-08-21 15:05:35,826 MainProcess DEBUG Skipping download, file already exists locally: /Users/jonatasdp/code/unstructured/example-docs/example-10k.html
2024-08-21 15:05:35,826 MainProcess INFO DownloadStep [cls] took 0.001313924789428711 seconds
2024-08-21 15:05:35,826 MainProcess INFO Calling FilterStep with 1 docs
2024-08-21 15:05:35,826 MainProcess INFO processing content across processes
2024-08-21 15:05:35,826 MainProcess INFO processing content serially
2024-08-21 15:05:35,827 MainProcess INFO FilterStep [cls] took 0.0011758804321289062 seconds
2024-08-21 15:05:35,827 MainProcess INFO Calling PartitionStep with 1 docs
2024-08-21 15:05:35,827 MainProcess INFO processing content across processes
2024-08-21 15:05:35,827 MainProcess INFO processing content serially
2024-08-21 15:05:35,828 MainProcess DEBUG Skipping partitioning, output already exists: /Users/jonatasdp/code/timescale/unstructured-timescaledb/data/partition/8323bca93f7c.json
2024-08-21 15:05:35,829 MainProcess INFO PartitionStep [cls] took 0.001569986343383789 seconds
2024-08-21 15:05:35,829 MainProcess INFO Calling UploadStageStep with 1 docs
2024-08-21 15:05:35,829 MainProcess INFO processing content across processes
2024-08-21 15:05:35,829 MainProcess INFO processing content serially
...
2024-08-21 15:05:35,874 MainProcess INFO UploadStageStep [cls] took 0.04475522041320801 seconds
2024-08-21 15:05:35,874 MainProcess INFO Calling UploadStep with 1 docs
2024-08-21 15:05:35,882 MainProcess DEBUG uploading 313 entries to vector_playground
2024-08-21 15:08:54,376 MainProcess INFO UploadStep [cls] took 198.50195574760437 seconds
2024-08-21 15:08:54,377 MainProcess INFO Finished ingest process in 200.0532009601593s
Now, let's take a look at the data filtering by the type
field:
select languages, type, text from elements where type = 'Title' ;
languages | type | text
-----------+-------+------------------------------------------------------------------------------------
{eng} | Title | UNITED STATES
{eng} | Title | SECURITIES AND EXCHANGE COMMISSION
{eng} | Title | Washington, D.C. 20549
{eng} | Title | FORM
{eng} | Title | For the transition period from to
{eng} | Title | Commission file number:
{eng} | Title | ITEM 1. BUSINESS
{eng} | Title | ANNUAL REPORT ON FORM 10-K FOR THE YEAR ENDED DECEMBER 31, 2021
{eng} | Title | TABLE OF CONTENTS
{eng} | Title | SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS
You can also find nearest elements by the embeddings using the <=>
operator:
vector_playground=# select type, text from elements order by embeddings <=> embedding('Tax') limit 10;
type | text
-------------------+--------------------------------------------------------------------------------------------------------
Title | FORM
Title | BUSINESS
NarrativeText | The income tax expense was $48,637 in 2021 based on an effective rate of 2.25 percent compared to the benefit of ($605,936) in 2020 based on an effective rate of 17.42 percent. The 2.25 percent effective tax rate for 2021 differed from the statutory federal income tax rate of 21.0 percent and was primarily attributable to (i) increased tax benefit from the exercise of stock options; (ii) the increased foreign rate differential and (iii) the Company maintaining a valuation allowance against its deferred tax assets.
UncategorizedText | 11
Title | GOVERNMENT REGULATION
UncategorizedText | 35
UncategorizedText | 41
Title | UNITED STATES
UncategorizedText | 45
The embeddings
field is the embedding of the text column of the element. And, you can use a semantic search to find the nearest elements to a given embedding.
vector_playground=# select text from elements where type = 'Title' and embeddings <=> embedding('first day of the year') < 0.2;
text
----------------------------------------
YEARS ENDED December 31, 2021 AND 2020
YEARS ENDED DECEMBER 31, 2021 AND 2020
YEARS ENDED DECEMBER 31, 2021 AND 2020
YEARS ENDED DECEMBER 31, 2021 AND 2020
(4 rows)
The other example fetches the TimescaleDB documentation and inserts into the database using the fetch_timescale_pages.py
script.
python fetch_timescale_pages.py
It will fetch the documentation from the TimescaleDB website and insert it into the database.
Then, you can run a semantic search using the following SQL:
SELECT url, text FROM elements
WHERE url ~ 'docs.timescale.com'
ORDER BY embeddings <=> embeddings('install timescaledb-ha via docker') LIMIT 1;
Unstructured has a lot of data files to play with. If you have a PDF file, you can import it into the database using the following command:
./import.sh ../unstructured/example-docs/embedded-images.pdf
And, you can run a semantic search using the following SQL:
SELECT text
FROM elements
WHERE filename ~ 'embedded-images.pdf'
ORDER BY embeddings <=> embedding('Hands Free') LIMIT 2 ;
text
---------------------------------------------------------------------------------------------------------
"HANDS-FREE" CARD: use (2/4)
Hands-free unlocking, when approaching the vehicle; With the card in access zone 3, the ve- hicle will unlock. Unlocking is indicated by one flash of the hazard warning lights and the indicator lights.
(2 rows)
- Unstructured is the company behind the Unstructured library.
- PGVector is the vector extension for postgresql.
- pgai is the ai extension for postgresql that can be used to run queries and create the vector embeddings.
This project is a simple example of how to use Unstructured with PGAI.
We welcome contributions to improve this project! Here are some ways you can contribute:
- Report bugs or suggest features by opening an issue.
- Submit pull requests to fix bugs or add new features.
- Improve documentation or add more examples.
- Share your experience using the project with others.
Before contributing, please read our contributing guidelines CONTRIBUTING.md for more information on our development process, coding standards, and how to submit pull requests.
Thank you for helping make this project better!