[Bug]: Incorrect deserialization for ImageDocument objects #15856

Nitro963 · 2024-09-05T13:23:52Z

Bug Description

Stored image documents can never be retrieved as ImageDocument from any document store due to the incorrect deserialization logic of json_to_doc helper function in llama_index.core.storage.docstore.utils module.

Version

0.11.2

Steps to Reproduce

from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.schema import ImageDocument

store = SimpleDocumentStore()
image_doc = ImageDocument(
    image_url="https://venturebeat.com/wp-content/uploads/2024/03/Screenshot-2024-03-04-at-12.49.41%E2%80%AFAM.png",
)

store.add_documents([
    image_doc
])

doc_from_store = store.get_document(image_doc.doc_id)

print(doc_from_store.image_url)

As the document is going to be retrieved as Document object we can't access image_url and will result in a AttributeError Exception

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/server_code/playground.py", line 15, in <module>
    print(doc_from_store .image_url)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 811, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'Document' object has no attribute 'image_url'

The text was updated successfully, but these errors were encountered:

dosubot · 2024-09-05T13:26:33Z

The json_to_doc helper function in the llama_index.core.storage.docstore.utils module already includes the necessary logic to handle ImageDocument objects correctly. Here is the relevant part of the function:

def json_to_doc(doc_dict: dict) -> BaseNode:
    doc_type = doc_dict[TYPE_KEY]
    data_dict = doc_dict[DATA_KEY]
    doc: BaseNode

    if "extra_info" in data_dict:
        return legacy_json_to_doc(doc_dict)
    else:
        if doc_type == Document.get_type():
            doc = Document.parse_obj(data_dict)
        elif doc_type == ImageDocument.get_type():
            doc = ImageDocument.parse_obj(data_dict)
        elif doc_type == TextNode.get_type():
            doc = TextNode.parse_obj(data_dict)
        elif doc_type == ImageNode.get_type():
            doc = ImageNode.parse_obj(data_dict)
        elif doc_type == IndexNode.get_type():
            doc = IndexNode.parse_obj(data_dict)
        else:
            raise ValueError(f"Unknown doc type: {doc_type}")

        return doc

This function checks the doc_type and correctly calls ImageDocument.parse_obj(data_dict) when the doc_type matches ImageDocument.get_type(). This ensures that ImageDocument objects are deserialized correctly from the document store ^[1]^[2].

Therefore, the existing code should handle the deserialization of ImageDocument objects correctly. If you are still encountering issues, please ensure that the doc_type is being set correctly when storing the document.

To continue talking to Dosu, mention @dosu.

Nitro963 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 5, 2024

Nitro963 mentioned this issue Sep 5, 2024

Fix image document deserialization issue #15857

Merged

16 tasks

run-llama deleted a comment Sep 5, 2024

logan-markewich closed this as completed in #15857 Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Incorrect deserialization for ImageDocument objects #15856

[Bug]: Incorrect deserialization for ImageDocument objects #15856

Nitro963 commented Sep 5, 2024 •

edited

Loading

dosubot bot commented Sep 5, 2024

[Bug]: Incorrect deserialization for ImageDocument objects #15856

[Bug]: Incorrect deserialization for ImageDocument objects #15856

Comments

Nitro963 commented Sep 5, 2024 • edited Loading

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Sep 5, 2024

Nitro963 commented Sep 5, 2024 •

edited

Loading