Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect deserialization for ImageDocument objects #15856

Closed
Nitro963 opened this issue Sep 5, 2024 · 1 comment · Fixed by #15857
Closed

[Bug]: Incorrect deserialization for ImageDocument objects #15856

Nitro963 opened this issue Sep 5, 2024 · 1 comment · Fixed by #15857
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@Nitro963
Copy link
Contributor

Nitro963 commented Sep 5, 2024

Bug Description

Stored image documents can never be retrieved as ImageDocument from any document store due to the incorrect deserialization logic of json_to_doc helper function in llama_index.core.storage.docstore.utils module.

Version

0.11.2

Steps to Reproduce

from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.schema import ImageDocument

store = SimpleDocumentStore()
image_doc = ImageDocument(
    image_url="https://venturebeat.com/wp-content/uploads/2024/03/Screenshot-2024-03-04-at-12.49.41%E2%80%AFAM.png",
)

store.add_documents([
    image_doc
])

doc_from_store = store.get_document(image_doc.doc_id)

print(doc_from_store.image_url)

As the document is going to be retrieved as Document object we can't access image_url and will result in a AttributeError Exception

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/server_code/playground.py", line 15, in <module>
    print(doc_from_store .image_url)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 811, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'Document' object has no attribute 'image_url'
@Nitro963 Nitro963 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 5, 2024
Copy link

dosubot bot commented Sep 5, 2024

The json_to_doc helper function in the llama_index.core.storage.docstore.utils module already includes the necessary logic to handle ImageDocument objects correctly. Here is the relevant part of the function:

def json_to_doc(doc_dict: dict) -> BaseNode:
    doc_type = doc_dict[TYPE_KEY]
    data_dict = doc_dict[DATA_KEY]
    doc: BaseNode

    if "extra_info" in data_dict:
        return legacy_json_to_doc(doc_dict)
    else:
        if doc_type == Document.get_type():
            doc = Document.parse_obj(data_dict)
        elif doc_type == ImageDocument.get_type():
            doc = ImageDocument.parse_obj(data_dict)
        elif doc_type == TextNode.get_type():
            doc = TextNode.parse_obj(data_dict)
        elif doc_type == ImageNode.get_type():
            doc = ImageNode.parse_obj(data_dict)
        elif doc_type == IndexNode.get_type():
            doc = IndexNode.parse_obj(data_dict)
        else:
            raise ValueError(f"Unknown doc type: {doc_type}")

        return doc

This function checks the doc_type and correctly calls ImageDocument.parse_obj(data_dict) when the doc_type matches ImageDocument.get_type(). This ensures that ImageDocument objects are deserialized correctly from the document store [1][2].

Therefore, the existing code should handle the deserialization of ImageDocument objects correctly. If you are still encountering issues, please ensure that the doc_type is being set correctly when storing the document.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant