Feature Request: Filtering for Small and Invisible Text #274

nelzomal · 2024-11-18T10:52:49Z

Currently, there is filtering for small, invisible, or irrelevant images. However, implementing similar filtering for small or invisible text is equally important, as such text can significantly impact content quality by introducing noise or misleading information.

I would like to know if there is any plan to implement this feature. If not, I’d be happy to contribute by working on a pull request. Could someone provide pointers to the relevant parts of the codebase that would need modification to add this functionality?

unclecode · 2024-11-20T11:36:48Z

@nelzomal Thank you so much for the suggestion, and I do agree with that. Please go ahead and create the pull request and also share your email address with me. I will send you a Discord invitation. I would love to see if you can help us and also proceed with this suggestion. And the part of the code check content_scraping_strategy.py::WebScrapingStrategy.score_image_for_usefulness(). Also wait until I release the new version, then refer to that version from the main branch; as of now, this function is at line 244. Appreciate your collaboration.

And I need you to pay attention to one very important thing. For me, the processing time of scraping is crucial. Right now, the average has become around 100 milliseconds. I spent quality time to make it very efficient. Therefore, adding any new steps or process comes with the cost of computation time. I need you to test the computation time for multiple websites before and after you apply this change and make sure that we're not losing any time. Thank you so much.

nelzomal · 2024-12-09T09:32:07Z

@unclecode I have created a PR #332

As mentioned in the PR, I am open to discuss how to implement a better solution.

Currently, the layout-related logic is implemented within the async_crawler_strategy, as layout details are best retrieved during the crawling phase when the web driver renders the page. This allows for efficient detection since the page is already fully rendered.

However, I propose saving the layout information during the crawling phase. And leverage this data to implement more advanced heuristics during the subsequent scrape phase.

I’m open to discussing this approach further before proceeding with the extended implementation.

unclecode · 2024-12-09T13:04:48Z

@nelzomal Amazing, I will check it soon, appreciate the help.

nelzomal · 2024-12-13T03:37:33Z

@unclecode hi, any update?

unclecode · 2024-12-14T08:07:02Z

@nelzomal Hi, I am working on multiple things this weekend plus documentation. There are 3 pull requests I am planning to focus on, including this invisible text, so cross fingers.

nelzomal · 2024-12-21T10:23:16Z

Hi @unclecode!
I hope the review is going well!

I’m really interested in contributing more to this repository, and I’d love the opportunity to join the Discord you mentioned earlier.

unclecode · 2024-12-21T13:10:01Z

@nelzomal hi again, The past few years, I focused on updating website documents, which led to a new library feature: generating LLM.txt files. It’s much better than the standard I’ve seen and will be improved incrementally.

This goes beyond crawling, it creates LLM.txt content or markdown for websites, helping developers ask better response from LLMs chat by attaching these markdowns. I’ll work on this today and tomorrow.

Next week, I’ll address pull requests, including yours. Please share your email so I can send you an invite.

Thanks for your help! The library is growing, and we need all the collaboration we can get.

nelzomal · 2024-12-21T13:43:17Z

@unclecode my email is [email protected]

I feel LLM.txt feature is quite useful. Would like to contribute if possible.

unclecode · 2024-12-25T11:45:17Z

@nelzomal Sending the link, also this week will handle the PR.

nelzomal · 2024-12-26T01:13:56Z

@unclecode great news!

However, I haven't received any invitation link in my email [email protected]

unclecode · 2024-12-26T08:24:21Z

@nelzomal I sent again, please check.

unclecode self-assigned this Nov 20, 2024

unclecode added the enhancement New feature or request label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Filtering for Small and Invisible Text #274

Feature Request: Filtering for Small and Invisible Text #274

nelzomal commented Nov 18, 2024

unclecode commented Nov 20, 2024 •

edited

Loading

nelzomal commented Dec 9, 2024 •

edited

Loading

unclecode commented Dec 9, 2024

nelzomal commented Dec 13, 2024

unclecode commented Dec 14, 2024

nelzomal commented Dec 21, 2024

unclecode commented Dec 21, 2024

nelzomal commented Dec 21, 2024

unclecode commented Dec 25, 2024

nelzomal commented Dec 26, 2024

unclecode commented Dec 26, 2024

Feature Request: Filtering for Small and Invisible Text #274

Feature Request: Filtering for Small and Invisible Text #274

Comments

nelzomal commented Nov 18, 2024

unclecode commented Nov 20, 2024 • edited Loading

nelzomal commented Dec 9, 2024 • edited Loading

unclecode commented Dec 9, 2024

nelzomal commented Dec 13, 2024

unclecode commented Dec 14, 2024

nelzomal commented Dec 21, 2024

unclecode commented Dec 21, 2024

nelzomal commented Dec 21, 2024

unclecode commented Dec 25, 2024

nelzomal commented Dec 26, 2024

unclecode commented Dec 26, 2024

unclecode commented Nov 20, 2024 •

edited

Loading

nelzomal commented Dec 9, 2024 •

edited

Loading