Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Filtering for Small and Invisible Text #274

Open
nelzomal opened this issue Nov 18, 2024 · 11 comments
Open

Feature Request: Filtering for Small and Invisible Text #274

nelzomal opened this issue Nov 18, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@nelzomal
Copy link
Contributor

Currently, there is filtering for small, invisible, or irrelevant images. However, implementing similar filtering for small or invisible text is equally important, as such text can significantly impact content quality by introducing noise or misleading information.

I would like to know if there is any plan to implement this feature. If not, I’d be happy to contribute by working on a pull request. Could someone provide pointers to the relevant parts of the codebase that would need modification to add this functionality?

@unclecode unclecode self-assigned this Nov 20, 2024
@unclecode unclecode added the enhancement New feature or request label Nov 20, 2024
@unclecode
Copy link
Owner

unclecode commented Nov 20, 2024

@nelzomal Thank you so much for the suggestion, and I do agree with that. Please go ahead and create the pull request and also share your email address with me. I will send you a Discord invitation. I would love to see if you can help us and also proceed with this suggestion. And the part of the code check content_scraping_strategy.py::WebScrapingStrategy.score_image_for_usefulness(). Also wait until I release the new version, then refer to that version from the main branch; as of now, this function is at line 244. Appreciate your collaboration.

And I need you to pay attention to one very important thing. For me, the processing time of scraping is crucial. Right now, the average has become around 100 milliseconds. I spent quality time to make it very efficient. Therefore, adding any new steps or process comes with the cost of computation time. I need you to test the computation time for multiple websites before and after you apply this change and make sure that we're not losing any time. Thank you so much.

@nelzomal
Copy link
Contributor Author

nelzomal commented Dec 9, 2024

@unclecode I have created a PR #332

As mentioned in the PR, I am open to discuss how to implement a better solution.

Currently, the layout-related logic is implemented within the async_crawler_strategy, as layout details are best retrieved during the crawling phase when the web driver renders the page. This allows for efficient detection since the page is already fully rendered.

However, I propose saving the layout information during the crawling phase. And leverage this data to implement more advanced heuristics during the subsequent scrape phase.

I’m open to discussing this approach further before proceeding with the extended implementation.

@unclecode
Copy link
Owner

@nelzomal Amazing, I will check it soon, appreciate the help.

@nelzomal
Copy link
Contributor Author

@unclecode hi, any update?

@unclecode
Copy link
Owner

@nelzomal Hi, I am working on multiple things this weekend plus documentation. There are 3 pull requests I am planning to focus on, including this invisible text, so cross fingers.

@nelzomal
Copy link
Contributor Author

Hi @unclecode!
I hope the review is going well!

I’m really interested in contributing more to this repository, and I’d love the opportunity to join the Discord you mentioned earlier.

@unclecode
Copy link
Owner

@nelzomal hi again, The past few years, I focused on updating website documents, which led to a new library feature: generating LLM.txt files. It’s much better than the standard I’ve seen and will be improved incrementally.

This goes beyond crawling, it creates LLM.txt content or markdown for websites, helping developers ask better response from LLMs chat by attaching these markdowns. I’ll work on this today and tomorrow.

Next week, I’ll address pull requests, including yours. Please share your email so I can send you an invite.

Thanks for your help! The library is growing, and we need all the collaboration we can get.

@nelzomal
Copy link
Contributor Author

@unclecode my email is [email protected]

I feel LLM.txt feature is quite useful. Would like to contribute if possible.

@unclecode
Copy link
Owner

@nelzomal Sending the link, also this week will handle the PR.

@nelzomal
Copy link
Contributor Author

@unclecode great news!

However, I haven't received any invitation link in my email [email protected]

@unclecode
Copy link
Owner

@nelzomal I sent again, please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants