-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Filtering for Small and Invisible Text #274
Comments
@nelzomal Thank you so much for the suggestion, and I do agree with that. Please go ahead and create the pull request and also share your email address with me. I will send you a Discord invitation. I would love to see if you can help us and also proceed with this suggestion. And the part of the code check And I need you to pay attention to one very important thing. For me, the processing time of scraping is crucial. Right now, the average has become around 100 milliseconds. I spent quality time to make it very efficient. Therefore, adding any new steps or process comes with the cost of computation time. I need you to test the computation time for multiple websites before and after you apply this change and make sure that we're not losing any time. Thank you so much. |
@unclecode I have created a PR #332 As mentioned in the PR, I am open to discuss how to implement a better solution.
|
@nelzomal Amazing, I will check it soon, appreciate the help. |
@unclecode hi, any update? |
@nelzomal Hi, I am working on multiple things this weekend plus documentation. There are 3 pull requests I am planning to focus on, including this invisible text, so cross fingers. |
Hi @unclecode! I’m really interested in contributing more to this repository, and I’d love the opportunity to join the Discord you mentioned earlier. |
@nelzomal hi again, The past few years, I focused on updating website documents, which led to a new library feature: generating LLM.txt files. It’s much better than the standard I’ve seen and will be improved incrementally. This goes beyond crawling, it creates LLM.txt content or markdown for websites, helping developers ask better response from LLMs chat by attaching these markdowns. I’ll work on this today and tomorrow. Next week, I’ll address pull requests, including yours. Please share your email so I can send you an invite. Thanks for your help! The library is growing, and we need all the collaboration we can get. |
@unclecode my email is [email protected] I feel LLM.txt feature is quite useful. Would like to contribute if possible. |
@nelzomal Sending the link, also this week will handle the PR. |
@unclecode great news! However, I haven't received any invitation link in my email [email protected] |
@nelzomal I sent again, please check. |
Currently, there is filtering for small, invisible, or irrelevant images. However, implementing similar filtering for small or invisible text is equally important, as such text can significantly impact content quality by introducing noise or misleading information.
I would like to know if there is any plan to implement this feature. If not, I’d be happy to contribute by working on a pull request. Could someone provide pointers to the relevant parts of the codebase that would need modification to add this functionality?
The text was updated successfully, but these errors were encountered: