Regarding huge performance boost from --no-unicode flag #2584
-
When running on folders (around 100k files) mixed with binaries files (some are large) and text files using Would like to know if this is correct and expected behavior (e.g. all files are scanned and not skipped in the middle), our use case is to scan all occurrences of some regex matches even in binary files. We invoke rg with
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
It's completely impossible to answer your question with the lack of details given. You haven't even shared the regex you're using......... Are your search time differences possible? Absolutely. Unicode can be extremely expensive. ripgrep does (a lot) better than most, but it isn't magic. Try passing |
Beta Was this translation helpful? Give feedback.
-
@BurntSushi One example is
then
|
Beta Was this translation helpful? Give feedback.
Yes, the Unicode word boundary is what's killing you. It's easily confirmed by running this under a profiler:
(With
perfr
defined here.)As you can see, the vast majority of the time is being spent in the PikeVM (the slowest engine):
But if I profile with
--no-unicode
:Then most of the time is being spent in the lazy DFA, which is much much faster:
You'll get similar results if you selectively use ASCII word boundaries instead of Unicode word boundaries: