Regarding huge performance boost from --no-unicode flag #2584

taoxinyi · 2023-08-15T23:37:31Z

taoxinyi
Aug 15, 2023

When running on folders (around 100k files) mixed with binaries files (some are large) and text files using -uuu flag, we have noticed in the default unicode mode it takes about 30 minutes to finish the search while the time is reduced significantly to less than 1 minute when --no-unicode is provided.

Would like to know if this is correct and expected behavior (e.g. all files are scanned and not skipped in the middle), our use case is to scan all occurrences of some regex matches even in binary files.

We invoke rg with --file with one regex per line and the following flags

--json
--trim
--context=1
--no-messages
-uuu
--search-zip
--null-data
--text
--crlf
--ignore-case
--no-multiline
--no-multiline-dotall
--no-unicode #turn it on or off

Answered by BurntSushi

Aug 16, 2023

Yes, the Unicode word boundary is what's killing you. It's easily confirmed by running this under a profiler:

$ perfr rg-13.0.0 --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./

(With perfr defined here.)

As you can see, the vast majority of the time is being spent in the PikeVM (the slowest engine):

But if I profile with --no-unicode:

$ perfr --callgraph rg-13.0.0 --json -auuu --no-unicode "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./

Then most of the time is being spent in the lazy DFA, which is much much faster:

You'll get similar results if you selectively use ASCII word boundaries instead of Unicode word boundaries:

$ time rg-13.0.0 --json -auuu "Appl…

View full answer

BurntSushi · 2023-08-16T00:12:47Z

BurntSushi
Aug 16, 2023
Maintainer

It's completely impossible to answer your question with the lack of details given. You haven't even shared the regex you're using.........

Are your search time differences possible? Absolutely. Unicode can be extremely expensive. ripgrep does (a lot) better than most, but it isn't magic.

Try passing --dfa-size-limit 99999999999. It isn't guaranteed to help, but there are cases where it does. Of course, I can't predict if it will help in your case because you've omitted almost every relevant detail.

1 reply

taoxinyi Aug 16, 2023
Author

Hi @BurntSushi unfortunately I cannot share exactly regex, but they will be something like

\babc\b
\w{0,10}xyz
pqr

And search targets are docker images. (we decompress layers after docker save -o and run rg)

taoxinyi · 2023-08-16T04:34:24Z

taoxinyi
Aug 16, 2023
Author

@BurntSushi One example is

docker pull centos && docker save centos -o centos.tar
mkdir centos && tar -xf centos.tar -C centos && cd centos
mkdir extracted && find . -type f -name "*.tar" -exec tar -xf "{}" -C extracted \; && cd extracted

then

# 0.03s
rg -uuu --json  -e "Apple[\s_-]*Banana[\s_-]*\d*|\bAB[\s_-]*\d\b"  --no-unicode  
# 1.18s
rg -uuu --json  -e "Apple[\s_-]*Banana[\s_-]*\d*|\bAB[\s_-]*\d\b"  
# 1.23s
rg -uuu --json  -e "Apple[\s_-]*Banana[\s_-]*\d*" -e "\bAB[\s_-]*\d\b"
# 0.00s
rg -uuu --json  -e "Apple[\s_-]*Banana[\s_-]*\d*"
# 0.02s
rg -uuu --json  -e "\bAB[\s_-]*\d\b"

1 reply

BurntSushi Aug 16, 2023
Maintainer

Yes, the Unicode word boundary is what's killing you. It's easily confirmed by running this under a profiler:

$ perfr rg-13.0.0 --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./

(With perfr defined here.)

As you can see, the vast majority of the time is being spent in the PikeVM (the slowest engine):

But if I profile with --no-unicode:

$ perfr --callgraph rg-13.0.0 --json -auuu --no-unicode "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./

Then most of the time is being spent in the lazy DFA, which is much much faster:

You'll get similar results if you selectively use ASCII word boundaries instead of Unicode word boundaries:

$ time rg-13.0.0 --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|(?-u:\b)AB[\s_-]*[0-9](?-u:\b)" ./ &> /dev/zero

real    0.043
user    0.323
sys     0.044
maxmem  17 MB
faults  0

It is also worth noting that this particular example is fixed on master:

$ time rg-13.0.0 --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./ &> /dev/zero

real    0.956
user    8.365
sys     0.037
maxmem  13 MB
faults  0

$ time rg --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|\bAB[\s_-]*[0-9]\b" ./ &> /dev/zero

real    0.032
user    0.160
sys     0.085
maxmem  18 MB
faults  0

But it's only fixed in the sense that heuristic literal optimizations have gotten better. rg 13.0.0 doesn't know how to pull the A prefix out of this particular regex pattern, but ripgrep on master does. ripgrep on master does still have to use the PikeVM for this (because the Unicode word boundary isn't something the faster engines can deal with), but it does so with the benefit of literal acceleration. So it still looks fast. But if your regexes don't have literal optimizations available to them, then the \b is still going to kill performance even on master:

$ time rg --json -auuu "Apple[\s_-]*Banana[\s_-]*[0-9]*|\b[A-Z][\s_-]*[0-9]\b" ./ &> /dev/zero

real    0.711
user    6.923
sys     0.070
maxmem  51 MB
faults  0

This is overall expected unfortunately and it is not planned to ever be fixed because the faster DFA engines just can't handle Unicode word boundaries. If you want to read up more on how ripgrep's regex engine internals work, then this blog might help. It specifically applies to ripgrep master, but generally applies to rg 13.0.0 too.

Answer selected by BurntSushi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding huge performance boost from --no-unicode flag #2584

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Regarding huge performance boost from --no-unicode flag #2584

taoxinyi Aug 15, 2023

Replies: 2 comments · 2 replies

BurntSushi Aug 16, 2023 Maintainer

taoxinyi Aug 16, 2023 Author

taoxinyi Aug 16, 2023 Author

BurntSushi Aug 16, 2023 Maintainer

taoxinyi
Aug 15, 2023

Replies: 2 comments 2 replies

BurntSushi
Aug 16, 2023
Maintainer

taoxinyi Aug 16, 2023
Author

taoxinyi
Aug 16, 2023
Author

BurntSushi Aug 16, 2023
Maintainer