ripgrep: case insensitive search is faster than case sensitive search #2444

lfreist · 2023-03-04T14:57:36Z

lfreist
Mar 4, 2023

Hi,

I was playing around with ripgrep and figured out that the case insensitive search is faster than the case sensitive search on a single file with a single regex pattern:

rg --version
ripgrep 13.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

> time rg "pat[t ]ern" 3-gb.txt -c
15362

real    0m1,221s
user    0m1,108s
sys     0m0,113s

> time rg "pat[t ]ern" 3-gb.txt -c -i
15364

real    0m0,485s
user    0m0,381s
sys     0m0,105s

(The file 3-gb.txt is a file of exactly 3G chars composed of random english words.)

As you can see, the search with the -i flag is multiple times faster. The search results are correct when compared to GNU grep and the timings can be reproduced when data are cached.
This is not only true for counting matches but also for searching lines or byte offsets - the case insensitive search remains faster.

Can someone explain why?

Thanks in advance!

Answered by BurntSushi

Mar 4, 2023

OK, so this is a somewhat intriguing example. The short answer for why you're seeing this behavior is because ripgrep does pretty sophisticated black magic called "literal optimizations." The reason why it's black magic is because they are basically a big bag of heuristics that usually work really well. In this case, both are pretty fast, but it is indeed somewhat unusual to see a case insensitive search so much faster than the case sensitive version.

I'll also note that GNU grep demonstrates a similar phenomenon here:

$ time LC_ALL=C grep -E "pat[t ]ern" 3-gb.txt -c
12564

real    3.026
user    2.747
sys     0.277
maxmem  8 MB
faults  0

$ time LC_ALL=C grep -E "pat[t ]ern" 3-gb.txt -c -…

View full answer

BurntSushi · 2023-03-04T15:01:14Z

BurntSushi
Mar 4, 2023
Maintainer

Could you share a haystack where this occurs? I can try to make one, but an analysis might really depend on the exact haystack.

3GB is too big to share, but perhaps you can share a script to generate one from /usr/share/dict/words.

1 reply

lfreist Mar 4, 2023
Author

Sure, I have one here: https://github.com/lfreist/x-search/blob/main/scripts/createTestFile.py

python3 createTestFile.py -s 3 -o /tmp/3-gb.txt /usr/share/dict/words

However, I just figured out that this most likely results from rust regex engine: Case insensitive appears to be faster here too... which is impressive since case insensitive search for other regex engines (google re2, boost::regex) are slower when running case insensitive searches.

BurntSushi · 2023-03-04T17:50:36Z

BurntSushi
Mar 4, 2023
Maintainer

OK, so this is a somewhat intriguing example. The short answer for why you're seeing this behavior is because ripgrep does pretty sophisticated black magic called "literal optimizations." The reason why it's black magic is because they are basically a big bag of heuristics that usually work really well. In this case, both are pretty fast, but it is indeed somewhat unusual to see a case insensitive search so much faster than the case sensitive version.

I'll also note that GNU grep demonstrates a similar phenomenon here:

$ time LC_ALL=C grep -E "pat[t ]ern" 3-gb.txt -c
12564

real    3.026
user    2.747
sys     0.277
maxmem  8 MB
faults  0

$ time LC_ALL=C grep -E "pat[t ]ern" 3-gb.txt -c -i
12565

real    3.323
user    2.964
sys     0.356
maxmem  8 MB
faults  0

$ time LC_ALL=en_US.UTF-8 grep -E "pat[t ]ern" 3-gb.txt -c
12564

real    2.997
user    2.716
sys     0.280
maxmem  8 MB
faults  0

$ time LC_ALL=en_US.UTF-8 grep -E "pat[t ]ern" 3-gb.txt -c -i
12565

real    1.675
user    1.414
sys     0.260
maxmem  8 MB
faults  0

Although for GNU grep, it is doubly weird because it only happens in the en_US.UTF-8 locale versus the C locale, and the en_US.UTF-8 locale is faster than the C locale. I can't ever recall another example of that. GNU grep has been in my experience either the same speed or slower in the en_US.UTF-8 locale. (GNU grep also engages in aforementioned black magic, although it's not quite as sophisticated and robust as what ripgrep does.)

So basically, what it comes down to here is that the case sensitive version of pat[t ]ern gets translated to pattern|pat ern, which is then given to the aho-corasick library, and that in turn says, "hey! both patterns start with p, so let's just run memchr on p and then confirm the matches."

Generally speaking, Teddy is much much slower than memchr, because memchr has to do far less work. It's just looking for a single byte. But Teddy is a multiple substring matcher. It needs to do a lot more work. So at first blush, the fact that memchr is used for case sensitive search and Teddy gets used for the case insensitive search, well, that all seems fine and good. It makes sense.

But... the problem is that memchr is looking for a byte that occurs very frequently in your haystack:

$ rg -co p 3-gb.txt
67454281

That means that while memchr is quite fast, it doesn't really matter here because it is starting and stopping constantly. That is, it is said to have a high false positive rate. It produces lots of candidates that don't ultimately lead to a match. But Teddy, even though its actual algorithm is generally slower, is going to be spending more time in its vector code because it is more discriminatory in what it matches. It has a much lower false positive rate here:

$ rg -co '(?-u:PAT E|pAT E|PaT E|paT E|PAt E|pAt E|Pat E|pat E|PATTE|pATTE|PaTTE|paTTE|PAtTE|pAtTE|PatTE|patTE|PATtE|pATtE|PaTtE|paTtE|PAttE|pAttE|PattE|pattE|PAT e|pAT e|PaT e|paT e|PAt e|pAt e|Pat e|pat e|PATTe|pATTe|PaTTe|paTTe|PAtTe|pAtTe|PatTe|patTe|PATte|pATte|PaTte|paTte|PAtte|pAtte|Patte|patte)' 3-gb.txt
58855

That's only a couple times bigger than the total count for pat[t ]ern, which is quite fantastic. It's very discriminatory, so it's going to spend most of its time in tight vectorized code.

GNU grep is vulnerable to similar problems, because for a single substring search, it always feeds the last byte to memchr. It's really easy to provoke huge speed differences because of this:

$ time grep 'Sherlocka' 3-gb.txt -c
0

real    1.511
user    1.230
sys     0.280
maxmem  8 MB
faults  0

$ time grep 'SherlockZ' 3-gb.txt -c
0

real    0.357
user    0.083
sys     0.273
maxmem  8 MB
faults  0

Where as ripgrep is just as fast in both cases:

$ time rg 'Sherlocka' 3-gb.txt -c

real    0.312
user    0.208
sys     0.104
maxmem  2865 MB
faults  0

$ time rg 'SherlockZ' 3-gb.txt -c

real    0.314
user    0.167
sys     0.147
maxmem  2865 MB
faults  0

Because ripgrep uses different algorithms and does try to be robust with these sorts of things. But sometimes it guesses wrong, which leads to cases like this. In fact, you don't even need the -i flag to provoke it:

$ time rg "pat[t ]ern" 3-gb.txt -c
12564

real    1.090
user    0.979
sys     0.110
maxmem  2865 MB
faults  0

$ time rg "[Pp]at[t ]ern" 3-gb.txt -c
12564

real    0.423
user    0.276
sys     0.146
maxmem  2866 MB
faults  0

Most regex engines don't have an algorithm like Teddy to deal with multiple substring searches like this, so this isn't really a pit that they can fall into.

1 reply

lfreist Mar 4, 2023
Author

Wow, thank you very much for this detailed answer! I really appreciate your effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ripgrep: case insensitive search is faster than case sensitive search #2444

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ripgrep: case insensitive search is faster than case sensitive search #2444

lfreist Mar 4, 2023

Replies: 2 comments · 2 replies

BurntSushi Mar 4, 2023 Maintainer

lfreist Mar 4, 2023 Author

BurntSushi Mar 4, 2023 Maintainer

lfreist Mar 4, 2023 Author

lfreist
Mar 4, 2023

Replies: 2 comments 2 replies

BurntSushi
Mar 4, 2023
Maintainer

lfreist Mar 4, 2023
Author

BurntSushi
Mar 4, 2023
Maintainer

lfreist Mar 4, 2023
Author