Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider providing \< and \> #469

Closed
BurntSushi opened this issue Apr 23, 2018 · 25 comments
Closed

consider providing \< and \> #469

BurntSushi opened this issue Apr 23, 2018 · 25 comments

Comments

@BurntSushi
Copy link
Member

The \< and \> escape sequences correspond to "left word boundary" and "right word boundary," respectively. They are more specific variants of \b.

There was a request to add them to ripgrep, but in reality, this is more a request for the regex engine itself.

It's not clear whether these are worthwhile things to add, but they are present in other regex engines. There's a secondary question of whether their negation should be support too, although I don't think that's a common thing.

Since unrecognized escape sequences produce a syntax error today, these can be added in a backwards compatible way at any time.

@mqudsi
Copy link

mqudsi commented Dec 30, 2018

Note that in PCRE2 (as discussed in BurntSushi/ripgrep#1151) \> is generally recognized as an escape (although as @BurntSushi notes, it is an error in the current dialect), which may lead to unexpected results.

PCRE2 does not support \> and \< as word boundaries, which as I understand it are GNU extensions to the original dialect. I'm not able to secure information about which regex engines do support it, but I think the biggest one people are familiar with is through vim (which doesn't support \b, if I'm remembering correctly).

Just some food for thought.

@jacob-keller
Copy link

Grep supports them, as well as other gnu tools like sed, and emacs. Vim obviously supports them.

The primary reason why I'd like them implemented isn't that \b doesn't work for most use cases but mostly because I habitually type < instead of \b due to use in vim which doesn't support \b.. I also find it easier to think about wrapping a word in < > mentally verses using the same boundary key for left and right side..

It's not a huge deal as much as it required retraining to use different sequences in different programs. It can be worked around well enough...

I understand that specific word boundaries aren't all that useful because you generally follow up with either a word or non word character on at least one side. So yea they don't necessarily hold their weight in terms of adding useful features... It's more about compatibility with other tools.

@BurntSushi
Copy link
Member Author

I think I'm just going to say no to this feature now. I'm happy to revisit it in the future, but I think I'd want something compelling. It is also worth nothing that even GNU grep supports \b:

$ echo 'foo bar' | egrep '\w\b' -o
o
r

So it kind of seems like vim is the oddball here that doesn't support \b.

@BurntSushi
Copy link
Member Author

See this thread on HN for a discussion with a user that feels strongly enough about using \< and \> that they patched this crate to support it: https://news.ycombinator.com/item?id=31713791

@KoviRobi
Copy link

KoviRobi commented Feb 2, 2023

I would like to argue for this, because there is a common workflow (for me at least) that this gets in the way:

  • In vim, you use * to search for a word, it puts \<word\> into your / register (search register)
  • If you want to then search for it in other files, you just type :grep then press <Ctrl-R> then / and it inserts the contents of the / register.
  • Now I have to navigate and patch up the regexp so that ripgrep doesn't choke on it, which is a small annoyance but this often happens a fair amount when I'm navigating so it pops up a lot.

Given there is a patch (the URL from HN points to hvdijk@511c648 to make it easier to find, I find HN to be a difficult to read wall of text), is there any reason why we couldn't merge that patch? Could I help?

But also, if you really don't want to add \< and \>, then even just ignoring it as per #93 would be nice -- even if you make ignoring escapes an optional behaviour in ripgrep that you pass a flag to, you can tell vim to pass custom flags to the grep program.

@BurntSushi
Copy link
Member Author

@KoviRobi I'm not sure what you mean by ignoring them. Ignoring them would mean that, e.g., \< would match <, which is probably not what you want given the workflow you described?

@BurntSushi
Copy link
Member Author

I'll re-open this for now. We can consider adding this once #656 is done.

@BurntSushi BurntSushi reopened this Feb 2, 2023
@KoviRobi
Copy link

KoviRobi commented Feb 2, 2023

@KoviRobi I'm not sure what you mean by ignoring them. Ignoring them would mean that, e.g., \< would match <, which is probably not what you want given the workflow you described?

Ah I misunderstood the issue I linked, I thought it ignored the escape characters at all, so \< would be the same as the empty string, not ignored the escape so \< is the same as <.

I'll re-open this for now. We can consider adding this once #656 is done.

Excellent, thank you -- let me know if I can be of any help

@BurntSushi
Copy link
Member Author

BurntSushi commented Oct 3, 2023

In addition to \< and \>, I've been noodling on also adding these:

  • \b{start} as an alias for \<.
  • \b{end} as an alias for \>.
  • \b{prev-non-word} for (?<!\w), or equivalently, (?<=\W|^).
  • \b{next-non-word} for (?!\w), or equivalently, (?=\W|$).

I like \b{start} and \b{end} because I find them less opaque and also leave the door open for \B{start} and \B{end}, although I'd like to avoid adding those for the time being.

The \b{prev-non-word} and \b{next-non-word} are somewhat more specialized. For example, \b{start} only matches at a position where the left side is \W and the right side is \w, but \b{prev-non-word} only matches whenever the left side is \W, even if the right side is also \W. It turns out that this is quite useful for implementing the -w/--word-regexp flag in grep programs, where standard behavior is not to do \b(?:pat)\b, but (?<!\w)(?:pat)(?!\w). Indeed, that's how -w/--word-regexp is implemented in ripgrep when using PCRE2. But in a regex engine without general look-around, implementing it is a tortured affair. Indeed, I've found some latent bugs in the implementation.

I've thought for some time about how to implement something like \b{prev-non-word}/\b{next-non-word} outside of the regex engine and I haven't been able to come up with something that is obviously correct. However, it seems to me that they (along with \b{start} and \b{end}) could slot right in as new look-around assertions in the regex engine. And then there would no longer need to be any work-arounds in ripgrep or any other routine that wants similar semantics.

One downside of this is that adding new look-around assertions will require breaking change releases of both regex-syntax and regex-automata. I was hoping to avoid doing a breaking change release so quickly after the last one, but I'd like to get the -w/--word-regexp stuff cleaned up in ripgrep. And I can bundle in a few other planned breaking changes while I'm at it: #1090, #1088, #1031 and maybe a couple others that slip my mind at the moment.

My plan at this point is to think a bit more about working around this in ripgrep without adding the new word boundary assertions to the regex engine. If I can come up with something that makes me happy, then I might forgo this for now.

@BurntSushi
Copy link
Member Author

I've been thinking about names because I really don't like \b{prev-non-word} and \b{next-non-word}. I'm thinking that I like \b{start-half} and \b{end-half} better. (Although I still don't like them.)

BurntSushi added a commit that referenced this issue Oct 7, 2023
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
BurntSushi added a commit that referenced this issue Oct 7, 2023
This adds Ast and Hir support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 7, 2023
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 7, 2023
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
BurntSushi added a commit that referenced this issue Oct 8, 2023
This was substantially easier. Coupling, private abstractions and slow
code are so much easier to deal with.

Ref #469
@BurntSushi
Copy link
Member Author

I have this implemented in #1098 and I'm currently planning to include it in the regex 1.10 release. Specifically, this is the new syntax as it will appear in the docs:

\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

BurntSushi added a commit that referenced this issue Oct 8, 2023
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
BurntSushi added a commit that referenced this issue Oct 8, 2023
BurntSushi added a commit that referenced this issue Oct 8, 2023
This was substantially easier. Coupling, private abstractions and slow
code are so much easier to deal with.

Ref #469
@ratmice
Copy link

ratmice commented Oct 8, 2023

I have this implemented in #1098 and I'm currently planning to include it in the regex 1.10 release. Specifically, this is the new syntax as it will appear in the docs:

\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

I haven't looked through the patch series yet, but this caught my eye.
Does this add the < character to regex-syntax is_meta_character, there is a decently sized comment
here where I will need to test/update some downstream code when this gets included.

https://github.com/softdevteam/grmtools/blob/cf6af64db67fed12c7259902794b98c627daadf1/lrlex/src/lib/parser.rs#L511

@BurntSushi
Copy link
Member Author

Thanks for checking!

No, it doesn't. This change doesn't make < a meta character. That would break things so bad as to be untenable. It's only \< that is interpreted as a word boundary. If you try using \< in regex 1.9, you get an error:

$ regex-cli debug hir '\<'
failed to parse pattern 0 to AST: '\<': failed to parse pattern: regex parse error:
    \<
    ^^
error: unrecognized escape sequence

@ratmice
Copy link

ratmice commented Oct 8, 2023

I guess the problem then is that the code linked above is going to strip the \, off of \< unless it is a meta character i.e. currently we already need to escape < into \< to get it passed to regex as <, so I think to use this word boundary as-is without modifying the linked code above one would have to double escape \, via \\\<.

I would definitely recommend people rather use the \b{...} instead from lex, so thanks for adding those alternatives.
Anyhow I think it's not pretty but at least we can send both < via single-escape, and \< via double escape through the format.
So I think it should all work out.

But yeah, I concur that adding it as a meta-character would seem to be a mistake as it would become impossible to encode <. directly in our format.

So I don't believe there are any major issues or objections on my side of this.

@BurntSushi
Copy link
Member Author

BurntSushi commented Oct 8, 2023

currently we already need to escape < into \< to get it passed to regex as <

Eh? No. The regex < matches < literally. As I showed above, if you try to write \< today, then you get an error. So you can't be doing that today...

Is there some existing valid regex pattern whose behavior or interpretation changes as a result of this PR?

Also, where and how are you using this crate? (Out of curiosity.)

@ratmice
Copy link

ratmice commented Oct 8, 2023

Sorry, for not making things explicitly clear, I'm dealing with 2 layers of escaping:

quoting from 1

Both Lex and grmtools support escaping arbitrary characters, for all other characters
besides those listed above, when given an escaped character \c it will be passed to
the regex engine as the character c. This is useful when a character is used within
the lex format.

An example of this is when the character < is used at the beginning of a regex. Both Lex
and grmtools interpret this as the beginning of a start condition prefix. Which can be
escaped with \< to ensure it is treated as the start of a regular expression.

There is a simple testcase usage here2, and an example where we use < before a regex, but not as part of the regex itself 3.

I use the crates in my lex/yacc lsp implementation4, instead of codegen it builds parse tables at runtime and so you can modify grammars and modify/parse input sources at runtime from a traditional lex/yacc specification.

Anyhow, from what I can tell the existing code I linked should be acceptable without changes, with the escaping woes stemming from the layer of escaping on top of regex.

Footnotes

  1. https://github.com/softdevteam/grmtools/blob/cf6af64db67fed12c7259902794b98c627daadf1/doc/src/lexcompatibility.md?plain=1#L53C1-L56C83

  2. https://github.com/softdevteam/grmtools/blob/cf6af64db67fed12c7259902794b98c627daadf1/lrpar/cttests/src/quoting.test#L11

  3. https://github.com/softdevteam/grmtools/blob/master/lrpar/examples/start_states/src/comment.l

  4. https://github.com/ratmice/nimbleparse_lsp

@BurntSushi
Copy link
Member Author

BurntSushi commented Oct 8, 2023

Yes, you shouldn't have to change anything. Nothing in this PR should change the behavior or semantics of any existing valid regex. And since < was never and will never be a meta character on its own, there is no need to escape it for this crate. Whatever escaping you're already doing must necessarily be sufficient after this PR, for if it weren't, it would imply your escaping routine would lead to an invalid regex pattern in the status quo.

BurntSushi added a commit that referenced this issue Oct 9, 2023
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
BurntSushi added a commit that referenced this issue Oct 9, 2023
This was substantially easier. Coupling, private abstractions and slow
code are so much easier to deal with.

Ref #469
@KoviRobi
Copy link

KoviRobi commented Oct 9, 2023

Thank you for this, I was going to give it a try by patching Cargo.toml in ripgrep https://github.com/KoviRobi/ripgrep/tree/use-word-boundary-regex but seems like I did something wrong, when I try to test it out I get the following:

ripgrep/$ cargo r -- \<foo\>
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rg '\<foo\>'`
regex parse error:
    (?:\<foo\>)
       ^^
error: unrecognized escape sequence

possibly I have just misunderstood something (my understanding is ripgrep uses this engine, but maybe it also has some syntax checking that needs to be patched?)

@BurntSushi
Copy link
Member Author

@KoviRobi You might need to explicitly patch regex-automata too. ripgrep master is in a somewhat state of flux at the moment, as I'm working on the ripgrep 14 release. In particular, it should no longer depend on regex itself at all. (It only uses it in dev-dependencies.) Instead it uses regex-automata directly.

@BurntSushi
Copy link
Member Author

Note also that ripgrep's -w/--word-regexp flag will be using \b{start-half} and \b{end-half}, not \< and \>. (Although of course \< and \> will still be available for use explicitly.)

@KoviRobi
Copy link

KoviRobi commented Oct 9, 2023

@BurntSushi thank you, I have updated https://github.com/KoviRobi/ripgrep/tree/use-word-boundary-regex by updating the regex-automata and regex-syntax, then fixing the remaining errors to my best intuition, it seems to work!

ripgrep/$ cargo r -- \<oo\>
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rg '\<oo\>'`
crates/regex/src/word.rs
289:        assert_eq!(Some((3, 7)), find(r"f?oo!?", "##\nfoo!##"));

And being able to use \< and \> explicitly fits my use-case very well :)

@BurntSushi
Copy link
Member Author

Aye yeah, both regex-syntax and regex-automata have breaking changes (which means they get 0.8.0 and 0.4.0 releases, respectively). Once the releases are out, I'll update ripgrep master pretty quickly and you can drop your patch.

BurntSushi added a commit that referenced this issue Oct 9, 2023
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
BurntSushi added a commit that referenced this issue Oct 9, 2023
This was substantially easier. Coupling, private abstractions and slow
code are so much easier to deal with.

Ref #469
@BurntSushi
Copy link
Member Author

@KoviRobi You should be good to build ripgrep from master now!

@KoviRobi
Copy link

@BurntSushi thanks! Also just saw your last commit message about fixing up past commits being an effort, have you come across https://github.com/tummychow/git-absorb ? It makes the fixup workflow extremely simple

@BurntSushi
Copy link
Member Author

Yes, I use it all the time. The issue there wasn't finding the right commit, but the conflicts that I guessed would result from inserting the change in the Cargo.lock file. (I didn't actually try it though, it was just my sense of things at the time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants