perf(es/lexer): Use `logos` lexer as a sub-lexer #9807

kdy1 · 2024-12-19T23:17:11Z

Description:

Define RawToken.
- It should be declared in a standalone crate because it requires heavy code generation. logos generates a deterministic finite state machine that consists of lookup tables and jump tables.
- In the future, RawToken will be renamed to Token and used directly by the parser. I need to investigate much more about API.
- For performance, RawToken should be single-byte-sized.
Adjust lexer to work on RawToken instead of char.
- logos takes &str and generates Iterator<Item = RawToken>.
- Some ECMAScript code cannot be lexed using FSM. logos provides callback API, and that's how we should handle ambiguous tokens.
- The regex syntax of logos is a bit inferior to that of regex. In other words, even the regex is valid, the logos lexer may not generate matching RawToken. This is for performance, and it's documented here.
Wrapper: logos::Lexer => RawLexer => Lexer => Parser
- The callback API of logos is not enough for lexing ECMAScript. We have to wrap the logos lexer with our lexer and call logos lexer only if it's valid to do. Well, it's valid most of the time, but currently, we handle regex and template literals using a separate method.
- I introduced RawLexer as a sort of buffer (based on peek_nth from itertools), but I found that it's a good place to have various lexing methods, so I added read_regexp. read_regexp uses another logos token definition, so it should be in the swc_ecma_raw_lexer crate.
Fix tests
- Currently, some of spans are wrong.
- Currently, processed values of AST are simply filled with the raw value from the lexer. These should have the correct values. For example, Str has the value of \\\\ and the raw value of \\\\ for input string \\\\. But the value field should be \\ instead.

Related issue (if exists):

changeset-bot · 2024-12-19T23:17:15Z

⚠️ No Changeset found

Latest commit: c862038

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

This reverts commit 2cb094e.

GiveMe-A-Name · 2025-01-08T09:47:12Z

crates/swc_ecma_parser/src/lexer/state.rs

@@ -372,7 +353,7 @@ impl Iterator for Lexer<'_> {
            }

            self.state.update(start, token.kind());
-            self.state.prev_hi = self.last_pos();
+            self.state.prev_hi = self.input.cur_pos();


We do not need maintain the cur_pos message in parser.lexer. How about get span from logos.lexer ? It's a better way to reduce ecma.parser.lexer complexity.

I agree that it can be a better design. You would only need to add start_pos to those.

kdy1 self-assigned this Dec 19, 2024

kdy1 force-pushed the parser-rewrite branch from 1c4dc31 to 33d2de6 Compare January 2, 2025 04:33

kdy1 mentioned this pull request Jan 2, 2025

JSON example fails to lex unicode escapes like "unicode\u2028escape" maciejhirsz/logos#458

Open

kdy1 changed the title ~~perf(es/lexer): Use logos for Text => RawToken => Token phase~~ perf(es/lexer): Use logos for sub-lexer Jan 2, 2025

kdy1 changed the title ~~perf(es/lexer): Use logos for sub-lexer~~ perf(es/lexer): Use logos lexer as a sub-lexer Jan 2, 2025

kdy1 added this to the Planned milestone Jan 2, 2025

kdy1 added 23 commits January 7, 2025 13:55

raw_lexer

cac6c27

edition

c283516

raw token

8296f04

Dep

6605c44

Dep

1f530f0

raw token

6c9e5b8

Fix string input

87a88dc

raw token

13291db

bump

91cbf62

mod

0d66e65

bump(1)

5a0254a

cargo lockfile

13563e4

bump

38b063b

LexError

ad86d37

self.input.bump(1)

3adc46d

RawBuffer

9200463

RawBuffer

68321b1

raw buffer work

b84a9e7

RawBuffer

52265b4

lexerror

28414ec

more lexer work

3861977

more lexer work

e9ea1cf

more lexer work

a7ee322

kdy1 added 27 commits January 7, 2025 13:56

Rename

1b95950

more escape

ef91c15

Remove dbg

bd8ffc2

Use next instead of bump

cf136e4

Optimize reset_peeked

dff3492

cleanup

df33a39

Reuse RawToken

ba267d8

dbg

5e8d34c

Optimize eat_ascii

c000ccb

Optimize cur_char

aa150e8

Revert "Optimize eat_ascii"

5df3fa1

This reverts commit 2cb094e.

Use bump

cefd09b

Optimize newline

b427d68

next() instead of bump

39f4f7a

whitespace_callback

6506bf1

Skip whitespace

d6be0d0

Add raw token types

92bd6ef

skip newlines

9fb8734

Change assertion

c95190a

update_cur_pos

16dbbaf

Rename

cc2d26b

Doc

644da8f

WIP: read_regexp

a34ced4

read_regexp

3df1326

Use it

7729f5f

Disable number tests

cd9fe06

Fix Lexer.span

c862038

kdy1 force-pushed the parser-rewrite branch from 0849e30 to c862038 Compare January 7, 2025 04:56

kdy1 removed their assignment Jan 7, 2025

GiveMe-A-Name reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(es/lexer): Use `logos` lexer as a sub-lexer #9807

perf(es/lexer): Use `logos` lexer as a sub-lexer #9807

kdy1 commented Dec 19, 2024 •

edited

Loading

changeset-bot bot commented Dec 19, 2024 •

edited

Loading

GiveMe-A-Name Jan 8, 2025 •

edited

Loading

kdy1 Jan 8, 2025

perf(es/lexer): Use logos lexer as a sub-lexer #9807

Are you sure you want to change the base?

perf(es/lexer): Use logos lexer as a sub-lexer #9807

Conversation

kdy1 commented Dec 19, 2024 • edited Loading

changeset-bot bot commented Dec 19, 2024 • edited Loading

⚠️ No Changeset found

GiveMe-A-Name Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

kdy1 Jan 8, 2025

Choose a reason for hiding this comment

perf(es/lexer): Use `logos` lexer as a sub-lexer #9807

perf(es/lexer): Use `logos` lexer as a sub-lexer #9807

kdy1 commented Dec 19, 2024 •

edited

Loading

changeset-bot bot commented Dec 19, 2024 •

edited

Loading

GiveMe-A-Name Jan 8, 2025 •

edited

Loading