Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simdjson #1892

Open
ekg opened this issue Apr 19, 2019 · 9 comments
Open

simdjson #1892

ekg opened this issue Apr 19, 2019 · 9 comments

Comments

@ekg
Copy link

ekg commented Apr 19, 2019

Would it be possible to leverage simdjson to improve parsing speed?

https://github.com/lemire/simdjson

Apologies if this is a duplicate.

@nicowilliams
Copy link
Contributor

It'd be nice to do this, yes, but there are issues.

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

Second, jq supports multiple JSON texts in sequence. This is less tricky.

@pkoppstein
Copy link
Contributor

... but there are issues.

Third, jq's parser is currently quite lenient -- perhaps too much so in some cases (e.g. allowing 00 for 0), but well within its rights in others (e.g. allowing duplicate keys within a JSON object).

Fourth, there is a plan to allow jq to preserve arbitrarily big integers (in the sense that jq . will not modify them), and it would be a shame to lose that functionality.

(If jq is to support stricter parsing, then for the sake of utility as well as backward compatibility, it should be possible to specify which mode is wanted.)

@lemire
Copy link

lemire commented Oct 18, 2019

well within its rights in others (e.g. allowing duplicate keys within a JSON object).

simdjson allows duplicate keys, the RFC does not require uniqueness: https://tools.ietf.org/html/rfc7159

there is a plan to allow jq to preserve arbitrarily big integers

Though simdjson is limited to 64-bit numbers at this time, there are plans to extend the support...
simdjson/simdjson#167

@lemire
Copy link

lemire commented Oct 18, 2019

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

cc @piotte13

@lemire
Copy link

lemire commented Nov 25, 2019

simdjson now supports JSON documents in sequence. The JSON can be either line-separated or just in sequence with arbitrary white space between them. The input can be nearly infinite...

The performance is quite good... (gigabytes per second)

https://github.com/lemire/simdjson/blob/master/doc/JsonStream.md

cc @piotte13

@hitorilabs
Copy link

hitorilabs commented Jun 19, 2023

Is anyone working on this already? (or finished some form of it)

@liquidaty
Copy link

A few thoughts I would put forth for consideration:

  • can we come up with one or two particular clear "painful" use case(s) we can target-- including sample filters as well as sample inputs?
    • presumably, the vast majority of jq use cases are using a filter that is not simply '.', so this isn't the ideal filter use case to optimize for
    • optimizing for strict JSON might be very different from optimizing for non-strict JSON (or other e.g. YAML / XML), and input size might also impact
    • ultimately, the initial goal would be to devise the specific benchmark tests to target for optimization
  • we may find that the bottleneck in the most important use cases is not the JSON input parsing, but rather the filter calculations, or maybe the way that the parser hands off data to the filter calculator. That's not to say that optimizing the JSON input parsing is not valuable in its own right no matter what, but rather to recognize that we all have limited time and resources to contribute, and informing how we spend that can make a drastic difference in how much impact that contribution makes-- and deciding how this issue request should be prioritized relative to other performance-related ones such as Speed up parsing -- delay string and number parsing? #1857
  • there are several mentions of multiple contemplated parser modes in terms of strictness and input format (e.g. XML/yaml) (ER: add --strict command-line option for strict parsing of JSON input #2643, simdjson #1892 (comment), Allow, and strip, comments (#, // or /*) #2548 (comment), Maybe add fromyaml to parse yaml strings? #2530, support yaml input/output in addition to json #467). Given this, the most expeditious approach might be to first evaluate these as a whole, identify commonalities, and then refactor the current jq parser to separate those commonalities, before then broadening the supported input formats
    • not sure if this would turn out to be the case, but as an illustrative example, perhaps that collective analysis would then suggest refactoring the parser to an event-driven structure, where the event handlers could be reused by all of the different input parsers. Knowing that and performing that refactoring ahead of time might improve performance and almost certainly will make the subsequent work of supporting other input formats much easier, and lower maintenance learning curves and technical debt in the process

@nicowilliams
Copy link
Contributor

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

That's still open. Is it likely we can get that?

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

Sweet!

Maybe we could use simdjson for jv_parse() and jv_parse_sized(), which is to say "for fromjson". Now, we'll still have to allocate a bunch of objects, so that's not terribly fun, and I wonder what we could do about that. Now in the streaming JSON parser we only ever need to allocate an array for the paths to scalars and for the scalars, and we can reuse the path array, so we should be able to get more bang for this effort there.

@nicowilliams
Copy link
Contributor

A few thoughts I would put forth for consideration:

Yes, this will be a lot of profiling and playing with options to find something that rocks perf-wise and isn't too hard to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants