Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parse_str_radix_10 #456

Closed
wants to merge 9 commits into from

Conversation

chris-cantor
Copy link

Replacing #454, this PR resolves #453 and:

  1. Added some tests
  2. Fixed two edge cases
  3. Simplified the logic of parse_str_radix_10
  4. Optimized the performance with 64/128 bit integers
  5. Speed up further with tail call optimization

Running cargo bench --bench lib_benches -- decimal_from_str

On M1 Pro, the numbers are:

master:
test decimal_from_str        ... bench:         566 ns/iter (+/- 21)

7f18e0d:
test decimal_from_str        ... bench:         88  ns/iter (+/- 21)

On Ryzen 3990x, the numbers are:

master:
test decimal_from_str        ... bench:         557 ns/iter (+/- 13)

432f9d6:
decimal_from_str             ... bench:         129 ns/iter (+/- 2)

I know this is a lot of code changes. No rush at all and feel free to ask us questions if pieces of the code seems unclear.

Have a happy new year too!

chris-cantor and others added 5 commits December 31, 2021 10:59
The culprit is where `maybe_round` prevented the number from overflowing
, and ended up returning an `Ok` in the end. The fix is to round a
number only after we have passed the decimal point.
This commit flattened add_by_internal noting the fact that it is only
ever called with a fixed size array plus one u32.

In addition, instead of accumulating the coefficients in `ArrayVec` and
process it later, we try to compute the output number in one pass. This
also removed the manual overflow rounding on the `coeff` array and
replacing with a plain arithmetic +1.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     193 ns/iter (+/- 1) [M1 Pro]
decimal_from_str    ... bench:     320 ns/iter (+/- 3) [Ryzen 3990x]
Instead of manually doing 96-bit integer operations, use a 128 bit
integer and check for overflowing 96 bits worth of state

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     182 ns/iter (+/- 3) [M1 Pro]
decimal_from_str    ... bench:     299 ns/iter (+/- 4) [Ryzen 3990x]
Accumulate parse state into a 64-bit integer until we would observe an
overflow of this integer, and fall back to maintaining 128-bit state

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     133 ns/iter (+/- 4) [M1 Pro]
decimal_from_str    ... bench:     190 ns/iter (+/- 5) [Ryzen 3990x]
This commit changes the parsing code to operate as a series of tail
calls. This allows loop and overflow conditions to be checked only when
exactly needed, as well as makes it very easy for the compiler to
generate quality code on the fast path.

The parsing functions are also parameterized by const generics to
generate optimized codepaths for various situations (i.e. has/hasn't
seen digits yet, has/hasn't seen the decimal point) and remove carried state.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     85 ns/iter (+/- 2) [M1 Pro]
decimal_from_str    ... bench:     129 ns/iter (+/- 2) [Ryzen 3990x]
@TradingTomatillo
Copy link
Contributor

Pasting the interesting part of a comment here so I can respond in the new PR:

Since the change is quite large I want to spend a bit of time reviewing it over the next day or two. I want to get a better understanding of the #[cold] attributes in particular and make sure they make sense in this context.

Cold basically informs LLVM that we do not expect this function will get called. Practically speaking, that means LLVM might

  • Optimize for size instead of speed in the cold function (in practice, these two aren't that far apart anyways)
  • Use a calling convention mean to push costs to the callee instead of the caller (in the parser, the cold functions are just passed a few integers anyways)
  • Optimize branch layouts with the assumption cold codepaths won't get called.

Because this is all variations of might, the impact varies. It has much less of an impact in the common case with small strings because we statically remove the various overflow checks, so it will impact larger strings more. Pre-size-check (the BIG variable), the cold annotations took the time from ~150 ns/iter to ~140, but after it is less impactful. It doesn't look like large decimals are very well represented in the benchmarks, although it's probably reasonable to optimize for strings with less than 18 digits in practice anyways.

@TradingTomatillo
Copy link
Contributor

TradingTomatillo commented Dec 31, 2021

You can play around with this godbolt link and see: https://godbolt.org/z/rnhb5anhP. You can see that test3 is the non-cold one so the compiler optimizes for it (forward jumps are by default predicted false on intel so the compiler makes test3 the fallthrough case)

@TradingTomatillo
Copy link
Contributor

TradingTomatillo commented Dec 31, 2021

For what it's worth, the tail call approach won't be great on architectures that don't support tails or want to pass parameters on the stack. I think the only relevant architecture here for rust is webassembly (and maybe arm32). The options I see are:

  1. Just eat the recursion cost since the call depth is bounded (probably not great?)
  2. Retain an improved-but-tailless version for architectures outside of x86, arm64, risc-v. This would be a higher code maintenance burden since you have 3 nontrivial cases (radix-10-x64, radix-10-general, radix-any).
  3. Reuse the general radix parser for other architectures. Not that fast, but there's no reason some of the easier optimizations can't be applied there at a later date

This change gives each of the possible operations for the u64
accumulation a dedicated dispatch site, giving each operation dedicated
branch prediction.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     81 ns/iter (+/- 1) [M1 Pro]
decimal_from_str    ... bench:     120 ns/iter (+/- 2) [Ryzen 3990x]
Remove up-front handling of digits like '+', '-', and inline handling of
'_' and put all of it into a function dedicated to not common digits.
This improved the up-front cost for numbers with no prefix and has
similar startup cost to the checking code for rare digits if they exist

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]

Current:
decimal_from_str    ... bench:     71 ns/iter (+/- 2) [M1 Pro]
@TradingTomatillo
Copy link
Contributor

Latest numbers are down to 70ns/iter on M1, will benchmark on an x86 processor in the next day or two.

This makes sure that the parsing behaviour is same as before
Copy link
Owner

@paupino paupino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an innovative approach to this!

I see how the tail recursion could be problematic if optimization isn't available/used. Would it be worthwhile to feature flag this in the meantime? That way it could be an opt in (or opt out) feature.

You're right that it does have the overhead of maintaining two different versions however we should be able to slightly cater for that with CI. It's not without other downsides of course (e.g. discovery, deprecation) but for now it may make sense.

Thoughts? I can always help wrap up the PR with feature flags + CI if you like.

if str.is_empty() {
return Err(Error::from("Invalid decimal: empty"));
let bytes = str.as_bytes();
// handle the sign
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this is leftover

offset += 1;
len -= 1;
#[inline]
fn byte_dispatch_u64<const POINT: bool, const NEG: bool, const HAS: bool, const BIG: bool, const FIRST: bool>(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one qualm with these const generics is perhaps knowing what each does. Even though this is an internal method it'd be great to add a little bit of documentation just to cover what the purpose of each is for.

  • POINT - a decimal point has been seen
  • NEG - we've encountered a - and the number is negative
  • HAS - a digit has been encountered (when HAS is false it's invalid)
  • BIG - a number that uses 96 bits instead of only 64 bits
  • FIRST - true if it is the first byte in the string

@@ -92,6 +93,25 @@ pub(crate) fn add_by_internal(value: &mut [u32], by: &[u32]) -> u32 {
carry as u32
}

pub(crate) fn add_by_internal_flattened(value: &mut [u32; 3], by: u32) -> u32 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously had something similar - add_by_internal3. At some stage I'd love to move away from these functions however it makes sense to introduce it for this PR.

@chris-cantor
Copy link
Author

Sure! Let me work on the changes

@TradingTomatillo
Copy link
Contributor

As for the tail-call-or-not question: Everything before the tail calls is still a significant improvement and shouldn't really introduce a lot of platform dependency.

For the tail calls, it might make sense to keep the later commits here as a WIP-PR, add the non-tail optimizations to the generalized parser, and see what performance looks like there. If that's performing acceptably choice 3 (only use good one on x86/arm/riscv) might be fine?

@paupino
Copy link
Owner

paupino commented Jan 8, 2022

That makes sense. I'm going to merge this as is for now, and we can iterate from there.

paupino added a commit that referenced this pull request Jan 8, 2022
* Fix `from_str` overflow rounding issue

The culprit is where `maybe_round` prevented the number from overflowing
, and ended up returning an `Ok` in the end. The fix is to round a
number only after we have passed the decimal point.

* Remove coeff ArrayVec from `parse_str_radix_10`

This commit flattened add_by_internal noting the fact that it is only
ever called with a fixed size array plus one u32.

In addition, instead of accumulating the coefficients in `ArrayVec` and
process it later, we try to compute the output number in one pass. This
also removed the manual overflow rounding on the `coeff` array and
replacing with a plain arithmetic +1.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     193 ns/iter (+/- 1) [M1 Pro]
decimal_from_str    ... bench:     320 ns/iter (+/- 3) [Ryzen 3990x]

* Use 128 bit integer to speed up `parse_str_radix_10`

Instead of manually doing 96-bit integer operations, use a 128 bit
integer and check for overflowing 96 bits worth of state

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     182 ns/iter (+/- 3) [M1 Pro]
decimal_from_str    ... bench:     299 ns/iter (+/- 4) [Ryzen 3990x]

* Use 64 bit integer with 128 bit fallback to speed up further

Accumulate parse state into a 64-bit integer until we would observe an
overflow of this integer, and fall back to maintaining 128-bit state

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     133 ns/iter (+/- 4) [M1 Pro]
decimal_from_str    ... bench:     190 ns/iter (+/- 5) [Ryzen 3990x]

* Convert into series of tail-calls

This commit changes the parsing code to operate as a series of tail
calls. This allows loop and overflow conditions to be checked only when
exactly needed, as well as makes it very easy for the compiler to
generate quality code on the fast path.

The parsing functions are also parameterized by const generics to
generate optimized codepaths for various situations (i.e. has/hasn't
seen digits yet, has/hasn't seen the decimal point) and remove carried state.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     85 ns/iter (+/- 2) [M1 Pro]
decimal_from_str    ... bench:     129 ns/iter (+/- 2) [Ryzen 3990x]

* Give each operation a dedicated dispatch

This change gives each of the possible operations for the u64
accumulation a dedicated dispatch site, giving each operation dedicated
branch prediction.

cargo bench --bench lib_benches -- decimal_from_str

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]
decimal_from_str    ... bench:     554 ns/iter (+/- 21) [Ryzen 3990x]

Current:
decimal_from_str    ... bench:     81 ns/iter (+/- 1) [M1 Pro]
decimal_from_str    ... bench:     120 ns/iter (+/- 2) [Ryzen 3990x]

* Move 'rare' digit handling into dedicated function

Remove up-front handling of digits like '+', '-', and inline handling of
'_' and put all of it into a function dedicated to not common digits.
This improved the up-front cost for numbers with no prefix and has
similar startup cost to the checking code for rare digits if they exist

Baseline:
decimal_from_str    ... bench:     566 ns/iter (+/- 21) [M1 Pro]

Current:
decimal_from_str    ... bench:     71 ns/iter (+/- 2) [M1 Pro]

* Remove unimportant const generic from digit dispatch

* Verify test cases in unpacked representation

This makes sure that the parsing behaviour is same as before

Co-authored-by: Tomatillo <[email protected]>
@paupino
Copy link
Owner

paupino commented Jan 8, 2022

Ok, that's weird. This is merged now in fca96a5? I'm going to close this PR since it remained open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

from_str rounding issues when overflow from too many digits
3 participants