Optimize `parse_str_radix_10` #456

chris-cantor · 2021-12-31T03:14:42Z

Replacing #454, this PR resolves #453 and:

Added some tests
Fixed two edge cases
Simplified the logic of parse_str_radix_10
Optimized the performance with 64/128 bit integers
Speed up further with tail call optimization

Running cargo bench --bench lib_benches -- decimal_from_str

On M1 Pro, the numbers are:

master:
test decimal_from_str        ... bench:         566 ns/iter (+/- 21)

7f18e0d:
test decimal_from_str        ... bench:         88  ns/iter (+/- 21)

On Ryzen 3990x, the numbers are:

master:
test decimal_from_str        ... bench:         557 ns/iter (+/- 13)

432f9d6:
decimal_from_str             ... bench:         129 ns/iter (+/- 2)

I know this is a lot of code changes. No rush at all and feel free to ask us questions if pieces of the code seems unclear.

Have a happy new year too!

The culprit is where `maybe_round` prevented the number from overflowing , and ended up returning an `Ok` in the end. The fix is to round a number only after we have passed the decimal point.

This commit flattened add_by_internal noting the fact that it is only ever called with a fixed size array plus one u32. In addition, instead of accumulating the coefficients in `ArrayVec` and process it later, we try to compute the output number in one pass. This also removed the manual overflow rounding on the `coeff` array and replacing with a plain arithmetic +1. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 193 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 320 ns/iter (+/- 3) [Ryzen 3990x]

Instead of manually doing 96-bit integer operations, use a 128 bit integer and check for overflowing 96 bits worth of state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 182 ns/iter (+/- 3) [M1 Pro] decimal_from_str ... bench: 299 ns/iter (+/- 4) [Ryzen 3990x]

Accumulate parse state into a 64-bit integer until we would observe an overflow of this integer, and fall back to maintaining 128-bit state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 133 ns/iter (+/- 4) [M1 Pro] decimal_from_str ... bench: 190 ns/iter (+/- 5) [Ryzen 3990x]

This commit changes the parsing code to operate as a series of tail calls. This allows loop and overflow conditions to be checked only when exactly needed, as well as makes it very easy for the compiler to generate quality code on the fast path. The parsing functions are also parameterized by const generics to generate optimized codepaths for various situations (i.e. has/hasn't seen digits yet, has/hasn't seen the decimal point) and remove carried state. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 85 ns/iter (+/- 2) [M1 Pro] decimal_from_str ... bench: 129 ns/iter (+/- 2) [Ryzen 3990x]

TradingTomatillo · 2021-12-31T03:25:57Z

Pasting the interesting part of a comment here so I can respond in the new PR:

Since the change is quite large I want to spend a bit of time reviewing it over the next day or two. I want to get a better understanding of the #[cold] attributes in particular and make sure they make sense in this context.

Cold basically informs LLVM that we do not expect this function will get called. Practically speaking, that means LLVM might

Optimize for size instead of speed in the cold function (in practice, these two aren't that far apart anyways)
Use a calling convention mean to push costs to the callee instead of the caller (in the parser, the cold functions are just passed a few integers anyways)
Optimize branch layouts with the assumption cold codepaths won't get called.

Because this is all variations of might, the impact varies. It has much less of an impact in the common case with small strings because we statically remove the various overflow checks, so it will impact larger strings more. Pre-size-check (the BIG variable), the cold annotations took the time from ~150 ns/iter to ~140, but after it is less impactful. It doesn't look like large decimals are very well represented in the benchmarks, although it's probably reasonable to optimize for strings with less than 18 digits in practice anyways.

TradingTomatillo · 2021-12-31T03:29:14Z

You can play around with this godbolt link and see: https://godbolt.org/z/rnhb5anhP. You can see that test3 is the non-cold one so the compiler optimizes for it (forward jumps are by default predicted false on intel so the compiler makes test3 the fallthrough case)

TradingTomatillo · 2021-12-31T07:04:02Z

For what it's worth, the tail call approach won't be great on architectures that don't support tails or want to pass parameters on the stack. I think the only relevant architecture here for rust is webassembly (and maybe arm32). The options I see are:

Just eat the recursion cost since the call depth is bounded (probably not great?)
Retain an improved-but-tailless version for architectures outside of x86, arm64, risc-v. This would be a higher code maintenance burden since you have 3 nontrivial cases (radix-10-x64, radix-10-general, radix-any).
Reuse the general radix parser for other architectures. Not that fast, but there's no reason some of the easier optimizations can't be applied there at a later date

This change gives each of the possible operations for the u64 accumulation a dedicated dispatch site, giving each operation dedicated branch prediction. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 81 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 120 ns/iter (+/- 2) [Ryzen 3990x]

Remove up-front handling of digits like '+', '-', and inline handling of '_' and put all of it into a function dedicated to not common digits. This improved the up-front cost for numbers with no prefix and has similar startup cost to the checking code for rare digits if they exist Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] Current: decimal_from_str ... bench: 71 ns/iter (+/- 2) [M1 Pro]

TradingTomatillo · 2022-01-01T04:18:45Z

Latest numbers are down to 70ns/iter on M1, will benchmark on an x86 processor in the next day or two.

This makes sure that the parsing behaviour is same as before

paupino

This is quite an innovative approach to this!

I see how the tail recursion could be problematic if optimization isn't available/used. Would it be worthwhile to feature flag this in the meantime? That way it could be an opt in (or opt out) feature.

You're right that it does have the overhead of maintaining two different versions however we should be able to slightly cater for that with CI. It's not without other downsides of course (e.g. discovery, deprecation) but for now it may make sense.

Thoughts? I can always help wrap up the PR with feature flags + CI if you like.

paupino · 2022-01-05T03:04:33Z

src/str.rs

-    if str.is_empty() {
-        return Err(Error::from("Invalid decimal: empty"));
+    let bytes = str.as_bytes();
+    // handle the sign


nit: looks like this is leftover

paupino · 2022-01-05T03:13:17Z

src/str.rs

-                offset += 1;
-                len -= 1;
+#[inline]
+fn byte_dispatch_u64<const POINT: bool, const NEG: bool, const HAS: bool, const BIG: bool, const FIRST: bool>(


My one qualm with these const generics is perhaps knowing what each does. Even though this is an internal method it'd be great to add a little bit of documentation just to cover what the purpose of each is for.

POINT - a decimal point has been seen

NEG - we've encountered a - and the number is negative

HAS - a digit has been encountered (when HAS is false it's invalid)

BIG - a number that uses 96 bits instead of only 64 bits

FIRST - true if it is the first byte in the string

paupino · 2022-01-05T03:16:04Z

src/ops/array.rs

@@ -92,6 +93,25 @@ pub(crate) fn add_by_internal(value: &mut [u32], by: &[u32]) -> u32 {
    carry as u32
 }

+pub(crate) fn add_by_internal_flattened(value: &mut [u32; 3], by: u32) -> u32 {


I previously had something similar - add_by_internal3. At some stage I'd love to move away from these functions however it makes sense to introduce it for this PR.

chris-cantor · 2022-01-07T02:49:52Z

Sure! Let me work on the changes

TradingTomatillo · 2022-01-07T02:54:37Z

As for the tail-call-or-not question: Everything before the tail calls is still a significant improvement and shouldn't really introduce a lot of platform dependency.

For the tail calls, it might make sense to keep the later commits here as a WIP-PR, add the non-tail optimizations to the generalized parser, and see what performance looks like there. If that's performing acceptably choice 3 (only use good one on x86/arm/riscv) might be fine?

paupino · 2022-01-08T19:00:43Z

That makes sense. I'm going to merge this as is for now, and we can iterate from there.

* Fix `from_str` overflow rounding issue The culprit is where `maybe_round` prevented the number from overflowing , and ended up returning an `Ok` in the end. The fix is to round a number only after we have passed the decimal point. * Remove coeff ArrayVec from `parse_str_radix_10` This commit flattened add_by_internal noting the fact that it is only ever called with a fixed size array plus one u32. In addition, instead of accumulating the coefficients in `ArrayVec` and process it later, we try to compute the output number in one pass. This also removed the manual overflow rounding on the `coeff` array and replacing with a plain arithmetic +1. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 193 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 320 ns/iter (+/- 3) [Ryzen 3990x] * Use 128 bit integer to speed up `parse_str_radix_10` Instead of manually doing 96-bit integer operations, use a 128 bit integer and check for overflowing 96 bits worth of state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 182 ns/iter (+/- 3) [M1 Pro] decimal_from_str ... bench: 299 ns/iter (+/- 4) [Ryzen 3990x] * Use 64 bit integer with 128 bit fallback to speed up further Accumulate parse state into a 64-bit integer until we would observe an overflow of this integer, and fall back to maintaining 128-bit state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 133 ns/iter (+/- 4) [M1 Pro] decimal_from_str ... bench: 190 ns/iter (+/- 5) [Ryzen 3990x] * Convert into series of tail-calls This commit changes the parsing code to operate as a series of tail calls. This allows loop and overflow conditions to be checked only when exactly needed, as well as makes it very easy for the compiler to generate quality code on the fast path. The parsing functions are also parameterized by const generics to generate optimized codepaths for various situations (i.e. has/hasn't seen digits yet, has/hasn't seen the decimal point) and remove carried state. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 85 ns/iter (+/- 2) [M1 Pro] decimal_from_str ... bench: 129 ns/iter (+/- 2) [Ryzen 3990x] * Give each operation a dedicated dispatch This change gives each of the possible operations for the u64 accumulation a dedicated dispatch site, giving each operation dedicated branch prediction. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 81 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 120 ns/iter (+/- 2) [Ryzen 3990x] * Move 'rare' digit handling into dedicated function Remove up-front handling of digits like '+', '-', and inline handling of '_' and put all of it into a function dedicated to not common digits. This improved the up-front cost for numbers with no prefix and has similar startup cost to the checking code for rare digits if they exist Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] Current: decimal_from_str ... bench: 71 ns/iter (+/- 2) [M1 Pro] * Remove unimportant const generic from digit dispatch * Verify test cases in unpacked representation This makes sure that the parsing behaviour is same as before Co-authored-by: Tomatillo <[email protected]>

paupino · 2022-01-08T19:09:29Z

Ok, that's weird. This is merged now in fca96a5? I'm going to close this PR since it remained open.

chris-cantor and others added 5 commits December 31, 2021 10:59

Fix from_str overflow rounding issue

bfb5b42

The culprit is where `maybe_round` prevented the number from overflowing , and ended up returning an `Ok` in the end. The fix is to round a number only after we have passed the decimal point.

TradingTomatillo added 3 commits December 31, 2021 18:06

Remove unimportant const generic from digit dispatch

0033d3c

TradingTomatillo mentioned this pull request Jan 1, 2022

Proper tail calls rust-lang/rfcs#1888

Closed

Verify test cases in unpacked representation

e553cc5

This makes sure that the parsing behaviour is same as before

paupino reviewed Jan 5, 2022

View reviewed changes

paupino closed this Jan 8, 2022

This was referenced Jan 8, 2022

from_str rounding issues when overflow from too many digits #453

Closed

Parsing errors with a large number of fractional digits #461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `parse_str_radix_10` #456

Optimize `parse_str_radix_10` #456

chris-cantor commented Dec 31, 2021

TradingTomatillo commented Dec 31, 2021

TradingTomatillo commented Dec 31, 2021 •

edited

Loading

TradingTomatillo commented Dec 31, 2021 •

edited

Loading

TradingTomatillo commented Jan 1, 2022

paupino left a comment

paupino Jan 5, 2022

paupino Jan 5, 2022

paupino Jan 5, 2022

chris-cantor commented Jan 7, 2022

TradingTomatillo commented Jan 7, 2022

paupino commented Jan 8, 2022

paupino commented Jan 8, 2022

Optimize parse_str_radix_10 #456

Optimize parse_str_radix_10 #456

Conversation

chris-cantor commented Dec 31, 2021

TradingTomatillo commented Dec 31, 2021

TradingTomatillo commented Dec 31, 2021 • edited Loading

TradingTomatillo commented Dec 31, 2021 • edited Loading

TradingTomatillo commented Jan 1, 2022

paupino left a comment

Choose a reason for hiding this comment

paupino Jan 5, 2022

Choose a reason for hiding this comment

paupino Jan 5, 2022

Choose a reason for hiding this comment

paupino Jan 5, 2022

Choose a reason for hiding this comment

chris-cantor commented Jan 7, 2022

TradingTomatillo commented Jan 7, 2022

paupino commented Jan 8, 2022

paupino commented Jan 8, 2022

Optimize `parse_str_radix_10` #456

Optimize `parse_str_radix_10` #456

TradingTomatillo commented Dec 31, 2021 •

edited

Loading

TradingTomatillo commented Dec 31, 2021 •

edited

Loading