Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up count_hyper and num_chars_hyper #35

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

AdamNiederer
Copy link

Hello! I've managed to simplify this library's code and features and get a pretty sizeable speedup by using faster instead of simd. We get a 5.5x speedup in the best case, and a 4% slowdown in the worst case. Additionally, this fixes a compilation error when using --features simd-accel on a target without SSE2.

faster kabylake
running 15 tests
test bench_count_30000_hyper           ... bench:   1,172 ns/iter (+/- 139)
test bench_count_big_0100000_hyper     ... bench:   3,892 ns/iter (+/- 9)
test bench_count_big_1000000_hyper     ... bench:  38,850 ns/iter (+/- 10,860)
test bench_num_chars_30000_hyper       ... bench:     805 ns/iter (+/- 133)
test bench_num_chars_big_0100000_hyper ... bench:   3,059 ns/iter (+/- 464)
test bench_num_chars_big_1000000_hyper ... bench:  34,522 ns/iter (+/- 2,631)

faster nehalem
test bench_count_30000_hyper           ... bench:   2,478 ns/iter (+/- 328)
test bench_count_big_0100000_hyper     ... bench:   8,079 ns/iter (+/- 3)
test bench_count_big_1000000_hyper     ... bench:  80,784 ns/iter (+/- 24,800)
test bench_num_chars_30000_hyper       ... bench:   1,380 ns/iter (+/- 10)
test bench_num_chars_big_0100000_hyper ... bench:   4,837 ns/iter (+/- 49)
test bench_num_chars_big_1000000_hyper ... bench:  50,925 ns/iter (+/- 7,802)

faster x86-64
test bench_count_30000_hyper           ... bench:   2,347 ns/iter (+/- 2)
test bench_count_big_0100000_hyper     ... bench:   8,325 ns/iter (+/- 2,010)
test bench_count_big_1000000_hyper     ... bench:  83,036 ns/iter (+/- 14,618)
test bench_num_chars_30000_hyper       ... bench:   1,474 ns/iter (+/- 350)
test bench_num_chars_big_0100000_hyper ... bench:   4,695 ns/iter (+/- 56)
test bench_num_chars_big_1000000_hyper ... bench:  51,214 ns/iter (+/- 22,773)

faster pentium
test bench_count_30000_hyper           ... bench:  50,619 ns/iter (+/- 27)
test bench_count_big_0100000_hyper     ... bench: 168,750 ns/iter (+/- 2,620)
test bench_count_big_1000000_hyper     ... bench: 1,793,224 ns/iter (+/- 184,043)
test bench_num_chars_30000_hyper       ... bench:   1,441 ns/iter (+/- 2)
test bench_num_chars_big_0100000_hyper ... bench:   5,045 ns/iter (+/- 79)
test bench_num_chars_big_1000000_hyper ... bench:  52,989 ns/iter (+/- 4,530)

simd kabylake
test bench_count_30000_hyper           ... bench:   1,658 ns/iter (+/- 71)
test bench_count_big_0100000_hyper     ... bench:   5,999 ns/iter (+/- 11)
test bench_count_big_1000000_hyper     ... bench:  61,536 ns/iter (+/- 2,047)
test bench_num_chars_30000_hyper       ... bench:   5,506 ns/iter (+/- 176)
test bench_num_chars_big_0100000_hyper ... bench:  19,045 ns/iter (+/- 2,317)
test bench_num_chars_big_1000000_hyper ... bench: 190,179 ns/iter (+/- 6,082)

simd nehalem
test bench_count_30000_hyper           ... bench:   2,011 ns/iter (+/- 33)
test bench_count_big_0100000_hyper     ... bench:   7,728 ns/iter (+/- 307)
test bench_count_big_1000000_hyper     ... bench:  77,853 ns/iter (+/- 5,531)
test bench_num_chars_30000_hyper       ... bench:     988 ns/iter (+/- 7)
test bench_num_chars_big_0100000_hyper ... bench:   4,137 ns/iter (+/- 528)
test bench_num_chars_big_1000000_hyper ... bench:  45,211 ns/iter (+/- 6,833)

simd x86-64
test bench_count_30000_hyper           ... bench:   2,286 ns/iter (+/- 313)
test bench_count_big_0100000_hyper     ... bench:   7,610 ns/iter (+/- 898)
test bench_count_big_1000000_hyper     ... bench:  79,711 ns/iter (+/- 5,352)
test bench_num_chars_30000_hyper       ... bench:     987 ns/iter (+/- 30)
test bench_num_chars_big_0100000_hyper ... bench:   3,985 ns/iter (+/- 248)
test bench_num_chars_big_1000000_hyper ... bench:  43,328 ns/iter (+/- 1,382)

simd pentium
error[E0432]: unresolved import `x86::sse2`
  --> /home/adam/.cargo/registry/src/github.conef.uk-1ecc6299db9ec823/simd-0.2.1/src/common.rs:16:10
   |
16 | use x86::sse2::common;
   |          ^^^^ Could not find `sse2` in `x86`

error: aborting due to previous error

error: Could not compile `simd`.
warning: build failed, waiting for other jobs to finish...
error: build failed

@BurntSushi
Copy link

Note that this will impose a copyleft dependency on all dependents of bytecount. I personally will not abide such a license change. I will do whatever is necessary to avoid it.

@llogiq
Copy link
Owner

llogiq commented Jan 28, 2018

That's interesting. I'll do my own benchmarks. I'm not sure if I can allow the additional dependency, given its license.

@llogiq
Copy link
Owner

llogiq commented Jan 28, 2018

Also it appears you removed the avx-accel feature, which broke the build. If it is no longer needed, we should remove it from the build matrix for both Travis & appveyor.

@AdamNiederer
Copy link
Author

AdamNiederer commented Jan 28, 2018

That's interesting. I'll do my own benchmarks. I'm not sure if I can allow the additional dependency, given its license.

Thanks for the interest!

(I am not a lawyer but) The MPL isn't a viral copyleft, so your dependents should have similar responsibilities to that of an MIT-licensed work (the "license statement" required by the MIT license would also contain a link to my repo, but that's it) unless they decide to modify faster itself. You and your dependents can still license your project as you wish, as well.

Also it appears you removed the avx-accel feature, which broke the build. If it is no longer needed, we should remove it from the build matrix for both Travis & appveyor.

Yes - once faster is turned on, it will determine what to emit based on the feature level of the CPU. This also works for non-x86 architectures (although I haven't added many SIMD intrinsics for those, yet). I'll remove the feature from the README and travis.

@AdamNiederer
Copy link
Author

Good news! A contributor to faster found a way to speed up its core iteration algorithm by a ton, so we're now consistently 5% faster than simd using SSE (from ~5% slower)! I've re-benched on a machine which isn't thermally constrained (much higher consistency), but which doesn't support AVX2:

scalar ivybridge
test bench_count_00000_32              ... bench:           1 ns/iter (+/- 0)
test bench_count_00000_hyper           ... bench:           2 ns/iter (+/- 0)
test bench_count_00000_naive           ... bench:           1 ns/iter (+/- 0)
test bench_count_00010_32              ... bench:           6 ns/iter (+/- 0)
test bench_count_00010_hyper           ... bench:           5 ns/iter (+/- 0)
test bench_count_00010_naive           ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_32              ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_hyper           ... bench:           8 ns/iter (+/- 0)
test bench_count_00020_naive           ... bench:           8 ns/iter (+/- 0)
test bench_count_30000_32              ... bench:       2,740 ns/iter (+/- 79)
test bench_count_30000_hyper           ... bench:       2,691 ns/iter (+/- 104)
test bench_count_30000_naive           ... bench:       7,657 ns/iter (+/- 227)
test bench_count_big_0100000_32        ... bench:       9,113 ns/iter (+/- 285)
test bench_count_big_0100000_hyper     ... bench:       8,918 ns/iter (+/- 289)
test bench_count_big_0100000_naive     ... bench:      25,493 ns/iter (+/- 685)
test bench_count_big_1000000_32        ... bench:      91,150 ns/iter (+/- 2,807)
test bench_count_big_1000000_hyper     ... bench:      88,726 ns/iter (+/- 2,904)
test bench_count_big_1000000_naive     ... bench:     254,864 ns/iter (+/- 9,429)

simd ivybridge
test bench_count_00000_32              ... bench:           1 ns/iter (+/- 0)
test bench_count_00000_hyper           ... bench:           4 ns/iter (+/- 0)
test bench_count_00000_naive           ... bench:           1 ns/iter (+/- 0)
test bench_count_00010_32              ... bench:           6 ns/iter (+/- 0)
test bench_count_00010_hyper           ... bench:           7 ns/iter (+/- 1)
test bench_count_00010_naive           ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_32              ... bench:           5 ns/iter (+/- 1)
test bench_count_00020_hyper           ... bench:          10 ns/iter (+/- 1)
test bench_count_00020_naive           ... bench:           8 ns/iter (+/- 0)
test bench_count_30000_32              ... bench:       2,738 ns/iter (+/- 13)
test bench_count_30000_hyper           ... bench:       1,578 ns/iter (+/- 9)
test bench_count_30000_naive           ... bench:       7,648 ns/iter (+/- 31)
test bench_count_big_0100000_32        ... bench:       9,109 ns/iter (+/- 32)
test bench_count_big_0100000_hyper     ... bench:       5,683 ns/iter (+/- 24)
test bench_count_big_0100000_naive     ... bench:      25,493 ns/iter (+/- 94)
test bench_count_big_1000000_32        ... bench:      91,110 ns/iter (+/- 758)
test bench_count_big_1000000_hyper     ... bench:      56,813 ns/iter (+/- 271)
test bench_count_big_1000000_naive     ... bench:     254,770 ns/iter (+/- 1,254)

faster 0.4.2 ivybridge
test bench_count_00000_32              ... bench:           1 ns/iter (+/- 0)
test bench_count_00000_hyper           ... bench:           2 ns/iter (+/- 0)
test bench_count_00000_naive           ... bench:           1 ns/iter (+/- 0)
test bench_count_00010_32              ... bench:           8 ns/iter (+/- 0)
test bench_count_00010_hyper           ... bench:           6 ns/iter (+/- 0)
test bench_count_00010_naive           ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_32              ... bench:           6 ns/iter (+/- 0)
test bench_count_00020_hyper           ... bench:           9 ns/iter (+/- 1)
test bench_count_00020_naive           ... bench:           8 ns/iter (+/- 1)
test bench_count_30000_32              ... bench:       2,743 ns/iter (+/- 81)
test bench_count_30000_hyper           ... bench:       1,849 ns/iter (+/- 57)
test bench_count_30000_naive           ... bench:       7,652 ns/iter (+/- 48)
test bench_count_big_0100000_32        ... bench:       9,117 ns/iter (+/- 214)
test bench_count_big_0100000_hyper     ... bench:       6,142 ns/iter (+/- 166)
test bench_count_big_0100000_naive     ... bench:      25,499 ns/iter (+/- 749)
test bench_count_big_1000000_32        ... bench:      91,132 ns/iter (+/- 2,818)
test bench_count_big_1000000_hyper     ... bench:      61,477 ns/iter (+/- 1,779)
test bench_count_big_1000000_naive     ... bench:     255,213 ns/iter (+/- 9,373)

faster 0.4.3 ivybridge
test bench_count_00000_32              ... bench:           1 ns/iter (+/- 0)
test bench_count_00000_hyper           ... bench:           2 ns/iter (+/- 0)
test bench_count_00000_naive           ... bench:           1 ns/iter (+/- 0)
test bench_count_00010_32              ... bench:           6 ns/iter (+/- 0)
test bench_count_00010_hyper           ... bench:           6 ns/iter (+/- 0)
test bench_count_00010_naive           ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_32              ... bench:           5 ns/iter (+/- 0)
test bench_count_00020_hyper           ... bench:           8 ns/iter (+/- 0)
test bench_count_00020_naive           ... bench:           8 ns/iter (+/- 1)
test bench_count_30000_32              ... bench:       2,736 ns/iter (+/- 18)
test bench_count_30000_hyper           ... bench:       1,592 ns/iter (+/- 11)
test bench_count_30000_naive           ... bench:       7,644 ns/iter (+/- 41)
test bench_count_big_0100000_32        ... bench:       9,101 ns/iter (+/- 28)
test bench_count_big_0100000_hyper     ... bench:       5,342 ns/iter (+/- 29)
test bench_count_big_0100000_naive     ... bench:      25,479 ns/iter (+/- 96)
test bench_count_big_1000000_32        ... bench:      91,074 ns/iter (+/- 457)
test bench_count_big_1000000_hyper     ... bench:      53,526 ns/iter (+/- 298)
test bench_count_big_1000000_naive     ... bench:     254,972 ns/iter (+/- 2,532)

And here's an updated AVX2 benchmark (same machine as the originals):

test bench_count_00000_hyper       ... bench:           2 ns/iter (+/- 1)
test bench_count_00010_hyper       ... bench:           7 ns/iter (+/- 0)
test bench_count_00020_hyper       ... bench:           7 ns/iter (+/- 0)
test bench_count_30000_hyper       ... bench:         807 ns/iter (+/- 3)
test bench_count_big_0100000_hyper ... bench:       3,023 ns/iter (+/- 1,145)
test bench_count_big_1000000_hyper ... bench:      30,959 ns/iter (+/- 27)

@llogiq
Copy link
Owner

llogiq commented Jan 29, 2018

This is really awesome stuff! I like it. I'm still not going to merge it as is. @BurntSushi is one of bytecount major "customers" and ripgrep is a tool I use quite often. I don't want any conflict coming from license issues.

So I see two possible solutions:

  1. We change this PR to make faster optional, restoring the current plain, simd-accel and avx-accel implementations and allowing users of the crate to choose whether to use faster or not (I personally think this would be a good first step in a transition anyway)
  2. We manage to convince all faster authors to relicense to MIT / Apache 2. I'm not sure how open you folks are for that, but it's happened in the past, so I won't rule out anything

@fitzgen
Copy link

fitzgen commented Jan 29, 2018 via email

@Veedrac
Copy link
Collaborator

Veedrac commented Jan 29, 2018

I don't think the license should be an issue, though @BurntSushi's reply makes me want to be cautious.

I don't actually think there are real speed gains to changing to faster per se; rather something seems to have broken really recently around code-gen.

$ RUSTFLAGS="-C target-cpu=native" cargo bench count_04000_hyper --features avx-accel
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/bench-9f97338ce66e1df7

running 1 test
test bench_count_04000_hyper           ... bench:          50 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

$ RUSTFLAGS="-C target-cpu=native" cargo bench count_05000_hyper --features avx-accel
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/bench-9f97338ce66e1df7

running 1 test
test bench_count_05000_hyper           ... bench:         286 ns/iter (+/- 54)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

This did not show up late December last year when I made these charts. Given how much of a local minima this is, and the entirely unprincipled way I arrived at it (I really should do this properly some day), swings aren't entirely surprising, but the huge size of the swing is unfortunate.

That is not to say runtime feature detection would not justify this on its own.

@Veedrac
Copy link
Collaborator

Veedrac commented Jan 29, 2018

@AdamNiederer Would you mind testing against #36? I suspect that should fix the performance issue.

@BurntSushi
Copy link

BurntSushi commented Jan 29, 2018

@Veedrac

I don't think the license should be an issue, though @BurntSushi's reply makes me want to be cautious.

To clarify, I did not mean to imply that the MPL would impact the copyright of my code. I've never been under that misinterpretation. :-) I simply do not want to depend on any type of copyleft in my code.

(Debating the finer points of my position here doesn't seem appropriate. The intent of my initial comment was to make it clear that there is a line in the sand and I am firmly on one side of it. If folks would like to discuss this further, then please email me.)

@llogiq
Copy link
Owner

llogiq commented Jan 30, 2018

On my low-powered skylake machine, which is admittedly not well-suited to benchmarks, with rustc 1.25.0-nightly (bacb5c58d 2018-01-26), your version is consistently slower than the current master with simd-accel (avx-accel is currently pessimized, see #36). The asymptote shows ~35% slowdown for num_chars and ~130% slowdown on count. I'll recheck on a beefier machine when I get around to it, but for now I certainly won't merge it without further checks.

@AdamNiederer
Copy link
Author

AdamNiederer commented Jan 30, 2018

After looking at the disassembly, I realized LLVM wasn't eliding a branch as I had hoped. I've updated the core algorithm in my latest commit, which puts it ahead of simd on @Veedrac's branch.

[bytecount-veedrac]$ RUSTFLAGS="-C target-cpu=native" cargo bench --features avx-accel
test bench_count_30000_hyper       ... bench:         324 ns/iter (+/- 35)
test bench_count_big_0100000_hyper ... bench:       1,474 ns/iter (+/- 75)
test bench_count_big_1000000_hyper ... bench:      20,475 ns/iter (+/- 3,305)
[bytecount-faster]$ RUSTFLAGS="-C target-cpu=native" cargo bench --features simd-accel
test bench_count_30000_hyper       ... bench:         334 ns/iter (+/- 13)
test bench_count_big_0100000_hyper ... bench:       1,315 ns/iter (+/- 36)
test bench_count_big_1000000_hyper ... bench:      18,983 ns/iter (+/- 749)

EDIT: And SSE:

[bytecount-veedrac]$ RUSTFLAGS="-C target-cpu=nehalem" cargo bench --features simd-accel
test bench_count_30000_hyper       ... bench:         677 ns/iter (+/- 16)
test bench_count_big_0100000_hyper ... bench:       2,632 ns/iter (+/- 33)
test bench_count_big_1000000_hyper ... bench:      30,718 ns/iter (+/- 3,850)
[bytecount-faster]$ RUSTFLAGS="-C target-cpu=nehalem" cargo bench --features simd-accel
test bench_count_30000_hyper       ... bench:         639 ns/iter (+/- 14)
test bench_count_big_0100000_hyper ... bench:       2,409 ns/iter (+/- 16)
test bench_count_big_1000000_hyper ... bench:      28,354 ns/iter (+/- 273)

Again, this machine is thermally constrained so YMMV. rustc is the 01/29 nightly.

I haven't attempted to reimplement num_chars in faster yet, because simply ripping out the simd implementation was a huge speedup before the codegen fix. I'll see what I can do with that.

Using `faster` yields a serious speedup over `simd`, and removes the need for
feature segmentation.

faster kabylake
running 15 tests
test bench_count_30000_hyper           ... bench:   1,172 ns/iter (+/- 139)
test bench_count_big_0100000_hyper     ... bench:   3,892 ns/iter (+/- 9)
test bench_count_big_1000000_hyper     ... bench:  38,850 ns/iter (+/- 10,860)
test bench_num_chars_30000_hyper       ... bench:     805 ns/iter (+/- 133)
test bench_num_chars_big_0100000_hyper ... bench:   3,059 ns/iter (+/- 464)
test bench_num_chars_big_1000000_hyper ... bench:  34,522 ns/iter (+/- 2,631)

faster nehalem
test bench_count_30000_hyper           ... bench:   2,478 ns/iter (+/- 328)
test bench_count_big_0100000_hyper     ... bench:   8,079 ns/iter (+/- 3)
test bench_count_big_1000000_hyper     ... bench:  80,784 ns/iter (+/- 24,800)
test bench_num_chars_30000_hyper       ... bench:   1,380 ns/iter (+/- 10)
test bench_num_chars_big_0100000_hyper ... bench:   4,837 ns/iter (+/- 49)
test bench_num_chars_big_1000000_hyper ... bench:  50,925 ns/iter (+/- 7,802)

faster x86-64
test bench_count_30000_hyper           ... bench:   2,347 ns/iter (+/- 2)
test bench_count_big_0100000_hyper     ... bench:   8,325 ns/iter (+/- 2,010)
test bench_count_big_1000000_hyper     ... bench:  83,036 ns/iter (+/- 14,618)
test bench_num_chars_30000_hyper       ... bench:   1,474 ns/iter (+/- 350)
test bench_num_chars_big_0100000_hyper ... bench:   4,695 ns/iter (+/- 56)
test bench_num_chars_big_1000000_hyper ... bench:  51,214 ns/iter (+/- 22,773)

faster pentium
test bench_count_30000_hyper           ... bench:  50,619 ns/iter (+/- 27)
test bench_count_big_0100000_hyper     ... bench: 168,750 ns/iter (+/- 2,620)
test bench_count_big_1000000_hyper     ... bench: 1,793,224 ns/iter (+/- 184,043)
test bench_num_chars_30000_hyper       ... bench:   1,441 ns/iter (+/- 2)
test bench_num_chars_big_0100000_hyper ... bench:   5,045 ns/iter (+/- 79)
test bench_num_chars_big_1000000_hyper ... bench:  52,989 ns/iter (+/- 4,530)

simd kabylake
test bench_count_30000_hyper           ... bench:   1,658 ns/iter (+/- 71)
test bench_count_big_0100000_hyper     ... bench:   5,999 ns/iter (+/- 11)
test bench_count_big_1000000_hyper     ... bench:  61,536 ns/iter (+/- 2,047)
test bench_num_chars_30000_hyper       ... bench:   5,506 ns/iter (+/- 176)
test bench_num_chars_big_0100000_hyper ... bench:  19,045 ns/iter (+/- 2,317)
test bench_num_chars_big_1000000_hyper ... bench: 190,179 ns/iter (+/- 6,082)

simd nehalem
test bench_count_30000_hyper           ... bench:   2,011 ns/iter (+/- 33)
test bench_count_big_0100000_hyper     ... bench:   7,728 ns/iter (+/- 307)
test bench_count_big_1000000_hyper     ... bench:  77,853 ns/iter (+/- 5,531)
test bench_num_chars_30000_hyper       ... bench:     988 ns/iter (+/- 7)
test bench_num_chars_big_0100000_hyper ... bench:   4,137 ns/iter (+/- 528)
test bench_num_chars_big_1000000_hyper ... bench:  45,211 ns/iter (+/- 6,833)

simd x86-64
test bench_count_30000_hyper           ... bench:   2,286 ns/iter (+/- 313)
test bench_count_big_0100000_hyper     ... bench:   7,610 ns/iter (+/- 898)
test bench_count_big_1000000_hyper     ... bench:  79,711 ns/iter (+/- 5,352)
test bench_num_chars_30000_hyper       ... bench:     987 ns/iter (+/- 30)
test bench_num_chars_big_0100000_hyper ... bench:   3,985 ns/iter (+/- 248)
test bench_num_chars_big_1000000_hyper ... bench:  43,328 ns/iter (+/- 1,382)

simd pentium
error[E0432]: unresolved import `x86::sse2`
  --> /home/adam/.cargo/registry/src/github.conef.uk-1ecc6299db9ec823/simd-0.2.1/src/common.rs:16:10
   |
16 | use x86::sse2::common;
   |          ^^^^ Could not find `sse2` in `x86`

error: aborting due to previous error

error: Could not compile `simd`.
warning: build failed, waiting for other jobs to finish...
error: build failed
Even without faster, the "hyper" method of counting is slower for slices with a
small size.
Takes us from 30kns -> 18.9kns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants