-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up count_hyper and num_chars_hyper #35
base: master
Are you sure you want to change the base?
Conversation
Note that this will impose a copyleft dependency on all dependents of |
That's interesting. I'll do my own benchmarks. I'm not sure if I can allow the additional dependency, given its license. |
Also it appears you removed the |
Thanks for the interest! (I am not a lawyer but) The MPL isn't a viral copyleft, so your dependents should have similar responsibilities to that of an MIT-licensed work (the "license statement" required by the MIT license would also contain a link to my repo, but that's it) unless they decide to modify faster itself. You and your dependents can still license your project as you wish, as well.
Yes - once faster is turned on, it will determine what to emit based on the feature level of the CPU. This also works for non-x86 architectures (although I haven't added many SIMD intrinsics for those, yet). I'll remove the feature from the README and travis. |
Good news! A contributor to faster found a way to speed up its core iteration algorithm by a ton, so we're now consistently 5% faster than
And here's an updated AVX2 benchmark (same machine as the originals):
|
This is really awesome stuff! I like it. I'm still not going to merge it as is. @BurntSushi is one of bytecount major "customers" and ripgrep is a tool I use quite often. I don't want any conflict coming from license issues. So I see two possible solutions:
|
FWIW, my understanding of the MPL is also that it is not viral. Only
changes to MPL'd source files are required to be releaaed under the MPL.
Just linking / depending on an MPL library does not mean that the dependent
code must also be MPL'd.
…On Jan 28, 2018 4:28 PM, "llogiq" ***@***.***> wrote:
This is really awesome stuff! I like it. I'm still not going to merge it
as is. @BurntSushi <https://github.com/burntsushi> is one of bytecount
major "customers" and ripgrep is a tool I use quite often. I don't want any
conflict coming from license issues.
So I see two possible solutions:
1. We change this PR to make faster optional, restoring the current
plain, simd-accel and avx-accel implementations and allowing users of the
crate to choose whether to use faster or not (I personally think this would
be a good first step in a transition anyway)
2. We manage to convince all faster authors to relicense to MIT /
Apache 2. I'm not sure how open you folks are for that, but it's happened
in the past, so I won't rule out anything
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEjS38ikV0dD0lOUQLW6kGZHcZ_Yexxks5tPRCwgaJpZM4RvdKi>
.
|
I don't think the license should be an issue, though @BurntSushi's reply makes me want to be cautious. I don't actually think there are real speed gains to changing to
This did not show up late December last year when I made these charts. Given how much of a local minima this is, and the entirely unprincipled way I arrived at it (I really should do this properly some day), swings aren't entirely surprising, but the huge size of the swing is unfortunate. That is not to say runtime feature detection would not justify this on its own. |
@AdamNiederer Would you mind testing against #36? I suspect that should fix the performance issue. |
To clarify, I did not mean to imply that the MPL would impact the copyright of my code. I've never been under that misinterpretation. :-) I simply do not want to depend on any type of copyleft in my code. (Debating the finer points of my position here doesn't seem appropriate. The intent of my initial comment was to make it clear that there is a line in the sand and I am firmly on one side of it. If folks would like to discuss this further, then please email me.) |
On my low-powered skylake machine, which is admittedly not well-suited to benchmarks, with rustc 1.25.0-nightly (bacb5c58d 2018-01-26), your version is consistently slower than the current master with simd-accel (avx-accel is currently pessimized, see #36). The asymptote shows ~35% slowdown for |
After looking at the disassembly, I realized LLVM wasn't eliding a branch as I had hoped. I've updated the core algorithm in my latest commit, which puts it ahead of
EDIT: And SSE:
Again, this machine is thermally constrained so YMMV. I haven't attempted to reimplement |
Using `faster` yields a serious speedup over `simd`, and removes the need for feature segmentation. faster kabylake running 15 tests test bench_count_30000_hyper ... bench: 1,172 ns/iter (+/- 139) test bench_count_big_0100000_hyper ... bench: 3,892 ns/iter (+/- 9) test bench_count_big_1000000_hyper ... bench: 38,850 ns/iter (+/- 10,860) test bench_num_chars_30000_hyper ... bench: 805 ns/iter (+/- 133) test bench_num_chars_big_0100000_hyper ... bench: 3,059 ns/iter (+/- 464) test bench_num_chars_big_1000000_hyper ... bench: 34,522 ns/iter (+/- 2,631) faster nehalem test bench_count_30000_hyper ... bench: 2,478 ns/iter (+/- 328) test bench_count_big_0100000_hyper ... bench: 8,079 ns/iter (+/- 3) test bench_count_big_1000000_hyper ... bench: 80,784 ns/iter (+/- 24,800) test bench_num_chars_30000_hyper ... bench: 1,380 ns/iter (+/- 10) test bench_num_chars_big_0100000_hyper ... bench: 4,837 ns/iter (+/- 49) test bench_num_chars_big_1000000_hyper ... bench: 50,925 ns/iter (+/- 7,802) faster x86-64 test bench_count_30000_hyper ... bench: 2,347 ns/iter (+/- 2) test bench_count_big_0100000_hyper ... bench: 8,325 ns/iter (+/- 2,010) test bench_count_big_1000000_hyper ... bench: 83,036 ns/iter (+/- 14,618) test bench_num_chars_30000_hyper ... bench: 1,474 ns/iter (+/- 350) test bench_num_chars_big_0100000_hyper ... bench: 4,695 ns/iter (+/- 56) test bench_num_chars_big_1000000_hyper ... bench: 51,214 ns/iter (+/- 22,773) faster pentium test bench_count_30000_hyper ... bench: 50,619 ns/iter (+/- 27) test bench_count_big_0100000_hyper ... bench: 168,750 ns/iter (+/- 2,620) test bench_count_big_1000000_hyper ... bench: 1,793,224 ns/iter (+/- 184,043) test bench_num_chars_30000_hyper ... bench: 1,441 ns/iter (+/- 2) test bench_num_chars_big_0100000_hyper ... bench: 5,045 ns/iter (+/- 79) test bench_num_chars_big_1000000_hyper ... bench: 52,989 ns/iter (+/- 4,530) simd kabylake test bench_count_30000_hyper ... bench: 1,658 ns/iter (+/- 71) test bench_count_big_0100000_hyper ... bench: 5,999 ns/iter (+/- 11) test bench_count_big_1000000_hyper ... bench: 61,536 ns/iter (+/- 2,047) test bench_num_chars_30000_hyper ... bench: 5,506 ns/iter (+/- 176) test bench_num_chars_big_0100000_hyper ... bench: 19,045 ns/iter (+/- 2,317) test bench_num_chars_big_1000000_hyper ... bench: 190,179 ns/iter (+/- 6,082) simd nehalem test bench_count_30000_hyper ... bench: 2,011 ns/iter (+/- 33) test bench_count_big_0100000_hyper ... bench: 7,728 ns/iter (+/- 307) test bench_count_big_1000000_hyper ... bench: 77,853 ns/iter (+/- 5,531) test bench_num_chars_30000_hyper ... bench: 988 ns/iter (+/- 7) test bench_num_chars_big_0100000_hyper ... bench: 4,137 ns/iter (+/- 528) test bench_num_chars_big_1000000_hyper ... bench: 45,211 ns/iter (+/- 6,833) simd x86-64 test bench_count_30000_hyper ... bench: 2,286 ns/iter (+/- 313) test bench_count_big_0100000_hyper ... bench: 7,610 ns/iter (+/- 898) test bench_count_big_1000000_hyper ... bench: 79,711 ns/iter (+/- 5,352) test bench_num_chars_30000_hyper ... bench: 987 ns/iter (+/- 30) test bench_num_chars_big_0100000_hyper ... bench: 3,985 ns/iter (+/- 248) test bench_num_chars_big_1000000_hyper ... bench: 43,328 ns/iter (+/- 1,382) simd pentium error[E0432]: unresolved import `x86::sse2` --> /home/adam/.cargo/registry/src/github.conef.uk-1ecc6299db9ec823/simd-0.2.1/src/common.rs:16:10 | 16 | use x86::sse2::common; | ^^^^ Could not find `sse2` in `x86` error: aborting due to previous error error: Could not compile `simd`. warning: build failed, waiting for other jobs to finish... error: build failed
Even without faster, the "hyper" method of counting is slower for slices with a small size.
Takes us from 30kns -> 18.9kns
Hello! I've managed to simplify this library's code and features and get a pretty sizeable speedup by using
faster
instead ofsimd
. We get a 5.5x speedup in the best case, and a 4% slowdown in the worst case. Additionally, this fixes a compilation error when using--features simd-accel
on a target without SSE2.