-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for batch input in BERT Tokenizer with perf benchmark #1745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments. In regards to the benchmark code, I wonder if we should try a really large value for num_samples
so we can see how our batched implementation performs for very small and large batches.
I also wonder if you've looked into using the timeit
module which is meant for profiling and allows us to run a snippet of code several times to more accurately measure execution speed. I noticed that the timeit
module has a Timer
class built in. Could we potentially use this implementation instead of the one we have in benchmark/utils.py
torchtext/transforms.py
Outdated
"""Encode text into a list of tokens IDs | ||
|
||
Args: | ||
text: An input text string. | ||
|
||
Returns: | ||
A list of token ids represents each sub-word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the docstring to reflect that we're operating on a list of strings and the output is a nested list of tokens
torchtext/transforms.py
Outdated
A list of token ids represents each sub-word | ||
|
||
For example: | ||
--> "Hello world!" --> token ids: [707, 5927, 11, 707, 68] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to update the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the doc related comments, since this is a private function I updated the doc to bare minimum stating they are the batched versions of _tokenize
and _encode
:)
torchtext/transforms.py
Outdated
def _batch_tokenize(self, text: List[str]) -> List[List[str]]: | ||
"""Tokenize text into a list of tokens | ||
|
||
Args: | ||
text: An input text string. | ||
|
||
Returns: | ||
A list of tokens (sub-words) | ||
|
||
For example: | ||
--> "Hello World!": ["Hello", "World", "!"] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring to reflect list input and nested list output
Yupp, that is a follow-up item in the summary :)
hmm, I think it's a good idea. Currently the code in benchmark/utils.py only execute once. We should probably write a new Timer class that can will do multiple executions before reporting the average results as you already suggested. I think this can be a follow-up item as well. |
Oops I missed that.
Sgtm. I think my suggestion here is mainly to reuse existing implementation of |
Ahh I see. my bad! Ya, I think it's a good idea. Let me see if I can incorporate your suggestion before landing the PR :) |
One of the challenges with timeit is how to pass additional dynamic arguments like batch size to it? (Aside let me close this PR to avoid blocking release and follow up the benchmark improvement in separate PR). |
This PR add support for batch tokenizer directly implemented at C++ kernel layer.
It also add benchmarking code for tokenizer.
sample/batch size 100
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.0014495829999998655
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.01610425200000032
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.0037178359999998634
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.0007109650000001189
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.0010531449999997555
sample/batch size 1000
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.021309904000000213
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.383976283
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.07770123400000006
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.01940807400000022
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.015243869999999937
Observations
Performance next steps: