Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for batch input in BERT Tokenizer with perf benchmark #1745

Merged
merged 4 commits into from
Jun 1, 2022

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented May 26, 2022

This PR add support for batch tokenizer directly implemented at C++ kernel layer.

It also add benchmarking code for tokenizer.

python benchmark/benchmark_bert_tokenizer.py --num-samples num_samples

sample/batch size 100
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.0014495829999998655
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.01610425200000032
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.0037178359999998634
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.0007109650000001189
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.0010531449999997555

sample/batch size 1000
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.021309904000000213
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.383976283
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.07770123400000006
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.01940807400000022
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.015243869999999937

Observations

  • ~15x on non-batched input compared to HF slow tokenizer (python based).
  • ~3x on non-batched input compared to HF fast tokenizer (Rust based)
  • torchtext is faster for batch processing up-to certain batch size but then as the batch size increases HF implementation becomes significantly faster (20-25%). This is likely because the kernel implementation of HF is faster compared to torchtext's but the overhead to call the binding for HF is higher compared to torchtext.

Performance next steps:

  • Perform analysis on the impact of batch size for batch processing.
  • Identify and improve performance gaps in batch processing
  • Make use of timeit module in benchmark code

@parmeet parmeet requested a review from Nayef211 May 31, 2022 15:33
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. In regards to the benchmark code, I wonder if we should try a really large value for num_samples so we can see how our batched implementation performs for very small and large batches.

I also wonder if you've looked into using the timeit module which is meant for profiling and allows us to run a snippet of code several times to more accurately measure execution speed. I noticed that the timeit module has a Timer class built in. Could we potentially use this implementation instead of the one we have in benchmark/utils.py

Comment on lines 602 to 608
"""Encode text into a list of tokens IDs

Args:
text: An input text string.

Returns:
A list of token ids represents each sub-word
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the docstring to reflect that we're operating on a list of strings and the output is a nested list of tokens

A list of token ids represents each sub-word

For example:
--> "Hello world!" --> token ids: [707, 5927, 11, 707, 68]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to update the example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the doc related comments, since this is a private function I updated the doc to bare minimum stating they are the batched versions of _tokenize and _encode :)

Comment on lines 633 to 644
def _batch_tokenize(self, text: List[str]) -> List[List[str]]:
"""Tokenize text into a list of tokens

Args:
text: An input text string.

Returns:
A list of tokens (sub-words)

For example:
--> "Hello World!": ["Hello", "World", "!"]
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring to reflect list input and nested list output

torchtext/csrc/register_pybindings.cpp Show resolved Hide resolved
@parmeet
Copy link
Contributor Author

parmeet commented Jun 1, 2022

Left a couple of comments. In regards to the benchmark code, I wonder if we should try a really large value for num_samples so we can see how our batched implementation performs for very small and large batches.

Yupp, that is a follow-up item in the summary :)

I also wonder if you've looked into using the timeit module which is meant for profiling and allows us to run a snippet of code several times to more accurately measure execution speed. I noticed that the timeit module has a Timer class built in. Could we potentially use this implementation instead of the one we have in benchmark/utils.py

hmm, I think it's a good idea. Currently the code in benchmark/utils.py only execute once. We should probably write a new Timer class that can will do multiple executions before reporting the average results as you already suggested. I think this can be a follow-up item as well.

@Nayef211
Copy link
Contributor

Nayef211 commented Jun 1, 2022

Yupp, that is a follow-up item in the summary :)

Oops I missed that.

hmm, I think it's a good idea. Currently the code in benchmark/utils.py only execute once. We should probably write a new Timer class that can will do multiple executions before reporting the average results as you already suggested. I think this can be a follow-up item as well.

Sgtm. I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

@parmeet
Copy link
Contributor Author

parmeet commented Jun 1, 2022

I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

Ahh I see. my bad! Ya, I think it's a good idea. Let me see if I can incorporate your suggestion before landing the PR :)

@parmeet
Copy link
Contributor Author

parmeet commented Jun 1, 2022

I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

One of the challenges with timeit is how to pass additional dynamic arguments like batch size to it? (Aside let me close this PR to avoid blocking release and follow up the benchmark improvement in separate PR).

@parmeet parmeet merged commit 60d9d51 into pytorch:main Jun 1, 2022
@parmeet parmeet deleted the batch_bert branch June 1, 2022 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants