Adding support for batch input in BERT Tokenizer with perf benchmark #1745

parmeet · 2022-05-26T14:09:41Z

This PR add support for batch tokenizer directly implemented at C++ kernel layer.

It also add benchmarking code for tokenizer.

python benchmark/benchmark_bert_tokenizer.py --num-samples num_samples

sample/batch size 100
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.0014495829999998655
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.01610425200000032
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.0037178359999998634
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.0007109650000001189
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.0010531449999997555

sample/batch size 1000
Running TorchText BERT Tokenizer on non-batched input ... Total running time: 0.021309904000000213
Running HF BERT Tokenizer (slow) on non-batched input ... Total running time: 0.383976283
Running HF BERT Tokenizer (fast) on non-batched input ... Total running time: 0.07770123400000006
Running TorchText BERT Tokenizer on batched input ... Total running time: 0.01940807400000022
Running HF BERT Tokenizer (fast) on batched input ... Total running time: 0.015243869999999937

Observations

~15x on non-batched input compared to HF slow tokenizer (python based).
~3x on non-batched input compared to HF fast tokenizer (Rust based)
torchtext is faster for batch processing up-to certain batch size but then as the batch size increases HF implementation becomes significantly faster (20-25%). This is likely because the kernel implementation of HF is faster compared to torchtext's but the overhead to call the binding for HF is higher compared to torchtext.

Performance next steps:

Perform analysis on the impact of batch size for batch processing.
Identify and improve performance gaps in batch processing
Make use of timeit module in benchmark code

Nayef211

Left a couple of comments. In regards to the benchmark code, I wonder if we should try a really large value for num_samples so we can see how our batched implementation performs for very small and large batches.

I also wonder if you've looked into using the timeit module which is meant for profiling and allows us to run a snippet of code several times to more accurately measure execution speed. I noticed that the timeit module has a Timer class built in. Could we potentially use this implementation instead of the one we have in benchmark/utils.py

Nayef211 · 2022-06-01T01:10:43Z

torchtext/transforms.py

+        """Encode text into a list of tokens IDs
+
+        Args:
+            text: An input text string.
+
+        Returns:
+            A list of token ids represents each sub-word


Can we update the docstring to reflect that we're operating on a list of strings and the output is a nested list of tokens

Nayef211 · 2022-06-01T01:11:03Z

torchtext/transforms.py

+            A list of token ids represents each sub-word
+
+        For example:
+            --> "Hello world!" --> token ids: [707, 5927, 11, 707, 68]


Also need to update the example

For all the doc related comments, since this is a private function I updated the doc to bare minimum stating they are the batched versions of _tokenize and _encode :)

Nayef211 · 2022-06-01T01:11:39Z

torchtext/transforms.py

+    def _batch_tokenize(self, text: List[str]) -> List[List[str]]:
+        """Tokenize text into a list of tokens
+
+        Args:
+            text: An input text string.
+
+        Returns:
+            A list of tokens (sub-words)
+
+        For example:
+            --> "Hello World!": ["Hello", "World", "!"]
+        """


Update docstring to reflect list input and nested list output

torchtext/csrc/register_pybindings.cpp

parmeet · 2022-06-01T15:01:36Z

Left a couple of comments. In regards to the benchmark code, I wonder if we should try a really large value for num_samples so we can see how our batched implementation performs for very small and large batches.

Yupp, that is a follow-up item in the summary :)

I also wonder if you've looked into using the timeit module which is meant for profiling and allows us to run a snippet of code several times to more accurately measure execution speed. I noticed that the timeit module has a Timer class built in. Could we potentially use this implementation instead of the one we have in benchmark/utils.py

hmm, I think it's a good idea. Currently the code in benchmark/utils.py only execute once. We should probably write a new Timer class that can will do multiple executions before reporting the average results as you already suggested. I think this can be a follow-up item as well.

Nayef211 · 2022-06-01T18:07:08Z

Yupp, that is a follow-up item in the summary :)

Oops I missed that.

hmm, I think it's a good idea. Currently the code in benchmark/utils.py only execute once. We should probably write a new Timer class that can will do multiple executions before reporting the average results as you already suggested. I think this can be a follow-up item as well.

Sgtm. I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

parmeet · 2022-06-01T18:26:01Z

I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

Ahh I see. my bad! Ya, I think it's a good idea. Let me see if I can incorporate your suggestion before landing the PR :)

parmeet · 2022-06-01T19:37:03Z

I think my suggestion here is mainly to reuse existing implementation of Timer from existing Python modules rather than creating our own (unless there is an explicit need to do so).

One of the challenges with timeit is how to pass additional dynamic arguments like batch size to it? (Aside let me close this PR to avoid blocking release and follow up the benchmark improvement in separate PR).

Adding support for batch input in BERT Tokenizer with perf benchmark

efd69ee

facebook-github-bot added the cla signed label May 26, 2022

Merge branch 'main' of github.com:pytorch/text into batch_bert

24ef3d5

parmeet requested a review from Nayef211 May 31, 2022 15:33

Nayef211 reviewed Jun 1, 2022

View reviewed changes

parmeet added 2 commits June 1, 2022 08:38

Merge branch 'main' of github.com:pytorch/text into batch_bert

297aa23

fix doc and move code out of registration

d7d33db

Nayef211 approved these changes Jun 1, 2022

View reviewed changes

parmeet merged commit 60d9d51 into pytorch:main Jun 1, 2022

parmeet deleted the batch_bert branch June 1, 2022 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for batch input in BERT Tokenizer with perf benchmark #1745

Adding support for batch input in BERT Tokenizer with perf benchmark #1745

parmeet commented May 26, 2022 •

edited

Loading

Nayef211 left a comment

Nayef211 Jun 1, 2022

Nayef211 Jun 1, 2022

parmeet Jun 1, 2022

Nayef211 Jun 1, 2022

parmeet commented Jun 1, 2022

Nayef211 commented Jun 1, 2022

parmeet commented Jun 1, 2022

parmeet commented Jun 1, 2022

Adding support for batch input in BERT Tokenizer with perf benchmark #1745

Adding support for batch input in BERT Tokenizer with perf benchmark #1745

Conversation

parmeet commented May 26, 2022 • edited Loading

Nayef211 left a comment

Choose a reason for hiding this comment

Nayef211 Jun 1, 2022

Choose a reason for hiding this comment

Nayef211 Jun 1, 2022

Choose a reason for hiding this comment

parmeet Jun 1, 2022

Choose a reason for hiding this comment

Nayef211 Jun 1, 2022

Choose a reason for hiding this comment

parmeet commented Jun 1, 2022

Nayef211 commented Jun 1, 2022

parmeet commented Jun 1, 2022

parmeet commented Jun 1, 2022

parmeet commented May 26, 2022 •

edited

Loading