Another Implementation (faster and more effecient) of BPE Training Algorithm

Early in this year, I wrote an new implementation for BPE Algorithm in pure python, which is faster than the version in Tokenizer.

I hope this implementation could help tokenizers to further improve the BPE training performance.

I have writen a [blog](http://74.48.25.55/post/2) in Chinese about this implementation. I will try to translate it to English if there is any need. By the way, the code is quite short in my opinion, with about merely 400 lines.

Here is the code: https://github.com/Yikai-Liao/efficient_bpe


| Implementation| user time | system time | total time| cpu |
| --     | -- | --| --| - |
|My version (Single Thread)|2.70s | 0.05s |  2.761s   | 99% |
| Tokenizer   (Single Thread) |5.51s | 1.60s |  5.411s| 131%|
|Tokenizer   (Multi Threads)|8.51s | 3.52s | 2.849s | 422% |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Another Implementation (faster and more effecient) of BPE Training Algorithm #1400

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation	user time	system time	total time	cpu
My version (Single Thread)	2.70s	0.05s	2.761s	99%
Tokenizer (Single Thread)	5.51s	1.60s	5.411s	131%
Tokenizer (Multi Threads)	8.51s	3.52s	2.849s	422%

Another Implementation (faster and more effecient) of BPE Training Algorithm #1400

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions