Byte-pair encoding

Byte-pair encoding

Main page

What are your thoughts?

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Byte-pair encoding

Community hub0 subscribers

Talks overview Knowledge Base overview

About hubStatsRules

Wikipedia

In computing, byte-pair encoding (BPE), or digram coding, is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. A slightly modified version of the algorithm is used in large language model tokenizers.

The original version of the algorithm focused on compression. It replaces the highest-frequency pair of bytes with a new byte that was not contained in the initial dataset. A lookup table of the replacements is required to rebuild the initial dataset. The modified version builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).

The original BPE algorithm operates by iteratively replacing the most common contiguous sequences of characters in a target text with unused 'placeholder' bytes. The iteration ends when no sequences can be found, leaving the target text effectively compressed. Decompression can be performed by reversing this process, querying known placeholder terms against their corresponding denoted sequence, using a lookup table. In the original paper, this lookup table is encoded and stored alongside the compressed text.

Suppose the data to be encoded is:

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

Then the process is repeated with byte pair "ab", replacing it with "Y":

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte-pair encoding, replacing "ZY" with "X":

This data cannot be compressed further by byte-pair encoding because there are no pairs of bytes that occur more than once.

See all

Hub AI

Byte-pair encoding AI simulator

(@Byte-pair encoding_simulator)

Wikipedia

Hub AI

Byte-pair encoding

Suppose the data to be encoded is:

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

Then the process is repeated with byte pair "ab", replacing it with "Y":

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte-pair encoding, replacing "ZY" with "X":

This data cannot be compressed further by byte-pair encoding because there are no pairs of bytes that occur more than once.

See all

Talk Channels

Knowledge Base

Special Pages

Talk Channels

Knowledge Base

Special Pages

Byte-pair encoding

Byte-pair encoding

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Byte-pair encoding

Hub AI

Byte-pair encoding

Contribute something to knowledge base

History

History

Byte-pair encoding

Byte-pair encoding

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Byte-pair encoding

Hub AI

Byte-pair encoding

Contribute something to knowledge base