Tokenization
Tokenization is simply changing primary data into a helpful data string. Although tokenization is commonly known for its use in cybersecurity and in the construction of NFTs, tokenization is a significant part of the NLP. For example, tokenization is applied in natural language processing for breaking paragraphs and sentences into smaller items that can be more easily appointed meaning [Tokenex].
Tokenization splits the raw text into words, phrases, or sentences called tokens. These tokens help to understand the context or develop the model for the NLP. Tokenization aims to explain the text's meaning by analyzing the words' succession.
For instance, the text “It is getting dark” can be tokenized into ‘It,’ ‘is,’ ‘getting,’ and ‘dark.’ Tokenization can be performed for either separating words or sentences. If the text is broken into words through a particular separation technique, it is called word tokenization, and the same kind of separation performed for sentences is called sentence tokenization.
There are different tokenization techniques:
White Space Tokenization.
In this tokenization technique, a sentence or paragraph tokenizes into words by breaking the input whenever whitespace is clicked.
Dictionary Based Tokenization.
In this tokenization technique, the tokens are found based on the tokens already existing in the dictionary.
Rule Based Tokenization.
In this tokenization technique, rules are made for a particular issue. Then, the tokenization conducts based on these rules (for example, grammar rules).
Regular Expression Tokenizer.
This tokenization technique applies a regular expression to manage to convert text into tokens. However, a regular expression may be easy and sometimes difficult to comprehend.
Penn TreeBank Tokenization.
This technique of tokenization distinct the punctuation, clitics, and hyphenated words.
Spacy Tokenizer.
This tokenization technique produces the flexibility to identify unique tokens that don't need to be segmented or segmented using particular rules.
Moses Tokenizer.
This tokenization technique is a set of complex normalization and segmentation logic working very well for structured languages (for example, English).
Subword Tokenization.
In this technique, the most often used words are given unique IDs, and less often used words are broken into subwords to represent the meaning better.
Byte-Pair Encoding (BPE).
This tokenization technique uses more embedding or symbols for representing less often used words and fewer symbols or embedding for more often used words.
WordPiece.
WordPiece is similar to BPE. The difference between BPE and WordPiece is that BPE considers the token with the most frequently occurring in a pair of symbols to merge into the vocabulary. In contrast, WordPiece considers the frequency of individual symbols, and based on the below count, it merges into the vocabulary.
Tokenization with NLTK.
NLTK is a python library developed by Microsoft to aid in NLP.
Tokenization with Textblob.
Textblob is used for processing text data and is a library in Python. Its purpose is in sentiment analysis, parts of speech tagging, classification, translation, etc.
Tokenization with Gensim.
Gensim is one of the libraries that provides utility functions for tokenization.
Tokenization with Keras.
In this technique, tokenization is done with the Keras library [TowardsDataScience].
⠀ Tokenization for Natural Language Processing. TowardsDataScience. Retrieved from: https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4.
⠀ What is NLP (Natural Language Processing) Tokenization? Tokenex. Retrieved from: https://www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization.