#
tokenizing
Neural networks don't actually understand the things they work on. They don't know the letter "a" exists, or the word "dog". All they know is the numeric representations of those words, which are commonly a combination of tokenizing and encoding.
Tokenizing is taking these words and splitting them up into a repeatable and referenceable language. The simplest form of this is converting a sentence or word into its constituent characters.
input = "the quick brown fox jumped over the lazy dog."
chars = sorted(list(set(input)))
print(''.join(chars))
# .abcdefghijklmnopqrtuvwxyz
vocab_size = len(chars)
print(vocab_size)
# 27
stoi = dict([c, i] for i,c in enumerate(chars))
print(stoi)
# {' ': 0, '.': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
itos = dict([i, c] for i,c in enumerate(chars))
print(itos)
# {0: ' ', 1: '.', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'q', 19: 'r', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}
#
Options
There are may options for how to tokenize your inputs, I'm still learning exactly what they all are. An example being SentencePiece or tiktoken.