πŸ”Ž decoding LLM pipeline – step 1: input processing and tokenization

Author (E): EM karaman

Originally published in the direction of artificial intelligence.

πŸ”Ž decoding LLM pipeline – Step 1: Input processing and tokenization

πŸ”Ή From raw text to the entrance to the model

I spread it in my previous post 8-stage LLM pipelineDecoding how Language Language models models (LLMS) behind the scenes. Now let's enlarge – starting with Step 1: Input processing.

In this post, I will examine exactly how raw text transforms into structured numerical inputs that LLM can understand, dive in cleaning text, tokenization methods, numerical coding and chat structure. This step is often overlooked, but it is crucial because the quality of the input coding directly affects the model output.

🧩 1. Cleaning and normalization of text (raw text β†’ Pre -processed text)

Goal: User RAW input β†’ standardized, pure text for accurate tokenization.

πŸ“Œ Why cleaning text and normalization?

  • Raw input text β†’ often sloppy (typos, housing, punctuation, emoji) β†’ normalization ensures consistency.
  • Necessary preparatory step β†’ reduces the errors of tokenproviding better lower performance.
  • Normalization compromise: GPT models retain formatting and nuance (greater complexity of the token); Bert aggressively cleans the text β†’ simpler tokens, reduced nuance, ideal for structural tasks.

πŸ” Technical details (behind -the -scenes)

  • Normalization Unicode (NFKC/NFC) β†’ Standards of characters (IS vs. IS).
  • Folding of the housing (lower) β†’ It reduces the size of vocabulary, standardizes representation.
  • Normalization of the Whites β†’ Removes unnecessary spaces, bookmarks, lines.
  • Normalization of punctuation (Consistent use of punctuation).
  • Sur crust service (“No” β†’ “no” or non -fired intact depending on the requirements of the model). GPT usually retains cramps, Bert -based models can share.
  • Special characters support (emoji, accents, punctuation).
import unicodedata
import re

def clean_text(text):
text = text.lower() # Lowercasing
text = unicodedata.normalize("NFKC", text) # Unicode normalization
text = re.sub(r"\s+", " ", text).strip() # Remove extra spaces
return text

raw_text = "Hello! How’s it going? 😊"
cleaned_text = clean_text(raw_text)
print(cleaned_text) # hello! how’s it going?

πŸ”‘ 2. Tokenization (pre -processed text β†’ tokens)

Goal: Raw text β†’ tokens (subwords, words or characters).

Tokenization directly affects the quality and performance of the model.

πŸ“Œ Why tokenization?

  • Models I cannot read the raw text directly β†’ must convert discrete units (tokens).
  • Tokens: The basic unit that neural networks process.

Example: “Interesting” β†’ (“Interest”, “ING”)

πŸ” behind the scenes

Tokenization includes:

  • Maping text β†’ tokens based on predefined vocabulary.
  • Leukemia and normalization of punctuation (e.g. Spaces β†’ Special markers such as Δ ).
  • Segmenting unknown words for known hints.
  • Balancing the size of vocabulary and computing efficiency.
  • Can be deterministic (established rules) or probabilistic (adaptive segmentation)

πŸ”Ή Types of tokenizers and basic differences

βœ… Subwords token (BPE, Wordpiece, Unigram) Most often it occurs in modern LLM due to sustainable performance and accuracy.

Types of subword toketenizers:

  • Coding of bytes couples (BPE): Iteratively combines frequent pairs of signs (GPT models).
  • BPE at bytes level: BPE, but it works at bytes level, enabling better toxate of non-English text (GPT-4, LAMA-2/3)
  • WordPiece: Optimizes divisions based on the probability of the training body (Bert).
  • Unigram: Removes unlikely tokens iteratively, creating an optimal set (T5, LAMA).
  • Opinion: Supports direct text; Whitespace-Aware (Deepseek, multilingual models).
Different tokenizers send different division of tokens based on the algorithm, the size of vocabulary and coding rules.
  • GPT-4 AND GPT-3.5 use Bpe – good balance of vocabulary and performance.
  • Bert application Word – a more structured approach to hints; Slightly different handling of unknown words.

πŸ“Œ core Types of tokenizers They are public, but specific AI models can use their tuned versions (e.g. BPE is an algorithm that decides how to divide the text, but GPT models use a custom version of BPE). Specific to the tokenizer configentizations optimize performance.

# GPT-2 (BPE) Example
from transformers import AutoTokenizer
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer_gpt2.tokenize("Let's learn about LLMs!")
print(tokens)
# ('Let', "'s", 'Δ learn', 'Δ about', 'Δ LL', 'Ms', '!')
# Δ  prefix indicates whitespace preceding token
# OpenAI GPT-4 tokenizer example (via tiktoken library)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Let's learn about LLMs!")
print(tokens) # Numeric IDs of tokens
print(encoding.decode(tokens)) # Decoded text

πŸ”’ 3. Numerical coding (tokens β†’ token IDs)

Goal: Convert tokens for unique numerical identifiers.

  • LLMS Do not process the text directly – They work on numbers. β†’ tokens are still text units
  • Every token has Unique representation of integers In the vocabulary of the model.
  • IDS token (integers) Turn on efficient tensor operations and calculations in neural layers.

πŸ” behind the scenes

Vocabulary search tables are efficiently mapping tokens β†’ unique integers (token identifiers).

  • The size of the vocabulary defines model restrictions (memory use and performance) (GPT-4: ~ 50k tokens):

β†’Small vocabulary: Less parameters, less memory, but more tokens.

β†’Large vocabulary: Richer context, higher precision, but increased calculation costs.

  • Search tables are shortcut maps: Enable to enable it to enable it to enable it to enable it to enable it to enable it. stable time Token-to-ID conversions (complexity O (1)).
  • Special tokens (e.g, (PAD)IN IN (CLS)) To have Restricted identifiers β†’ standardized input format.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokens = tokenizer.tokenize("LLMs decode text.")
print("Tokens:", tokens) # Tokens: ('LL', 'Ms', 'Δ decode', 'Δ text', '.')

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids) # Token IDs: (28614, 12060, 35120, 1499, 13)

πŸ“œ 4. Input formatting for LLM (token ID β†’ chat templates)

Target: Structure of a toxled entrance to conversational models (chat with many revolutions)

  • Why: LLM, such as GPT-4, Claude, Lama expect a structured entrance in Roles (system, user, assistant).
  • Will surprise: Models use specific formatting and special tokens β†’ Maintain the context and role of conversation.

πŸ” behind the scenes

Chat templates ensure:

  • Roles identification: Clearly separates system instructions, input data of users and assistants' answers.
  • Context management: He maintains the history of conversation with many twisted β†’ better consistency reaction.
  • Structured entrance: Each message wrapped in special tokens or structured JSON β†’ helps in a clear distinction between the input data.
  • Metadata (optional): It may include time tags, speaker labels or the number of tokens for the speaker (for advanced models).
Comparison of chat templates: Different styles directly affect the interpretation of the model's context.

πŸ“ 5. Input coding of the model (structural text β†’ tensory)

Goal: Convert numerical token identifiers β†’ structured numerical boards (Tensors) for GPU -based neuronal compatibility.

βœ… Why TENSORS?

  • Neuron networks expect numerical boards (Tensors) with uniform dimensions (Party size Γ— sequence length), not simple list of integers.
  • IDS token themselves = discrete integers; Tensor boards add structure and context (Lining, masks).
  • The right lining, cutting, party β†’ directly affect the performance and performance of the model.

πŸ” Technical details (behind -the -scenes)

  • Padding: Adds special tokens (PAD) For shorter sequences β†’ Uniform tensor shapes.
  • Cut: Removes excess tokens from long inputs β†’ provides compatibility with a fixed context of windows (e.g. GPT-2: 1024 tokens).
  • Attention masks: Binary tensors distinguishing real tokens (1) vs. Lining tokens (0) β†’ prevents attending lining tokens during calculations.
  • Tensor party: It combines many input data in parties β†’ optimized parallel calculations on the GPU.

πŸ” key results

βœ… Input processing is more than just tokenization – covers Cleaning text, tokenization, numerical coding, chat structure and input formatting of the final model.

βœ… Type β†’ TROMFICS Model tokenizer: BPE (GPT), Wordpiece (Bert), Unigram (LAMA) – the choice affects the size of vocabulary, speed, complexity.

βœ… Chat -based models are based on structural formatting (Chat templates) β†’ direct impact on consistency, meaning, conversation flow.

βœ… IDS token β†’ Critical tensors: Provides numerical compatibility for efficient neural processing.

πŸ“– Next: Step 2 – Processing of the neural network

Now, when we discussed how The raw text becomes the entrance structure of the modelThe next post will be break the way the neural network processes this input data to generate meaning – covering Setting layers, attention mechanisms and many others.

If you liked this article:

πŸ’» Check mine Girub for projects on AI/ML, Cyber ​​security and Python
πŸ”— Connect with me LinkedIn talk about all things AI

πŸ’‘ Thinks? Questions? Let's talk! πŸš€

Published via AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here