A
Autoregressive
An autoregressive model generates text token by token, where each next token is predicted based on all previously generated tokens. Many LLMs (e.g., GPT models) use this approach.
Attention Mechanism
Attention allows a model to focus on different parts of the input when predicting each token, enabling it to learn long-range dependencies more effectively than traditional recurrent neural networks.
E
Embeddings
Embeddings are numeric vector representations of words, tokens, or sentences. They capture semantic or contextual meaning such that words with similar meanings have similar embeddings.
Epoch
An epoch is a full pass through the entire training dataset. After one epoch, the model has seen every training sample at least once.
F
Fine-Tuning
Fine-tuning is the process of taking a pre-trained model (e.g., GPT, BERT) and continuing its training on a specific dataset or task. During fine-tuning, the model’s weights are adjusted to improve performance on that task.
G
GGML / GGUF
GGML (and its successor GGUF) refers to specialized file formats and libraries designed for efficient inference of LLMs on CPUs (and sometimes GPUs). These formats often include built-in quantization to reduce model size.
GPU (Graphics Processing Unit)
A GPU accelerates large-scale matrix operations, which are fundamental to training and inference in deep learning models.
I
Inference
Inference is the process of using a trained model to make predictions or generate output given new input data. For an LLM chatbot, inference means producing text responses for user queries.
L
Large Language Model (LLM)
An LLM is a model trained on vast amounts of text data, capable of tasks like text generation, summarization, and translation, among many others.
LoRA (Low-Rank Adaptation)
LoRA is a technique that fine-tunes only a subset (low-rank decomposition) of the model’s weights, significantly reducing computational cost and memory usage for large models.
M
Model Checkpoint
A checkpoint is a saved copy of a model’s weights at a specific training iteration or epoch. Large models may be split into multiple checkpoint files (shards).
Multi-Head Attention
In Transformer-based models, attention is split into multiple “heads.” Each head learns different relationships in the data, providing richer context for the output.
P
Parameter
A parameter is any learnable weight in a neural network (for instance, the entries in weight matrices or bias vectors). LLMs can contain billions (or even trillions) of parameters.
Prompt
A prompt is the text input given to an LLM to elicit a response. The practice of crafting prompts to yield the best responses is known as prompt engineering.
Q
Quantization
Quantization reduces the numerical precision of model weights (e.g., from 16-bit floats to 8-bit or 4-bit integers). This significantly reduces memory usage and can speed up inference, often with minimal loss in accuracy.
S
Safetensors
Safetensors is a secure, fast file format for storing model weights. Large model checkpoints may be split into multiple .safetensors shards plus an index file for easier handling and distribution.
Sharding
Sharding means dividing very large model weight files into smaller pieces to meet hosting limits or simplify distribution and parallelism.
T
Tensor
A tensor is a generalized multi-dimensional array (0D = scalar, 1D = vector, 2D = matrix, etc.). In LLMs, model weights and input/output data are stored and processed as tensors.
Token
A token is the smallest unit of text the model processes—this could be a word, subword, or even a character. An LLM predicts text one token at a time.
Tokenizer / Tokenization
The tokenizer splits text into tokens. This is a crucial step before feeding the text to an LLM for both training and inference.
Transformers
A Transformer is a neural network architecture characterized by the attention mechanism. Most modern large language models (GPT, BERT, etc.) are variations of Transformers.
W
Weight
A weight is a single learned value in a neural network’s parameter set. The collective set of weights in a neural network is what gets updated during training and stored in checkpoints.