Author(s): Utkarsh Mittal
Originally published on Towards AI.
Introduction
XGBoost (Extreme Gradient Boosting) has become the go-to algorithm for winning machine learning competitions and solving real-world prediction problems. But what makes it so powerful? In this comprehensive tutorial, we’ll unpack the mathematical foundations and practical mechanisms that make XGBoost superior to traditional gradient boosting methods.
This tutorial assumes you have basic knowledge of decision trees and machine learning concepts. We’ll walk through the algorithm step-by-step with visual examples and a concrete dataset that we’ll follow throughout the entire tutorial.
Our Tutorial Dataset
Throughout this tutorial, we’ll use a consistent dataset to demonstrate every concept. This helps you see how the algorithm works on real data, step by step.
Dataset Description
We have 20 samples (x₁ through x₂₀) with:
- 4 features: Column A, Column B, Column C, Column D
- 1 target variable: Target Y (binary: 0 or 1)
Understanding the Problem
This is a binary classification problem where Target Y is either 0 or 1. Our goal is to build a model that can distinguish between the two classes based on features A, B, C, and D.
Initial Observations:
- When Column B = 1, Target Y tends to be 1 (positive class)
- When Column B = 0, Target Y tends to be 0 (negative class)
- Column C values range from 0 to 6
- Column A shows some correlation with the target
Let’s see how XGBoost learns these patterns!
Part 1: How Decision Trees Learn from Data
The Recursive Splitting Process
Before diving into XGBoost, let’s understand how a single decision tree learns. Training a tree means recursively splitting the training data by examining different feature values.
Using our tutorial dataset with 20 samples (features A, B, C, D and target Y), let’s see how a tree is built.
Building the First Tree — Step by Step
Step 1: Root Node Decision
At the root node, we have all 20 samples. The algorithm examines all features looking for the best split. Let’s say it evaluates “Column B < 1” (i.e., Column B = 0):
Left Branch (Column B = 0):
- Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉ (10 samples)
- Target Y values: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- All 10 samples have Target Y = 0!
Right Branch (Column B = 1):
- Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀ (10 samples)
- Target Y values: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
- All 10 samples have Target Y = 1!
Amazing! This single split perfectly separates our classes. Column B is a perfect predictor in this dataset.
Step 2: Do We Need More Splits?
Let’s examine the left child (Column B = 0):
- All samples have the same target (Y = 0)
- No further splitting needed → This becomes a leaf node with prediction = 0
The right child (Column B = 1):
- All samples have the same target (Y = 1)
- No further splitting needed → This becomes a leaf node with prediction = 1
Our first tree is complete with just one split!
Root: All 20 samples
|
(Column B < 1?)
/
YES NO
/
Left Leaf: Right Leaf:
Predict Y = 0 Predict Y = 1
(10 samples) (10 samples)
The Key Question: Which Column and Value to Split On?
In our example, Column B was obvious, but typically the algorithm must evaluate:
- All columns: A, B, C, D
- All possible threshold values in each column
For Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5… For Column C: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5…
The algorithm calculates a score for each potential split and chooses the best one. This is where the loss function comes into play, which we’ll explore next.
Realistic Scenario: What If Column B Had Noise?
Let’s imagine our dataset wasn’t perfect. Suppose x₁₉ (Column B = 0) had Target Y = 1 instead of 0.
Left Branch (Column B = 0):
- 9 samples with Y = 0, 1 sample with Y = 1
- Not perfect, but still predominantly class 0
- Prediction: Y = 0 (majority vote)
Right Branch (Column B = 1):
- 10 samples with Y = 1
- Perfect separation
- Prediction: Y = 1
This is where gradient boosting helps: the next tree will focus on correcting that one misclassified sample (x₁₉).
Part 2: The Ensemble Approach — Building Trees Sequentially
From Single Trees to Boosted Ensembles
XGBoost doesn’t build just one tree — it builds an ensemble of trees sequentially, where each new tree corrects the errors of the previous trees.
Continuing with Our Tutorial Dataset
Remember our first tree? It used Column B to perfectly classify:
- Prediction for Column B = 0: Y_pred = 0
- Prediction for Column B = 1: Y_pred = 1
But what if our dataset had some complexity that one tree couldn’t capture?
Building Tree 2: Correcting the Residuals
Let’s add realistic complexity. Suppose Tree 1 made these predictions:
(Predictions aren’t exactly 0 or 1 due to regularization, which we’ll explain later)
Tree 2’s Job: Predict these residuals (0.0, 0.0, 0.1, 0.1, …)
Tree 3’s Job: Predict the residuals from Tree 1 + Tree 2
This is the heart of gradient boosting: each new tree is trained to predict the residual errors from all previous trees combined.
The Sequential Learning Process
Final Prediction = Tree₁ + Tree₂ + Tree₃ + Tree₄ + Tree₅For sample x₃:
Tree₁ predicts: 0.9
Tree₂ predicts: 0.08 (trying to correct the 0.1 error)
Tree₃ predicts: 0.015 (trying to correct remaining error)
...
Final prediction = 0.9 + 0.08 + 0.015 + ... ≈ 1.0
Key Question: How do we build Tree 2 optimally? We need to know:
- What errors did Tree 1 make?
- How should we split the data to best correct these errors?
- What should each leaf predict?
This is where the loss function and XGBoost’s mathematical framework come in!
Part 3: The Loss Function — Minimizing Bias and Variance Simultaneously
Engineering the Objective
XGBoost’s secret weapon is its carefully engineered loss function that simultaneously addresses two fundamental challenges in machine learning: bias and variance.
The total loss function in XGBoost has two main components:
Component 1: Minimizing Bias (Training Error)
The first term sums over all samples in the training data:
Σ l(ŷᵢ, yᵢ)
Where:
ŷᵢ= prediction for sample iyᵢ= actual target value for sample il()= loss function (like mean squared error or cross-entropy)
This term ensures we fit the training data well and minimise underfitting.
Component 2: Controlling Variance (Regularization)
The second term sums over all trees:
Σ Ω(fₖ)
The regularization function Ω is decomposed into two parts:
- Leaf Penalty:
γT, where T is the number of leaves
- Penalizes trees with too many leaves
- Simpler trees → lower variance → better generalization
2. Weight Penalty: ½λ||w||² where w are the leaf predictions
- Penalizes extreme predictions at each leaf
- Smaller predictions from each tree → gradual corrections → less overfitting
Key Insight: By minimizing both terms simultaneously through hyperparameters γ and λ, XGBoost finds the sweet spot between bias and variance that traditional gradient boosting methods struggle to achieve.
Part 4: Breaking Down the Sample-Level Loss Function
How Predictions Build on Each Other
When building tree t, we need to understand how predictions accumulate from previous trees.
The sample-level loss function shows:
l(ŷᵢ, yᵢ) = l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ))
Breaking this down:
- ŷᵢ^(t-1): The prediction from all previous trees (1 through t-1)
- fₜ(xᵢ): The prediction from the new tree t we’re building
- xᵢ: The feature values for sample i
Example with Our Tutorial Dataset
Let’s work through a concrete example using our 20-sample dataset.
Scenario: Building Tree 2 (t=2)
For sample x₃ (Column A=3, Column B=1, Column C=6, Column D=1, True Y=1):
Step 1: What did Tree 1 predict?
ŷ₃^(1) = 0.85 (Tree 1’s prediction for x₃)
Step 2: What’s the actual target?
y₃ = 1.0
Step 3: What’s the error?
Error = y₃ - ŷ₃^(1) = 1.0 - 0.85 = 0.15
Step 4: What should Tree 2 predict?
Tree 2 should predict approximately 0.15 to correct the error. If we’re using log loss (for classification), the exact value depends on the gradient and Hessian, but intuitively:
New prediction: ŷ₃^(2) = ŷ₃^(1) + f₂(x₃) = 0.85 + 0.12 = 0.97
Getting closer to the true value of 1.0!
Full Example for Tree 2
Observation:
- Samples with negative errors (predicted too high): Tree 2 predicts negative values
- Samples with positive errors (predicted too low): Tree 2 predicts positive values
The loss function guides Tree 2 on how to split the data to best capture these error patterns.
For instance, Tree 2 might discover:
- When Column C < 3: Errors tend to be negative (over-predicted)
- When Column C ≥ 3: Errors tend to be positive (under-predicted)
And use this to make a split!
Part 5: The Taylor Approximation — Making the Math Tractable
Simplifying Complex Loss Functions
The challenge with the sample-level loss function is that its analytical form depends on which loss function we choose (MSE, cross-entropy, etc.). XGBoost uses a clever mathematical trick: the second-order Taylor expansion.
The Taylor approximation transforms the loss function into a more manageable form:
l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ)) ≈ l(yᵢ, ŷᵢ^(t-1)) + gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ)
Computing Gradients and Hessians for Our Dataset
Let’s calculate actual values using log loss (binary cross-entropy), which is standard for binary classification:
l(y, ŷ) = -y·log(ŷ) — (1-y)·log(1-ŷ)
Gradient (First Derivative):
gᵢ = ∂l/∂ŷ = ŷᵢ^(t-1) - yᵢ
Hessian (Second Derivative):
hᵢ = ∂²l/∂ŷ² = ŷᵢ^(t-1)·(1 - ŷᵢ^(t-1))
Concrete Example: Sample x₃ from Our Dataset
Given:
- True target: y₃ = 1
- Tree 1 prediction: ŷ₃^(1) = 0.85
- We’re building Tree 2
Calculate Gradient:
g₃ = ŷ₃^(1) — y₃ = 0.85–1.0 = -0.15
Interpretation: Negative gradient means we predicted too low, need to increase prediction.
Calculate Hessian:
h₃ = ŷ₃^(1) · (1 — ŷ₃^(1)) = 0.85 · 0.15 = 0.1275
Interpretation: Hessian represents the curvature. Higher values mean we’re more confident about the gradient direction.
Full Calculations for Our 20 Samples
Building Tree 2, here are gradients and Hessians:
Patterns Observed:
- Class 0 samples (Y=0): Positive gradients (we over-predicted, need to decrease)
- Class 1 samples (Y=1): Negative gradients (we under-predicted, need to increase)
- Hessians all positive, representing curvature of loss function
Understanding Each Term:
- Constant Term:
l(yᵢ, ŷᵢ^(t-1))– the loss from previous trees (fixed) - Gradient Term:
gᵢfₜ(xᵢ)
- For x₃: g₃ = -0.15, if Tree 2 predicts f₂(x₃) = 0.12, this term = -0.15 × 0.12 = -0.018
- Negative value → decreases loss → good!
- Hessian Term:
½hᵢfₜ²(xᵢ)
- For x₃: h₃ = 0.128, f₂(x₃) = 0.12, this term = 0.5 × 0.128 × 0.1²² = 0.00092
- Positive value → increases loss slightly
- Acts as a penalty for large predictions (regularization effect)
Visual Interpretation:
The figure shows three scenarios:
- Left: High gradient (steep slope) — we need to move predictions significantly
- Center: The “bowl” shape shows how the Taylor approximation mirrors the true loss
- Right: The curvature (Hessian) tells us whether the loss surface is gentle or steep
Key Insight: This approximation is exact when the new tree’s predictions are small (fₜ → 0), which is exactly what regularization encourages! This is why XGBoost uses small learning rates and many trees.
Part 6: Focusing on What Matters
Discarding Constants and Focusing on the New Tree
When optimizing for the new tree, we can simplify our objective by removing terms that don’t depend on the new tree.
The Simplified Loss Function:
Looking at our expanded loss:
- Constant Term (Past errors): Doesn’t depend on new tree → Discard
- Gradient Term:
gᵢfₜ(xᵢ)– Depends on new tree → Keep - Hessian Term:
½hᵢfₜ²(xᵢ)– Depends on new tree → Keep - Regularization Term:
Ω(fₜ)– Depends on new tree → Keep
This gives us the optimization objective:
L̃^
Three Distinct Components:
- Optimization Term: Finds the best prediction step using gradients and Hessians
- Regularization Term: Penalizes complexity to prevent overfitting
The figure beautifully illustrates this separation of concerns.
Part 7: Regrouping by Leaves — A Game-Changing Perspective
From Sample Sums to Leaf Sums
Here’s where XGBoost’s brilliance truly shines. Instead of summing over samples, we regroup the sum to iterate over leaves in the tree.
The Mathematical Trick:
Original (sum over samples):
Σᵢ₌₁ⁿ (gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ))
Regrouped (sum over leaves):
Σⱼ₌₁ᵀ ((Σᵢ∈Iⱼ gᵢ)wⱼ + ½(Σᵢ∈Iⱼ hᵢ + λ)wⱼ²) + γT
Understanding the Notation:
- T: Total number of leaves in the tree
- Iⱼ: Set of all samples that land in leaf j
- wⱼ: The prediction value at leaf j (what we’re solving for)
- Gⱼ = Σᵢ∈Iⱼ gᵢ: Total gradient in leaf j
- Hⱼ = Σᵢ∈Iⱼ hᵢ: Total Hessian in leaf j
Key Insight: For all samples in the same leaf j, the prediction fₜ(xᵢ) is the same leaf weight wⱼ. This allows us to factor out wⱼ and regroup!
Why This Matters:
This regrouping transforms the problem from “How should each sample contribute?” to “What should each leaf predict?” This is computationally much more efficient and leads to the optimal weight formula.
Part 8: Finding Optimal Leaf Weights
Calculus to the Rescue
To find the optimal leaf weights, we use a fundamental calculus principle: the minimum of a function occurs where its derivative equals zero.
Calculating Optimal Weights for Our Dataset
Let’s use our tutorial data and see how to calculate actual leaf weights when building Tree 2.
Part 1: Understanding the Formula
Taking the derivative with respect to wⱼ and setting it to zero:
∂wⱼ L̃^
Solving this equation gives us:
wⱼ* = -Gⱼ/(Hⱼ + λ)
Where:
- Gⱼ = Σᵢ∈Iⱼ gᵢ: Sum of gradients for all samples in leaf j
- Hⱼ = Σᵢ∈Iⱼ hᵢ: Sum of Hessians for all samples in leaf j
- λ: Regularization parameter (let’s use λ = 1.0)
Concrete Example: Computing Leaf Weights
Suppose Tree 2 splits on Column B (like Tree 1), creating two leaves:
Leaf 1 (Column B = 0) — Contains samples where Y = 0:
Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉
Calculate Gⱼ:
G₁ = g₁ + g₂ + g₅ + g₇ + g₉ + g₁₁ + g₁₃ + g₁₅ + g₁₇ + g₁₉
G₁ = 0.10 + 0.12 + 0.08 + 0.11 + 0.09 + 0.10 + 0.13 + 0.07 + 0.09 + 0.12
G₁ = 1.01
Calculate Hⱼ:
H₁ = h₁ + h₂ + h₅ + h₇ + h₉ + h₁₁ + h₁₃ + h₁₅ + h₁₇ + h₁₉
H₁ = 0.090 + 0.106 + 0.074 + 0.098 + 0.082 + 0.090 + 0.113 + 0.065 + 0.082 + 0.106
H₁ = 0.906
Calculate optimal weight:
w₁* = -G₁/(H₁ + λ) = -1.01/(0.906 + 1.0) = -1.01/1.906 = -0.530
Interpretation: Tree 2 predicts -0.530 for samples in Leaf 1. This negative value reduces the overall prediction, correcting the slight over-predictions from Tree 1 for class 0 samples.
Leaf 2 (Column B = 1) — Contains samples where Y = 1:
Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀
Calculate Gⱼ:
G₂ = g₃ + g₄ + g₆ + g₈ + g₁₀ + g₁₂ + g₁₄ + g₁₆ + g₁₈ + g₂₀
G₂ = -0.15 + -0.12 + -0.13 + -0.11 + -0.16 + -0.10 + -0.14 + -0.17 + -0.12 + -0.15
G₂ = -1.35
Calculate Hⱼ:
H₂ = h₃ + h₄ + h₆ + h₈ + h₁₀ + h₁₂ + h₁₄ + h₁₆ + h₁₈ + h₂₀
H₂ = 0.128 + 0.106 + 0.113 + 0.098 + 0.134 + 0.090 + 0.120 + 0.141 + 0.106 + 0.128
H₂ = 1.164
Calculate optimal weight:
w₂* = -G₂/(H₂ + λ) = -(-1.35)/(1.164 + 1.0) = 1.35/2.164 = 0.624
Interpretation: Tree 2 predicts +0.624 for samples in Leaf 2. This positive value increases the overall prediction, correcting the slight under-predictions from Tree 1 for class 1 samples.
Updated Predictions After Tree 2
For sample x₃ (in Leaf 2):
Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)
After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95
Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95
We’re getting closer to the true value of 1.0!
For sample x₁ (in Leaf 1):
Previous prediction (Tree 1): ŷ₁^(1) = 0.10
Tree 2 contribution: w₁* = -0.530
New prediction: ŷ₁^(2) = 0.10 + (-0.530) = -0.430 (before sigmoid)After sigmoid: σ(-0.430) = 1/(1 + e^0.430) ≈ 0.39
Hmm, this actually moved us away from 0! This is where Tree 3 comes in to correct further.
Understanding the Formula Components
Intuitive Understanding:
If you have:
- High gradients (large errors) → Larger |w*| to correct more aggressively
- High Hessian (high certainty) → Smaller |w*| to be conservative
- High λ (strong regularization) → Smaller |w*| to avoid overfitting
Effect of λ:
If λ = 0 (no regularization):
w₁* = -1.01/0.906 = -1.115 (larger magnitude)
w₂* = 1.35/1.164 = 1.160 (larger magnitude)
If λ = 5.0 (strong regularization):
w₁* = -1.01/5.906 = -0.171 (much smaller)
w₂* = 1.35/6.164 = 0.219 (much smaller)
Higher λ creates smaller, more conservative corrections!
Part 2: Using Optimal Weights in Final Loss
We substitute wⱼ* back into the loss function to get:
L̃^
For our Tree 2 with 2 leaves and γ = 0:
L̃^(2) = -½((G₁²)/(H₁ + λ) + (G₂²)/(H₂ + λ))
L̃^(2) = -½((1.0¹²)/(1.906) + (1.3⁵²)/(2.164))
L̃^(2) = -½(0.535 + 0.842)
L̃^(2) = -½(1.377)
L̃^(2) = -0.689
This is the objective score for Tree 2’s structure. Lower (more negative) scores are better! We’ll use this to evaluate different split options.
Part 9: The Split-Finding Algorithm
Understanding the Objective and Splitting Criteria
Now we have all the pieces to understand how XGBoost finds the best splits!
The Complete Picture:
The figure shows the relationship between:
- Total Loss for Tree T:
L̃^
2. Split Gain Formula:
L_split = ½ ((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) — (G_I²)/(H_I + λ)) — γ
Breaking Down the Split Gain:
Consider a parent node with samples, and a proposed split:
- G_I, H_I: Gradient and Hessian sums in the parent (current node)
- G_L, H_L: Gradient and Hessian sums in the left child after split
- G_R, H_R: Gradient and Hessian sums in the right child after split
The Negative Structured Score:
The term (Gⱼ²)/(Hⱼ + λ) appears three times with different signs:
- Positive for left child: Lower loss in left child is good
- Positive for right child: Lower loss in right child is good
- Negative for parent: We subtract the parent’s loss (opportunity cost)
The gain formula tells us: How much better is the split compared to not splitting?
Regularization Term: -γ
- Each split adds a new leaf (increases T by 1)
- We subtract γ to penalize unnecessary complexity
- Only split if the gain exceeds this penalty!
Part 10: Gradient vs. Curvature — The XGBoost Advantage
Why XGBoost Outperforms Traditional Gradient Boosting
This figure reveals the fundamental difference between XGBoost and traditional Gradient Boosting Machines (GBM).
The Objective Function Breakdown:
Gain = (GRAD_j²)/(CURV_j + λ)
Scenario Analysis:
Scenario 1: The XGBoost Choice
- Low loss decrease (modest gradient reduction)
- Low curvature (smooth, generalizable surface)
- Result: High gain score
- Why it’s better: Balances loss reduction with a smoother, more generalizable surface
Scenario 2: What GBM Would Choose
- High loss decrease (steep gradient reduction)
- High curvature (wiggly, complex surface)
- Result: Lower gain score (after accounting for curvature)
- Why it’s worse: Aggressive fitting leads to overfitting
The Mathematical Connection to Bias-Variance:
The figure makes an elegant connection:
- -GRAD_j ~ Bias: Maximizing gradient reduction minimizes bias (underfitting)
- -CURV_j ~ Variance: Minimizing curvature minimizes variance (overfitting)
Visual Comparison:
The second figure shows two split options:
- XGBoost’s Choice (left): Modest loss decrease but smooth curvature
- GBM’s Choice (right): Aggressive loss decrease but high curvature risk
The Bottom Line:
XGBoost tries to minimize both Bias and Variance
=> XGBoost overfits less than GBM
This is why XGBoost consistently outperforms traditional gradient boosting in practice!
Part 11: Finding the Right Split — The Practical Algorithm
Step-by-Step Split Search Process
Now let’s see how XGBoost actually finds the best split at each node using our tutorial dataset.
Real Calculation: Finding the Best Split for Tree 2
We’re building Tree 2. All 20 samples are at the root node with their gradients and Hessians calculated.
The Question: Which feature and threshold should we split on?
Step 1: Sort Column C Values
Our 20 samples sorted by Column C:
Observation: Notice how Target Y values change as Column C increases!
Step 2: Evaluate Split “Column C < 3”
Left child (Column C < 3): 9 samples
Samples: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉
Calculate sums:
G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 + 0.10 + 0.11 + 0.10 + 0.12 = 0.64
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 + 0.090 + 0.098 + 0.090 + 0.106 = 0.793
Right child (Column C ≥ 3): 11 samples
Samples: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈
Calculate sums:
G_R = 0.12 + -0.16 + 0.13 + -0.15 + -0.13 + -0.17 + -0.11 + -0.14 + -0.15 + -0.10 + -0.12
G_R = -1.08
H_R = 0.106 + 0.134 + 0.113 + 0.128 + 0.113 + 0.141 + 0.098 + 0.120 + 0.128 + 0.090 + 0.106
H_R = 1.277
Parent node (all 20 samples):
G_parent = G_L + G_R = 0.64 + (-1.08) = -0.44
H_parent = H_L + H_R = 0.793 + 1.277 = 2.070
Calculate Gain (with λ=1.0, γ=0):
Gain = ½((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) - (G_parent²)/(H_parent + λ)) - γGain = ½((0.64²)/(0.793 + 1.0) + (-1.08²)/(1.277 + 1.0) - (-0.44²)/(2.070 + 1.0)) - 0
Gain = ½((0.4096/1.793) + (1.1664/2.277) - (0.1936/3.070))
Gain = ½(0.228 + 0.512 - 0.063)
Gain = ½(0.677)
Gain = 0.339
Positive gain means this split improves the loss function!
Step 3: Evaluate Other Splits
Let’s try “Column C < 2”:
Left (C < 2): 5 samples (x₄, x₁₅, x₅, x₉, x₁₇)
G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 = 0.21
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 = 0.409
Right (C ≥ 2): 15 samples (remaining)
G_R = -0.44–0.21 = -0.65
H_R = 2.070–0.409 = 1.661
Gain = ½((0.2¹²/1.409) + (0.6⁵²/2.661) — (0.4⁴²/3.070))
Gain = ½(0.031 + 0.159–0.063)
Gain = 0.064
Lower gain than “C < 3”!
Let’s try “Column C < 4”:
Left (C < 4): 13 samples
G_L = 0.64 + 0.12 + (-0.16) + 0.13 + (-0.15) = 0.58
H_L = 0.793 + 0.106 + 0.134 + 0.113 + 0.128 = 1.274
Right (C ≥ 4): 7 samples
G_R = -0.44–0.58 = -1.02
H_R = 2.070–1.274 = 0.796
Gain = ½((0.5⁸²/2.274) + (1.0²²/1.796) — (0.4⁴²/3.070))
Gain = ½(0.148 + 0.580–0.063)
Gain = 0.333
Still lower than “C < 3”!
Step 4: Compare All Features
We would repeat this process for:
- Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5
- Column B: Try split at 0.5 (only two values: 0 and 1)
- Column C: Already evaluated multiple splits
- Column D: Try split at 0.5 (only two values: 0 and 1)
Results Summary:
Winner: Column B < 0.5 has the highest gain!
This makes sense — Column B is a strong predictor in our dataset, clearly separating classes 0 and 1.
Tree 2 will split on Column B, just like Tree 1, but with different leaf weights to correct residual errors.
Part 12: Parallelization — Speed Through Smart Design
How XGBoost Achieves Fast Training
One of XGBoost’s practical advantages is its speed, achieved through clever parallelization.
The Parallel Split Search Strategy:
Since gradient boosting builds trees sequentially (can’t parallelize across trees), XGBoost parallelizes the split-finding process:
Thread Distribution:
- Thread 1: Searches for best split in Column A
- Thread 2: Searches for best split in Column B
- Thread 3: Searches for best split in Column C
- Thread 4: Searches for best split in Column D
Each thread independently:
- Sorts its column (if not pre-sorted)
- Evaluates all possible splits
- Computes the gain for each split
- Reports the best split for its column
Main Thread Coordination:
After all threads complete:
- Collect best split from each column
- Use ArgMax to find the overall best split
- Apply the winning split (e.g., “Column C < c”)
Why This Works:
The independence of column evaluations allows perfect parallelization:
- No thread dependencies
- No synchronization needed during search
- Only final aggregation requires coordination
Real-World Impact:
On a 4-core machine, you could see close to 4x speedup in split-finding, which is the most computationally intensive part of tree building!
Part 13: Handling Missing Values Intelligently
XGBoost’s Clever Approach to Missing Data
Unlike many algorithms that require preprocessing for missing values, XGBoost handles them naturally during training.
Adding Missing Values to Our Tutorial Dataset
Let’s modify our dataset to include some missing values and see how XGBoost handles them.
Modified Dataset (Column C with missing values):
We have 2 missing values in Column C (samples x₂ and x₅).
Traditional approaches would require:
- Imputation (filling with mean=2.8, median=3, etc.)
- Deletion (removing rows x₂ and x₅)
- Creating an indicator variable
XGBoost does something smarter!
The Missing Value Algorithm
Step 1: Sort and Separate
When evaluating split “Column C < 3”:
Non-missing samples sorted:
Col C: 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6
Missing samples (set aside):
x₂: Target Y = 0, Gradient = 0.12
x₅: Target Y = 0, Gradient = 0.08
Step 2: Try Missing Values Going LEFT
Configuration 1: Missing → Left
Left child (C < 3 OR C is missing):
- Non-missing: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉ (9 samples)
- Missing: x₂, x₅ (2 samples)
- Total: 11 samples
Calculate:
G_L = (sum of 9 non-missing gradients) + 0.12 + 0.08
G_L = 0.64 + 0.20 = 0.84
H_L = (sum of 9 non-missing Hessians) + h₂ + h₅
H_L = 0.793 + 0.106 + 0.074 = 0.973
Right child (C ≥ 3 and C not missing): 9 samples
G_R = -1.08
H_R = 1.277
Calculate gain:
Gain_left = ½((0.8⁴²/1.973) + (1.0⁸²/2.277) — (0.4⁰²/3.250))
Gain_left = ½(0.357 + 0.512–0.049)
Gain_left = 0.410
Step 3: Try Missing Values Going RIGHT
Configuration 2: Missing → Right
Left child (C < 3 and C not missing): 9 samples
G_L = 0.64
H_L = 0.793
Right child (C ≥ 3 OR C is missing):
- Non-missing: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈ (11 samples)
- Missing: x₂, x₅ (2 samples)
- Total: 13 samples
Calculate:
G_R = -1.08 + 0.12 + 0.08 = -0.88
H_R = 1.277 + 0.106 + 0.074 = 1.457
Calculate gain:
Gain_right = ½((0.6⁴²/1.793) + (0.8⁸²/2.457) — (0.4⁰²/3.250))
Gain_right = ½(0.228 + 0.316–0.049)
Gain_right = 0.248
Step 4: Choose the Better Direction
Compare the two scenarios:
Gain_left = 0.410 (Missing → Left)
Gain_right = 0.248 (Missing → Right)
Since 0.410 > 0.248: Missing values go LEFT!
The split is:
IF (Column C < 3) OR (Column C is missing):
Go LEFT
ELSE:
Go RIGHT
Why Did Missing Go Left?
Let’s analyze our missing samples:
- x₂: Target Y = 0, Gradient = +0.12 (positive)
- x₅: Target Y = 0, Gradient = +0.08 (positive)
The left child predominantly contains:
- Samples with Target Y = 0
- Samples with positive gradients (need downward corrections)
The missing samples share these characteristics! They belong with the left group.
XGBoost discovered this pattern automatically without any manual feature engineering.
Real-World Interpretation
What does this mean in practice?
Imagine Column C represents “number of customer service calls”:
- Missing value might mean “no calls recorded” or “new customer”
- These samples might behave more like low-call customers (Y=0)
- XGBoost learns: missing → treat like “low calls”
Benefits:
- No imputation needed: Don’t have to guess values
- Preserves information: Missing might be meaningful
- Node-specific: Different nodes might route missing values differently
- Automatic: No manual decision required
Another Example: Different Node, Different Direction
At a different node in the tree, when evaluating “Column A < 3”, missing values might go RIGHT if that minimizes loss better. Each node makes its own decision based on the data at that node.
Part 14: Additional Performance Optimizations
Features That Make XGBoost Production-Ready
While we’ve covered the core algorithm, XGBoost includes several additional optimizations that make it fast and efficient in practice.
Four Key Performance Features
1. Pre-Sorted Column Blocks
What it does:
- Sorts all feature columns once at the beginning
- Keeps pointers to original indices
- Stores in compressed column format
Why it matters:
- Sorting has O(n log n) complexity
- Without pre-sorting, you’d sort at every node evaluation
- With pre-sorting, you sort once and reuse
- Massive time savings during split search
Trade-off: Higher initial memory usage for faster searches
2. Compressed Column Format
What it does:
- Stores data in a column-oriented format
- Compresses continuous features into bins
- Optimizes for CPU cache efficiency
Why it matters:
- Modern CPUs are fast but memory access is slow
- Column format allows better cache utilization
- When evaluating Column C splits, all Column C data is contiguous in memory
- Parallel threads can access different columns without cache conflicts
3. Cache-Aware Prefetching
What it does:
- Anticipates which memory locations will be needed next
- Loads data into CPU cache before it’s needed
- Minimizes cache misses from non-contiguous memory access
Why it matters:
- Pre-sorting changes data order, making memory access non-contiguous
- Without prefetching, each memory access might be a cache miss (slow!)
- With prefetching, data is ready when needed (fast!)
- Can provide 2–10x speedup depending on dataset size
Technical detail: Uses the CPU’s prefetch instructions to hint at future memory access patterns
4. Approximate Split-Finding (Histogram Method)
What it does:
- Instead of evaluating every possible split point
- Groups feature values into bins (histogram buckets)
- Only evaluates splits at bucket boundaries
Why it matters:
- For features with many unique values, evaluating all splits is expensive
- Histograms reduce the search space dramatically
- Example: 10,000 unique values → 256 histogram bins
- Massive speedup with minimal accuracy loss
How it works:
Original: 0.1, 0.2, 0.3, ..., 99.8, 99.9, 100.0 (10,000 values)
Histogram: (0-1), (1-2), (2-3), ..., (99-100) (100 bins)
Only evaluate splits at bin boundaries, not at every individual value.
The Combined Impact
These optimizations work together:
- Pre-sorting eliminates repeated sorting overhead
- Column format + Cache prefetching optimize memory access
- Parallelization (from Part 12) uses multiple cores
- Histogram method reduces the search space
Real-World Performance:
On typical datasets, these optimizations combined can provide:
- 10–100x speedup over naive implementations
- Ability to handle datasets with millions of rows efficiently
- Sub-second training times for moderately sized datasets
This is why XGBoost became the standard tool for Kaggle competitions and production machine learning systems!
Published via Towards AI















