Home Machine Learning Understanding XGBoost: A Deep Dive into the Algorithm

Machine Learning

Understanding XGBoost: A Deep Dive into the Algorithm

December 10, 2025

Author(s): Utkarsh Mittal

Originally published on Towards AI.

Introduction

XGBoost (Extreme Gradient Boosting) has become the go-to algorithm for winning machine learning competitions and solving real-world prediction problems. But what makes it so powerful? In this comprehensive tutorial, we’ll unpack the mathematical foundations and practical mechanisms that make XGBoost superior to traditional gradient boosting methods.

This tutorial assumes you have basic knowledge of decision trees and machine learning concepts. We’ll walk through the algorithm step-by-step with visual examples and a concrete dataset that we’ll follow throughout the entire tutorial.

Our Tutorial Dataset

Throughout this tutorial, we’ll use a consistent dataset to demonstrate every concept. This helps you see how the algorithm works on real data, step by step.

Dataset Description

We have 20 samples (x₁ through x₂₀) with:

4 features: Column A, Column B, Column C, Column D
1 target variable: Target Y (binary: 0 or 1)

Understanding the Problem

This is a binary classification problem where Target Y is either 0 or 1. Our goal is to build a model that can distinguish between the two classes based on features A, B, C, and D.

Initial Observations:

When Column B = 1, Target Y tends to be 1 (positive class)
When Column B = 0, Target Y tends to be 0 (negative class)
Column C values range from 0 to 6
Column A shows some correlation with the target

Let’s see how XGBoost learns these patterns!

Part 1: How Decision Trees Learn from Data

The Recursive Splitting Process

Before diving into XGBoost, let’s understand how a single decision tree learns. Training a tree means recursively splitting the training data by examining different feature values.

Using our tutorial dataset with 20 samples (features A, B, C, D and target Y), let’s see how a tree is built.

Building the First Tree — Step by Step

Step 1: Root Node Decision

At the root node, we have all 20 samples. The algorithm examines all features looking for the best split. Let’s say it evaluates “Column B < 1” (i.e., Column B = 0):

Left Branch (Column B = 0):

Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉ (10 samples)
Target Y values: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
All 10 samples have Target Y = 0!

Right Branch (Column B = 1):

Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀ (10 samples)
Target Y values: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
All 10 samples have Target Y = 1!

Amazing! This single split perfectly separates our classes. Column B is a perfect predictor in this dataset.

Step 2: Do We Need More Splits?

Let’s examine the left child (Column B = 0):

All samples have the same target (Y = 0)
No further splitting needed → This becomes a leaf node with prediction = 0

The right child (Column B = 1):

All samples have the same target (Y = 1)
No further splitting needed → This becomes a leaf node with prediction = 1

Our first tree is complete with just one split!

Root: All 20 samples
|
(Column B < 1?)
/ 
YES NO
/ 
Left Leaf: Right Leaf:
Predict Y = 0 Predict Y = 1
(10 samples) (10 samples)

The Key Question: Which Column and Value to Split On?

In our example, Column B was obvious, but typically the algorithm must evaluate:

All columns: A, B, C, D
All possible threshold values in each column

For Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5… For Column C: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5…

The algorithm calculates a score for each potential split and chooses the best one. This is where the loss function comes into play, which we’ll explore next.

Realistic Scenario: What If Column B Had Noise?

Let’s imagine our dataset wasn’t perfect. Suppose x₁₉ (Column B = 0) had Target Y = 1 instead of 0.

Left Branch (Column B = 0):

9 samples with Y = 0, 1 sample with Y = 1
Not perfect, but still predominantly class 0
Prediction: Y = 0 (majority vote)

Right Branch (Column B = 1):

10 samples with Y = 1
Perfect separation
Prediction: Y = 1

This is where gradient boosting helps: the next tree will focus on correcting that one misclassified sample (x₁₉).

Part 2: The Ensemble Approach — Building Trees Sequentially

From Single Trees to Boosted Ensembles

XGBoost doesn’t build just one tree — it builds an ensemble of trees sequentially, where each new tree corrects the errors of the previous trees.

Continuing with Our Tutorial Dataset

Remember our first tree? It used Column B to perfectly classify:

Prediction for Column B = 0: Y_pred = 0
Prediction for Column B = 1: Y_pred = 1

But what if our dataset had some complexity that one tree couldn’t capture?

Building Tree 2: Correcting the Residuals

Let’s add realistic complexity. Suppose Tree 1 made these predictions:

(Predictions aren’t exactly 0 or 1 due to regularization, which we’ll explain later)

Tree 2’s Job: Predict these residuals (0.0, 0.0, 0.1, 0.1, …)

Tree 3’s Job: Predict the residuals from Tree 1 + Tree 2

This is the heart of gradient boosting: each new tree is trained to predict the residual errors from all previous trees combined.

The Sequential Learning Process

Final Prediction = Tree₁ + Tree₂ + Tree₃ + Tree₄ + Tree₅For sample x₃:
Tree₁ predicts: 0.9
Tree₂ predicts: 0.08 (trying to correct the 0.1 error)
Tree₃ predicts: 0.015 (trying to correct remaining error)
...
Final prediction = 0.9 + 0.08 + 0.015 + ... ≈ 1.0

Key Question: How do we build Tree 2 optimally? We need to know:

What errors did Tree 1 make?
How should we split the data to best correct these errors?
What should each leaf predict?

This is where the loss function and XGBoost’s mathematical framework come in!

Part 3: The Loss Function — Minimizing Bias and Variance Simultaneously

Engineering the Objective

XGBoost’s secret weapon is its carefully engineered loss function that simultaneously addresses two fundamental challenges in machine learning: bias and variance.

The total loss function in XGBoost has two main components:

Component 1: Minimizing Bias (Training Error)

The first term sums over all samples in the training data:

Σ l(ŷᵢ, yᵢ)

Where:

ŷᵢ = prediction for sample i
yᵢ = actual target value for sample i
l() = loss function (like mean squared error or cross-entropy)

This term ensures we fit the training data well and minimise underfitting.

Component 2: Controlling Variance (Regularization)

The second term sums over all trees:

Σ Ω(fₖ)

The regularization function Ω is decomposed into two parts:

Leaf Penalty: γT , where T is the number of leaves

Penalizes trees with too many leaves
Simpler trees → lower variance → better generalization

2. Weight Penalty: ½λ||w||² where w are the leaf predictions

Penalizes extreme predictions at each leaf
Smaller predictions from each tree → gradual corrections → less overfitting

Key Insight: By minimizing both terms simultaneously through hyperparameters γ and λ, XGBoost finds the sweet spot between bias and variance that traditional gradient boosting methods struggle to achieve.

Part 4: Breaking Down the Sample-Level Loss Function

How Predictions Build on Each Other

When building tree t, we need to understand how predictions accumulate from previous trees.

The sample-level loss function shows:

l(ŷᵢ, yᵢ) = l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ))

Breaking this down:

ŷᵢ^(t-1): The prediction from all previous trees (1 through t-1)
fₜ(xᵢ): The prediction from the new tree t we’re building
xᵢ: The feature values for sample i

Example with Our Tutorial Dataset

Let’s work through a concrete example using our 20-sample dataset.

Scenario: Building Tree 2 (t=2)

For sample x₃ (Column A=3, Column B=1, Column C=6, Column D=1, True Y=1):

Step 1: What did Tree 1 predict?

ŷ₃^(1) = 0.85 (Tree 1’s prediction for x₃)

Step 2: What’s the actual target?

y₃ = 1.0

Step 3: What’s the error?

Error = y₃ - ŷ₃^(1) = 1.0 - 0.85 = 0.15

Step 4: What should Tree 2 predict?

Tree 2 should predict approximately 0.15 to correct the error. If we’re using log loss (for classification), the exact value depends on the gradient and Hessian, but intuitively:

New prediction: ŷ₃^(2) = ŷ₃^(1) + f₂(x₃) = 0.85 + 0.12 = 0.97

Getting closer to the true value of 1.0!

Full Example for Tree 2

Observation:

Samples with negative errors (predicted too high): Tree 2 predicts negative values
Samples with positive errors (predicted too low): Tree 2 predicts positive values

The loss function guides Tree 2 on how to split the data to best capture these error patterns.

For instance, Tree 2 might discover:

When Column C < 3: Errors tend to be negative (over-predicted)
When Column C ≥ 3: Errors tend to be positive (under-predicted)

And use this to make a split!

Part 5: The Taylor Approximation — Making the Math Tractable

Simplifying Complex Loss Functions

The challenge with the sample-level loss function is that its analytical form depends on which loss function we choose (MSE, cross-entropy, etc.). XGBoost uses a clever mathematical trick: the second-order Taylor expansion.

The Taylor approximation transforms the loss function into a more manageable form:

l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ)) ≈ l(yᵢ, ŷᵢ^(t-1)) + gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ)

Computing Gradients and Hessians for Our Dataset

Let’s calculate actual values using log loss (binary cross-entropy), which is standard for binary classification:

l(y, ŷ) = -y·log(ŷ) — (1-y)·log(1-ŷ)

Gradient (First Derivative):

gᵢ = ∂l/∂ŷ = ŷᵢ^(t-1) - yᵢ

Hessian (Second Derivative):

hᵢ = ∂²l/∂ŷ² = ŷᵢ^(t-1)·(1 - ŷᵢ^(t-1))

Concrete Example: Sample x₃ from Our Dataset

Given:

True target: y₃ = 1
Tree 1 prediction: ŷ₃^(1) = 0.85
We’re building Tree 2

Calculate Gradient:

g₃ = ŷ₃^(1) — y₃ = 0.85–1.0 = -0.15

Interpretation: Negative gradient means we predicted too low, need to increase prediction.

Calculate Hessian:

h₃ = ŷ₃^(1) · (1 — ŷ₃^(1)) = 0.85 · 0.15 = 0.1275

Interpretation: Hessian represents the curvature. Higher values mean we’re more confident about the gradient direction.

Full Calculations for Our 20 Samples

Building Tree 2, here are gradients and Hessians:

Patterns Observed:

Class 0 samples (Y=0): Positive gradients (we over-predicted, need to decrease)
Class 1 samples (Y=1): Negative gradients (we under-predicted, need to increase)
Hessians all positive, representing curvature of loss function

Understanding Each Term:

Constant Term: l(yᵢ, ŷᵢ^(t-1)) – the loss from previous trees (fixed)
Gradient Term: gᵢfₜ(xᵢ)

For x₃: g₃ = -0.15, if Tree 2 predicts f₂(x₃) = 0.12, this term = -0.15 × 0.12 = -0.018
Negative value → decreases loss → good!

Hessian Term: ½hᵢfₜ²(xᵢ)

For x₃: h₃ = 0.128, f₂(x₃) = 0.12, this term = 0.5 × 0.128 × 0.1²² = 0.00092
Positive value → increases loss slightly
Acts as a penalty for large predictions (regularization effect)

Visual Interpretation:

The figure shows three scenarios:

Left: High gradient (steep slope) — we need to move predictions significantly
Center: The “bowl” shape shows how the Taylor approximation mirrors the true loss
Right: The curvature (Hessian) tells us whether the loss surface is gentle or steep

Key Insight: This approximation is exact when the new tree’s predictions are small (fₜ → 0), which is exactly what regularization encourages! This is why XGBoost uses small learning rates and many trees.

Part 6: Focusing on What Matters

Discarding Constants and Focusing on the New Tree

When optimizing for the new tree, we can simplify our objective by removing terms that don’t depend on the new tree.

The Simplified Loss Function:

Looking at our expanded loss:

Constant Term (Past errors): Doesn’t depend on new tree → Discard
Gradient Term: gᵢfₜ(xᵢ) – Depends on new tree → Keep
Hessian Term: ½hᵢfₜ²(xᵢ) – Depends on new tree → Keep
Regularization Term: Ω(fₜ) – Depends on new tree → Keep

This gives us the optimization objective:

L̃^

Three Distinct Components:

Optimization Term: Finds the best prediction step using gradients and Hessians
Regularization Term: Penalizes complexity to prevent overfitting

The figure beautifully illustrates this separation of concerns.

Part 7: Regrouping by Leaves — A Game-Changing Perspective

From Sample Sums to Leaf Sums

Here’s where XGBoost’s brilliance truly shines. Instead of summing over samples, we regroup the sum to iterate over leaves in the tree.

The Mathematical Trick:

Original (sum over samples):

Σᵢ₌₁ⁿ (gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ))

Regrouped (sum over leaves):

Σⱼ₌₁ᵀ ((Σᵢ∈Iⱼ gᵢ)wⱼ + ½(Σᵢ∈Iⱼ hᵢ + λ)wⱼ²) + γT

Understanding the Notation:

T: Total number of leaves in the tree
Iⱼ: Set of all samples that land in leaf j
wⱼ: The prediction value at leaf j (what we’re solving for)
Gⱼ = Σᵢ∈Iⱼ gᵢ: Total gradient in leaf j
Hⱼ = Σᵢ∈Iⱼ hᵢ: Total Hessian in leaf j

Key Insight: For all samples in the same leaf j, the prediction fₜ(xᵢ) is the same leaf weight wⱼ. This allows us to factor out wⱼ and regroup!

Why This Matters:

This regrouping transforms the problem from “How should each sample contribute?” to “What should each leaf predict?” This is computationally much more efficient and leads to the optimal weight formula.

Part 8: Finding Optimal Leaf Weights

Calculus to the Rescue

To find the optimal leaf weights, we use a fundamental calculus principle: the minimum of a function occurs where its derivative equals zero.

Calculating Optimal Weights for Our Dataset

Let’s use our tutorial data and see how to calculate actual leaf weights when building Tree 2.

Part 1: Understanding the Formula

Taking the derivative with respect to wⱼ and setting it to zero:

∂wⱼ L̃^

Solving this equation gives us:

wⱼ* = -Gⱼ/(Hⱼ + λ)

Where:

Gⱼ = Σᵢ∈Iⱼ gᵢ: Sum of gradients for all samples in leaf j
Hⱼ = Σᵢ∈Iⱼ hᵢ: Sum of Hessians for all samples in leaf j
λ: Regularization parameter (let’s use λ = 1.0)

Concrete Example: Computing Leaf Weights

Suppose Tree 2 splits on Column B (like Tree 1), creating two leaves:

Leaf 1 (Column B = 0) — Contains samples where Y = 0:

Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉

Calculate Gⱼ:

G₁ = g₁ + g₂ + g₅ + g₇ + g₉ + g₁₁ + g₁₃ + g₁₅ + g₁₇ + g₁₉
G₁ = 0.10 + 0.12 + 0.08 + 0.11 + 0.09 + 0.10 + 0.13 + 0.07 + 0.09 + 0.12
G₁ = 1.01

Calculate Hⱼ:

H₁ = h₁ + h₂ + h₅ + h₇ + h₉ + h₁₁ + h₁₃ + h₁₅ + h₁₇ + h₁₉
H₁ = 0.090 + 0.106 + 0.074 + 0.098 + 0.082 + 0.090 + 0.113 + 0.065 + 0.082 + 0.106
H₁ = 0.906

Calculate optimal weight:

w₁* = -G₁/(H₁ + λ) = -1.01/(0.906 + 1.0) = -1.01/1.906 = -0.530

Interpretation: Tree 2 predicts -0.530 for samples in Leaf 1. This negative value reduces the overall prediction, correcting the slight over-predictions from Tree 1 for class 0 samples.

Leaf 2 (Column B = 1) — Contains samples where Y = 1:

Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀

Calculate Gⱼ:

G₂ = g₃ + g₄ + g₆ + g₈ + g₁₀ + g₁₂ + g₁₄ + g₁₆ + g₁₈ + g₂₀
G₂ = -0.15 + -0.12 + -0.13 + -0.11 + -0.16 + -0.10 + -0.14 + -0.17 + -0.12 + -0.15
G₂ = -1.35

Calculate Hⱼ:

H₂ = h₃ + h₄ + h₆ + h₈ + h₁₀ + h₁₂ + h₁₄ + h₁₆ + h₁₈ + h₂₀
H₂ = 0.128 + 0.106 + 0.113 + 0.098 + 0.134 + 0.090 + 0.120 + 0.141 + 0.106 + 0.128
H₂ = 1.164

Calculate optimal weight:

w₂* = -G₂/(H₂ + λ) = -(-1.35)/(1.164 + 1.0) = 1.35/2.164 = 0.624

Interpretation: Tree 2 predicts +0.624 for samples in Leaf 2. This positive value increases the overall prediction, correcting the slight under-predictions from Tree 1 for class 1 samples.

Updated Predictions After Tree 2

For sample x₃ (in Leaf 2):

Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)

After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95

Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95

We’re getting closer to the true value of 1.0!

For sample x₁ (in Leaf 1):

Previous prediction (Tree 1): ŷ₁^(1) = 0.10
Tree 2 contribution: w₁* = -0.530
New prediction: ŷ₁^(2) = 0.10 + (-0.530) = -0.430 (before sigmoid)After sigmoid: σ(-0.430) = 1/(1 + e^0.430) ≈ 0.39

Hmm, this actually moved us away from 0! This is where Tree 3 comes in to correct further.

Understanding the Formula Components

Intuitive Understanding:

If you have:

High gradients (large errors) → Larger |w*| to correct more aggressively
High Hessian (high certainty) → Smaller |w*| to be conservative
High λ (strong regularization) → Smaller |w*| to avoid overfitting

Effect of λ:

If λ = 0 (no regularization):

w₁* = -1.01/0.906 = -1.115 (larger magnitude)
w₂* = 1.35/1.164 = 1.160 (larger magnitude)

If λ = 5.0 (strong regularization):

w₁* = -1.01/5.906 = -0.171 (much smaller)
w₂* = 1.35/6.164 = 0.219 (much smaller)

Higher λ creates smaller, more conservative corrections!

Part 2: Using Optimal Weights in Final Loss

We substitute wⱼ* back into the loss function to get:

L̃^

For our Tree 2 with 2 leaves and γ = 0:

L̃^(2) = -½((G₁²)/(H₁ + λ) + (G₂²)/(H₂ + λ))
L̃^(2) = -½((1.0¹²)/(1.906) + (1.3⁵²)/(2.164))
L̃^(2) = -½(0.535 + 0.842)
L̃^(2) = -½(1.377)
L̃^(2) = -0.689

This is the objective score for Tree 2’s structure. Lower (more negative) scores are better! We’ll use this to evaluate different split options.

Part 9: The Split-Finding Algorithm

Understanding the Objective and Splitting Criteria

Now we have all the pieces to understand how XGBoost finds the best splits!

The Complete Picture:

The figure shows the relationship between:

Total Loss for Tree T:

L̃^

2. Split Gain Formula:

L_split = ½ ((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) — (G_I²)/(H_I + λ)) — γ

Breaking Down the Split Gain:

Consider a parent node with samples, and a proposed split:

G_I, H_I: Gradient and Hessian sums in the parent (current node)
G_L, H_L: Gradient and Hessian sums in the left child after split
G_R, H_R: Gradient and Hessian sums in the right child after split

The Negative Structured Score:

The term (Gⱼ²)/(Hⱼ + λ) appears three times with different signs:

Positive for left child: Lower loss in left child is good
Positive for right child: Lower loss in right child is good
Negative for parent: We subtract the parent’s loss (opportunity cost)

The gain formula tells us: How much better is the split compared to not splitting?

Regularization Term: -γ

Each split adds a new leaf (increases T by 1)
We subtract γ to penalize unnecessary complexity
Only split if the gain exceeds this penalty!

Part 10: Gradient vs. Curvature — The XGBoost Advantage

Why XGBoost Outperforms Traditional Gradient Boosting

This figure reveals the fundamental difference between XGBoost and traditional Gradient Boosting Machines (GBM).

The Objective Function Breakdown:

Gain = (GRAD_j²)/(CURV_j + λ)

Scenario Analysis:

Scenario 1: The XGBoost Choice

Low loss decrease (modest gradient reduction)
Low curvature (smooth, generalizable surface)
Result: High gain score
Why it’s better: Balances loss reduction with a smoother, more generalizable surface

Scenario 2: What GBM Would Choose

High loss decrease (steep gradient reduction)
High curvature (wiggly, complex surface)
Result: Lower gain score (after accounting for curvature)
Why it’s worse: Aggressive fitting leads to overfitting

The Mathematical Connection to Bias-Variance:

The figure makes an elegant connection:

-GRAD_j ~ Bias: Maximizing gradient reduction minimizes bias (underfitting)
-CURV_j ~ Variance: Minimizing curvature minimizes variance (overfitting)

Visual Comparison:

The second figure shows two split options:

XGBoost’s Choice (left): Modest loss decrease but smooth curvature
GBM’s Choice (right): Aggressive loss decrease but high curvature risk

The Bottom Line:

XGBoost tries to minimize both Bias and Variance
=> XGBoost overfits less than GBM

This is why XGBoost consistently outperforms traditional gradient boosting in practice!

Part 11: Finding the Right Split — The Practical Algorithm

Step-by-Step Split Search Process

Now let’s see how XGBoost actually finds the best split at each node using our tutorial dataset.

Real Calculation: Finding the Best Split for Tree 2

We’re building Tree 2. All 20 samples are at the root node with their gradients and Hessians calculated.

The Question: Which feature and threshold should we split on?

Step 1: Sort Column C Values

Our 20 samples sorted by Column C:

Observation: Notice how Target Y values change as Column C increases!

Step 2: Evaluate Split “Column C < 3”

Left child (Column C < 3): 9 samples

Samples: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉

Calculate sums:

G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 + 0.10 + 0.11 + 0.10 + 0.12 = 0.64
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 + 0.090 + 0.098 + 0.090 + 0.106 = 0.793

Right child (Column C ≥ 3): 11 samples

Samples: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈

Calculate sums:

G_R = 0.12 + -0.16 + 0.13 + -0.15 + -0.13 + -0.17 + -0.11 + -0.14 + -0.15 + -0.10 + -0.12
G_R = -1.08
H_R = 0.106 + 0.134 + 0.113 + 0.128 + 0.113 + 0.141 + 0.098 + 0.120 + 0.128 + 0.090 + 0.106
H_R = 1.277

Parent node (all 20 samples):

G_parent = G_L + G_R = 0.64 + (-1.08) = -0.44
H_parent = H_L + H_R = 0.793 + 1.277 = 2.070

Calculate Gain (with λ=1.0, γ=0):

Gain = ½((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) - (G_parent²)/(H_parent + λ)) - γGain = ½((0.64²)/(0.793 + 1.0) + (-1.08²)/(1.277 + 1.0) - (-0.44²)/(2.070 + 1.0)) - 0
Gain = ½((0.4096/1.793) + (1.1664/2.277) - (0.1936/3.070))
Gain = ½(0.228 + 0.512 - 0.063)
Gain = ½(0.677)
Gain = 0.339

Positive gain means this split improves the loss function!

Step 3: Evaluate Other Splits

Let’s try “Column C < 2”:

Left (C < 2): 5 samples (x₄, x₁₅, x₅, x₉, x₁₇)

G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 = 0.21
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 = 0.409

Right (C ≥ 2): 15 samples (remaining)

G_R = -0.44–0.21 = -0.65
H_R = 2.070–0.409 = 1.661

Gain = ½((0.2¹²/1.409) + (0.6⁵²/2.661) — (0.4⁴²/3.070))
Gain = ½(0.031 + 0.159–0.063)
Gain = 0.064

Lower gain than “C < 3”!

Let’s try “Column C < 4”:

Left (C < 4): 13 samples

G_L = 0.64 + 0.12 + (-0.16) + 0.13 + (-0.15) = 0.58
H_L = 0.793 + 0.106 + 0.134 + 0.113 + 0.128 = 1.274

Right (C ≥ 4): 7 samples

G_R = -0.44–0.58 = -1.02
H_R = 2.070–1.274 = 0.796

Gain = ½((0.5⁸²/2.274) + (1.0²²/1.796) — (0.4⁴²/3.070))
Gain = ½(0.148 + 0.580–0.063)
Gain = 0.333

Still lower than “C < 3”!

Step 4: Compare All Features

We would repeat this process for:

Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5
Column B: Try split at 0.5 (only two values: 0 and 1)
Column C: Already evaluated multiple splits
Column D: Try split at 0.5 (only two values: 0 and 1)

Results Summary:

Winner: Column B < 0.5 has the highest gain!

This makes sense — Column B is a strong predictor in our dataset, clearly separating classes 0 and 1.

Tree 2 will split on Column B, just like Tree 1, but with different leaf weights to correct residual errors.

Part 12: Parallelization — Speed Through Smart Design

How XGBoost Achieves Fast Training

One of XGBoost’s practical advantages is its speed, achieved through clever parallelization.

The Parallel Split Search Strategy:

Since gradient boosting builds trees sequentially (can’t parallelize across trees), XGBoost parallelizes the split-finding process:

Thread Distribution:

Thread 1: Searches for best split in Column A
Thread 2: Searches for best split in Column B
Thread 3: Searches for best split in Column C
Thread 4: Searches for best split in Column D

Each thread independently:

Sorts its column (if not pre-sorted)
Evaluates all possible splits
Computes the gain for each split
Reports the best split for its column

Main Thread Coordination:

After all threads complete:

Collect best split from each column
Use ArgMax to find the overall best split
Apply the winning split (e.g., “Column C < c”)

Why This Works:

The independence of column evaluations allows perfect parallelization:

No thread dependencies
No synchronization needed during search
Only final aggregation requires coordination

Real-World Impact:

On a 4-core machine, you could see close to 4x speedup in split-finding, which is the most computationally intensive part of tree building!

Part 13: Handling Missing Values Intelligently

XGBoost’s Clever Approach to Missing Data

Unlike many algorithms that require preprocessing for missing values, XGBoost handles them naturally during training.

Adding Missing Values to Our Tutorial Dataset

Let’s modify our dataset to include some missing values and see how XGBoost handles them.

Modified Dataset (Column C with missing values):

We have 2 missing values in Column C (samples x₂ and x₅).

Traditional approaches would require:

Imputation (filling with mean=2.8, median=3, etc.)
Deletion (removing rows x₂ and x₅)
Creating an indicator variable

XGBoost does something smarter!

The Missing Value Algorithm

Step 1: Sort and Separate

When evaluating split “Column C < 3”:

Non-missing samples sorted:

Col C: 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6

Missing samples (set aside):

x₂: Target Y = 0, Gradient = 0.12
x₅: Target Y = 0, Gradient = 0.08

Step 2: Try Missing Values Going LEFT

Configuration 1: Missing → Left

Left child (C < 3 OR C is missing):

Non-missing: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉ (9 samples)
Missing: x₂, x₅ (2 samples)
Total: 11 samples

Calculate:

G_L = (sum of 9 non-missing gradients) + 0.12 + 0.08
G_L = 0.64 + 0.20 = 0.84

H_L = (sum of 9 non-missing Hessians) + h₂ + h₅
H_L = 0.793 + 0.106 + 0.074 = 0.973

Right child (C ≥ 3 and C not missing): 9 samples

G_R = -1.08
H_R = 1.277

Calculate gain:

Gain_left = ½((0.8⁴²/1.973) + (1.0⁸²/2.277) — (0.4⁰²/3.250))
Gain_left = ½(0.357 + 0.512–0.049)
Gain_left = 0.410

Step 3: Try Missing Values Going RIGHT

Configuration 2: Missing → Right

Left child (C < 3 and C not missing): 9 samples

G_L = 0.64
H_L = 0.793

Right child (C ≥ 3 OR C is missing):

Non-missing: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈ (11 samples)
Missing: x₂, x₅ (2 samples)
Total: 13 samples

Calculate:

G_R = -1.08 + 0.12 + 0.08 = -0.88
H_R = 1.277 + 0.106 + 0.074 = 1.457

Calculate gain:

Gain_right = ½((0.6⁴²/1.793) + (0.8⁸²/2.457) — (0.4⁰²/3.250))
Gain_right = ½(0.228 + 0.316–0.049)
Gain_right = 0.248

Step 4: Choose the Better Direction

Compare the two scenarios:

Gain_left = 0.410 (Missing → Left)
Gain_right = 0.248 (Missing → Right)

Since 0.410 > 0.248: Missing values go LEFT!

The split is:

IF (Column C < 3) OR (Column C is missing):
Go LEFT
ELSE:
Go RIGHT

Why Did Missing Go Left?

Let’s analyze our missing samples:

x₂: Target Y = 0, Gradient = +0.12 (positive)
x₅: Target Y = 0, Gradient = +0.08 (positive)

The left child predominantly contains:

Samples with Target Y = 0
Samples with positive gradients (need downward corrections)

The missing samples share these characteristics! They belong with the left group.

XGBoost discovered this pattern automatically without any manual feature engineering.

Real-World Interpretation

What does this mean in practice?

Imagine Column C represents “number of customer service calls”:

Missing value might mean “no calls recorded” or “new customer”
These samples might behave more like low-call customers (Y=0)
XGBoost learns: missing → treat like “low calls”

Benefits:

No imputation needed: Don’t have to guess values
Preserves information: Missing might be meaningful
Node-specific: Different nodes might route missing values differently
Automatic: No manual decision required

Another Example: Different Node, Different Direction

At a different node in the tree, when evaluating “Column A < 3”, missing values might go RIGHT if that minimizes loss better. Each node makes its own decision based on the data at that node.

Part 14: Additional Performance Optimizations

Features That Make XGBoost Production-Ready

While we’ve covered the core algorithm, XGBoost includes several additional optimizations that make it fast and efficient in practice.

Four Key Performance Features

1. Pre-Sorted Column Blocks

What it does:

Sorts all feature columns once at the beginning
Keeps pointers to original indices
Stores in compressed column format

Why it matters:

Sorting has O(n log n) complexity
Without pre-sorting, you’d sort at every node evaluation
With pre-sorting, you sort once and reuse
Massive time savings during split search

Trade-off: Higher initial memory usage for faster searches

2. Compressed Column Format

What it does:

Stores data in a column-oriented format
Compresses continuous features into bins
Optimizes for CPU cache efficiency

Why it matters:

Modern CPUs are fast but memory access is slow
Column format allows better cache utilization
When evaluating Column C splits, all Column C data is contiguous in memory
Parallel threads can access different columns without cache conflicts

3. Cache-Aware Prefetching

What it does:

Anticipates which memory locations will be needed next
Loads data into CPU cache before it’s needed
Minimizes cache misses from non-contiguous memory access

Why it matters:

Pre-sorting changes data order, making memory access non-contiguous
Without prefetching, each memory access might be a cache miss (slow!)
With prefetching, data is ready when needed (fast!)
Can provide 2–10x speedup depending on dataset size

Technical detail: Uses the CPU’s prefetch instructions to hint at future memory access patterns

4. Approximate Split-Finding (Histogram Method)

What it does:

Instead of evaluating every possible split point
Groups feature values into bins (histogram buckets)
Only evaluates splits at bucket boundaries

Why it matters:

For features with many unique values, evaluating all splits is expensive
Histograms reduce the search space dramatically
Example: 10,000 unique values → 256 histogram bins
Massive speedup with minimal accuracy loss

How it works:

Original: 0.1, 0.2, 0.3, ..., 99.8, 99.9, 100.0 (10,000 values)
Histogram: (0-1), (1-2), (2-3), ..., (99-100) (100 bins)

Only evaluate splits at bin boundaries, not at every individual value.

The Combined Impact

These optimizations work together:

Pre-sorting eliminates repeated sorting overhead
Column format + Cache prefetching optimize memory access
Parallelization (from Part 12) uses multiple cores
Histogram method reduces the search space

Real-World Performance:

On typical datasets, these optimizations combined can provide:

10–100x speedup over naive implementations
Ability to handle datasets with millions of rows efficiently
Sub-second training times for moderately sized datasets

This is why XGBoost became the standard tool for Kaggle competitions and production machine learning systems!

Published via Towards AI

Author(s): Utkarsh Mittal

Introduction

Our Tutorial Dataset

Dataset Description

Understanding the Problem

Part 1: How Decision Trees Learn from Data

The Recursive Splitting Process

Building the First Tree — Step by Step

Realistic Scenario: What If Column B Had Noise?

Part 2: The Ensemble Approach — Building Trees Sequentially

From Single Trees to Boosted Ensembles

Continuing with Our Tutorial Dataset

Building Tree 2: Correcting the Residuals

The Sequential Learning Process

Part 3: The Loss Function — Minimizing Bias and Variance Simultaneously

Engineering the Objective

Component 1: Minimizing Bias (Training Error)

Component 2: Controlling Variance (Regularization)

Part 4: Breaking Down the Sample-Level Loss Function

How Predictions Build on Each Other

Example with Our Tutorial Dataset

Full Example for Tree 2

Part 5: The Taylor Approximation — Making the Math Tractable

Simplifying Complex Loss Functions

Computing Gradients and Hessians for Our Dataset

Concrete Example: Sample x₃ from Our Dataset

Full Calculations for Our 20 Samples

Part 6: Focusing on What Matters

Discarding Constants and Focusing on the New Tree

Part 7: Regrouping by Leaves — A Game-Changing Perspective

From Sample Sums to Leaf Sums

Part 8: Finding Optimal Leaf Weights

Calculus to the Rescue

Calculating Optimal Weights for Our Dataset

Concrete Example: Computing Leaf Weights

Updated Predictions After Tree 2

Understanding the Formula Components

Part 9: The Split-Finding Algorithm

Understanding the Objective and Splitting Criteria

Part 10: Gradient vs. Curvature — The XGBoost Advantage

Why XGBoost Outperforms Traditional Gradient Boosting

Part 11: Finding the Right Split — The Practical Algorithm

Step-by-Step Split Search Process

Real Calculation: Finding the Best Split for Tree 2

Step 1: Sort Column C Values

Step 3: Evaluate Other Splits

Step 4: Compare All Features

Part 12: Parallelization — Speed Through Smart Design

How XGBoost Achieves Fast Training

Part 13: Handling Missing Values Intelligently

XGBoost’s Clever Approach to Missing Data

Adding Missing Values to Our Tutorial Dataset

The Missing Value Algorithm

Why Did Missing Go Left?

Real-World Interpretation

Another Example: Different Node, Different Direction

Part 14: Additional Performance Optimizations

Features That Make XGBoost Production-Ready

Four Key Performance Features

The Combined Impact

LEAVE A REPLY Cancel reply

APLICATIONS

HOT NEWS

POPULAR POSTS

POPULAR CATEGORY