Understanding XGBoost: A Deep Dive into the Algorithm

Author(s): Utkarsh Mittal

Originally published on Towards AI.

Introduction

XGBoost (Extreme Gradient Boosting) has become the go-to algorithm for winning machine learning competitions and solving real-world prediction problems. But what makes it so powerful? In this comprehensive tutorial, we’ll unpack the mathematical foundations and practical mechanisms that make XGBoost superior to traditional gradient boosting methods.

This tutorial assumes you have basic knowledge of decision trees and machine learning concepts. We’ll walk through the algorithm step-by-step with visual examples and a concrete dataset that we’ll follow throughout the entire tutorial.

Our Tutorial Dataset

Throughout this tutorial, we’ll use a consistent dataset to demonstrate every concept. This helps you see how the algorithm works on real data, step by step.

Training Example

Dataset Description

We have 20 samples (x₁ through x₂₀) with:

  • 4 features: Column A, Column B, Column C, Column D
  • 1 target variable: Target Y (binary: 0 or 1)

Understanding the Problem

This is a binary classification problem where Target Y is either 0 or 1. Our goal is to build a model that can distinguish between the two classes based on features A, B, C, and D.

Initial Observations:

  • When Column B = 1, Target Y tends to be 1 (positive class)
  • When Column B = 0, Target Y tends to be 0 (negative class)
  • Column C values range from 0 to 6
  • Column A shows some correlation with the target

Let’s see how XGBoost learns these patterns!

Part 1: How Decision Trees Learn from Data

The Recursive Splitting Process

Before diving into XGBoost, let’s understand how a single decision tree learns. Training a tree means recursively splitting the training data by examining different feature values.

Using our tutorial dataset with 20 samples (features A, B, C, D and target Y), let’s see how a tree is built.

Building the First Tree — Step by Step

Step 1: Root Node Decision

At the root node, we have all 20 samples. The algorithm examines all features looking for the best split. Let’s say it evaluates “Column B < 1” (i.e., Column B = 0):

Left Branch (Column B = 0):

  • Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉ (10 samples)
  • Target Y values: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
  • All 10 samples have Target Y = 0!

Right Branch (Column B = 1):

  • Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀ (10 samples)
  • Target Y values: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
  • All 10 samples have Target Y = 1!

Amazing! This single split perfectly separates our classes. Column B is a perfect predictor in this dataset.

Step 2: Do We Need More Splits?

Let’s examine the left child (Column B = 0):

  • All samples have the same target (Y = 0)
  • No further splitting needed → This becomes a leaf node with prediction = 0

The right child (Column B = 1):

  • All samples have the same target (Y = 1)
  • No further splitting needed → This becomes a leaf node with prediction = 1

Our first tree is complete with just one split!

Root: All 20 samples
|
(Column B < 1?)
/
YES NO
/
Left Leaf: Right Leaf:
Predict Y = 0 Predict Y = 1
(10 samples) (10 samples)

The Key Question: Which Column and Value to Split On?

In our example, Column B was obvious, but typically the algorithm must evaluate:

  • All columns: A, B, C, D
  • All possible threshold values in each column

For Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5… For Column C: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5…

The algorithm calculates a score for each potential split and chooses the best one. This is where the loss function comes into play, which we’ll explore next.

Realistic Scenario: What If Column B Had Noise?

Let’s imagine our dataset wasn’t perfect. Suppose x₁₉ (Column B = 0) had Target Y = 1 instead of 0.

Left Branch (Column B = 0):

  • 9 samples with Y = 0, 1 sample with Y = 1
  • Not perfect, but still predominantly class 0
  • Prediction: Y = 0 (majority vote)

Right Branch (Column B = 1):

  • 10 samples with Y = 1
  • Perfect separation
  • Prediction: Y = 1

This is where gradient boosting helps: the next tree will focus on correcting that one misclassified sample (x₁₉).

Part 2: The Ensemble Approach — Building Trees Sequentially

From Single Trees to Boosted Ensembles

XGBoost doesn’t build just one tree — it builds an ensemble of trees sequentially, where each new tree corrects the errors of the previous trees.

Continuing with Our Tutorial Dataset

Remember our first tree? It used Column B to perfectly classify:

  • Prediction for Column B = 0: Y_pred = 0
  • Prediction for Column B = 1: Y_pred = 1

But what if our dataset had some complexity that one tree couldn’t capture?

Building Tree 2: Correcting the Residuals

Let’s add realistic complexity. Suppose Tree 1 made these predictions:

(Predictions aren’t exactly 0 or 1 due to regularization, which we’ll explain later)

Tree 2’s Job: Predict these residuals (0.0, 0.0, 0.1, 0.1, …)

Tree 3’s Job: Predict the residuals from Tree 1 + Tree 2

This is the heart of gradient boosting: each new tree is trained to predict the residual errors from all previous trees combined.

The Sequential Learning Process

Final Prediction = Tree₁ + Tree₂ + Tree₃ + Tree₄ + Tree₅

For sample x₃:
Tree₁ predicts: 0.9
Tree₂ predicts: 0.08 (trying to correct the 0.1 error)
Tree₃ predicts: 0.015 (trying to correct remaining error)
...

Final prediction = 0.9 + 0.08 + 0.015 + ... ≈ 1.0

Key Question: How do we build Tree 2 optimally? We need to know:

  1. What errors did Tree 1 make?
  2. How should we split the data to best correct these errors?
  3. What should each leaf predict?

This is where the loss function and XGBoost’s mathematical framework come in!

Part 3: The Loss Function — Minimizing Bias and Variance Simultaneously

Engineering the Objective

XGBoost’s secret weapon is its carefully engineered loss function that simultaneously addresses two fundamental challenges in machine learning: bias and variance.

The total loss function in XGBoost has two main components:

Component 1: Minimizing Bias (Training Error)

The first term sums over all samples in the training data:

Σ l(ŷᵢ, yᵢ)

Where:

  • ŷᵢ = prediction for sample i
  • yᵢ = actual target value for sample i
  • l() = loss function (like mean squared error or cross-entropy)

This term ensures we fit the training data well and minimise underfitting.

Component 2: Controlling Variance (Regularization)

The second term sums over all trees:

Σ Ω(fₖ)

The regularization function Ω is decomposed into two parts:

  1. Leaf Penalty: γT , where T is the number of leaves
  • Penalizes trees with too many leaves
  • Simpler trees → lower variance → better generalization

2. Weight Penalty: ½λ||w||² where w are the leaf predictions

  • Penalizes extreme predictions at each leaf
  • Smaller predictions from each tree → gradual corrections → less overfitting

Key Insight: By minimizing both terms simultaneously through hyperparameters γ and λ, XGBoost finds the sweet spot between bias and variance that traditional gradient boosting methods struggle to achieve.

Part 4: Breaking Down the Sample-Level Loss Function

How Predictions Build on Each Other

When building tree t, we need to understand how predictions accumulate from previous trees.

The sample-level loss function shows:

l(ŷᵢ, yᵢ) = l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ))

Breaking this down:

  • ŷᵢ^(t-1): The prediction from all previous trees (1 through t-1)
  • fₜ(xᵢ): The prediction from the new tree t we’re building
  • xᵢ: The feature values for sample i

Example with Our Tutorial Dataset

Let’s work through a concrete example using our 20-sample dataset.

Scenario: Building Tree 2 (t=2)

For sample x₃ (Column A=3, Column B=1, Column C=6, Column D=1, True Y=1):

Step 1: What did Tree 1 predict?

ŷ₃^(1) = 0.85 (Tree 1’s prediction for x₃)

Step 2: What’s the actual target?

y₃ = 1.0

Step 3: What’s the error?

Error = y₃ - ŷ₃^(1) = 1.0 - 0.85 = 0.15

Step 4: What should Tree 2 predict?

Tree 2 should predict approximately 0.15 to correct the error. If we’re using log loss (for classification), the exact value depends on the gradient and Hessian, but intuitively:

New prediction: ŷ₃^(2) = ŷ₃^(1) + f₂(x₃) = 0.85 + 0.12 = 0.97

Getting closer to the true value of 1.0!

Full Example for Tree 2

Observation:

  • Samples with negative errors (predicted too high): Tree 2 predicts negative values
  • Samples with positive errors (predicted too low): Tree 2 predicts positive values

The loss function guides Tree 2 on how to split the data to best capture these error patterns.

For instance, Tree 2 might discover:

  • When Column C < 3: Errors tend to be negative (over-predicted)
  • When Column C ≥ 3: Errors tend to be positive (under-predicted)

And use this to make a split!

Part 5: The Taylor Approximation — Making the Math Tractable

Simplifying Complex Loss Functions

The challenge with the sample-level loss function is that its analytical form depends on which loss function we choose (MSE, cross-entropy, etc.). XGBoost uses a clever mathematical trick: the second-order Taylor expansion.

The Taylor approximation transforms the loss function into a more manageable form:

l(yᵢ, ŷᵢ^(t-1) + fₜ(xᵢ)) ≈ l(yᵢ, ŷᵢ^(t-1)) + gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ)

Computing Gradients and Hessians for Our Dataset

Let’s calculate actual values using log loss (binary cross-entropy), which is standard for binary classification:

l(y, ŷ) = -y·log(ŷ) — (1-y)·log(1-ŷ)

Gradient (First Derivative):

gᵢ = ∂l/∂ŷ = ŷᵢ^(t-1) - yᵢ

Hessian (Second Derivative):

hᵢ = ∂²l/∂ŷ² = ŷᵢ^(t-1)·(1 - ŷᵢ^(t-1))

Concrete Example: Sample x₃ from Our Dataset

Given:

  • True target: y₃ = 1
  • Tree 1 prediction: ŷ₃^(1) = 0.85
  • We’re building Tree 2

Calculate Gradient:

g₃ = ŷ₃^(1) — y₃ = 0.85–1.0 = -0.15

Interpretation: Negative gradient means we predicted too low, need to increase prediction.

Calculate Hessian:

h₃ = ŷ₃^(1) · (1 — ŷ₃^(1)) = 0.85 · 0.15 = 0.1275

Interpretation: Hessian represents the curvature. Higher values mean we’re more confident about the gradient direction.

Full Calculations for Our 20 Samples

Building Tree 2, here are gradients and Hessians:

Patterns Observed:

  • Class 0 samples (Y=0): Positive gradients (we over-predicted, need to decrease)
  • Class 1 samples (Y=1): Negative gradients (we under-predicted, need to increase)
  • Hessians all positive, representing curvature of loss function

Understanding Each Term:

  1. Constant Term: l(yᵢ, ŷᵢ^(t-1)) – the loss from previous trees (fixed)
  2. Gradient Term: gᵢfₜ(xᵢ)
  • For x₃: g₃ = -0.15, if Tree 2 predicts f₂(x₃) = 0.12, this term = -0.15 × 0.12 = -0.018
  • Negative value → decreases loss → good!
  1. Hessian Term: ½hᵢfₜ²(xᵢ)
  • For x₃: h₃ = 0.128, f₂(x₃) = 0.12, this term = 0.5 × 0.128 × 0.1²² = 0.00092
  • Positive value → increases loss slightly
  • Acts as a penalty for large predictions (regularization effect)

Visual Interpretation:

The figure shows three scenarios:

  • Left: High gradient (steep slope) — we need to move predictions significantly
  • Center: The “bowl” shape shows how the Taylor approximation mirrors the true loss
  • Right: The curvature (Hessian) tells us whether the loss surface is gentle or steep

Key Insight: This approximation is exact when the new tree’s predictions are small (fₜ → 0), which is exactly what regularization encourages! This is why XGBoost uses small learning rates and many trees.

Part 6: Focusing on What Matters

Discarding Constants and Focusing on the New Tree

When optimizing for the new tree, we can simplify our objective by removing terms that don’t depend on the new tree.

The Simplified Loss Function:

Looking at our expanded loss:

  • Constant Term (Past errors): Doesn’t depend on new tree → Discard
  • Gradient Term: gᵢfₜ(xᵢ) – Depends on new tree → Keep
  • Hessian Term: ½hᵢfₜ²(xᵢ) – Depends on new tree → Keep
  • Regularization Term: Ω(fₜ) – Depends on new tree → Keep

This gives us the optimization objective:

L̃^

Three Distinct Components:

  1. Optimization Term: Finds the best prediction step using gradients and Hessians
  2. Regularization Term: Penalizes complexity to prevent overfitting

The figure beautifully illustrates this separation of concerns.

Part 7: Regrouping by Leaves — A Game-Changing Perspective

From Sample Sums to Leaf Sums

Here’s where XGBoost’s brilliance truly shines. Instead of summing over samples, we regroup the sum to iterate over leaves in the tree.

The Mathematical Trick:

Original (sum over samples):

Σᵢ₌₁ⁿ (gᵢfₜ(xᵢ) + ½hᵢfₜ²(xᵢ))

Regrouped (sum over leaves):

Σⱼ₌₁ᵀ ((Σᵢ∈Iⱼ gᵢ)wⱼ + ½(Σᵢ∈Iⱼ hᵢ + λ)wⱼ²) + γT

Understanding the Notation:

  • T: Total number of leaves in the tree
  • Iⱼ: Set of all samples that land in leaf j
  • wⱼ: The prediction value at leaf j (what we’re solving for)
  • Gⱼ = Σᵢ∈Iⱼ gᵢ: Total gradient in leaf j
  • Hⱼ = Σᵢ∈Iⱼ hᵢ: Total Hessian in leaf j

Key Insight: For all samples in the same leaf j, the prediction fₜ(xᵢ) is the same leaf weight wⱼ. This allows us to factor out wⱼ and regroup!

Why This Matters:

This regrouping transforms the problem from “How should each sample contribute?” to “What should each leaf predict?” This is computationally much more efficient and leads to the optimal weight formula.

Part 8: Finding Optimal Leaf Weights

Calculus to the Rescue

To find the optimal leaf weights, we use a fundamental calculus principle: the minimum of a function occurs where its derivative equals zero.

Calculating Optimal Weights for Our Dataset

Let’s use our tutorial data and see how to calculate actual leaf weights when building Tree 2.

Part 1: Understanding the Formula

Taking the derivative with respect to wⱼ and setting it to zero:

∂wⱼ L̃^

Solving this equation gives us:

wⱼ* = -Gⱼ/(Hⱼ + λ)

Where:

  • Gⱼ = Σᵢ∈Iⱼ gᵢ: Sum of gradients for all samples in leaf j
  • Hⱼ = Σᵢ∈Iⱼ hᵢ: Sum of Hessians for all samples in leaf j
  • λ: Regularization parameter (let’s use λ = 1.0)

Concrete Example: Computing Leaf Weights

Suppose Tree 2 splits on Column B (like Tree 1), creating two leaves:

Leaf 1 (Column B = 0) — Contains samples where Y = 0:

Samples: x₁, x₂, x₅, x₇, x₉, x₁₁, x₁₃, x₁₅, x₁₇, x₁₉

Calculate Gⱼ:

G₁ = g₁ + g₂ + g₅ + g₇ + g₉ + g₁₁ + g₁₃ + g₁₅ + g₁₇ + g₁₉
G₁ = 0.10 + 0.12 + 0.08 + 0.11 + 0.09 + 0.10 + 0.13 + 0.07 + 0.09 + 0.12
G₁ = 1.01

Calculate Hⱼ:

H₁ = h₁ + h₂ + h₅ + h₇ + h₉ + h₁₁ + h₁₃ + h₁₅ + h₁₇ + h₁₉
H₁ = 0.090 + 0.106 + 0.074 + 0.098 + 0.082 + 0.090 + 0.113 + 0.065 + 0.082 + 0.106
H₁ = 0.906

Calculate optimal weight:

w₁* = -G₁/(H₁ + λ) = -1.01/(0.906 + 1.0) = -1.01/1.906 = -0.530

Interpretation: Tree 2 predicts -0.530 for samples in Leaf 1. This negative value reduces the overall prediction, correcting the slight over-predictions from Tree 1 for class 0 samples.

Leaf 2 (Column B = 1) — Contains samples where Y = 1:

Samples: x₃, x₄, x₆, x₈, x₁₀, x₁₂, x₁₄, x₁₆, x₁₈, x₂₀

Calculate Gⱼ:

G₂ = g₃ + g₄ + g₆ + g₈ + g₁₀ + g₁₂ + g₁₄ + g₁₆ + g₁₈ + g₂₀
G₂ = -0.15 + -0.12 + -0.13 + -0.11 + -0.16 + -0.10 + -0.14 + -0.17 + -0.12 + -0.15
G₂ = -1.35

Calculate Hⱼ:

H₂ = h₃ + h₄ + h₆ + h₈ + h₁₀ + h₁₂ + h₁₄ + h₁₆ + h₁₈ + h₂₀
H₂ = 0.128 + 0.106 + 0.113 + 0.098 + 0.134 + 0.090 + 0.120 + 0.141 + 0.106 + 0.128
H₂ = 1.164

Calculate optimal weight:

w₂* = -G₂/(H₂ + λ) = -(-1.35)/(1.164 + 1.0) = 1.35/2.164 = 0.624

Interpretation: Tree 2 predicts +0.624 for samples in Leaf 2. This positive value increases the overall prediction, correcting the slight under-predictions from Tree 1 for class 1 samples.

Updated Predictions After Tree 2

For sample x₃ (in Leaf 2):

Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)

After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95

Previous prediction (Tree 1): ŷ₃^(1) = 0.85
Tree 2 contribution: w₂* = 0.624
New prediction: ŷ₃^(2) = 0.85 + 0.624 = 1.474 (before sigmoid)

After sigmoid: σ(1.474) = 0.814 / (1 + 0.814) ≈ 0.95

We’re getting closer to the true value of 1.0!

For sample x₁ (in Leaf 1):

Previous prediction (Tree 1): ŷ₁^(1) = 0.10
Tree 2 contribution: w₁* = -0.530
New prediction: ŷ₁^(2) = 0.10 + (-0.530) = -0.430 (before sigmoid)

After sigmoid: σ(-0.430) = 1/(1 + e^0.430) ≈ 0.39

Hmm, this actually moved us away from 0! This is where Tree 3 comes in to correct further.

Understanding the Formula Components

Intuitive Understanding:

If you have:

  • High gradients (large errors) → Larger |w*| to correct more aggressively
  • High Hessian (high certainty) → Smaller |w*| to be conservative
  • High λ (strong regularization) → Smaller |w*| to avoid overfitting

Effect of λ:

If λ = 0 (no regularization):

w₁* = -1.01/0.906 = -1.115 (larger magnitude)
w₂* = 1.35/1.164 = 1.160 (larger magnitude)

If λ = 5.0 (strong regularization):

w₁* = -1.01/5.906 = -0.171 (much smaller)
w₂* = 1.35/6.164 = 0.219 (much smaller)

Higher λ creates smaller, more conservative corrections!

Part 2: Using Optimal Weights in Final Loss

We substitute wⱼ* back into the loss function to get:

L̃^

For our Tree 2 with 2 leaves and γ = 0:

L̃^(2) = -½((G₁²)/(H₁ + λ) + (G₂²)/(H₂ + λ))
L̃^(2) = -½((1.0¹²)/(1.906) + (1.3⁵²)/(2.164))
L̃^(2) = -½(0.535 + 0.842)
L̃^(2) = -½(1.377)
L̃^(2) = -0.689

This is the objective score for Tree 2’s structure. Lower (more negative) scores are better! We’ll use this to evaluate different split options.

Part 9: The Split-Finding Algorithm

Understanding the Objective and Splitting Criteria

Now we have all the pieces to understand how XGBoost finds the best splits!

The Complete Picture:

The figure shows the relationship between:

  1. Total Loss for Tree T:

L̃^

2. Split Gain Formula:

L_split = ½ ((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) — (G_I²)/(H_I + λ)) — γ

Breaking Down the Split Gain:

Consider a parent node with samples, and a proposed split:

  • G_I, H_I: Gradient and Hessian sums in the parent (current node)
  • G_L, H_L: Gradient and Hessian sums in the left child after split
  • G_R, H_R: Gradient and Hessian sums in the right child after split

The Negative Structured Score:

The term (Gⱼ²)/(Hⱼ + λ) appears three times with different signs:

  • Positive for left child: Lower loss in left child is good
  • Positive for right child: Lower loss in right child is good
  • Negative for parent: We subtract the parent’s loss (opportunity cost)

The gain formula tells us: How much better is the split compared to not splitting?

Regularization Term:

  • Each split adds a new leaf (increases T by 1)
  • We subtract γ to penalize unnecessary complexity
  • Only split if the gain exceeds this penalty!

Part 10: Gradient vs. Curvature — The XGBoost Advantage

Why XGBoost Outperforms Traditional Gradient Boosting

This figure reveals the fundamental difference between XGBoost and traditional Gradient Boosting Machines (GBM).

The Objective Function Breakdown:

Gain = (GRAD_j²)/(CURV_j + λ)

Scenario Analysis:

Scenario 1: The XGBoost Choice

  • Low loss decrease (modest gradient reduction)
  • Low curvature (smooth, generalizable surface)
  • Result: High gain score
  • Why it’s better: Balances loss reduction with a smoother, more generalizable surface

Scenario 2: What GBM Would Choose

  • High loss decrease (steep gradient reduction)
  • High curvature (wiggly, complex surface)
  • Result: Lower gain score (after accounting for curvature)
  • Why it’s worse: Aggressive fitting leads to overfitting

The Mathematical Connection to Bias-Variance:

The figure makes an elegant connection:

  • -GRAD_j ~ Bias: Maximizing gradient reduction minimizes bias (underfitting)
  • -CURV_j ~ Variance: Minimizing curvature minimizes variance (overfitting)

Visual Comparison:

The second figure shows two split options:

  • XGBoost’s Choice (left): Modest loss decrease but smooth curvature
  • GBM’s Choice (right): Aggressive loss decrease but high curvature risk

The Bottom Line:

XGBoost tries to minimize both Bias and Variance
=> XGBoost overfits less than GBM

This is why XGBoost consistently outperforms traditional gradient boosting in practice!

Part 11: Finding the Right Split — The Practical Algorithm

Step-by-Step Split Search Process

Now let’s see how XGBoost actually finds the best split at each node using our tutorial dataset.

Real Calculation: Finding the Best Split for Tree 2

We’re building Tree 2. All 20 samples are at the root node with their gradients and Hessians calculated.

The Question: Which feature and threshold should we split on?

Step 1: Sort Column C Values

Our 20 samples sorted by Column C:

Observation: Notice how Target Y values change as Column C increases!

Step 2: Evaluate Split “Column C < 3”

Left child (Column C < 3): 9 samples

Samples: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉

Calculate sums:

G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 + 0.10 + 0.11 + 0.10 + 0.12 = 0.64
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 + 0.090 + 0.098 + 0.090 + 0.106 = 0.793

Right child (Column C ≥ 3): 11 samples

Samples: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈

Calculate sums:

G_R = 0.12 + -0.16 + 0.13 + -0.15 + -0.13 + -0.17 + -0.11 + -0.14 + -0.15 + -0.10 + -0.12
G_R = -1.08
H_R = 0.106 + 0.134 + 0.113 + 0.128 + 0.113 + 0.141 + 0.098 + 0.120 + 0.128 + 0.090 + 0.106
H_R = 1.277

Parent node (all 20 samples):

G_parent = G_L + G_R = 0.64 + (-1.08) = -0.44
H_parent = H_L + H_R = 0.793 + 1.277 = 2.070

Calculate Gain (with λ=1.0, γ=0):

Gain = ½((G_L²)/(H_L + λ) + (G_R²)/(H_R + λ) - (G_parent²)/(H_parent + λ)) - γ

Gain = ½((0.64²)/(0.793 + 1.0) + (-1.08²)/(1.277 + 1.0) - (-0.44²)/(2.070 + 1.0)) - 0

Gain = ½((0.4096/1.793) + (1.1664/2.277) - (0.1936/3.070))

Gain = ½(0.228 + 0.512 - 0.063)

Gain = ½(0.677)

Gain = 0.339

Positive gain means this split improves the loss function!

Step 3: Evaluate Other Splits

Let’s try “Column C < 2”:

Left (C < 2): 5 samples (x₄, x₁₅, x₅, x₉, x₁₇)

G_L = -0.12 + 0.07 + 0.08 + 0.09 + 0.09 = 0.21
H_L = 0.106 + 0.065 + 0.074 + 0.082 + 0.082 = 0.409

Right (C ≥ 2): 15 samples (remaining)

G_R = -0.44–0.21 = -0.65
H_R = 2.070–0.409 = 1.661

Gain = ½((0.2¹²/1.409) + (0.6⁵²/2.661) — (0.4⁴²/3.070))
Gain = ½(0.031 + 0.159–0.063)
Gain = 0.064

Lower gain than “C < 3”!

Let’s try “Column C < 4”:

Left (C < 4): 13 samples

G_L = 0.64 + 0.12 + (-0.16) + 0.13 + (-0.15) = 0.58
H_L = 0.793 + 0.106 + 0.134 + 0.113 + 0.128 = 1.274

Right (C ≥ 4): 7 samples

G_R = -0.44–0.58 = -1.02
H_R = 2.070–1.274 = 0.796

Gain = ½((0.5⁸²/2.274) + (1.0²²/1.796) — (0.4⁴²/3.070))
Gain = ½(0.148 + 0.580–0.063)
Gain = 0.333

Still lower than “C < 3”!

Step 4: Compare All Features

We would repeat this process for:

  • Column A: Try splits at 0.5, 1.5, 2.5, 3.5, 4.5, 5.5
  • Column B: Try split at 0.5 (only two values: 0 and 1)
  • Column C: Already evaluated multiple splits
  • Column D: Try split at 0.5 (only two values: 0 and 1)

Results Summary:

Winner: Column B < 0.5 has the highest gain!

This makes sense — Column B is a strong predictor in our dataset, clearly separating classes 0 and 1.

Tree 2 will split on Column B, just like Tree 1, but with different leaf weights to correct residual errors.

Part 12: Parallelization — Speed Through Smart Design

How XGBoost Achieves Fast Training

One of XGBoost’s practical advantages is its speed, achieved through clever parallelization.

The Parallel Split Search Strategy:

Since gradient boosting builds trees sequentially (can’t parallelize across trees), XGBoost parallelizes the split-finding process:

Thread Distribution:

  • Thread 1: Searches for best split in Column A
  • Thread 2: Searches for best split in Column B
  • Thread 3: Searches for best split in Column C
  • Thread 4: Searches for best split in Column D

Each thread independently:

  1. Sorts its column (if not pre-sorted)
  2. Evaluates all possible splits
  3. Computes the gain for each split
  4. Reports the best split for its column

Main Thread Coordination:

After all threads complete:

  1. Collect best split from each column
  2. Use ArgMax to find the overall best split
  3. Apply the winning split (e.g., “Column C < c”)

Why This Works:

The independence of column evaluations allows perfect parallelization:

  • No thread dependencies
  • No synchronization needed during search
  • Only final aggregation requires coordination

Real-World Impact:

On a 4-core machine, you could see close to 4x speedup in split-finding, which is the most computationally intensive part of tree building!

Part 13: Handling Missing Values Intelligently

XGBoost’s Clever Approach to Missing Data

Unlike many algorithms that require preprocessing for missing values, XGBoost handles them naturally during training.

Adding Missing Values to Our Tutorial Dataset

Let’s modify our dataset to include some missing values and see how XGBoost handles them.

Modified Dataset (Column C with missing values):

We have 2 missing values in Column C (samples x₂ and x₅).

Traditional approaches would require:

  • Imputation (filling with mean=2.8, median=3, etc.)
  • Deletion (removing rows x₂ and x₅)
  • Creating an indicator variable

XGBoost does something smarter!

The Missing Value Algorithm

Step 1: Sort and Separate

When evaluating split “Column C < 3”:

Non-missing samples sorted:

Col C: 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6

Missing samples (set aside):

x₂: Target Y = 0, Gradient = 0.12
x₅: Target Y = 0, Gradient = 0.08

Step 2: Try Missing Values Going LEFT

Configuration 1: Missing → Left

Left child (C < 3 OR C is missing):

  • Non-missing: x₄, x₁₅, x₅, x₉, x₁₇, x₁, x₇, x₁₁, x₁₉ (9 samples)
  • Missing: x₂, x₅ (2 samples)
  • Total: 11 samples

Calculate:

G_L = (sum of 9 non-missing gradients) + 0.12 + 0.08
G_L = 0.64 + 0.20 = 0.84

H_L = (sum of 9 non-missing Hessians) + h₂ + h₅
H_L = 0.793 + 0.106 + 0.074 = 0.973

Right child (C ≥ 3 and C not missing): 9 samples

G_R = -1.08
H_R = 1.277

Calculate gain:

Gain_left = ½((0.8⁴²/1.973) + (1.0⁸²/2.277) — (0.4⁰²/3.250))
Gain_left = ½(0.357 + 0.512–0.049)
Gain_left = 0.410

Step 3: Try Missing Values Going RIGHT

Configuration 2: Missing → Right

Left child (C < 3 and C not missing): 9 samples

G_L = 0.64
H_L = 0.793

Right child (C ≥ 3 OR C is missing):

  • Non-missing: x₂, x₁₀, x₁₃, x₂₀, x₆, x₁₆, x₈, x₁₄, x₃, x₁₂, x₁₈ (11 samples)
  • Missing: x₂, x₅ (2 samples)
  • Total: 13 samples

Calculate:

G_R = -1.08 + 0.12 + 0.08 = -0.88
H_R = 1.277 + 0.106 + 0.074 = 1.457

Calculate gain:

Gain_right = ½((0.6⁴²/1.793) + (0.8⁸²/2.457) — (0.4⁰²/3.250))
Gain_right = ½(0.228 + 0.316–0.049)
Gain_right = 0.248

Step 4: Choose the Better Direction

Compare the two scenarios:

Gain_left = 0.410 (Missing → Left)
Gain_right = 0.248 (Missing → Right)

Since 0.410 > 0.248: Missing values go LEFT!

The split is:

IF (Column C < 3) OR (Column C is missing):
Go LEFT
ELSE:
Go RIGHT

Why Did Missing Go Left?

Let’s analyze our missing samples:

  • x₂: Target Y = 0, Gradient = +0.12 (positive)
  • x₅: Target Y = 0, Gradient = +0.08 (positive)

The left child predominantly contains:

  • Samples with Target Y = 0
  • Samples with positive gradients (need downward corrections)

The missing samples share these characteristics! They belong with the left group.

XGBoost discovered this pattern automatically without any manual feature engineering.

Real-World Interpretation

What does this mean in practice?

Imagine Column C represents “number of customer service calls”:

  • Missing value might mean “no calls recorded” or “new customer”
  • These samples might behave more like low-call customers (Y=0)
  • XGBoost learns: missing → treat like “low calls”

Benefits:

  1. No imputation needed: Don’t have to guess values
  2. Preserves information: Missing might be meaningful
  3. Node-specific: Different nodes might route missing values differently
  4. Automatic: No manual decision required

Another Example: Different Node, Different Direction

At a different node in the tree, when evaluating “Column A < 3”, missing values might go RIGHT if that minimizes loss better. Each node makes its own decision based on the data at that node.

Part 14: Additional Performance Optimizations

Features That Make XGBoost Production-Ready

While we’ve covered the core algorithm, XGBoost includes several additional optimizations that make it fast and efficient in practice.

Four Key Performance Features

1. Pre-Sorted Column Blocks

What it does:

  • Sorts all feature columns once at the beginning
  • Keeps pointers to original indices
  • Stores in compressed column format

Why it matters:

  • Sorting has O(n log n) complexity
  • Without pre-sorting, you’d sort at every node evaluation
  • With pre-sorting, you sort once and reuse
  • Massive time savings during split search

Trade-off: Higher initial memory usage for faster searches

2. Compressed Column Format

What it does:

  • Stores data in a column-oriented format
  • Compresses continuous features into bins
  • Optimizes for CPU cache efficiency

Why it matters:

  • Modern CPUs are fast but memory access is slow
  • Column format allows better cache utilization
  • When evaluating Column C splits, all Column C data is contiguous in memory
  • Parallel threads can access different columns without cache conflicts

3. Cache-Aware Prefetching

What it does:

  • Anticipates which memory locations will be needed next
  • Loads data into CPU cache before it’s needed
  • Minimizes cache misses from non-contiguous memory access

Why it matters:

  • Pre-sorting changes data order, making memory access non-contiguous
  • Without prefetching, each memory access might be a cache miss (slow!)
  • With prefetching, data is ready when needed (fast!)
  • Can provide 2–10x speedup depending on dataset size

Technical detail: Uses the CPU’s prefetch instructions to hint at future memory access patterns

4. Approximate Split-Finding (Histogram Method)

What it does:

  • Instead of evaluating every possible split point
  • Groups feature values into bins (histogram buckets)
  • Only evaluates splits at bucket boundaries

Why it matters:

  • For features with many unique values, evaluating all splits is expensive
  • Histograms reduce the search space dramatically
  • Example: 10,000 unique values → 256 histogram bins
  • Massive speedup with minimal accuracy loss

How it works:

Original: 0.1, 0.2, 0.3, ..., 99.8, 99.9, 100.0 (10,000 values)
Histogram: (0-1), (1-2), (2-3), ..., (99-100) (100 bins)

Only evaluate splits at bin boundaries, not at every individual value.

The Combined Impact

These optimizations work together:

  1. Pre-sorting eliminates repeated sorting overhead
  2. Column format + Cache prefetching optimize memory access
  3. Parallelization (from Part 12) uses multiple cores
  4. Histogram method reduces the search space

Real-World Performance:

On typical datasets, these optimizations combined can provide:

  • 10–100x speedup over naive implementations
  • Ability to handle datasets with millions of rows efficiently
  • Sub-second training times for moderately sized datasets

This is why XGBoost became the standard tool for Kaggle competitions and production machine learning systems!

Published via Towards AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here