Link Link vs Dziennik Transformation in R – difference that misleads all data analysis

Author: ngoc doan

Originally published in the direction of artificial intelligence.

Image by Unsplash

Although normal distributions are most commonly used, many real data is unfortunately not normal. In the face of very distorted data, it is tempting for us to use logarithmic transformation to normalize the distribution and stabilize variance. Recently, I worked on a project that analyzes the energy consumption of AI models, using data from EPoch AI (1). There is no official data on energy consumption for each model, so I calculated it, multiplying the power draw of each model with training time. The new variable, energy (in kwh), was highly acted to the right, along with some extreme and excessive protruding values ​​(Fig. 1).

Dig. 1. Histogram of energy consumption (kWh)

To solve this skewness and heterososcible, my first instinct was the use of logarithmic transformation to variable energy. The log (energy) timetable looked much more normal (Fig. 2), and the Shapiro-Wilka test confirmed the border normality (P off 0.5)

Dig. 2. Histogram of the energy consumption logar (kWh)

Modeling dilemma: Transformation of the VS Dziennik Link Dziennik

The visualization looked good, but when I went to modeling, I met with a dilemma: should I model Logarithmically transformed response variable (log (y) ~ x)IN should I model Original variable of answers by means of Journal's connection function (Y ~ x, link = “log”)? I also considered two distributions – Gaussia (normal) and gamma distributions – and combined each distribution with both logarithmic approaches. It gave me four different models as below, all matched using generalized R (GLM) linear models:

all_gaussian_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
family = gaussian(link = "log"), data = df)
all_gaussian_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware,
data = df)
all_gamma_log_link <- glm(Energy_kWh ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
family = Gamma(link = "log"), data = df)
all_gamma_log_transform <- glm(log(Energy_kWh) ~ Parameters +
Training_compute_FLOP +
Training_dataset_size +
Training_time_hour +
Hardware_quantity +
Training_hardware + 0,
family = Gamma(), data = df)

Model comparison: AIC and diagnostic charts

I compared four models using the Akaike (AIC) information criterion, which is an estimator of the forecast error. Usually the lower AIC, the better the model fits.

AIC(all_gaussian_log_link, all_gaussian_log_transform, all_gamma_log_link, all_gamma_log_transform)

df AIC
all_gaussian_log_link 25 2005.8263
all_gaussian_log_transform 25 311.5963
all_gamma_log_link 25 1780.8524
all_gamma_log_transform 25 352.5450

Of the four models, models using the results transferred by logarithers have much lower AIC values ​​than those using the journal links. Because the AIC difference between the logarithmic and logarithmic transformed model was significant (311 and 352 vs 1780 and 2005), I also examined diagnostic charts to further confirm that logarithically transformed models fit better:

Figure 4. Diagnostic charts for the logarithmic combined Gauss model. Chart of residues vs. Mounted suggests linearity despite several protruding values. However, the QQ chart shows noticeable deviations from the theoretical line, which suggests informality.
Figure 5. Diagnostic charts for the logarithmic transmitted Gauss model. The QQ plot has a much better fit by supporting normality. However, the residual chart compared to the matched chart has a submerge to -2, which may suggest non -linearity.
Figure 6. Diagnostic charts for the logarithmic gamma model. The QQ plot looks good, but the remains compared to the matching chart shows clear signs of non -linearity
Dig. 7. Diagnostic charts for the logarithmically transformed gamma model. Plots of residues vs, the plot looks good, with a small dip -0.25. However, the QQ chart shows some deviation on both tails.

Based on the AIC values ​​and diagnostic charts, I decided to go forward with the logarithmically transformed gamma model, because it had the second lowest AIC value and its remains, and the matching chart looks better than in the case of the Gauss model transmitted by logarithms.

I began to examine which explanatory variables were useful and which interactions could be significant. The final model of my choice was:

glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
Training_hardware + 0, family = Gamma(), data = df)

Interpretation of coefficients

However, when I started interpreting the model coefficients, something felt. Because only the variable answer was logarithmically transformed, the effects of the predictors are multiplicative and we need to put out coefficients to convert them back on the original scale. Growth of one unit 𝓍 multiplied the result 𝓎 via exp (β) or any additional unit in 𝓍 leads to (exp (β) -1) × 100 % change 𝓎 (2).

Looking at the table of the results of the following model, we have Training_time_hour, equipment_quantity, and their date of interaction Training_time_hour: hardware_quantity They are continuous variables, so their coefficients represent slopes. Meanwhile, because I defined +0 in the model formula, all levels of the categorical Training_hardware It works as intervals, which means that each type of equipment acts as an interception of β₀ when the appropriate fictitious variable was active.

> glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
Training_hardware + 0, family = Gamma(), data = df)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Training_time_hour -1.587e-05 3.112e-06 -5.098 5.76e-06 ***
Hardware_quantity -5.121e-06 1.564e-06 -3.275 0.00196 **
Training_hardwareGoogle TPU v2 1.396e-01 2.297e-02 6.079 1.90e-07 ***
Training_hardwareGoogle TPU v3 1.106e-01 7.048e-03 15.696 < 2e-16 ***
Training_hardwareGoogle TPU v4 9.957e-02 7.939e-03 12.542 < 2e-16 ***
Training_hardwareHuawei Ascend 910 1.112e-01 1.862e-02 5.969 2.79e-07 ***
Training_hardwareNVIDIA A100 1.077e-01 6.993e-03 15.409 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.020e-01 1.072e-02 9.515 1.26e-12 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.014e-01 1.018e-02 9.958 2.90e-13 ***
Training_hardwareNVIDIA GeForce GTX 285 3.202e-01 7.491e-02 4.275 9.03e-05 ***
Training_hardwareNVIDIA GeForce GTX TITAN X 1.601e-01 2.630e-02 6.088 1.84e-07 ***
Training_hardwareNVIDIA GTX Titan Black 1.498e-01 3.328e-02 4.501 4.31e-05 ***
Training_hardwareNVIDIA H100 SXM5 80GB 9.736e-02 9.840e-03 9.894 3.59e-13 ***
Training_hardwareNVIDIA P100 1.604e-01 1.922e-02 8.342 6.73e-11 ***
Training_hardwareNVIDIA Quadro P600 1.714e-01 3.756e-02 4.562 3.52e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 1.538e-01 3.263e-02 4.714 2.12e-05 ***
Training_hardwareNVIDIA Quadro RTX 5000 1.819e-01 4.021e-02 4.524 3.99e-05 ***
Training_hardwareNVIDIA Tesla K80 1.125e-01 1.608e-02 6.993 7.54e-09 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 1.072e-01 1.353e-02 7.922 2.89e-10 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 9.444e-02 2.030e-02 4.653 2.60e-05 ***
Training_hardwareNVIDIA V100 1.420e-01 1.201e-02 11.822 8.01e-16 ***
Training_time_hour:Hardware_quantity 2.296e-09 9.372e-10 2.450 0.01799 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1

(Dispersion parameter for Gamma family taken to be 0.05497984)

Null deviance: NaN on 70 degrees of freedom
Residual deviance: 3.0043 on 48 degrees of freedom
AIC: 345.39

During the conversion of slopes to a percentage change in the variable response, the influence of each continuous variable was almost zero, and even slightly negative:

All intercourse was also converted back to about 1 kWh on the original scale. The results did not make any sense, because at least one of the slopes should grow with huge energy consumption. I wondered if the use of a logarithmic combined model with the same predictors can give different results, so I will adjust the model again:

glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity + 
Training_hardware + 0, family = Gamma(link = "log"), data = df)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Training_time_hour 1.818e-03 1.640e-04 11.088 7.74e-15 ***
Hardware_quantity 7.373e-04 1.008e-04 7.315 2.42e-09 ***
Training_hardwareGoogle TPU v2 7.136e+00 7.379e-01 9.670 7.51e-13 ***
Training_hardwareGoogle TPU v3 1.004e+01 3.156e-01 31.808 < 2e-16 ***
Training_hardwareGoogle TPU v4 1.014e+01 4.220e-01 24.035 < 2e-16 ***
Training_hardwareHuawei Ascend 910 9.231e+00 1.108e+00 8.331 6.98e-11 ***
Training_hardwareNVIDIA A100 1.028e+01 3.301e-01 31.144 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB 1.057e+01 5.635e-01 18.761 < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB 1.093e+01 5.751e-01 19.005 < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285 3.042e+00 1.043e+00 2.916 0.00538 **
Training_hardwareNVIDIA GeForce GTX TITAN X 6.322e+00 7.379e-01 8.568 3.09e-11 ***
Training_hardwareNVIDIA GTX Titan Black 6.135e+00 1.047e+00 5.862 4.07e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB 1.115e+01 6.614e-01 16.865 < 2e-16 ***
Training_hardwareNVIDIA P100 5.715e+00 6.864e-01 8.326 7.12e-11 ***
Training_hardwareNVIDIA Quadro P600 4.940e+00 1.050e+00 4.705 2.18e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000 5.469e+00 1.055e+00 5.184 4.30e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000 4.617e+00 1.049e+00 4.401 5.98e-05 ***
Training_hardwareNVIDIA Tesla K80 8.631e+00 7.587e-01 11.376 3.16e-15 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB 9.994e+00 6.920e-01 14.443 < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB 1.058e+01 1.047e+00 10.105 1.80e-13 ***
Training_hardwareNVIDIA V100 9.208e+00 3.998e-01 23.030 < 2e-16 ***
Training_time_hour:Hardware_quantity -2.651e-07 6.130e-08 -4.324 7.70e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1

(Dispersion parameter for Gamma family taken to be 1.088522)

Null deviance: 2.7045e+08 on 70 degrees of freedom
Residual deviance: 1.0593e+02 on 48 degrees of freedom
AIC: 1775

This time Training_time AND Hardware_quantity Would increase total energy consumption by 0.18% for an additional hour and 0.07%, respectively, on an additional system. Meanwhile, their interaction would reduce energy consumption by 2 × 10⁵%. These results made more sense Training_time can reach up to 7,000 hours and Hardware_quantity up to 16,000 units.

To better visualize differences, I created two plots comparing forecasts (shown as intermittent lines) from both models. The left panel used the Gamma Gamma model transmitted by the log in which the intermittent lines were almost flat and near zero, not near the matched lines of raw data. On the other hand, the right panel used the logarithmic GMAM GLM model, in which the intermittent lines leveled much more with the actual matched lines.

test_data <- df(, c("Training_time_hour", "Hardware_quantity", "Training_hardware"))
prediction_data <- df %>%
mutate(
pred_energy1 = exp(predict(glm3, newdata = test_data)),
pred_energy2 = predict(glm3_alt, newdata = test_data, type = "response"),
)
y_limits <- c(min(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2),
max(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2))

p1 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = prediction_data, aes(y = pred_energy1), method = "lm", se = FALSE,
linetype = "dashed", size = 1) +
scale_y_log10(limits = y_limits) +
labs(x="Hardware Quantity", y = "log of Energy (kWh)") +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = prediction_data, aes(y = pred_energy2), method = "lm", se = FALSE,
linetype = "dashed", size = 1) +
scale_y_log10(limits = y_limits) +
labs(x="Hardware Quantity", color = "Training Time Level") +
theme_minimal() +
theme(axis.title.y = element_blank())
p1 + p2

Figure 8. Dependence between the amount of equipment and logarithmic energy consumption in training time groups. In both panels, raw data is shown as points, continuous lines represent matched values ​​from linear models, and intermittent lines represent anticipated values ​​from generalized linear models. The left panel uses the logarithmic GLM gamma, and the right panel uses the logarithmic gamma logarithmic with the same predictors.

Why does the journal transformation fail

To understand the reason why the logarithmical transformed model cannot capture the basic effects as logarithmic, let's go through what happens when we use the journal transformation to the variable answer:

Let's say Y is equal to some function x plus error date:

When we use the journal's transformation, we actually compress both f (x) and error:

This means that we model a completely new variable answer, log (s). When we connect our own function G (X) – in my case g (x) = training_time_hour*equipment_quantity + training_hardware – tries to capture the combined effects of both “shrink” f (x) and error.

Unlike when we use the journal link, we still model the original Y, not the transformed version. Instead, the lecturers model our own G (X) function to predict Y.

The model then minimizes the difference between real Y and the expected Y. In this way, error conditions remain intact on the original scale:

Application

Logarithmic transformation variable is not the same as using a journal connection and can not always give reliable results. Under the hood, the journal transformation changes the variable itself and distorts both the change and the noise. Understanding this subtle mathematical difference behind the model is as important as an attempt to find the best matched model.

(1) Age AI. Data on significant AI models. Recovered with https://epoch.ai/data/notable-ai-models

(2) University of Virginia Library. Interpretation of the journal transformation in the linear model. Recovered with https://libranry.virginia.edu/data/articles/interprreting-log-transformations-in-a-linear-Model

Published via AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here