Tree-based models like XGBoost, LightGBM, and CatBoost internally convert data to 32-bit floats during training. This means that split thresholds are chosen based on 32-bit precision values. However, R uses 64-bit doubles by default, and most databases also use higher precision floating-point numbers.
This precision mismatch can cause predictions to differ when a data point falls exactly on or very close to a split boundary. The 32-bit and 64-bit representations of the same number may round differently, causing the data point to go left in one system and right in another.
XGBoost and Cubist store everything as 32-bit floats, making them most susceptible to this issue. LightGBM and CatBoost use 64-bit doubles for leaf values, which reduces (but does not eliminate) the risk.
Here is a real example from a Cubist model. When we extract the split values used in the model’s rules, we see values like:
variable value
lstat 9.5299997
rm 6.2259998
rm 6.546
lstat 5.3899999
These split values should correspond to actual values in the training data. But when we check, only one of the four matches exactly:
# Exact matches
variable value
rm 6.546
# Non-matches
variable value
lstat 9.5299997
rm 6.2259998
lstat 5.3899999
If we look for nearby values in the training data:
variable value_data value_split
rm 6.226 6.2259998
rm 6.546 6.546
lstat 5.39 5.3899999
lstat 9.53 9.5299997
The original training values were 6.226,
5.39, and 9.53, but they were converted to
32-bit floats during model training, resulting in slightly different
stored thresholds.
Why does this matter? Consider a model with two rules:
rule 1: rm > 6.2259998
rule 2: rm <= 6.2259998
If you pass in an observation where rm is
6.226, you might expect rule 1 to apply since
6.226 > 6.2259998. But the native model applies rule 2
because it internally converted 6.226 to
6.2259998 during training, making them equal.
tidypredict extracts split thresholds from the model and uses them in R formulas or SQL queries. Since R and databases typically use 64-bit floats, the comparisons are done at 64-bit precision against thresholds that were originally determined at 32-bit precision.
In most cases, this works fine because data points rarely fall
exactly on split boundaries. However, you should always verify
predictions match using tidypredict_test().
Pros of using tidypredict despite this issue:
Risks:
When are values likely to hit boundaries?
Continuous or high-precision real-world measurements are less likely to land exactly on split boundaries.
Considerations:
tidypredict_test() to quantify the discrepancy rate
on your dataAccept small differences: For production use, consider that a tiny fraction of predictions may differ at exact boundaries. Decide if this is acceptable for your use case.
Use native predictions when possible: For applications where perfect agreement is critical, consider using the native model’s predict function rather than SQL translation.