Speed vs. Accuracy: Optimal Data Retention
When building machine learning models for time-series data, a crucial question is: how much data should you retain? Our experiment, using a simulated mean-reverting asset pair, reveals a powerful rule of thumb: retain 10-20x your data weighting half-life. This approach strikes the optimal balance between model accuracy and computational efficiency.
The Experiment Setup
Our experiment simulated two assets with a mean-reverting correlation, a scenario common in financial markets where relationships between assets evolve over time. An XGBoost model was tasked with predicting one asset's returns based on the other's recent performance.
To find the optimal data window, we ran a grid search on two key parameters:
Data Retention: The total number of historical data points used for training, from 100 to 5,000.
Data Weighting Half-Life: A decay factor that gives more importance to recent data. A shorter half-life means older data has less influence on the model.
The model's performance was measured by Root Mean Squared Error (RMSE), and we also tracked the computation time for each combination.
The Key Finding: A Clear Relationship
The results show that a significant investment in data retention is often necessary to achieve a high level of accuracy. The following table illustrates the minimum data retention required to get within a certain percentage of the optimal RMSE for each half-life.
As you can see, getting from an RMSE within 10% of the minimum to within 1% of the minimum often requires a substantial increase in the data retention window. For a half-life of 250, for example, the retention window must increase from 600 to 2000 data points to achieve this higher level of precision. This investment is often worth it, as it directly leads to a more accurate and reliable model.
The Point of Diminishing Returns: When the Cost Gets Really Bad
The real pitfall lies in continuing to add data after the model's performance has already reached its limit. The computation time heatmap reveals that while accuracy plateaus, the computational cost continues to rise dramatically.
For instance, looking at a half-life of 50, the model's RMSE hits its minimum at around 400 data points. However, the time required to train the model continues to climb as we increase the data retention all the way up to 5,000, with no corresponding benefit to accuracy. This is the critical moment when the trade-off becomes highly unfavorable. The added data points are too old to be relevant (due to the weighting half-life), but they still force the model to perform a much more intensive training process.
Conclusion: A Smarter Approach
Instead of thinking of data retention as a simple "more is better" or "less is better" problem, a smarter approach is to use a specific rule of thumb derived from our analysis. Our findings show that to achieve a high degree of accuracy—within 1% of the optimal RMSE—you should retain a data window that is 4 to 8 times your data weighting half-life.
For a short half-life (e.g., 50), a retention window of 400 data points (8x) is necessary to reach peak performance.
For a longer half-life (e.g., 500), a retention window of 2000 data points (4x) is required.
Once you have enough data to achieve this level of accuracy, adding any more is a case of diminishing returns. Our analysis of the computation time reveals that retaining data far beyond this point does not improve model performance, but it does significantly increase the time and resources needed for training. This is the critical moment when the trade-off becomes highly unfavorable, as the oldest data points are no longer relevant, yet they still consume valuable computational power.



Comments
Post a Comment