Posts

Showing posts from August, 2025

How Autoencoders Beat PCA for Non-Linear Signals

Image
The financial markets produce vast amounts of data, and our ability to make sense of it often determines our success. Two powerful tools for this task are Principal Component Analysis ( PCA ) and Autoencoders . While both are used to reduce the dimensions of our data, they do it in fundamentally different ways, which can have a big impact on a trading model's performance. Our Simulated Market Data To rigorously compare PCA and autoencoders, we generated synthetic data with specific underlying structures. Our dataset comprises several features, some exhibiting linear correlations with the target variable, while others are designed to have non-linear relationships. For instance, we introduced: Linear Dependencies: Certain features were directly and linearly related to the target, representing straightforward correlations one might find in simpler market dynamics. Distance-Based Relationships: The target variable was also influenced by the proximity of data points in a two-dimension...

Speed vs. Accuracy: Optimal Data Retention

Image
When building machine learning models for time-series data, a crucial question is: how much data should you retain? Our experiment, using a simulated mean-reverting asset pair, reveals a powerful rule of thumb: retain 10-20x your data weighting half-life. This approach strikes the optimal balance between model accuracy and computational efficiency. The Experiment Setup Our experiment simulated two assets with a mean-reverting correlation, a scenario common in financial markets where relationships between assets evolve over time. An XGBoost model was tasked with predicting one asset's returns based on the other's recent performance. To find the optimal data window, we ran a grid search on two key parameters: Data Retention : The total number of historical data points used for training, from 100 to 5,000. Data Weighting Half-Life : A decay factor that gives more importance to recent data. A shorter half-life means older data has less influence on the model. The model's perfor...

A Deeper Dive: How Model Complexity and Prediction Horizon Shape Optimal Half-Life

Image
A Deeper Dive: How Model Complexity and Prediction Horizon Shape Optimal Half-Life In quantitative analysis, selecting the appropriate model "memory" is a fundamental challenge when working with time-series data. This memory, often controlled by a half-life parameter in exponentially weighted methods, dictates the influence of historical observations on a model's predictions. The following analysis explores how the optimal half-life is influenced by a model's structural complexity and the prediction horizon, utilizing a simulation that incorporates fast-reverting "trade impact" noise. Method: Simulation and Analysis To investigate this relationship, our simulation was structured as follows: Synthetic Data Generation: Two assets were simulated with a mean-reverting correlation. The analytical half-life of this correlation was approximately 6.93 days . To create a realistic, noisy environment, a separate, faster mean-reverting process was added to the price ...

Half-Lives for Trading Correlated Assets? Unpacking Correlation Mean Reversion and Predictive Power

Image
Today, we're diving into a fascinating corner of quantitative finance: the interplay between mean-reverting correlations and how we can best predict asset movements, particularly in the context of strategies like pairs trading. If you've ever thought about how the relationship between two assets evolves over time, and how to capture that evolution for better predictions, this post is for you. Our central question for today's exploration is: What is the relationship between the mean reversion time of a correlation (how quickly it tends to return to its average) and the "best" look-back period (or half-life) to use when trying to predict one asset's returns based on another? To answer this, I set up a simulation. Here's a quick rundown of my experiment: Mean-Reverting Correlation: I simulated 50 different "mean-reverting random walks" for the correlation between two hypothetical assets. This means the correlation itself isn't static; it jitter...

Navigating Colab's Compute: Choosing the Right Architecture for Your XGBoost Training

Image
Hey everyone, and welcome back to "Signal and Noise: A Trader's Journey into Data and Decisions." Today, we're cutting straight to a crucial question for anyone working with machine learning on platforms like Google Colab: What kind of compute architecture should you choose for your XGBoost training? It's not always as simple as "GPU is always faster." The optimal choice often depends critically on your dataset's characteristics, particularly its dimensionality. I conducted a controlled experiment within Google Colab, leveraging synthetic datasets to rigorously test XGBoost regression training across various CPU and GPU configurations. My focus was to understand when and why certain architectures provide an advantage, specifically comparing standard T4, T4 High-RAM, L4, and A100 runtimes. Let's dive into the results and draw some practical conclusions to guide your Colab compute choices. The Experiment Setup My methodical approach involved: Synt...

Unmasking Correlation in a Sea of Uncorrelated Trades

Image
In the world of quantitative finance and market analysis, we often seek out relationships between different assets. One powerful concept is "pair trading," where you trade two historically correlated assets, betting that if their relationship temporarily diverges, it will eventually revert. But what happens when seemingly correlated assets start behaving independently, at least on the surface? This is where the distinction between underlying asset correlation and observed trade flow correlation becomes critical. Today, we're diving into a fascinating simulation designed to illustrate how assets with strong underlying value correlations can have uncorrelated price moves, and more importantly, how we can cut through that noise to reveal the true relationships. Let's break down the key insights from our simulation, focusing on the plots we generated: How the Data Was Simulated To explore this concept, we've created a synthetic dataset that mimics key aspects of finan...

Welcome to Navigating Signal & Noise: A Trader's Journey into Data and Decisions

Hello, and welcome! I'm Scott Szatkowski, and I'm genuinely excited to launch this blog, Navigating Signal & Noise , as a space to explore the fascinating intersection of data science, predictive analytics, and the dynamic world of both traditional and decentralized financial assets. For those of you delving into the complexities of financial markets, you know that the true challenge often lies in sifting through the constant barrage of information – the "noise" – to identify the meaningful "signals" that inform better decisions. That's precisely what this blog aims to do: dissect and discuss various approaches to uncovering those signals, including explorations in the rapidly evolving cryptocurrency markets. Think of this as my public notebook, where I'll share my ongoing explorations and what I'm learning along the way. My journey into this space began with a strong foundation in quantitative analysis. I hold an M.A. in Statistics from Colu...