How Autoencoders Beat PCA for Non-Linear Signals

The financial markets produce vast amounts of data, and our ability to make sense of it often determines our success. Two powerful tools for this task are Principal Component Analysis (PCA) and Autoencoders. While both are used to reduce the dimensions of our data, they do it in fundamentally different ways, which can have a big impact on a trading model's performance.

Our Simulated Market Data

To rigorously compare PCA and autoencoders, we generated synthetic data with specific underlying structures. Our dataset comprises several features, some exhibiting linear correlations with the target variable, while others are designed to have non-linear relationships. For instance, we introduced:

  • Linear Dependencies: Certain features were directly and linearly related to the target, representing straightforward correlations one might find in simpler market dynamics.

  • Distance-Based Relationships: The target variable was also influenced by the proximity of data points in a two-dimensional feature space (e.g., features x9 and x10). This creates a non-linear relationship, as the effect on the target depends on the combination of these two features.

  • Sinusoidal Relationships: We incorporated features (e.g., x13 and x14) where their combined effect on the target followed a sine wave pattern, a clearly non-linear dependency that linear methods like standard correlation would struggle to fully capture.

The goal of this controlled simulation is to evaluate how effectively PCA (which excels at capturing linear variance) and autoencoders (which can model non-linear patterns) can extract meaningful signals from this multifaceted data. We then use the reduced-dimensional representations generated by these methods as input to a robust regression algorithm, XGBoost, to predict our target variable and compare the resulting prediction accuracy.

PCA: The Linear Approach

PCA is a classic statistical technique that finds a new set of axes, or principal components, that are a linear combination of our original features. It essentially reorients the data to capture the most variance along the new axes.

Let's look at the results of a PCA analysis. We can visualize the first two principal components, with the color of each point representing our target variable.



In this plot, we see the data points spread out, but there isn't a clear pattern or clustering of colors. This suggests that the relationship between the features and the target variable isn't a simple, linear one. While PCA found a way to reduce the data's dimensions, it didn't do a great job of isolating the information that's most predictive of our target, especially given the non-linear components we introduced in our simulation.

In our test, the PCA-reduced data (using the top 5 principal components based on a scree plot analysis, a common method to determine the optimal number of components) was fed into an XGBoost regression model. After hyperparameter tuning using GridSearchCV, it produced a test RMSE (Root Mean Squared Error) of 78.5737. This serves as a solid baseline, particularly given the linear aspects of our simulated data, but we imagine that a method capable of capturing non-linearities could yield better results.

Autoencoders: The Non-Linear Powerhouse

An autoencoder is a type of neural network that learns to compress and decompress data. It consists of two parts: an encoder and a decoder. The encoder takes the high-dimensional input and shrinks it down to a low-dimensional "latent space." The decoder then tries to reconstruct the original data from this compressed representation. The key advantage of autoencoders is their ability to capture non-linear relationships between variables through the use of non-linear activation functions within the network layers. They can learn much more complex patterns than PCA.

To find the optimal autoencoder architecture for our task, we conducted an extensive cross-validation process, systematically testing various configurations of latent space dimensionality, network depth (number of layers), activation functions, and training parameters (epochs and batch size). For each autoencoder configuration, we trained it on the training data and then used the encoded (compressed) representation as input to an XGBoost model, again tuning the XGBoost hyperparameters. This nested cross-validation approach ensured we found an autoencoder setup that generalized well to unseen data.

Let's look at the same test data, but this time compressed using our best-performing autoencoder configuration (a 2-layer encoder with a latent dimension of 20 and tanh activation). The plot below shows the top two most important features from the autoencoder's latent space, as determined by the feature importances from the subsequent XGBoost model.


Notice the difference? Unlike the PCA plot, the autoencoder plot shows a clearer separation of colors. This suggests the autoencoder was more effective at creating a latent space where the target variable is better explained, particularly by capturing the non-linear dependencies we intentionally included in our simulated data. The XGBoost model, leveraging these non-linear features extracted by the autoencoder, was able to achieve a lower prediction error.

When we fed this more informative, compressed data into our optimized XGBoost model, the results were significantly better. The test RMSE was 64.1568.


The Bottom Line for Traders

The choice between PCA and autoencoders isn't about one being universally "better" than the other; it's about strategically matching the tool to the underlying structure of your market data.

  • If your data's predictive signals are primarily driven by linear relationships, PCA offers a computationally efficient and interpretable approach to dimension reduction.

  • However, in the often complex and non-linear dynamics of financial markets, autoencoders provide a powerful advantage by their ability to learn intricate, non-linear representations of the data that can better capture the true drivers of your target variable.

In our carefully constructed simulation, the autoencoder's superior ability to model and extract non-linear patterns led to a more accurate predictive model for our target trading signal. This underscores a vital consideration for quantitative traders: while simplicity has its merits, embracing more sophisticated, non-linear feature engineering techniques can be crucial for unlocking hidden signals and gaining a significant edge in the market. The key lies in understanding the potential nature of the relationships within your data and selecting the tool best equipped to unravel them.


Comments

Popular posts from this blog

Navigating Colab's Compute: Choosing the Right Architecture for Your XGBoost Training

A Deeper Dive: How Model Complexity and Prediction Horizon Shape Optimal Half-Life

Welcome to Navigating Signal & Noise: A Trader's Journey into Data and Decisions