Dimensionality Reduction in Trading: Creating a Synthetic Volatility Feature with Kernel PCA¶

In this notebook, we create multiple volatility indicators using the quantreo library, standardize them properly, and apply a Kernel PCA to extract a single, interpretable synthetic volatility feature. This method helps reduce multicollinearity while retaining essential market information—ideal for building more robust machine learning models in algorithmic trading.

👉 Full Newsletter about this subject available here: Click here

📚 Full course talking about ML for Trading: Click here

In [71]:

Copied!





# Import the Features Engineering Package from Quantreo
import quantreo.features_engineering as fe

# Import scikit-learn packages
from sklearn.decomposition import KernelPCA
from sklearn.preprocessing import StandardScaler

# Import pandas
import pandas as pd

# To display the graphics
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8")

# To remove some warnings
import warnings
warnings.filterwarnings("ignore")
# Import the Features Engineering Package from Quantreo
import quantreo.features_engineering as fe

# Import scikit-learn packages
from sklearn.decomposition import KernelPCA
from sklearn.preprocessing import StandardScaler

# Import pandas
import pandas as pd

# To display the graphics
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8")

# To remove some warnings
import warnings
warnings.filterwarnings("ignore")

In [40]:

Copied!





# Import a dataset to test the functions and create new ones easily
from quantreo.datasets import load_generated_ohlcv
df = load_generated_ohlcv()
df = df.loc["2016"]

# Show the data
df
# Import a dataset to test the functions and create new ones easily
from quantreo.datasets import load_generated_ohlcv
df = load_generated_ohlcv()
df = df.loc["2016"]

# Show the data
df

Out[40]:

	open	high	low	close	volume
time
2016-01-04 00:00:00	104.944241	105.312073	104.929735	105.232289	576.805768
2016-01-04 04:00:00	105.233361	105.252139	105.047564	105.149357	485.696723
2016-01-04 08:00:00	105.159851	105.384745	105.141110	105.330306	403.969745
2016-01-04 12:00:00	105.330306	105.505799	104.894155	104.923404	1436.917324
2016-01-04 16:00:00	104.914147	105.023293	104.913252	105.014347	1177.672605
...	...	...	...	...	...
2016-12-30 04:00:00	103.632257	103.711884	103.495896	103.564574	563.932484
2016-12-30 08:00:00	103.564574	103.629321	103.555581	103.616731	697.707475
2016-12-30 12:00:00	103.615791	103.628165	103.496810	103.515847	1768.926665
2016-12-30 16:00:00	103.515847	103.692243	103.400370	103.422193	921.437054
2016-12-30 20:00:00	103.420285	103.429955	103.195921	103.226868	764.291608

1548 rows × 5 columns

Create volatility features¶

In this section, we generate multiple volatility indicators using different statistical estimators (Close-to-Close, Parkinson, Rogers-Satchell, and Yang-Zhang) over various time windows. These features will later be used for dimensionality reduction with PCA.

In [46]:

Copied!





# Standard volatility features
df["vol_close_to_close_30"] = fe.volatility.close_to_close_volatility(df, window_size=30)
df["vol_close_to_close_60"] = fe.volatility.close_to_close_volatility(df, window_size=60)
df["vol_close_to_close_90"] = fe.volatility.close_to_close_volatility(df, window_size=90)


# Parkinson volatility features
df["vol_parkinson_30"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=30)
df["vol_parkinson_60"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=60)
df["vol_parkinson_90"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=90)


# Rogers Satchell volatility feature
df["vol_rogers_satchell_30"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=30)
df["vol_rogers_satchell_60"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=60)
df["vol_rogers_satchell_90"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=90)

# Yang Zhang volatility feature
df["vol_yang_zhang_30"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=30)
df["vol_yang_zhang_60"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=60)
df["vol_yang_zhang_90"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=90)


# Remove NaN values
df = df.dropna()
df
# Standard volatility features
df["vol_close_to_close_30"] = fe.volatility.close_to_close_volatility(df, window_size=30)
df["vol_close_to_close_60"] = fe.volatility.close_to_close_volatility(df, window_size=60)
df["vol_close_to_close_90"] = fe.volatility.close_to_close_volatility(df, window_size=90)


# Parkinson volatility features
df["vol_parkinson_30"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=30)
df["vol_parkinson_60"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=60)
df["vol_parkinson_90"] = fe.volatility.parkinson_volatility(df, high_col="high", low_col="low", window_size=90)


# Rogers Satchell volatility feature
df["vol_rogers_satchell_30"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=30)
df["vol_rogers_satchell_60"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=60)
df["vol_rogers_satchell_90"] = fe.volatility.rogers_satchell_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=90)

# Yang Zhang volatility feature
df["vol_yang_zhang_30"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=30)
df["vol_yang_zhang_60"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=60)
df["vol_yang_zhang_90"] = fe.volatility.yang_zhang_volatility(df, high_col="high", low_col="low", open_col="open", close_col="close", window_size=90)


# Remove NaN values
df = df.dropna()
df

Out[46]:

	open	high	low	close	volume	vol_close_to_close_30	vol_close_to_close_60	vol_close_to_close_90	vol_parkinson_30	vol_parkinson_60	vol_parkinson_90	vol_rogers_satchell_30	vol_rogers_satchell_60	vol_rogers_satchell_90	vol_yang_zhang_30	vol_yang_zhang_60	vol_yang_zhang_90
time
2016-02-15 00:00:00	108.987590	109.070169	108.709641	108.907025	660.035859	0.001785	0.001590	0.001584	0.001770	0.001647	0.001633	0.001801	0.001666	0.001662	0.002505	0.002297	0.002289
2016-02-15 04:00:00	108.858099	109.157427	108.846086	109.138406	794.704627	0.001821	0.001606	0.001596	0.001801	0.001662	0.001645	0.001850	0.001690	0.001679	0.002524	0.002309	0.002299
2016-02-15 08:00:00	109.138406	109.261909	109.126550	109.175951	959.457020	0.001792	0.001606	0.001595	0.001821	0.001654	0.001648	0.001846	0.001662	0.001669	0.002580	0.002323	0.002315
2016-02-15 12:00:00	109.175951	109.281245	109.158402	109.278148	1752.989844	0.001785	0.001602	0.001597	0.001801	0.001646	0.001648	0.001836	0.001651	0.001670	0.002545	0.002318	0.002315
2016-02-15 16:00:00	109.278148	109.469441	109.236273	109.440890	1918.783515	0.001673	0.001609	0.001599	0.001801	0.001638	0.001648	0.001835	0.001647	0.001669	0.002544	0.002304	0.002317
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2016-12-30 04:00:00	103.632257	103.711884	103.495896	103.564574	563.932484	0.002443	0.001955	0.001770	0.002410	0.002141	0.001925	0.002434	0.002263	0.002003	0.003399	0.002902	0.002628
2016-12-30 08:00:00	103.564574	103.629321	103.555581	103.616731	697.707475	0.002438	0.001944	0.001742	0.002359	0.002139	0.001927	0.002328	0.002261	0.002007	0.003350	0.002901	0.002628
2016-12-30 12:00:00	103.615791	103.628165	103.496810	103.515847	1768.926665	0.002413	0.001949	0.001744	0.002357	0.002137	0.001904	0.002326	0.002261	0.001989	0.003347	0.002894	0.002591
2016-12-30 16:00:00	103.515847	103.692243	103.400370	103.422193	921.437054	0.002286	0.001955	0.001747	0.002335	0.002135	0.001904	0.002307	0.002256	0.001988	0.003318	0.002895	0.002590
2016-12-30 20:00:00	103.420285	103.429955	103.195921	103.226868	764.291608	0.002259	0.001972	0.001752	0.002222	0.002138	0.001911	0.002227	0.002259	0.002000	0.003149	0.002899	0.002598

1368 rows × 17 columns

In [56]:

Copied!

vol_features = [col for col in df.columns if "vol_" in col and "volatility" not in col]
vol_features
vol_features = [col for col in df.columns if "vol_" in col and "volatility" not in col]
vol_features

Out[56]:

['vol_close_to_close_30',
 'vol_close_to_close_60',
 'vol_close_to_close_90',
 'vol_parkinson_30',
 'vol_parkinson_60',
 'vol_parkinson_90',
 'vol_rogers_satchell_30',
 'vol_rogers_satchell_60',
 'vol_rogers_satchell_90',
 'vol_yang_zhang_30',
 'vol_yang_zhang_60',
 'vol_yang_zhang_90']

In [86]:

Copied!

df[vol_features].plot(figsize=(15,6))
plt.show()
df[vol_features].plot(figsize=(15,6))
plt.show()

No description has been provided for this image

Split the Data¶

In this section, we simplify the setup by splitting the dataset into a single train/test split, keeping the chronological order (80% train, 20% test).

In a real-world scenario, especially if you're using walk-forward validation or cross-validation for time series, you'll need to reapply the PCA and standardization process for each train/test pair.

👉 This is essential to avoid data leakage: you must always fit the scaler and PCA only on the training data, and then apply the transformations to the test set—never the other way around.

In [47]:

Copied!





# Define the train size
train_size = int(len(df) * 0.8)

# Chronological split
train_df = df.iloc[:train_size].copy()
test_df = df.iloc[train_size:].copy()

# Check the result
print(f"Train set shape: {train_df.shape}")
print(f"Test set shape : {test_df.shape}")
# Define the train size
train_size = int(len(df) * 0.8)

# Chronological split
train_df = df.iloc[:train_size].copy()
test_df = df.iloc[train_size:].copy()

# Check the result
print(f"Train set shape: {train_df.shape}")
print(f"Test set shape : {test_df.shape}")

Train set shape: (1094, 17)
Test set shape : (274, 17)

Compute the correlation¶

In this section, we compute the linear correlation matrix between all the volatility features. As expected, we observe strong correlations, since all features represent different ways of measuring volatility—using various estimators and time windows.

➡️ To reduce redundancy and simplify our feature set, we will combine these correlated variables into a single synthetic volatility feature in the next section using PCA.

In [57]:

Copied!

# Volatility Features Correlation
train_df[vol_features].corr()
# Volatility Features Correlation
train_df[vol_features].corr()

Out[57]:

	vol_close_to_close_30	vol_close_to_close_60	vol_close_to_close_90	vol_parkinson_30	vol_parkinson_60	vol_parkinson_90	vol_rogers_satchell_30	vol_rogers_satchell_60	vol_rogers_satchell_90	vol_yang_zhang_30	vol_yang_zhang_60	vol_yang_zhang_90
vol_close_to_close_30	1.000000	0.658094	0.646324	0.776477	0.446810	0.497140	0.435528	0.182276	0.213798	0.907786	0.578932	0.603998
vol_close_to_close_60	0.658094	1.000000	0.826935	0.518322	0.736422	0.617448	0.263288	0.320982	0.240687	0.613556	0.913467	0.769769
vol_close_to_close_90	0.646324	0.826935	1.000000	0.466242	0.583201	0.755090	0.208227	0.219250	0.300686	0.578475	0.745742	0.930691
vol_parkinson_30	0.776477	0.518322	0.466242	1.000000	0.641198	0.585609	0.886313	0.550041	0.486488	0.954019	0.621607	0.560181
vol_parkinson_60	0.446810	0.736422	0.583201	0.641198	1.000000	0.788581	0.586320	0.863499	0.684608	0.581583	0.935582	0.730292
vol_parkinson_90	0.497140	0.617448	0.755090	0.585609	0.788581	1.000000	0.482263	0.654675	0.843140	0.573707	0.749639	0.933161
vol_rogers_satchell_30	0.435528	0.263288	0.208227	0.886313	0.586320	0.482263	1.000000	0.672169	0.558775	0.715710	0.455953	0.363678
vol_rogers_satchell_60	0.182276	0.320982	0.219250	0.550041	0.863499	0.654675	0.672169	1.000000	0.799549	0.392271	0.638537	0.458238
vol_rogers_satchell_90	0.213798	0.240687	0.300686	0.486488	0.684608	0.843140	0.558775	0.799549	1.000000	0.373191	0.492789	0.600923
vol_yang_zhang_30	0.907786	0.613556	0.578475	0.954019	0.581583	0.573707	0.715710	0.392271	0.373191	1.000000	0.643161	0.618389
vol_yang_zhang_60	0.578932	0.913467	0.745742	0.621607	0.935582	0.749639	0.455953	0.638537	0.492789	0.643161	1.000000	0.803874
vol_yang_zhang_90	0.603998	0.769769	0.930691	0.560181	0.730292	0.933161	0.363678	0.458238	0.600923	0.618389	0.803874	1.000000

Standardize the data¶

In this section, we standardize the volatility features before applying PCA. This step is essential because PCA and Kernel PCA are geometry-based algorithms—they rely on distances and projections in multidimensional space.

📏 Therefore, they are very sensitive to differences in scale between features. Standardizing ensures that each feature contributes equally to the principal components.

In [74]:

Copied!





# Standardize the features using only the training set
scaler = StandardScaler()
scaler.fit(train_df[vol_features])  # Fit on training data only

# Transform both train and full dataset using the same scaler
train_df_scaled = scaler.transform(train_df[vol_features])
df_vol_scaled = scaler.transform(df[vol_features])

# Convert the scaled full dataset to a DataFrame for easier handling
df_vol_scaled = pd.DataFrame(df_vol_scaled, index=df.index, columns=vol_features)

# Display the standardized volatility features
df_vol_scaled
# Standardize the features using only the training set
scaler = StandardScaler()
scaler.fit(train_df[vol_features])  # Fit on training data only

# Transform both train and full dataset using the same scaler
train_df_scaled = scaler.transform(train_df[vol_features])
df_vol_scaled = scaler.transform(df[vol_features])

# Convert the scaled full dataset to a DataFrame for easier handling
df_vol_scaled = pd.DataFrame(df_vol_scaled, index=df.index, columns=vol_features)

# Display the standardized volatility features
df_vol_scaled

Out[74]:

	vol_close_to_close_30	vol_close_to_close_60	vol_close_to_close_90	vol_parkinson_30	vol_parkinson_60	vol_parkinson_90	vol_rogers_satchell_30	vol_rogers_satchell_60	vol_rogers_satchell_90	vol_yang_zhang_30	vol_yang_zhang_60	vol_yang_zhang_90
time
2016-02-15 00:00:00	-0.401640	-1.381544	-1.576680	-0.666506	-1.700655	-2.118362	-0.579882	-1.549402	-1.925998	-0.541895	-1.561662	-1.836679
2016-02-15 04:00:00	-0.310207	-1.319138	-1.526081	-0.560944	-1.618969	-2.039274	-0.411699	-1.432562	-1.818213	-0.499044	-1.519953	-1.797043
2016-02-15 08:00:00	-0.384208	-1.320118	-1.527724	-0.492111	-1.663043	-2.020514	-0.426479	-1.569616	-1.878487	-0.376898	-1.473607	-1.731616
2016-02-15 12:00:00	-0.400028	-1.335861	-1.521340	-0.561136	-1.705520	-2.020907	-0.461776	-1.625128	-1.872349	-0.454575	-1.489398	-1.732881
2016-02-15 16:00:00	-0.680327	-1.309875	-1.510666	-0.562572	-1.744195	-2.020207	-0.463613	-1.645059	-1.880228	-0.455181	-1.535916	-1.724313
...	...	...	...	...	...	...	...	...	...	...	...	...
2016-12-30 04:00:00	1.244838	0.011187	-0.761746	1.512804	0.899999	-0.258326	1.582711	1.446196	0.183129	1.414367	0.478538	-0.492934
2016-12-30 08:00:00	1.231967	-0.029461	-0.884245	1.337949	0.892614	-0.246083	1.221043	1.436164	0.207125	1.306979	0.475796	-0.495787
2016-12-30 12:00:00	1.169592	-0.009815	-0.877716	1.332142	0.878060	-0.391666	1.215668	1.432973	0.092471	1.301821	0.453419	-0.642294
2016-12-30 16:00:00	0.851923	0.010732	-0.863305	1.257809	0.868973	-0.395238	1.148207	1.408861	0.088642	1.237403	0.456153	-0.645470
2016-12-30 20:00:00	0.785280	0.077579	-0.841172	0.872236	0.885606	-0.349264	0.874607	1.422574	0.161035	0.868271	0.470454	-0.614195

1368 rows × 12 columns

Apply the PCA¶

In this section, we apply Kernel PCA to combine all the standardized volatility features into a single synthetic variable.

By training the PCA model only on the training set, we ensure no data leakage. We then transform the full dataset using the same projection.

🎯 The result is a new column, volatility, which captures the common dynamics across all volatility estimators—providing a clean, compact, and interpretable feature for modeling.

In [63]:

Copied!





# Call the PCA method from scikit learn
pca = KernelPCA(n_components=1)

# Train the PCA on the train set
pca.fit(train_df_scaled)

# Apply the PCA on the whole dataset
df["volatility"] = pca.transform(df_vol_scaled)
df
# Call the PCA method from scikit learn
pca = KernelPCA(n_components=1)

# Train the PCA on the train set
pca.fit(train_df_scaled)

# Apply the PCA on the whole dataset
df["volatility"] = pca.transform(df_vol_scaled)
df

Out[63]:

	open	high	low	close	volume	vol_close_to_close_30	vol_close_to_close_60	vol_close_to_close_90	vol_parkinson_30	vol_parkinson_60	vol_parkinson_90	vol_rogers_satchell_30	vol_rogers_satchell_60	vol_rogers_satchell_90	vol_yang_zhang_30	vol_yang_zhang_60	vol_yang_zhang_90	volatility
time
2016-02-15 00:00:00	108.987590	109.070169	108.709641	108.907025	660.035859	0.001785	0.001590	0.001584	0.001770	0.001647	0.001633	0.001801	0.001666	0.001662	0.002505	0.002297	0.002289	-4.614981
2016-02-15 04:00:00	108.858099	109.157427	108.846086	109.138406	794.704627	0.001821	0.001606	0.001596	0.001801	0.001662	0.001645	0.001850	0.001690	0.001679	0.002524	0.002309	0.002299	-4.340453
2016-02-15 08:00:00	109.138406	109.261909	109.126550	109.175951	959.457020	0.001792	0.001606	0.001595	0.001821	0.001654	0.001648	0.001846	0.001662	0.001669	0.002580	0.002323	0.002315	-4.328625
2016-02-15 12:00:00	109.175951	109.281245	109.158402	109.278148	1752.989844	0.001785	0.001602	0.001597	0.001801	0.001646	0.001648	0.001836	0.001651	0.001670	0.002545	0.002318	0.002315	-4.419657
2016-02-15 16:00:00	109.278148	109.469441	109.236273	109.440890	1918.783515	0.001673	0.001609	0.001599	0.001801	0.001638	0.001648	0.001835	0.001647	0.001669	0.002544	0.002304	0.002317	-4.515242
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2016-12-30 04:00:00	103.632257	103.711884	103.495896	103.564574	563.932484	0.002443	0.001955	0.001770	0.002410	0.002141	0.001925	0.002434	0.002263	0.002003	0.003399	0.002902	0.002628	1.991279
2016-12-30 08:00:00	103.564574	103.629321	103.555581	103.616731	697.707475	0.002438	0.001944	0.001742	0.002359	0.002139	0.001927	0.002328	0.002261	0.002007	0.003350	0.002901	0.002628	1.772922
2016-12-30 12:00:00	103.615791	103.628165	103.496810	103.515847	1768.926665	0.002413	0.001949	0.001744	0.002357	0.002137	0.001904	0.002326	0.002261	0.001989	0.003347	0.002894	0.002591	1.625629
2016-12-30 16:00:00	103.515847	103.692243	103.400370	103.422193	921.437054	0.002286	0.001955	0.001747	0.002335	0.002135	0.001904	0.002307	0.002256	0.001988	0.003318	0.002895	0.002590	1.483359
2016-12-30 20:00:00	103.420285	103.429955	103.195921	103.226868	764.291608	0.002259	0.001972	0.001752	0.002222	0.002138	0.001911	0.002227	0.002259	0.002000	0.003149	0.002899	0.002598	1.254933

1368 rows × 18 columns

Plot the result¶

In this section, we plot the newly created PCA-based volatility feature alongside a few selected original volatility indicators.

This visual comparison allows us to see how the PCA feature captures the overall structure and dynamic behavior of multiple volatility estimators, while reducing the dimensionality of our dataset.

In [84]:

Copied!





# Plot the result
plt.figure(figsize=(15,6))
plt.plot(df["volatility"], label="PCA vol feature")
plt.plot(df_vol_scaled["vol_yang_zhang_30"], label="YZ vol 30", alpha=0.5)
plt.plot(df_vol_scaled["vol_rogers_satchell_60"], label="RS vol 60", alpha=0.5)
plt.plot(df_vol_scaled["vol_close_to_close_90"], label="CTC vol 90", alpha=0.5)
plt.plot(df_vol_scaled["vol_close_to_close_60"], label="CTC vol 60", alpha=0.5)
plt.plot(df_vol_scaled["vol_parkinson_90"], label="Parkinson vol 90", alpha=0.5)
plt.legend()
plt.show()
# Plot the result
plt.figure(figsize=(15,6))
plt.plot(df["volatility"], label="PCA vol feature")
plt.plot(df_vol_scaled["vol_yang_zhang_30"], label="YZ vol 30", alpha=0.5)
plt.plot(df_vol_scaled["vol_rogers_satchell_60"], label="RS vol 60", alpha=0.5)
plt.plot(df_vol_scaled["vol_close_to_close_90"], label="CTC vol 90", alpha=0.5)
plt.plot(df_vol_scaled["vol_close_to_close_60"], label="CTC vol 60", alpha=0.5)
plt.plot(df_vol_scaled["vol_parkinson_90"], label="Parkinson vol 90", alpha=0.5)
plt.legend()
plt.show()

We created several volatility indicators and observed strong correlations between them. To reduce redundancy, we applied Kernel PCA (after proper standardization) to generate one synthetic volatility feature. This new feature captures the overall market volatility in a compact and interpretable way.

It can now be used as a robust input for any predictive or risk model.

In real trading pipelines, this PCA+scaling process must be repeated on each train/test split to avoid data leakage