The Numerai dataset contains decades of historical data on the global stock market. Machine learning models trained on the dataset learn to predict stock returns and earn cryptocurrency (NMR) based on performance in the Numerai Tournament. This blog post first explains “why” variational autoencoder is a suitable tool in a Numerai model developer stack. Then, we discuss “what” a variational autoencoder is and show “how” you can train one.
Why Variational Autoencoders?
We can use VAEs for anomaly detection, denoising, and generating synthetic data.
Anomaly detection
Anomaly detection is about identifying samples that deviate significantly from most data and do not conform to a well-defined notion of normal behavior. In Numerai dataset, there can be eras that are financially abnormal times, and detecting those can be informative.
Denoising
Noise reduction is the process of removing noise from a signal. We can apply VAE to de-noise the features that are off from the majority. Denoising transforms noisy features while anomaly detection flags the noisy samples.
Synthetic Data Generation
Training models with a combination of synthetic and real data have shown promising results. With VAE, we can sample from a normal distribution and pass it to the decoder to obtain new samples.
What is Variational Auto Encoder?
An autoencoder consists of two main parts: 1) an encoder that maps the input into a code, and 2) a decoder that reconstructs the input using the code. The code is also referred to as representation or latent variables in the literature. What makes it variational? Enforcing the distribution of the latent representation to a known distribution such as Gaussian. A typical AE has no control over the distribution of the latent space. A variational autoencoder (VAE) provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an encoder that outputs a single value to describe each latent state attribute, we’ll formulate our encoder to describe a probability distribution for each latent attribute. In this tutorial, we use the original VAE introduced in the following paper, and we refer to it as vanilla VAE:
We use https://github.com/AntixK/PyTorch-VAE as our code base. This code base includes various VAE architectures, but we’ll focus on its vanilla VAE.
Architecture
The encoder consists of one or more fully connected layers where the last layer outputs the mean and variance of a normal distribution. The mean and variance values are used to sample from the corresponding normal distribution as input to the decoder. The decoder consists of one or more fully connected layers and outputs the reconstructed version of the encoder's input. The following picture demonstrates the architecture of VAE:
Instead of immediately reporting values for the latent state, as in a standard autoencoder, the encoder model of a VAE will output parameters characterizing a distribution for each dimension in the latent space. We’ll output two vectors reflecting the mean and variance of the latent state distributions because we’re assuming that our prior has a normal distribution. Our decoder model will then build a latent vector by sampling from these defined distributions, after which it will reconstruct the original input.
Training
There are two terms in the loss function of a vanilla VAE: 1) reconstruction error and 2) KL divergence:
The reconstruction error used in the vanilla VAE is the mean-squared error (MSE). The MSE loss tries to make the reconstructed signal similar to the input signal. The KL divergence loss tries to make the distribution of the code close to a normal distribution. q(z|x)
is the distribution of the code given input signal andp(z)
is the normal distribution. The PyTorch code looks as follows:
recons_loss = F.mse_loss(recons, input)kld_loss = torch.mean(-0.5 * torch.sum(1 + log_var - mu ** 2 - log_var.exp(), dim = 1), dim = 0)
I created a fork from the main Pytorch-VAE’s branch for the Numerai dataset:
The Vanilla VAE config file looks as follows:
model_params:
name: 'NumeraiHistogram of KL divergence (left) and mean-squared reconstruction lossVAE'
in_channels: 1191
latent_dim: 32
data_params:
data_path: "/train.parquet"
train_batch_size: 4096
val_batch_size: 4096
num_workers: 8
exp_params:
LR: 0.005
weight_decay: 0.0
scheduler_gamma: 0.95
kld_weight: 0.00025
manual_seed: 1265trainer_params:
gpus: [1]
max_epochs: 300logging_params:
save_dir: "logs/"
name: "NumeraiVAE"
The key parameters in the config are in_channels
: the number of input features, latent_dim
: the latent dimension of the VAE. The encoder/decoder include linear layers followed by batch normalization and leaky ReLU activation.
The model definition of the encoder:
# Build Encoder
modules = []
modules.append(
nn.Sequential(
nn.Linear(in_channels, latent_dim),
nn.BatchNorm1d(latent_dim),
nn.LeakyReLU(),
))
self.encoder = nn.Sequential(*modules)
self.fc_mu = nn.Linear(latent_dim, latent_dim)
self.fc_var = nn.Linear(latent_dim, latent_dim)
The model definition of the decoder:
# Build Decoder
modules = []
self.decoder_input = nn.Linear(latent_dim, latent_dim)
modules.append(
nn.Sequential(
nn.Linear(latent_dim, in_channels),
nn.BatchNorm1d(in_channels),
nn.LeakyReLU()
))
self.decoder = nn.Sequential(*modules)
How to run the training?
python3 run.py --config configs/numerai_vae.yaml
It should print the following logs:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
======= Training NumeraiVAE =======
Global seed set to 1265
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]| Name | Type | Params
-------------------------------------
0 | model | NumeraiVAE | 83.1 K
-------------------------------------
83.1 K Trainable params
0 Non-trainable params
83.1 K Total params
0.332 Total estimated model params size (MB)
Global seed set to 1265
Epoch 19: 100%|██████████████████████████████████████████████████████████████████████████| 592/592 [00:20<00:00, 28.49it/s, loss=0.0818, v_num=3]
How to do anomaly detection with VAE?
The anomalies are the samples with high loss values. The loss value can be either reconstruction loss, KLD loss, or their combination.
How to do denoising with VAE?
The VAE is trained to reconstruct its input. A noisy input is first passed to the encoder to obtain the code. Then the code is passed to the decoder to obtain the de-noised input.
How to generate synthetic data with VAE?
As the input to the decoder follows a known distribution (i.e. Gaussian), we can sample from a Gaussian distribution and pass the values to the decoder to obtain new synthetic data.
Further reading
Thanks for reading the blog post. Feel free to share your comments about using VAEs in the Numerai tournament!