Papers Explained Review 04: Tabular Deep Learning
Table of Contents
- Entity Embeddings
- Tabular ResNet
- Wide and Deep Learning
- Deep and Cross Network
- Tab Transformer
- Feature Tokenizer Transformer
Entity Embeddings
Entity Embeddings of Categorical Variables
Neural networks are not as prominent when dealing with machine learning problems with structured data. This can be easily seen by the fact that the top teams in many online machine learning competitions like those hosted on Kaggle use tree based methods more often than neural networks.
In principle a neural network can approximate any continuous function and piece wise continuous function. However, it is not suitable to approximate arbitrary non-continuous functions as it assumes certain level of continuity in its general form. During the training phase the continuity of the data guarantees the convergence of the optimization, and during the prediction phase it ensures that slightly changing the values of the input keeps the output stable.
On the other hand decision trees do not assume any continuity of the feature variables and can divide the states of a variable as fine as necessary.
Interestingly the problems we usually face in nature are often continuous if we use the right representation of data. Whenever we find a better way to reveal the continuity of the data we increase the power of neural networks to learn the data.
For example, convolutional neural networks group pixels in the same neighborhood together. This increases the continuity of the data compared to simply representing the image as a flattened vector of all the pixel values of the images.
The rise of neural networks in natural language processing is based on the word embedding which puts words with similar meaning closer to each other in a word space thus increasing the continuity of the words compared to using one-hot encoding of words.
Unlike unstructured data found in nature, structured data with categorical features may not have continuity at all and even if it has it may not be so obvious.
To learn the approximation of the function we map each state of a discrete variable to a vector, This mapping is equivalent to an extra layer of linear neurons on top of the one-hot encoded input.
The main goal of entity embedding is to map similar categories close to each other in the embedding space.
In the experiments we use both one-hot encoding and entity embedding to represent input features of neural networks. We use two fully connected layers (1000 and 500 neurons respectively) on top of either the embedding layer or directly on top of the one-hot encoding layer. The fully connected layer uses ReLU activation function. The output layer contains one neuron with sigmoid activation function. No dropout is used as we found that it did not improve the result.
Tabular ResNet
Revisiting Deep Learning Models for Tabular Data
Given ResNet’s success story in computer vision, the idea is to construct a simple variation of ResNet for Tabular Data. The main building block is simplified compared to the original architecture, and there is an almost clear path from the input to output which we find to be beneficial for the optimization. Overall, we expect this architecture to outperform MLP on tasks where deeper representations can be helpful.
Wide and Deep Learning
The human brain is a sophisticated learning machine, forming rules by memorizing everyday events and generalizing those learnings to apply to things we haven’t seen before. Perhaps more powerfully, memorization also allows us to further refine our generalized rules with exceptions.
By jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. This is the premise of Wide and Deep Learning.
It’s useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
Deep and Cross Network
Deep & Cross Network for Ad Click Predictions
Feature engineering has been the key to the success of many prediction models. However, the process is nontrivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, they generate all the interactions implicitly, and are not necessarily efficient in learning all types of cross features.
DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model.
To reduce the dimensionality, we employ an embedding procedure to transform the one hot features into dense vectors of real values (Entity Embeddings).
The cross network is composed of cross layers, with each layer having the following formula:
This special structure of the cross network causes the degree of cross features to grow with layer depth. e highest polynomial degree (in terms of input x0) for an l-layer cross network isl +1.
A combination layer concatenates the outputs from two networks and feed the concatenated vector into a standard logits layer.
TabTransformer
TabTransformer: Tabular Data Modeling Using Contextual Embeddings
The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy.
The contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability.
The tree-based models have several limitations in comparison to deep learning models.
- They are not suitable for continual training from streaming data, and do not allow efficient end-to-end learning of image/text encoders in presence of multi-modality along with tabular data.
- In their basic form they are not suitable for state-of-the-art semi-supervised learning methods.
The MLPs usually learn parametric embeddings to encode categorical data features. But due to their shallow architecture and context-free embeddings, they have the following limitations:
- neither the model nor the learned embeddings are interpretable
- it is not robust against missing and noisy data
- for semi-supervised learning, they do not achieve competitive performance. Most importantly, MLPs do not match the performance of tree-based models such as GBDT on most of the datasets.
Motivated by the successful applications of Transformers in NLP, we adapt them in tabular domain. In particular, TabTransformer applies a sequence of multi-head attention-based Transformer layers on parametric embeddings to transform them into contextual embeddings, bridging the performance gap between baseline MLP and GBDT models.
Setup: For the TabTransformer, the hidden (embedding) dimension, the number of layers and the number of attention heads are fixed to 32, 6, and 8 respectively. The MLP layer sizes are set to {4 × l, 2 × l}, where l is the size of its input.
Feature Tokenizer Transformer
Revisiting Deep Learning Models for Tabular Data
In a nutshell, FT-Transformer model transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. Thus, every Transformer layer operates on the feature level of one object.
References
Hungry for more insights?
Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!