pyg-team
diff --git a/‎CHANGELOG.md
Lines changed: 2 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 3 additions & 3 deletions b/‎README.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎benchmark/data_frame_benchmark.py
Lines changed: 4 additions & 1 deletion b/‎benchmark/data_frame_benchmark.py
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/source/_figures/architecture.png
273 KB b/‎docs/source/_figures/architecture.png
273 KB
diff --git a/‎examples/excelformer.py
Lines changed: 42 additions & 24 deletions b/‎examples/excelformer.py
Lines changed: 42 additions & 24 deletions
diff --git a/‎test/nn/models/test_excelformer.py
Lines changed: 6 additions & 2 deletions b/‎test/nn/models/test_excelformer.py
Lines changed: 6 additions & 2 deletions
diff --git a/‎test/transforms/test_cat_to_num_transform.py
Lines changed: 5 additions & 0 deletions b/‎test/transforms/test_cat_to_num_transform.py
Lines changed: 5 additions & 0 deletions
diff --git a/‎torch_frame/datasets/fake.py
Lines changed: 9 additions & 3 deletions b/‎torch_frame/datasets/fake.py
Lines changed: 9 additions & 3 deletions
@@ -13,6 +13,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Changed
 
+- Updated `ExcelFormer` implementation and related scripts ([#391](https://github.com/pyg-team/pytorch-frame/pull/391))
+
 ### Deprecated
 
 ### Removed
 
@@ -28,7 +28,7 @@
 
 </div>
 
-**[Documentation](https://pytorch-frame.readthedocs.io)**
+**[Documentation](https://pytorch-frame.readthedocs.io)** | **[Paper](https://arxiv.org/abs/2404.00776)**
 
 PyTorch Frame is a deep learning extension for [PyTorch](https://pytorch.org/), designed for heterogeneous tabular data with different column types, including numerical, categorical, time, text, and images. It offers a modular framework for implementing existing and future methods. The library features methods from state-of-the-art models, user-friendly mini-batch loaders, benchmark datasets, and interfaces for custom data integration.
 
@@ -80,7 +80,7 @@ PyTorch Frame democratizes deep learning research for tabular data, catering to
 PyTorch Frame builds directly upon PyTorch, ensuring a smooth transition for existing PyTorch users. Key features include:
 
 * **Diverse column types**:
-  PyTorch Frame supports learning across various column types: `numerical`, `categorical`, `multicategorical`, `text_embedded`, `text_tokenized`, `timestamp`, and `embedding`. See [here](https://pytorch-frame.readthedocs.io/en/latest/handling_advanced_stypes/handle_heterogeneous_stypes.html) for the detailed tutorial.
+  PyTorch Frame supports learning across various column types: `numerical`, `categorical`, `multicategorical`, `text_embedded`, `text_tokenized`, `timestamp`, `image_embedded`, and `embedding`. See [here](https://pytorch-frame.readthedocs.io/en/latest/handling_advanced_stypes/handle_heterogeneous_stypes.html) for the detailed tutorial.
 * **Modular model design**:
   Enables modular deep learning model implementations, promoting reusability, clear coding, and experimentation flexibility. Further details in the [architecture overview](#architecture-overview).
 * **Models**
@@ -96,7 +96,7 @@ PyTorch Frame builds directly upon PyTorch, ensuring a smooth transition for exi
 Models in PyTorch Frame follow a modular design of `FeatureEncoder`, `TableConv`, and `Decoder`, as shown in the figure below:
 
 <p align="center">
-  <img width="100%" src="https://raw.githubusercontent.com/pyg-team/pytorch-frame/master/docs/source/_figures/modular.png" />
+  <img width="50%" src="https://raw.githubusercontent.com/pyg-team/pytorch-frame/master/docs/source/_figures/architecture.png" />
 </p>
 
 In essence, this modular setup empowers users to effortlessly experiment with myriad architectures:
 
@@ -229,6 +229,8 @@
             'diam_dropout': [0, 0.2],
             'residual_dropout': [0, 0.2],
             'aium_dropout': [0, 0.2],
+            'mixup': [None, 'feature', 'hidden'],
+            'beta': [0.5],
             'num_cols': [train_tensor_frame.num_cols],
         }
         train_search_space = {
@@ -257,7 +259,8 @@ def train(
         tf = tf.to(device)
         y = tf.y
         if isinstance(model, ExcelFormer):
-            pred, y = model.forward_mixup(tf)
+            # Train with FEAT-MIX or HIDDEN-MIX
+            pred, y = model(tf, mixup_encoded=True)
         elif isinstance(model, Trompt):
             # Trompt uses the layer-wise loss
             pred = model.forward_stacked(tf)
 
@@ -1,20 +1,26 @@
-"""Reported (reproduced) accuracy(rmse for regression task) of ExcelFormer
+"""Reported (reproduced) accuracy (for multi-classification task), auc
+(for binary classification task) and rmse (for regression task)
 based on Table 1 of the paper https://arxiv.org/abs/2301.02819.
 ExcelFormer uses the same train-validation-test split as the Yandex paper.
-
-california_housing: 0.4587 (0.4733) num_layers=5, num_heads=4, num_layers=5,
-channels=32, lr: 0.001,
-jannis : 72.51 (72.38) num_heads=32, lr: 0.0001
-covtype: 97.17 (95.37)
-helena: 38.20 (36.80)
-higgs_small: 80.75 (65.17) lr: 0.0001
+The reproduced results are based on Z-score Normalization, and the
+reported ones are based on :class:`QuantileTransformer` preprocessing
+in the Sklearn Python package. The above preprocessing is applied
+to numerical features.
+
+california_housing: 0.4587 (0.4550) mixup: feature, num_layers: 3,
+gamma: 1.00, epochs: 300
+jannis : 72.51 (72.80) mixup: feature
+covtype: 97.17 (97.02) mixup: hidden
+helena: 38.20 (37.68) mixup: feature
+higgs_small: 80.75 (79.27) mixup: hidden
 """
 import argparse
 import os.path as osp
 
 import torch
 import torch.nn.functional as F
 from torch.optim.lr_scheduler import ExponentialLR
+from torchmetrics import AUROC, Accuracy, MeanSquaredError
 from tqdm import tqdm
 
 from torch_frame.data.loader import DataLoader
@@ -23,14 +29,16 @@
 from torch_frame.transforms import CatToNumTransform, MutualInformationSort
 
 parser = argparse.ArgumentParser()
-parser.add_argument('--dataset', type=str, default='higgs_small')
+parser.add_argument('--dataset', type=str, default='california_housing')
+parser.add_argument('--mixup', type=str, default=None,
+                    choices=[None, 'feature', 'hidden'])
 parser.add_argument('--channels', type=int, default=256)
 parser.add_argument('--batch_size', type=int, default=512)
 parser.add_argument('--num_heads', type=int, default=4)
 parser.add_argument('--num_layers', type=int, default=5)
 parser.add_argument('--lr', type=float, default=0.001)
+parser.add_argument('--gamma', type=float, default=0.95)
 parser.add_argument('--epochs', type=int, default=100)
-parser.add_argument('--mixup', type=bool, default=True)
 parser.add_argument('--compile', action='store_true')
 args = parser.parse_args()
 
@@ -76,6 +84,16 @@
 else:
     out_channels = 1
 
+is_binary_class = is_classification and out_channels == 2
+
+if is_binary_class:
+    metric_computer = AUROC(task='binary')
+elif is_classification:
+    metric_computer = Accuracy(task='multiclass', num_classes=out_channels)
+else:
+    metric_computer = MeanSquaredError()
+metric_computer = metric_computer.to(device)
+
 model = ExcelFormer(
     in_channels=args.channels,
     out_channels=out_channels,
@@ -85,12 +103,13 @@
     residual_dropout=0.,
     diam_dropout=0.3,
     aium_dropout=0.,
+    mixup=args.mixup,
     col_stats=mutual_info_sort.transformed_stats,
     col_names_dict=train_tensor_frame.col_names_dict,
 ).to(device)
 model = torch.compile(model, dynamic=True) if args.compile else model
 optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
-lr_scheduler = ExponentialLR(optimizer, gamma=0.95)
+lr_scheduler = ExponentialLR(optimizer, gamma=args.gamma)
 
 
 def train(epoch: int) -> float:
@@ -99,8 +118,10 @@ def train(epoch: int) -> float:
 
     for tf in tqdm(train_loader, desc=f'Epoch: {epoch}'):
         tf = tf.to(device)
-        pred_mixedup, y_mixedup = model.forward_mixup(tf)
+        # Train with FEAT-MIX or HIDDEN-MIX
+        pred_mixedup, y_mixedup = model(tf, mixup_encoded=True)
         if is_classification:
+            # Softly mixed one-hot labels
             loss = F.cross_entropy(pred_mixedup, y_mixedup)
         else:
             loss = F.mse_loss(pred_mixedup.view(-1), y_mixedup.view(-1))
@@ -115,29 +136,26 @@ def train(epoch: int) -> float:
 @torch.no_grad()
 def test(loader: DataLoader) -> float:
     model.eval()
-    accum = total_count = 0
-
+    metric_computer.reset()
     for tf in loader:
         tf = tf.to(device)
         pred = model(tf)
-        if is_classification:
+        if is_binary_class:
+            metric_computer.update(pred[:, 1], tf.y)
+        elif is_classification:
             pred_class = pred.argmax(dim=-1)
-            accum += float((tf.y == pred_class).sum())
+            metric_computer.update(pred_class, tf.y)
         else:
-            accum += float(
-                F.mse_loss(pred.view(-1), tf.y.view(-1), reduction='sum'))
-        total_count += len(tf.y)
+            metric_computer.update(pred.view(-1), tf.y.view(-1))
 
     if is_classification:
-        accuracy = accum / total_count
-        return accuracy
+        return metric_computer.compute().item()
     else:
-        rmse = (accum / total_count)**0.5
-        return rmse
+        return metric_computer.compute().item()**0.5
 
 
 if is_classification:
-    metric = 'Acc'
+    metric = 'Acc' if not is_binary_class else 'AUC'
     best_val_metric = 0
     best_test_metric = 0
 else:
 
@@ -15,7 +15,8 @@
     TaskType.MULTICLASS_CLASSIFICATION,
 ])
 @pytest.mark.parametrize('batch_size', [0, 5])
-def test_excelformer(task_type, batch_size):
+@pytest.mark.parametrize('mixup', [None, 'feature', 'hidden'])
+def test_excelformer(task_type, batch_size, mixup):
     in_channels = 8
     num_heads = 2
     num_layers = 6
@@ -35,6 +36,7 @@ def test_excelformer(task_type, batch_size):
         num_cols=num_cols,
         num_layers=num_layers,
         num_heads=num_heads,
+        mixup=mixup,
         col_stats=dataset.col_stats,
         col_names_dict=tensor_frame.col_names_dict,
     )
@@ -46,7 +48,9 @@ def test_excelformer(task_type, batch_size):
 
     # Test the mixup forward pass
     feat_num = copy.copy(tensor_frame.feat_dict[stype.numerical])
-    out_mixedup, y_mixedup = model.forward_mixup(tensor_frame)
+    # Set lazy mutual information scores for `feature` mixup
+    tensor_frame.mi_scores = torch.rand(torch.Size((feat_num.shape[1], )))
+    out_mixedup, y_mixedup = model(tensor_frame, mixup_encoded=True)
     assert out_mixedup.shape == (batch_size, out_channels)
     # Make sure the numerical feature is not modified.
     assert torch.allclose(feat_num, tensor_frame.feat_dict[stype.numerical])
 
@@ -65,6 +65,11 @@ def test_cat_to_num_transform_on_categorical_only_dataset(with_nan):
         out.col_names_dict[stype.numerical]) == ((dataset.num_classes - 1) *
                                                  total_cols))
 
+    tensor_frame.feat_dict[stype.categorical] += 1
+    with pytest.raises(RuntimeError, match="contains new category"):
+        # Raise informative error when input tensor frame contains new category
+        out = transform(tensor_frame)
+
 
 @pytest.mark.parametrize('task_type', [
     TaskType.MULTICLASS_CLASSIFICATION, TaskType.REGRESSION,
 
@@ -81,16 +81,22 @@ def __init__(
         elif task_type == TaskType.MULTICLASS_CLASSIFICATION:
             labels = np.random.randint(0, 3, size=(num_rows, ))
             if num_rows < 3:
-                raise ValueError("Number of rows needs to be at"
-                                 " least 3 for multiclass classification")
+                raise ValueError("Number of rows needs to be at "
+                                 "least 3 for multiclass classification")
             # make sure every label exists
             labels[0] = 0
             labels[1] = 1
             labels[2] = 2
             df_dict = {'target': labels}
             col_to_stype = {'target': stype.categorical}
         elif task_type == TaskType.BINARY_CLASSIFICATION:
-            df_dict = {'target': np.random.randint(0, 2, size=(num_rows, ))}
+            labels = np.random.randint(0, 2, size=(num_rows, ))
+            if num_rows < 2:
+                raise ValueError("Number of rows needs to be at "
+                                 "least 2 for binary classification")
+            labels[0] = 0
+            labels[1] = 1
+            df_dict = {'target': labels}
             col_to_stype = {'target': stype.categorical}
         else:
             raise ValueError(