Missing-Modality-Adaptation/index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description" content="Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation">
    <meta name="author" content="Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif">

    <title>Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation</title>
    <!-- Bootstrap core CSS -->
    <!--link href="bootstrap.min.css" rel="stylesheet"-->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"
          integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">

    <!-- Custom styles for this template -->
    <link href="offcanvas.css" rel="stylesheet">
    <link rel="stylesheet" type="text/css" href="src/css/style.css">
</head>

<body>

<div class="jumbotron jumbotron-fluid">
    <div class="container"></div>
    <h2>Robust Multimodal Learning with Missing Modalities</h2> 
    <h2>via Parameter-Efficient Adaptation</h2>
        <p class="abstract"><b>Adapting multimodal models for different missing modality scenarios</b></p>
    <hr>
    <p class="authors">
        <a href="https://kaykobad.github.io/" target="_blank">Md Kaykobad Reza<sup> 1</sup></a>,
        <a href="https://scholar.google.com/citations?user=f1WPBE8AAAAJ&hl=en" target="_blank">Ashley Prater-Bennette<sup> 2</sup></a>, and
        <a href="https://intra.ece.ucr.edu/~sasif/" target="_blank"> M. Salman Asif<sup> 1</sup></a>
    </p>
    <p>
        <a><sup>1</sup> University of California Riverside, CA, USA</a></br>
        <a><sup>2</sup> Air Force Research Laboratory, NY, USA</a></br>
    </p>

    <div>
        <a class="btn btn-primary" href="https://ieeexplore.ieee.org/document/10713849" target="_blank">Paper (TPAMI)</a> 
        <a class="btn btn-primary" href="https://arxiv.org/abs/2310.03986" target="_blank">Paper (arXiv)</a> 
        <!-- <a class="btn btn-primary" href="https://github.com/CSIPlab/Robust-multimodal-learning" target="_blank">Code (GitHub)</a> -->
        <a class="btn btn-primary" href="https://CSIPlab.github.io/Missing-Modality-Adaptation">Webpage</a>
    </div>
</div>

<div class="container">
    <div class="section">
        <div class="row">
            <div class="col text-center">
                <img src="./img/ssf-final.png" style="width:80%" alt="Banner">
                <p class="text-left"><b>Figure 1:</b> a) Overview of our model adaptation approach for robust MML. A model pretrained on all the modalities is adapted using a small number of learnable parameters to handle different modality combinations. We insert adaptable layers after each layer of the encoders and the fusion block to learn the modulation as a function of the available input modalities to compensate for the missing modalities. The grayed-out branch (missing modality) is inactive and does not contribute to the output. b) Low-rank model adaption computes features using frozen weights and low-rank weight updates and combine them. c) Scale and shift feature adaptation transforms input by element-wise multiplication and addition.</p>
            </div>
        </div>

        <h2>Abstract</h2>
        <hr>
        <p>Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.</p>
        <br>

        <!-- <video width="100%" height="400" controls="True" preload="none" loop muted autoplay = 'autoplay' playsInline>
            <source type="video/mp4" src="img/Presentation1.mov" />
            <source type="video/webm" src="img/video1.webm" />
        </video> -->
    </div>


    <div class="section">
        <h2>Experiments for Multimodal Segmentation</h2>
        <hr>

        <div class="row">
			<div class="col text-center">
                <img src="./img/overall-comparison-for-segmentation.png" style="width:100%" alt="Banner">
                <p class="text-left"><b>Table 1:</b> Performance comparison with different baseline methods for multimodal semantic segmentation on MFNet and NYUDv2 datasets and multimodal material segmentation on MCubeS dataset. We use CMNeXt as the base model. <b>Bold</b> letters represent best results.</p>
            </div>
        </div>

        <div class="row">
			<div class="col text-center">
                <img src="./img/visualization.png" style="width:85%" alt="Banner">
                <p class="text-left"><b>Figure 2:</b> Examples of predicted segmentation maps for the Pretrained and Adapted models. Title above each subimage shows method name (available modalities). CMNeXt column shows the predictions with all the modalities. Segmentation quality improves significantly after model adaptation for all input modality combinations. Green boxes highlight areas with salient differences in results (e.g., cars and humans missing in the Pretrained model with missing modalities but visible in the Adapted model). For MCubeS dataset, we only show RGB input images for brevity. A, D and N denote angle of linear polarization, degree of linear polarization, and near-infrared, respectively.</p>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Comparison with Robust Models and Other Adaptation Methods</h2>
        <hr>

        <div class="row">
			<div class="col text-center">
                <img src="./img/mfnet.png" style="width:100%" alt="Banner">
                <p class="text-left"><b>Table 2:</b> Performance comparison with existing robust methods for MFNet dataset. RGB and Thermal columns report performance when only RGB and only Thermal are available. Average column reports average performance when one of the two modalities gets missing. ‘-’ indicates that results for those cells are not published. ∗ indicates that available code and pretrained models from the authors were used to generate the results.</p>
            </div>
        </div>

        <div class="row">
			<div class="col text-center">
                <img src="./img/nyu.png" style="width:100%" alt="Banner">
                <p class="text-left"><b>Table 3:</b> Performance comparison with existing robust methods for NYUDv2 dataset. RGB and Depth columns report performance when only RGB and only Depth are available. Average column indicates average performance when one of the two modalities gets missing. ∗ indicates that available code and pretrained models from the authors were used to generate the results. Other results are from the corresponding papers.</p>
            </div>
        </div>

        <div class="row">
			<div class="col text-center">
                <img src="./img/comparison-with-other-adaptation.png" style="width:100%" alt="Banner">
                <p class="text-left"><b>Table 4:</b> Performance comparison (% mIoU) of different parameter-efficient adaptation techniques for MFNet, NYUDv2, and MCubeS datasets. Each column reports mIoU of the Adapted model with the corresponding modalities, and Avg indicates average performance. A and D denote Angle and Degree of Linear Polarization.</p>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Experiments for Multimodal Sentiment Analysis</h2>
        <hr>
        <div class="row">
			<div class="col text-center">
                <img src="./img/comparison-for-mmsa.png" style="width:100%" alt="Banner">
                <p class="text-left"><b>Table 5:</b> Comparison of our adaptation technique with existing methods for multimodal sentiment analysis on CMU-MOSI and CMU-MOSEI datasets.</p>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Experiments for Multimodal Action Recognition and Classification</h2>
        <hr>
        <div class="row">
			<div class="col text-center">
                <div class="imgcontainer">
                    <figure>
                        <img src="./img/multimodal-action-recognition.png" style="width:100%" alt="Banner">
                        <p class="text-left"><b>Table 6:</b> Performance (top-1 accuracy) comparison with existing methods for action recognition on NTU RGB+D dataset. RGB and Depth columns report performance when only RGB and only Depth are available. Avg column indicates average performance. ∗ indicates that available code and pretrained models were used to generate the results.</p>
                    </figure>
                    <figure>
                        <img src="./img/multimodal-classification.png" style="width:100%" alt="Banner">
                        <p class="text-left"><b>Table 7:</b> Performance (accuracy) comparison with prompting based approach for multimodal classification on UPMC Food-101 dataset. Image and text columns indicate the amount of image and text modality available during both training and testing. † indicates that those values are approximated from the plots published in [27].</p>
                    </figure>
                </div>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Cosine Similarity Analysis</h2>
        <hr>
        <div class="row">
			<div class="col text-center">
                <div class="imgcontainer">
                    <img src="./img/mcubes-cosine-similarity.png" style="width:50%" alt="Banner">
                    <img src="./img/ntu-cosine-similarity.png" style="width:50%" alt="Banner">
                </div>
                <p class="text-left"><b>Figure 3:</b> Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) on MCubeS and NTU RGB+D datasets. Adapted models show higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.</p>
            </div>
        </div>
    </div>

    
    <div class="section">
        <h2>Paper</h2>
        <hr>
        <div>
            <div class="list-group">
                <a href="https://arxiv.org/abs/2310.03986"
                   class="list-group-item">
                    <img src="img/paper.jpeg" style="width:100%; margin-right:-20px; margin-top:-10px;">
                </a>
            </div>
        </div>
    </div>

    <div class="section">
        <h2>Bibtex</h2>
        <hr>
        <div class="bibtexsection text-wrap">
  @ARTICLE{10713849,
    author={Reza, Md Kaykobad and Prater-Bennette, Ashley and Asif, M. Salman},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
    title={Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation}, 
    year={2024},
    volume={},
    number={},
    pages={1-13},
    keywords={
      Adaptation models;Training;Computational modeling;Robustness;Modulation;Transforms;
      Sentiment analysis;Data models;Solid modeling;Knowledge engineering;
      Robust multimodal learning;parameter-efficient adaptation;missing modality adaptation;
      missing modality robustness
    },
    doi={10.1109/TPAMI.2024.3476487}
  }
        </div>
    </div>

    <hr>
</div>

</body>
</html>