Supervisor: Prof. Subhasis Chaudhuri, Prof. Biplab Banerjee
Department: DUAL DEGREE Electrical Engineering
Description of research work:
Cross-Attention is what you need!
Sub Heading: FusAtNet: Dual Attention Based SpectroSpatial Multimodal Fusion
Today, with recent advances in sensing, multimodal data is becoming easily available for various applications, especially in remote sensing (RS), where many data types like multispectral (MSI), hyperspectral (HSI), LiDAR etc. are available.
Effective fusion of these multi source datasets is becoming important, for these multi-modality features have been shown to generate highly accurate land-cover maps. However, fusion in the context of RS is non-trivial considering the redundancy involved in the data and the large domain differences among multiple modalities. In addition, the feature extraction modules for different modalities hardly interact among themselves, which further limits their semantic relatedness.
Why is a single fused representation important?
Several advantages of combining multimodal images, including:
generating a rich, fused representation, helps select task relevant features
Interestingly, most common methods today often just use methods like early concatenation, CNN extracted feature level concatenation or multi-stream decision level fusion methods, totally overlooking cross-domain features. Visual attention, a recent addition to the deep-learning-researchers’ toolbox is largely unexplored in multi-modal domain.
A question arises: How to best fuse these modalities for a joint, rich representation which can be used in downstream tasks?
An ideal fusion method would synergistically combine the two modalities and ensure that the resultant product reflects the salient features of input modalities.
In this work, we propose a new concept of “cross-attention and propose attention based HSI-LiDAR fusion in the context of land-cover classification.
A New Concept: Cross Attention
Dept: Electrical Engineering
Cross attention is a novel and intuitive fusion method in which attention masks from one modality (hereby LiDAR) are used to highlight the extracted features in another modality (hereby HSI). Note that this is different from self-attention where attention mask from HSI is used to highlight its own spectral features.
FusAtNet: Using Cross Attention in practice
We propose a feature fusion and extraction framework, namely FusAtNet, for collective land-cover classification of HSIs and LiDAR data in this paper. The proposed framework effectively utilizes HSI modality to generate an attention map using “self-attention” mechanism that highlights its own spectral features. Similarly, a “cross-attention” approach is simultaneously used to harness the LiDAR-derived attention map that accentuates the spatial features of HSI. These attentive spectral and spatial representations are then explored further along with the original data to obtain modality-specific feature embeddings. The modality oriented joint spectro-spatial information thus obtained, is subsequently utilized to carry out the land-cover classification task.
Experimental evaluations on three HSI-LiDAR datasets show that the proposed method achieves the state-of-the-art classification performance, including on the largest HSI-LiDAR benchmark dataset available, Houston. In remote sensing where lack of data is a major problem, our work starts outperforming existing state of the art methods with just 50% training data! Our work outperforms all existing deep fusion strategies, opening new avenues in multimodal classification (some of which cannot be disclosed by us at the moment, because many of these applications are subject to our forthcoming publications, but this impacts every multi-input deep learning algorithm in principle!).