SUM

Abstract

Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.

Key Words: Saliency Prediction - Mamba - Unification

Chat about the Paper

Evaluation & Visualization

We conducted comprehensive testing of our universal model, SUM, across six different datasets, each benchmarked against state-of-the-art models. These datasets cover various areas including natural scenes, user interfaces, and e-commerce. SUM consistently outperformed existing models, achieving top results in 27 out of 30 metrics and securing second place in the remaining three. This highlights the model's effectiveness and versatility across different types of data, setting a new standard in the field. Additionally, compared to other models, SUM is relatively efficient, leveraging Mamba's capabilities to create a model that is robust, efficient, and universally applicable.

Saliency prediction performance across various datasets. ^∗ indicates that we have trained those models ourselves for fair comparison. ^† signifies that the results have been taken from the paper by Hosseini et al, and the rest of the results are taken from their respective papers. For our model, we note the percentage (%) change in performance relative to the second-best result, or to the best result if ours is not the top performer.