Abstract: With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP’s logits suffer from serious ...
Abstract: Multimodal sarcasm detection, aiming to uncover sarcastic sentiment behind multimodal data, has gained substantial attention in multimodal communities. Recent advancements in multimodal ...