Abstract: Despite significant advancements in remote sensing multimodal learning, particularly in image-image feature fusion, the exploration of audio-image feature fusion remains insufficient. Given ...