Abstract: Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly ...
Abstract: Remote sensing image (RSI) captioning is a vision-language multimodal task concentrating on both image comprehension and sentence generation. Several studies suggest that ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results