
Le Xue
Articles
-
Jan 8, 2025 |
salesforce.com | Jieyu Zhang |Le Xue |Zeyuan Johnson Chen |Ran Xu
The development of multimodal language models (MLMs) such as GPT4-V and BLIPs [1,2] have enabled many multimodal applications such as answering complex image-based queries; for example, “How many students are raising their hands in this image?”. These models rely heavily on instruction data—datasets that pair visual content with corresponding questions and answers. However, generating such data is a challenging task due to the limitations of existing approaches.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →