LLaVA stands for Large Language and Vision Assistant, a powerful multimodal model that combines the strengths of language and vision. Based on OpenAI’s CLIP and a fine-tuned version of Meta’s Llama 2 7B model, LLaVA uses visual instruction tuning to support image-based natural instruction following and visual reasoning capabilities. This allows LLaVA to perform a range of tasks, including:
Visual question answering: answering questions based on image content
Caption generation: generating text descriptions of images
Optical Character Recognition: identifying text in image
Multimodal dialogue: engaging in conversations that involve both text and images