the first-ever multimodal Mistral model.
Pixtral 12B in short:
Natively multimodal, trained with interleaved image and text data
Strong performance on multimodal tasks, excels in instruction following
Maintains state-of-the-art performance on text-only benchmarks
Architecture:
New 400M parameter vision encoder trained from scratch
12B parameter multimodal decoder based on Mistral Nemo
Supports variable image sizes and aspect ratios
Supports multiple images in the long context window of 128k tokens