French AI startup Mistral has released its first multimodal model, Pixtral 12B, which can process images as well as text. The 12-billion-parameter model is approximately 24GB in size and is built on one of Mistral’s text models, Nemo 12B.
Pixtral 12B can answer questions about an arbitrary number of images of any size given either URLs or base64-encoded images. This allows the model to perform tasks like captioning images and counting objects in a photo, similar to other multimodal models like Anthropic’s Claude family and OpenAI’s GPT-4o.
The model is available for download on GitHub and Hugging Face, under an Apache 2.0 license without restrictions. However, it’s currently not possible to test Pixtral 12B online due to the lack of working web demos. The company plans to make it available for testing soon through its chatbot and API-serving platforms, Le Chat and Le Plateforme.
It remains unclear which image data Mistral used to develop Pixtral 12B, as most generative AI models are trained on vast quantities of public data from around the web, often copyrighted. Some model vendors argue that “fair use” rights entitle them to scrape any public data, but many copyright holders disagree and have filed lawsuits against larger vendors.
Pixtral 12B comes after Mistral closed a $645 million funding round led by General Catalyst, valuing the company at $6 billion. As one of Europe’s answer to OpenAI, Mistral’s strategy involves releasing free “open” models, charging for managed versions, and providing consulting services to corporate customers.
Source: https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/