Meta, American multinational technology conglomerate, has released its multisensory AI model named ImageBind. It is a multimodal artificial intelligence system that works with six distinct sorts of data, such as text, audio, video, 3D, thermal, and motion data.
ImageBind can receive input in any of the allowed data modes, including text, audio, video, 3D, thermal, and motion data. After receiving the input, this multisensory AI tool relates it to the other supporting data modes. For instance, if you show it a picture of a beach, it can figure out what the sound of the waves is.
This newly launched tool is a reflection of Meta’s larger goal. It has been planning to develop a multimodal artificial intelligence system that is capable of learning from any and all forms of data for quite a long time now. ImageBind is sure to pave the way for researchers to create artificial intelligence systems that are both more advanced and more comprehensive as the number of modalities increases.
ImageBind can learn a shared embedding space for several modalities by making use of image-paired data. This enables the modalities to “talk” to each other and uncover relationships without having to look at them concurrently. Due to this, other models can comprehend new modalities without requiring extensive training to do so.
The model shows substantial scaling behavior, suggesting that the model’s performance and accuracy improve with increasing model size. It indicates the potential usefulness of non-visual tasks like audio categorization for bigger vision models. Also, ImageBind is a significant step forward in zero-shot retrieval, audio classification, and depth classification compared to previous methods.
Mark Zuckerberg, CEO of Meta, took to his Instagram and posted about the release of ImageBind. He wrote: “This is a step towards AIs that understand the world around them more like we do, which will make them a lot more useful and will open up totally new ways to create things.”
According to Meta, ImageBind employs a methodology that is analogous to the way in which humans are able to collect information via several senses and then analyse all of it concurrently and comprehensively. The brand plans to expand the supported data modes in the future and include other senses, such as touch, speech, smell, and fMRI signals from the brain. This will make it possible to create richer human-centric AI models.
Existing AI models such as Open AI’s DALL E 2, Stable Diffusion, and MidJourney are able to create wonderful images using text and images as prompts. These systems accept inputs in the form of text prompts written in natural language and then generate an image based on those prompts.
Read Also: 9 Best AI Image Generators
ImageBind has a wide range of potential uses; one of these applications is to enhance the functionality of the search feature for images, videos, audio files, or text messages by utilizing a mix of text, audio, and image. ImageBind enables robots to learn in a comprehensive manner similar to that of humans. In addition, it is also capable of enhancing content recognition and moderation, as well as promoting creative design by making the creation of rich media more seamless.
More and more companies are now developing artificial intelligence models that are more human-centered. ImageBind is one such development that includes new modalities, such as touch, speech, scent, and brain signals. It is sure to help researchers investigate new methods of evaluating vision models, which will ultimately result in novel applications for multimodal learning.
ImageBind may be downloaded for free. Make-A-Scene, an AI application developed by Meta that now generates images by responding to text prompts, can use ImageBind to create images by listening to audio.