New ‘multimodal’ AIs (meaning multiple types of AI input and output– like text, image, audio, video and more) have advanced beyond simply responding to text and can now analyze images and engage in spoken conversation. OpenAI released a multimodal version of its ChatGPT software, powered by the LLM GPT-4. Similarly, Google and Meta have incorporated image and audio features into their chatbot models, too. These multimodal AIs can perform various tasks, including accurately splitting a bar tab based on a photo of a receipt and providing detailed descriptions of images.
Multimodal AIs combine language-based neural networks with AI algorithms specifically designed for image or audio analysis, either by stacking them or by tightly integrating their code. Although the exact inner workings of these models are undisclosed, they rely on type transformers, which convert inputs into vector data to enable a more humanlike interaction.
The whytry.ai article you just read is a brief synopsis; the original article can be found here: Read the Full Article…