Recent artificial intelligence (AI) deep learning models have demonstrated their power in detecting intricate features in medical images, such as x-rays, CT and MRI scans, pathology slides, and retinal photos, which are most often imperceptible to the human eye. For instance, a retinal scan reveals crucial physiological information hidden from even a sharp-eyed doctor. Machines can decipher this information, shedding light on blood pressure levels, glucose status, risks of various diseases, and the likelihood of heart attacks and strokes. Moreover, this incredible diagnostic tool extends to electrocardiogram interpretation, which can provide insights into an individual’s age, sex, anemia, diabetes risk, and heart function. So, it’s not just retinas that carry obscure information.
In order to take full advantage this sort of unexpected diagnoses, a significant technology shift lies ahead as AI evolves away from single-typed tasks like text and speech alone. Multimodal AI is paving the way for advanced systems capable of processing all input types, and laying the foundation for extremely sophisticated applications. Multimodal AI is a recent AI paradigm, in which various input and output data types (image, text, speech, video, numerical data, processes, and more) are combined with multiple processing algorithms to achieve more complete and useful information.
Older, monomodal, modeling has relied mainly on supervised learning, which requires painstaking annotation of inputs and axioms. In contrast, multimodal AI utilizes rapid self-supervised and unsupervised learning approaches, eliminating the need for laborious and impractical data annotation.
A breakthrough came with the introduction of transformer model architectures in 2017. This led to the first multimodal systems, like GPT-4, which are capable of working with various input and output data formats, including text, audio, speech, images, video and processes.
However, these multimodal models involve substantial resources, such as over 1 trillion model parameters, 24,000 graphic processing units, and massive data sets. Notably, the data used for training GPT-4 and similar models were derived from sources like Wikipedia, the Internet, and numerous books, drawing on a wide range of subjects instead of specific sets of knowledge, like medical or legal data. These infant multimodal systems will need fine-tuning with supervised learning before they can be ready for prime time in critical applications. As multimodal AI evolves, its name will become a misnomer because its abilities will extend beyond just generating content to include function and procedural generation, like text editing, too.
The vast capabilities of these models find increased relevance when considering their applications in medicine. Transformer models have begun to facilitate multimodal AI in medicine, enabling real-time analysis of an individual’s extensive varied data. Input and output factors such as images, physiological biomarkers, genetic information, environmental conditions, social determinants, and comprehensive medical knowledge contribute to each person’s health profile, and should be available as input to, or output from, an approved multimodal AI.
By leveraging these data layers, multimodal AIs can support a wide range of data-driven applications. For individuals at risk of chronic conditions, a multimodal AI virtual health assistant can offer continuous personalized feedback to manage their health effectively. Remote monitoring also becomes more viable with multimodal data, enabling a “hospital-at-home” setting akin to an intensive care unit.
Multimodal AI will make numerous other applications possible, including the creation of ‘digital twins’ to aid in prototyping treatment decisions, or in personalized pandemic surveillance systems that assess individual risks based on spatiotemporal factors and varying data inputs.
While early applications of LLM in healthcare garnered attention, achieving the full potential of multimodal AI LLMs poses significant challenges. These barriers include addressing the overconfidence and biases exhibited by the models, ensuring data privacy and security, navigating regulatory approval processes, overcoming resistance to change, and establishing solid empirical evidence of the benefits versus the costs. However, based on existing multimodal systems, like ChatGPT-4, many of those concerns are being proactively addressed.
The whytry.ai article you just read is a brief synopsis; the original article can be found here: Read the Full Article…