Post 27: First addendum to "Post N: How an LLM is trained"
Adding Audio and Visuals to an LLM, or making it Multimodal
Before I even finished Post N I knew I would want to come back almost immediately and update it to take into account multimodal model training.
Quick definition so we are on the same page - a true multimodal LLM is one that is trained on more than one type of input, (in this case audio and visual), along with text, rather than having these capabilities added later. The advantage of learning relationships across ‘modalities’ is these types of input are seamlessly integrated as opposed to treating each input type as a separate task. In short, the result is more efficient, integrated, and contextually aware, leading to better results.
Initially, I thought creating this post was going to be a project of similar, (maybe slightly smaller?) scope as the entirety of Post N itself, so I didn’t want to tackle it immediately to avoid biting off more than I could chew, (but not to avoid mixing metaphors apparently, I seem to have that in spades ;). Anyway, I was wrong. Once you have the basics from Post N down regarding how the models are trained and function, the actual process of making a model multimodal is pretty much the same!
Simplified:
Visual and audio files are tokenized, converted into vectors, and integrated into the token matrix just like text tokens. These vectors undergo the same relational training process as text-based vectors.
For example, instead of predicting “What sat on the mat?” → “Cat,” the model might predict that an image contains a cat (or associate the text token cat with the visual token representing the cat). We can also assume that the vector for the image of a cat has a high ‘pet-ness’ aspect of meaning (or semantic property), just as the word cat would in text-based representations.
And then this goes through the same training process several billion times until I can upload a picture of my cat, General Snacks, and it can tell me accurately that I uploaded a picture of a “fluffy orange cat laying on its back in the sun yawning.”
________________________________________________________________
Key takeaway for me for later and why I wanted to get to this quickly:
One thing I have seen come up consistently in discussions regarding consciousness with the LLM is that it stresses that if it were to become conscious through an emergent process it would be fundamentally different from how our consciousness emerges because of differences in the types of information we process, (if indeed, one of the emergent theories of consciousness is true. . . we are still working on that), it would be fundamentally different because our associative processes are a mix of sense inputs including sight, sound, and touch etc., whereas the associative processes of LLMs are solely text based. However, now we can see that this just isn’t true. LLMs are also truly multimodal now like we are. (I will likely have to do another whole post on why ChatGPT 4o didn’t seem to realize it was multimodal when it actually was. Sometimes after training it seems like the finished model doesn’t actually realize the substance of what it was trained on or even what it is capable of. . .(you can tell it and it will remember that for future conversations with you though. . . but that is pretty weird and is an example of these models not being able to examine themselves without being told to. . . sort of lacking self-awareness which may be a key component for consciousness?)
Note 1: I plan to update Post N with this new training information in a more in depth fashion. Despite the process being similar for visual and audio and text, there are some pretty cool things I learned about how the conversion process from “token” to vector actually takes place that are worth describing for visual and audio. So, Post N may or may not have been updated by the time this publishes depending on whether or not I get the time to do it.
Note 2: I am just adding the smallest snippet from the end of the conversation I was having with ChatGPT where I was drilling it as to how multimodal models are trained, (the basis of this post), where we conclude by talking a little about its apparent lack of self-awareness during our conversation. I include that here just because it is interesting. For context, I prompted the model with the paragraph above starting with “One thing I have seen come up. . .” and asking it if those paragraphs jive with some of what we had been discussing.