Combining multiple heavy modalities (images and audio) locally avoids massive upload times and creates a seamless interactive experience.
This demo requires an On-Device Language Model that supports BOTH Vision and Audio processing simultaneously.
Provide an image and an audio question to begin.