If you're mapping from a higher dimension to a smaller dimension, you're almost always going to be losing data. The question is how to decide which data you want to keep.
This all is highly domain specific.
I would start with one modality: how should you compress that into a reasonably sized 1D vector? Maybe look for past work that deals with similar dimensions of data: e.g., for video classification you might also want to convert a series of images (n x c x w x h) to a 1D vector that gets fed into a linear classifier. This textbook chapter seems particularly useful. One architecture they propose is a 3D CNN.
To incorporate multiple modalities, you'll probably have to figure out when to concatenate your data. E.g., you could encode each modality separately, then concat at the very end or concatenate at the very beginning. You should also look at past work for this (e.g., maybe video + audio classification?).