0
$\begingroup$

I am a newbie in AI and playing with some computer vision algorithms.

I have three tensors with different sizes. Noise augmentation levels tensor with size (N, C, H, W), diffusion timestep tensor of size (N, H) and pooled pose embeddings of size (N, C, H, W). I need to sum these tensors so that the 1D embedding result can be fed to FiLM layer.

How can I apply the summation without losing data?

Thank you!

$\endgroup$

    1 Answer 1

    1
    $\begingroup$

    If you're mapping from a higher dimension to a smaller dimension, you're almost always going to be losing data. The question is how to decide which data you want to keep.

    This all is highly domain specific.

    I would start with one modality: how should you compress that into a reasonably sized 1D vector? Maybe look for past work that deals with similar dimensions of data: e.g., for video classification you might also want to convert a series of images (n x c x w x h) to a 1D vector that gets fed into a linear classifier. This textbook chapter seems particularly useful. One architecture they propose is a 3D CNN.

    To incorporate multiple modalities, you'll probably have to figure out when to concatenate your data. E.g., you could encode each modality separately, then concat at the very end or concatenate at the very beginning. You should also look at past work for this (e.g., maybe video + audio classification?).

    $\endgroup$

      You must log in to answer this question.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.