Framework

Our framework consisted of a dual autoencoder model that was trained to encode inputs from both modalities into a shared latent space and decode them back to construct instances of either modality.

Input Type	Flattened Force Profile Features	Binary Phrase Vector	SBERT Embedding Phrase Vector
Input Vector	769×1	62×1	150×1
Encoder Layer 1	768×512	62×512	769×512
Encoder Activation 1	ReLU + Dropout	ReLU + Dropout	ReLU + Dropout
Encoder Layer 2	512×128	512×128	512×128
Encoder Activation 2	ReLU + Dropout	ReLU + Dropout	ReLU + Dropout
Encoder Layer 3	128×16	128×16	128×16
Latent Space Vector	16×1	16×1	16×1
Decoder Layer 1	16×128	16×128	16×128
Decoder Activation 1	ReLU	ReLU	ReLU
Decoder Layer 2	128×512	128×512	128×512
Decoder Activation 2	ReLU	ReLU	ReLU
Decoder Layer 3	512×769	512×62	512×150
Output Vector	768×1	62×1	768×1

Detailed breakdown of the dimensions of the inputs, layers, and outputs comprising each autoencoder responsible for encoding and decoding its assigned input type. Since we aim to develop a shared latent space for force and language, the latent space vector has the same dimensions across all autoencoders.

Hyperparameters

Dropout chance: 10%

Number of training epochs: 1024

Adam learning rate: 0.001

Force profiles were augmented with per sample random noise residuals with mean 0 and variance 1