Learning deep multimodal affective features for spontaneous speech emotion recognition