Describir: The spoken language in a multimodal context :