Describir: Multimodal Video Characterization and Summarization