One of the largest video captioning datasets, E-commercial Multimodal Advertising Dataset, which contains 100,000+ valid data elaborately picked out from 1,000,000+ Real Product Examples in both Chinese and English.


We present an e-commercial multi-modal advertising dataset, E-MMAD, which contains 120 thousand valid data elaborately picked out from 1.3 million real product examples in both Chinese and English. Our dataset sources are the Chinese largest e-commerce website shopping platform (www.taobao.com). Noticeably, it is one of the largest video captioning datasets in this field, in which each example has its product video (around 30 seconds), title, caption and structured information table that is observed to play a vital role in practice.

100,000 +

high-level multimodal data

4000 +

e-commercial product categories about various fields

300,000 +

Structure Information Words


The real world advertising description is vivid and various. So we collect a large-scale high-quality and reliable e-commercial multimodal advertising dataset. It is one of the largest video captioning datasets in this field. E-MMAD is completely collected from human real life and carefully selected so that it is qualified to meet the needs of real life.


The E-MMAD dataset is available to download for non-commercial purposes under a Creative Commons Attribution 4.0 International License.


We will release the full data before the conference starts.


This work is created in the face of Real needs. Thanks to our data labeling teams and others for their hard work and suggestions on this work.