Movie Dialogue Corpus

Annotation data are distributed here.
Also, additional information is provided in this page.

This is the support page for our film dialogue corpus. Please read the details on corpus construction and cite the following paper when using the dataset. This page contains links to the annotation data, script to download public domain movie files, and example dialogue videos extracted from the films. The difference in precision scores for different genres are described in the form of a table in the poster PDF file. Corpus improvement after the publication is described and the newer version of the corpus is provided here.

Examples

These are the example dialogue video segments extracted from a movie file in the corpus.
The original entire movie is "Deep Red (1975)".

1

IA_1001_3.0-4(4)

2

IA_1001_3.0-4(29)

3

IA_1001_3.0-4(35)

4

IA_1001_3.0-4(45)

5

IA_1001_3.0-4(67)

6

IA_1001_3.0-4(81)

Dataset

  • movie file list: 22genre_movlist.csv (for Linux and Mac). Windows users can download the list from the Google Form.
    Please make sure the file contains 1722 rows.
  • annotation files: movie_labels.zip
  • download script file: download_MovieDS.py (for Python3) download_MovieDS_py2.py (for Python2)
    You need Python and wget installed. Python module BeautifulSoup4 should be installed.
    After downloading above files, in command line, you can use them as after creating a download directory (named Dir for example).
     download_MovieDS.py movlist_part1.csv Dir 
  • Usage

    Note that the size of original video data is large (764.1GB) and download takes time (three days in our case). Please do not run the download script over mobile network or shared network such as at conference venue or at hotels. Movie files are in MP4 format. Please check that the size of the downloaded file is not zero. The program create "miss_downloads.log" file in the current directory. If download goes without any problem, the file should be empty. The above script use wget with " --no-check-certificate" option. The option may be removed depending on your environment.

    Note

    The segmentation is done automatically and the data contains some errors. Overall accuracy is about 90% but for some movie genres (music and musical) the accuracies are lower. Please see the poster files below that were presented at ICMI. The first page contains a table for the error rate based on the sampling.
    Poster PDF

    Citation

    Please cite the following paper when you use the dataset.
    Yasuhara et al., "Large-Scale Multimodal Movie Dialogue Corpus" ICMI 2016

    Details including bibtex are available at the ACM Digital Library
    For the PDF file, please read the updated versoin instead of the above published one.

    Contact

    Inquiries and comments are welcome. Please email me (m.inoue-at-acm.org).