TED dataset
Description
The TED dataset contains all the audio-video recordings of the TED talks downloaded from the official TED website, www.ted.com, on April 27th 2012 (first version) and on September 10th 2012 (second version). No processing has been done on any of the metadata fields. The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file.
The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations.
Please check the CBMI 2013 paper for a detailed description of each task.
- Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013
PDF document, Bibtex citation - Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013
PDF document, Bibtex citation
If you use the TED dataset for your research please cite one of the above papers (specifically the 1st paper for the April 2012 version and the 2nd paper for the September 2012 version of the dataset).
TED Website
The TED website is a popular online repository of audiovisual recordings of public lectures given by prominent speakers, under a Creative Commons non-commercial license (see www.ted.com). The site provides extended metadata and user-contributed material. The speakers are scientists, writers, journalists, artists, and businesspeople from all over the world who are generally given a maximum of 18 minutes to present their ideas. The talks are given in English and are usually transcribed and then translated into several other languages by volunteer users. The quality of the talks has made TED one of the most popular online lecture repositories, as each talk was viewed on average almost 500,000 times.
Metadata
The dataset contains two main entry types: talks and users. The talks have the following data fields: identifier, title, description, speaker name, TED event at which they were given, transcript, publication date, filming date, number of views. Each talk has a variable number of user comments, organized in threads.
In addition, three fields were assigned by TED editorial staff: related tags, related themes, and related talks. Each talk generally has three related talks and 95% of them have a high-quality transcript available. The dataset includes 1,149 talks from 960 speakers and 69,023 registered users that have made about 100,000 favorites and 200,000 comments.