Audio archives of cultural heritage represent an important form of saving people's collective memories. Those initiatives have flourished in several countries and aim at preserving the local spoken audio heritage. The size of those collections is constantly increasing and demands considerable efforts in annotation and categorization. Automatically structuring and indexing spoken audio archives is an active research field with focus on well defined types of data like Broadcast recordings, voice-mail, lectures or meetings. However spoken data from the cultural heritage presents considerably more heterogeneous contents respect to conventional Broadcast recordings or meetings. Beside conventional presentations, interviews, testimonies, debates, they also contains folkloric literary productions, i.e., recitals, poems and theatrical representations (see e.g. http://xml.memovs.ch/s024-55-146.xml) which represent an important part of those archives as well as a fundamental element of the local cultural heritage. Recitals, poems and theatre have significantly different structure and stylistic features compared to presentations, interviews and story-telling and none of the current spoken data processing methods is able to detect and classify them distinguishing from other story types, limiting thus the automatic indexing of those data.
The proposed project (SESAME) aims at advancing the state-of-the-art in speech processing for automatically detecting and classifying folkloric literary productions uttered by actors in spoken archives of the cultural heritage. SESAME is organized in two research tracks:
- The automatic identification of actors in a given recording. This involves two subtasks: the detection of speaker time boundaries and the classification of each speaker into a category (or role), i.e., actor, journalist or guest.
- The automatic identification of literary productions in a given recording. This involves two subtasks: the detection of the story boundaries (i.e., semantically uniform segments) and the classification of each story into a category, i.e., presentation, interview, debate, story-telling, recital, poetry or theatrical representation.
The methodology proposed for achieving the goals is based on the investigation of statistical classifiers trained on in-domain data and on the investigation of structural and stylistic features, both speaker-related and story-related. We plan to perform the entire research on selected data from available collections of the Swiss audio cultural heritage, e.g., http://xml.memovs.ch/s024-.xml that will be manually annotated for performing the previous mentioned tasks. We ask the Hasler Stiftung to fund a dedicated PhD student to perform the research (3 years with possibility of extending to a forth year). The importance of SESAME is threefold. It will define a data set for performing the scientific experiments providing a common setup on which further research projects could be developed around the Swiss audio cultural heritage. From a fundamental research point of view, SESAME will attempt to characterize both in term of structure and style, spoken content largely available, yet neglected, like folkloric literature and separate it from other conventional presentations, interviews or stories. Results will be evaluated according to standard metrics in the area of speech processing. From a training point of view, SESAME will introduce a PhD student into state-of-the-art speech and audio processing.