Action recognition is key for many tasks such as automatic annotation of videos, improved human-computer interaction and guidance in monitoring public spaces. As the amount of available videos from different sources (from raw personal videos to more professional content) has dramatically increased in the last few years, new propositions are needed to organize this new data.
Most recent state-of-the-art recent techniques for action recognition in naturalistic and unconstrained video documents such as movies or broadcast data rely on Bag-of-Word representations built from Spatio-Temporal interest point descriptors and collected over long video segments. Such methods, however, often suffer from two severe and related drawbacks:
• the time information is discarded, although actions are often characterized by strong temporal components;
• activities in the same video segments are mixed in the representation, and plagues recognition algorithms that are based on these.
To address these issues, we will investigate novel techniques relying on principled probablistic techniques (the so called topic models) and symbolic pattern mining to capture information lying in the temporal relationships between recognized “action” units in order to enhance performances of action recognition algorithms. To this end we will rely on and greatly extend our previous work on the automatic extraction of temporal motifs from word × time documents as a basis to investigate video-based action recognition. This method, applied to large amounts of surveillance data, not only captures the co-occurrence between words, but also the order in which they occur, and can handle interleaved activites. Investigated techniques will be focused around three main axes.
1. Motif representation will address the development of models with a hierarchical structure allowing to identify recurring sequences of lower-level temporal motifs, and improving the robustness of the motif representation to handle the usually small amount of data available in supervised action classification. 2. Action recognition in unconstrained video documents which investigate the recognition of actions in videos using Motifs extracted from spatio-temporal interest point (STIP) descriptor BoW representations, leveraging on our modeling to identify meaningful and interleaved temporal patterns with longer temporal support than those of STIP, and addressing corresponding challenges (generative
vs discriminative modeling, vocabulary size, complexity).
3. Joint temporal and spatial action learning and recognition will address the learning of action motifs while jointly infering where these motifs occur in the images in addition to when they occur as currently performed by our model, allowing to address weakly supervised action recognition tasks. Evaluation on standard human action, movie, and sports databases from the litterature will be conducted to assess the performances of our algorithms.