CARTER - Classification of visual scenes using Affine invariant Regions and TExt Retrieval methods

Baseline Representation

As a baseline method, we use the image representations proposed by Vailaya et al. [18]. We selected this approach, as it reports some of the best results, from all scene classification approaches for datasets with landscape, city and in door images and since it has already been proven to work on a significant enough dataset. Thus, it can be regarded as a good representative of the state-of-the-art. Two different representations are used for each binary classification tasks: color features are used to classify images as indoor or outdoor, and edge features are used to classify outdoor images as city or landscape. Color features are based on the LUV first- and second-order moments computed over a 10.

Our Representation

Fig.1 - Representation computation of an image.

There are two main elements in an image classification system. The first one refers to the computation of the feature vector representing an image, and the second one is the classifier, the algorithm that classifies an input image into one of the predefined category using the feature vector. In this section, we focus on the image representation and describe the two models that we use: the first one is the bag-of-visterms, built from automatically extracted and quantized local descriptors, and the second one is obtained through the higher-level abstraction of the bag-of-visterms into a set of aspects using the latent space modeling.

Bag-of-visterms representation from local descriptors:

The construction of the bag-of-visterms (BOV) feature vector from an image involves the different steps illustrated in Fig. 1. In brief, interest points are automatically detected in the image, then local descriptors are computed over those regions. All the descriptors are quantized into visterms, and all occurrences of each specific visterm of the vocabulary in the image are counted to build the BOV representation of the image. In the following we describe in more detail each of the steps.

Interest point detection:

The goal of the interest point detector is to automatically extract characteristic points -and more generally regions- from the image, which are invariant to some geometric and photometric transformations. This invariance property is interesting, as it ensures that given an image and its transformed version, the same image points will be extracted from both and hence, the same image representation will be obtained. Several interest point detectors exist in the literature. They vary mostly by the amount of invariance they theoretically insure, the image property they exploit to achieve invariance, and the type of image structures they are designed to detect [22,23]. In this work, we use the difference of Gaussians (DOG) point detector [22]. This detector essentially identifies blob-like regions where a maximum or minimum of intensity occurs in the image, and it is invariant to translation, scale, rotation and constant illumination variations. We chose this detector since it was shown to perform well in comparisons previously published [23] , and also since we found it to be a good choice in practice for the task at hand, performing competively compared to other detectors. The DOG detector is also faster and more compact than similar performing detectors, and an additional reason to prefer this detector over fully affine-invariant ones [23], is also motivated by the fact that an increase of the degree of invariance may remove information about the local image content that is valuable for classification.

Local descriptors:

Local descriptors are computed on the region around each interest point that is automatically identified by the local interest point detector. We use the SIFT (Scale Invariant Feature Transform) feature as local descriptors [22]. Our choice was motivated by the findings of several publications [23], where SIFT was found to work best. This descriptor is based on the grayscale representation of images, and was shown to perform best in terms of specificity of region representation and robustness to image transformations [23]. SIFT features are local histograms of edge directions computed over different parts of the interest region. These features capture the structure of the local image regions, which correspond to specific geometric configurations of edges or to more texture-like content. In [22], it was shown that the use of 8 orientation directions and a grid of 4x4 parts gives a good compromise between descriptor size and accuracy of representation. The size of the feature vector is thus 128. Orientation invariance is achieved by estimating the dominant orientation of the local image patch using the orientation histogram of the keypoint region. All direction computations in the elaboration of the SIFT feature vector are then done with respect to this dominant orientation.

Quantization and Vocabulary model construction:

When applying the two preceding steps to a given image, we obtain a set of real-valued local descriptors. In order to obtain a text-like representation, we quantize each local descriptor into one of a discrete set of visterms acording to the nearest neighboor rule.

The construction of the vocabulary is performed through clustering. More specifically, we apply the K-means algorithm to a set of local descriptors extracted from training images, and the means are kept as visterms. We used the Euclidean distance in the clustering and choose the number of clusters depending on the desired vocabulary size.

Technically, the grouping of similar local descriptors into a specific visterm can be thought of as being similar to the stemming preprocessing step of text documents, which consists of replacing all words by their stem. The rationale behind stemming is that the meaning of words is carried by their stem rather than by their morphological variations [24]. The same motivation applies to the quantization of similar descriptors into a single visterm. Furthermore, in our framework, local descriptors will be considered as distinct whenever they are mapped to different visterms, regardless of whether they are close or not in the SIFT feature space. This also resembles the text modeling approach which considers that all information is in the stems, and that any distance defined over their representation (e.g. strings in the case of text) carries no semantic meaning.

Bag-of-visterms representation:

The first representation of the image that we will use for classification is the bag-of-visterms (BOV), which is constructed from the local descriptors by counting the occurence in a histogram like fashion. This is equivalent to the bag-of-words representation used in text documents. This representation of an image contains no information about spatial relationship between visterms. The standard bag-of-words text representation results in a very similar simplification of the data: even though word ordering contains a significant amount of information about the original data, it is completely removed from the final document representation.

Probabilistic Latent Semantic Analysis (PLSA)

The bag-of-words approach has the advantage of producing a simple data representation, but potentially introduces the well known synonymy and polysemy ambiguities(see analogy with text). Recently, probabilistic latent space models[6] have been proposed to capture co-occurrence information between elements in a collection of discrete data in order to disambiguate the bag-of-words representation. The analysis of visterm co-occurrences can thus be considered using similar approaches, and we use the Probabilistic Latent Semantic Analysis [6] (PLSA) model in this paper for that purpose. Though PLSA suffers from a non-fully generative formulation, its tractable likelihood maximization makes it an interesting alternative to fully generative models with comparative performance.

Classification Method

To classify an input image represented either by the BOV vectors, the aspect parameters or any of the feature vector of the baseline approach, we employed Support Vector Machines (SVMs) [25]. SVMs have proven to be successful in solving machine learning problems in computer vision and text categorization applications, especially those involving large dimensional input spaces. In the current work, we used Gaussian kernel SVMs, whose bandwidth was chosen based on a 5-fold cross-validation procedure.

Standard SVMs are binary classifiers, for multi-class classification, we adopt a one-against-all approach. Given a n-class problem, we train n SVMs, where each SVM learns to differentiate images of one class from images of all other classes. In the testing phase, each test image is assigned to the class of the SVM that delivers the highest output of its decision function.

Protocol

The protocol for each of the classification experiments was as follows. The full dataset of a given experiment was divided into 10 parts, thus defining 10 different splits of the full dataset. One split corresponds to keeping one part of the data for testing, while using the other nine parts for training (hence the amount of training data is 90% of the full dataset). In this way, we obtain 10 different classification results. Reported values for all experiments correspond to the average error over all splits, and standard deviations of the errors are provided in parentheses after the mean value.

Additional experiments were conducted with less amount of training data, to test the robustness of the image representation. In that case, for each of the splits, images were chosen randomly from the training part of the split to create a reduced training set. Care was taken to keep the same class proportions in the reduced set as in the original set, and to use the same reduced training set in those experiments involving two different representation models. The test data of each split was left unchanged.