Challenges in Cross-Lingual Information Retrieval
Nowadays, every Internet user is well adapted to perform several search operations during a day. According to recent statistics, Google, the most famous search engine, process over 3.5 billion searches everyday (InternetLiveStats). However, there are still important challenges in the Information Retrieval field. For example, everyday is more frequent to face the scenario where the language of the user does not match the language of the documents. This task is known as Cross-lingual Information Retrieval (CLIR), and is defined as the task of retrieving relevant information when the document collection is written in a different language from the user query. The CLIR task becomes even more difficult when the document collection is formed by both text and audio documents, from a great diversity of genres, e.g., news reports, conversational speech, etc.
One of the goals of the SARAL-IARPA project, is to develop efficient technology to tackle the above problems. Accordingly, the CLIR system requires the ability to represent and match information in the same representation space even if the query and the document collection are in different languages. Thus, the fundamental problem in CLIR is to match terms in different languages that describe the same or have a similar meaning. Given the recent success of deep neural networks (DNNs) approaches for many Natural Language Processing (NLP) related tasks, Prof. Esaú Villatoro is working on a proposal for incorporating and applying DNNs in the context of SARAL project. These ideas has not been extensively explored in the IR field, and represent promising path for improving the performance of CLIR systems.
Fig. 1. General overview of the CLIR architecture.