The Idiap Research Institute has been active for almost two decades in Swiss, European and Worldwide projects and consortiums where the acquisition, storage, distribution, processing, and evaluation of very large amounts of digital data was playing a central role in the execution and success of the research. To name but a few, Idiap has been and still is leading or involved in:
• the NCCR IM2 on Interactive Multimodal Information Management;
• the AMI and AMIDA EU FP6 projects;
• the MOBIO FP7 project.
The need for efficient access to very large data sets in each of these projects will be discussed as examples in the proposal, and a number of other projects will also benefit from the planned system.
As leading house of the IM2 NCCR, Idiap has played a key role in coordinating the set up of acquisition infrastructure such as instrumented smart meeting rooms, time synchronization and alignment of different sources with different bandwidth and resolution. Going from raw data to annotated corpora useful for research also require infrastructures and services that go beyond adding a number of hard discs to a file server. Finally, distributing those datasets to other teams within the NCCR and worldwide to work on the same data represented a breakthrough and Idiap wants to maintain its leadership.
The current proposal aims at adding to the Idiap research infrastructure a high bandwidth, high capacity, distributed storage system. In addition to our “traditional” file server (based on NetApp systems), we currently have a total of 50TB of low cost storage space available for static data. Based on raid systems we are limited by the network bandwidth of each system (generally 1 Gbit/s). Furthermore, the increasing number of people using the system, and the growing size of the datasets call for a significant increase in capacity and bandwidth.
With the goal of maintaining Idiap’s position as the Swiss leader in the acquisition, storage and distribution of massive amounts of multimedia data (coming from Idiap, as well as other national and international sources), we seek to improve our storage capacity by adding high performance storage. The key idea of the project is to improve the performances through the distribution of storage nodes, the striping of data, and direct access from computing machines, which requires the usage of a file system able to handle concurrent access. We indent to use the Sun Microsystem's Lustre technology which benefits from a successful history of being the file system used in many of the Top500 Supercomputing Sites – see http://top500.org.
To implement this we plan to build racks of storage using standard multi disks raid systems (RAID6 systems and SATA disks). Those systems will be connected to our network with high bandwidth (20 times faster than today) which will speedup the time of the experiments. We aim at increasing our storage capacity by adding four racks containing 120TB each.