The Project «Russian Archives on-line»: perspectives of development

Bukhshtab Yuri
Head of Laboratory
Keldysh Institute of Applied Mathematics RAS
Adress: 4 Miusskaya square, Moscow 125047
E-mail: kikom@online.ru

Natalia Nikolaevna Yevteeva , Ph.D., senior researcher

Содержание :

The main objective of the Russian Archives Online (RAO) Project is the creation and launching into Internet of large volume databases containing the description of audio and visual materials from Russian archival collections. It is aimed at providing an access to archival documents for numerous users in Russia and worldwide. Although Russian archival vaults are well-organized on the basis of traditional, "pre-computer" methods, one can use their materials only after tiresome research efforts on-site having looked through cards of paper catalogs written either manually or with the help of a typewriter. Even in Russia there are few people who are aware of rich audio and visual archival collections that''s why potential benefits of the materials stored by archives have just begun to be realized.

Making on-line educational resources (in particular, connected with the history of Russia) based on archival collections constitutes a top priority activity within the framework of the Project.

Besides, the Project foresees respective e-commerce, i.e. licensing archival audio and visual materials chosen from created databases. The originating proceeds envisage channeling to the collections preservation (this procedure has already been initiated).

The Project was supported by a number of international organizations - the US Agency for International Development, Internews, the Open Society Institute, UNESCO, as well as the Russian Foundation for Basic Research (RFBR). Apart from archival employees, the Keldysh Institute of Applied Mathematics with the Russian Academy of Sciences, Moscow and Texas Universities, the Abamedia Company (USA) contribute to the development of the Project.

Although the Project was named Russian Archives Online only a little bit less than a year ago, the actual work began as far as 1996 when a decision was made to create an electronic catalog of the Russian State Archive for Documentaries and Photos in Krasnogorsk (RGAKFD). This Archive possesses the largest collection of documentaries and stills that reflect the history of Russia and Soviet republics. The Krasnogorsk Archive has in its stock more than 215 thousand reels with documentaries: their titles exceed 38 thousand, more than 1 thousand of them date back to the period of Russian history before the Revolution. The Archive stores more than 1 million photos and negatives, as well as unique albums of the Tsarist Family. The Krasnogorsk collection that represents an illustrated history of Russia from the middle of the XIXth century arouses interest not only of non-fiction filmmakers or mass media all over the world, but constitutes a major source of documentary materials for scientific research, historical in the first place. The archival collection plays an even more important role for the whole range of activities connected with education both in Russia itself and beyond.

At present the database for the electronic catalog of the Krasnogorsk Archive contains descriptions of the larger part of its collection of films (25, 000 descriptions to be precise). The catalog can be accessed on-line (http://rgakfd.internews.ru/catalog.htm) and is distributed on CDs. The cataloging is going on and in a year the database will have descriptions of the film collection on the whole. This year we have started works aimed at making the English version of the electronic catalog, namely: a methodology has been developed and tested how to use the Systran Professional machine translation with the subsequent editing of the texts entered into the electronic catalog database containing English descriptions of films from the RGAKFD collection. 3000 descriptions have been translated and launched on-line (http://rgakfd.internews.ru/ecatalog.htm). As the analysis of the results gained shows, such methodology to be used for translating a large volume database into English gives a level that is quite acceptable, from the point of view of quality, under relatively low cost. The Krasnogorsk electronic catalog is envisaged to be wholly translated into English on the basis of this approach. Completing the information fund of the Krasnogorsk Archive catalog and its translation into English are financed by the Open Society Institute.

Apart from the Krasnogorsk Archive catalog, at present two more catalogs that represent the collection of the Russian State Archive for Scientific and Technical Documentation (RGANTD) have been made and launched on-line. This Archive stores a large number of unique photos and films about the history of coming into being and development of astronautics, missiles and spaceships. Within the framework of the Project, a catalog was made that includes 3 000 photos of greatest interest and their descriptions from the Archive collection. Space photos cataloging was sponsored by the UNESCO. The catalog was launched on-line (http://rgantd.ru/elcatalog/photocat.htm) and its CD version has already been prepared.

An electronic catalog containing descriptions of documentaries on space exploration from the collection of the Russian State Archive for Scientific and Technical Documentation has also been developed and launched on-line (http://rgantd.ru/ecfilm/catalog.htm). These activities were sponsored by the Open Society Institute.

Of course, choosing appropriate software became the first and foremost task facing the developers at the moment when the Project was launched. Primarily, the choice of software was stipulated by the necessity to make a search engine that would be built not only on formal classification of documents but on their contents, too. Free text search mechanisms based on an automatic indexing are known to be a methodology adequate for the solution of such a task.

The first pilot version of the Krasnogorsk Archive electronic catalog was developed on the basis of the full-text DBMS made by the Personal Library Software (a US company). It is worth mentioning that programmers - participants of the Project invested a lot of forces into the DBMS adoption into Russian because the use of the Russian language was not envisaged by its developers. But then we had to refuse the DBMS as it turned out to be very complicated for distributing the catalog on CDs (the DMBS prior installation and adjustment to the user''s computer were requested). Besides, the DBMS version intended for on-line operations was very expensive at that time. The first actually operating version of the electronic catalog for documentaries from Krasnogorsk was developed on the Мicrosoft Access basis. In order to use this DBMS for the Project''s needs, a special program was written to allow an adequate free text search. But the DMBS responsiveness under such a mode that was unusual for it did not fully meet the requirements of the Project''s developers.

Due to the circumstances mentioned above, a decision was made to develop a new full-text DBMS oriented at the range of the tasks to be solved that would allow the users to work with CDs without its prior installation and that could operate on-line with the help of servers on different platforms. For the last two years, all developments that take place within the framework of the Russian Archives Online Project use this DBMS and as the operational experience shows, it''s quire effective for the solution of tasks connected with making various electronic catalogs. The developed DBMS based on the B-tree structure allows localizing the necessary information from large volumes of data within a few seconds. All the words identified in database entries are treated by it as the search ones except those fields that are declared non-indexed or except so-called stop-words, the updated list of which can be initialized for every application.

Queries use logical combinations of words that can be masked. They can include the names of fields thus giving an opportunity to identify the part of the entry where exactly the search words are to be looked for. The search results are displayed in a form convenient for quick choosing the most relevant information - as a short list of identified entries (e.g. titles) but the entries themselves can be fully looked through in a special entry window.

The DBMS in question has one more convenient means for quick selection of relevant information - i.e. hypertext links connecting the entry text with various information related to the text (a photo, a video or audio clip, or a program external to the DMBS).

On order to fill in the information funds of target search systems developed on the basis of the DBMS mentioned above, some utilities have been made to provide for the use of texts presented in different formats, including the Microsoft Word documents and the Microsoft Access entries. Besides, a specialized editor has been developed that allows to enter texts directly into the database and to modify them.

Research in the area of developing systems for video information content search makes an important part of activities implemented within the framework of the Project. The matter is that discrepancies between visual contents and text descriptions reduce the accuracy and completeness of search. Besides, descriptions - if they are to be acceptable - cannot be created automatically, without human interference, which considerably slows down the process of replenishing the database.

Today a great scope of visual materials can be easily digitized. What we urgently need today is to provide a content-based search to retrieve relevant information from digital visual archives.

The reported research data is intended to develop and implement techniques to analyze, classify and search for images and video data on the basis of their visual attributes and is relative to automated comprehension of visual information.

Our current research focuses on developing tools for visual features extraction and on designing comparison algorithms. The results have been obtained in directions:
- Similarity measure for still images based on color histograms computation and on their quantitative comparison (possibly using quadro-tree technique).
- Edge detection using non-maximal suppression of the image intensity first spatial derivative. Spatial segmentation detects objects characterized by dominant color and by shape.
- Shape measures based on two functions computed for the contour - turn angles and centroid distance.
- Texture measures: we are exploring the Gabor functions method and the use of a gray level co-occurrence matrix.
- Video temporal segmentation based on color histogram comparison for the consecutive video frames.
- Video indexing technique: representative key frames extraction and using of visual features extraction methods developed for still images; additional using of the optical flow parameters.
- Optical flow computation using differential technique.
- Moving entities detection in video on the base of the total optical flow analysis. Each extracted object is characterized by its location, size and motion type. We''ll focus on using motion to describe object activity and events.
- Multilevel classification of video by motion type. In particular, the classification lets to detect the specific camera functions (zoom, etc.)
- object detection in images using neural network.

It is supposed that first practical use of this research results will be connected with the task of automatic extraction of informative stills from video stream. The research works in this area was financed by RFBR grant #01-01-00267.

