Data Access and Distribution

Next: Research issues Up: THE CATALOGUE COMPONENT: Previous: Content-based Indexing

Data Access and Distribution

The primary storage medium for storing image data in Alexandria will be in terms of files on a distributed file-system. In the absence of appropriate tools, a user wishing to access relevant maps in such a system needs to have a complete knowledge of file organization on individual storage sites.

An initial approach of integrating these storage sites would be to use tools such as network file systems (e.g., Sun NFS). Unfortunately, there are two problems with this approach. The first is that network file systems assume that all the storage sites are within a single administrative domain. The other problem is that this approach results in a statically organized naming hierarchy and is not extensible. We envision a digital library to encompass multiple administrative domains. Assuming that a single naming organization would be acceptable to all administrative domains and would suit the needs of the entire user community will be short-sighted. In particular, we would like the naming and storage organization that is extensible and is adaptive to the needs of its users. Ideally, multiple naming and storage organizations should be allowed to co-exist serving different user groups.

In a related project on scientific databases, we have integrated a virtual file system, called Prospero [49], to provide transparent access to distributed data. The most desirable aspect of Prospero is that it permits multiple organization of the underlying file-systems. For example, users studying the hydrological properties of a particular region may want the information to be organized in terms of the water-basins of that region. On the other hand, geographers analyzing the vegetation and forestry of the same region may perhaps require the image data to be organized differently. We propose to integrate customized naming and storage organization of image data with the Alexandria testbed.

Although the virtual file system enables the users to access distributed data transparently, it still requires its users to retrieve images based on external properties (i.e. file-names). This organization is appropriate for the class of users who are experts in their domain and know their data well. A casual user may still find it very difficult and cumbersome to retrieve images in such a system. Image processing techniques for content-based indexing will be used to design index structures and access methods to enable the users to access and retrieve information based on their internal properties or attributes. Various data-structures have been proposed for such retrieval and manipulation [60][59] that we can utilize to construct multi-attribute indexing mechanisms for providing efficient access to distributed image databases.

Since the primary data or images will be used in read-only mode, the problem of update conflicts and the need for concurrency control would not exist. Hence, image sharing will not result in throughput degradation for the various users. However, due to the distributed nature of the data, the index structure must also be distributed among the various locations where the data is stored, as well as where the users are located. We can envision each user profile as specifying the domain of interest for that particular user, hence when the user starts a query session, the part of the index structure, which is of interest, can be loaded into his/her local machine. This temporary caching of the index structure could be quite useful in speeding up the search process. However, another, and equally important aspect of this localization of the index structure is that it can be customized or adapted to the user's profile. Typically, each user would like to view the image according to his/ her interpretation. In addition, each user as a result of their interest/user profile may want to change or modify his/her interpretation. Thus, a one time static database modeling support will be completely inadequate for such an environment. Contrast this with more traditional database applications such as banking, airlines, and accounting, in which the database schema is relatively static for a long duration. Thus we feel that the digital library must be highly extensible and must provide means for dynamic changes in the schema definition and the index structure. The distributed index structure for digital libraries needs to adapt to the various users and should provide fast access based on the different needs. Finally, and even though most of the usage of a digital library involves searches or simple appending of data, the librarian may need to revise the index structure either in its entirety, or for some subset of the data. In these cases, a distributed restructuring of the index structure must occur. This could be done off line at a significant cost due to the disruption of services, or could be done concurrently with user queries. During the course of this project, we plan to explore and implement various efficient concurrency control protocols that are especially suitable for multi-dimensional index structures [2].

The shared resources in Alexandria will consist of data, metadata, index structures, network resources for accessing remote data, and local computational resources. These resources need to be used efficiently in order to ensure fast response time to users. Existing resource allocation algorithms may not scale up to Alexandria, on account of a large number of heterogeneous users and the possibility of site and network failures. We have previously investigated efficient fault-tolerant algorithms for resource allocation in distributed systems [12][11][10][13][4][3]. These algorithms present solutions to a number of resource allocation paradigms such as mutual exclusion, job scheduling, and dining philosophers, while emphasizing on failure locality, low contention, and reliability. We will examine the adoption of these and other existing algorithms for Alexandria and also continue to design new algorithms in the specific context of a digital library.

Next: Research issues Up: THE CATALOGUE COMPONENT: Previous: Content-based Indexing

Ron Dolin
Wed Dec 7 23:25:02 PST 1994