Alexandria Project Proposal

Next: Contents

TOWARDS A DISTRIBUTED DIGITAL LIBRARY

WITH COMPREHENSIVE SERVICES FOR

IMAGES AND SPATIALLY-REFERENCED INFORMATION

CO-PRINCIPAL INVESTIGATORS
Jeff Dozier
Michael Goodchild
Oscar Ibarra
Sanjit Mitra
Terence Smith
Divyakant Agrawal
Amr El Abbadi
James Frew

INVESTIGATORS

PROJECT SUMMARY

A consortium of major libraries, university research groups and industrial partners will design, develop and evaluate a distributed, high-performance digital library of spatially-indexed information that includes collections of maps and images in digital form. The main output of the Alexandria Project will be a distributed testbed system that provides geographically dispersed users with access to a geographically dispersed set of library collections. Users will be able to access, browse and retrieve specific items from the collections of the library by means of user-friendly interfaces that integrate visually-based and text-based query languages. Librarians will be provided with facilities that enable them to to extend their collections of appropriately formatted materials and to add meta-information to their electronic catalogue. The testbed is being designed to scale to a distributed library at the national-level.

The architecture of the testbed system consists of a user interface supporting user-friendly access to library services by means of a combination of textual and visual languages and strong browse capabilities based on hierarchical partitions of the data; a catalogue component in which index structures and metadata are employed in providing rapid response to user queries involving content-based search; a storage component providing high-speed access to very large collections of spatially-indexed items; and an ingest component that permits librarians and systems managers to incorporate new items into the library collection using procedures that include digitization, (re)formatting, and the extraction of catalogue information. Each of the major components has a network interface subcomponent providing protocols for communication over a wide area network. The testbed system will be composed of subsystems that are distributed over a number of sites and comprise various configurations of the four components.

Two experimental systems will be developed over the course of the project, The first is a rapid prototype that will be built in the early days of the project while the second is the main testbed system that will be developed and evaluated over the four-year life of the project. The rapid prototype will be built in close cooperation with industrial partners and will be be a single centralized system serving multiple users over a local area network. This prototype will provide a simulation model of the application system and user interfaces. Once the prototype is operational, we will develop the full testbed system, which will be a fully distributed system. This system will be compatible with digital spatial data standards, spatial metadata standards and library standards, such as Z39.50. The research involved in developing the system will focus on issues relating to user requirements; the interface, catalogue, ingest, storage, and networks components of the system; support for high-performance, including focus on special data structures, high-speed communications and the targeted use of parallel computation; systematic evaluations of the system from the user point-of-view; and legal and other extra-computational issues.

The testbed will be populated with several collections of important, spatially-indexed items, including images and maps as well as textual information. Significant attention will be focussed on the investigation of user requirements and the on-going evaluation of the testbed by users of these collections. Upon the completion of a sound beta version of the system, proposals will be solicited from the library community in general for participation in the formal testing of Alexandria. The project will also explore a range of incentives for enticing producers to make their spatial data sets available, assess the practical and legal limits of the various incentives, and evaluate and gauge the impact of varying information policy arrangements on government and the private sector in increasing general access to spatial data.

The project will be based and administered at UC Santa Barbara, where several large and related projects have been underway for some time. A comprehensive management program, including continuing evaluation and external review, will be employed thoughout the life of the project in order to ensure its success.

EXECUTIVE SUMMARY

The goal of the Alexandria Project is to develop a user-friendly digital library system that provides a comprehensive range of services to collections of maps, images, and spatially-referenced information. We therefore propose to design, develop and test a distributed, high-performance digital library, in which collections of spatially-indexed information in digital form as well as users are dispersed geographically. The program of research and development that we are proposing represents a major step towards the evolution of a distributed digital library supporting both textual and spatially-indexed sources of information and scalable to the national level. While various technical issues relating to the storage and content-based access and retrieval of spatial data require us to focus initially on spatially-indexed collections, our long-term goal is to remove the distinction between mainstream libraries focusing on text and special libraries focusing on less conventional materials.

The main output of the Alexandria Project will be a distributed testbed system that will appear to users to be a single library in which items that were previously difficult or impossible to access are now retrievable; and that will appear to librarians as a system in which once cumbersome and perishable items are easily manageable. The system will permit user-friendly access to library collections by means of integrated text and visual interfaces. Such interfaces will support accessing, browsing and retrieving specific items from the collections of the library. Librarians will be provided with facilities enabling them to to extend their collections of appropriately formatted materials and to add meta-information to their electronic catalogue. The architecture of the testbed system will include a user interface supporting simple access to each of the library services by means of some combination of textual and visual languages and by means of strong browse capabilities; a catalogue component providing rapid and appropriate response to user queries, particularly those involving content-based search, with the use of various index structures and meta-information; a storage component providing storage capability for, and high-speed access to, large collections of spatially-indexed items; and an ingest component that permits librarians and systems managers to incorporate new items into the library collection, using procedures that include digitization, (re)formatting and the extraction of catalogue information. Each of the major components has a network interface subcomponent providing protocols for communication over a wide area network. The testbed system will be composed of subsystems that are distributed over a number of sites and comprise various configurations of the four components. The testbed is being designed to scale to a distributed library at the national-level.

The completion of a testbed system possessing such a comprehensive functionality and satisfying user requirements and high-performance criteria clearly requires the resolution of a large number of research and development issues. The research and development team that we have assembled at UCSB is uniquely qualified to complete successfully such a complex system and includes:

Library groups: the Map and Image Laboratory of the UCSB Library, containing one of the nation's largest map libraries and imagery collections; the University of California Division of Library Automation; the library of SUNY at Buffalo; the Library of Congress; the library of the US Geological Survey; and the St. Louis Public Library.
University research groups: the National Center for Geographic Information and Analysis (NCGIA), with sites at UCSB, SUNY Buffalo, and the University of Maine; the UCSB Department of Computer Science; the UCSB Department of Electrical and Computer Engineering; the UCSB Center for Remote Sensing and Environmental Optics (CRSEO), a partner in the Sequoia/2000 project; and the National Center for Supercomputer Applications (NCSA).
Private sector: including Digital Equipment Corporation; Environmental Systems Research Institute (ESRI); ConQuest; and the Xerox Corporation.

The project will be based and administered at UC Santa Barbara, where a comprehensive management program, including continual evaluation and external review, will be employed throughout the life of the project in order to evaluate the testbed and its applicability.

User Requirements for the Testbed System: A characterization of users and their requirements is of fundamental importance for the design, development, and evaluation of the testbed system. We will adopt a a three-fold strategy to determine user requirements. First, we will analyze existing studies of user requirements with respect to spatial-data in the context of library systems, such as the GRIN (GeoReferenced Information Network) study carried out at UCSB. Second, a rapid prototype system, built using currently available software in the first few months of the project, will serve as an important vehicle for investigating user requirements. Finally, the main testbed system that we will design, develop, and test over the greater part of the project will provide a major test of our characterization of user requirements. This characterization will involve the various classes of users and their data needs; the classes of items of interest to the users; the sets of operations that users wish to apply to items; the nature of the interface(s) that are appropriate; and the levels of system performance that must be met.

The Architecture of the Testbed System: The design of the user interface is crucial for the success of the Alexandria Project. The users of the Alexandria will require methods that are both simple and naturally expressive in accessing and retrieving spatial and non-spatial information. The functionality of the interface component involves support for text-based and visually-based query languages that permit a user to express, in simple and frequently visual terms, queries concerning the existence, characteristics (including content), and availability of datasets that satisfy various constraints. It will provide support for procedures that permit users to visualize and browse datasets that are candidates satisfying the constraints expressed in their queries; support for user requests to have selected datasets transmitted to locations of their choice; and support for query processing that enables the interface system to determine how the complete processing of a query should be partitioned between the client site and the server site. We will augment khoros with specific processing procedures to support this functionality. The design of the user interface system will initially be based on the results of an ongoing digital library project at UCSB.

The core of the catalogue system is a subdatabase of catalogue/meta information concerning the main items stored in a database supported by full DBMS functionalities. The data stored in the catalogue database includes indices and metadataabout the items stored in the main datastore of the library, where the metadata involves abstract textual descriptions of the data and reduced datasets. Since spatially-indexed data comes in many different types and a large variety or representations, it is important to have an appropriate data model for organizing both the data and the metadata. The component consists of a ``lower'' level involving the storage organization; the various indexing schemes; access methods; and support for distributed access to hierarchically structured spatial data, texts, and structured as well as raw spatial or spatio-temporal datasets. A ``higher'' level involves the provision of a sound and extensible conceptual representation for spatial data and metadata, basic untilities for common specialized processing needs, and a flexible platform for developing user- or application-specific software tools and interfaces. The problems can be summarized as tasks involving: the development of an appropriate conceptual data organization; the development of a model for metadata organization; the provision of a primitive user interface that includes a set of simple search and browsing operations;the provision of special tools for image processing that includes operations provided by a locally augmented version of khoros and for text retrieval that is provided by ConQuest; and the provision of a high-level database programming language as a platform for application development.

The basic strategy for designing and developmenting the ingest component involves the acquisition of high-performance scanners for basic digitization, and off-the-shelf support for the scanners. We will work with our industrial partner Xerox in this area. We will focus significant attention on the preprocessing of newly digitized and previously digitized information in order to prepare data for permanent storage in a format that is optimal for search, browse and retrieval. In particular, we will focus much attention on wavelet data structures that provide a hierarchical partition of the data and permit natural support for browsing operations. Other important foci of activity involve the extraction of metadata from the information and the the preprocessing of the information using a wavelet decomposition in order to facilitate access and retrieval. Issues related to image feature extraction for indexing and metadata creation will be investigated. The storage component will be built using currently available secondary and tertiary storage devices.

Communications will be of central importance in the testbed system. While designs of previous prototype systems providing electronic library services for image and text retrieval were dictated by slow communication technology, the Alexandria Project proposes to make use of the emerging broadband integrated services digital network (B-ISDN) technology. The bandwidth of B-ISDN systems will lessen the need for data compression for transmission.

The Construction of a Testbed System: The rapid prototype will incorporate all the major components of the proposed architecture but will be a single centralized system, serving multiple users over a local area network. This prototype will be built in close collaboration with ESRI, which will provide software and DEC, which will provide hardware. The prototype incorporates the design criteria from GRIN and extends them with newer technologies, modified requirements, and broader functionality in relation to the classes of data that the system is capable of handling. It also incorporates lessons learned from operational systems such as GLIS, WAIS, and Mosaic. This system will provide a basis for populating a test database; testing metadata elements on various data types; testing user reaction to various interface elements; building, using existing digital map data, spatial query backgrounds for footprint overlay; constructing thesaurus query tools and creation of browsable images linked to metadata records. This prototype will provide a simulation model of the application system and user interfaces. Once the prototype is operational at UCSB, we plan to start development of the testbed based on the proposed architecture of Alexandria. The Alexandria testbed system itself will be a fully distributed system. It will contain each of the components of the architecture discussed above and will provide transparent access to information distributed over wide area networks to a user community that is itself distributed geographically.

The prototype and the testbed implementations of the Alexandria project will involve using widely accepted software engineering techniques for designing, developing, and managing large software projects. This methodology combines traditional structured analysis techniques with rapid prototyping at critical points in the design and development process. The Alexandria testbed will be an assemblage of custom and commercial "off-the-shelf" components that must evolve considerably over the life of the project. Two principles will guide the integration and evolution of this system. First, the Alexandria system architecture will be specified primarily in terms of its interfaces. These interfaces will be defined and specified as a result of our experience on the prototype. Second, the Alexandria testbed will include extensive facilities for binding or "gluing" components to the standard interfaces. These facilities will include scripting tools, on-the-fly data format translators, and graphical programming tools and interface builders. The goal is twofold: to maximize the use of existing software and hardware, and to facilitate evolutionary prototyping by minimizing the amount of coding required to build a working system.

It is critically importance that the system be compatible with digital spatial data standards, spatial metadata standards and library standards. In particular, we will adher to the Z39.50 standard for library interchanges. Format standards for digital spatial data include FIPS 173 (the Federal Spatial Data Transfer Standard), VPF, DIGEST, HDF, and a host of formal and informal industry standards for raster and vector data. Finally, it is important that the prototype and the testbed be as compatible as possible with evolving standards in the library community, such as MARC and the FDGC metadata standard.

Collections to be Supported by the Testbed System: We will initially populate the testbed system with several collections of spatially-indexed data that are important and that represent diversity of content, coverage, and type. In particular, these datasets involve coverages that range from local, through regional and national, to international levels; they include both datasets and metadatasets; and they contain images, digitized pictures, and digitized maps of both raster and vector type. As an example of a local area data set, we will build a collection of items for Santa Barbara County, California including DEMs (digital elevation model); DLGs (digital line graph); DOQ's (Digital Orthophoto Quads); 1990 Census data (both boundary line files and statistics); AVHRR; SPOT data; and Landsat data. As examples of regional level collections, we will construct collections based on the South Florida Ecosystem Restoration Initiative Data Set; the GIS-based analysis of Biodiversity in California; and the collection for Sierra Nevada Ecosystem Project (SNEP). As an example of a national data set, we will build a collection of Digital Orthophoto Quads; AVHRR; TIGER files; DLG's; and DEM's. As examples of a world data set, we will build a collection of Digital Chart of the World (DCW); Geographic Names Processing System (GNPS); and a collection relating to the geology, hydrology, climatology, and biology of the entire Amazon basin. We will also build collections of metadata sets.

Populating the Alexandria database will be a substantial effort. A considerable portion of the system design will be devoted to facilitating data ingestion. Much of the data to be loaded into Alexandria is not currently in digital form. Maps, aerial photographs, and printed text will have to be mechanically scanned in addition to having their metadata entered. Even data which are already digitized must be physically loaded into the system. This loading phase must be carefully structured and streamlined to avoid potential bottlenecks. We will attempt to parallelize the loading process for data and metadata. Data will be stored in file-based commercial tertiary storage systems, while metadata will be stored in a DBMS.

High Performance of the Testbed System: The requisite levels of performance that will make the system truly usable will be based on a three-fold strategy that includes: appropriate data models and data structures, including hierarchical structures that are appropriate for browsing and access operations on large entities; appropriate parallel computing support for several computationally-intensive aspects in the various components; the linking of components within and between sites with the use of high-speed networks. The system will be scalable over several orders of magnitude of database volume, allowing the system to accommodate applications that range from school and community libraries through utility companies, local government departments, research libraries, and resource agencies. We will investigate how parallel and high performance computation can be incorporated into the system components for increased efficiency, particularly for the storage and retrieval of large data sets, data conversion, optimization of database queries, computations on heterogeneous computing environments, parallel I/O, and routing. In the user interface, we will investigate parallel computing support for performing registration operations, browsing, query operations, fusion and filtering, pattern recognition, various user-requested operations, and development of parallel algorithms for various application problems, particularly those that arise in GIS/EOS research. In the catalogue component, query processing involves the execution of various data operations and the searching of multiple data files. Parallel processing can significantly improve the response time for each individual query. We plan to experimentally validate our developments and algorithms on real-world problems and data sets by running simulation programs on UCSB-accessible parallel computers, including the new 64-node Meiko CS-2 supercomputer the CS Department will be acquiring in March 1994.

Hardware for the Rapid Prototype System In the first phase of the Alexandria project, hardware for the rapid prototype will be assembled to provide a simulation model of the application system and user interfaces. Data for the prototype will be gathered from low-cost, low-resolution scanners, and existing high-resolution datasets. The scanners will provide for simulation of the ingest system, while the existing electronic datasets will provide realistic data for testing the data manipulation, storage, and display aspects of the system. The scanners will be driven by 486 PCs supplied by Digital. The processing of the raw images into the stored form will be performed on a DEC 3000/300 workstation provided by Digital. This workstation will also be used as the principal software development machine for the prototype. Browse-ready files will be stored on a DEC 3000/800. This system will contain over 256MB of RAM, 50GB of disk, and 100GB of optical or tape tertiary storage. The workstations will be connected via Ethernet to the campus broadband backbone network and the Internet. Remote X-server access will be supported in addition to the high-end access via the DEC 3000/800 front-end. The bulk of the prototype software will use ARCinfo and ARCview, contributed by ESRI. Additional software development on the prototype will be oriented towards supporting the simulation of the system interfaces.

Evaluation of the Testbed System: The on-going evaluation of the testbed by users will be of major importance to the project's success. This entails two segments: gathering user input, first at UCSB and later from selected installations in other libraries; and actively seeking additional funds, from organizations such as NSF and NASA, to perform this critical task. Early in the project, organizations (e.g., public, academic, and special library associations; AAUP) will be alerted about Alexandria's objective, followed by hands-on exposure of selected representatives from these groups for interface, performance and system tools evaluation, and passing users' critical comments to the design team for later versions. From this process a robust and functional set of user tools and interfaces will be developed, system performance improved, and the seeds of Alexandria's acceptance sown in a diverse user community. Upon the completion of a sound beta version of the system, we will solicit proposals from the library community for participation in the formal testing of Alexandria. Sites will be selected based on criteria as developed by organizations such as ARL, but should include sites serving a variety types of clients, with important unique collections, and occupying a position of leadership in the information community. Each site will be expected to provide its own hardware and personnel, and agree to seek from the broadest user communities input on interface design, manipulation/visualization tools and system performance. Each will perform the full scope of library activities using the system (e.g., data input, metadata construction, query, data and metadata retrieval and local visualization). These tests will be performed in concert with the design team through frequent, systematic reporting mechanisms. The test results will be used to tune the system as well as to collect new user requirements for later incorporation into the design.

Part of our strategy in building a digital library with comprehensive services for images and spatially-indexed information is to encourage many groups to make their spatial data sets accessible and available to a broader community. This project will explore a range of incentives for enticing producers to make their spatial data sets available, assess the practical and legal limits of the various incentives, and evaluate and gauge the impact of varying information policy arrangements on government and the private sector in increasing general access to spatial data.

Next: Contents

Ron Dolin
Wed Dec 7 23:25:02 PST 1994

TOWARDS A DISTRIBUTED DIGITAL LIBRARY

WITH COMPREHENSIVE SERVICES FOR

IMAGES AND SPATIALLY-REFERENCED INFORMATION

CO-PRINCIPAL INVESTIGATORS Jeff Dozier Michael Goodchild Oscar Ibarra Sanjit Mitra Terence Smith Divyakant Agrawal Amr El Abbadi James Frew INVESTIGATORS

CO-PRINCIPAL INVESTIGATORS
Jeff Dozier
Michael Goodchild
Oscar Ibarra
Sanjit Mitra
Terence Smith
Divyakant Agrawal
Amr El Abbadi
James Frew

INVESTIGATORS