Conceptual data model and schema design

Next: Content-based Indexing Up: THE CATALOGUE COMPONENT: Previous: Characterizing Data: Geographical

Conceptual data model and schema design

The conceptual level schema, i.e., the catalogue structures, should present to user a simple, physical implementation-independent, and clear view of the format of the datasets and their descriptions. Unlike the existing catalogue systems (such as Melvyl, the UC library system), which essentially represent the catalogue database in a flat relational view (e.g., Title, Author, Call Number, etc. are among the fields), the digital library should allow nested structures to be represented. Since, in fact, there is no standardized spatial data format and that it is unlikely that one will be adopted in the near future (many application programs have been developed assuming one or another format), the underlying data model used to define the conceptual schema should be able to support the above as well as the development of new application packages.

We will develop a new data model which is based on the data model we have developed for supporting earth-science investigations [62][63]. Briefly speaking, the model is object-based and supports encapsulation of datasets, metadata, and operations. It also allows data and behavior inheritance and nested structures. Specifically, it permits a dataset to be stored and accessed via many formats. Thus each type of datasets is described by a single abstract specification including the format of the metadata, the name and types its operations permitted.

There are two classes of types provided in the model: Ascii types and binary types. The basic ascii types include integer, real, string, text, etc. The basic binary types, also called formats, include popular data formats used by data sources. Examples are ARC (ARC-INFO), DEM (USGS), BanSeQuential (ERDAS), etc. Spatial index types such as features and similarity measures also belong to the binary types. Structured types are composed from the basic ones by type constructors (e.g. tuple).

Datasets are organized into categories such as analog maps, satellite images, etc. Each category has four parts: a structural description for metadata, a specification of possible operations that are applicable for datasets in this category, a format of the original (source) datasets, and a listing of concrete formats as which that each dataset can be used. A metadata structure consists of many fields, each of which may have a value of a basic type. The operations provide some essential processing and computation needs for the category. An example category of raw (i.e., unprocessed) DEM datasets is described in Fig. 2. In the figure, the description field includes the relevant metadata. Tool box has commonly useful operations for manipulating the DEMs. Since the datasets are primarily obtained from USGS, the source format is what USGS uses. Other formats that the user can use as output are provided explicitly.

Figure 2: The Category of Raw DEMs

Categories are high-level specifications of conceptual domains, analogous to module interfaces in programming languages or the specification part of Abstract Data Types (ADTs). The concrete formats provide versatile views of the same datasets for the convenience of end user and application programs. Categories form natural hierarchies. For example, the categories of collections, conference proceedings, and books are aggregated as a new category called publications. In this way, user can search in general categories as well as specific categories if needed.

We will develop the full data definition and manipulation languages. They will be embedded in a programming language such as C++. Advanced application packages like graphical user interfaces can be developed in the resulting language.

Next: Content-based Indexing Up: THE CATALOGUE COMPONENT: Previous: Characterizing Data: Geographical

Ron Dolin
Wed Dec 7 23:25:02 PST 1994