next up previous contents
Next: IMAGE PROCESSING TEAM Up: RESEARCH ACTIVITIES AND Previous: INTERFACE DESIGN &

INFORMATION SYSTEMS TEAM

Membership: Su (leader), Agrawal, El Abbadi, Freeston, Frew, Singh, Smith, Cheng, Dolin, Kothuri, Prabhakar, Wu

Mission Statement of Team: The Team's research is focused on a number of issues pertaining to the architecture of digital libraries including resource discovery, extensible data store, and the implications of building a geographically referenced database. In relation to geographically referenced databases, issues under investigation include multi-dimensional indexing, data placement, tertiary storage, content-based retrieval, and performance tuning.

Research activities over the last year

Abstracts of Published and Submitted Articles

  1. ``Scalable Access within the Context of Digital Libraries'', X. Cheng, R. Dolin, M. Neary, S. Prabhakar, K. Ravi Kanth, D. Wu, D. Agrawal, A. El Abbadi, M. Freeston, A. Singh, T. Smith, J. Su, IEEE International Conference on Advances in Digital Libraries, IEEE ADL 97, May 7-9, 1997, Library of Congress, Washington, D.C, pages 70-81.

    This paper presents a summary of some of the work-in-progress within the Alexandria Digital Library Project. In particular, we present scalable methods of locating information at different levels within a distributed digital library environment. Starting at the high level, we show how queries can be routed to appropriate information sources. At a given source, efficient query processing is supported by using materialized views and multidimensional index structures. Finally, we propose solutions to the problem of storage and retrieval of large objects on secondary and tertiary storage.

  2. `Scalable Access within the Context of Digital Libraries'', X. Cheng, R. Dolin, M. Neary, S. Prabhakar, K. Ravi Kanth, D. Wu, D. Agrawal, A. El Abbadi, M. Freeston, A. Singh, T. Smith, J. Su, International Journal on Digital Libraries, IJODL, Vol. 1, No. 4, 1998.

    Abstract as above.

  3. ``Pharos: A Scalable Distributed Architecture for Locating Heterogeneous Information Sources'' Dolin, R. and Agrawal, D. and El Abbadi, A. and Dillon, L., Proceedings of the 6th International Conference on Information and Knowledge Management (CIKM '97), Las Vegas, Nevada, Nov. 1997.

    Pharos is a scalable distributed architecture for locating heterogeneous information sources. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first step retrieves coarse-grain metadata, about all sources, stored on local, massively replicated, high-level servers. Further steps retrieve more detailed metadata, about a greatly reduced set of sources, stored on remote, sparsely replicated, topic-based mid-level servers. We present results of a simulation which indicate the feasibility of the architecture. We describe the structure, distribution, and retrieval of the metadatain Pharos to enable users to locate desirable information sources overthe Internet.

  4. ``Using Automated Classification for Summarizing and Selecting Heterogeneous Information Sources'' R. Dolin, D. Agrawal, A. El Abbadi, and J. Pearlman, DLIB, January 1998.

    Information retrieval over the Internet increasingly requires the filtering of thousands of heterogeneous information sources. Important sources of information include not only traditional databases with structured data and queries, but also increasing numbers of non-traditional, semi- or un-structured collections such as Web sites, FTP archives, etc. As the number and variability of sources increases, new ways of automatically summarizing, discovering, and selecting collections relevant to a user's query are needed. One such method involves the use of classification schemes, such as the Library of Congress Classification (LCC), within which a collection may be represented based on its content, irrespective of the structure of the actual data or documents. For such a system to be useful in a large-scale distributed environment, it must be easy to use for both collection managers and users. As a result, it must be possible to classify documents automatically within a classification scheme. Furthermore, there must be a straightforward and intuitive interface with which the user may use the scheme to assist in information retrieval (IR). We are currently experimenting with newsgroups as collections. We have built an initial prototype which automatically classifies and summarizes newsgroups within the LCC. The prototype uses electronic library catalog records as a `training set' and Latent Semantic Indexing (LSI) for IR. This work is extensible to other types of classification, including geographical, temporal, and image feature.

  5. ``Summarization and selection of information sources using automated classification'' R. Dolin, D. Agrawal, A. El Abbadi, submitted for publication.

    Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining reasonable scalability.

  6. ``Making Queries Travel: A Framework Towards Data Integration'' X. Cheng and G. Dong and T. Lau and J. Su, submitted for consideration.

    In this paper we develop a data integration approach for the efficient evaluation of queries over autonomous source databases. The approach is based on some novel applications and extensions of constraint databases techniques. We assume the existence of a global database schema. The contents of each data source are described using a set of constraint tuples over the global schema; each such tuple indicates possible contributions from the source. The ``source description catalog'' (SDC) of a global relation consists of its associated constraint tuples. Such a way of description is more advantageous, by allowing flexible addition and modification of sources, than approaches such as Tsimmis where the correspondence between sources and the global schema is compiled into the mediators.

    In our framework, to evaluate a conjunctive query over the global schema, a plan generator first identifies relevant data sources by ``evaluating'' the query against the SDCs using techniques of constraint query evaluation; it then formulates an evaluation plan, consisting of some specialized queries over different paths. The evaluation of a query associated with a path is done by a sequence of partial evaluations at data sources along the path, similar to side-ways information passing of Datalog; the partially evaluated queries travel along their associated paths. Our SDC-based query planning is more efficient than approaches such as Information Manifold and SIMS by avoiding their NP-complete query rewriting process. We can achieve further optimization using techniques such as emptiness test.

  7. ``A Java-based Distributed Framework for Distributed Processing'' Daniel Wu, Divy Agrawal, Amr El Abbadi and Ambuj Singh, Proc. of Intl. Conf. on Conceptual Modeling 1997 (ER '97), UCLA, pp 333-346, November 3-6, 1997

    Presentations:lexandria Digital Library Project at UC Santa Barbara has been building an information retrieval system for geographically-referenced information and datasets. To meet these requirements, we have designed a distributed Data Store to store its holdings. The library's map, image and geographical data are viewed as a collection of objects with evolving roles. Developed in the Java programming language and the HORB distributed object system, the Data Store manages these objects for flexible and scalable processing. To implement the Data Store we provide a messaging layer that allows applications to distribute processing between the Data Store and the local host. We define a data model for Data Store repositories that provide Client access to Data Store objects. We finally provide support for specialized views of these Data Store items.

  8. ``StratOSphere: Mobile Processing of Distributed Objects in Java'', D. Wu, D. Agrawal and A. El Abbadi, submitted for consideration.

    We describe the design and implementation of StratOSphere, a framework which unifies distributed objects and mobile code applications. We begin by first examining different mobile code paradigms that distribute processing of code and data resource components across a network. After analyzing these paradigms, and presenting a lattice of functionality, we then develop a layered architecture for StratOSphere, incorporating higher levels of mobility and interoperability at each successive layer. In our design, we provide an object model that permits objects to migrate to different sites, select among different method implementations, and provide new methods and behavior. We describe how we build new semantics in each software layer, and finally, we present sample objects developed for StratOSphere.

  9. ``Optimal Dynamic Range Searching in Non-replicating Index Structures'' K. V. Ravi Kanth and Ambuj Singh, submitted for publication.

    Multi-dimensional range searching has important applications in several fields including spatial, image and text databases. Several data structures using linear and non-linear storage space have been proposed in database and computational geometry literature. These structures support poly-logarithmic range query time using storage space for n d-dimensional point data. Using linear-space, query times of for any can be achieved by replicating each data item at least times. Such large storage-space structures may not be feasible for several databases. In this paper, we examine the complexity of range searching when data items are not replicated. Such non-replicating structures achieve low storage costs and fast update times due to lack of multiple copies. Examples of non-replicating structures include R-trees, Bang files, quad trees and k-d trees. In this paper, we first obtain a lower bound for range searching in such non-replicating index structures. Assuming a simple tree structure model of an index, we prove that the worst-case query time for a query retrieving t out of n data items is , where d is the data dimensionality and b is the capacity of index nodes. We then propose a new index structure, called the O-tree, that achieves this query time in dynamic environments. Updates are supported in amortized time and exact match queries in worst-case time. This structure improves the query time of the best known non-replicating structure, the divided k-d tree, and is optimal for both queries and updates in non-replicating tree structures. In contrast to these lower and upper bounds, we prove that index structures such as R-trees and Bang files exhibit a worst-case complexity of .

  10. ``Efficient Dynamic Range Searching with Data Replication'' K. V. Ravi Kanth and Ambuj Singh, submitted for publication.

    In this paper, we consider the problem of dynamic range searching in linear-space index structures. We propose a new dynamic structure that combines Bentley's multilevel range structure with Overmars' dynamization techniques. This structure achieves time complexity for updates and time complexity for queries, for any . This result is an improvement over the static-to-dynamic transformations of Bentley's structure, which yield an update time of and a query time of for any .

  11. ``Dimensionality Reduction for Similarity Searching in Dynamic Databases'' K. V. Ravi Kanth, Divyakant Agrawal and Ambuj Singh, ACM SIGMOD Conference, Seattle, WA, 1998.

    Databases are increasingly being used to store multi-media objects such as maps, images, audio and video. Storage and retrieval of these objects is accomplished using multi-dimensional index structures such as R-trees and SS-trees. As dimensionality increases, query performance in these index structures degrades. This phenomenon, generally referred to as the dimensionality curse, can be circumvented by reducing the dimensionality of the data. Such a reduction is however accompanied by a loss of precision of query results. Current techniques such as QBIC use SVD transform-based dimensionality reduction to ensure high query precision. The drawback of this approach is that SVD is expensive to compute, and therefore not readily applicable to dynamic databases. In this paper, we propose novel techniques for performing SVD-based dimensionality reduction in dynamic databases. When the data distribution changes considerably so as to degrade query precision, we recompute the SVD transform and incorporate it in the existing index structure. For recomputing the SVD-transform, we propose a novel technique that uses aggregate data from the existing index rather than the entire data. This technique reduces the SVD-computation time without compromising query precision. We then explore efficient ways to incorporate the recomputed SVD-transform in the existing index structure without degrading subsequent query response times. These techniques reduce the computation time by a factor of 20 in experiments on color and texture image vectors. The error due to approximate computation of SVD is less than 10%.

  12. ``Improved Concurrency Control Techniques for Multi-dimensional Index Structures'' K. V. Ravi Kanth, David F. Serena and Ambuj Singh, IEEE Parallel Processing Symposium (IPPS/SPDP '98), Orlando, FL, 1998.

    Multi-dimensional index structures such as R-trees enable fast searching in high-dimensional spaces. They differ from uni-dimensional structures in the following aspects: (1) index regions in the tree may be modified during ordinary insert and delete operations, and (2) node splits during inserts are quite expensive. Both these characteristics may lead to reduced concurrency of update and query operations. In this paper, we examine how to achieve high concurrency for multi-dimensional structures. First, we develop a new technique for efficiently handling index region modifications. Then, we extend it to reduce/eliminate query blocking overheads during node-splits. We examine two variants of this extended scheme -- one that reduces the blocking overhead for queries, and another that completely eliminates it. Experiments on image data on a shared-memory multi-processor show that these schemes achieve up to 2 times higher throughput than existing techniques, and scale well with the number of processors.

  13. ``Indexing Non-uniform Spatial Data'' K. V. Ravi Kanth, Amr El Abbadi, Divyakant Agrawal and Ambuj Singh, International Database Engineering and Applications Symposium, Montreal, Canada, pp. 289-298, August 1997.

    Non-uniformity in data extents is a general characteristic of spatial data. Indexing such non-uniform data using conventional spatial index structures such as R-trees is inefficient for two reasons: (1) The non-uniformity increases the likelihood of overlapping index entries, and, (2) clustering of non-uniform data is likely to index more dead space than clustering of uniform data. To reduce the impact of these anomalies, we propose a new scheme that promotes data objects to higher levels in tree-based index structures. We examine two criteria for promotion of data objects and evaluate their relative merits using an R-tree. In experiments on cartographic data, we observe that our promotion criteria yield upto 45% improvement in query performance for an R-tree.

  14. ``Parallelizing Multidimensional Index Structures'' K. V. Ravi Kanth, Amr El Abbadi, Divyakant Agrawal, Ambuj Singh and T. R. Smith, IEEE Symposium on Parallel and Distributed Processing (SPDP 96), New Orleans, pp. 376-383, October 1996.

    Indexing multidimensional data is inherently complex leading to slow query processing. This behavior becomes more pronounced with the increase in database size and/or number of dimensions. In this paper, we address this issue by processing an index structure in parallel. First, we study different ways of partitioning an index structure. We then propose efficient algorithms for processing each query in parallel on the index structure. Using these strategies, we parallelized two multidimensional index structures -- R* and LIB and evaluated the performance gains for the Gazetteer and the Catalog data of the Alexandria Digital Library on the Meiko CS-2.

  15. ``Optimizing Relational Queries by Materializing Natural Joins'', X. Cheng and J. Su, Workshop on Information Technologies and Systems, 1997.

    Efficient evaluation of user queries is very important and critical in database applications involving very large amount of data. In these applications, especially the ones where query answers are expected in real time, performance of query evaluation in a DBMS may be poor; often in such cases, query performance is extremely sensitive to the structure of the database schema. For example, if a query joins several very large relations (in terms of the number of tuples), the joins may be very expensive even with efficient algorithms. On the other hand, if such joins are ``materialized'', i.e. precomputed, stored, and properly indexed, the joins can be avoided at the query evaluation time. In our experiments, the performance improvement by join elimination is significant. Based on the analysis of several large operational database application systems and experimental results, we argue that normalized database schemas should remain for the sake of semantic integrity (upon updates) and that some joins can be materialized to improve query performance. In this paper, we develop techniques to (1) select joins to be materialized based on query statistics and ER diagrams, (2) generate materialized joins with tags to avoid duplicate removal, and (3) translate queries on the base relations to equivalent ones using materialized joins. We also describe an architecture to embed this performance tuning method into existing application systems and a prototype implemented within the Alexandria Digital Library system. We report experimental results on this prototype.

  16. ``More BANG for your buck: a performance comparison of BANG and R* spatial indexing'', Michael Freeston, Steven Geffner, Mike Horhammer, submitted for publication

    A current trend in database architecture is to provide 'data blades' or 'data cartridges' as 'plug-in' indexing methods to support new data types. The research project which gave rise to this paper aims to test the practicality of a diametrically opposite approach: the development of a new, generic indexing technology i.e. a single indexing technique capable of supporting a wide range of data types. We believe that BANG indexing [Fre87] is now a viable candidate for such a technology, as a result of a series of extensions and refinements, and fundamental improvements in worst-case characteristics made possible by recent theoretical advances. The task is therefore to test whether this single generalized technique can match the performance of several other specialized methods.

  17. ``Cyclic Allocation of Two-Dimensional Data'', S. Prabhakar, K. Abdel-Ghaffar, D. Agrawal and A. El Abbadi, International Conference on Data Engineering, ICDE'98, Orlando, Florida.

    Various proposals have been made for declustering two-dimensionally tiled data on multiple I/O devices. Recently, it was shown that strictly optimal solutions only exist under very restrictive conditions on the tiling of the two-dimensional space or for very few I/O devices. In this paper we explore allocation methods where no strictly optimal solution exists. We propose a general class of allocation methods, referred to as cyclic declustering methods, and show that many existing methods are instances of this class. As a result, various seemingly ad hoc and unrelated methods are presented in a single framework. Furthermore, the framework is used to develop new allocation methods that give better performance than any previous method and that approach the best feasible performance.

  18. ``Browsing and Placement of Multi-resolution Images on Parallel Disks'' Sunil Prabhakar, D. Agrawal, A. El Abbadi, A. Singh and T. Smith, 5th Annual Workshop on I/O in Parallel and Distributed Systems, IOPADS, 97, San Jose, CA.

    With rapid advances in computer and communication technologies, there is an increasing demand to build and maintain large image repositories. In order to reduce the demands on I/O and network resources, multiresolution representations are being proposed for the storage organization of images. Image decomposition techniques such as wavelets can be used to provide these multiresolution images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. These visually similar coefficients can be thought of as thumbnails or icons of the original image. This paper addresses the problem of storing these multiresolution coefficients on disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Several strategies are evaluated to store the image coefficients on parallel disks. These strategies can be classified into two broad classes depending on whether the access pattern of the images is used in the placement. Disk simulation is used to evaluate the performance of these strategies. Simulation results are validated with results from experiments with real disks and are found to be in good agreement. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.

  19. ``Scheduling Tertiary I/O in Database Applications'' Sunil Prabhakar, D. Agrawal, A. El Abbadi and A. Singh, 8th international workshop on Database and Expert Systems Applications, DEXA '97, Toulouse, France.

    With the recent improvements in network and processor speeds many data intensive applications such as digital libraries, data mining, etc., have become more feasible than ever before. The only practical solution for their enormous storage needs is tertiary storage. Automated access to tertiary storage is made possible through tape and optical disk robotic libraries. The slow speeds of operation of the drives and library robotics imply large access times. This high access latency and the bursty nature of I/O traffic result in the accumulation of I/O requests at the storage library. We study the problem of scheduling these requests to improve performance. We focus on scheduling policies that process all requests on a loaded medium before unloading it. For single drive settings an efficient algorithm that produces optimal schedules is developed. For multiple drives the problem is shown to be NP-Complete. Efficient and effective heuristics are presented for the multiple drive case. The scheduling policies developed achieve significant performance gains over more naive policies. The algorithms developed are simple to implement and are not restrictive. The study is general enough to be applicable to any storage library handling removable media, such as tapes and optical disks.

  20. ``Efficient Disk Allocation for Fast Similarity Searching'' S. Prabhakar, D. Agrawal and A. El Abbadi, Symposium on Parallel Algorithms and Architectures, SPAA'98, Puerto Vallarta, Mexico, June 28-July 2 1998.

    As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper, we propose a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that cyclic declustering techniques outperform previously proposed techniques.

  21. ``Disk Allocation for Multidimensional Range Queries'' S. Prabhakar, A. Abdel-Ghaffar, D. Agrawal and A. El Abbadi, submitted for publication.

    Many scientific and engineering applications process large multidimensional datasets. An important access pattern for these applications is the retrieval of data corresponding to ranges of values in multiple dimensions. Performance of these accesses is limited by the poor performance of magnetic disks, largely due to high disk latencies. Tiling the multidimensional data space and exploiting parallel I/O from multiple disks is an effective technique for improving performance. The gains from parallelism are chiefly determined by the distribution of the tiles across the parallel disks such that as many accesses as possible are maximally parallelised. Several schemes for declustering multidimensional data to improve the performance of range queries have been proposed in the literature. We extend the class of cyclic schemes which have been developed earlier for two-dimensional data to multiple dimensions. We begin by establishing some important properties of these allocation schemes. Based upon these properties, we identify techniques for reducing the search space for determining good declustering schemes within the class of cyclic schemes. Through experimental evaluation, we establish that the cyclic schemes are superior to other declustering schemes both in terms of the degree of parallelism achieved and robustness to variations in parameters. In particular, we show that the exhaustive cyclic scheme gives better performance than the state-of-the-art scheme for declustering multidimensional data for range queries.

  22. ``Tertiary Storage Systems: Current Status and Future Directions'' S. Prabhakar, D. Agrawal, A. El Abbadi, A. Singh, submitted for publication.

    This report summarizes current state of the art in tertiary storage systems. We begin with a comprehensive discussion of magnetic tape and optical storage technologies. This is followed by a classification of commercial products based on their performance characteristics. Our analysis of product data indicates that in contrast to disk technology, tertiary storage products have significant variablility in terms of data transfer rates as well as other performance figures. We then summarize efforts in the areas of operating systems, databases and advanced applications to integrate tertiary storage. We point out that different assumptions about the underlying technology result in entirely different algorithms and system design. We conclude the report with a speculation of future trends.



next up previous contents
Next: IMAGE PROCESSING TEAM Up: RESEARCH ACTIVITIES AND Previous: INTERFACE DESIGN &



Terence R. Smith
Tue Jul 21 09:26:42 PDT 1998