We imagine that a massively parallel computer will act as a server running the Catalogue subsystem that responds to the daily requests from multiple users, and the Ingest subsystem that pre-processes the digital data information. Multiple clients (users) run the Application subsystem in their own machines and request digital data from the server through the network access. The server processes the query, returns the result of the search (e.g. image files) through the network, and subsequently the users view these results locally on their own systems.
There are two levels of parallel support for a high performance digital library. The first, which we will discuss in more detail in the next subsection, consists of parallelizing time-consuming processing operations for an individual query. The second level consists of concurrentizing the multiple user query requests in a multi-user network environment. The parallel processing of multiple requests should be able to sustain fast response time even during ``peak'' use of the digital library.
For the second level, the issues we will address are scheduling of parallel computation (e.g. image searching) and I/O. Due to the huge amount of heterogeneous data sets that will be stored, disk I/O will become the bottleneck of parallel processing. Disk arrays can be used to improve the I/O performance. The problem of scheduling parallel data accessing involves both the CPU processor resource and I/O channels linked between disk arrays and CPU memories. In the course of this project, we will study the following problems.
Partitioning and Scheduling Multiple Data Access Over a Limited Number of Processors: Load balancing to improve the average response time for information retrieval is the goal of optimization. We call each information query request a task . Each task involves time-consuming image and text searching and should be parallelizable using several processors. Thus the problem of scheduling is reduced to processing a sequence of parallelizable tasks and assigning them onto a set of processors. Since each task is parallelizable, we need to consider the partitioning of each task, the allocation of processors to each task, and the required response time for each task such that the resource constraint is satisfied but the average response time for each task is minimized.
Scheduling Computation and I/O Channels Together: Assuming that each disk channel is used exclusively by one disk, we can maintain a global queue to keep active query requests and examine the disks that need to be accessed by each task. In order to avoid the disk access conflicts, the scheduler needs to assign tasks to processors such that no channel is used by more than one task at the same time. One approach is to use graph searching algorithms (e.g. bipartite matching) to find such an assignment. Another issue of interest is the load balancing of computation with channel matching to achieve the maximum efficiency.
We have conducted research to analyze the characteristics of task partitioning for efficient parallel processing and developed several scheduling algorithms and run-time support methods in exploiting task parallelism on massively parallel machines, and we have developed an automatic scheduling system in parallelizing task computation [71][70][25][26]. The database group of the Alexandria Project has developed methods and algorithms on parallel I/O and data placement [1]. The development of parallel algorithms and systems for the above research issues in a digital library will require a close collaboration with the database research group to study data placement, retrieval strategies, the decomposition of queries that employ parallel I/O, and multi-task scheduling techniques.