PERFORMANCE AND PARALLEL PROCESSING TEAM

Next: COMPARISON OF ACTUAL Up: RESEARCH ACTIVITIES AND Previous: IMAGE PROCESSING TEAM

PERFORMANCE AND PARALLEL PROCESSING TEAM

Membership: Yang (leader), Andresen, Ibarra, Egecioglu, Zhu., Poulakidas., Watson, Xiong, Smith, Holmedahl

Mission Statement of Team: The goal of the performance and Parallel Processing Team is to identify and investigate aspects of ADL that will benefit from high-performance computing on multi-computers. Current investigation is focused on various performance issues arising from the ADL environment in terms of both space and time complexities, including WWW applications. It is also developing algorithms and software techniques for efficient image processing for high performance digital libraries.

Significant Accomplishments
Dynamic partitioning and scheduling for wavelet image browsing on WWW
The team has been evaluating the applications of adaptive client-server partitioning and scheduling techniques in wavelet image browsing. Taking advantage of multi-processor server support with client resources can lead to significantly improved response times for image browsing. The main research challenge is the effective management and utilization of resources from multi-processor WWW servers and client-site machines. Blindly transferring load onto clients may not be advisable, since the bytecode performance of Java is usually much slower than a client machine's potential. A careful design of the scheduling strategy is needed to avoid imposing too much burden on client machines. A paper on this result was published in IEEE IPPS'97.

The optimization is to address when a part of the computation involved in multi-resolution image construction should be executed at a server or at a client. The computation for processing a wavelet image request can be modeled as a chain of subtasks:

Fetching compressed wavelet data and extracting the subregion.
The compressed coefficients must be expanded to their original form.
After the coefficients are available, the inverse wavelet function.

Notice that the thumbnail image needs to be fetched from the server disk if the reconstruction is conducted on the server. Otherwise, the thumbnail image is already available on the memory of the client machine.

Our analysis shows that with different resource availability in terms of client-server bandwidths and client/server system load, a client should execute different portion of wavelet computation in order to minimize request response times.

A General Adaptive Partitioning/Scheduling Scheme for Enhancing Web application performance
We have developed general adaptive scheduling techniques that optimize the use of a multiprocessor server with client resources by predicting demands of requests on I/O, CPU and network capabilities. We also provide a performance analysis under simplified assumptions for understanding the impact of system loads and network bandwidth when using our scheduling strategy. The experimental results show that substantial performance improvement is obtained by shifting computation to client-site machines and for some applications, it reduces network bandwidth requirements. Our scheme is adaptive to the variation of client resource availability. We examine the usefulness of our scheme for image and postscript document browsing. A paper on this result will be published in Journal of Parallel and Distributed Computing, 1998.

In addition to progressive image browsing to which our techniques are applicable, the extraction of the plain text from a Postscript-formatted document is another application which can benefit strongly from client resources. This application is useful for content replication and for information displaying in non-Postscript-enabled browsers. The archives of an Internet/intranet site would be a logical place to locate useful related work for inclusion in another publication. Two standard options are either scanning a hard copy of the document and using OCR, or retyping the relevant sections. Automatic text extraction can be a much more efficient process. Also many small Web clients, such as personal digital assistants (PDAs) or NetPCs, do not have the capability to display Postscript files. Viewing the text can be the only available option to view the content of a Postscript document. Dynamic scheduling is needed for balancing bandwidth and processing requirements. Postscript files are typically large, and so in most situations require a large amount of time to transmit. Text extraction dramatically reduces the size of the data to be transferred, but imposes a large computational burden. The scheduler must determine a proper split point as a function of bandwidth and available processing capability.

In our request scheduling and partitioning scheme, after the arrival of an HTTP request at a node The server parses the HTTP command, and expands the incomplete pathname. It also determines whether the requested document exists or it is a CGI program/task chain to execute. If it is not recognized as a task chain, the system will assign this request to the server node with the lowest load. Otherwise the system will analyze the request, select a partitioning point and server node for the minimum response time using the cost function discussed below. If the chosen server node is not the local node, the request is redirected appropriately. Otherwise, a part of this chain is executed at this server node and the remaining part of a task chain will be further executed at the client machine. No requests are allowed to be re-directed more than once, to avoid the ping-pong effect.

We have provided an analysis under a number of assumptions to study maximum sustained requests per second (MRPS), expected redirection ratios, and predicted task chain split points. In this way, we can analyze the impact of system resource availability from different aspects, for example, the number of server-nodes, available Internet bandwidth and the CPU speed ratio between client and server machines.

SWEB++: Software support for adaptive client-server computing
A software prototype (SWEB++) which assists the programming of WWW applications in using our partitioning and scheduling techniques has also been developed. In such an environment, a WWW programmer first creates the task binary executables for the client and server. The programmer then describes each task and the task chain using the task definition language. After that, the SWEB++ composer takes the task specification as an input and generates the C++ code to link the necessary modules together used for client and server site libraries. It extracts the task attribute information from the specification and produces C++ stubs for the scheduler's resource requirement estimation. The composer also creates a CGI shell script program for the run-time chain executor to invoke when a split point is provided. This system can substantially reduce the programming efforts in incorporating dynamic optimization features introduced in our earlier work. In supporting the implementation of our partitioning and scheduling scheme, we have developed a client-side resource monitor to dynamically report client load and client-server bandwidth. We have used a piggyback strategy in our Java implementation to evaluate the client-server bandwidth without creating extra network traffic. Some of above results appeared in ACM SuperComputing'97. The source code of the system is available from http://www.cs.ucsb.edu/research/rapid_sweb/SWEB.html.

A compact storage scheme and Java support for fast progressive image retrieval
Subregion retrieval is an important feature of a digital library system for browsing large-scale images. In 1996, we have developed a wavelet-based image storage scheme that provides fast image subregion retrieval in progressively higher resolutions, while accomplishing good image compression ratios. The method is based on the quadtree and Huffman coding schemes. In 1997, we continue to evaluate the performance of our scheme and proposed an analytical result to access the compression performance of our algorithm, and we also conduct an extensive set of experiments to demonstrate its effectiveness for the ADL images and provide an extension to a client/server environment. A paper on this result was published in COCOON '97.

We have developed a JAVA implementation of the wavelet color image browsing scheme. It uses multi-threading to improve the client computation and communication performance and the code is universally executable in any standard WWW browser. Our experiments show that the performance of this code running on a reasonable client machine is competitive to running on a powerful server when the server and the client are connected through the Internet or telephone lines. We are in the final process of documenting the source code and will make the code available publically in Spring 98.

Global Optimization for Mapping Parallel Image Processing Tasks on Distributed Memory Machines
Digital library systems use many image processing algorithms for supporting image-related information accessing. Parallelizing those algorithms is important to improve the response times in processing user image accessing requests. We assume that an image server is running on distributed memory machines. Many of those algorithms can be modeled as task chains with loops. The typical image distribution on multi-processors may use column, row, and block based mapping. Integrating a set of library routines for an image application requires a global optimization for determining the data mapping of individual tasks by considering inter-task communication. The main difficulty in deriving the optimal image data distribution for each task is that task computation may involve loops, and the number of processors available and the size of the input image may vary at the run time. We have developed a mapping scheme for optimizing the parallelization of image processing operations. A paper on this work has appeared in Journal of Parallel and Distributed Computing, 1997.

Clustering Support for the Alexandria Digital Library Server
In this work, we investigate load balancing strategies for clustered Alexandria digital library (ADL) servers. The ADL system involves intensive database I/O and heterogeneous CPU activities. Clustering servers can improve the scalability of the ADL system in response to a large number of simultaneous access requests. One difficulty addressed is that clustered workstation nodes may be non-uniform in terms of CPU and I/O speeds.

We have developed an optimization scheme that dynamically monitors the resource availability, uses a low-cost communication strategy for updating load information among nodes, and schedules requests based on both I/O and computation load indices. Given a request arriving at each node, our request scheduler selects the best node for processing. Since the accurate cost estimation for processing database-searching requests is difficult (especially an ADL query involves spatial data searching), we have proposed a sampling and prediction scheme to identify the relative efficiency of nodes for satisfying I/O and CPU demands of these requests.

We have also addressed communication issues in updating system load information between nodes. The accuracy of global information on each node improves scheduling performance but may create too much network traffic. The previous work has addressed this issue by using two strategies called broadcast and random poll. We have proposed a new hybrid approach which better fits our setting. In this scheme, load information at each node is multicasted to a few of randomly selected nodes every few seconds. In processing an incoming request, we use an aggregated load index to check if random poll should be used. This aggregated index combines both I/O and CPU load information for each node. If its value at a node is below a threshold, this node is considered as light and it randomly polls few nodes to select the best among them. Otherwise, it will not poll other nodes, and just use the load information multicasted by others to select a node.

We have implemented our system in a cluster of 6 Sun 147Mhz UltraSparc 1 workstations and 2 Ultra 2 workstations with dual processors ( 167Mhz) connected via fast Ethernet at the Parallel Systems Lab of UCSB. Two of Ultra 1 have disk drives with 8 Mbits/second and other six nodes have disk drives with 10 Mbits/second. The Ultra 2 has a processing rate about 2 times faster than the slowest node for the ADL requests we have tested. We have conducted a set of experiments using the ADL traces to verify the effectiveness of the proposed strategies. The experiments show that it is important to consider both I/O and CPU load factors in making scheduling decisions, and our method for estimating the impact of these two factors is effective in processing ADL requests on this non-uniform cluster. We have provided analytic results to bound the performance of our scheme on a clustered environment. The theoretical analysis corroborates the design of our techniques. Our future work is to further evaluate and extend our scheme (e.g. dealing with non-uniform memory sizes), and study the use of a faster network such as SCI for the cluster.

Cooperative Caching of Dynamic Content on a Distributed Web Server
In this work we propose a new method for improving the average response times of web servers by cooperatively caching the results of requests for dynamic content on a cluster of workstations. The work is motivated by our recent study of the access logs of the Alexandria Digital Library server at UCSB, which demonstrates that approximately a 30 percent decrease in response time could be achieved by caching dynamically generated content.

We have developed a distributed Web server called Swala in which the cache manager on each node cooperatively caches CGI requests. Swala is multi-threaded and runs on a cluster of workstations. The caches contain the results of requests for dynamic content, produced by CGIs. Each individual node is made aware of the others at startup, so that they can exchange cache information and content. All execution is multi-threaded, to maximize throughput. We use memory-mapped I/O whenever possible to minimize the number of system calls and eliminate double-buffering. The primary runtime modules in Swala on every machine are the control module and the cache manager. The control module listens for HTTP requests from clients and starts a thread for each incoming request. It is also responsible for parsing incoming requests. The cache manager maintains the local cache dictionary and provides information to request threads. Each node maintains a cache directory with information about all entries cached on all nodes. The cache manager contains three daemon threads. The first thread receives information about cache insertions and deletions from the other nodes, and updates the local dictionary. The second thread listens for cache data requests from the other nodes and starts a separate thread for each request to return the cache contents. If the current cache replacement method is Time-To-Live, a third thread wakes up every few seconds and purges any outdated entries from the cache. We use a two-level consistency protocol to maximize the system performance and minimize overhead in responding to dynamic Web requests.

Our experiments show that, for dynamic requests, the single-computer performance of Swala without caching is comparable to Netscape Enterprise server, that substantial speedups are obtained using a cache on a single machine, that even greater speedups are obtained by using a cooperative cache on a cluster of workstations. A number of other research groups have addressed Web proxy caching; however, they focus on file caching rather than caching of dynamic content. Furthermore, proxy caching of dynamic content may not be feasible, for example, it does not permit caching of authenticated content. Caching the results of dynamic requests has been studied recently by an IBM group. They have written a cache server and rewritten their server applications to insert and delete items from this cache. There are two main drawbacks to this approach. First, they require that the server application be rewritten to take advantage of the cache. This can be a nontrivial task for some applications and impossible for those applications for which the user does not have access to source code. Secondly, for every dynamic request, the Web server still must start the application, even if only to return a cache hit. For call mechanisms such as CGI, the operating system overhead for this call is significant, as we demonstrate in one of our experiments, our work overcomes both limitations, since our cache is built into the Web server.

Abstracts of Published Papers

D. Andresen. SWEB++: Distributed Scheduling and Software Support for High Performance WWW Applications. PhD. Thesis. Sept. 1997.
WWW-based information service has grown enormously during the last few years, and major performance bottlenecks have been caused by WWW server and internet bandwidth inadequacies. Augmenting the server with multiprocessor support and shifting computation to client-site machines can substantially improve the system response time and for some applications, it may also reduce network bandwidth requirements. Taking full advantage of these capabilities requires sophisticated scheduling.
We first investigate algorithms for scheduling HTTP requests within a server cluster. Typical distributed scheduling techniques for HTTP servers either use a simple round-robin request distribution algorithm, or have only a single criteria such as CPU load. We propose novel multifaceted scheduling techniques that optimize the use of a multiprocessor server by predicting the demands of requests on I/O, processor, and network resources. We present SWEB, a software system implementing our techniques on a cluster of workstations. We provide a performance analysis under simplified assumptions for understanding the impact of system loads when using our scheduling strategies. Our experiments show substantial improvements by our techniques compared to traditional algorithms, and we observe a close correlation between our theoretical analysis and the achieved results.
We then extend our techniques to adaptively incorporate client resources. Due to a wide variation in client capabilities and connection network characteristics, the standard technique of partitioning the client-server workload at a fixed point is infeasible. We present a task model and scheduling technique for adaptive client-server computing in which the computation required by a user request is dynamically partitioned between the client and server by monitoring network, client and server resources. We demonstrate the use of this technique in digital library applications such as image and postscript document browsing. We also present SWEB++, a software system to support programmers desiring to use our scheduling algorithms. Experimentally, we achieve significantly faster response times through utilizing client resources, and demonstrate the validity of our theoretical analysis.
A. Poulakidas, Image Compression and Data Replication in Distributed Computing Systems, PhD. Thesis. July 1997.
The realization of the need to deal with large-scale systems as a whole and the availability of powerful computer resources have fueled the interest for efficient monitoring and modeling of such systems. This is particularly true in the area of Earth System Science, which has motivated this work. These are projects that need a multicomputer environment. Reasons include requirement for very large computing power, financial constraints that prohibit supercomputers, and fault-tolerance issues. Furthermore, a distributed computing environment is more natural since users and data generation are geographically distributed.
In an Earth System Science computing environment, users should be able to browse the data available, process them, insert new data and correct or enhance data already available. A bottleneck is the handling of image or image-like data ---there are a large number of images and the size of each of them is also large.
We propose a scheme that stores images and supports image browsing efficiently. In particular, the scheme achieves good image compression while supporting fast image subregion retrieval at various resolutions. It is particularly suited for a distributed server/client environment.
Since new images are inserted, obsolete ones are deleted and others are updated, support for maintaining data coherence without sacrificing performance is needed. For an image server/client environment to be used as part of a distributed system, which may include other servers that provide related information, all these servers must appear as a single logical entity to any user. We propose an algorithm that achieves such an objective by providing a shared memory abstraction, i.e., access to the memories of the servers appear like accesses to a single memory module. To achieve this efficiently, the algorithm dynamically migrates/replicates/deletes data from/to servers in order to best distribute them to satisfy the current pattern of users' accesses.
A. Poulakidas, A. Srinivasan, O. Egecioglu, O. Ibarra, and T. Yang, ``A Compact Storage Scheme for Fast Wavelet-based Subregion Retrieval'', in Proc. of 1997 International Computing and Combinatorics Conference (COCOON '97), Shanghai, China, August, 1997.
In an image browsing environment there is need for progressively viewing image subregions at various resolutions. We describe a storage scheme that accomplishes good image compression, while supporting fast image subregion retrieval. We evaluate analytically and experimentally the compression performance of our algorithm. We also provide results on the speed of the algorithm to demonstrate its effectiveness, and present an extension to a client/server environment.
D. Andresen, T. Yang, D. Watson, A. Poulakidas, Dynamic Processor Scheduling with Client Resources for Fast Multi-resolution WWW Image Browsing, in Proceedings of the 11th International Parallel Processing Symposium (IPPS'97), Geneva, 1997.
WWW-based Internet information service has grown enormously during the last few years, and major performance bottlenecks have been caused by WWW server and Internet bandwidth inadequacies. Utilizing client-site computing power and also multi-processor support at the server site can substantially improve the system response time. In this paper, we examine the use of scheduling techniques in monitoring and adapting to workload variation at client and server sites for supporting fast WWW image browsing. We provide both analytic and experimental results to identify the impact of system loads and network bandwidth on response times and demonstrate the effectiveness of our scheduling strategy.
D. Andresen, T. Yang, O. H. Ibarra, O. Egecioglu, Adaptive Partitioning and Scheduling for Enhancing WWW Application Performance. Accepted for publication in Journal of Parallel and Distributed Computing, 1998.
This paper studies runtime partitioning, scheduling and load balancing techniques for improving performance of on-line WWW-based information systems such as digital libraries. The main performance bottlenecks of such a system are caused by the server computing capability and Internet bandwidth. Our observations and solutions are based on our experience with the Alexandria Digital Library (ADL) testbed at UCSB, which provides on-line browsing and processing of documents, digitized maps and other geo-spatially mapped data via WWW. A proper partitioning and scheduling of computation and communication in processing a user request on a multi-processor server and transferring some computation to client-site machines can reduce network traffic and substantially improve system response time. We propose a partitioning and scheduling mechanism that adapts to resource changes and optimizes resource utilization, and demonstrate the application of this mechanism for on-line information browsing. We also provide a performance analysis and experimental results to study impact of resource availability and the effectiveness of our scheduling techniques.
D. Andresen, T. Yang Multiprocessor Scheduling with Client Resources to Improve the Response Time of WWW Applications, in Proc. of the 11th ACM International Conference on Supercomputing (ICS'97), pp. 92-99, 1997.
WWW-based information service has grown enormously during the last few years, and major performance bottlenecks have been caused by WWW server and Internet bandwidth inadequacies. Augmenting a server with multiprocessor support and shifting computation to client-site machines can substantially improve system response times and for some applications, it may also reduce network bandwidth requirements. In this paper, we propose adaptive scheduling techniques that optimize the use of a multiprocessor server with client resources by predicting demands of requests on I/O, CPU and network capabilities. We also provide a performance analysis under simplified assumptions for understanding the impact of system loads and network bandwidth when using our scheduling strategy. Finally we report preliminary experimental results to examine the system performance and verify the usefulness of the analytic model.
D. Andresen, T. Yang, O. Ibarra, Towards a Scalable Distributed WWW Server on Networked Workstations, Journal of Parallel and Distributed Computing. Vol 42, pp. 91-100, 1997.
In this paper, we investigate the issues involved in developing a scalable World Wide Web (WWW) server called SWEB on a cluster of workstations. The objective is to strengthen the processing capabilities of such a server in order to match huge demands in simultaneous access requests from the Internet, especially when these requests involve delivery of large digitized documents. The scheduling component of the system actively monitors the usages of CPU, disk I/O channels and the interconnection network to effectively distribute HTTP requests across processing units to exploit task and I/O parallelism. We analyze the maximum number of requests that can be handled by the system and present several experiments to examine the performance of this system.
C. Lee,Y.-F., Wang, T. Yang, Global Optimization for Mapping Parallel Image Processing Tasks, Journal of Parallel and Distributed Computing, Vol 45, pp. 29-45, 1997.
Many parallel algorithms and library routines are available for performing computer vision and image processing (CVIP) tasks on distributed-memory multiprocessors. The typical image distribution may use column, row, and block based mapping. Integrating a set of library routines for a CVIP application requires a global optimization for determining the data mapping of individual tasks by considering inter-task communication. The main difficulty in deriving the optimal image data distribution for each task is that CVIP task computation may involve loops, and the number of processors available and the size of the input image may vary at the run time. In this paper, a CVIP application is modeled using a task chain with nested loops, specified by conventional visual languages such as Khoros and Explorer. A mapping algorithm is proposed that optimizes the average run-time performance for CVIP applications with nested loops by considering the data redistribution overheads and possible run-time parameter variations. A taxonomy of CVIP operations is provided and used for further reducing the complexity of the algorithm. Experimental results on both low-level image processing and high-level computer vision applications are presented to validate this approach.
A. Poulakidas, A. Srinivasan, O. Egecioglu, O. Ibarra, and T. Yang, ``Image Compression for Fast Wavelet-based Subregion Retrieval'', Submitted to Theoretical Computer Science journal.
In an image browsing environment there is need for progressively viewing image subregions at various resolutions. We describe a storage scheme that accomplishes good image compression, while supporting fast image subregion retrieval. We evaluate analytically and experimentally the compression performance of our algorithm. We also provide results on the speed of the algorithm to demonstrate its effectiveness, and present an extension to a client/server environment.
Vegard Holmedahl, Ben Smith, Tao Yang ``Cooperative Caching of Dynamic Content on a Distributed Web Server'', Seventh IEEE International Symposium on High Performance Distributed Computing Chicago, Illinois, July 28-31, 1998
In this paper we propose a new method for improving the average response time of Web servers by cooperatively caching the results of requests for dynamic content. The work is motivated by our recent study of access logs from the Alexandria Digital Library server at UCSB, which demonstrates that approximately a 30 percent decrease in average response time could be achieved by caching dynamically generated content . We have developed a distributed Web server called Swala, in which all nodes cooperatively cache CGI requests. We use a two-level consistency protocol to maximize the system performance and minimize overhead in responding to dynamic Web requests. Our experiments show that the single-node performance of Swala without caching is comparable to the Netscape Enterprise server, substantial speedups are obtained using a cache on a single machine, and even greater speedups are obtained by using a cooperative cache on a cluster of workstations.
Huican Zhu, Tao Yang, Qi Zheng, David Watson, Oscar Ibarra and Terry Smith, ``Adaptive Load Sharing for Clustered Digital Library Servers'', Seventh IEEE International Symposium on High Performance Distributed Computing Chicago, Illinois, July 28-31, 1998
In this paper, we investigate load balancing strategies for clustered Alexandria digital library (ADL) servers. The ADL system, which provides on-line information searching and browsing of spatially-referenced materials through the World Wide Web, involves intensive database I/O and heterogeneous CPU activities. Clustering servers can improve the scalability of the ADL system in response to a large number of simultaneous access requests. One difficulty addressed is that clustered workstation nodes may be non-uniform in terms of CPU and I/O speeds. We have developed an optimization scheme that dynamically monitors the resource availability, uses a low-cost communication strategy for updating load information among nodes, and schedules requests based on both I/O and computation load indices. Since the accurate cost estimation for processing database-searching requests is difficult, we have proposed a sampling and prediction scheme to identify the relative efficiency of nodes for satisfying I/O and CPU demands of these requests. We have provided analytic results to bound the performance of our scheme on this cluster environment and have conducted a set of experiments using the ADL traces to verify the effectiveness of the proposed strategies.
D. Andresen, T. Yang. ``SWEB++: Partitioning and Scheduling for Adaptive Client-Server Computing on WWW'', 1998 SIGMETRICS Workshop on Internet Server Performance, 1998.
The SWEB++ project studies runtime partitioning, scheduling and load balancing techniques for improving the performance of on-line WWW-based information systems such as digital libraries. The main performance bottlenecks of such a system are caused by the server computing capability and Internet bandwidth. Our observations and solutions are based on our experience with the Alexandria Digital Library (ADL) testbed at UCSB, which provides on-line browsing and processing of documents, digitized maps and other geo-spatially mapped data via WWW. A proper partitioning and scheduling of computation and communication in processing a user request on a multi-processor server and transferring some computation to client-site machines can reduce network traffic and substantially improve system response time. We have proposed a partitioning and scheduling mechanism that adapts to resource changes and optimizes resource utilization. We have developed a software tool which implements and supports the use of our scheduling strategies when programming WWW applications, and have demonstrated the application of this system for on-line DL information browsing. We have conducted a set of experiments to examine the system performance on the Meiko parallel machine and a cluster of workstations.

Next: COMPARISON OF ACTUAL Up: RESEARCH ACTIVITIES AND Previous: IMAGE PROCESSING TEAM

Terence R. Smith
Tue Jul 21 09:26:42 PDT 1998