Membership: Yang (leader), Andresen, Ibarra, Egecioglu, Zhu., Poulakidas., Watson, Xiong, Smith, Holmedahl
Mission Statement of Team: The goal of the performance and Parallel Processing Team is to identify and investigate aspects of ADL that will benefit from high-performance computing on multi-computers. Current investigation is focused on various performance issues arising from the ADL environment in terms of both space and time complexities, including WWW applications. It is also developing algorithms and software techniques for efficient image processing for high performance digital libraries.
Significant Accomplishments
Dynamic partitioning and scheduling for wavelet image browsing on WWW
The team has been evaluating the applications of adaptive client-server
partitioning and scheduling techniques in wavelet image browsing.
Taking advantage of multi-processor server support with client
resources can lead to significantly improved response times for image
browsing. The main research challenge is the effective management
and utilization of resources from multi-processor WWW servers
and client-site machines. Blindly transferring load onto clients may
not be advisable, since the bytecode performance of Java is usually
much slower than a client machine's potential. A careful design of the
scheduling strategy is needed to avoid imposing too much burden on
client machines. A paper on this result was published in IEEE IPPS'97.
The optimization is to address when a part of the computation involved in multi-resolution image construction should be executed at a server or at a client. The computation for processing a wavelet image request can be modeled as a chain of subtasks:
Our analysis shows that with different resource availability in terms of client-server bandwidths and client/server system load, a client should execute different portion of wavelet computation in order to minimize request response times.
A General Adaptive Partitioning/Scheduling Scheme for
Enhancing Web application performance
We have developed general adaptive scheduling techniques that optimize
the use of a multiprocessor server with client resources by predicting
demands of requests on I/O, CPU and network capabilities. We also
provide a performance analysis under simplified assumptions for
understanding the impact of system loads and network bandwidth when
using our scheduling strategy. The experimental results show that
substantial performance improvement is obtained by shifting computation
to client-site machines and for some applications, it reduces network
bandwidth requirements.
Our scheme is adaptive to the variation of client resource
availability. We examine the usefulness of our scheme for image and
postscript document browsing. A paper on this result will be published
in Journal of Parallel and Distributed Computing, 1998.
In addition to progressive image browsing to which our techniques are applicable, the extraction of the plain text from a Postscript-formatted document is another application which can benefit strongly from client resources. This application is useful for content replication and for information displaying in non-Postscript-enabled browsers. The archives of an Internet/intranet site would be a logical place to locate useful related work for inclusion in another publication. Two standard options are either scanning a hard copy of the document and using OCR, or retyping the relevant sections. Automatic text extraction can be a much more efficient process. Also many small Web clients, such as personal digital assistants (PDAs) or NetPCs, do not have the capability to display Postscript files. Viewing the text can be the only available option to view the content of a Postscript document. Dynamic scheduling is needed for balancing bandwidth and processing requirements. Postscript files are typically large, and so in most situations require a large amount of time to transmit. Text extraction dramatically reduces the size of the data to be transferred, but imposes a large computational burden. The scheduler must determine a proper split point as a function of bandwidth and available processing capability.
In our request scheduling and partitioning scheme, after the arrival of an HTTP request at a node The server parses the HTTP command, and expands the incomplete pathname. It also determines whether the requested document exists or it is a CGI program/task chain to execute. If it is not recognized as a task chain, the system will assign this request to the server node with the lowest load. Otherwise the system will analyze the request, select a partitioning point and server node for the minimum response time using the cost function discussed below. If the chosen server node is not the local node, the request is redirected appropriately. Otherwise, a part of this chain is executed at this server node and the remaining part of a task chain will be further executed at the client machine. No requests are allowed to be re-directed more than once, to avoid the ping-pong effect.
We have provided an analysis under a number of assumptions to study maximum sustained requests per second (MRPS), expected redirection ratios, and predicted task chain split points. In this way, we can analyze the impact of system resource availability from different aspects, for example, the number of server-nodes, available Internet bandwidth and the CPU speed ratio between client and server machines.
SWEB++: Software support for adaptive client-server computing
A software prototype (SWEB++) which assists the programming of WWW
applications in using our partitioning and scheduling
techniques has also been developed. In such
an environment, a WWW programmer first creates the task binary
executables for the client and server. The programmer then describes
each task and the task chain using the task definition language. After
that, the SWEB++ composer takes the task specification as an input and
generates the C++ code to link the necessary modules together used for
client and server site libraries. It extracts the task attribute
information from the specification and produces C++ stubs for the
scheduler's resource requirement estimation. The composer also
creates a CGI shell script program for the run-time chain executor
to invoke when a split point is provided. This system can
substantially reduce the programming efforts in incorporating dynamic
optimization features introduced in our earlier work.
In supporting the implementation of our partitioning and scheduling
scheme, we have developed a client-side resource monitor to dynamically
report client load and client-server bandwidth. We have used a
piggyback strategy in our Java implementation to evaluate the
client-server bandwidth without creating extra network traffic.
Some of above results appeared in ACM SuperComputing'97.
The source code of the system is available from
http://www.cs.ucsb.edu/research/rapid_sweb/SWEB.html.
A compact storage scheme and Java support for fast progressive image retrieval
Subregion retrieval is an important feature of a digital library
system for browsing large-scale images. In 1996, we have developed a
wavelet-based image storage scheme that provides fast image subregion
retrieval in progressively higher resolutions, while accomplishing
good image compression ratios. The method is based on the quadtree
and Huffman coding schemes.
In 1997, we continue to evaluate the performance
of our scheme and proposed an analytical result to access the
compression performance of our algorithm, and we also
conduct an extensive set of experiments to demonstrate its
effectiveness for the ADL images and provide an extension to a
client/server environment. A paper on this result was published
in COCOON '97.
We have developed a JAVA implementation of the wavelet color image browsing scheme. It uses multi-threading to improve the client computation and communication performance and the code is universally executable in any standard WWW browser. Our experiments show that the performance of this code running on a reasonable client machine is competitive to running on a powerful server when the server and the client are connected through the Internet or telephone lines. We are in the final process of documenting the source code and will make the code available publically in Spring 98.
Global Optimization for Mapping Parallel Image Processing Tasks
on Distributed Memory Machines
Digital library systems use many image processing algorithms
for supporting image-related information accessing. Parallelizing
those algorithms is important to improve the response times
in processing user image accessing requests. We assume that an image
server is running on distributed memory machines. Many of those
algorithms can be modeled as task chains with loops.
The typical image distribution on multi-processors may
use column, row, and block based mapping. Integrating a set of library
routines for an image application requires a global optimization for
determining the data mapping of individual tasks by considering
inter-task communication. The main difficulty in deriving the optimal
image data distribution for each task is that task computation may
involve loops, and the number of processors available and the size of
the input image may vary at the run time. We have developed
a mapping scheme for optimizing the parallelization of image
processing operations. A paper on this work has appeared in Journal
of Parallel and Distributed Computing, 1997.
Clustering Support for the Alexandria Digital Library Server
In this work, we investigate load balancing strategies for clustered
Alexandria digital library (ADL) servers. The ADL system
involves intensive database I/O and heterogeneous CPU activities.
Clustering servers can improve the scalability of the ADL system
in response to a large number of simultaneous access requests.
One difficulty addressed is that clustered workstation
nodes may be non-uniform in terms of CPU and I/O speeds.
We have developed an optimization scheme that dynamically monitors the resource availability, uses a low-cost communication strategy for updating load information among nodes, and schedules requests based on both I/O and computation load indices. Given a request arriving at each node, our request scheduler selects the best node for processing. Since the accurate cost estimation for processing database-searching requests is difficult (especially an ADL query involves spatial data searching), we have proposed a sampling and prediction scheme to identify the relative efficiency of nodes for satisfying I/O and CPU demands of these requests.
We have also addressed communication issues in updating system load information between nodes. The accuracy of global information on each node improves scheduling performance but may create too much network traffic. The previous work has addressed this issue by using two strategies called broadcast and random poll. We have proposed a new hybrid approach which better fits our setting. In this scheme, load information at each node is multicasted to a few of randomly selected nodes every few seconds. In processing an incoming request, we use an aggregated load index to check if random poll should be used. This aggregated index combines both I/O and CPU load information for each node. If its value at a node is below a threshold, this node is considered as light and it randomly polls few nodes to select the best among them. Otherwise, it will not poll other nodes, and just use the load information multicasted by others to select a node.
We have implemented our system in a cluster of 6 Sun 147Mhz UltraSparc 1 workstations and 2 Ultra 2 workstations with dual processors ( 167Mhz) connected via fast Ethernet at the Parallel Systems Lab of UCSB. Two of Ultra 1 have disk drives with 8 Mbits/second and other six nodes have disk drives with 10 Mbits/second. The Ultra 2 has a processing rate about 2 times faster than the slowest node for the ADL requests we have tested. We have conducted a set of experiments using the ADL traces to verify the effectiveness of the proposed strategies. The experiments show that it is important to consider both I/O and CPU load factors in making scheduling decisions, and our method for estimating the impact of these two factors is effective in processing ADL requests on this non-uniform cluster. We have provided analytic results to bound the performance of our scheme on a clustered environment. The theoretical analysis corroborates the design of our techniques. Our future work is to further evaluate and extend our scheme (e.g. dealing with non-uniform memory sizes), and study the use of a faster network such as SCI for the cluster.
Cooperative Caching of Dynamic Content on a Distributed Web
Server
In this work we propose a new method for improving the average
response times of
web servers by cooperatively caching the results of requests for
dynamic content on a cluster of workstations. The work is motivated by
our recent study of the access logs of the Alexandria Digital Library
server at UCSB, which demonstrates that approximately a 30 percent
decrease in response time could be achieved by caching dynamically
generated content.
We have developed a distributed Web server called Swala in which the cache manager on each node cooperatively caches CGI requests. Swala is multi-threaded and runs on a cluster of workstations. The caches contain the results of requests for dynamic content, produced by CGIs. Each individual node is made aware of the others at startup, so that they can exchange cache information and content. All execution is multi-threaded, to maximize throughput. We use memory-mapped I/O whenever possible to minimize the number of system calls and eliminate double-buffering. The primary runtime modules in Swala on every machine are the control module and the cache manager. The control module listens for HTTP requests from clients and starts a thread for each incoming request. It is also responsible for parsing incoming requests. The cache manager maintains the local cache dictionary and provides information to request threads. Each node maintains a cache directory with information about all entries cached on all nodes. The cache manager contains three daemon threads. The first thread receives information about cache insertions and deletions from the other nodes, and updates the local dictionary. The second thread listens for cache data requests from the other nodes and starts a separate thread for each request to return the cache contents. If the current cache replacement method is Time-To-Live, a third thread wakes up every few seconds and purges any outdated entries from the cache. We use a two-level consistency protocol to maximize the system performance and minimize overhead in responding to dynamic Web requests.
Our experiments show that, for dynamic requests, the single-computer performance of Swala without caching is comparable to Netscape Enterprise server, that substantial speedups are obtained using a cache on a single machine, that even greater speedups are obtained by using a cooperative cache on a cluster of workstations. A number of other research groups have addressed Web proxy caching; however, they focus on file caching rather than caching of dynamic content. Furthermore, proxy caching of dynamic content may not be feasible, for example, it does not permit caching of authenticated content. Caching the results of dynamic requests has been studied recently by an IBM group. They have written a cache server and rewritten their server applications to insert and delete items from this cache. There are two main drawbacks to this approach. First, they require that the server application be rewritten to take advantage of the cache. This can be a nontrivial task for some applications and impossible for those applications for which the user does not have access to source code. Secondly, for every dynamic request, the Web server still must start the application, even if only to return a cache hit. For call mechanisms such as CGI, the operating system overhead for this call is significant, as we demonstrate in one of our experiments, our work overcomes both limitations, since our cache is built into the Web server.
Abstracts of Published Papers
WWW-based information service has grown enormously during the last few
years, and major performance bottlenecks have been caused by WWW server
and internet bandwidth inadequacies. Augmenting the server with
multiprocessor support and shifting computation to client-site machines
can substantially improve the system response time and for some
applications, it may also reduce network bandwidth requirements. Taking
full advantage of these capabilities requires sophisticated scheduling.
We first investigate algorithms for scheduling HTTP requests within a
server cluster.
Typical distributed scheduling techniques for HTTP servers
either use a simple round-robin request distribution algorithm,
or have only a single criteria such as CPU load.
We propose novel multifaceted scheduling techniques that optimize the use of a
multiprocessor server
by predicting the demands of
requests on I/O, processor, and network resources.
We present SWEB, a software system implementing our techniques
on a cluster of workstations.
We provide a
performance analysis under simplified assumptions for understanding the
impact of system loads
when using our scheduling strategies.
Our experiments show
substantial improvements by our techniques compared to traditional
algorithms, and we observe a close correlation
between our theoretical analysis and the achieved results.
We then extend our techniques to adaptively incorporate client
resources. Due to a wide variation in client capabilities and
connection network characteristics, the standard technique of partitioning
the client-server workload at a fixed point is infeasible.
We present a task model and scheduling technique
for adaptive client-server computing in which the computation
required by a user request is dynamically partitioned
between the client and server by monitoring
network, client and server resources.
We demonstrate the use of this technique in digital library applications
such as image and postscript document browsing.
We also present SWEB++, a software system
to support programmers desiring to use our scheduling algorithms.
Experimentally, we achieve significantly faster response times
through utilizing client resources, and demonstrate the validity
of our theoretical analysis.
The realization of the need to deal with large-scale systems as a
whole and the availability of powerful computer resources have fueled
the interest for efficient monitoring and modeling of such systems.
This is particularly true in the area of Earth System Science,
which has motivated this work.
These are projects that need a multicomputer environment.
Reasons include requirement for very large computing power,
financial constraints that prohibit supercomputers, and
fault-tolerance issues.
Furthermore, a distributed computing environment is more natural since
users and data generation are geographically distributed.
In an Earth System Science computing environment,
users should be able to browse the data available, process them,
insert new data and
correct or enhance data already available.
A bottleneck is the handling of image or image-like data
---there are a large number of images
and the size of each of them is also large.
We propose a scheme that stores images
and supports image browsing efficiently.
In particular, the scheme achieves good image compression while
supporting fast image subregion retrieval at various resolutions.
It is particularly suited for a distributed server/client environment.
Since new images are inserted, obsolete ones are deleted and others
are updated, support for maintaining data coherence without
sacrificing performance is needed.
For an image server/client environment to be used as part of a
distributed system, which may include other servers that
provide related
information, all these servers must appear as a single logical entity
to any user.
We propose an algorithm that achieves such an objective by providing a
shared
memory abstraction, i.e., access to the memories of the servers appear
like
accesses to a single memory module.
To achieve this efficiently, the algorithm dynamically
migrates/replicates/deletes data
from/to servers in order to best distribute them to satisfy the
current pattern of users' accesses.
In an image browsing environment there is need for progressively viewing image subregions at various resolutions. We describe a storage scheme that accomplishes good image compression, while supporting fast image subregion retrieval. We evaluate analytically and experimentally the compression performance of our algorithm. We also provide results on the speed of the algorithm to demonstrate its effectiveness, and present an extension to a client/server environment.
WWW-based Internet information service has grown enormously during the last few years, and major performance bottlenecks have been caused by WWW server and Internet bandwidth inadequacies. Utilizing client-site computing power and also multi-processor support at the server site can substantially improve the system response time. In this paper, we examine the use of scheduling techniques in monitoring and adapting to workload variation at client and server sites for supporting fast WWW image browsing. We provide both analytic and experimental results to identify the impact of system loads and network bandwidth on response times and demonstrate the effectiveness of our scheduling strategy.
This paper studies runtime partitioning, scheduling and load balancing techniques for improving performance of on-line WWW-based information systems such as digital libraries. The main performance bottlenecks of such a system are caused by the server computing capability and Internet bandwidth. Our observations and solutions are based on our experience with the Alexandria Digital Library (ADL) testbed at UCSB, which provides on-line browsing and processing of documents, digitized maps and other geo-spatially mapped data via WWW. A proper partitioning and scheduling of computation and communication in processing a user request on a multi-processor server and transferring some computation to client-site machines can reduce network traffic and substantially improve system response time. We propose a partitioning and scheduling mechanism that adapts to resource changes and optimizes resource utilization, and demonstrate the application of this mechanism for on-line information browsing. We also provide a performance analysis and experimental results to study impact of resource availability and the effectiveness of our scheduling techniques.
WWW-based information service has grown enormously during the last few years, and major performance bottlenecks have been caused by WWW server and Internet bandwidth inadequacies. Augmenting a server with multiprocessor support and shifting computation to client-site machines can substantially improve system response times and for some applications, it may also reduce network bandwidth requirements. In this paper, we propose adaptive scheduling techniques that optimize the use of a multiprocessor server with client resources by predicting demands of requests on I/O, CPU and network capabilities. We also provide a performance analysis under simplified assumptions for understanding the impact of system loads and network bandwidth when using our scheduling strategy. Finally we report preliminary experimental results to examine the system performance and verify the usefulness of the analytic model.
In this paper, we investigate the issues involved in developing a scalable World Wide Web (WWW) server called SWEB on a cluster of workstations. The objective is to strengthen the processing capabilities of such a server in order to match huge demands in simultaneous access requests from the Internet, especially when these requests involve delivery of large digitized documents. The scheduling component of the system actively monitors the usages of CPU, disk I/O channels and the interconnection network to effectively distribute HTTP requests across processing units to exploit task and I/O parallelism. We analyze the maximum number of requests that can be handled by the system and present several experiments to examine the performance of this system.
Many parallel algorithms and library routines are available for performing computer vision and image processing (CVIP) tasks on distributed-memory multiprocessors. The typical image distribution may use column, row, and block based mapping. Integrating a set of library routines for a CVIP application requires a global optimization for determining the data mapping of individual tasks by considering inter-task communication. The main difficulty in deriving the optimal image data distribution for each task is that CVIP task computation may involve loops, and the number of processors available and the size of the input image may vary at the run time. In this paper, a CVIP application is modeled using a task chain with nested loops, specified by conventional visual languages such as Khoros and Explorer. A mapping algorithm is proposed that optimizes the average run-time performance for CVIP applications with nested loops by considering the data redistribution overheads and possible run-time parameter variations. A taxonomy of CVIP operations is provided and used for further reducing the complexity of the algorithm. Experimental results on both low-level image processing and high-level computer vision applications are presented to validate this approach.
In an image browsing environment there is need for progressively viewing image subregions at various resolutions. We describe a storage scheme that accomplishes good image compression, while supporting fast image subregion retrieval. We evaluate analytically and experimentally the compression performance of our algorithm. We also provide results on the speed of the algorithm to demonstrate its effectiveness, and present an extension to a client/server environment.
In this paper we propose a new method for improving the average response time of Web servers by cooperatively caching the results of requests for dynamic content. The work is motivated by our recent study of access logs from the Alexandria Digital Library server at UCSB, which demonstrates that approximately a 30 percent decrease in average response time could be achieved by caching dynamically generated content . We have developed a distributed Web server called Swala, in which all nodes cooperatively cache CGI requests. We use a two-level consistency protocol to maximize the system performance and minimize overhead in responding to dynamic Web requests. Our experiments show that the single-node performance of Swala without caching is comparable to the Netscape Enterprise server, substantial speedups are obtained using a cache on a single machine, and even greater speedups are obtained by using a cooperative cache on a cluster of workstations.
In this paper, we investigate load balancing strategies for clustered Alexandria digital library (ADL) servers. The ADL system, which provides on-line information searching and browsing of spatially-referenced materials through the World Wide Web, involves intensive database I/O and heterogeneous CPU activities. Clustering servers can improve the scalability of the ADL system in response to a large number of simultaneous access requests. One difficulty addressed is that clustered workstation nodes may be non-uniform in terms of CPU and I/O speeds. We have developed an optimization scheme that dynamically monitors the resource availability, uses a low-cost communication strategy for updating load information among nodes, and schedules requests based on both I/O and computation load indices. Since the accurate cost estimation for processing database-searching requests is difficult, we have proposed a sampling and prediction scheme to identify the relative efficiency of nodes for satisfying I/O and CPU demands of these requests. We have provided analytic results to bound the performance of our scheme on this cluster environment and have conducted a set of experiments using the ADL traces to verify the effectiveness of the proposed strategies.
The SWEB++ project studies runtime partitioning, scheduling and load balancing techniques for improving the performance of on-line WWW-based information systems such as digital libraries. The main performance bottlenecks of such a system are caused by the server computing capability and Internet bandwidth. Our observations and solutions are based on our experience with the Alexandria Digital Library (ADL) testbed at UCSB, which provides on-line browsing and processing of documents, digitized maps and other geo-spatially mapped data via WWW. A proper partitioning and scheduling of computation and communication in processing a user request on a multi-processor server and transferring some computation to client-site machines can reduce network traffic and substantially improve system response time. We have proposed a partitioning and scheduling mechanism that adapts to resource changes and optimizes resource utilization. We have developed a software tool which implements and supports the use of our scheduling strategies when programming WWW applications, and have demonstrated the application of this system for on-line DL information browsing. We have conducted a set of experiments to examine the system performance on the Meiko parallel machine and a cluster of workstations.