/ N. N. Mirenkov / Professor
/ S. G. Sedukhin / Associate Professor
/ Shietung Peng / Assistant Professor
/ Sergey V. Ten / Assistant Professor
Our general program is high-performance computing and advanced information infrastructure. We are also thinking about a planetary machine and active knowledge being developed by humanity. We are undertaking research efforts in parallel and distributed algorithms/architectures, visual (multimedia) languages and tools. We lead joint research projects ``Communication-efficient parallel computing" and ``A new paradigm for clustered computing". In the first research project, we focus on the design of communication-efficient parallel algorithms for Bulk-synchronous parallel computers. The key is to develop a router that can perform efficient point-to-point and collective communication among processors. The second research project will focus on modeling, developing software packages, investigating fault-tolerance and performance analysis for the clustered computing.
We are studying the problem of parallel program portability, skeleton-based techniques and developing the supporting tools. Our theoretical work is related to efficient algorithms for parallel numerical computation including linear algebra and Fourier transformation as well as for fault tolerant routing on certain interconnection networks, like hypercubes, star graphs, and torus. Information about our activities can be found on the laboratory Home Page: http://www.u-aizu.ac.jp/labs/sw-dpp/, which contains links to three experimental laboratory WWW servers. These servers hold the High-Performance Computing and Communication topics, on-line CD-ROM Digital Library and serve as mirror sites of, for instance, the ``Computational Science Educational Project".
As one of our results we would like to mention the S4CAD system. S4CAD (System of Systolic Structures Synthesis) is a high level interactive software tool for designing, analysis and modeling of systolic VLSI-oriented array processors. Starting with a localized data dependence graph of the algorithm S4CAD automatically produces all permissible array processors, among which the designer may chose the optimal one. World Wide Web page of the S4CAD is available at http://www-dpp.u-aizu.ac.jp/HPCC/S4CAD/.
The important question of our research is parallel program transparency and accessibility of massively parallel computers to large user populations. We are developing a multimedia technology for the interactive specification of application algorithms. This technology is related to visualization, animation and sonification of data processing methods. In fact, it is a `filmification' of application algorithms. Each film is based on a rather big peace of knowledge of a class of the algorithms. Our approach is close to the idea of algorithmic skeletons (higher order functions). However, skeletons leave a big gap between reality and a human (a multi-channel being). So, we use algorithmic multimedia skeletons based on both mathematical and physical abstractions.
Each film is a series of frames (computational steps) displaying one or more parameterized sets of nodes and/or moving objects in multi-dimensional space-time. Each frame highlights a subset of these nodes and/or moving objects. The films have their own sets of figures, sounds, colors and their own ``shape'' of the movement. Users recognize them due to a combination of all these features. Each film defines a partial order of scanning of the nodes or objects (and possibly, colors and sounds). As a rule, computation specified on different nodes (objects) of a frame is considered to be performed in parallel. Computation specified in different frames is performed sequentially. So, it is possible to say: the shorter film the better.
The user defines the specification by creating his new film. The corresponding program (sequential or parallel) is generated automatically. Each film itself is considered as a special-purpose application for which a parallel program is developed in advance as an item of the system library. The specification of a problem is considered as input data for such a parallel program. However, this data can introduce a composition of a few films, an irregularity of computation, special sizes of data, etc. It means that to reach efficiency it is impossible to have only one variant of the film implementation. So, each film is supported at least by a few parallel programs from the system library. Different programs are related to different conditions being defined by the specification. In other words, a film has a special space of decision-making. This space takes into account levels of problem granularity, irregularity, dynamics of the data decomposition, and some others related to algorithm features.
In fact, we are developing film machines where data, knowledge and algorithms (as well as results) are specified by films. Distributed film databases can be considered as active common knowledge/software being developed by humanity.
Refereed Journal Papers
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipelineperiod, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
Division-free algorithms for solving the linear algebra problems and its applications in signal/image processing have attracted interests for parallel processing. In this paper the design of systolic array processors for solving linear systems of equations using division-free Gaussian elimination method is presented. The design is based on a systematic approach which constructs array processors systematically by first investigating parallel versions of the division-free algorithm and their three-dimensional dependency graphs, and then generating the planar systolic array processors by projecting the dependency graph along properly selected directions. The resulting array processors are synthesized and analyzed. It is shown that some array processors are optimal. The key for designing an optimal array processor is to design a parallel algorithm whose dependency graph is well-structured. The array processors are optimal in terms of number of processing elements, number of input/output ports, time of processing and pipelining period.
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
Given a node s and a set T={t_1,...,t_k} of k nodes in a k-connected graph G, node-to-set routing is to find k node disjoint paths s --> t_i, i = 1,...,k. In this paper, we give an efficient algorithm for node-to-set routing in n-dimensional star graphs G_n. The algorithm finds n-1 node disjoint paths of length at most d(G_n)+2 in optimal time, where d(G_n) is the diameter of G_n.
In this paper, we give an algorithm which, given at most n-1 faulty nodes and two nonfaulty nodes s and t in the n-dimensional hypercube, finds a fault-free path s -> t of length at most d(s,t)+2 in O(n) time. Using this algorithm as a subroutine, we present another algorithm, which, given at most 2n-3 faulty nodes such that the faulty nodes can be covered by n-1 subgraphs of diameter 1, finds a fault-free path s -> t of length at most d(s,t)+4 in O(n) time.
In this paper, we study the following node-to-node and node-to-set routing problems in $r$-dimensional torus $T^r_n$: given at most $2r-1$ faulty nodes and non-faulty nodes $s$ and $t$ in $T^r_n$, find a fault-free path $s\rightarrow t$; and given at most $2r-k$ faulty nodes and non-faulty nodes $s$ and $t_1,\ldots,t_k$ in $T^r_n$, find $k$ fault-free node-disjoint paths $s\rightarrow t_i$, $1\leq i\leq k$. We give an $O(r^2)$ time algorithm which finds a fault-free path of length at most $d(T^r_n)+1$ for the node-to-node routing, where $d(T^r_n)$ is the diameter of $T^r_n$. For node-to-set routing, we show an $O(r^3)$ time algorithm which finds $k$ fault-free node-disjoint paths $s\rightarrow t_i$, $1\leq i\leq k$, of length at most $d(T^r_n)+1$. The upper bounds on the length of the found paths are optimal. From this, Rabin diameter of $T^r_n$ is $d(T^r_n)+1$.
The problem of finding a core of a graph with a specified length has applications in transportation science, optimal resouce allocation, and distributed database systems, etc. This paper presents efficient algorithms for finding a core of a tree with a specified length, where a core of a graph is a path central with the property of minimizing the sum of the distances of all vertices to the path. Our sequential algorithm runs in $O(n\log n)$ time which improves the previous $O(n^3)$ result. The parallel algorithm runs in $O(\log^2 n)$ time using $O(n/\log n)$ processors on an EREW PRAM model.
In this paper, we give an algorithm which, given a set $F$ of at most $(n-1)-k$ faulty nodes, and two sets $S=\{s_1,\ldots,s_k\}$ and $T=\{t_1,\ldots,t_k\}$, $1\leq k\leq n-1$, of non-faulty nodes in $n$-dimensional star graphs $G_n$, finds $k$ fault-free node disjoint paths $s_i\rightarrow t_{j_i}$, where $(j_1,\ldots,j_k)$ is any permutation of $(1,\ldots,k)$, of length at most $d(G_n)+5$ in $O(kn)$ optimal time, where $d(G_n)$ is the diameter of $G_n$.
In this paper, we give an algorithm which, given a set $F$ of at most $n-k$ faulty nodes, and two sets $S=\{s_1,\ldots,s_k\}$ and $T=\{t_1,\ldots,t_k\}$, $1\leq k\leq n$, of non-faulty nodes in $n$-dimensional hypercubes $H_n$, finds $k$ fault-free node disjoint paths $s_i\rightarrow t_{j_i}$, where $(j_1,\ldots,j_k)$ is a permutation of $(1,\ldots,k)$, of length at most $n+k$ in $O(kn\log k)$ time. The result of this paper implies that $n$ disjoint paths of length at most $2n$ for set-to-set node disjoint path problem in $H_n$ can be found in $O(n^2\log n)$ time.
The paper presents an example of a method filmification. Two special-purpose animation films related to computational (communicational) schemes on multi-stage networks are described. The first film shows a partial order of node scanning on butterfly networks, the second - on bitonic sorting networks. These films can be used to specify various algorithms of similar types. Each film is supported by a set of templates. They are programs and files. Examples of templates and program synthesis from film specifications are also presented.
Refereed Proceeding Papers
An approach based on a fundamental feature of continuous functions is proposed for consideration. It allows reducing a great part of the 3-D computation to the 1-D computation. 1-D arrays and tables can be evaluated in advance or before the main computation. Compositions of values of these arrays and tables are used for the evaluation of 3-D functions with hierarchic computational structures. The approach is supported by a set of optimization techniques to improve making use of the internal parallelism of a processor as well as the parallelism of multiple processors. To illustrate the approach some graphic computation problems are considered.
A concept of film databases as a core of film machines is proposed for consideration. Items of such databases are special-purpose animation or video films. These films are considered as units for data/knowledge acquisition and as communication units for computer-human dialog. Operations with data/knowledge units as well as retrieval operations on databases are specified by films, too. This paper presents briefly a basis of our approach and describes a film database being developed for specification (programming) application algorithms. Some details of a few films related to image processing and cellular automation-like methods are also discussed.
The idea behind the VIM technology is making use of animation films for the interactive specification of application algorithms as well as for the very-high-level parallel programming. Each film is a ``shape'' of computation (an algorithmic multimedia skeleton). It defines sets of points and/or moving objects in multi-dimensional space-time, and a partial order of scanning of these points and objects. This paper presents one such a film related to linear algebra algorithms. Each frame of the film shows parameterized 2-D and 1-D structures and highlights subsets of nodes where computation should be defined. To specify a problem the user employs one-click operations to delete, insert or merge frames and provides some formulas for computation on the nodes. The corresponding program (sequential or parallel) is generated automatically.
Key words and phrases: classification of methods, filmification of methods, multiple model paradigms, multi-media algorithmic skeletons, films as communication units for computer-human dialog, films as units for data/knowledge acquisition, film databases, film machines, VIM technology.
This paper presents an outline of our strategy for program synthesis within the VIM film technology where special-purpose animation films are used as a new type of abstraction. To specify an algorithm/method the user develops his/her own film. New films are created through cutting, editing and another few click operations as well as by combiningand merging component films. These operations predefine transformations rules to be performed with templates related to the system films. These templates are hand-made programs or files of a few hierarchical levels taking into account various kinds of programming know-how and techniques for the efficient implementation of computation on a target computer system. The program synthesis is performed by sequential transformations of the above-mentioned programs and files.
An interactive graphic tool for systematic design and analysis of VLSI systolic-like array processors is described. The WWW page of this tool is http://www-dpp.u-aizu.ac.jp/HPCC/S4CAD/.
In this paper, we present a new parallel algorithm for Householder bidiagonaliztion on parallel computers with dynamic ring architecture. Two-sided Householder reduction/expansion technique is applied for bidiagonaliztion. Innovative systolic-like communication techniques are proposed to eliminate the need for computing explicitly the transpose of the matrix. The experimental study on CM-5 shows that the parallel algorithm proposed in this paper achieves high speedup for large matrices.
In this paper, the design of systolic array processors for solving linear systems of equations using fraction-free Gaussian elimination method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that some array processors are optimal in the framework of linear allocation of computations and in terms of number of processing elements, number of input/output ports, time of processing, and data pipelining period.
In this paper the design of systolic array processors for solving linear systems of equations using division-free Gaussian elimination method is presented. The design is based on a systematic approach which constructs array processors systematically by first investigating parallel versions of the division-free algorithm and their three-dimensional dependency graphs, and then generating the planar systolic array processors by projecting the dependency graph along properly selected directions. The resulting array processors are synthesized and analyzed. It is shown that some array processors are optimal. The key for designing an optimal array processor is to design a parallel algorithm whose dependency graph is well-structured. The array processors are optimal in terms of number of processing elements, number of input/output ports, time of processing and pipelining period.
In this paper we propose a parallel distributed solution of a compute-intensive problem: rendering of the functionally represented geometric objects. The Network Linda programming tool was used to distribute rendering on the network of workstations of the University of Aizu. Speedup results and a load-balancing technique, applied to complex objects, are studied in the paper. Finally, the original algorithm was optimized and tested in the parallel processing environment.
The design of optimal array processors for solving linear systems using two-step division-free Gaussian elimination method is considered. The two-step method circumvents the one-step one in terms of numerical stability. In spite of the rather complicated computations needed at each iteration of the two-step method, we develop an innovative parallel algorithm whose data dependency graph meets the requirements for regularity and locality. Then we derive two-dimensional array processors by adopting a systematic approach to investigate the set of all admissible solutions and obtain the optimal array processors under linear time-space scheduling. The array processors are optimal in terms of the number of processing elements used.
Given a graph G, a cluster C is a connected subgraph of G, and C is called faulty cluster if all nodes in C are faulty. Given an n-dimensional star graph G_n with n-2 faulty clusters of diameter at most 2, it has been shown that any two non-faulty nodes s and t of G_n can be connected by a fault-free path of length at most d(G_n)+6 in O(n^2) time, where d(G_n) is the diameter of G_n. In this paper, we prove that a fault-free path s --> t of length at most d(G_n)+1 for n>10 or n is odd, d(G_n)+2 otherwise, can be found in O(n^2) time. We also show that the result on The length of path s --> t is optimal.
In this paper, we prove that an n-dimensional star graph G_n can tolerate as many as n-k-1 arbitrary faulty clusters of diameter at most 2 for the node-to-set (node s and the set of nodes T) routing problem, where k is the size of the set T. We also show an algorithm which, in the presence of up to n-k-1 faulty clusters of diameter at most 2, finds the k disjoint, fault-free paths of length at most d(G_n)+9 in O(|F|+kn), where d(G_n) is the diameter of G_n, and |F| is the total number of faulty nodes in faulty clusters.
In this paper, we consider fault-tolerant broadcasting in MIMD hypercubes in which each node can send message to all its neighbors simultaneously, however, each node can receive message from only one of its neighbors at each time step. Let H_n denote the n-dimensional hypercube. H_n is called d-safe if each node of H_n has at least d fault-free neighbors. We give efficient broadcasting algorithms for d-safe H_n with up to 2^d(n-d)-1 faulty nodes. For the case of d=1, our algorithm completes the broadcasting in optimal time steps. For d>1, we show an algorithm which completes the broadcasting in n+O(d^2) time steps. All the algorithms have optimal traffic steps.
Analysis of genomes evolving by inversions and transpositions leads to a combinatorial problem of sorting a permutation using reversals and transpositions of arbitrary fragments. We establish a lower bound and give two algorithms for this problem. We show that the first one is a 2-approximation algorithm, and the second one is a 2(1+1/k)-approximation and polynomial time algorithm.
In this paper, we present an optimal algorithm for node-to-node fault tolerant routing in hypercubes. The algorithm find a fault-free path between any two nonfaulty nodes s and t assuming that there are at most $2n-3$ faulty nodes with the condition that there exists at least one nonfaulty neighbor for every node in the hypercube. The algorithm finds the fault-free path of length at most d(s,t)+4 in O(n) time. Both the time complexity and the upper bound of the length of the path are optimal.
In this paper, we study the following $k$-pairwise CFT (Cluster Fault Tolerant) routing problem in $n$-dimensional star graphs $G_n$: Given a set F of fault clusters and $k$ pairs of $2k$ distinct non-fault nodes $(s_1,t_1),\ldots,(s_k,t_k)$ in $G_n$, find $k$ fault-free node disjoint paths that pairwisely connect, one path for one pair, the $k$ node pairs. We show in this paper that for $1\leq k\leq \lceil\frac{n-1}{2}\rceil$, $G_n$ can tolerate as many as $n-2k$ fault clusters of diameter at most 2. We also give $O(n^2)$ optimal time algorithms which, given a tolerable set ${\bf F}$, find $k$ fault-free node disjoint paths of length at most $d(G_n)+8$ for $k$-pairwise CFT routing problem in $G_n$, where $d(G_n)=\lfloor \frac{3(n-1)}{2}\rfloor$ is the diameter of $G_n$.
Two paths $P$ and $Q$ in a graph $G$ are $d$-separated if for any node $u$ in $P$ and any node $v$ in $Q$ except possibly the common end-nodes, the distance between $u$ and $v$ is at least $d$. In this paper, we study the following node-to-node routing problem in hypercubes: (1) finding 2-separated paths in free space (without obstacles); (2) finding disjoint paths with path-shaped obstacles. We give algorithms to construct the 2-separated/disjoint paths for node-to-node routing problems (1) and (2), and show that the numbers of paths constructed by our algorithms are maximum in the worst case.
In this paper, we consider a generalized node disjoint paths problem: $d$-separated paths problem. In a graph $G$, given two distinct nodes $s$ and $t$, two paths $P$ and $Q$, connecting $s$ and $t$, are $d$-separated if $d_{G-\{s,t\}}(u,v)\geq d$ for any $u\in P-\{s,t\}$ and $v\in Q-\{s,t\}$, where $d_{G-\{s,t\}}(u,v)$ is the distance between $u$ and $v$ in the reduced graph $G-\{s,t\}$. $d$-separated paths problem is to find as many $d$-separated paths between $s$ and $t$ as possible. In this paper, we solve $d$-separated paths problems on $n$-dimensional hypercubes and star graphs $H_n$ and $G_n$. Given $s$ and $t$ in $H_n$, there are at least $(n-2)$ 2-separated paths between $s$ and $t$. $(n-2)$ is the maximum number of $2$-separated paths between $s$ and $t$ for $d(s,t)\geq 4$. Moreover, $(n-2)$ 2-separated paths of length at most $\min\{d(s,t)+2$, $n+1\}$ for $d(s,t)<n$ and of length $n$ for $d(s,t)=n$ between $s$ and $t$ can be constructed in $O(n^2)$ optimal time. For $d\geq 3$, $d$-separated paths in $H_n$ do not exist. Given $s$ and $t$ in $G_n$, $(n-1)$ 3-separated paths of length at most $\min\{d(s,t)+4, d(G_n)+2\}$ between $s$ and $t$ can be constructed in $O(n^2)$ optimal time, where $d(G_n)=\lfloor \frac{3(n-1)}{2}\rfloor$. For $d\geq 5$, $d$-separated paths in $G_n$ do not exist.
Set-to-set node-disjoint paths problem is that given two sets $S=\{s_1,\ldots,s_k\}$ and $T=\{t_1,\ldots,t_k\}$ of nodes in a graph, find $k$ node disjoint paths $s_i\rightarrow t_{j_i}$, where $(j_1,\ldots,j_k)$ is any permutation of $(1,\ldots,k)$. In this paper, we give an algorithm which, given $S=\{s_1,\ldots,s_k\}$ and $T=\{t_1,\ldots,t_k\}$, $1\leq k\leq n$, in an $n$-dimensional hypercube $H_n$ finds the $k$ disjoint paths $s_i\rightarrow t_{j_i}$ of length at most $n+\log k+2$ in $O(kn\log^* k)$ time. The previous results of the length of the $k$ paths and the run time of finding the $k$ paths in $H_n$ are $n+k$ and $O(kn\log k)$, respectively.
In this paper, we give an algorithm which, given at most $n-1$ arbitrary faulty nodes and non-faulty nodes $s$ and $t$ in the $n$-dimensional hypercube $H_n$, finds a fault-free path of length at most $d(s,t)+2$ for $d(s,t)<n$ and of length $d(s,t)$ for $d(s,t)=n$ in $O(n)$ time, where $d(s,t)$ is the distance between $s$ and $t$. We also give an algorithm which, given at most $n-1$ faulty clusters of diameter at most 1 with $2n-3$ faulty nodes in total and non-faulty nodes $s$ and $t$ in $H_n$, finds a fault-free path of length at most $d(s,t)+4$ in $O(n)$ time. The upper bounds on the length of the path and the time complexity obtained in this paper are optimal.
Generation of animation sequence of a deformed volumetric object on networked workstations is discussed. We use several deformation types: space mapping controlled by points linked to features of an object, deformation with algebraic sum and metamorphosis. These deformations are directly applied to interpolated volume data with following polygonization of an isosurface for visualization. We have achieved average speed up about 70% of maximum.
Grants
Academic Activities