/ N. N. Mirenkov / Professor
/ S. G. Sedukhin / Professor
/ Shietung Peng / Associate Professor
/ Alexander Vazhenin / Visiting Associate Professor
Our general program is high-performance computing and advanced information infrastructure. We are also thinking about a planetary machine and active knowledge being developed by humanity. We are undertaking research efforts in parallel and distributed algorithms/architectures, visual (multimedia) languages and tools. We lead joint research projects ``Communication-efficient parallel computing" and ``A new paradigm for clustered computing". In the first research project, we focus on the design of communication-efficient parallel algorithms for Bulk-synchronous parallel computers. The key is to develop a router that can perform efficient point-to-point and collective communication among processors. The second research project will focus on modeling, developing software packages, investigating fault-tolerance and performance analysis for the clustered computing.
We are studying the problem of parallel program portability, skeleton-based techniques and developing the supporting tools. Our theoretical work is related to efficient algorithms for parallel numerical computation including linear algebra and Fourier transformation as well as for fault tolerant routing on certain interconnection networks, like hypercubes, star graphs, and torus. Information about our activities can be found on the laboratory Home Page: http://www.u-aizu.ac.jp/labs/sw-dpp/, which contains links to three experimental laboratory WWW servers. These servers hold the High-Performance Computing and Communication topics, on-line CD-ROM Digital Library and serve as mirror sites of, for instance, the ``Computational Science Educational Project".
As one of our results we would like to mention the S4CAD system. S4CAD (System of Systolic Structures Synthesis) is a high level interactive software tool for designing, analysis and modeling of systolic VLSI-oriented array processors. Starting with a localized data dependence graph of the algorithm S4CAD automatically produces all permissible array processors, among which the designer may chose the optimal one. World Wide Web page of the S4CAD is available at http://www-dpp.u-aizu.ac.jp/HPCC/S4CAD/.
The important question of our research is parallel program transparency and accessibility of massively parallel computers to large user populations. We are developing a multimedia technology for the interactive specification of application algorithms. This technology is related to visualization, animation and sonification of data processing methods. In fact, it is a `filmification' of application algorithms. Each film is based on a rather big peace of knowledge of a class of the algorithms. Our approach is close to the idea of algorithmic skeletons (higher order functions). However, skeletons leave a big gap between reality and a human (a multi-channel being). So, we use algorithmic multimedia skeletons based on both mathematical and physical abstractions.
Each film is a series of frames (computational steps) displaying one or more parameterized sets of nodes and/or moving objects in multi-dimensional space-time. Each frame highlights a subset of these nodes and/or moving objects. The films have their own sets of figures, sounds, colors and their own ``shape'' of the movement. Users recognize them due to a combination of all these features. Each film defines a partial order of scanning of the nodes or objects (and possibly, colors and sounds). As a rule, computation specified on different nodes (objects) of a frame is considered to be performed in parallel. Computation specified in different frames is performed sequentially. So, it is possible to say: the shorter film the better.
The user defines the specification by creating his new film. The corresponding program (sequential or parallel) is generated automatically. Each film itself is considered as a special-purpose application for which a parallel program is developed in advance as an item of the system library. The specification of a problem is considered as input data for such a parallel program. However, this data can introduce a composition of a few films, an irregularity of computation, special sizes of data, etc. It means that to reach efficiency it is impossible to have only one variant of the film implementation. So, each film is supported at least by a few parallel programs from the system library. Different programs are related to different conditions being defined by the specification. In other words, a film has a special space of decision-making. This space takes into account levels of problem granularity, irregularity, dynamics of the data decomposition, and some others related to algorithm features.
In fact, we are developing film machines where data, knowledge and algorithms (as well as results) are specified by films. Distributed film databases can be considered as active common knowledge/software being developed by humanity.
Refereed Journal Papers
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
Finding disjoint paths for certain collective communication in a parallel computer is an important issue for fault-tolerance and traffic control. In this paper, we consider finding disjoint paths for node-to-set routing problem in star graphs. Given a node $s$ and a set $T=\{t_1,...,t_k\}$ of $k$ nodes in a $k$-connected graph $G$, node-to-set routing is to find $k$ node disjoint paths $s \rightarrow t_i,~ i = 1\leq i \leq k$. In this paper, we improve the previous node-to-set routing algorithm for $n$-dimensional star graphs $G_n$. The new algorithm finds $n-1$ node disjoint paths of length at most $d(G_n)+1$ if $n \neq 4, 6$, where $d(G_n)$ is the diameter of $G_n$. Our result on the length of the disjoint paths is optimal.
In this paper, an efficient algorithm for node-to-set routing in star graphs is presented. Our algorithm improves the previous results on the maximal length of the paths found, from $5n$ to $1.5n+2$. More specifically, for node-to-set routing in $n$-dimensional star graph $G_n$, our algorithm finds $n-1$ node disjoint paths of length at most $d(G_n)+2$ in optimal time, where $d(G_n)$ is the diameter of $G_n$. The algorithm use cycle forms of permutations to derive important formula that allow us to find disjoint paths with shorter length.
In this paper, we introduce a general fault tolerant routing problem, called cluster fault tolerant (CFT) routing, which is a natural extension of the node fault tolerant routing problem. As a case study, we investigate the following $k$-pairwise CFT routing in $n$-dimensional hypercube $H_n$: Given a set of faulty clusters and $k$ distinct nonfaulty node pairs, find $k$ fault-free disjoint paths connecting the $k$ node pairs. We show that $H_n$ can tolerate $n-2k+1$ faulty clusters of diameter one for the $k$-pairwise CFT routing ($k>1$). We also present an $O(kn \log n)$ time algorithm which finds the $k$ fault-free disjoint paths.
Refereed Proceeding Papers
The design of algorithmic array processors for solving linear systems of equations using fraction-free Guassian method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that two of them are optimal under the framework of linear allocation of computations and in the terms of computing time and the number of processing elements.
The design of optimal array processors for solving linear systems is considered in this paper. We present a new parallel algorithm/architecture based on the two-step division-free Gaussian elimination method. The new design circumvents the one-step method in terms of numerical stability. In spite of the rather complicated computations needed at each iteration of the two-step method, we develop an innovative parallel algorithm whose data dependency graph meets the requirements for regularity and locality. Then we derive two-dimensional array processors by adopting a systematic approach to investigate the set of all admissible solutions. Finally, we obtain the optimal array processors in terms of the number of processors under linear time-space scheduling.
A systematic approach and software tool for synthesis and analysis of a set of systolic-like array processors are presented. The S4CAD tool allow us to obtain and examine the set of admissible array processors. The software tool uses advaaanced graphical media technologies that gets more comfort for the user to evaluate and choose an optimal solution, observing the requirements on the computing time, number of processing elements, topology of the connection, data flow formats, etc. As an example, the design of systolic array processors for the transitive closure algorithm is shown.
The design of optimal array processors for solving linear systems is considered in this paper. We present a new parallel algorithm/architecture based on the two-step division-free Gaussian elimination method. The new design circumvents the one-step method in terms of numerical stability. In spite of the rather complicated computations needed at each iteration of the two-step method, we develop an innovative parallel algorithm whose data dependency graph meets the requirements for regularity and locality. Then we derive two-dimensional array processors by adopting a systematic approach to investigate the set of all admissible solutions. Finally, we obtain the optimal array processors in terms of the number of processors under linear time-space scheduling.
In this paper, we investigate the node-to-node fault tolerant routing problem in $n$-dimensional hypercubes $H_n$ based on a generalized cluster fault tolerance model: an improvement over the previous cluster fault tolerance model of limited diameters. For a graph $G$, a faulty cluster is a connected subgraph of $G$ such that all its nodes are faulty. In cluster fault-tolerant routing problems, how many faulty clusters and how large of those clusters can be tolerated are studied. It was previously known that for node-to-node routing, $H_n$ can tolerate as many as $n-1$ faulty clusters of diameter at most 1 with at most $2n-3$ faulty nodes in total. In this paper, we extend the above results to include faulty clusters of arbitrary diameter. We also give efficient routing algorithms which, given a tolerable set of faulty clusters and non-faulty nodes $s$ and $t$ in $H_n$, finds a fault-free path $s \rightarrow t$ of length $d(s,t)+O(\log n)$ in $O(n+|F|)$ optimal time, where $|F|$ is the total number of faulty nodes.
The design of algorithmic array processors for solving linear systems of equations using fraction-free Guassian method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that two of them are optimal under the framework of linear allocation of computations and in the terms of computing time and the number of processing elements. (invited talk).
This volume is one of the Applied Informatics Series published by Polish Academy of Sciences. Chapter one of this volume (pages 5--46) was written by Peng and Sedukhin. This chapter describes a new approach for designing optimal systolic array processors. We show how to apply the proposed systematic approach for solving linear systems of equations. More specifically, in this chapter, the systematic design of systolic array processors for solving linear systems of equations using division-free and fraction-free Gaussian elimination method is presented. The design is a new systematic approach which constructs array processors systematically by three stages: first constructing the 3-D data dependency graphs of the algorithms, and then generating all planar systolic array processors by projecting the data dependency graph along certain feasible directions, and finally, selecting optimal solutions based on the given criteria. The key for designing optimal array processors is to develop regular algorithms and then map them into 2-D systolic arrays to meet advanced VLSI requirements.
Academic Activities