/ Tuneo Ikedo / Professor
/ Michael Kishinevsky / Professor
/ Robert H. Fujii / Associate Professor
/ A. Y. Kondratyev / Associate Professor
/ Yamin Li / Associate Professor
/ Omar Hammami / Assistant Professor
/ Jianhua Ma / Visiting Researcher
/ Wanming Chu / Research Associate
Computer Architectrue Laboratory is organized with 7 faculty members and one visiting researcher. The followings re the summary of each members.
Multimedia Processor Architecture: Multimedia processor has been developed since 1995, which implements font, graphics, audio, video processing in a single chip of 10 million gates. we focused on the architecture design of computer graphics renderer in fine grain processing in 1996 and 30Gflop reconfigurable multimedia accelerator (coarse grain processing). The development of reconfigurable accelerator is a succession of Aizu supercomputer project. 4 graduate students and 8 undergraduate students join in this research.
1. Research: (1) Parallel Multithreaded Architecture We have examined various design issues of the parallel multithreaded architecture. A prototype of parallel multithreaded processor was designed. We are working on the performance evaluation in order to enhance the the computation capabilities of the processor. (2) Computer Arithmetic Algorithms and Implementations We have examined latency, throughput and complexity for various computer arithmetic algorithms and developed two new algorithms. The hardware designs are also finished and prototyped with FPGAs.
2. Teaching: We taught Computer Architecture (Li and Chu), Computer Organization I and Organization II (Li), and the laboratory of Logic Circuit Design (Chu). We also developed and updated two processor design projects: Sim2 processor design for the laboratory of Computer Architecture and Aizup pipeline processor design and implementation on FPGA for the laboratory of Computer Organization I. All of the teaching notes and laboratories handouts are prepared and available at Li's homepage, http://www.u-aizu.ac.jp/~yamin/.
Design Automation of Concurrenct and Asynchronous Systems: We have reached further progress in research in automation of asynchronous design, embedded reactive systems and models of concurrency. In a few research directions we continued close cooperation with Prof. A. Taubin (Computer Education Lab.) and with research groups in Cadence Berkeley Labs. (USA), Politecnico di Torino (Italy), University Politecnica de Catalunya (Barcelona, Spain) and University of Newcastle upon Tyne (England).
Research results: 1. Embedded systems and models of concurrency: (1). We have developed a new model, called Place Chart Nets (PCN). It allows the modelling of both asynchronicity and exception handling (preemption). PCNs specify a system behavior using partial orders. PCNs have a notion of hierarchy, which is determined by preemption. PCNs is a non-trivial generalization of classical PNs, in the sense that (1) for the finite (bounded) case modeling a PCN may require an exponentially larger PN, (2) for the infinite (unbounded) case a class of PCN languages properly includes a class of PN languages and (3) k-boundedness of PCNs is decidable. (2). We developed a method for synthesis of PCNs starting from labeled transition systems. We considered applications of PCNs for design of embedded reactive systems in hardware/software codesign framework. 2. Technology mapping of speed-independent circuits: We have reached significant progress in solving a problem of technology mapping for speed-independent circuits based on two different techniques: (1). Algebraic factorization: the proposed method performs both combinational (inserting new gates) and sequential (inserting new memory elements) decomposition of complex gates in a given standard cell library, while preserving original behaviour and speed-independence. (2). Boolean decomposition: the proposed method iteratively performs Boolean decomposition of each complex gate using Boolean relations, as opposed to the less powerful algebraic factorization approach used in previous methods. After logic decomposition, the overall library matching and optimization is carried out. 3. Testing of path delay faults: We provided effective procedures to solve the initialization and the test pattern generation problems for the path delay fault testing of asynchronous circuits. Experimental results shows that a high level of path delay fault testability can be achieved with partial scan.
We have presented a few Invited tutorials at the Summer School, International Conference, and Fujitsu Labs.
Teaching: we taught logic design, computer architecture in the undegradute school and "Synthesis and Optimization of Digital Circuits" in the graduate school. We have lead graduation reserch projects.
This year's activities have included the design of a microcontroller controlled motorized wheelchair prototype, the design of a fuzzy logic controller using Verilog, and the design of a high speed pipeline control unit using Verilog. Implementation of the motorized wheelchair prototype into a real system which can be ridden by people with disabilities is being carried out. Analog circuit designs of various modules will be forthcoming.
My researches in 1996 are devoted to improving algorithms of the Truga001 graphics chip and modeling multimedia hyperworld. A new circuit combined Phong shading with bump mapping has been developed. Several bump-mapped shading pictures have been produced to test the developed algorithm and circuit. The interpolation method has been proposed to improve bump-mapped shading effects. Texture mapping and video mapping performance of the Truga001 has been simulated and evaluated. The hyperworld is an integration of various interaction worlds in different time, space and reality. Our research starts from basic features of the hyperworld, composition and models of an one-to-many interaction system. We are now focusing on developing the prototype of an educational hyperworld system, called Cheer.
Refereed Journal Papers
It is proposed new architecture of massive parallel computing for virtual reality system and visual programming language.
The conventional single-threaded multiple-pipelined processor is not capable of using multiple pipelines efficiently, and so the processor performance suffers. This paper investigates a multiple-threaded multiple-pipelined (MTMP) processor architecture that tries to issue multiple instructions from multiple instruction threads in every clock cycle. For the performance evaluation, the paper proposes a modified analytic model that provides a quick prediction of utilization of pipelines. Unlike previous analytic models of multiple-threaded architecture, the model presented here concerns the utilization of multiple pipelines. It deals not only with pipeline dependencies but also with structure conflicts. The model can be used for turning processor parameters when a MTMP is designed.
The design of a 32-bit RISC processor carried out using Verilog-Synergy and PARTHENON is described. Semantic, procedural, and optimization differences are analyzed and utilized to come up with designs which take advantage of the individual strengths of the two hardware description languages.
The Truga001 was developed as a graphics processor for generating virtual reality images in real time. It consists of multiple processors embedded in a single chip with each processor interconnected with an MIMD structure. In this sense, it is not a single-chip graphics processor of conventional architecture which mainly performs graphics functions based on a microprogram or user-defined software. The Truga001 performs the various types of rendering in parallel. In the chip, there are 12 processors and 7 special graphics hardware modules.
Without supports of effective modeling in visual worlds, even if the visual information can be displayed in real worlds, it is difficult to recognize and extract its features and to confirm or verify the identities and the characteristics. Direct mappings target at refining and abstracting multimedia information by cognitive technology and differential technology to efficiently improve our human performance and actively control the real worlds we live. An assemblability discriminating method and an assembling sequence generating method named SYDEM is explained by giving an example of CIM assembly process. The three other examples of an effective guide-map generation method, of a hierarchical description of surfaces, and of a conceptual visual human algorithm for skiing, are given to show the drastic efficiency and exactness increase.
Computer Graphics has been a very active research area in Japan. With the developments of CAD/CAM, visualization and entertainment applications, especially, recent advanced technologies multimedia and virtual reality, computer graphics research and its applications have become hotter topics. There are unique pure academic researches at universities. Most of the research and development at companies have been traditionally emphasizing engineering and applied research. This article reviews the ``state-of-the-art" of the Computer Graphics industry and research activities in Japan.
Petri nets and Change Diagrams provide adequate modelling and circuit synthesis tools for the various OR causality types, yet they do not always bring the specifier to a unique decision about which modelling construct must be used for which type. We present a unified descriptive tool, called Causal Logic Net, which is graphically based on Petri net but has an explicit logic causality annotation for transitions. It is aimed as the least possible generalisation of Petri nets and Change Diagrams. The signal-transition interpretation of this tool is analogous to, but more powerful than, the well-known Signal Transition Graph. A number of examples demonstrate the usefulness of this model in the synthesis of asynchronous control circuits.
Petrify is a tool for (1) manipulating concurrent specifications and (2) synthesis and optimization of asynchronous control circuits. Given a Petri Net (PN), a Signal Transition Graph (STG), or a Transition System (TS) it (1) generates another PN or SG which is simpler than the original description and (2) produces an optimized net-list of an asynchronous controller in the target gate library while preserving the specified input-output behavior. An ability of back-annotating to the specification level helps the designer to control the design process.
The paper proposed the new architecture for virtual reality system which can cope with diversed data types and processing of multimedia system.
MMU and router architecture with node addressing mechanism are described. Reconfigurable architecture and node mapping system are available with the highest transmission efficiency in pseudo complete graph IN.
A video mapping technique is described where a video camera image is mapped onto the animated polygon surface in real time. The hardware architecture for video capturing, image caching, antialiasing used sinc filtering are proposed.
This paper describes a pipelined processor (named Aizup) design and implementation for the exercise of Computer Architecture/Organization Education at the University of Aizu. The Aizup pipeline has four stages and deals with data dependency and control dependency. The Aizup was designed at Cadence environment and implemented on Xilinx XC4006PC84 FPGA chip. We ask students to design the processor, to perform functional simulations, to implement the design on the chip, and to measure the chip with Logic Analyzer. The exercise course is helpful to students to understand the operations of pipelined processors and to master the design methodologies and the use of measuring instruments.
In this paper, we present a new non-restoring square root algorithm that is very efficient to implement. The new algorithm presented here has the following features unlike other square root algorithms. First, the focus of the ``non-restoring'' is on the ``partial remainder'', not on ``each bit of the square root'', with each iteration. Second, it only requires one traditional adder/subtractor in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexors. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration. We illustrate two VLSI implementations of the new algorithm. One is a fully pipelined high-performance implementation that can accept a new square-root instruction each clock cycle with each pipeline stage requiring a minimum number of gate counts. The other is a low-cost implementation that uses only a single adder/subtractor for iterative operation.
Microelectronic systems design education at Japanese Universities and the support provided by industry and government are described; innovative governmental initiatives and unique University microelectronics programs are also described.
A prototype motorized wheelchair controlled by a VLSI microcontroller and various sensors was designed and built. Its performance on various terrains and courses was analyzed.
A person interacts with various worlds in two ways: the one-to-one interaction with a single world and the one-to-many interaction with multi-worlds. Most of the current researches on improving human interaction with the world are limited to the one-to-one interaction, i.e. the interaction of a person with each individual world. For example, telepresence and teleoperation technologies can overcome some of the time, space and other physical constraints when a person interacts with the physical world. CSCW (Computer Support Cooperated Work) and CSCL (Computer Support Cooperated Learning) are changing the interaction among people. Hypermedia and WWW (World Wide Web) are making the interaction of human with the information world more natural. Since relations among worlds in the multi-worlds are nonlinear and can be expressed by a set of links, such multi-worlds as a whole is called a hyperworld. This paper focuses on giving an outline of a future hyperworld as a system and presents some problems of developing such a system and proposes potential solutions.
In this paper, we focus on developing a genetic algorithm based learning algorithm for a communication network design to evolve solutions that minimize total link cost, and subject to more constraints like the network routing, diameter and two-connected survivability rather than just the survivability considered in some research papers. The implementation results of the genetic algorithm for searching two disjoint paths corresponding to each requirement and performance of the genetic algorithm based learning algorithm for a simulated communication network design in terms of cost are reported.
In this paper, a distributed genetic algorithm is proposed and it is emphasized that how the distributed genetic algorithm is implemented over a transputer based parallel machine called ParsyTec Gcel-1/64 by using virtual torus topology for a communication network design that minimizes total communication link cost, and subjects to more constraints like the network routing, diameter and two-connectivity rather than just the survivability. The implementation results of the genetic algorithm for searching two disjoint paths corresponding to each requirement and performances of the distributed genetic algorithm for a simulated communication network design in terms of computing time and link cost are reported.
Implementation technology for 3D pixel cache and performance evaluation of a graphics processor Truga001 embedded 12 processors within a single chip are described. The chip can render 4 million vectors/s (10 pixels/vector) or 1.2 million triangle polygons/s (100 pixels/polygon) with Phong shading, texture mapping and hidden surface removal. A pixel-array configured with 8(x) x 4(y) x 24-bit(intensity) x 24-bit(z) can be accessed with frame buffer at 180ns due to the 3D bus-architecture between chip and frame buffer. The chip was designed with Toshiba TC180C CMOS of 400,000 gates.
There are two kinds of modeling topics in a hyperworld system: modeling direct mapping between multimedia information worlds and real worlds, and modeling human interface with a hyperworld. Most of the current researches on improving human interaction with the worlds are limited to the one-to-one interaction, i.e., the interaction of a person with each individual world. The study of the paper is devoted to one-to-many interaction features with the hyperworld, composition of a one-to-many interaction system, and its associated reference model so as to lay foundations for further study and system development.
The most critical step in applying a genetic algorithm to a survivable communication network design is to choose a way to represent a solution to the problem. In this paper, we present a genetic algorithm in which the routing, diameter, and 2 connectivity constraints can be easily and successfully encoded in a chromosome representation. A parallel implementation of the genetic algorithm based learning algorithm in the level of requirements is proposed and implemented in a transputer based parallel machine.
World integration, beyond media integration, means integrating various interaction worlds of different time, space and reality into one system. Such world integration, called a hyperworld, originates from matching with an one-to-many interaction way between a person and the worlds. There have been lots of models, mechanisms and standards for multimedia integration, however, there are no correspondents for the world integration. This paper focuses on basic features of the hyperworld, composition and models of an one-to-many interaction system, and a case study of a telemedicine hyperworld system. The composition of a hyperworld system has three levels: a level of each world generation and interaction system, a level of world management supported by a hyperworld reference model, and a level of hyperinterface.
The Truga001 is a single chip rendering processor with 12 embedded many graphics functions. In the design of the Phong and bump-mapped shading circuit, we used angular parameters for defining surface and light-source normals instead of vector. This enables the circuit-scale less than 10,000 gates/circuit. The chip is fabricated with a 940,000-gate standard cell, 0.3um CMOS in a TCP/BGA package. This paper describes the hardware architecture and ts implementation technologies of the Phong and bump-mapped shading in an ASIC.
This paper is devoted to developing a genetic algorithm for a communication network design that minimizes total link cost, and subjects to some constraints like diameter and two-connectivity. Two parallel genetic algorithms on the level of partitioning requirements and the level of dividing population are proposed and implemented over a transputer based parallel network with various virtual network topologies. The ring-ring topology gives the best performance for the parallel genetic algorithm on the level of partitioning requirements, and the torus topology is the most suitable topology for the parallel genetic algorithm on the level of dividing population.
To overcome the complexity of the circuit design for the calculation of the conventional Phong shading model and bump mapping, a vector (such as the normal or unit vectors of a surface, a light source, and other objects) is defined by the two angle parameters of the horizontal and vertical components with the eye-point axes in the Truga system. The new algorithms of Phong and bump shading based on the angle parameters are proposed. One chip can perform 1.2 million polygons (100-pixel 3-D triangle)/second while simultaneously applying Phong shading, bump and texture mapping, hidden surface removal. The system can also be unlimitedly scaled up with parallel interconnections of multiple The Truga001 submodules.
The new criterion significantly reduces the size of an unfolding obtained by a PN. The properties of PNs for analysis can be various: boundedness, safety, persistency etc. A practical example of the suggested approach is given in an application to asynchronous design.
A deadlock prevention procedure first detects deadlocks using the unfolding, then reduces the unfolding to a deadlock-free sub-unfolding, and finally folds the deadlock-free acyclic net into a cyclic net. For the class of reinitialized processes which are very common for the manufacture systems the obtained cyclic net is live equivalent to the initial Petri Net. The live equivalence implies that each trace of the transformed net has an equivalent feasible trace in the initial net and all live (infinite) traces of the initial net are also present in the final net. As a side effect our method constructs ordering relations between places and transitions and checks boundedness, safety, persistency and hazards in the initial Petri Net specification.
A model, called Place Chart Nets (PCN), is presented. It allows the modelling ofboth asynchronicity and exception handling (preemption). Contrary to State Charts and other reactive models, which are inherently synchronous, PCNs specify a system behavior using partial orders. Contrary to Petri nets, PCNs have a notion of hierarchy. Contrary to other hierarchical models based on Petri net extensions, the hierarchy in PCNs is determined by preemption. We show that PCNs are a non-trivial generalization of classical PNs, in the sense that (1) for the finite (bounded) case modeling a PCN may require an exponentially larger PN, (2) for the infinite (unbounded) case a class of PCN languages properly includes a class of PN languages and (3) k-boundedness of a place is decidable. Then we present a method for synthesis of safe PCNs from transition systems, which generalizes the theory of regions previously developed for PNs.
Asynchronous circuits operate correctly only under timing assumptions that must be tested. Asynchronous nets are a subclass in which feedback is allowed only inside sequential elements, and can be obtained from asynchronous circuits by partial scan. We show a reduction from delay fault testing of asynchronous nets to stuck-at testing on combinational circuits.
This paper proposes a state encoding method for asynchronous circuits based on the theory of regions. A region in a Transition System is a set of states that ``behave uniformly'' with respect to a given transition (value change of an observable signal), and is analogue to a place in a Petri net. Regions are tightly connected with a set of properties that must be preserved across the state encoding process, namely: (1) trace equivalence between the original and the encoded specification, and (2) implementability as a speed-independent circuit. We build on a theoretical body of work that has shown the significance of regions for such property-preserving transformations, and describe a set of algorithms aimed at efficiently solving the encoding problem. The algorithms have been implemented in a software tool called petrify.
This paper presents theory and practical implementation of a method for multi-level logic synthesis of speed-independent circuits. An initial circuit implementation is assumed to satisfy the monotonous cover conditions but is technology independent. The proposed method performs both combinational (inserting new gates) and sequential (inserting new memory elements) decomposition of complex gates in a given standard cell library, while preserving original behaviour and speed-independence. The algorithm applies known efficient algebraic factorization techniques from combinational multi-level logic synthesis, but achieves also boolean simplification and sequential decomposition. The method allows sharing of decomposed logic.
This paper formulates and describes a solution to the problem of sequential multi-level logic synthesis of asynchronous speed-independent circuits. The starting point is a technology-independent speed-independent circuit obtained using, e.g., the monotonous cover conditions. We describe an algorithm for the factorization of this circuit aimed at implementing it in a given standard cell library, while preserving speed-independence. The algorithm exploits known efficient factorization techniques from combinational multi-level logic synthesis, but achieves also boolean simplification. Experimental results show a significant improvement in terms of number and complexity of solvable circuits with respect to existing methods.