Next: Shape Modeling Laboratory Up: Department of Computer Previous: Multimedia Systems Laboratory

Human Interface Laboratory

/ Masahide Sugiyama / Professor
/ Susantha Herath / Associate Professor
/ Michael Cohen / Assistant Professor
/ Minoru Ueda / Assistant Professor

Using human communication modalities- hearing, seeing, tasting, smelling, feeling, etc.- we communicate, human human, human machine, and human any information channel. The research of the Human Interface Laboratory explores the enhancement and generation of various communication forms.

In order to advance the above research on human interface, we adopted the following research principles:

Theoretical: Our goal is to enhance human interfaces. Based upon experimental results and experiences, we establish the theory, unified insight, generalization, and analytical perspectives.
Practical: We apply our theories in real-world environments, extracting concepts in order to clarify experimental and quantitative viewpoints.

In 1993, we focused on these two main research topics:

Study on Communication with the Handicapped
Study on Analysis and Generation of Acoustic Scenes

In order to conduct our research, we investigated and established an experimental environment: As a basic voice input man-machine interface tool, a neural network-based speaker-independent continuous speech recognition system (FPM-LR) was implemented. This system was developed by one of our members who had done research at the ATR Interpreting Telephony Research Laboratories. A realtime FPM-LR speech recognition demonstration was built on Sun/S10 and HP9000/755 workstations using sophisticated server-client modeling. In order to investigate sign language, we deployed an HP9000/755 workstation connected to a video deck and video camera via a special hardware. In order to encourage the community of sign language research, we are planing a workshop at the University of Aizu sponsored by the Sign Language Technology Committee of the IEICE. Development of a visual programming language for 4GL (fourth generation language) is also being studied. For Audio Windows (virtual acoustics and spatial sound) research, we deployed NeXT workstations, and conduct live demonstrations of a prototype binaural directional mixing console. Also, we started to contact the handicapped people's community in Aizu in order to cooperate, develop and solve problems of mutual concern.

We exhibited our research activities (FPM-LR speech recognition system and Audio Windows) at the Fukushima New Media Festival, held 20-24th of October, 1993. One of us organized and served as as a general co-chairperson at the "Workshop on Synthetic Worlds", an international scientific conference held at the University of Aizu. On topics of speech recognition and spatial audio, we presented 7 papers in refereed international conferences and 7 full papers in refereed academic journals. We organized and promoted a series of HI Lab seminars: 18 lectures, including 7 invited speakers.

We participate in 5 SCCPs (student research projects):

Social Hyper Networking
Visual Language for Office Processing Software
Speech Dialogue System
Computer Music
Virtual Reality Audio

as well as 3 joint faculty research projects:

Study of Machine Processing of Signs Generated by Hand Movements
Study on Speech Recognition under Noisy Environments
Audio Windows: Spatialization of Synthesized Speech, Spatialization of Music and Hierarchical Organization of Spatial Sound Sources

One of us received a commissioned research fund from ATR Interpreting Telecommunication Research Laboratories on ``Study on Speech Recognition System Based on Information Theory", and another of us is sponsored by the NTT Human Interface Labs.

Refereed Journal Papers

Michael Cohen. Throwing, pitching, and catching sound: Audio windowing models and modes. International Journal of Person-Computer Interaction, 39(2):269-304, August 1993.
After surveying the concepts of audio windowing, this paper elaborates taxonomies of three sets of its dimensions: spatial audio (throwing sound), timbre (pitching sound), and gain (catching sound), establishing matrices of variability for each, drawing similes, and citing applications. Two audio windowing systems are examined across these three operations: repositioning, distortion/blending, and gain control (state transitions in virtual space, timbre space, and volume space). Handy Sound is a purely auditory system with gestural control, while Maw exploits exocentric graphical control. These two systems motivated the development of special user interface features. (Sonic) piggyback-channels are introduced as filtear manifestations of changing cursors, used to track control state. A variable control/response ratio can be used to map a near-field work envelope into perceptual space. Clusters can be used to hierarchically collapse groups of spatial sound objects. Wimp idioms are reinterpreted for audio windowing functions. Reflexive operations are cast an instance of general manipulation when all the modified entities, including an iconification of the user, are projected into an egalitarian control/display system. Other taxonomies include a spectrum of directness of manipulation, and sensitivity to current position crossed with dependency on some target position. Keywords: audio windows, CSCW, filtears, groupware, piggyback-channels, spatial sound.
Michael Cohen. Integrating graphical and audio windows. Presence, 1(4):468-481, Fall 1993.
It is important to exploit sound as a vital communication channel for computer-human interfaces. Developing this potential motivates both developing expressive models unique to audio and also exploring analogues to visual modes of representation. This paper elaborates an organization of presentation and control that implements a flexible sound management system called ``audio windows.'' After reviewing audio imaging, spatial sound, and relevant underlying technology, an audio windowing prototype is described, implementing an extended model of free-field planar spatial sound control. The system, ``Maw'' (acronymic for multidimensional audio windows), is a GUI (graphical user interface), integrating a graphical editor with a multidimensional spatial sound engine. Standard idioms for wimp (window, icon, menu, pointing device) systems are reinterpreted for audio window applications, including provisions for directionalized and non-atomic spatial sound objects. Unique features include draggably rotating icons; clusters, dynamically collapsible hierarchical groups of spatial sound objects; and an autofocus mode that is used to disambiguate multiple presence.
Michael Cohen. Zebrackets: a Pseudo-dynamic Contextually Adaptive Font. TUGboat: Journal of the TeX Users Group, 14(2):118-122, October 1993.
A system is introduced that increases the information density of textual presentation by reconsidering text as pictures, expanding the range of written expression. Indicating nested associativity with stripes, Zebrackets uses small-scale horizontal striations, superimposed on parenthetical delimiters. This system is implemented as an active filter that re-presents textual information graphically, using adaptive pseudo-dynamic character generation to reflect a context that can be as wide as the document.
K. Ohkura, M. Sugiyama, and S. Sagayama. Speaker adaptation based on transfer vector field smoothing method with continuous mixture density hmms. Trans. of IEICE, J76-D-II(12):2469-2476, 12 1993.
This paper describes a method of speaker adaptation for continuous mixture density HMMs (CDHMMs). Speaker adaptation in CDHMMs is regarded as a kind of retraining problem where a small amount of training data is available. The ``Vector Field Smoothing method (VFS)'' is used to deal with the problem of retraining with insufficient training data. ``VFS'' is applied simultaneously to inter-speaker and speaking-style adaptation. In this chapter, the standard speaker is a male and the unknown speakers for adaptation are both one male and one female. When 11 sentences are uttered for adaptation phrase-by-phrase instead of word-by-word, the 23 phoneme recognition rate is 87.4%(none adaptation: 47.3%). The phrase recognition rate for HMM-LR is 85.1%(none adaptation: 21.5%).
M. Sugiyama, J. Murakami, and H. Watanabe. Speech segmentation and clustering problem based on an unknown-multiple signal source model, -an application to segmented speech clustering based on speaker features-. Trans. of IEICE, J76-D-II(12):2477-2485, 12 1993.
This paper describes speech segmentation and clustering algorithms based on speaker features, where speakers, number of speakers and speech context are unknown. Several problems are formulated and their solutions proposed. As in the simpler case, when speech segmentations are known, the Output Probability Clustering algorithm is proposed. In the case of unknown segmentation, an ergodic HMM-based technique is applicable. In this paper, both cases are evaluated using simulated multi-speaker dialogue speech data.
K. Fukuzawa, Y. Katoh, and M. Sugiyama. Speaker-independent continous speech recognition using fpm (fuzzy partition model) and lr parsers. Trans. of IEICE, J76-D-II(11):2253-2263, 11 1993.
This paper proposes a Fuzzy Partition Model (FPM) neural network architecture for speaker-independent continuous speech recognition. Generally speaking, conventional TDNN(Time-Delay Neural Network) architecture in its training stage requires much computation time. Nevertheless, an FPM has a rapid training capability that is over two times faster than TDNN's training speed. FPM architecture is combined with an LR-parser and its recognition performance with 278 Japanese phrases is evaluated. The recognition rate of FPM-LR is higher than that of TDNN-LR. This paper also proposes a Multi-FPM-LR method. Using this method, the recognition rate is 77.5%for open speakers.

Refereed Proceeding Papers

M. Cohen, S. Aoki, and N. Koizumi. Augmented audio reality: Telepresence/VR hybrid acoustic environments. In Ro-Man: Proc. 2nd IEEE International Workshop on Robot and Human Communication, Tokyo, November 1993.
Augmented reality is used to describe hybrid presentations that overlay computer-generated imagery on top of real scenes. For example, a wiring schematic might be projected onto see-through goggles, aligned (via head position sensor) with the physical cable ducts, allowing a technician to easily lay wires. Augmented audio reality extends this notion to include sonic effects, overlaying computer-generated sounds on top of more directly acquired audio signals. (One common example of augmented audio reality is sound reinforcement, as in a public address system.) We are exploring the alignability of binaural signals with artificially spatialized sources, synthesized by convolving monaural signals with left/right pairs of directional transfer functions. We are using Maw (acronymic for multidimensional audio windows), a NeXT-based audio windowing system, as a binaural directional mixing console. Since the rearrangement of a dynamic map is used to dynamically select transfer functions, a user may specify the virtual location of a sound source, throwing the source into perceptual space, using exocentric graphical control to drive egocentric auditory display.
M. Cohen and N. Koizumi. Putting spatial sound into voicemail. In Proc. 1st International Workshop on Networked Reality in TeleCommunication, Tokyo, May 1994. IEEE COMSOC, IEICE.
It is important to exploit sound as a vital communication channel for computer-human interfaces. Audio windowing is conceived of as a frontend, or user interface, to an audio system with a spatial sound backend. This paper surveys the ideas underlying audio windowing and describes a system investigating asynchronous applications of these ideas. Features of a GUI (graphical user interface) can be extended to support an audio windowing system, driving a spatial sound backend. Besides the reinterpretation of WIMP (window/icon/menu/pointing device) conventions to support audio window operations for synchronous sessions like teleconferences, extra features can be added to support asynchronous operations like voicemail. After tracing some underlying technology of audio imaging in computer-human interfaces, we describe of an audio windowing prototype, ``Maw'' (acronymic for multidimensional audio windows), an exocentric graphical mouse-based interface based on an extended model of free-field 2D spatial sound, used to augment voicemail.
M. Cohen and N. Koizumi. Virtual gain for audio windows. In HCI: Proc. Human-Computer Interaction, Orlando, FL, August 1993.
Audio windowing is a frontend, or user interface, to an audio system with a spatial sound backend. Besides the directionalization of the DSP spatialization, gain adjustment is used to control the volume of the various sources. Virtual gain can be synthesized from components derived from iconic size, distance, orientation and directivity, and selectively enabled according to room-wise partitioning of sources across sinks. This paper describes the mathematical derivation of our calculation of virtual gain, and outlines the deployment of these calculations in an audio windowing system.
M. Cohen and N. Koizumi. Virtual gain for audio windows. In VR93: Proc. IEEE Symp. on Research Frontiers in Virtual Reality (in conjunction with IEEE Visualization), pages 85-91, San Jose, CA, October 1993.
Audio windowing is a frontend, or user interface, to an audio system with a spatial sound backend. Besides the directionalization of the DSP spatialization, gain adjustment is used to control the volume of the various sources. Virtual gain can be synthesized from components derived from iconic size, distance, orientation and directivity, and selectively enabled according to room-wise partitioning of sources across sinks. This paper describes the mathematical derivation of our calculation of virtual gain, and outlines the deployment of these calculations in an audio windowing system.
M. Sugiyama, J. Murakami, and H. Watanabe. Speech segmentation and clustering based on speaker features. In Proc. of ICASSP93, page RAA.7, 4 1993.
This paper describes speech segmentation and clustering algorithms based on speaker features, where speakers, number of speakers and speech context are unknown. Several problems are formulated and their solutions proposed. As in the simpler case, when speech segmentations are known, the Output Probability Clustering algorithm is proposed. In the case of unknown segmentation, an ergodic HMM-based technique is applicable. In this paper, both cases are evaluated using simulated multi-speaker dialogue speech data.
Y. Katoh and M. Sugiyama. Speaker-independent features extracted by a neural network. In Proc. of ICASSP93, page RPE.5, 4 1993.
This paper proposes an algorithm using a neural network to normalize features that differ between speakers in speaker-independent speech recognition. The algorithm has three procedures: (1) initially train a neural network, (2) calculate the alignment function between the target signal and the network's output by Dynamic Time Warping, and (3) incrementally train the network for extracting speaker-independent features. The neural network is a Fuzzy Partition Model (FPM) with multiple input-output units to give a probabilistic formulation. The algorithm is evaluated in phrase recognition experiments by FPM-LR recognizers. The algorithm is compared with a conventional training algorithm in terms of recognition performances. The experimental results show that a neural network can be used as a new speaker-independent feature extractor.
K. Ohkura, D. Rainton, and M. Sugiyama. Noise robust hmms based on minimum error classification training. In Proc. of ICASSP93, page TAB.2, 4 1993.
This paper compares and contrasts the noise robustness of HMMs trained using a discriminant minimum error classification(MEC) optimization criterion, against that of HMMs trained using the conventional maximum likelihood (ML) approach. Isolated word recognition experiments, performed on the ATR 5240 Japanese word database, gave the following results: 1) MEC continuous Gaussian mixture density HMMs, trained in a specific noisy environment, were more robust to changes in the signal-to-noise (SNR) ratio, than conventional ML HMMs and 2) MEC HMMs, trained in various noisy environments, were more robust in all environments than conventional ML HMMs.

Books

Michael Cohen and Elizabeth M. Wenzel. Advanced Interface Design and Virtual Environments. Oxford University Press, 1994.
Masahide Sugiyama. Interpreting Telephony, volume 1 of ATR Advanced Technology Series, pp.38-45, 67-68, 78-79. Ohm Publishing Co., 1994.

Academic Activities

Susantha Herath, IEEE, April 1993.
Membership coordinator.
Susantha Herath, Journal of Artificial Intelligence, April 1993.
Referee.
Masahide Sugiyama, Institute of Electronics, Information, Communication Engineerings, March 1992.
Planning secretary of Speech Processing Committee.
Masahide Sugiyama, Institute of Electronics, Information, Communication Engineerings, November 1993.
General secretary of committee of IWSP94.
Masahide Sugiyama, Institute of Electronics, Information, Communication Engineerings, February 1994.
General secretary of editorial board for special issue.
Masahide Sugiyama, ASSP (IEEE), ASJ, IEICE, May 1993.
Referee.

Others

Michael Cohen, 12 1993.
Besides Immersion: Points of View and Frames of Reference.
Michael Cohen, 2 1994.
Conferences, Concerts, and Cocktail Parties.
Masahide Sugiyama, 4 1993.
Research Project from ATR.
Minoru Ueda, 10 1993.
1993,10-1994,3. Okawa Scholarship Association Foundation, Research on 4GL and Database.

Next: Shape Modeling Laboratory Up: Department of Computer Previous: Multimedia Systems Laboratory

a-fujitu@edumng1.u-aizu.ac.jp
Fri Feb 10 09:19:38 JST 1995