/ Masahide Sugiyama / Professor
/ Michael Cohen / Associate Professor
/ Susantha Herath / Associate Professor
/ William L. Martens / Visiting Researcher
/ Minoru Ueda / Assistant Professor
Using our communication channels (sense organs: ears, mouth, eyes, nose, skin, etc) we can communicate each other, including between human and human, human and machine, and human and every information sources. Because of disability of the above channels in software or hardware sense, sometimes it becomes to be difficult for human to communicate. Research area of Human Interface Laboratory covers enhancement and generation of various human interface channels.
In order to advance the above research on human interface, we adopt the following research principle:
We organized second workshop IWHIT98 on Nov. 11th, 12th and 13th (International Workshop on Human Interface Technology 1998) which was sponsored by the International Affairs Committee of the University of Aizu. The workshop had 5 sessions (1.Object Location and Tracking in Video Data, 2.Subjective Factors in Handling Images, 3.Visual Interfaces, 4.Visual and Body Perception, 5.Tools for Language Generation; 15 lectures).
We promoted 5 SCCPs for students (``Speech Processing and Multimedia", ``Sign Language Processing System'', ``GAIA -- Planet Management'', ``Computer Music'', ``Aizu Virtual City on InterNet") and 2 Research Projects (``Object Location and Tracking in Video Data'', ``Spatial Media: Sound Spatializiation''). We received 4 commissioned research funds; IPA on ``Development of Japanese Dictation Software'' , HITOCC on ``Study on Computer Security using Speaker Recognition", Fukushima Prefectural Foundation for the advancement of Science and education on ``Environment computer activity project'', Telecommunication Advancement Organization of Japan Fund on ``Sign Language Communication Between Different Languages''.
We exhibited our research activities in the open campus in University Festival (Oct 31st, Nov.1st) and Fukushima Sangyo Fair (Nov. 29th and 30th). We promoted Lab Open House for Freshmen on April 3rd.
On our research activity we presented 6 papers in academic journals and 10 refereed papers in International Conferences.
One of members organized working group on ``Blind and Computer" and about 30 people attended to the working group and received the support from NHK Wakaba Fund.
We have the homepage of Human Interface Lab to open our research and education activities to the world.
http://www.u-aizu.ac.jp/labs/sw-hi/.
Refereed Journal Papers
The speaker can be recognized using the individual features included in each voice wave. It is called the speaker recognition, which can be applied as a means of an individual verification. This paper develops a software system named ``xvlock'' which can manage computer access by the speaker recognition technique, and also describes the outline of xvlock and the performance evaluation. The implementation and the experiments did only one standard platform, but xvlock can be applied to the other platforms because of less platform dependency. The For low quality input voice (8bit $\mu$ law, sampling rate: 8kHz ) implemented xvlock achieved 93.9\% verification rate.
Audio windowing is a frontend, or user interface, to an audio system with a realtime spatial sound backend. Complementing directionalization by a digital signal processor (DSP), gain adjustment is used to control the volume of the various mixels ([sound] mixing elements). Virtual gain can be synthesized from components derived from collective iconic size, mutual distance, orientation and directivity, and selectively enabled according to room-wise partitioning of sources across sinks. This paper describes a derivation of virtual gain, and outlines the deployment of these expressions in an audio windowing system.
The PSFC (for Pioneer Sound Field Controller) is a DSP-driven hemispherical loudspeaker array, installed at the University of Aizu Multimedia Center. It features realtime manipulation of the primary components of sound spatialization for each of two audio sources located in a virtual environment, including both the content (source location: apparent direction and distance) and context (room characteristics: room size and liveness). In an alternate mode, it can also direct the destination of the two separate input signals across 14 loudspeakers, manipulating the apparent direction of the virtual sound sources with no control over apparent distance other than that afforded by source loudness (i.e., no simulated environmental reflections or reverberation). The PSFC speaker dome is about 10m in diameter, accommodating about fifty simultaneous users, including about twenty users comfortably standing or sitting near its ``sweet spot,'' the area in which the illusions of sound spatialization are most vivid. Collocated with a large screen rear-projection stereographic display, the PSFC is intended for advanced multimedia and virtual reality applications.
Refereed Proceeding Papers
The speaker individuality can be recognized using voice features included in his/her voice wave. It is called the speaker recognition and can be applied to an individual verification. This paper proposes a new computer security software system named ``{\bf xvlock}'' which can control computer access connected with the speaker recognition technique, and also describes the outline of {\bf xvlock} and the performance evaluation. The implementation and the experiments did only one standard platform, but {\bf xvlock} can be applied to the other platforms because of less platform dependency. For the low quality input voice (8bit $\mu$ law, sampling rate: 8kHz ) implemented {\bf xvlock} achieved 93.9\% verification performance.
Multimedia database management and retrieving are on a world-wide demand. In particular, Object Location and Tracking (OLT) technology in time-space is a core in a search engine of a huge multimedia database and has wide applications.
The final target of our research is to establish technologies which enables to locate and track specified objects in video data from the combination of audio and visual cues. As human being is one of the typical objects, as the first step of our research this paper will be focused on location and tracking of a specified person in the sound domain.
This paper describes the speaker-based segment detection and junction algorithms and evaluation experiments using the simulated dialogue data.
Multimedia database management and retrieving are on a world-wide demand. In particular, Object Location and Tracking (OLT) technology in time-space is a core in a search engine of a huge multimedia database and has wide applications. The final target of our research is to establish technologies which enables to locate and track specified objects in video data from the combination of audio and visual cues. As human being is one of the typical objects, as the first step of our research this paper will be focused on location and tracking of a specified person in the audio domain. This paper describes OLT project, the speaker-based segment detection and junction algorithms and evaluation experiments using the simulated dialogue data, and segment Fuzzy search algorithm and its application to detection of variable length segment.
Refereed. A pivot (swivel, rotating) chair is considered as an I/O device, an information appliance. As implemented, the main input modality is orientation tracking, which dynamically selects transfer functions used to spatialize audio in a rotation-invariant soundscape. In groupware situations, like teleconferencing or chat spaces, such orientation tracking can also be used to twist iconic representations of a seated user, avatars in a virtual world, enabling social situation awareness via coupled visual displays, fixed virtual source locations, and projection of non-omnidirectional sources.
Traditional mixing idioms for enabling and disabling various sources employ mute and solo functions, which, along with cue, selectively disable or focus on respective channels. Exocentric interfaces which explicitly model not only sources, but also location, orientation, directivity, and multiplicity of sinks, motivate the generalization of mute/solo and cue to exclude and include, manifested for sinks as deafen/confide and harken, a narrowing of stimuli by explicitly blocking out and/or concentrating on selected entities. As sinks are analogs of sources, the semantics are identical. Such functions can be applied not only to other users' sinks for privacy, but also to one's own sinks for selective presence. Multiple sinks are useful in both groupware, where a common environment implies social inhibitions to rearranging shared sources like musical voices or conferees, and individual sessions in which spatial arrangement of sources, like the configuration of a concert orchestra, has mnemonic value. Exclude/include source and sink attributes can be visually represented by iconic attributes associated with a figurative avatar and can distinguish between operations reflexive, invoked by the user associated with a respective icon, and transitive, invoked by another user in the shared environment. Distributed users might typically share spatial aspects of a groupware environment, but attributes like muteness or deafness are determined and displayed on a per-user basis. For example, a source representing a human teleconferee might symbolize muteness with an iconic hand clapped over its mouth, positioned differently (thumb up or thumb down) depending on whether the source was muted by itself or another user's sink. (In the former case, all the users in the space could observe the muted source, but in the later, only the user disabling the remote source would see and perceive the mute.) An audio muffler can be wrapped around an iconic head to denote its deafness, but to distinguish between self-imposed deafness, invoked by an associated user whose attention is focused elsewhere, and distally imposed, invoked by a user desiring privacy, iconic hands can be clasped over the ears can be positioned differently depending on the agent of deafness.
Extracting necessary information is hard and time consuming in the information-oriented society. An abstract generation system for newspaper that represents a large volume of information is vital. This paper presents an experimental system developed for abstracting newspaper articles on traffic accidents without applying complicated natural language processing techniques. The level of abstraction can be selected by the user. The user saves time significantly by excluding unwanted information in the article.
Alternative non-immersive perspectives enable new paradigms of perception, especially in the context of frames-of-reference for musical audition and groupware. Maw, acronymic for multidimensional audio windows, is an application for manipulating sound sources and sinks in virtual rooms, featuring an exocentric graphical interface driving an egocentric audio backend. Listening to sound presented in such a spatial fashion is as different from conventional stereo mixes as sculpture is from painting. Schizophrenic virtual existence suggests sonic (analytic) cubism, presenting multiple acoustic perspectives simultaneously. Clusters can be used to hierarchically organize mixels, [sound] mixing elements. New interaction modalities are enabled by this sort of perceptual aggression and liquid perspective. In particular, virtual concerts may be ``broken down'' by individuals and groups. Keywords and Phrases: binaural directional mixing console, CSCW (computer-supported collaborative work), frames of reference, groupware, mixel ([sound] mixing element), points of view, sonic (analytical) cubism, sound localization, spatial sound.
Shared virtual environments, especially those supporting spatial sound, require generalized control of user-dependent media streams. Traditional mixing idioms for enabling and disabling various sources employ mute and solo functions, which, along with cue, selectively disable or focus on respective channels. Exocentric interfaces which explicitly model not only sources, but also location, orientation, directivity, and multiplicity of sinks, motivate the generalization of mute/solo and to exclude and include, manifested for sinks as deafen/confide and harken, a narrowing of stimuli by explicitly blocking out and/or concentrating on selected entities. This paper introduces figurative representations of these functions, virtual hands to be claped over avatars' ears and mouths. Applications include groupware for collaboration and teaching, teleconferencing and chat spaces, and authoring and manipulation of distributed virtual environments. Keywords: CSCW (computer-supported collaborative work), groupware, narrowcasting functions, articulated mixing console.