Basic Information

University-Business Innovation Center
Adjunct Professor(Education and Research Special Advisor)
Web site


Courses - Undergraduate
Fourier Analysis
Courses - Graduate


Pattern Recognition, Artificial Intelligence, Robotics
Educational Background, Biography
1. The University of Tokyo (Graduate School. master course) 2. Electrotechical Laboratory of MITI ( Researcher) 3. National Research Council of Canada (Visiting Scientist) 4. Real World Computing (National Project, Chief of division and laboratories)
Current Research Theme
Wide 3D scene image reconstruction from video, Video processing for automatic vehicle driving, Motion understanding from video, Recognition of speech captured in
the cocktail party place, Automatic evaluation of sport performance from video, Detection of motions and flow line of multiple moving objects from video, Moving robot, Drone network (Dronet)
Key Topic
Computer vision, Motion recognition, Speech recognition, Continuous DP, Cocktail party effect, Matching, Moving robot, Dronet. Gorone
Affiliated Academic Society
IEEE, IEICE, Acoustic Society of Japan, Japanese Society of Artificial Intelligence


Listening to music, Reading books, Visiting art museums in foreign countries.
School days' Dream
Becoming a researcher of scientific engineering
Current Dream
Making our university more attractive. Finding new algorithms in engineering and verifying their reality in applications.
"Any creative work is done by through boy's pure and honest mind" by Ryotaro Shiba
Favorite Books
Books by Nanami Shiono, Goro Shimura, Shuichi Kato, Mitsuo Taketani, Taturu Uchida, Ryotaro Shiba,Emmanual Todd,Yuval Noah Harari.
Messages for Students
Choose a more active direction when you face two directions.
Publications other than one's areas of specialization
President's Message
Relay Essay #35
Relay Essay #68

Main research

Reconstructing a moving 3D image from video captured by a moving camera -- Application to automatic car driving and making a 3D map --

The proposed algorithm is a proposal of a method of acquiring a distance scene in the aribtrary direction of movement from a moving image by a single camera mounted on a moving body of a car or a drone in real time. Accordingly, it is an object of the present invention to provide an image sensor which can be applied to an automated driving technique of a car or a drone without using a large number of expensive and various kinds of sensors.

In recent years, sensors that capture the outside world are mounted on moving objects such as cars and drones. Stanford University, Carnegie Mellon University, Google Car initiatives at DARPA Urban Challenge.

This purpose is to acquire distance information for automatic driving of cars and drone, and these sensor information is used for that purpose. As an example, if automatic operation of a car is taken, laser, ultrasonic, infrared and stereo cameras are main in distance sensors. Especially, about 10 laser sensors are installed in one car. Among them are extremely expensive laser sensors. In addition, its performance also includes a range of distances near a car such as an ultrasonic sensor as a target area. Also, even with a laser sensor reaching a place away from the car, it has a problem that it is difficult to specify the object to measure the distance. It has a problem of deciding which external object the point group of the distance measured by the distance sensor in the stereo camera corresponds to.

This technique reconstructs a dynamic distance landscape based on tracking pixels of motion pictures. Out method also can be used for making a 3D map for automatic car driving. Also, from the moving image when the car is stopped, the distance of the moving object in the scne can be frame-wisely extracted.

On the other hand, as a related technology, there is a tracking technique of an object in a moving image. Conventional tracking technology simply tracks the movement of a pedestrian or the like.

There are two categories of tracking. One uses an area template of images, and the other uses feature templates (describing features of images). Those using image templates are supposed to be based on a particle filter and mean-shift. Though not mentioned in the above document, deep learning has recently been proposed as an image area.

Generally, in tracking by region template, it is necessary to first take out the tracking target image from the image. In the Particle Filter and mean-shift method, a tracking movement destination is determined from a moving image by using a single region template and using the likelihood (particle filter) and the normalized mean-shift histogram of its pixel value histogram Method. Both are processing for each object to be tracked.

Next, conventional feature-based tracking methods will be described. In the SIFT method and the SURF (Speed ​​UP Robust Feature) method, features of an image called sift and SURF are extracted, a point called a keypoint is defined, the sift and SURF features there are expressed, Estimate in the image. It can be said that tracking is impossible in places where these features do not exist.

As mentioned above, these tracking is the object tracking itself, not the calculation of the distance between the camera and the object.

Some patents of the proposed method are registerd demostically and internationally.

The following our papers explaines the outline of the proposed method called

1) Ryuichi Oka, Kesisuke Hata: "Reconstructing a moving 3D image from video caputured by a forward-moving camera", MIRU 2018, PS3-1, 8-th August (2018).

2) Ryuichi Oka,Yasuhiro Hashimoto,Yuichi Okuyama,Keisuke Hata,:"Reconstruction
of 3D Motion Image from Video Captured by a Single Camera on a Moving Body", Technical Journal of Advanced Mobility, pp.3--13,Vol. 1, No. 1( 2020.

View this research

Dronet -- A proposal of a network of drones connected by cables for realizing new functions--

We show the demonstration video of our research in YouTube: 

We proposed a new type of drone called "Dronet." A dronet is composed of many drones. Each drone of a dronet is connected by cables with neighboring drones. Drones of a dronet are taking distributed control for stabilizing the dronet and reaching the target place. The flight for reaching a target place is realized by stabilization of a dronet against a vertual and external force additionally introduced. A dronet can realize new functions which are not realized by a group of conventional drones.

 Two types of dronet are proposed, that is, a dronet with a power supply cable from the ground, and a dronet without power supply from the ground. The latter type of  dronet has drones which are used for only carrying batteries.  Both types of dronet are robust against external forces like the wind for capturing  video data of scene by cameras, and able to carry a heavy object by summing up payload of many drones.

 A drone with a cable for supplying power from the ground is able to stay in the air for a long time. Therefore the dronet with line shape can enter the internal space of buildings or bridges so that cameras or laser sensors can work for a long time in the space for obtaining necessary sensor data.

 The sub-images indicate the simulations of dronet motion: 1) doronet carrying an object, 2) dronet with a broken drone, 3)  dronet with line shape. A hardware under construction is also shown.

Reference :Ryuichi Oka, Keisuke Hata: "Dronet --Drone Network Connected by Cables," Journal of The Society of Instrument and Control Engineer, Vol.56, January, pp.40-43, 2017. (in Japanese)

 The patent of this method is now pending.

View this research

Controlling run motions of a mobile robot by human gestures

We show the demonstration video of our research in Youtube:  





 Currently, we are developing a mobile robot called "Goron". One of the functions of this "Goron" is to send the video image of the "Goron" camera wirelessly to the computer, and aslo send the result of gesture recognized by the computer to "Goron" in real time, and operate "Goron" a person tries to do. The gesture recognition is performed with the algorithm of "Time-space Continuous DP".
If we implement the recognition algorithm with FPGA etc., we can make gesture recognition of a robot "Goron" without sending wirelessly video data to an outside computer.

  "Time-space Continuous DP" was developed by us before, and here it is used to recognize gestures from a moving image of a single camera mounted on a mobile robot. At the present time, the gesture instruction is performed by recognition of the four actions instructing the mobile robot to turn clockwise, counterclockwise, and switch on and off the light mounted on the robot. Currently, the number of gestures to identify is small, then we would like to increase that number in the future.

 Normally, a mobile robot is performed by either or both of autonomously determining its movement, or deciding by human instruction. Automatic driving cars are the former. On the other hand, mobile robots involved with humans, such as nursing robots, often want to move according to human instructions. At that time, the human side usually has two instructions. It is voice and gesture. Although the voice is convenient, the current technology is the restriction that when a robot is separated from a human, a microphone close to the human side must be used. On the other hand, gesture has no such restriction, but in the situation where both robot and human are moving, technology to well recognize human gesture has not been established yet.

  Conventionally, a mobile robot equipped with a laser, an ultrasonic sensor, a kinetic sensor, etc. detects and keeps track of a stationary person in the vicinity. It can be said that there is rarely a research that a mobile robot equipped with a moving image of a single camera recognizes a gesture of a moving person in the surroundings and operates accordingly. The reason is that there is no technique for recognizing a specific gesture without specifying the start and end times, in a moving background and in an environment with an unspecified number of moving people.

View this research

Motion and trajectory detection of moving persons and cars from video captured by a camera in the sky

We show the demonstration video of our research in Youtube:

Grasp of congestion situation and flow of persons and cars in time of disaster is important for reducing human damage. There are many researches and developments for obtaining the information by analyzing images and videos captured by a camera on an airplane or a drone. However conventional methods (optical flow, particle filter, Kalman filter, statistical processing of time-space voxel code, etc.) are not enough to grasp situation by detecting motion of each person and each car so far. The segmentation problem of moving persons and moving cars in images or videos is not well solved.
 A new algorithm should be developed so that it can detect motion of each person and each car in a wide area from a video in the way that it works easily, quickly, automatically, in real time, for a long time video, and specially realizing segmentation-free characteristics of persons and cars in images or videos. This kind of task is not realized by using a laser or an infrared sensors because the scene is capturing by a video camera in the sky. 
 Our algorithm called Time-Space Continuous Dynamic Programming (TSCDP)[1]can detect motion in the way mentioned in the above including the solution of segmentation-free problem.

 We show two experimental results for detecting motion of many persons and cars (see pictures). The first one is detection of motion of each football player during the game on the field from a video captured by a camera. The second one is motion detection of each walking person and each moving car on the road from a video. A person walking along a sidewalk is detected. Different colors in the
scene images indicate different motions of moving objects.

 The experimental results show surely the potential of TSCDP for grasping congestion situation and flow of persons and cars in time of disaster. The patent of this method has been registered.

 Recently, drones become popular for using them in many application domains. We need now to develop new algorithms which are applicable to data obtained by drones and obtain actual useful information. These works are mainly belonging to not hardware but software.

[1] Yuki Niitsuma, Syunpei Torii, Yuichi Yaguchi & Ryuichi Oka:"Time-segmentation and position-free recognition of air-drawn gestures and characters in videos", Multimedia Tools and Applications, An International Journal, ISSN 1380-7501, Volume 75, Number 19, pp.11615--11639.


View this research

Reconstructing wide 3D images of city and indoor scenes from videos --- Data making for walk through of city and indoor scenes from videos

We show thye demonstration video of our research in Youtube: 


Using a standard video camera, it is easy to capture wide scenes of city, town, mountain, country sides as well as indoor scenes. We proposed a new algorithm for making a dense 3D image with a wide range of distance of a scene covering a wide area in a video. This kind of research target is a frontier of vision research.

 The obtained 3D data is suitable to use for supporting the work of robots as so-called Visual SLAM. Moreover it becomes easy to make contents of VR systems such as  3D world data for walk through. Automatic car driving becomes its application by using a video capturing a 360 degree scene by a camera on a moving car.

 There are many conventional methods for reconstructing a 3D image using devices such as ultrasonic, laser and infrared sensors, or techniques based on vision such as stereo vision, filling voxels method based on silhouette characteristics, object-based method, etc. However, in order to make a 3D image of a wide scene these methods still have weakness such as limit of pixel size, a small range of distance, being not applicable to non-standard reflection characteristics of the object etc. Moreover conventional methods need to combine with other techniques such as feature extraction  (SIFT, etc.), factorization , RANSAC, Kalman filter etc. Therefore a new algorithm is required to overcome the weakness of conventional methods.

 Our method solves most of difficulties mentioned in the above.

 There are two kinds of 3D information for a wide scene. The one is global 3D information for distinguishing larger objects such as buildings, roads, rivers, woods, etc. The other one is for distinguishing sub- objects belonging to each larger object. Our method is applicable to extract both kinds of 3D information. Here we show only the former one.

 The following five images are:  1) one frame image of a video capturing city scene, 2) the RGB + distance image of 1) from a view angle. 3) a RGB +distance image constructed by video capturing our univeristy garden

The part of our method was published in a paper,Ryuichi Oka and Ranaweera Rasika, Region-wise 3D Image Reconstruction from Video Based on Accumulated Moton Parallax, MIRU2017, PS1-5, August 2017.

 The patent of this method is now pending.



View this research

Simultaneous recognition of multi-category human motions from video under the circumstances of occlusion and moving background

 We show the demonstration video of our research in Youtube:  






It is a normal behavior for a robot to move itself around the places where people lives daily. Then around a robot, threre are many moving oblects such as walking persons, moving cars, moving dogs, cats etc. Moreover a moving robot is seeing a moving scene by its eyes.

   In this situation, a robot looks at motions of people who are facing the robot. The robot must recognize the human motions and make respoces to the people by its creating motion and suitable utterances by synthsized voices.

 If a robot is impossible to create suitable actions based on the perception of motions of facing people, the robot seems not be cooperative with human so that the robot not acceptable for human society. 

  We have already developed a new algorithm called "Time-space Continuous Dynamic Programming (TSCDP)" which enable a robot to realize fuctions mentioned the above required for a robot eye, which make a robot to be well cooperated with our society.

  Namely, TSCDP is implemented for a robot. TSCDP works using a time-varying image
captured by an eye of moving robot, so that the robot can recognize the human motions in the moving background.

 Occlusion also often occures in our daily life. Occlusion means that there are blocking objects between a robot and a forcussed person who is making motions.

 TSCDP is also allowed the exisitance of partially occlusion for a robot.

 The attached picture is showing the recognition of motion of "S"by a focussed person
captured by a moving camera of a robot in the moving background where persons are crossing the scene. 

The realized functions by our proposed algorithm seem quite difficult by so-called Deep
Learning because Deep Learning is weak to recognize motions captured by a moving
camera under the environment of moving scene.

The patent of TSCDP is registered.

This research uses the algorithm proposed by the following paper:

[1] Yuki Niitsuma, Syunpei Torii, Yuichi Yaguchi & Ryuichi Oka:"Time-segmentation and position-free recognition of air-drawn gestures and characters in videos", Multimedia Tools and Applications, An International Journal, ISSN 1380-7501, Volume 75, Number 19, pp.11615--11639.


View this research

Segmentation-free recognition of overlapped and multiple images using 2-dimensional continuous dynamic programming

Static image recognition is a central issue in Deep Learning and others. However, when we apply deep learning to an image which includes images belonging multiple categories and overlapped each other, and also their shapes and textures are non-linearly deformated, it is difficult to recognize each of them.
 It is said that deep earning is currently difficult to introduce "hierarchy". This is the main reason why deep learning has the weakness when facing the problem mentioned above.
 Hierachy should be introduced for solving the segmentation of image. However, the problem of segmentation and recognition are strongly coupled each other. That is, "chicken and egg" problem exists. Recognition becomes easier if segmentation is possible, and vice versa. It is necessary that we develop an algorithm which can solve "chicken and egg" problem that is the strong coupling problem of segmentation adn recognition

Two-dimensional continuous DP (2DCDP) is extension of a one-dimensional continuous DP to two dimensions. One-dimensional continuous DP (CDP) was proposed by Oka in 1978, which realizes segmentation-free recognition of a one-dimensional pattern such as time series.

For two-dimensional continuous DP, it is applied independently for each input image using each individual reference image with a single category, and segmentation-free recognition becomes possible. This is because each individual reference pattern constitutes linear combination of hierarchy, while multiple category images consitute a single hierarchy in Deep Learning. Also, since the two-dimensional continuous DP absorbs nonlinear deformation including the enlargement / reduction of size of the individual target image, the reference pattern ("learning data") is enough to take one.
  Two-dimensional continuous DP is useful for recognizing overlapped amd multiple image patterns in a target image by taking such a linear combination of hierarchical structure, that is, 2DCDP applies independently using a single reference image.

One dimensional DP is naturally expanded to two-dimensional image pattern by a series of joint works of many researchers, Dr. T.Nishimura (now AIST), Mr. Iwasa(now Seiko-Epson Inc.), Dr. Y.Yaguch (U of Aizu) with me. Finally, the algorithm becomes a quite sophistcated and almost final version of 2DCDP.

On the other hand, 2DCDP requires only one reference (learning) pattern, while DL requires a large amount of data for learning.

The one-dimensional Continous Dynamic Programming was published is the follwing
paper. However, the first paper of CDP was publised in 1978 in Japanese.

[1] "Spotting Method for Classification of Real World Data": Ryuichi Oka, The Computer Journal, Vol.41, No.8, pp.559-565 (1998)

View this research

Speech Recognitionl from a Single Wave of Mixed Speakers -- Speech recognition without separation of a single speech wave simultaneously spoken by multiple speakers

We show the demonstration of our research in Youtube:

We propose an algorithm for recognizing speech from a single speech wave spoken simultaneously by multiple speakers. We use a synthesized speech as a query with a category so that the recognition system works speaker independently.
   As one of cognitive functions, the human brain has a function solving called “cocktail party effect” which works for understanding the meaning of a focused utterance among mixed speech simultaneously spoken by multiple speakers. The typical situation of this phenomenon is in the place of cocktail party. 
   The trial for engineeringly resolving the cocktail party effect is to apply the algorithm called Independent Component Analysis (ICA), which has strong potential for separating the mixed speech into a set of separated and independent speeches. The function of ICA is only separation of a speech. Therefore, the recognition of the speech is out of ICA. When applying ICA, we need basically a set of microphones of which number is equal or more than the number of speakers.
   The human brain is actually realizing the function of cocktail party effect using two ears. Using two ears are not equivalent to having two microphones, but used to identify the location of a sound source in the 3D space around the person.
   Therefore, we could say that there is a function realized using only a single microphone in the human brain. This indicates that the same function can be engineeringly realized by using a single microphone. ICA using many microphones for only separating a speech could be not the intrinsic resolution of the human cocktail party effect.
   Our method carries out speech recognition using a query of synthesized speech, which corresponds a category, from a single speech of mixed speakers without separation of the speech.
the attacged figure shows the experimental result of keyword or key phrase segmentation-free recognition from a single speech spoken by English, Japanese , Chinese and German speakers. The query keyword and key phrase are synthesized speech. It means that our method works speaker-independently.

The patent of the method is now pending.

View this research

Automatic recognition and its performance evaluation of figure skating from broadcasting video

We show the demonstration video of our research in Youtube.

We proposed a matching algorithm called Time-space Continuous Dynamic Programming (TSCDP) [1] for segmentation-free recognition of complex and multiple motions from a video stream. Segmentation-free characteristics work in both time and spatial position so that determination of both starting and ending times of each motion is not required, and any spatial position of each motion is allowed for recognition.

  Moreover, multiple and complex motions in a scene are also recognized. Moving background and occlusion are also allowed. Real-time and segmentation-free recognition is available for spotting recognition of sport motions such as performance of figure skating and sumo etc. and it is useful for realizing automatic scoring and/or decision of win or loss.
The video captured by a moving camera is allowed. 

 These functions have not been realized by conventional methods including HMM etc. so far. 

 There are many other sensors using infrared (Kinect etc.) and laser devices, and  accelerometer for capturing human motions. However the realization of these functions are out of scope even we use these devices. 

  The figures are showing several applications of motion recognition including complex and connected Chinese characters and also new functions such as detection of moving cars realized by TSCDP. The realized functions provide actual and ideal solutions which have been required to realize the real world applications of motion recognition technology. The patent of this method has been registered.

[1] Yuki Niitsuma, Syunpei Torii, Yuichi Yaguchi & Ryuichi Oka:"Time-segmentation and position-free recognition of air-drawn gestures and characters in videos", Multimedia Tools and Applications, An International Journal, ISSN 1380-7501, Volume 75, Number 19, pp.11615--11639.

View this research

Dissertation and Published Works

1) A new cellular automaton structure for macroscopic linear-curved features extraction: Ryuichi Oka, p.654, Proc. 4-th International Joint Conference on Pattern Recognition (1978).
Comment: Proposal of Cellular Feature including orientation pattern which became standard features in the field of character recognition.

2) "Continuous Words Recognition by Use of Continuous Dynamic Programming for Pattern Matching": Ryuichi Oka, Technical Report of Speech Committee, Acoustic Society of Japan, Vol.S78-20, pp.145-152, June (1978) (in Japanese).
Comment: This is the first paper of Continuous Dynamic Programming written in Japanese. Spotting recognition (segmentation-free recognition) realized by Continuous Dynamic Programming is extended to apply time sequence, 2D image, and time-varying image.

3) "Spotting Method for Classification of Real World Data": Ryuichi Oka, The Computer Journal, Vol.41, No.8, pp.559-565 (1998).
Comment: This paper is cited internationally in many papers concerning Continuous Dynamic Programming.

4) Hierarchical labeling for integrating images and words: Ryuichi Oka, Artificial Intelligence Review, Vol. 8, pp. 123-145 (1994).
Comment::This paper proposed an algorithm of middle vision which seems the most difficult stage in understanding of vision. There are three stages in computer vision, namely, early, middle, hige level.

5) On Spotting Recognition of Gesture Motion from Time-varying Image: Ryuichi OKA, Takuichi Nishimura, Hiroaki Yabe, Transactions of Information Processing Society of Japan, Vol.43, No.SIG 4 (CVIM 4), pp.54-68 (2002).
Comment:This paper proposed an architecture called "frame-wise complete cycle" for real time computer human integration of multi-media.

6) Image-to-word transformation based on dividing and vector quantizing images with words: Y.Mori, H.Takahashi and R.Oka, First International Workshop on Multimedia Intelligent Storage and Retrieval Management (MISRM'99), December (1999)
Comment: This paper is cited in many papers related language vision integration.This paper is one of pioneer papers in this topic.

7) Time-segmentation and position-free recognition of air-drawn gestures and characters in videos, Yuki Niitsuma, Syunpei Torii, Yuichi Yaguchi & Ryuichi Oka, Multimedia Tools and Applications, An International Journal, ISSN 1380-7501, Volume 75, Number 19, pp.11615--11639. 
Comment: This paper is an English paper describing the method called "Time-space Continuous Dynamic Programming" in detail. There is a set of algorithms based on concept of Continuous Dynamic Programming.