Day 0    10/29

08:00~ 09:00

Tutorial Registration

09:00~ 10:30

Tutorial #1
Power Line Communication: Introduction and Implementation to Solar Farm Monitoring

Prof. Ekachai Leelarasmee

Tutorial #2
Beyond the Visual and Imagery Based BCI – The New Developments in Spatial Auditory and So-Matosensory Based Paradigms

Prof. Tomasz M. Rutkowski

Tutorial #3
Active Noise Control: Fundamentals and Recent Advances

Prof. Waleed H. Abdulla
Dr. Iman T. Ardekani

Tutorial #4
MIMO Signal Processing Techniques to Enhance Physical Layer Security

Prof. Y.-W. Peter Hong
Prof. Tsung-Hui Chang

10:30~ 10:50

Morning Break

10:50~ 12:20

Tutorial #1
Power Line Communication: Introduction and Implementation to Solar Farm Monitoring

Prof. Ekachai Leelarasmee

Tutorial #2
Beyond the Visual and Imagery Based BCI – The New Developments in Spatial Auditory and So-Matosensory Based Paradigms

Prof. Tomasz M. Rutkowski

Tutorial #3
Active Noise Control: Fundamentals and Recent Advances

Prof. Waleed H. Abdulla
Dr. Iman T. Ardekani

Tutorial #4
MIMO Signal Processing Techniques to Enhance Physical Layer Security

Prof. Y.-W. Peter Hong
Prof. Tsung-Hui Chang

12:20~ 14:00


14:00~ 15:30

Tutorial #5
Machine Learning for Multimedia Sequential Pattern Recognition

Prof. Koichi Shinoda
Prof. Jen-Tzung Chien

Tutorial #6
Perceptual Quality Evaluation for Image and Video: from Modules to Systems

Prof. Weisi Lin

Tutorial #7
Research Roadmap Driven by Network Benchmarking Lab (NBL): Deep Packet Inspection, Traffic Forensics, WLAN/LTE, Embedded Benchmarking, and Beyond

Prof. Ying-Dar Lin

Tutorial #8
Next Generation Video Coding- H.265/HEVC and Its Extensions

Prof. Oscar. C. Au

15:30~ 15:50

Afternoon Break

15:50~ 17:20

Tutorial #5
Machine Learning for Multimedia Sequential Pattern Recognition

Prof. Koichi Shinoda
Prof. Jen-Tzung Chien

Tutorial #6
Perceptual Quality Evaluation for Image and Video: from Modules to Systems

Prof. Weisi Lin

Tutorial #7
Research Roadmap Driven by Network Benchmarking Lab (NBL): Deep Packet Inspection, Traffic Forensics, WLAN/LTE, Embedded Benchmarking, and Beyond

Prof. Ying-Dar Lin

Tutorial #8
Next Generation Video Coding- H.265/HEVC and Its Extensions

Prof. Oscar. C. Au

17:20~ 18:00

Free Time

APSIPA Techical Actives Board Meeting

18:00~ 21:00

Welcome Reception


Day 1    10/30

08:00~ 08:30


08:30~ 09:00 Opening Ceremony
09:00~ 10:00

Keynote #1
Artificial Neural Networks for Speech Recognition: A Historical Perspective

Prof. Nelson Morgan

10:00~ 10:20

Morning Break

10:20~ 12:00

Perceptual Considerations and Models in Visual Signal Processing

Intelligent Visual/Image Computation

Technologies for Security and Safety and Those Applications

Signal and Information Processing for Affective Computing

Machine Learning for Music Information Retrieval

Algorithm and Architecture for Orange Computing

Data-Driven Biomedical Signal Processing and Machine Learning Methods

Signal and Information Processing Theory and Methods

12:00~ 13:30


Technical Committee Meetings 1

Technical Committee Meetings 2

Technical Committee Meetings 3

13:30~ 14:30

APSIPA Assembly

14:30~ 16:30

Multimedia Signal Processing and Applications

Image and Video Coding

Recent Challenges and Applications on Active Noise Control

Spoken Term Detection

Speech Processing

Language Processing

Signal Processing Systems

Advanced Topics in Noise Reduction and Related Techniques for Signal Processing Applications

16:30~ 16:50

Afternoon Break

16:50~ 18:30

Object Modeling and Recognition

Advanced Image and Video Analysis

Visual Content Generation, Representation, Evaluation and Protection

Multimedia Security and Forensics

Recent Advances in Audio and Acoustic Signal Processing

Speech Recognition (I)

Recent Advances in Digital Filter Design and Implementation

Advances in Linear and Nonlinear Adaptive Signal Processing and Learning

18:40~ 21:40



Day 2    10/31

08:00~ 08:40


08:40~ 09:40

Keynote #2
Advances in Signal Processing for Networked Applications

Prof. John Apostolopoulos

09:40~ 10:00

Morning Break

10:00~ 12:00

Advanced Audio-Visual Analysis in Multimedia

Visual Data Understanding and Modeling

Green Communications and Networking

Speech Recognition (II)

Audio Signal Analysis, Processing and Classification (I)

3D Video Representation and Coding

Biomedical Signal Processing and Systems

Information Security and Multimedia Applications

APSIPA Transactions Editorial Board Meeting

12:00~ 13:30


Technical Committee Meetings 4

Technical Committee Meetings 5

Technical Committee Meetings 6

13:30~ 15:30

Forum Session

15:30~ 15:50

Afternoon Break

15:50~ 17:50

Plenary Overview Session I


Cruise Banquet

Buffet Dinner


Day 3    11/1

08:00~ 09:00


09:00~ 10:00

Keynote #3
Flexible speech Synthesis Based on Hidden Markov Models

Prof. Keiichi Tokuda

10:00~ 10:20

Morning Break

10:20~ 12:00

Emerging Technologies in Multimedia Communications

3D Image Processing/Compression, Object Tracking, and Augmented Reality

Intelligent Multimedia Applications

Advanced Image and Video Processing

Image Enhancement and Restoration

Wireless Communications and Networking

Expressive Talking Avatar Synthesis and Applications

Toward High-Performance Real-World ASR Applications

12:00~ 13:30


13:30~ 14:50

Advanced Visual Media Data Generation, Editing and Transmission

Audio Signal Analysis, Processing and Classification (II)

Audio and Acoustic Signal Processing

Speech and Audio Coding and Synthesis

Emotion Analysis and Recognition

Design and Implementation for High Performance & Complexity-Efficient Signal Processing Systems

Frontiers of BioSiPS in Daily Life

Recent Advances in Multirate Processing and Transforms

14:50~ 15:10

Afternoon Break

15:10~ 17:10

Plenary Overview Session II

17:10~ 17:40

Closing Ceremony

18:00~ 21:00 Board of Governors Appreciation Dinner

Day 4    11/2

08:30~ 12:30 Board of Governors Meeting
12:30~ 13:30



Kaohsiung City Tour

OS.1-IVM.1: Perceptual Considerations and Models in Visual Signal Processing

Adaptive Picture-in-Picture Technology based on Visual Saliency

Shijian Lu*, Byung-Uck Kim*, Nicolas Lomenie, and Joo-Hwee Lim*

*A*STAR, Universit Paris Descartes

Picture-in-picture (PiP) is a feature of some television receivers and video devices, which allows one main program to be displayed on the full screen while one or more subprogram displayed in inset windows. Currently most TV/video devices require users to specify where and how large to place the sub-program over the main program display. This process is however not user-friendly as it involves a manual process and once specified, the size and the location of the sub-program will be fixed even when they block some key visual information from the main program. We propose an automatic and adaptive PiP technology that makes use of computational modeling of visual saliency. For each frame of the main program, a saliency map is computed efficiently which quantifies how probable a display region of the main program contains useful information and will attract humans attention/eyes. The sub-program can thus be adaptively resized and placed to the display region that contains the least useful information. Preliminary experiments show the effectiveness of the proposed technology.


Effect of Content on Visual Comfort in Viewing Stereoscopic Videos

Bo Chang* , Fuzheng Yang# , Shuai Wan 

*Xidian University, #Xidian University, Northwestern Polytechnical University

Visual discomfort is a serious challenge for stereoscopic videos to become prevalent. However, the assessment of stereoscopic visual discomfort is a complicated issue. In this paper, we have studied four of the major factors which affect visual comfort by conducting extensive subjective assessments. Namely, foreground disparity, size of foreground object, disparity distribution and in-depth motion. Relationships between visual comfort and these four factors are analyzed and four conclusions are drawn according to the experimental results. Firstly, when the influences of disparity magnitude and range of disparity function together and clash, the latter is in the dominant position. Secondly, the degree of visual comfort increases with the increment of the object size in a certain range and then levels off or decreases slightly after a certain threshold. Thirdly, better visual comfort is always obtained when the bottom part was perceived nearer to the viewers than its upper part. Last but not least, the variation of parallax over time might be one of the major factors which affect visual comfort. And its influence is complicated and can be reflected by the objects velocity, motion direction and starting plane.


No-reference IPTV Video Quality Modeling Based on Contextual Visual Distortion Estimation

Ning Liao* and Zhibo Chen*

*University of Science and Technology China

No-reference IPTV H.264 video quality modeling at bitstream level has just been standardized in ITU-T SG12/Q14 P.NBAMS work group as P.1202.2. Compression artifacts, channel artifacts, and their mutual influence are considered in the database design to reflect the realistic situations. For P.NBAMS, we contributed a no-reference slicing channel artifact measurement method based on contextual visual distortion estimation, shortly named CVD in this paper, which has been accepted into the final P.1202.2 Recommendation due to its best performance in standard competition. In CVD scheme, first we predicted the initial visibility of channel artifacts in an individual frame where packet loss occurs and detected the scene cut artifacts at bitstream level. Second, we applied a low-complexity zero-motion-based visible artifact propagation procedure, which emphasizes the most significant visual distortion rather than equally weights the propagated distortion and the initial distortion. Finally, we modeled the visibility of temporal artifacts by extracting two new features from the contextual distortions. The proposed CVD scheme outperforms or emulates the fullreference metric MSE on the five training databases of P.NBAMS, with an average correlation of 0.838 and an average RMSE of 0.42.


Ego Motion Induced Visual Discomfort of Stereoscopic Video

Jongyoo Kim* Kwanghyun Lee Taegeun Oh and Sanghoon Lee

*Yonsei University, Yonsei University

When each video sequence is captured, an inappropriate camera motion should be one of crucial factors leading to visual discomfort and distortion. The well known symptom, visually induced motion sickness (VIMS) is caused by the illusion of self motion by perceiving the video with ego motion. In particular, for the stereoscopic 3D video, it can be easily observed that the viewers have dominantly feel much more severe symptoms of visual discomfort. In this paper, we analyze the ego motion of the stereoscopic video and predict the effects. We attempt a novel approach by exploiting the computer vision algorithm. We propose a novel method which can estimate the perceptual 3D ego motion from the stereoscopic video. Then we analyze the ego motion components to predict the visual discomfort of stereoscopic video.


Visual-Saliency-Enhanced Image Quality Assessment Indices

Joe Yuchieh Lin*, Tsung Jung Liu*, Weisi Linand C.-C. Jay Kuo*

*University of Southern California, Nanyang Technological University

Modern image quality assessment (IQA) indices, e.g. SSIM and FSIM, are proved to be effective for some image distortion types. However, they do not exploit the characteristics of the human visual system (HVS) explicitly. In this work, we investigate a method to incorporate the human visual saliency (VS) model in these full-reference indices, and call the resulting indices SSIMVS and FSIMVS , respectively. First, we decompose an image into non-overlapping patches, calculate visual saliency, and assign a parameter ranging from 0 and 1 to each patch. Then, the local SSIM or FSIM values of the patches are weighed by the said parameter. Finally, the weighed similarity of all patches are integrated into one single index for the whole image. Experimental results are given to demonstrate the improved performance of the proposed VS-enhanced indices.


Back to top!!

OS.2-IVM.2:Intelligent Visual/Image Computation

Inverse Halftoning Based on Edge Detection Classification

Qi-Xuan Ong and Wen-Liang Hsue

Chung Yuan Christian University

Inverse halftoning is a technique to reconstruct gray-level images from halftone images. Since image halftoning process results in information loss, inverse halftoning cannot reconstruct perfectly original gray-level images from corresponding halftone images. Consequently, several inverse halftoning methods were proposed, e.g., LIH and ELIH [1]-[2], etc. In this paper, we will first review an existing inverse halftoning technique with variance classified filtering [3]. We will replace LMS (least-mean-square)algorithm by the LS (leastsquare) algorithm to improve the training stage for variance classified inverse halftoning in [3]. Then we will use edge detection to classify image data instead of variance used in [3]. Experiment results show that both LS filtering and edge detection classification proposed in this paper enhance quality of output gray-level images for inverse halftoning.


Virtual-view-based Stereo Video Composition

Chun-Kai Chang and Hsueh-Ming Hang

National Chiao Tung University

Given two sets of videos captured by two sets of multiple cameras, we like to combine them to create a new stereo scene with the foreground objects from one set of video and the background from the other set. We address the camera parameter mismatch and camera orientation mismatch problems in this paper. We propose a floor model to adjust the camera orientation. Once we pick up the landing point (of foreground) in the background scene, we need to adjust the background camera parameters (position etc.) to match the foreground object. The depth information is needed in the above calculation. Thus, new background scenes may have to be synthesized based on the calculated virtual camera parameters and the given background pictures. Plausible results are demonstrated using the proposed algorithms.


Video Retargeting Using Non-homogeneous Scaling and Cropping

Wen-Yu Yo, Jin-Jang Leou, and Han-Hui Hsiao

National Chung Cheng University

To display a video sequence in different display devices with different resolutions and aspect ratios, the video frames should be resized (retargeted) adaptively, whereas the important content of the video sequence should be retained. This is called the video retargeting problem. The proposed video retargeting approach consists of three main stages: preprocessing, optimization, and transformation. In the preprocessing stage, the initial scaling factor map, the saliency measure, and the temporal coherence measure of each video frame will be computed. In the optimization stage, an iterative optimization procedure is used to find the scaling factor maps of individual video frames via an optimization function involving spatial benefit and temporal cost. In the transformation stage, the scaling factor maps of individual video frames are improved by cropping. Then, the retargeted video frames are generated by pixel fusion using the scaling factor maps. Based on the experimented results obtained in this study, the performance of the proposed approach is better than those of five comparison approaches.


Real-time Aerial Targets Detection Algorithm Based Background Subtraction

Mao Zheng, Zhen-Rong Wu, Saidsho Bakhdavlatov, Jing-Song Qu, Hong-Yan Li, Jian -Jian Yuan

Beijing University of Technology

in his paper, we propose a­new­technique incorporates several innovative mechanisms for aerial target detection. The traditional algorithm has high time complexity, and when the target size changes, it has greater limitations. It is difficult to meet its critical real-time and accuracy requirements in practical application. Based on this, we propose the air target detection algorithm based background subtraction. Complete modeling of the video in the first frame, detect the target in second frame, the location and size of the target is obtained by connecting area detection. From the third frame, we use the size and position of target from last frame to open a window, and track the target in the window. Tracking target in the window, we can eliminate background interference and reduce the time consumption. Compared with the traditional algorithm, the proposed algorithm is experimentally proved real-time well, detecting and tracking efficient highly, and have a highly practical in practice. Keywords-local entropy; aerial target detection and tracking; background subtraction; connecting area detection ;window;


Representation of The Player Action in Sport Videos

Jingwen Zhang* Jiankang Qiu* Xiangdong Wang Lifang Wu*

*Beijing University of Technology, China Institute of sports science

In a sport video, the complex and dynamic background and the complexity of player actions bring considerable difficulty to analyze the players action. In this paper we propose a method to represent the players action on the panoramic background in a whole. It is helpful for watching and analyzing the players action. We first build the panoramic background by the inter-frame relationship. Then object segmentation is implemented by subtraction between the panoramic background and each warped frame. Finally, the players action in each frame is represented on the panoramic background. And the trajectory of the player action is presented on the background. The experimental results show that the proposed method is effective for the sports video like diving, jumps, which the player performs his or her action in a large arena, and the camera motion mainly includes horizontal and vertical direction.


A New Fast Adopted Image Matching Approach with Arbitrary Rotation

Guangmin Sun, Zibo Li, Yufeng Jin, Junling Sun, Kun Zheng, Dequn Zhao

Beijing University of Technology

To keep image matching method being selfadjusted is very difficult. As an important aspect of self-adjusted problem, the angle between template and image directly decide whether the matching could be success. Particularly, the result of image matching is seldom accurate when the traditional matching approach is used. In recent years, with series of new idea created, there is a great breakthrough in field of Image matching. A new self-adjusted image matching approach based on Pyramid model and polar coordinate is discussed in this paper. Polar coordinate is used to remove the implication of rotation and Pyramid model is taken to keep the real-time of algorithm. The simulation result shows that this algorithm is more stable and faster than the traditional pixel-to-pixel algorithm. Moreover, this approach is easier to carry out and become a simple step of complex analyzing system.


Back to top!!

OS.3-IVM.3: Technologies for Security and Safety and Those Applications

Local Consistency Preserved Coupled Mappings for Low-Resolution Face Recognition

Zuodong Yang1,2, Yong Wu1,2, Yinyan Jiang1,2, Yicong Zhou3, Longbiao Wang4, Weifeng Li1,2 and Qingmin Liao1,2

1IS & DRM, 2Tsinghua University, 3University of Macau, 4Nagaoka University of Technology

Existing face recognition systems can achieve high recognition rates in the well-controlled environment. However, when the resolution of the test images is lower than that of the gallery images, the performance degrades seriously. Traditional two-step solutions (first adopting super-resolution (SR) method, and then performing the recognition phase) mainly focus on visual enhancement, rather than classification. In this paper, we utilize Local Consistency Preserved Coupled Mappings (LCPCM-I) to project the face images with different resolutions onto a new common space for recognition based on coupled mappings (CM). To achieve better results, we incorporate discriminant information with LCPCM (LCPCM-II). The experimental results on FERET database verify the effectiveness of our proposed method.


Generation of All-focus Images and Depth-adjustable Images Based on Pixel Blurriness

Yi-Chong Zeng

Institute for Information Industry

Taking a picture containing all-focus objects with regardless of distance is difficult. A feasible strategy to generate an all-focus image relies on multiple images with different focus settings. Furthermore, image refocusing with side information is applied to all-focus image to generate depth-adjustable image which is like to real-world photos at the specified depth-of-field. In this paper, we propose image fusion and refocusing methods for generation of all-focus image and depth-adjustable image. During fusion procedure, measuring pixels blurriness of multi-focus reference images is the first process. Then, all-focus image composed of sharpest pixels among reference images, index map, and standard deviation-index table are yielded. The objective of standard deviation-index table is to connect relationship between Gaussian lowpass filter and index. During refocusing procedure, different Gaussian lowpass filters applied to all-focus image by referring to index map and standard deviation-index table, and then depth-adjustable image is generated. The experiment results demonstrate the proposed methods are better than the compared approaches in computational complexity and quality of all-focus image and depth-adjustable image.


Image Restoration Using Similar Region Search and Digital Watermark

Kazuki Yamato*, Madoka Hasegawa*, Kazuma Shinoda*, Shigeo Kato*, and Yuichi Tanaka

*Utsunomiya University, Tokyo University of Agriculture and Technology

Baroumand et al. proposed the image restoration method using the correlation of the image, which is called the watermark driven decentralized best matching (WDBM). In this method, the most similar region for each block is searched and the position information of the most similar region is embedded as digital watermark. However, in some cases, a part of pixels in the corrupted block cannot be restored by WDBM. In this paper, we propose an image restoration method that embeds the position information of the similar region into the block which is located away from the current block at regular intervals. The proposed method can restore all pixels in the corrupted block with a high probability and improve the restoration rate.


Phase-Based Image Matching and Its Application to Biometric Recognition

Koichi Ito and Takafumi Aoki

*Tohoku University

This paper presents an accurate image matching technique using phase information obtained by discrete Fourier transform of images and its application to biometric recognition. Most of biometric recognition algorithms employ computer vision, pattern recognition and image processing techniques or their combination. On the other hand, our approach using phasebased image matching is based on signal processing technique. Through a set of experiments forfingerprint, iris, face, palmprint andfinger knuckle recognition, we demonstrate that our signal processing approach exhibits efficient performance for biometric recognition compared with the conventional approaches.


Compressing JPEG Compressed Image Using Reversible Data Hiding Technique

Sang-ug Kang*, Xiaochao Qu, and Hyoung Joong Kim

*Sangmyung University, Korea University, Korea University

Since the concept of reversible data hiding technique was introduced, many researchers have applied it for authentication of uncompressed images. In this paper, an algorithm is introduced to compress JPEG files again without any loss in image quality. The proposed method can modify an entire segment of VLC codeword sequence to embed a bit of data. The modified codewords may destroy the correlation, or the smoothness, between neighboring pixels of the recovered image. The data extractor utilizes the smoothness change to know the hidden data. For this, a novel smoothness measurement function which uses both inter- and intra-MAD values is proposed. When the smoothness change is small, two consecutive segments are concatenated to extract correct data with higher smoothness sensitivity. As a result, compression ratio or embedding capacity is increased in most natural images.


A Blind Lossless Information Embedding Scheme Based on Generalized Histogram Shifting


Tokyo Metropolitan University

This paper proposes a scheme for lossless information embedding based on generalized histogram shifting where the proposed scheme is free from memorizing embedding parameters. A generalized histogram shifting-based lossless information embedding (GHS-LIE) scheme embeds q-ary information symbols in an image according to the tonal distribution of the image. The scheme not only extracts embedded information but also restores the original image from the distorted image carrying concealed information. Whereas conventional GHS-LIE should memorize a set of image-dependent parameters for hidden information extraction and original image recovery, the proposed scheme is free from such memorization by three strategies; histogram peak shifting, double side modification, and guard zero histogram bins.


Back to top!!

OS.4-SLA.1: Signal and Information Processing for Affective Computing

Combining Emotional History Through Multimodal Fusion Methods

Linlin Chao, Jianhua Tao and Minghao Yang

Chinese Academy of Sciences

Continuity, one of the important characteristics of emotion, implies us that the emotional history may provide useful information for emotion recognition. Meanwhile, classification-based multimodal fusion methods show effective results in multimedia analysis tasks. In this study, two-stage classification method is proposed and emotional history is combined by the support vector machine fusion method. Evaluations on the Audio Sub-Challenge of the 2011 Audio/Visual Emotion Challenge dataset show that: (i) combining emotional history to recognition improves the accuracy significantly, (ii) classification-based multimodal fusion methods can effectively combine emotional history.


Affective-Cognitive Dialogue Act Detection in an Error-Aware Spoken Dialogue System

Wei-Bin Liang, Chung-Hsien Wu, and Meng-Hsiu Sheng

National Cheng Kung University

This paper presents an approach to affectivecognitive dialogue act detection in a spoken dialogue. To achieve this goal, the input utterance is decoded as the affective state by an emotion recognizer and a word sequence by an imperfect speech recognizer separately. Besides, four types of evidences are employed to grade the score of each recognized word. The recognized word sequence is used to derive the candidate sentences to alleviate the problem of unexpected language usage for the cognitive state predicted by the vector space-based dialogue act detection. The Boltzmann selection based method is then employed to predict the next possible act in the spoken dialogue system according to the affective-cognitive states. A model of affective anticipatory reward that is assumed to arise from the emotional seeking system is adopted for enhancing the efficacy of dialogue act detection. Finally, the evaluation data are collected and the experimental results confirm the improved performance of the proposed approach compared to the baseline system on the task completion rate.


BFI-Based Speaker Personality Perception Using Acoustic-Prosodic Features

Chia-Jui Liu, Chung-Hsien Wu, Yu-Hsien Chiu*

National Cheng Kung University, *Kaohsiung Medical University

This paper presents an approach to automatic prediction of the traits the listeners attribute to a speaker they never heard before. In previous research, the Big Five Inventory (BFI), one of the most widely used questionnaires, is adopted for personality assessment. Based on the BFI, in this study, an artificial neural network (ANN) is adopted to project the input speech segment to the BFI space based on acoustic-prosodic features. Personality trait is thenpredicted by estimating the BFI scores obtained from the ANN. For performance evaluation, the BFI with two versions (one is a complete questionnaire and the other is a simplified version) were adopted. The experiments were performed over a corpus of 535 speech samples assessed in terms of personality traits by experienced subjects. The results show that the proposed method for predicting the trait is efficient and effective and the prediction accuracy rate can achieve 70%.


Feature Space Dimension Reduction in Speech Emotion Recognition Using Support Vector Machine

Bo-Chang Chiou and Chia-Ping Chen

National Sun Yat-sen University

We report implementations of automatic speech emotion recognition systems based on support vector machines in this paper. While common systems often extract a very large feature set per utterance for emotion classification, we conjecture that the dimension of the feature space can be greatly reduced without severe degradation of accuracy. Consequently, we systematically reduce the number of features via feature selection and principal component analysis. The evaluation is carried out on the Berlin Database of Emotional Speech, also known as EMODB, which consists of 10 speakers and 7 emotions. The results show that we can trim the feature set to 37 features and still maintain an accuracy of 80%. This means a reduction of more than 99% compared to the baseline system which uses more than 6,000 features.


Evaluation on Text Categorization for Mathematics Application Questions

Liang-Chih Yu, Hsiao-Liang Hu and Wei-Hua Lin

Yuan Ze University

In learning environments, developing intelligent systems that can properly respond learners emotions is a critial issue for improving learning outcome. For example, systems should consider to replace the current question with an easier one when detecting negative emotions expressed by learners. Conversely, systems can try to retrieve a more challenging question when learners have contempt emotion or feel bored. This paper proposes the use of text categorization to automatically classify mathematics application questions into different difficulty levels. Applications can then benefit from such classification results to develop retrieval systems for proposing questions based on learners emotion states. Experimental results show that the machine learning algorithm C4.5 achieved the highest accuracy 78.53% in a binary classification task.


Comparing Feature Dimension Reduction Algorithms for GMM-SVM based Speech Emotion Recognition

Jianbo Jiang* ,Zhiyong Wu*, Mingxing Xu, Jia Jiaand Lianhong Cai*

*Tsinghua University, Tsinghua University

How to select effective emotional features are important for improving the performance of automatic speech emotion recognition. Although various feature dimension reduction algorithms were put forward that could help gain the accuracy rate of emotion distinction, but most of them exist various defects, such as high negative impact of the recognition rate, high computational complexity. Regarding this, two dimension reduction algorithms based on PCA (principal component analysis) and KPCA (Kernel-PCA) were comparatively discussed in this paper. The original features extracted from databases were transformed by PCA or KPCA. The weights of these new features over the transforming matrix were calculated and ranked, based on which features were chosen. Experimental results show that feature dimension reduction can make principal contribution to the accuracy of speech emotion recognition, and KPCA slightly outperforms PCA on the hit rate and the remaining dimensions.


Back to top!!

OS.5-SLA.2: Machine Learning for Music Information Retrieval

Analyzing the Dictionary Properties and Sparsity Constraints for a Dictionary-based Music Genre Classification System

Ping-Keng Jao, Li Su and Yi-Hsuan Yang

Academia Sinica

Learning dictionaries from a large-scale music database is a burgeoning research topic in the music information retrieval (MIR) community. It has been shown that classification systems based on such learned features exhibit state-of-the-art accuracy in many music classification benchmarks. Although the general approach of dictionary-based MIR has been shown effective, little work has been doneto investigate the relationship between system performance and dictionary properties, such as the dictionary sparsity, coherence, and conditional number of the dictionary. This paper aims at addressing this issue by systematically evaluating the performance of three types of dictionary learning algorithms for the task of genre classification, including the least-square based RLS (recursive least square) algorithm, and two variants of the stochastic gradient descentbased algorithm ODL (online dictionary learning) with different regularization functions. Specifically, we learn the dictionary with the USPOP2002 dataset and perform genre classification with the GTZAN dataset. Our result shows that setting strict sparsity constraints in the RLS-based dictionary learning (i.e., <1% of the signal dimension) leads to better accuracy in genre classification (around 80% when linear kernel support vector classifier is adopted). Moreover, we find that different sparsity constraints are needed for the dictionary learning phase and the encoding phase. Important links between dictionary properties and classification accuracy are also identified, such as a strong correlation between reconstruction error and classification accuracy in all algorithms. These findings help the design of future dictionary-based MIR systems and the selection of important dictionary learning parameters.


Enhance Popular Music Emotion Regression by Importing Structure Information

Xing Wang, Yuqian Wu, Xiaoou Chen, Deshun Yang*

*Peking University

Emotion is a useful mean to organize music library, and automatic music emotion recognition is drawing more and more attention. Music structure information is imported to improve the result for music emotion regression. Music dataset with emotion and structure annotations is built, and features concerning lyrics, audio and midi are extracted. For each emotion dimension, regressors are built using different features on different type of segments in order to find the best segment for music emotion regression. Results show that structure information can help improve emotion regression. Verse is good for pleasure recognition, while chorus is good for arousal and dominance. The difference between verse and chorus can also help improve regressors.


These Words are Music to My Ears: Recognizing Music Emotion from Lyrics Using AdaBoost

Dan Su and Pascale Fung


In this paper, we propose using AdaBoost with decision trees to implement music emotion classification (MEC) from song lyrics as a more appropriate alternative to the conventional SVMs. Traditional text categorizations methods using bag-ofwords features and machine learning methods such as SVM do not perform well on MEC from lyrics because lyrics tend to be much shorter than other documents. Boosting builds on a lot of weak classifiers to model the presence or absence of salient phrases to make the final classification. Our accuracy reached an average of 74.12% on a dataset consisting of 3766 songs with 14 emotion categories, compared to an average of 69.72% accuracy using the well-known SVM classification, with statistical significant improvement.


Probabilistic Model of Two-Dimensional Rhythm Tree Structure Representation for Automatic Transcription of Polyphonic MIDI Signals

Masato Tsuchiya, Kazuki Ochiai, Hirokazu Kameoka, Shigeki Sagayama

The University of Tokyo, Chiyoda-ku Hitotsubashi

This paper proposes a Bayesian approach for automatic music transcription of polyphonic MIDI signals based on generative modeling of onset occurrences of musical notes. Automatic music transcription involves two subproblems that are interdependent of each other: rhythm recognition and tempo estimation. When we listen to music, we are able to recognize its rhythm and tempo (or beat location) fairly easily even though there is ambiguity in determining the individual note values and tempo. This may be made possible through our empirical knowledge about rhythm patterns and tempo variations that possibly occur in music. To automate the process of recognizing the rhythm and tempo of music, we propose modeling the generative process of a MIDI signal of polyphonic music by combining the sub-process by which a musically natural tempo curve is generated and the sub-process by which a set of note onset positions is generated based on a 2-dimensional rhythm tree structure representation of music, and develop a parameter inference algorithm for the proposed model. We show some of the transcription results obtained with the present method.


Identification of Live or Studio Versions of A Song via Supervised Learning

Nicolas Auguin, Shilei Huang and Pascale Fung


We aim to distinguish between the "live" and "studio" versions of songs by using supervised techniques. We show which segments of a song are the most relevant to this classification task, and we also discuss the relative importance of audio, music and acoustic features, given this challenge. This distinction is crucial in practice since the listening experience of the user of online streaming services is often affected, depending on whether the song played is the original studio version or a secondary live recording. However, manual labelling can be tedious and challenging. Therefore, we propose to classify automatically a music data set by using Machine Learning techniques under a supervised setting. To the best of our knowledge, this issue has never been addressed before. Our proposed system is proven to perform with high accuracy on a 1066-song data set with distinct genres and across different languages.


Back to top!!

OS.6-SPS.1: Algorithm and Architecture for Orange Computing

Classification of Children with Voice Impairments using Deep Neural Networks

Chien-Lin Huang and Chiori Hori

National Institute of Information and Communications Technology

This paper presents the deep neural networks to classification of children with voice impairments from speech signals. In the analysis of speech signals, 6,373 static acoustic features are extracted from many kinds of low-level-descriptors and functionals. To reduce the variability of extracted features, two-dimensional normalizations are applied to smooth the interspeaker and inter-feature mismatch using the feature warping approach. Then, the feature selection is used to explore the discriminative and low-dimensional representation based on techniques of principal component analysis and linear discriminant analysis. In such representation, the robust features are obtained by eliminating noise features via subspace projection. Finally, the deep neural networks are adopted to classify the children with voice impairments. We conclude that deep neural networks with the proposed feature normalization and selection can significantly contribute to the robustness of recognition in practical application scenarios. We have achieved an UAR of 60.9% for the four-way diagnosis classification on the development set. This is a relative improvement of 16.2% to the official baseline by using our single system.


Numerical Calculation of the Head-Related Transfer Functions with Chinese dummy head

Ling Tang, Zhong-Hua Fu and Le i Xie

Northwestern Pol y technical University

Head-Related Transfer Function (HRTF) plays an important role in the virtual auditory technology. Since directly measuring the HRTF is rather complicated and time-consuming, especially with individual person to obtain personaliz ed HRTFs, many researches have focused on predicting the HRTF by numerical methods such as the boundary element method (BEM). In this study, we present our work on numerical calculation of the HRTFs with a standard Chinese dummy head, BHead210. The BEM-based method is introduced and the calculated HRTFs are compared to the measured HRTFs, as well as the well-known KEMAR HRTFs. The distinguished differences in the HRTFs between BHead210 and KEMAR are given.


Framework of Ubiquitous Healthcare System Based on Cloud Computing for Elderly Living

Yang-Yen Ou1, Po-Yi Shih1, Yu-Hao Chin2, Ta-Wen Kuan1, Jhing-Fa Wang1,3 and Shao-Hsien Shih1

1National Cheng Kung University, 2National Central University, 3Tajen University

This work discusses a integrating framework on health care, home safety, convenience and entertainment for elderly living. Since the increasing number of elderly people will live alone, yet requires more smart and automation home services, the proposed framework attempts to develop a smart living helper to improve their later life. Two systems are proposed in the Ubiquitous Healthcare System (UHS) framework including, the Web-based User Remote Management Service (WURMS) and the Multimodal Interactive Computation Services (MICS). The proposed systems are coordinating couple existing audio-visual and communication techniques, including the speech/sound recognition, the speaker identification, the face identification, the sound source estimation, the text to speech (TTS) and the event recognition. For elders friendly interfaces, the proposed services include, 1)Home care services, 2)Emergency assistance, 3)Family interaction, 4)Remote medical services, 5)Security monitoring, and 6)Information services, to improve the elders life in more convenience.


Graph Based Orange Computing Architecture for Cyber-Physical Being

Anand Paul and Yu-Hao Chin

National Central University, Kyungpook National University

In This paper a graph based orange computing architecture for cyber-physical Being (CPB) is proposed. The Physical world is not entirely predictable, thus Cyber Physical Systems (CPS) revitalize traditional computing with a contemporary real-world approach that can touch our day-today life. A graph is modeled to tackle the architectural optimization in a CPBs network. These frameworks of assisted method improve the quality of life for functionally locked in individuals. A learning pattern is developed to facilitate wellbeing of the individual. Surveys were performed for multiple CPBs for different state, activities and quality of life scheme are considered.



Instrumental Activities of Daily Living (IADL) Evaluation System Based on EEG Signal Feature Analysis

Yang-Yen Ou1, Po-Yi Shih1, Po-Chuan Lin2*, Jhing-Fa Wang1,3, Bo-Wei Chen1 and Sheng-Chung Chan1

1National Cheng Kung University, 2TungFang Design University, 3Tajen University

This work proposes an IADL evaluation system using LDA algorithm based on EEG signal, which explores the correlation between the subjective IADL assessment and the objective EEG signals measurement. Five features are extracted from the single channel EEG device including average amplitude, power ratio, spectral central, spectral edge frequency 25% and 50%. These features are represented as an indicator of participants IADL and are classified as IADL scales using LDA algorithm. For system evaluation, thirty elderly participants (70 ~ 96 years old) are classified into three groups by IADL score: high (disability-free, 16~24 points), medium (mild disability, 8 ~ 15 points) and low (severe disability, 0 ~ 7 points). These IADL groups distribute uniformly to conduct following IADL scenarios; 1. Ability to use telephone, 2.Ability to handle finances, and 3.Chat with people (that is not included in IADL scenario). The experiment result shows that the proposed EEG features and evaluation system can achieve 90% average accuracy rate verified by Leave-One-Out cross validation (LOOCV).


Back to top!!

OS.7-BioSiPS.1: Data-Driven Biomedical Signal Processing and Machine Learning Methods

Sound Quality Indicating System Using EEG and GMDH-type Neural Network

Kiminobu Nishimura and Yasue Mitsukura

Keio University

In this paper, we propose a sound quality evaluation system using electroencephalogram (EEG) and group method of data handling (GMDH) type neural network. Recently, EEG is used in various applications, and we focus on sound quality evaluation using EEG. We prepared EEG samples to train a GMDH-type neural network to recognise 3 typical types of sound which was used to create the training data. The results showed that using GMDH-type neural network improved recognition rate compared to the other method. Additionally, we repeated simulations by using different parameter of GMDH-type neural network, and the open test results showed the recognition rate variations in different parameter values.


Feature Extraction of P300 Signal Using Bayesian Delay Time Estimation

Reo Togashi* and Yoshikazu Washizawa*

*The University of Electro-Communications

Brain-computer interfaces (BCIs) based on eventrelated potentials (ERP) are communicating tools with severely disabled people. P300 which is observed after 300 mili seconds from stimuli is widely used for the operation principle of BCIs. However the response time to the stimuli depends on a subject, trial, and also a channel. Many existing approaches ignore this variation and extract only low frequency component. We propose a method to estimate the response time of P300 using Bayesian estimation. The proposed method exhibited higher performance in our auditory BCI.


QEEG Index for Early Detection ofImbalanced Conditions During Aerobatics

Kittichai Tharawadeepimuk and Yodchanan Wongsawat


In this paper, the brain activity was studied and observed while person was in the loss of balance state due to vestibular system disorder. Brain activity was recorded as QEEG data which was analyzed by the Z-scored FFT method. The experimental results showed brain activities in the positions O1, O2, P3, P4, PZ, T5, and T6 in the forms of topographic map (Absolute Power) and brain connectivity (Amplitude Asymmetry). It was represented in the topographic map that, during the loss of balance state, brain areas at occipital and parietal lobes have high intensity level of brain processing. For brain connectivity, the results clearly showed the brain connections while subjects were trying to maintain their balances. These strong connections were represented by red lines between O1, O2 positions (occipital lobe) and Pz (parietal lobe), Fz(frontal lobe) and Cz (central lobe). Similarly, there were strong connections between T5 and Fz positions and T5 and T6 positions. On the other hand, the connection between T6 and Fz was found to be a weak connection. The analysis of results can be employed in the alarm system of pilot during aerobatics.


Band Selection by Criterion of Common Spatial Patterns for Motor Imagery Based Brain Machine Interfaces

Hiroshi Higashi and Toshihisa Tanaka

*Tokyo University of Agriculture and Technology, RIKEN Brain Science Institute

Design of spatial weights and filters that extract components related to brain activity of motor imagery is a crucial issue in brain machine interfaces. This paper proposes a novel method to design these filters. We use a similarity of the covariance matrices of narrow band observed signals over frequency bins. The similarity is defined based on the common spatial pattern method. This proposed method enables us to design the multiple bandpass filters. The experimental results of classification of EEG signals during motor imagery show that the proposed method achieves higher classification accuracy than well-known conventional methods.


Single Trial BCI Classification Accuracy Improvement for the Novel Virtual Sound Movementbased Spatial Auditory Paradigm

Yohann Leli`evre*§, Yoshikazu Washizawa§, and Tomasz M. Rutkowski§▽

*University of Bordeaux, University of Tsukuba, Tsukuba, The University of Electro-Communications, §RIKEN Brain Science Institute

This paper presents a successful attempt to improve single trial P300 response classification results in a novel moving sound spatial auditory BCI paradigm. We present a novel paradigm, together with a linear support vector machine classifier application, which allows a boost in single trial based spelling accuracy in comparison with classic stepwise linear discriminant analysis methods. The results of the offline classification of the P300 responses of seven subjects support the proposed concept, with a classification improvement of up to 80%, leading, in the best case presented, to an information transfer rate boost of 28:8 bit/min.


Back to top!!

OS.8-SIPTM.1: Signal and Information Processing Theory and Methods

On Integer-valued Zero Autocorrelation Sequences

Soo-Chang Pei* and Kuo-Wei Chang

*National Taiwan University, National Taiwan University, Chunghwa Telecom Laboratories

A systematic way to construct Integer-valued zero autocorrelation sequences is proposed. This method only uses fundamental theorems of discrete Fourier transform(DFT) and some number theories.


On Ergodic Secrecy Capacity of Fast Fading MIMOME Wiretap Channel With Statistical CSIT

Shih-Chun Lin

National Taiwan University of Science and Technology

In this paper, we consider the secure transmission in ergodic Rayleigh fast-faded multiple-input multiple-output multiple-antenna-eavesdropper (MIMOME) wiretap channels with only statistical channel state information at the transmitter (CSIT). When legitimate receiver has more (or equal) antennas than the eavesdropper, we prove the first MIMOME secrecy capacity result with partial CSIT by establishing a new secrecy capacity upper-bound. The key step is forming an MIMOME degraded channel by dividing the legitimate receivers channel matrix into two submatrices, and setting one of which the same as the eavesdropper’s channel matrix. Next, subject to the total power constraint overall transmit antennas, we solve the channel-input covariance matrix optimization problem to fully characterize the MIMOME secrecy capacity. Typically, the MIMOME optimization problems are non-concave. However, with aids of the proposed degraded channel, we show that the stochastic MIMOME optimization problem can be transformed to be a Schur-concave problem to find its optimal solution. Finally, we find that the MIMOME secrecy capacities scale with the signal-to-noise ratios with large enough numbers of antennas at legitimate receiver. However, as shown in previous works, such a scaling does not exist for wiretap channels with single antenna at legitimate receiver and eavesdropper each.


A Preprocessing Method to Increase High Frequency Response of A Parametric Loudspeaker

Chuang Shi and Woon-Seng Gan

Nanyang Technological University

Unlike a conventional loudspeaker, a parametric loudspeaker whose principle is derived from nonlinear acoustic effects in air can create a controllable sound beam with a compact emitter. Therefore, the parametric loudspeaker is advantageous in deployments where personal audio zones are in demand. However, it is a dilemma that the sound generation principle of parametric loudspeaker leads to a narrow bandwidth of itself, which hinders its popularization in applications requiring high fidelity. In this paper, we propose a tentative treatment of harmonic distortions incurred in parametric loudspeakers, whereby the asymmetrical amplitude modulation (AAM) method is used to extend the high frequency response by doubling the higher cut-off frequency.


The Split-Radix Fast Fourier Transforms with Radix-4 Butterfly Units

Sian-Jheng Lin and Wei-Ho Chung

Research Center for Information Technology, Academia Sinica

We present a split radix fast Fourier transform (FFT) algorithm consisting of radix-4 butterflies. The major advantages of the proposed algorithm include: i). The proposed algorithm consists of mixed radix butterflies, whose structure is more regular than the conventional split radix algorithm. ii). The proposed algorithm is asymptomatically equal computation amount to the split radix algorithm, and is fewer operations than the radix-4 algorithms. iii). The proposed algorithm is in the conjugate-pair version, which requires less memory access than the conventional FFT algorithms.


L1-Norm-Based Coding for Lattice Vector Quantization

Wisarn Patchoo*, Thomas R. Fischer and Curtis Maddex

*Bangkok University, Washington State University

A lattice vector quantization encoding method is developed based on 1-norm-based enumeration and bit-plane coding. The algorithm is implemented for the cubic lattice, can handle arbitrary vector dimension, and is suitable for transform, subband, or wavelet coding applications. Moreover, the algorithm can possibly extend to other binary lattices.


Generalized Polynomial Wigner Spectrogram for High-Resolution Time-Frequency Analysis

Jian-Jiun Ding, Soo Chang Pei, and Yi-Fan Chang

National Taiwan University

A good time-frequency (TF) analysis method should have the advantages of high clarity and no cross term. However, there is always a trade-off between the two goals. In this paper, we propose a new TF analysis method, which is called the generalized polynomial Wigner spectrogram (GPWS). It combines the generalized spectrogram (GS) and the polynomial Wigner-Ville distribution (PWVD). The PWVD has a good performance for analyzing the instantaneous frequency of a high order exponential function. However, it has the cross term problem in the multiple component case. By contrast, the GS can avoid the cross term problem, but its clarity is not enough. The proposed GPWS can combine the advantages of the PWVD and the GS. It can achieve the goals of high clarity, no cross term, and less computation time simultaneously. We also perform simulations to show that the proposed GPWS has better resolution than other TF analysis methods.


Back to top!!

OS.9-IVM.4: Multimedia Signal Processing and Applications

PicWall: Photo Collage On-the-fly

Zhipeng Wu and Kiyoharu Aizawa

the University of Tokyo

Photo collage, which constructs a compact and visually appealing representation from a collection of input images, provides the best convenient and impressive user experience. Previous approaches for automatic collage generation are always analogized as optimization problems, in which the researchers are trying to find the best balance between maximizing the visibility of photos salient areas as well as compactly arrange the collage canvas layout. However, automatic saliency detection can sometimes be harmful since we cannot guarantee all of the users interest areas are well-kept. On the other hand, the rapid development of mobile technology also calls for a robust solution of fast collage generation without any computation-expensive processes such as saliency detection and graph-cut. The PicWall approach proposed in this paper offers real-time collage generation. Given the expected canvas sizes, it tightly packs the input images while keeping their aspect ratios and orientations unchanged. Experiments show that it costs less than 0.5ms for a 100-photo collage generation. Besides, various PicWall-based applications are also demonstrated.


A Capture-to-Display Delay Measurement System for Visual Communication Applications

Chao Wei*, Haoming Chen**, Mingli Song*, Ming-Ting Sun** and Kevin Lau

*Zhejiang University, **University of Washington, T-Mobile USA

We propose an effective method to measure the capture-to-display delay (CDD) of a visual communication application. The method does not require modifications to the existing system, nor require the encoder and decoder clocks to be synchronized. Furthermore, we propose a solution to solve the multiple overlapped-timestamp problems due to the response time of the display and the exposure time of the camera. We implemented the method in software to measure the capture-todisplay delay of a cellphone video chat application over various types of networks. Experiments confirmed the effectiveness of our proposed methods.


A Supporting System of Chorus Singing for Visually Impaired Persons using Depth Image Sensor

Noriyuki Kawarazaki*,Yuhei KANEISHI*,Nobuyuki SAITO** and Takashi ASAKAWA

*Kanagawa Institute of Technology, **Asap System Corporation, Kinki University Technical College

This paper provides a study on a supporting system of chorus singing for visually impaired persons. This system is composed of an electric music baton with an acceleration sensor, a radio module, haptic interface devices with vibration motors, depth image sensor and a PC. An electric music baton transmits a signal of a conductor’s motion to visually impaired players based on an acceleration of the sensor. Since a conductor has to give individual instruction to a target player, we use a depth image sensor in order to recognize the pointing direction of a conductor’s music baton. The pointing direction of a conductor’s music baton is estimated based on the conductor’s posture. The effectiveness of our system is clarified by several experimental results.


Tag Suggestion and Localization for Images by Bipartite Graph Matching

Wei-Ta Chu1, Cheng-Jung Li1, and Jen-Yu Yu2

1National Chung Cheng University, 2Industrial Technology Research Institute

Given an image that is loosely tagged by a few tags, we would like to accurately localize these tags into appropriate image regions, and at the same time suggest new tags for regions if necessary. In this paper, this task is formulated on a bipartite graph, and is solved by finding the best matching between two disjoint sets of nodes. One set of nodes represents regions segmented from an image, and another set represents a combination of existing tags and new candidate tags retrieved from photo sharing platforms. In graph construction, visual characteristics in the representation of the bag of word model and users tagging behaviors are jointly considered. By finding the best matching with the Hungarian algorithm, the region-tag correspondence is determined, and tag suggestion and tag localization are accomplished simultaneously. Experimental results show that the proposed unified framework achieves promising image tagging performance.


Automatic Facial Expression Recognition For Affective Computing Based on Bag of Distances

Fu-Song Hsu1,Wei-Yang Lin2,Tzu-Wei Tsai3

1 2National Chung Cheng University, 3National Taichung University of Science and Technology

In the recent years, the video-based approach is a popular choice for modeling and classifying facial expressions. However, this kind of methods require to segment different facial expressions prior to recognition, which might be a challenging task given real world videos. Thus, in this paper, we propose a novel facial expression recognition method based on extracting discriminative features from a still image. Our method first combines holistic and local distance-based features so that facial expressions could be characterized in more detail. The combined distance-based features are subsequently quantized to form mid-level features using the bag of words approach. The synergistic effect of these steps leads to much improved class separability and thus we can use a typical method, e.g., Support Vector Machine (SVM), to perform classification. We have performed the experiment on the Extended Cohn-Kanade (CK+) dataset. The experiment results show that the proposed scheme is efficient and accurate in facial expression recognition.


Impact Analysis of Temporal Resolution in Thermal Signal Reconstruction via Infrared Imaging

Wen-Chin Yang, You-Gang Yang, Yun-Chung Liu, Wei-Min Liu

National Chung-Cheng University

Thermal Signal Reconstruction (TSR) is a wellknown nondestructive evaluation technique. It enhances the locations of defects in composite materials in aerospace industry, and has been applied in infrared imaging of human hands recently to locate microvasculature under the skin. In this work we reinvestigated the new application and explored the relationship between the temporal resolutions in TSR and the quality of reconstructed images. The goal is to optimize the variable to improve the detection rate of microvasculature under difficult situations.


Negative-Voting and Class Ranking Based on Local Discriminant Embedding for Image Retrieval

Mei-Huei Lin, Chen-Kuo Chiang and Shang-Hong Lai

National Tsing Hua University

In this paper, we propose a novel image retrieval system by using negative-voting and class ranking schemes to find similar images for a query image. In our approach, the image features are projected onto a new feature space that maximizes the precision of image retrieval. The system involves learning a projection matrix for local discriminant embedding, generating class ordering distribution from a negative-voting scheme, and providing image ranking based on class ranking comparison. The evaluation of mean average precision (mAP) on the Holidays dataset shows that the proposed system outperforms the existing retrieval systems. Our methodology significantly improves the image retrieval accuracy by combining the idea of negative- voting and class ranking under the local discriminant embedding framework.


Back to top!!

OS.10-IVM.5: Image and Video Coding

Fast Coding Unit Decision Algorithm for HEVC

Wei-Jhe Hsu and Hsueh-Ming Hang

National Chiao-Tung University

HEVC adopts a flexible Coding Unit (CU) quadtree structure. With more flexible CU size selection, the coding efficiency of HEVC increases significantly but its complexity is much higher than that of H.264/MPEG-4 AVC. To reduce computational complexity, we propose a fast algorithm, which consists of splitting decision and termination decision, in constructing the CU quadtree. This scheme is designed to be complementary to the current three fast tools included in HEVC TM5.0. In other words, when it is combined with the existing fast CU tools, it still provides additional time savings. The time reduction of our scheme is most noticeable on HD pictures. In comparison with the original HM5.0, our proposed method averagely saves about 43% encoding time and the BD rate increases by about 2.2% for the HD test sequences.


Error Equalization for High Quality LDR Images in Backward Compatible HDR Image Coding

Masahiro Iwahashi* and Hitoshi Kiya

*Nagaoka University of Technology, Tokyo Metropolitan University

This report introduces an error equalization to increase quality of low dynamic range (LDR) images in a backward compatible high dynamic range (HDR) image coding. To utilize a currently standard source coding algorithm such as the JPEG encoder, a data compression friendly mapping such as the power function is applied before compression. In a receiver side, the decoded image is once again mapped based on a visually proper function such as the Hill function to generate the LDR image. In this report, we investigate how the errors added in the data compression process are magnified through the inverse power function and the Hill function, and show that the errors in dark pixels of the LDR image are extremely high. Based on this, we equalize probability density function of the error in the LDR image to increase its quality. As a result of experiments, it was observed that the PSNR of the LDR was increased by approximately 3 to 4 [dB] in the rate distortion curves.


Fast Mode Decision for H.264/AVC Based on Local Spatio-Temporal Coherency

Wei Zhang, Yan Huang and Jingliang Peng

Shandong University

In this paper, we propose a fast mode decision algorithm for H.264/AVC. It is based on the spatio-temporal coherency of the local neighborhood including and around the current block. We first build the histograms of the current block and the co-located block in the reference frame, respectively. If the difference between the two histograms is small, we use large-block-size modes. Otherwise, we subdivide the current block into four equal-sized sub-blocks and estimate the motion vector (MV) for each. In general, if there is a high degree of coherency between those MVs, we use large-block-size modes and otherwise small-block-size modes. In addition, we use the number of neighboring large blocks and the sub-blocks ratedistortion (R-D) costs as further hints for the mode decision. As experimentally demonstrated, our algorithm leads to significant saving in computing time on the test video sequences.


Depth Video Coding Based on Motion Information Sharing Prediction for Mixed Resolution 3D Video

Kwan-Jung Oh* and Ho-Cheon Wey

*ETRI, Samsung Electronics

Three-dimensional (3D) video has evolved and spread rapidly in recent years. In the 3D video, since a depth component and corresponding texture component have similar object movement, there is a redundancy in their motion information. To remove this motion redundancy, motion information sharing from texture to depth is useful in 3D video coding. In this paper, we propose a method for motion information sharing prediction (MISP) for use in mixed resolution 3D video coding. The MISP is applied for any types of macroblock even it is coded as intra mode. Experimental results show that the proposed MISP reduces the bit rate for depth by up to 7.3%.


Improved Sample Adaptive Offset for HEVC

Hong Zhang, Oscar C. Au, Yongfang Shi, Wenjing Zhu, Vinit Jakhetiya, Luheng Jia

Hong Kong University of Science and Technology

High-Efficiency Video Coding (HEVC) is the newest video coding standard which can significantly reduce the bit rate by 50% compared with existing standards. One new efficient tool is sample adaptive offset (SAO), which classifies reconstructed samples into different categories, and reduces the distortion by adding an offset to samples of each category. Two SAO types are adopted in HEVC: edge offset (EO) and band offset (BO). Four 1-D directional edge patterns are used in edge offset type, and only one is selected for each CTB. However, single directional pattern cannot remove artifacts effectively for the CTBs, which contain edges in different directions. Therefore, we analyze the performance of each edge pattern applied on this kind of CTB, and propose to take advantage of existing edge classes and combine some of the them as a new edge offset class, which can adapt to multiple edge directions. All the combinations are tested, and the results show that for Low Delay P condition, they can achieve 0.2% to 0.5% bit rate reduction.


Probability-Based Mode Decision Algorithm for Scalable Video Coding

Yu-CheWey1, Mei-Juan Chen1 and Chia-Hung Yeh2

1National Dong Hwa University, 2National Sun Yat-sen University

To reduce the computational complexity of the encoding process in Scalable Video Coding, we utilize the information of motion vector predictor (MVP) and the number of non-zero coefficients(NZC) to propose a fast mode decision algorithm. The probability models of motion vector predictor and the number of non-zero coefficients are built to predict the partition mode in the enhancement layer. In addition, the search range of motion estimation is adaptively adjusted to further reduce computational complexity. Experiment results show that the proposed algorithm can reduce coding time by up to 76% in average and provide higher time saving and better performance than previous work.


Back to top!!

OS.11-SLA.3: Recent Challenges and Applications on Active Noise Control

An Adaptive Signal Processing System for Active Control of Sound in Remote Locations

Iman Tabatabaei Ardekani* and Waleed H. Abdulla

*Unitec Institute of Technology, The University of Auckland

Adaptive active noise control (ANC) systems with single-channel structure create a silent point at the location of an error microphone. This silent point creates a zone of quiet surrounding itself, however, this zone has small dimensions. Moreover, the error microphone has to be located within the zone of quiet, limiting the effective space available in the zone of quiet. This paper develops a new adaptive signal processing system for ANC in a remote point, located far from the error microphone. Accordingly, the effective space of the zone of quiet can be extended. It is shown that an optimal controller for the creation of a remote silent point can be constructed by the series combination of a digital controller, called the remote controller, and an adaptive digital filter. The transfer function of the remote controller is derived based on the system model in the acoustic domain. A new update equation for the automatic adjustment of the adaptive digital filter is proposed. The validity of the results is discussed through different computer simulations.


Adaptive Active Noise Control in Free Space

Iman Tabatabaei Ardekani* and Waleed H. Abdulla

*Unitec Institute of Technology, The University of Auckland

This paper develops a reliable methodology for active control of acoustic noise in free space. The core of this paper consists of a root locus analysis on the adaptation process performed in active noise control. Based on this analysis, a novel active noise control algorithm is derived. This algorithm is fully implemented in an experimental setup. Different experiments with this setup show that the traditional active noise control algorithm is not stable when the setup is used in free space. However, the proposed algorithm is stable and converges at a high convergence rate until the noise level is reduced by 15 dB in steady-state conditions.


An Active Unpleasantness Control System for Indoor Noise Based on Auditory Masking

Daisuke Ikefuji, Masato Nakayamay, Takanabu Nishiuray and Yoich Yamashitay

Ritsumeikan University

Noise reduction methods have been proposed for various large noises. However, we often perceive unpleasant feeling even if the small noise arises in quiet indoor environments. To overcome this problem, we focus on auditory masking which is phenomenon that human cant hear a target noise by composite sound. In this paper, we propose the unpleasantness reduction method based on auditory masking for indoor noise. In the proposed method, we discuss how to design the artificial sound for noise control. Furthermore, we develop an active unpleasantness reduction system based on the proposed method. As a result of evaluation experiments, we could confirm the effectiveness of the development system.


Splitting Frequency Components of Error Signal in Narrowband Active Noise Control System Design

Jen-Yu Li1, Sen M. Kuo2 and Cheng-Yuan Chang3

1ITRI, 2Northern Illinois University, 3Chung Yuan Christian University

A narrowband active noise control (NANC) structure is often applied to reduce undesired noise when multiple tones have close frequencies. A parallel ANC structure is proposed for separating undesired harmonics into a series of adaptive filters. Since the input signal of each adaptive filter has only one harmonic component, a second-order finite impulse response (FIR) filter is sufficient for processing the undesired noise. Therefore, the presented parallel structure for NANC system is greatly simplified. Based on the input signal frequency, a bank of bandpass filters splits the frequency components of the error signal to update the corresponding adaptive filters in the parallel NANC structure, each of which is computationally very simple and unaffected by time delay. A new performance index for a parallel NANC is also proposed. The increased convergence speed obtained by the proposed method is first confirmed by theoretical analysis. Computer simulations are then performed to confirm the enhancement.


Feasibility Study of Voice Shutter

Masaharu Nishimura

Tottori University

Telephone voice is annoying in public spaces like trains and cafés. The speaking voice was tried to be attenuated by using an active noise control technique. It is called Voice Shutter. Two types of voice shutter, which are open type and closed type, are proposed in this paper. A closed type voice shutter controlled by a feed forward method was manufactured and tested. As the results, the voice shutter was proved to be feasible, although it was too large for practical use, in this stage.


Head-Mounted Active Noise Control System to Achieve Speech Communication

Nobuhiro Miyazaki, Kohei Yamakawa and Yoshinobu Kajikawa

Kansai University

In this paper, we propose a head-mounted active noise control (ANC) system with speech communication. Magnetic resonance imaging (MRI) device is one of medical equipment. Recently, MRI device is utilized for microwave coagulation therapy. However, MRI device generates a serious noise (over 100 dB SPL). Hence, surgeons and other medical staff are exposed to the large MR noise for many hours and cannot verbally communicate with each other. We have therefore proposed a headmounted ANC system for reducing MR noise and realizing verbal communication under such a loud noise environment. However, MRI device is generally controlled by the operator outside the MRI room. Hence, speech communication between inside and outside the room is needed. We therefore integrate the speech communication function with the head-mounted ANC system. Concretely, the error microphones and secondary loudspeakers are also used as an interface to realize the speech communication. In this case, the outside voice may be returned through the error microphone, so an audio-integrated ANC system based on the echo cancellation is utilized. Linear prediction filter is also utilized for separating the inside voice from residual noise. In this paper, we demonstrate the validity of the proposed ANC system through some noise reduction experiments and subjective assessment tests on phoneme articulation.


Step Size Bound for Narrowband Feedback Active Noise Control

Liang Wang1, Woon-Seng Gan2, Yong-Kim Chong2, Sen M Kuo3

1DTS Inc, 2Nanyang Technological University, 3Northern Illinois University

This paper presents the derivation of the convergence analysis of the feedback active noise control under perfect and imperfect secondary path transfer functions. Existing analysis approaches do not include the analysis of the reference signal synthesis errors due to the complexity of interrelated feedback nature. A detailed derivation of the maximum step size bound for the feedback active noise control system has been formulated for perfect and imperfect secondarypath estimation. This theoretical work has been verified by computer simulations.


Back to top!!

OS.12-SLA.4: Spoken Term Detection

Entropy-based False Detection Filtering in Spoken Term Detection Tasks

Satoshi Natori, Yuto Furuya, Hiromitsu Nishizaki and Yoshihiro Sekiguchi

University of Yamanashi

This paper describes spoken term detection (STD) and inexistent STD (iSTD) methods using term detection entropy based on a phoneme transition network (PTN)-formed index. Our previously reported STD method uses a PTN derived from multiple automatic speech recognizers (ASRs) as an index. A PTN is almost the same as a sub-word-based confusion network, which is derived from the output of an ASR. In the previous study, our PTN was very effective in detecting query terms. However, the PTN generated many false detection errors. In this study, we focus on entropy of the PTN-formed index. Entropy is used to filter out false detection candidates in the second pass of the STD process. Our proposed method was evaluated using the Japanese standard test-set for the STD and iSTD tasks. The experimental results of the STD task showed that entropy-based filtering is effective for improving STD at a high-recall range. In addition, entropy-based filtering was also demonstrated to work well for the iSTD task.


Robust/Fast Out-of-Vocabulary Spoken Term Detection By N-gram Index with Exact Distance Through Text/Speech Input

Nagisa Sakamoto and Seiichi Nakagawa

Toyohashi University of Technology

For spoken term detection, it is very important to consider Out-of-Vocabulary (OOV). Therefore, sub-word unit based recognition and retrieval methods have been proposed. This paper describes a very fast Japanese spoken term detection system that is robust for considering OOV words. We used individual syllables as sub-word unit in continuous speech recognition and an n-gram index of syllables in a recognized syllable-based lattice. We proposed an n-gram indexing/retrieval method in the syllable lattice for attacking OOV and high speed retrieval. Specially, in this paper, we redefineded the distance of the n-gram and used trigram, bigram and unigram that instead of using only trigram to calculate the exact distance. In our experiments, where using text and speech query, we achieved to improve the retrieval performance.


High Priority in Highly Ranked Documents in Spoken Term Detection

Kazuma Konno, Yoshiaki Itoh, Kazunori Kojima, Masaaki Ishigame, Kazuyo Tanaka†† and Shi-wook Lee†††

Iwate Prefectural University, ††Tsukuba University, †††National Institute of Advanced Industrial Science and Technology

In spoken term detection, the retrieval of OOV (Out-Of-Vocabulary) query terms are very important because query terms are likely to be OOV terms. To improve the retrieval performance for OOV query terms, the paper proposes a re-scoring method after determining the candidate segments. Each candidate segment has a matching score and a segment number. Because highly ranked candidate is usually reliable and a user is assumed to select query terms so that they are the special terms for the target documents and they appear frequently in the target documents, we give a high priority to the candidate segments that are included in highly ranked documents by adjusting the matching score. We conducted the performance evaluation experiments for the proposed method using open test collections for SpokenDoc-2 in NTCIR-10. Results showed the retrieval performance was more than 7.0 points improved by the proposed method for two test sets in the test collections, and demonstrated the effectiveness of the proposed method.


Open Vocabulary Spoken Content Retrieval by front-ending with Spoken Term Detection

Tomoko Takigami and Tomoyosi Akiba

Toyohashi University of technology

How to deal with speech recognition errors and outof- vocabulary (OOV) words are common challenging problems in spoken document processing. In this work, we propose the spoken content retrieval (SCR) method that incorporates spoken term detection (STD) as the pre-processing stage. The proposed method firstly performs STD for each term appearing in the given query topic, then the detection results are used to calculate the relevance of the retrieved document to the topic. The front-ending with STD enables to make use of even misrecognized and OOV words as the clues of the back-end document retrieval process.We also propose a novel retrieval model especially designed for the proposed SCR method. It incorporates the term co-occurrences into the conventional vector space model in order to put emphasis on reliable clues for the similarity calculation, which enables the retrieval process to work robust for documents including errors. The experimental results showed that the performance of the proposed SCR method improved the retrieval performance when a query topic included OOV words, even though it relied on the lower-accuracy syllable-based ASR results. They also showed that the proposed retrieval model significantly improved the retrieval accuracy not only for the proposed SCR but also for the conventional SCR methods.


Evaluation of the Usefulness of Spoken Term Detection in an Electronic Note-Taking Support System

Chifuyu Yonekura, Yuto Furuya, Satoshi Natori, Hiromitsu Nishizaki and Yoshihiro Sekiguchi

University of Yamanashi

The usefulness of a spoken term detection (STD) technique in an electronic note-taking support system is assessed through a subjective evaluation experiment. In this experiment, while listening to a lecture, subjects recorded electronic notes using the system. They answered questions related to the lecture while browsing the recorded notes. The response time required to correctly answer the questions was measured. When the subjects browsed the notes, half of them used the STD technique and half did not. The experimental results indicate that the subjects who used the STD technique answered all questions faster than those who did not use the STD technique. This indicates that the STD technique worked well in the electronic note-taking system.


Using Acoustic Dissimilarity Measures Based on State-level Distance Vector Representation for Improved Spoken Term Detection

Naoki Yamamoto and Atsuhiko Kai

Shizuoka University

This paper proposes a simple approach to subwordbased spoken term detection (STD) which uses improved acoustic dissimilarity measures based on a distance-vector representation at the state-level. Our approach assumes that both the query term and spoken documents are represented by subword units and then converted to the sequence of HMM states. A set of all distributions in subword-based HMMs is used for generating distance-vector representation of each state of all subword units. The element of a distance-vector corresponds to the distance between distributions of two different states, and thus a vector represents a structural feature at the state-level. The experimental result showed that the proposed method significantly outperforms the baseline method, which employs a conventional acoustic dissimilarity measure based on subword unit, with very little increase in the required search time.


Speaker-Invariant and Rhythm-Sensitive Representation of Spoken Words

Nobuaki Minematsu*, Yousuke Ozakiy, Keikichi Hirosey*, and Donna Ericksonz

*The University of Tokyo, Showa University of Music

It is well-known that human speech recognition (HSR) is much more robust than automatic speech recognition (ASR) [1], [2]. Given that HSRs robustness to large acoustic variability is extremely high, it is reasonable for researchers to assume that humans are able to extract invariant patterns underlying input utterances [3]. Recently in developmental psychology, it was found that infants are very sensitive to distributional properties in the sounds of a language [4], [5]. Following this finding, the first author proposed a speaker-independent or invariant speech representation of each utterance, formed by using distributional properties in the sounds of that utterance [6], [7], [8]. This representation is called speech structure and was tested in isolated word recognition experiments [7], [8]. This paper introduces another kind of sensitivity into speech structure, that is sensitivity to language rhythm. Sonority-based syllable nucleus detection is implemented and we extract local and syllable-based structures as well as conventional global and holistic structures. Isolated word recognition experiments show that the recognition performance is improved with rhythmsensitive and local speech structures.


Back to top!!

OS.13-SLA.5: Speech Processing

Speaker Identification Using Pseudo Pitch Synchronized Phase Information in Noisy Environments

Yuta Kawakami, Longbiao Wang, and Seiichi Nakagawa

Nagaoka University of Technology, Toyohashi University of Technology

In conventional speaker identification methods based on mel-frequency cepstral coefficients (MFCCs), phase information is ignored. Recent studies have shown that phase information contains speaker dependent characteristics, and, pitch synchronous phase information is more suitable for speaker identification. In this paper, we verify the effectiveness of pitch synchronous phase information for speaker identification in noisy environments. Experiments were conducted using the JNAS (Japanese Newspaper Article Sentence) database. The pseudo pitch synchronized phase information based method achieved a relative speaker identification error reduction rate of 15.5% compared to the conventional phase information (that is pitch non-synchronized phase). By cutting frames with low power and combining phase information with MFCC, a furthermore improvement was obtained.


Single Channel Dereverberation Method in Log-Melspectral Domain Using Limited Stereo Data for Distant Speaker Identification

Aditya Arie Nugraha, Kazumasa Yamamotoy, Seiichi Nakagawa

Toyohashi University of Technology, Toyota National College of Technology

In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. We assumed that the dimensions of feature were independent from each other and experimented on several assumptions of the room transfer function for each dimension. Speaker identification system was used to evaluate the method. Using limited stereo data, we could improve the identification rate for simulated and real datasets. On the simulated dataset, we could show that the proposed method is effective for both noiseless and noisy reverberant environments, with various noise and reverberation characteristics. On the real dataset, we could show that by using 6 independent NNs configuration for 24- dimensional feature and only 1 pair of utterances we could get 35% average error reduction relative to the baseline, which employed cepstral mean normalization (CMN).


Spoken Document Retrieval Using Both Word-Based and Syllable-Based Document Spaces with Latent Se-mantic Indexing

Ken Ichikawa, Satoru Tsugey, Norihide Kitaoka, Kazuya Takeda and Kenji Kitaz

Nagoya University, Daido University, The University of Tokushima

In this paper, we propose a spoken document retrieval method using vector space models in multiple document spaces. First we construct multiple document vector spaces, one of which is based on continuous-word speech recognition results and the other on continuoussyllable speech recognition results. Query expansion is also applied to the word-based document space. We proposed to apply latent semantic indexing (LSI) not only to the word-based space but also to the syllablebased space, to reduce dimensionality of the spaces using implicitly defined semantics. Finally, we combine the distances and compare the distance between the query and the available documents in various spaces to rank the documents. In this procedure, we propose to model the document by hyperplane. To evaluate our proposed method, we conducted spoken document retrieval experiments using the NTCIR-9 SpokenDoc data set. The results showed that using the combination of the distances, and using LSI on the syllable-based document space, improved retrieval performance.


A Prior Knowledge-based Noise Reduction Method with Dual Microphones

Hao Chen, Chang-chun Bao and Bing-yin Xia

Beijing University of Technology

In this paper, a noise reduction method with dual microphones, based on the prior knowledge, is proposed to reduce the residual noise especially in the period of target speech absence (TSA). First, two cases, i.e. target speech presence and target speech absence were modeled by Gaussian mixture model (GMM), respectively. Then, we calculated the frame-based target speech present probability (TSPP) using Bayesian classification. Finally, a mask filter was presented by modifying the gain function of the improved phase-error based filter (IPBF) method using TSPP. Simulation results show that the proposed method outperforms the reference methods and could reduce noise effectively, particularly in the period of TSA.


Vocal Tract Length Estimation for Voiced and Whispered Speech Using Gammachirp Filterbank

Toshio Irino, Erika Okamoto, Ryuichi Nisimura, and Hideki Kawahara

Wakayama University

In this paper, we demonstrate an auditory spectrogram based on a dynamic compressive gammachirp filterbank (GCFB) that enables accurate and robust estimation of vocal tract length (VTL) for both voiced and whispered speech. Normalized VTLs of 21 speakers were derived by using the least squared analysis of their VTL ratios (for all permutations, 420 = 21P20) which were estimated by minimizing spectral distances in the auditory spectrograms. The frequency range was selected in the calculation and the range between 500 and 5000 (Hz) was most reasonable for both speech mode. The method based on GCFB was better than that based on the mel-frequency filterbank (MFFB). The estimated VTLs were compared with the VTL data measured in MRI to confirm the reliability.


Suitable Spatial Resolution at Frequency Bands Based on Variances of Phase Differences for Real-Time Talker Localization

Kohei Hayashida, Masato Nakayamay, Takanobu Nishiuray and Yoichi Yamashitay

Ritsumeikan University, Ritsumeikan University

Conventional near-field talker localization methods with microphone-array calculate spatial spectrum in each scanning position of discretized space and each frequency. Hence, elapsed time is increased and real-time processing is difficult. Real-time processing is important for achieving the natural interaction with the speech interfaces. To overcome this problem, we newly propose a talker localization method based on Multi-resolution Scanning in Frequency Domain (MSFD). MSFD utilizes lower spatial resolution in the lower frequency band and higher spatial resolution in the higher frequency band to reduce elapsed time. We also propose a calculation method for suitable spatial resolution at each frequency on the basis of the variances of phase differences among microphones. The results of evaluation experiment indicated that the proposed MSFD could reduce the elapsed time without degrading the localization accuracy.


Back to top!!

OS.14-SLA.6: Language Processing

Visualization of Mandarin Articulation by using a Physiological Articulatory Model

Dian Huang*, Xiyu Wu, Jianguo Wei*, Hongcui Wang*, Chan Song*, Qingzhi Hou*, and Jianwu Dang*

*Tianjin University, Japan Advanced Institute of Science and Technology

It is difficult for language learners to produce unfamiliar speech sounds accurately because they may not manipulate articulatory movements precisely by auditory feedback alone. Visual feedback can help identify the errors and promote the learning progress, especially in language learning and speech rehabilitation fields. In this paper, we propose a visualization method for Mandarin phoneme pronunciation using a three-dimensional (3D) articulatory physiological model driven by Chinese Electromagnetic Articulographic (EMA) data. A mapping from EMA data to physiological articulatory model was constructed using three points on the mid-sagittal plane of the tongue. To do so, we analyzed configurations of 30 Chinese phonemes based on an EMA database. At the same time, we designed nearly 150,000 muscle activation patterns and applied them to the physiological model to generate model-based articulatory movements. As the result, we developed a visualized articulation system with 2.5 dimensional and 3D views respectively. The mapping was evaluated using MRI data. It is found that the mean deviation was about 0.21cm for seven vowels.


Categorical Rating of Narrowband Mandarin Speech Quality

Kuan-Lang Huang and Tai-Shih Chi

National Chiao Tung University

Speech quality is postulated to consist of several perceptual attributes. Psychoacoustic experiments for Mandarin monosyllables were designed and conducted to investigate the relations between five abstract attributes, including intelligibility, clarity, naturalness, continuity and noise intrusiveness, and perceived integral speech quality. Experimental results demonstrate a good speech quality estimate can be obtained using a simple multivariate linear regression method. The linear regression analysis shows clarity has the most impact on speech quality, while intelligibility contributes little in the subjective assessment. These findings could be used to develop categorical-rating based objective speech quality measures in the future.


Progressive Language Model Adaptation for Disaster Broadcasting with Closed-captions

Takahiro Oku, Yuya Fujita, Akio Kobayashi and Shoei Sato

NHK (Nippon Hoso Kyokai; Japan Broadcasting Corporation.)

This paper describes a progressive language model (LM) adaptation method for transcribing broadcast news in a sudden event such as a massive earthquake. In a practical automatic speech recognition (ASR) system, the new event whose linguistic contexts are not covered with the LM often causes a serious degradation of the performance. Furthermore, there might be not enough quantities of training texts for conventional LM adaptation such as linear interpolation. Then, we propose a new LM adaptation method by using ASR transcriptions as unsupervised training texts in addition to the online manuscripts written by reporters. The proposed method employs a progressive update procedure, which adapts LMs in an unsupervised manner by using every set of transcriptions in a short period for the purpose of immediate use of the adapted model. The method also uses the online manuscripts in order to adapt the LM and add new words into the vocabulary. Experimental results showed that the proposed progressive LM adaptation method reduced relatively a word error rate by 8.2% compared with the conventional LM adaptation method with the online manuscripts only.


A New Method for the Objective Perceptual Measurement of Chinese Initials

Sai Chen1 , Hongcui Wang1,* , Jia Jia2 and Jianwu Dang1,3

1Tianjin University, 2Tsinghua University, 3Japan Advanced Institute of Science and Technology

Many works have been done in the methods of the perception measurements for speech sound. However, most of them are subjective measurement alone for perception aspects. In this paper, we try to give a new method of objective perception measurement for Chinese initials and investigate the relationship between the acoustic features and the perception measurement. To do so, we discuss which acoustic features and their combinations are the most consistent with the real perception of Chinese initials. We propose a method to obtain an objective perception measure based on the acoustic features, where the acoustic distance has a monotonic relation with the perceptual distance for Chinese initials. The Spearmans rank correlation coefficient is enhanced from 0.6328 to 0.6498 by adding the time-domain features to the feature vector of each initial. Finally, we propose a new formula to measure the perceptual distance between different types of initials objectively by using the chosen acoustic features.


Constructing A Three-Dimension Physiological Vowel Space of The Mandarin Language Using Electromag-netic Articulography

Jian Sun *, Nan Yan* and Lan Wang*

*Chinese Academy of Sciences, The Chinese University of Hong Kong

The spatial relations of vowels are traditionally depicted by using an acoustic quadrilateral. However, the accuracy of vowel chart has been controversial. In the present study, the lingual movements of different mandarin Chinese vowels were investigated using electromagnetic articulography (EMA, AG501) and the physiological equivalent of Chinese vowels was then developed and compared to the traditional acoustical vowel quadrilateral. Six native mandarin speakers repeated each one of the six vowels (/a/, /o/, / /, /i/, /u/ and /y/) by four times. The key region of each vowel, which was characterized by x, y, z tongue position at the intermediate temporal point of vowel pronunciation, was extracted. The tongue movement distance was calculated between the key region of each vowel and static tongue position. Clustering method was used to find out the centroid of the distances for each vowel. It was found that there are considerable differences between the actual lingual positions and acoustic quadrilaterals relative position depictions in the pronunciation of vowels like /u/. Results indicated that acoustic quadrilateral was insufficient to describe the lingual movement information of vowels pronunciation.


Morphological Normalization: A Study of Vowels for Mandarin and Japanese

Hong Liu1, Jianguo Wei1,*, Wenhuan Lu2, Qiang Fang3, Liang Ma4 and Jianwu Dang1,5

1Tianjin University, 2Tianjin University, 3Chinese Academy of Social Sciences, 4Fudan University, 5Japan Advanced Institute of Science and Technology

Reducing the morphological variances of vocal tract across different subjects would benefit articulatory data analysis and modeling. To further test such a hypothesis by a thin-plate spline warping (TPS) method, this study used articulatory data of /a, i, u/ from 3 Chinese subjects and 3 Japanese subjects, which were collected by Electromagnetic Midsagittal Articulographic (EMMA) system. The templates for the normalization of Chinese and Japanese were obtained by averaging the 3 subjects palates and tongue shapes in each group. The 44 landmarks in each template were then defined by a gridline system of the vocal tract. The results show that the variances among subjects were reduced in both horizontal direction and vertical direction. The similar vowel structures between pre- and post-normalization data indicate that TPS method outperforms the traditional palate-straighten method in that TPS method has reduced mid-sagittal morphological differences among speakers while keeping their individual vowel structures unchanged. The comparison results show that the articulatory differences among the three vowels are consistent with their corresponding acoustic properties.


Predicting Gradation of L2 English Mispronunciations Using ASR with Extended Recognition Network

Hao Wang, Helen Meng and Xiaojun Qian

The Chinese University of Hong Kong

A CAPT system can be pedagogically improved by giving effective feedback according to the severity of mispronunciations. We obtained perceptual gradations of L2 English mispronunciations through crowdsourcing, conducted quality control to filter for reliable ratings and proposed approaches to predict gradation of word-level mispronunciations. This paper presents our work on making improvements using ASR with extended recognition network to the previous predicting approach to solve its limitations: 1. it is not working for those mispronounced words whose transcriptions are not immediately available; 2. perceptually differently articulated words with the same transcription have the same predicted gradation.


Back to top!!

OS.15-SPS.2: Signal Processing Systems

Intelligibility Comparison of Speech Annotation under Wind Noise in AAR Systems for Pedestrians and Cyclists using Two Output Devices

Masanori Miura, Hideto Watanabe, Kou Kawai and Kazuhiro Kondo

Yamagata University

Since visual navigation systems using smart phones show information on small screens, user attention is likely to be distracted. Therefore, we are considering a portable navigation system using Augmented Audio Realty. However, if such equipment is used outdoors, very loud wind noise is recorded with the environmental sound. Thus, to reduce this wind noise, we use wind screen (ear muffs) and applied signal processing for wind noise reduction. In this paper, we compared a noisecontrolled binaural earphone, and a bone conduction headphone. It was shown that Iterative Wiener filtering improves the speech intelligibility score dramatically with the former device.


A Local Minimum Stagnation Avoidance in Design of CSD Coefficient FIR Filters by Adding Gaussian Function

Kazuki Saito and Kenji Suyama

Tokyo Denki University

In this paper, a method for designing FIR (Finite Impulse Response) filters with CSD (Canonic Signed Digit) coefficient using PSO (Particle Swarm Optimization) is proposed. In such a design problem, there are a large number of local minimums. The updating procedure of normal PSO tends to stagnate around such a local minimum and thus indicates a premature convergence property. Therefore, a new method for avoiding such a situation is proposed, in which the Gaussian function is added as a penalty to the evaluation function. Several design examples are shown to present the effectiveness of the proposed method.


An Avoidance of Premature Convergence in IIR Filter Design using PSO


Tokyo Denki University

In this paper, a design method for the Infinite Impulse Response (IIR) filter using the Particle Swarm Optimization (PSO) is developed. Because the PSO has a strong directivity toward a local minimum, the PSO updating tends to stagnate around such a local minimum and thus indicates a premature convergence. Recently, the Asynchronous Digenetic PSO with Nonlinear dissipative term (N-AD-PSO) has been proposed for a diverse search. It can be expected that the stagnation can be avoided by the N-AD-PSO. In this study, the N-AD-PSO is adapted for the IIR filter design. Several examples are shown to present the effectiveness of the proposed method.


A Hardware-efficient Variable-length FFT Processor for Low-power Applications

Yifan Bo, Renfeng Dou, Jun Han and Xiaoyang Zeng

Fudan University

The fast Fourier transformation (FFT) is a key operation in digital signal processing (DSP) systems and has been studied intensively to improve the performance. Nowadays, embedded DSP systems require low energy consumption to prolong the life cycle, which raises stringent power limitation for FFT processing. Meanwhile, sufficient signal-to-quantization- noise ratio (SQNR) is a basic requirement in these systems. In this paper, a modified data scaling scheme as well as trounding method is employed to improve the SQNR performance. There- fore word-length can be reduced and energy is saved accordingly. Memory-based architecture is chosen to support variable-length FFT processing. Also, constant multiplier array is introduced in the datapath to reduce the power dissipation with a slight increase of area. The proposed processor can perform 64 - 8192- point FFT processing. The core area is 2.29 mm2 and the power consumption is 67.9 mW at 100MHz. Besides, the SQNR of 55.4 dB and 33.3 dB are achieved for 64-point and 8192-point FFT respectively.


Keypoints of Interest Based on Spatio-temporal Feature and MRF for Cloud Recognition System

Takahiro Suzuki and Takeshi Ikenaga

Waseda University

Keypoint extraction has lately attracted attention in computer vision. Particularly, Scale-Invariant Feature Transform (SIFT) is one of them and invariant for scale, rotation and illumination change. In addition, the recent advance of machine learning becomes possible to recognize a lot of objects by learning large amount of keypoints. Recently, cloud system starts to be utilized to maintain large-scale database which includes learning keypoint. Some network devices have to access these systems and match keypoints. Kepoint extraction algorithm utilizes only spatial information. Thus, many unnecessary keypoints for recognition are detected. If only Keypoints of Interest (KOI) are extracted from input images, it achieves reduction of descriptor data and high-precision recognition. This paper proposes the keypoint selection algorithm from many keypoints including unnecessary ones based on spatio-temporal feature and Markov Random Field (MRF). This algorithm calculats weight on each keypoint using 3 kinds of features (intensity gradient, optical flow and previous state) and reduces noise by comparing with states of surrounding keypoints. The state of keypoints is connected by using the distance of keypoints. Evaluation results show that the 90% reduction of keypoints comparing conventional keypoint extraction by adding small computational complexity.


A Low Complexity Multi-view Video Encoder Exploiting B-frame Characteristics

Yuan-Hsiang Miao1, Guo-An Jian1, Li-Ching Wang2, Jui-Sheng Lee1, and Jiun-In Guo1

1National Chiao Tung University, 2National Chung Cheng University

This paper proposes a low complexity multi-view video encoder which includes mode decision and early termination based on B-frame characteristics. According to the statistics of coding mode distribution in different B-frame types, we classify all the coding modes into several classes and propose an early terminated mode decision algorithm that can largely reduce the computing complexity. On the other hand, MVD-based adaptive search range scheme is also included in the proposed encoding strategy. In our experimental results, the encoding time is saved up to 91% - 93% but the quality loss is controlled within 0.1 dB PSNR drop.


Back to top!!

OS.16-SIPTM.2: Advanced Topics in Noise Reduction and Related Techniques for Signal Processing Applications

Noise Removing for Time-Variant Vocal Signal by Generalized Modulation

Jian-Jiun Ding and Ching-Hua Lee

National Taiwan University

Since the instantaneous frequencies of vocal signals always vary with time, it is inconvenient to use the conventional filter to remove the noise of vocal signals. In this paper, we pro-pose a method that uses the generalized modulation to reshape and minimize the areas of the spectrograms of vocal signals. In-stead of multiplying an exponential function with the first order phase, the generalized modulation is to multiply an exponential function whose phase is a higher order polynomial. With the proposed noise-removing algorithm based on generalized modu-lation, the signal part and the noise part of a vocal signal can be well separated and the effect of noise can be significantly reduced.


A New Lattice-Based Adaptive Notch Filtering Algorithm with Improved Mean Update Term

Shinichiro Nakamura, Shunsuke Koshita, Masahide Abe and Masayuki Kawamata

Tohoku University

In this paper, we propose a new lattice-based adaptive notch filtering algorithm which has faster convergence characteristics than Regalias Simplified Lattice Algorithm (SLA). Our algorithm makes use of the weighted sum of SLA and the Lattice Gradient Algorithm.We prove that the mean update term of our algorithm is larger than that of SLA when the input signal consists of a single sinusoid and a background white noise. Furthermore, our algorithm does not change the local convergence characteristics near the sinusoidal frequency. Consequently, the proposed algorithm achieves faster convergence than SLA. A simulation result shows that the proposed algorithm finds the sinusoidal frequency faster than SLA.


Design of FIR Fractional Delay Filter Based on Maximum Signal-to-Noise Ratio Criterion

Chien-Cheng Tseng* and Su-Ling Lee

*National Kaohsiung First University of Science and Technology, Chang-Jung Christian University

In this paper, a new approach to the design of digital FIR fractional delay filter with consideration of noise attenuation is presented. The design is based on the maximization of signal-to-noise ratio (SNR) at the output of the fractional delay filter under the constraint that actual frequency response and ideal response have several same derivatives at the prescribed frequency point. The optimal filter coefficients are obtained from the generalized eigenvector associate with maximum eigenvalue of a pair of matrices. Numerical examples are demonstrated to show the proposed method provides higher SNR than the conventional FIR fractional delay filter design methods without considering the noise attenuation.


Fixed Order Implementation of Kernel RLS-DCD Adaptive Filters

Kiyoshi Nishikawa*, Yoshiki Ogawa*, Felix Albuy

*Tokyo Metropolitan University, Valahia University of Targoviste

In this paper, we propose an efficient structure of the kernel recursive least squares (KRLS) adaptive filters for implementing with low and fixed amount of computational complexity. The concept of kernel adaptive filters is derived by applying the kernel method to the linear adaptive filters for achieving the autonomous learning of non-linear environments. It is expected to provide a better noise reduction performance in non-linear environments than the conventional linear adaptive filters. One of the problems of the kernel adaptive filters is the required amount of calculation. Besides, they increase as the adaptation time advances as opposed to the linear case. In this paper, we propose an efficient implementation method of the KRLS dichotomous coordinate descent (DCD) adaptive algorithm. The proposed method enables us to implement at a constant amount of computation by fixing the order of the filter and the dictionary maintaining the fast rate of convergence of the KRLS-DCD algorithm. The effectiveness of the proposed method is confirmed by computer simulations.


Multi-Touch Points Detection for Capacitive Touch Panel

Chien-Hsien Wu, Ronald Y. Chang, and Wei-Ho Chung

Academia Sinica

We consider the detection of multiple touch points for capacitive touch panel systems under Gaussian noise. We propose an algorithm that reduces the noise-induced detection error and improves the detection accuracy with partial touch signal information. The proposed algorithm is based on the likelihood ratio test, and utilizes the touch signal features, such as the local maximum, the range of touch magnitude, and the consecutive occurrence of touch locations, to first detect touch points and then calculates the real touch coordinates based on the weighting average technique. Simulation demonstrates the improved performance of the proposed algorithm.


Wireless Packet Collision Detection Using Self-Interference Canceller

Wataru Kawata, Kazunori Hayashi, Megumi Kaneko, Takumi Shimamoto, and Hideaki Sakai

Kyoto University

The paper considers wireless collision detection at the transmitter for wireless local area network systems. One of the major challenges to achieve such collision detection is the self-interference of very high power, which decrease the effective quantization level for the collision packet to be detected. In order to cope with the interference, we employ a self-interference canceller for each receiving antenna and combine the outputs of the cancelers with maximal ratio combining. Moreover, we utilize the correlation between the combiner output and the preamble of the packet for the detection of the existence of the collision packets. Performance of the proposed scheme is demonstrated via computer simulations.


Modulation Spectrum Power-law Expansion for Robust Speech Recognition

Hao-Teng Fan, Zi-Hao Ye and Jeih-weih Hung

National Chi Nan University

In this paper, we present a novel approach to enhancing the speech features in the modulation spectrum for better recognition performance in noise-corrupted environments. In the presented approach, termed modulation spectrum powerlaw expansion (MSPLE), the speech feature temporal stream is first pre-processed by some statistics compensation technique, such as mean and variance normalization (MVN), cepstral gain normalization (CGN) and MVN plus ARMA filtering (MVA), and then the magnitude part of the modulation spectrum (Fourier transform) for the feature stream is raised to a power (exponentiated). We find that MSPLE can highlight the speech components and reduce the noise distortion existing in the statistics-compensated speech features. With the Aurora-2 digit database task, experimental results reveal that the above process can consistently achieve very promising recognition accuracy under a wide range of noise-corrupted environments. MSPLE operated on MVN-preprocessed features brings about 55% in error rate reduction relative to the MFCC baseline and significantly outperforms the single MVN. Furthermore, performing MSPLE on the lower sub-band modulation spectra gives the results very close to those from the full-band modulation spectra updated by MSPLE, indicating that a lesscomplicated MSPLE suffices to produce noise-robust speech features.


Back to top!!

OS.17-IVM.6: Object Modeling and Recognition

A Novel Direct Feature-Based Seizure Detector: Using the Entropy of Degree Distribution of Epileptic EEG Signals

Qingfang Meng, Fenglin Wang, Weidong Zhou and Shanshan Chen

University of Jinan, Shandong University

The electroencephalogram (EEG) signals with different brain states show different nonlinear dynamics. Recently the statistical properties of complex networks theory have been applied to explore the nonlinear dynamics of time series, which studies the dynamics of time series via its organization. This study combines the complex networks theory with epileptic EEG analysis and applies the statistical properties of complex networks to the automatic epileptic EEG detection. We construct the complex networks from the epileptic EEG series and then calculate the entropy of the degree distribution of the network (NDDE). The NDDE corresponding to the ictal EEG is lower than interictal EEG’s. The experiment result shows that the approach using the NDDE as a classification feature obtains robust performance of epileptic seizure detection and the accuracy is up to 95.75%.


Human Upper Body Posture Recognition and Upper Limbs Motion Parameters Estimation

Jun-Yang Huang1 Shih-Chung Hsu1and Chung-Lin Huang1,2

1National Tsing-Hua University, 2Asia Univeristy

We propose a real-time human motion capturing system to estimate the upper body motion parameters consisting of the positions of upper limb joints based on the depth images captured by using Kinect. The system consists of the action type classifier and the body part classifiers. For each action type, we have a body part classifier which segment the depth map into 16 different body parts of which the centroids can be linked to represent the human body skeleton. Finally, we exploit the temporal relationship between of each body part to correct the occlusion problem and determine the occluded depth information of the occluded body parts. In the experiments, we show that by using Kinect our system can estimate upper limb motion parameters of a human object in real-time effectively.


Low Cost Illumination Invariant Face Recognition by Down-Up Sampling Self Quotient Image

Li Lin and Ching-Te Chiuy

National Tsing Hua University

Illumination variation generally causes performance degradation of face recognition systems under real-life environments. The Self Quotient Image (SQI) method [1] is proposed to remove extrinsic lighting effects but requires high computation complexity. Therefore, we propose a low cost face recognition scheme that uses multi-scale down-up sampling to generate self quotient image (DUSSQI) to remove the lighting effects. The DUSSQI has the following advantages: (1) Remove the lighting artifacts effectively. (2) Extract different face details including texture and edges. (3) Only global operation on pixels is required to reduce computational cost. Experimental results demonstrate that our proposed approach achieves 98.58% recognition rate for extended YaleB database and 93.8% for FERET database under various lighting conditions and reduces 97.1% computational time compared to that of SQI.


Image recognition based on hidden Markov eigen-image models using variational Bayesian method

Kei Sawada, Kei Hashimoto, Yoshihiko Nankaku and Keiichi Tokuda

Nagoya Institute of Technology

An image recognition method based on hidden Markov eigen-image models (HMEMs) using the variational Bayesian method is proposed and experimentally evaluated. HMEMs have been proposed as a model with two advantageous properties: linear feature extraction based on statistical analysis and size-and-location-invariant image recognition. In many image recognition tasks, it is difficult to use sufficient training data, and complex models such as HMEMs suffer from the over-fitting problem. This study aims to accurately estimate HMEMs using the Bayesian criterion, which attains high generalization ability by using prior information and marginalization of model parameters. Face recognition experiments showed that the proposed method improves recognition performance.


Point-in-Polygon Tests by Determining Grid Center Points in Advance

Jing Li and Wencheng Wang

Chinese Academy of Sciences

This paper presents a new method for point-inpolygon tests via uniform grids. It consists of two stages, with first to construct a uniform grid and determine the inclusion property of every grid center point, and the second to determine a query point by the center point of its located grid cell. In this way, the simple ray crossing method can be used locally to perform point-in-polygon tests quickly. When O(Ne) grid cells are constructed, as used in many grid-based methods, our new method can considerably reduce the storage requirement and the time on grid construction and determining the grid center points in the first stage, where Ne is the number of polygon edges. As for answering a query point at the second stage, the expected time complexity is O(1) in general. Experiments show that our new method can be several times faster than existing methods at the first stage and an order of magnitude faster at the second stage. Benefited from its high speed and less storage requirement, our new method is very suitable for treating large-scale polygons and dynamic cases, as shown in our experiments.


Back to top!!

OS.18-IVM.7: Advanced Image and Video Analysis

Indoor-Outdoor Image Classification using Mid-Level Cues

Yang Liu and Xueqing Li

Shandong University, Shandong University

Classifying an image into indoor/outdoor image category is very difficult due to vast range of variations in both of these scene categories. Most previous indoor-outdoor classification approaches utilize the simple statistics of the low-level features, such as colors, edges and textures. In this paper, we incorporate mid-level information to obtain superior scene description. We hypothesize that pixel based low-level descriptions are useful but can be improved with the introduction of mid-level region information. Experiments show that, while using mid-level features, it produces comparable result with that using low-level features. When combined with low-level features, the classification result get improved.


A Power-efficient Cloud-based Compressive Sensing Video Communication System

Mengsi Wang, Song Xiao, Lei Quan and Qunwei Li

Xidian University

Mobile devices performing video coding and streaming over wireless communication networks are limited in energy supply. In this paper, a power-efficient fast cloud-based Compressive Sensing (CS) video communication system framework is proposed, which shifts the heavy computational burden from mobile devices to cloud thus the operational lifetime of the mobile devices can be prolonged. Firstly, a CS-based video encoder with partial-bidirectional motion inter-frame prediction is proposed to greatly reduce the complexity of video encoding, and then a BER-based adaptive CSEC scheme is proposed to combat the data loss in wireless networks. Meanwhile, cloud platform is employed, which has strong ability of computation and processing to solve the problem of high complexity resulted from CS reconstruction. Simulation results show that the energy consumption of the proposed system can be substantially reduced. Under the condition of the same reconstructed video quality (SSIM=0.9), the proposed scheme only takes about 15% percent power consumption of the state-of-the-art H.264 encoding with conventional FEC-based transmitting scheme.


Using Temporal-Domain Peak Interval Determination for Video-based Short-Term Heart Rate Measurement

Yu-Shan Wu, Gwo-Hwa Ju, Ting-Wei Lee, Heng-Sung Liu and Yen-Lin Chiu

Chunghwa Telecom Co., Ltd

Video-based Heart Rate Measurement has drawn a lot of attention in recent years. A few papers have been proposed in this work and most of them applied Fourier transform peak interval determination method for heart rate measurement. However, the analyzing duration for one outcome in these methods is usually longer than 15 seconds. This is because that the measurement resolution in Fourier transform is proportional to the number of samples. In this paper, we apply a novel method to combine temporal-domain peak determination method and super resolution method for heart rate measurement. And we further propose a spectrum selection scheme and a data shift scheme to raise the measurement accuracy. The video database taken in our lab is used to evaluate the performance of the proposed method. The experimental results show two important things. The first is that when the analyzing duration of the proposed method is one third of Fourier transform, the precision of the proposed method is only a little lower than that of the Fourier transform method. Furthermore the precision of the proposed method is superior to Fourier transform approach when the analyzing duration is the same for both methods.


A Stereo Camera Distortion Detecting Method for 3DTV Video Quality Assessment

Quanwu Dong* , Tong Zhou* , Zongming Guo* and Jianguo Xiao*

*Peking University, Zhongguancun Haidian Science Park Postdoctoral Workstation, Peking University Founder Group Co., Ltd.

For 3DTV systems, camera distortions, such as vertical misalignment, camera rotation, unsynchronized zooming, and color misalignment, are introduced during the process of video capturing using stereo cameras. They are important factors affecting the perceptual quality of 3D videos. This paper proposes a stereo camera distortion detecting method based on the statistical model. Experiment is set up on a database which consists of video clips from an on-air 3DTV channel. Subjective assessment is also performed to find out its relation with the experimental results of the proposed method.


The Detection of Blotches in Old Movies

Xu Li*, Ranran Zhang, Yi Zhang

*Nankai University, Shanghai Fudan-holding Hualong Microsystem Techology Co. Ltd, Tianjin University

Blotch detection is one of the most important steps in the old movies restoration. The existing blotch detection algorithms get higher correct detection rates by reducing the threshold value. However, the corresponding higher false alarms affect the following correction results directly. To maximize the ratio between correct detections and false alarms, an improved blotch detection algorithm based on simplified rank-ordered difference is proposed in this paper. The improved algorithm can achieve the most appropriate threshold for the different blotches in one frame by introducing dual-step adaptive multi-threshold; meanwhile the employment of texture matching avoids the possible deviation caused by motion vector estimation in the regions with blotches. Performance evaluation is taken to the image sequences with both real blotches and artificially corrupted ones separately. The experimental results indicate that our algorithm can achieve higher correct detection rates and lower false alarms.


Improved Reconstruction for Computer-Generated Hologram from Digital Images

Min Liu* and Guanglin Yang

*Hunan University, Beijing University

In this paper, the reconstruction process of digital image’s Computer Generated Hologram (CGH) is mathematically analyzed from the point of Discrete Fourier Transform (DFT). Based on such analysis, a convenient and effective method is proposed to improve the quality of digital image’s CGH reconstruction. By eliminating the Direct Component (DC) of the DFT output of the hologram, the original digital image can be precisely reconstructed from the Burch CGH. Experiments have been done and the results show that the reconstructed real-image is exactly the same as the original digital image used to create the CGH, except that the pixel values are being scaled by some certain ratios. The experiment results also show that the proposed method can effectively improve the reconstruction quality of Huang’s CGH and Wai-Hon-Lee’s CGH. Index Terms-Computer-Generated Hologram, DFT, Burch, DC Component.


Back to top!!

OS.19-IVM.8: Visual Content Generation, Representation, Evaluation and Protection

Translation Insensitive Assessment of Image Quality Based on Measuring The Homogeneity of Correspondence

Chun-Hsien Chou1 and Yun-Hsiang Hsu2

1,2Department of Electrical Engineering, Tatung University

Image quality assessment plays an important role in the development of many image processing systems. Many full-reference image quality metrics have been proposed and aimed to give the prediction as close as possible to the subjective assessment made by human beings. However, these metrics have a common restriction that pixel-wise correspondence must be established before the evaluation of metric scores. Most of the existing metrics fail to result in accurate prediction even as a test image is differentiated from its original reference merely by one-pixel misalignment. Based on the fact that dissimilar image contents lead to random block correspondence, an image quality assessment method that primarily measures the randomness in the displacement between corresponding blocks from the images in comparison is proposed. The performance of the proposed metric is verified by evaluating the quality of the test images contained in the LIVE and TID2008 databases and the same images translated by various amounts of distance. The correlation between subjective evaluation results and the objective scores evaluated by the proposed metric as well as other five well-know image quality assessment methods is examined. Experimental results indicate that the proposed metric is an effective assessment method that can predict the image quality accurately without the preprocessing of image alignment.


A Method of Illumination Effect Transfer between Images Using Color Transfer and Gradient Fusion

Yi Zhang, Tianhao Zhao, Zhipeng Mo, Wenbo Li

Tianjin University

Illumination plays a crucial role to determine the quality of an image especially in photography. However, illumination alteration is quite difficult to achieve with existing image composition techniques. This paper proposes an unsupervised illumination-transfer approach for altering the illumination effects of an image by transferring illumination effects from another. Our approach consists of three phases. Phase-one layers the target image to three luminosity-variant layers by a series of pre-processing and alpha matting; meanwhile the source image is layered accordingly. Then the layers of the source image are recolored respectively by casting the colors from the corresponding layers of the target image. In phase-two, the recolored source image is edited to seamlessly transit at the boundaries between the layers using gradient fusion technique. Finally, phase-three recolors the fused source image again to produce a similar illuminating image with the target image. Our approach is tested on a number of different scenarios and the experimental results show that our method works well to transfer illumination effects between images.


Semi-Fragile Watermarking Scheme with Discriminating General Tampering from Collage Attack

Yaoran Huo, Hongjie He* and Fan Chen

Southwest Jiaotong University

A semi-fragile watermarking scheme is proposed to discriminate general tampering and collage attack in this paper. Five bits watermark data of each block generated are divided into the general tampering watermark (GTW) and collage attack watermark (CAW), which are embedded in the same block and the other blocks to discriminate general tampering and collage attack. The general tampered regions are obtained by the GTW data, and used to define a new collage identification parameter (CIP) combined with the consistency mark of collage attack. If the CIP is less than the given threshold, there are collaged regions in the test image and the CAW data is used to localize the collaged region. This work also discusses the selection of threshold of CIP. Experimental results show that the proposed algorithm improves the tamper detection performance compared to the existing semi-fragile watermarking algorithms and has the ability to discriminate general tampering from collage attack.


Point Cloud Compression Based on Hierarchical Point Clustering

Yuxue Fan and Yan Huang* and Jingliang Peng*

Shandong University

In this work we propose an algorithm for compressing the geometry of a 3D point cloud (3D point-based model). The proposed algorithm is based on the hierarchical clustering of the points. Starting from the input model, it performs clustering to the points to generate a coarser approximation, or a coarser level of detail (LOD). Iterating this clustering process, a sequence of LODs are generated, forming an LOD hierarchy. Then, the LOD hierarchy is traversed top down in a width-first order. For each node encountered during the traversal, the corresponding geometric updates associated with its children are encoded, leading to a progressive encoding of the original model. Special efforts are made in the clustering to maintain high quality of the intermediate LODs. As a result, the proposed algorithm achieves both generic topology applicability and good rate-distortion performance at low bitrates, facilitating its applications for low-end bandwidth and/or platform configurations.


Evaluating Line Drawings of Human Face

Qiao Lu , Lu Wang*, Li-ming Lou and Hai-dong Zheng

Shandong University

This paper presents the results of a study in which type of line drawings intended to convey 3D faces by using the normal estimation method based on the face domain knowledge, including facial expressions, facial orientation and the regions of five senses. Estimated lines include occluding contours, suggestive contours, ridges and valleys, apparent ridges, and lines from shaded images. Our findings suggest that different style of line drawing has its own advantage in depicting different parts of human faceand that errors in depiction are localized in particular regions of human face. Our study also shows the result that which kind of lines can do better in depicting special parts of human face for different expressions, like eyes, nose, mouth and so on. Finally, we generate optimized line drawing results for different face models to illustrate the efficiency of our evaluation result.


Back to top!!

OS.20-IVM.9: Multimedia Security and Forensics

Controllable Transparency Image Sharing Scheme for Grayscale and Color Images with Unexpanded Size

Yi-Chong Zeng*, and Chi-Hung Tsai

*Institute for Information Industry, Institute for Information Industry

Aimed at transparency controlling to secret image, we propose an image sharing scheme to encrypt secret image among two or more sharing images. The overall effects of the proposed method are the achievements of controllable transparency of secret image and unexpanded size of sharing images. The controllable transparency image sharing scheme is realized based on the principle of penetrability. While light passes through a medium, medium declines illumination of light. We treat pixel as medium and adjust pixels’ value of multiple sharing images, so that transparency of decrypted-secret image is controllable. The experiment results will demonstrate that our scheme can be applied to grayscale and color images for visual cryptography. Furthermore, similarity between sharing image and secret image is low by using the proposed scheme.


Approximate Message Passing Algorithm for Complex Separable Compressed Imaging

Akira Hirabayashi*, Jumpei Sugimoto, and Kazushi Mimura

*Ritsumeikan University, Yamaguchi University, Hiroshima City University,

We propose the approximate message passing (AMP) algorithm for complex separable compressed imaging. The standard formulation of compressed sensing uses onedimensional signals while images are usually reshaped into such vectors by raster scan, which requires a huge matrix. In separable cases like discrete Fourier transform (DFT), however, sensing processes can be formulated using two moderate size matrices which are multiplied to images from the both sides. We exploit this formulation in our AMP algorithm. Since we suppose DFT for the sensing process, in which measurements are complex, our formulation applies to cases in which both target signals and measurements are complex. We show that the proposed algorithm perfectly reconstructs a 128128 image, which could not be handled by the raster scan approach on the same computational environment. We also show that the compression rate of the proposed algorithm is mostly same as the so-called weak threshold.


Game Theoretic Analysis of Camera Source Identification

Hui Zeng1, Yunwen Jiang1, Xiangui Kang1,2, Li Liu3

1Sun Yat-sen University, 2 Chinese Academy of Sciences, 3Marvell Semiconductor Inc.

Sensor pattern noise (SPN) is recognized as a reliable device fingerprint for camera source identification (CSI). However, source identification method (source test) ignores whether the fingerprint is forged and anti-forensic techniques seldom consider traces they leave behind. Therefore, the performance of above techniques needs to be evaluated again by assuming the existence of both parties of a forensic investigator and an anti-forensic forger. In this paper, we propose a novel counter anti-forensic method based on noise level estimation to detect the possible forgery (forgery test). Furthermore, we evaluate the Nash equilibrium performance when investigator performs both source test and forgery test, and identify the optimal strategies of both parties with the game theory. The experimental results show that our proposed method can achieve good performance without collecting the candidate image set in the existing triangle test method especially when the false alarm rate is held low (e.g. Pfa < 5%).


A Secure Face Recognition Scheme Using Noisy Images Based on Kernel Sparse Representation

Masakazu FURUKAWA, Yuichi MURAKI, Masaaki FUJIYOSHI, and Hitoshi KIYA

Tokyo Metropolitan University

This paper proposes a secure face recognition scheme based on kernel sparse representation where facial images are visually encrypted. In the proposed scheme, a noisy image is added to all facial images, including a query image, to protect facial images. Noise-added facial images are further clipped for preventing unauthorized noise removing. Thanks to these strategies, a leakage of facial images will not disclose users’ privacy, even the noisy image is also leaked. That is, the proposed scheme protects users’ privacy and does not need to manage the noisy image securely. The proposed scheme directly applies kernel sparse representation based face recognition to noisy facial images, viz., decryption-free. Experimental results demonstrate that the face recognition performance of kernel sparse representation is not degraded, even facial images are visually encrypted.


The Game of Countering JPEG Anti-forensics Based on the Noise Level Estimation

Yunwen Jiang1, Hui Zeng1, Xiangui Kang1,2, Li Liu3

1Sun Yat-sen University, 2Chinese Academy of Sciences, 3Marvell Semiconductor Inc.

It's well known that JPEG image compression can result in quantization artifacts and blocking artifacts. There are plenty of forensic techniques making use of image's compression fingerprints to verify digital images. However, when a forger exists, these methods are not reliable any more. One typical anti-forensic method is adding anti-forensic dither to DCT transform coefficients and erasing blocking artifacts to remove compression history. In this paper, we propose a new countering anti-forensic method based on estimating the noise added in the process of erasing blocking artifacts. The experimental results show that our method obtains an average detection accuracy of 98% on the UCID image database. Another advantage of our proposed method is that it has only one-dimensional feature and time-saving. Furthermore, we use the game theory to evaluate the performance of both sides, and identify the optimal strategies of both sides.


RFID Systems Integrated OTP Security Authentication Design

Chao-Hsi Huang* and Shih-Chih Huang

*National Ilan University, National Ilan University

As radio frequency identification (RFID) technology matures, the application of RFID system also increased significantly and has been widely used in commodity storage, access management. We believe that it will become one of the major electronic money for the daily business consumption in the future. However, the stability and security of the data transaction will be more important for the demand of business applications. In the existed solution, we have not yet found an effective way that the Tag can be completely prevented forgery and attack. In this paper, we analyses the security problem of RFID authentication and propose security authentication for RFID tags based on a one-time password (OTP) authentication method. By the way of OTP authentication, we can improve the security of the RFID tag authentication. It can identify the authorized RFID Tag by additional OTP authentication. If an attacker uses eavesdropping to clone a RFID tag, the clone one can be identified by OTP authentication. We use RFC-6238 Time-Based One-Time password (TOTP) algorithm which is based on HMAC-SHA1 algorithm to enhance the authentication mechanism of RFID security. And we also use the computing power of NFC-enabled smart phone to generate TOTP by OTP generator which designed in this paper. The TOTP can be repeated and the security written to the tag. Thought using RADIUS authentication technology, manufacturers can easily apply this technology in the existing RFID system. It is easily provided to users to use roaming function between the different service providers, as long as they using the same frequency and standard of RFID technology.


Back to top!!

OS.21-SLA.7: Recent Advances in Audio and Acoustic Signal Processing

A General Approach to the Design and Implementation of Linear Differential Microphone Arrays

Jingdong Chen and Jacob Benesty

Northwestern Polytechnical University, University of Quebec

The design of differential microphone arrays (DMAs) and the associated beamforming algorithms have become very important problems. Traditionally, an Nth order DMA is formed by subtractively combining the outputs of two DMAs of order N 1. This method, though simple and easy to implement, suffers from a number of limitations. For example, it is difficult to design the equalization filter that is needed for compensating the array’s non-uniform frequency response, particularly for high-order DMAs. In this paper, we propose a new approach to the design and implementation of linear DMAs for speech enhancement. Unlike the traditional method that works in the time domain, this proposed approach works in the short-time Fourier transform (STFT) domain. The core issue with this framework is how to design the desired differential beamformer in each subband, which is accomplished by solving a linear system consisting of N + 1 fundamental constraints for an Nth-order DMA.


Acoustic Source Tracking in Reverberant Environment Using Regional Steered Response Power Measurement

Kai Wu and Andy W. H. Khong

Nanyang Technological University

Acoustic source localization and tracking using a microphone array is challenging due to the presence of background noise and room reverberation. Conventional algorithms employ the steered response power (SRP) as the measurement function in a particle filter based tracking framework. The particle weight is updated according to a pseudo-likelihood derived from the SRP value of each particle position. The performance of this approach reduces in a noisy and reverberant environment. In this paper, instead of evaluating the SRP value for each discrete particle position, we propose to apply a regional SRP beamformer which takes into account a circular region centered on each particle position, in order to provide a more robust particle likelihood evaluation. In addition, a proper mapping function is proposed to transform the regional SRP value to the likelihood. Simulation results show that the proposed method achieves robustness in tracking a speech source in a noisy and reverberant environment.


Toward Musical-Noise-Free Blind Speech Extraction: Concept and Its Applications

Ryoichi Miyazaki*, Hiroshi Saruwatari*, Satoshi Nakamura*, Kiyohiro Shikano*, Kazunobu Kondo, Jonathan Blanchette, and Martin Bouchard

*Nara Institute of Science and Technology, Yamaha Corporation, University of Ottawa

In this paper, we review a blind musical-noise-free speech extraction method using a microphone array that can be applied to nonstationary noise. In our previous study, it was found that optimized iterative spectral subtraction (SS) results in speech enhancement with almost no musical noise generation, but this method is valid only for stationary noise. The proposed method consists of iterative blind dynamic noise estimation by, e.g., ICA or multichannel Wiener filtering, and musical-noise-free speech extraction by modified iterative SS, where multiple iterative SS is applied to each channel while maintaining the multichannel property reused for the dynamic noise estimators. Also, related to the proposed method, we discuss the justification of applying ICA to such signals nonlinearly distorted by SS. From objective and subjective evaluations simulating real-world hands-free speech communication system, we reveal that the proposed method outperforms the conventional methods.


Detection of user’s body movement for binaural hearing aids to control of directivity

Yoshifumi CHISAKI*, Shogo TANAKA and Tsuyoshi USAGAWA

*Kumamoto University, Kumamoto University, Kumamoto University

Estimation of sound source directions and separation of the sound sources are implemented on many products widely, and one of the applications is binaural hearing aids. In conversation using binaural hearing aids, continuous tracking of sound sources with acoustics signals are sometimes complicated because sound sources move dynamically. In order to make the tracking of sound sources simple, it is considered to be helpful to use non-verbal information in communication. Since user’s body movement, including a head, corresponds to speakers’ positions, it is possible to estimate communication zone where sound sources locate by the head direction. In this paper, a head movement in conversation, as non-verbal information in communication, is focused, and two zone detection methods are discussed. A rotational angle of head movement is estimated by both acceleration by an accelerometer and angular velocity by angular velocity sensor which is attached to left ear position. The classification of spatial communication zone is performed by two methods, the threshold method and the support vector machine (SVM). As the results, performance on estimation of the target direction by the threshold-based method was slightly better than that by the SVM-based method.


Speech Enhancement with Ad-Hoc Microphone Array Using Single Source Activity

Ryutaro Sakanashi*, Nobutaka Ono, Shigeki Miyabe, Takeshi Yamada§ and Shoji Makino

*§¶University of Tsukuba, The Graduate University for Advanced Studies (SOKENDAI)

In this paper, we propose a method for synchronizing asynchronous channels in an ad-hoc microphone array based on single source activity for speech enhancement. An adhoc microphone array can include multiple recording devices, which do not communicate with each other. Therefore, their synchronization is a significant issue when using the conventional microphone array technique. We here assume that we know two or more segments (typically the beginning and the end of the recording) where only the sound source is active. Based on this situation, we compensate for the difference between the start and end of the recording and the sampling frequency mismatch. We also describe experimental results for speech enhancement with a maximum SNR beamformer.


Back to top!!

OS.22-SLA.8: Speech Recognition (I)

Estimation of Speech Recognition Performance in Noisy and Reverberant Environments Using PESQ Score and Acoustic Parameters

Takahiro Fukumori*, Masato Nakayama, Takanobu Nishiura and Yoichi Yamashita

*Ritsumeikan University, Ritsumeikan University

The automatic speech recognition (ASR) performance is degraded in noisy and reverberant environments. Although various techniques against degradation of the ASR performance have been proposed, it is difficult to properly apply them in evaluation environments with unknown noisy and reverberant conditions. It is possible to properly apply these techniques for improving the ASR performance if we can estimate the relationship between the ASR performance and degradation factors including both noise and reverberation. In this study, we here propose new noisy and reverberant criteria which are referred as "Noisy and Reverberant Speech Recognition with the PESQ and the Dn (NRSR-PDn)". We first designed the "NRSRP Dn" using the relationships among the D value, the PESQ score, and the ASR performance. We then estimated the ASR performance with the designed criteria "NRSR-PDn" in evaluation experiments. Experimental evaluations demonstrated that our proposed criteria make the well suited for robustly estimating the ASR performance in noisy and reverberant environments.


Fast NMF Based Approach and VQ Based Approach Using MFCC Distance Measure for Speech Recognition From Mixed Sound

Shoichi Nakano*, Kazumasa Yamamoto*, and Seiichi Nakagawa*

*Toyohashi University of Technology

We have considered a speech recognition method for mixed sound, consisting of speech and music, that removes only the music based on vector quantization (VQ) and non-negative matrix factorization (NMF). Instead of conventional amplitude spectrum distance measure, MFCC distance measure which is not affected by the pitch is introduced. For isolated word recognition using the clean speech model, an improvement of 53% word error reduction rate was obtained compared with the case of not removing music. Furthermore, a high recognition rate, close to clean speech recognition was obtained at 10dB. For the case of the multi-conditions, our proposed method reduced the error rate of 67% compared with the multi-conditions model.


Morphological Personalization of A Physiological Articulatory Model

Nana Nishimura, Shin’ichi Kawamoto, Jianwu Dang and Kiyoshi Honda

Japan Advanced Institute of Science and Technology, Tianjin University

A physiological articulatory model is capable of simulating movements of the articulatory organs together with their morphology, without limitation of the observation approaches and ethics. The model can be a tool for investigating an aspect of individual characteristics of speech, if it is adjusted to the form a particular speaker’s organs. This study reports our effort to personalize an articulatory model to multiple speakers. To do so, we propose a method for constructing personalized articulatory models based on the adaptation of a prototype model by transformation. Accordingly, new models were built for three speakers, which successfully reflected their morphological features. Also, the models were found to be able to simulate typical tongue movements in the same way as the prototype model performed.


Speech Recognition Using Blind Source Separation and Dereverberation Method for Mixed Sound of Speech and Music

Longbiao Wang, Kyohei Odani, Atsuhiko Kai and Weifeng Li

Nagaoka University of Technology, Shizuoka University, Tsinghua University

In this paper, we propose a method for performing a non-stationary noise reduction and dereverberation method. We use a blind dereverberation method based on spectral subtraction using a multi-channel least mean square algorithm has been proposed in our previous study. To suppress the non-stationary noise, we used a blind source separation based on an efficient fast independent component analysis algorithm. This method is evaluated using a mixed sound of speech and music, and achieves an average relative word error reduction rate of 41.9% and 7.9% compared with a baseline method and the state-of-the-art multistep linear prediction-based dereverberation, respectively, in a real environment.


The Effect of Part-of-speech on Mandarin Speech Recognition

Caixia Gong, Xiangang Li and Xihong Wu

Peking University

This paper concentrates on the effect of part-of-speech on Mandarin speech recognition by incorporating it into language model and pronunciation dictionary. This work is motivated by the two benefits of part-of-speech, one is to reduce the lexical ambiguity in language model to some extent and the other is to provide some information about the pronunciation of heteronyms. The experiments conducted on two corpora, tagged manually or automatically, show that a 3% relative character error rate (CER) reduction is achieved. Moreover, we find that performance improvement is mainly due to the relationship between part-of-speech and pronunciation of heteronyms.


Deep Neural Networks for Syllable based Acoustic Modeling in Chinese Speech Recognition

Xiangang Li, Caifu Hong, Yuning Yang, and Xihong Wu

Peking University

Recently, the deep neural networks (DNNs) based acoustic modeling methods have been successfully applied to many speech recognition tasks. This paper reports the work about applying DNNs for syllable based acoustic modeling in Chinese automatic speech recognition (ASR). Compared with initial/finals (IFs), syllable can implicitly model the intra-syllable variations in better accuracy. However, the context dependent syllable based modeling set holds too many units, bringing about heavy problems on modeling and decoding implementation. In this paper, a WFST decoding framework is applied. Moreover, the decision tree based state tying and DNNs based models are discussed for the acoustic model training. The experimental results show that compared with the traditional IFs based modeling method, the proposed syllable modeling method using DNNs is more robust for data sparsity problem, which indicates that it has the potential to obtain better performance for Chinese ASR.


Back to top!!

OS.23-SPS.3: Recent Advances in Digital Filter Design and Implementation

Design of Digital Fractional Order Differentiator Using Discrete Sine Transform

Chien-Cheng Tseng* and Su-Ling Lee

*National Kaohsiung First University of Science and Technology, Chang-Jung Christian University

In this paper, the designs of digital fractional order differentiator (DFOD) using discrete sine transforms (DST) are presented. First, the definition of fractional differentiation is reviewed briefly. Then, the DST-based interpolation method is applied to compute the fractional differentiation of a given digital signal. Next, the transfer functions of DFOD are obtained from the DST computational results by using index mapping method. Finally, some numerical and application examples are demonstrated to show the effectiveness of the proposed DST-based design method.


On L2-Sensitivity of State-Space Digital Filters under Gramian-Preserving Frequncy Transformation

Shunsuke Koshita, Masahide Abe and Masayuki Kawamata

Tohoku University,

This paper aims to reveal the relationship between the minimum L2-sensitivity of state-space digital filters and the Gramian-preserving frequency transformation. To this end, we first give a prototype low-pass state-space filter in such a manner that its structure becomes the minimum L2-sensitivity structure. Then we apply the Gramian-preserving (LP-LP) frequency transformation with a tunable parameter to this prototype filter. In this way we obtain a low-pass state-space filter with tunable cutoff frequency from a prescribed prototype low-pass filter with minimum L2-sensitivity. For this tunable low-pass state-space filter, we evaluate the L2-sensitivity over the entire range of cutoff frequencies. The evaluation result shows that, although the minimality of the L2-sensitivity is not preserved under the frequency transformation, the L2-sensitivity of the tunable filter given in this way becomes very close to the minimum value for arbitrary cutoff frequencies.


Hilbert Pair of Almost Symmetric Orthogonal Wavelets with Arbitrary Center of Symmetry

Dai-Wei Wang* and Xi Zhangy

The University of Electro-Communications

This paper proposes a new method for designing a class of Hilbert pairs of almost symmetric orthogonal wavelets with arbitrary center of symmetry. Two scaling lowpass filters are designed simultaneously to satisfy the specified degree of flatness of group delays, vanishing moments and orthogonality condition of wavelets, along with improved analyticity. Therefore, the resulting scaling lowpass filters have flat group delay responses and the specified number of vanishing moments. Moreover, the difference of the frequency responses between two scaling lowapss filters can be effectively minimized to improve the analyticity of complex wavelets. The condition of orthogonality is linearized, and then an iterative procedure is used to obtain the filter coefficients. Finally, several examples are presented to demonstrate the effectiveness of the proposed design procedure.


An Effective Allocation of Non-zero Digits for CSD Coefficient FIR Filters Using 0-1PSO

Takuya IMAIZUMI and Kenji SUYAMA

Tokyo Denki University

In this paper, a novel method for effective allocation of non-zero digits in design of CSD (Canonic Signed Digit) coefficient FIR (Finite Impulse Response) filters is proposed. The design problem can be formulated as a mixed integer linear programming problem, which is well-known as a NP-hard problem. Recently, a heuristic approach using the PSO (Particle Swarm Optimization) for solving the problem has been proposed, in which the maximum number of non-zero digits is limited in each coefficient. On the other hand, the maximum number of non-zero digits is limited in total in the proposed method and the 0-1PSO is applied. It enables an effective allocation of nonzero digits and provides a good design. Several examples are shown to present an efficiency of the proposed method.


On the Limit Cycles in the Minimum L2-Sensitivity Realizations Subject to L2-Scaling Constraints of Second-Order Digital Filters

Shunsuke Yamaki, Masahide Abe, and Masayuki Kawamata

Tohoku University, Tohoku University

This paper gives conjecture on the absence of limit cycles of the minimum L2-sensitivity realizations subject to L2- scaling constraints for second-order digital filters. We design second-order digital filters with various pole-zero configurations, synthesize the minimum L2-sensitivity realizations subject to L2-scaling constraints, and examine if their coefficient matrices satisfy a sufficient condition for the absence of limit cycles. As a result, in the range of practical pole radii, it is shown that the minimum L2-sensitivity realizations subject to L2-scaling constraints of second-order digital filters satisfy a sufficient condition for the absence of limit cycles. Furthermore, we demonstrate the absence of limit cycles of the minimum L2- sensitivity realizations of a second-order digital filter by observing its zero-input response.


Design of Fir Filters with Decimated Impulse Responses

Tomohiro Yamauchi, Ryo Matsuoka and Masahiro Okuda

Univ. of Kitakyushu

In this paper, we present a numerical algorithm for the design of FIR filters with sparse impulse responses. Our method minimizes the number of nonzero entries in the impulse response together with the least squares error of its frequency response. We show that the FIR filters with sparse coefficients can outperform a conventional least squares approach and the Parks-McCllelan method under the condition of the same number of multipliers.


Back to top!!

OS.24-SIPTM.3: Advances in Linear and Nonlinear Adaptive Signal Processing and Learning

Context Dependent Acoustic Keyword Spotting Using Deep Neural Network

Guangsen WANG and Khe Chai SIM

National University of Singapore

Language model is an essential component of a speech recogniser. It provides the additional linguistic information to constrain the search space and guide the decoding. In this paper, language model is incorporated in the keyword spotting system to provide the contexts for the keyword models under the weighted finite state transducer framework. A context independent deep neural network is trained as the acoustic model. Three keyword contexts are investigated: the phone to keyword context, fixed length word context and the arbitrary length word context. To provide these contexts, a hybrid language model with both word and phone tokens is trained using only the word n-gram count. Three different spotting graphs are studied depending on the involved contexts: the keyword loop graph, the word fillers graph and the word loop fillers graph. These graphs are referred to as the context dependent (CD) keyword spotting graphs. The CD keyword spotting systems are evaluated on the Broadcasting News Hub4-97 F0 evaluation set. Experimental results reveal that the incorporation of the language model information provides performance gain over the baseline context independent graph without any contexts for all the three CD graphs. The best system using the arbitrary length word context has the comparable performance to the full decoding but triples the spotting speed. In addition, error analysis demonstrates that the language model information is essential to reduce both the insertion and deletion errors.


Exploiting Sparsity in Feed-Forward Active Noise Control with Adaptive Douglas-Rachford Splitting

Masao Yamagishi and Isao Yamada

Tokyo Institute of Technology

Observing that a typical primary path in Active Noise Control (ANC) system is sparse, i.e., having a few significant coefficients, we propose an adaptive learning which promotes the sparsity of the concatenation of the adaptive filter and the secondary path. More precisely, we propose to suppress a timevarying sum of the data-fidelity term and the weighted 1 norm of the concatenation by the adaptive Douglas-Rachford splitting scheme. Numerical examples demonstrate that the proposed algorithm shows excellent performance of the ANC by exploiting the sparsity and has robustness against a violation of the sparsity assumption.


Multikernel Adaptive Filters With Multiple Dictionaries and Regularization

Taichi Ishida and Toshihisa Tanaka

Tokyo University of Agriculture and Technology

We discuss a method of regularization and a construction method of dictionary which have a high degree of freedom within the framework of multikernel adaptive filtering. The multikernel adaptive filter is an extension of the kernel adaptive filter using multiple kernels. Hence, it has offers higher performance than the kernel adaptive filter. In this paper, we focus on the fact that the multikernel adaptive filter determines a subspace in the sum of multiple reproducing kernel Hilbert spaces (RKHSs) associated with different kernels. Based on this, we propose a novel method to individually select appropriate input signals in order to construct dictionary which determines the subspace. Furthermore, based on the fact that any unknown filter is an element in RKHS, we propose L2 regularization in order to avoid overadaptation. Also, we derive an algorithm that fixes the dictionary size in order to construct an efficient adaptive algorithm. Numerical examples show the efficiency of the proposed method.


Sparse Adaptive Filtering by Iterative Hard Thresholding

Rajib Lochan Das and Mrityunjoy Chakraborty

Indian Institute of Technology

In this paper, we present a new algorithm for sparse adaptive filtering, drawing from the ideas of a greedy compressed sensing recovery technique called the iterative hard thresholding (IHT) and the concepts of affine projection. While usage of affine projection makes it robust against colored input, the use of IHT provides a remarkable improvement in convergence speed over the existing sparse adaptive algorithms. Further, the gains in performance are achieved with very little increase in computational complexity.


On Adaptivity of Online Model Selection Method Based on Multikernel Adaptive Filtering

Masahiro Yukawa and Ryu-ichiro Ishii

Keio University, Niigata University

We investigate adaptivity of the online model selection method which has been proposed recently within the multikernel adaptive filtering framework. Specifically, we consider a situation in which the nonlinear system under study changes during adaptation and an appropriate kernel also does accordingly. Our time-varying cost functions involve three regularizers: the 1 norm and two block 1 norms which promote sparsity both in the kernel and data groups. The block 1 regularizers are approximated by their Moreau envelopes, and the adaptive proximal forward-backward splitting (APFBS) method is applied to the approximated cost function. Numerical examples show that the proposed algorithm can adaptively estimate a reasonable model.


Back to top!!

OS.25-IVM.10: Advanced Audio-Visual Analysis in Multimedia

Characteristics Comparison of Two Audio Output Devices for Augmented Audio Reality

Kazuhiro Kondo, Naoya Anazawa and Yosuke Kobayashi

Yamagata University

In this paper, we compared two audio output devices for augmented audio reality applications. In these applications, we plan to use speech annotations on top of the actual ambient environment. Thus, it becomes essential that these audio output devices will be able to deliver intelligible speech annotation along with transparent delivery of the environmental auditory scene. Two candidate devices were compared. The first output was the bone-conduction headphones which can deliver speech by vibrating the skull, while normal hearing is left intact for surrounding noise since these headphones leave the ear canal open. The other is the binaural microphone/earphone combo, which is in a form factor similar to a regular earphone, but integrates a small microphone at the ear canal entry. The input from these microphones can be fed back to the earphone along with the annotation speech. In this paper, we compared the speech intelligibility of speech when competing babble noise is simultaneously given from the surrounding environment. It was found that the bone-conduction headphones can deliver speech at higher intelligibility than the binaural combo. However, with the binaural combo, we found that the ear canal transfer characteristics were altered significantly by closing the ear canal with the earphones. If we employed a compensation filter to account for this transfer function deviation, the resultant speech intelligibility was found to be higher than the bone-conduction headphones. In any case, both of these are found to be acceptable as audio output devices for augmented audio reality applications since both are able to deliver speech at high intelligibility even when significant amount of competing noise is present.


Voice Activity Detection Based on Density Ratio Estimation and System Combination

Yuuki Tachioka, Toshiyuki Hanazawa, Tomohiro Narita, and Jun Ishii

Mitsubishi Electric Corporation

We propose a robust voice activity detection (VAD) based on density ratio estimation. In highly noisy environments, the likelihood ratio test (LRT) is effective. Conventional LRT estimates both speech and noise models, calculates the likelihood of each model, and uses ratios of such likelihood to detect speech. However, in LRT, the likelihood ratio of speech and noise models is required, whereas likelihood of individual models is not necessarily required. The framework of the density ratio estimation models likelihood ratio functions by a kernel and directly generates a likelihood ratio. Applying density ratio estimation to VAD requires that feature selection and noise adaptation must be considered. This is because the density ratio estimation constrains the shape of the likelihood ratio functions and speech is dynamic. This paper addresses these problems. To improve accuracy, the proposed method is combined with conventional LRT. Experimental results using CENSREC-1-C show that the proposed method is more effective than conventional methods, especially in non-stationary noisy environments.


Low Power Motion Estimation based on Non- Uniform Pixel Truncation

Yaocheng Rong, Quanhe Yu, Da An, and Yun He, Senior Member, IEEE

Tsinghua University

Motion estimation consumes a large portion of computational resources in most video encoders, such as AVS, H.264/AVC and HEVC/H.265. Pixel truncation can effectively reduce the power consumption of motion estimation and it is usually performed uniformly within the search window, which unfortunately brings degradation to the compression efficiency. However, based on the observation of unequal distribution of motion vectors, we can apply different number of truncated bits in different search area to achieve a better tradeoff between compression efficiency and power consumption. The proposed algorithm saves approximately 64% of the hardware computational complexity than conventional full-bit pixel search and at the same time achieves almost the same compression efficiency, with only 0.04dB loss. 1


Human Gait Analysis by Body Segmentation and Center of Gravity

Ying-Fang Tsao , Wen-Te Liuand Ching-Te Chiuz

National Tsing Hua University, Taipei Medical University-Shuang Ho Hospital, National Tsing Hua University

The physiological condition of a person may affect his/her daily behaviour such as gait or posture. For example under the fatigue condition, a person may be used to walk in a slower pace than usual. This paper presents a novel gait analysis approach to detect movement variations such as walking pace or speed change, walking with bending, walking with heavy breath, arm or leg swing change. Based on the geometry of the silhouette, we segment the body to five main parts including head, upper body, lower body, arms and legs. For a specific analysis, we segment the torso to upper and lower body. For the walking pace analysis, we use the leg movement in the lower body to find the max distance in a pace cycle and corresponding pace speed. The angles between the head or upper body and the vertical line are used to detect the walking with bending or walking with breathing. The arm swing angle or pace variation during walking can also be detected. We compare the normal condition with other abnormal condition such as people who have respiratory obstruction leading to heavy breathing, and have stomach ache resulting humpbacked status. These cause the angle of upper body different with normal condition, so we can observe these signals to give a warning notice. Our experiments show that with these fine posture features, we are able to detect a persons gait change. Examples are that a person is humpbacked, or the arm/leg swing and pace distance are in abnormal rhythm. From our gait analysis approach, we observe that when people are in a tired condition, they are used to adopt a static and comfortable pace distance to walk in our experimental results.


Unsupervised Classification of Heart Sound Recordings

Wei-Ho Tsai*, Sung-How Su and Cin-Hao Ma*

*National Taipei University of Technology, Pojen General Hospital

An unsupervised framework for classifying heart sound data is proposed in this paper. Our goal is to cluster unknown heart sound recordings, such that each cluster contains sound recordings belonging to the same heart diseases or normal heart beat category. This framework is more flexible than the existing supervised classification of heart sounds by the case when heart sound data belong to undefined categories or when there is no prior template data for building a heart sound classifier. To this end, methods are proposed for heart sound feature extraction, similarity computation, cluster generation, and estimation of the optimal number of clusters. Our experiments show that the resulting clusters based on our system are roughly consistent with the heart beat categories defined by human labeling.


Near-Duplicate Subsequence Matching for Video Streams

Chih-Yi Chiu, Yi-Cheng Jhuang, Guei-Wun Han, and Li-Wei Kang

National Chiayi University, National Yunlin University of Science and Technology

In this paper, we study the efficiency problem of near-duplicate subsequence matching for video streams. A simple but effective algorithm called incremental similarity update is proposed to address the problem. A similarity upper bound between two videos can be calculated incrementally by taking a lightweight computation to filter out the unnecessary time-consuming computation for the actual similarity between two videos. We integrate the algorithm with inverted frame indexing to scan video sequences for matching near-duplicate subsequences. Four state-of-the-art methods are implemented for comparison in terms of the accuracy, execution time, and memory consumption. Experimental results demonstrate the proposed algorithm yields comparable accuracy, compact memory size, and more efficient execution time.


Human Computer Interaction Using Face and Gesture Recognition

Yo-Jen Tu, Chung-Chieh Kao, Huei-Yung Lin

National Chung Cheng University

In this paper, we present a face and gesture recognition based human-computer interaction (HCI) system using a single video camera. Different from the conventional communication methods between users and machines, we combine head pose and hand gesture to control the equipment. We can identify the position of the eyes and mouth, and use the facial center to estimate the pose of the head. Two new methods are presented in this paper: automatic gesture area segmentation and orientation normalization of the hand gesture. It is not mandatory for the user to keep gestures in upright position, the system segments and normalizes the gestures automatically. The experiment shows this method is very accurate with gesture recognition rate of 93.6%. The user can control multiple devices, including robots simultaneously through a wireless network.


Back to top!!

OS.26-IVM.11: Visual Data Understanding and Modeling

Human Segmentation from Video by Combining Random Walks with Human Shape Prior Adaption

Yu-Tzu Lee*, Te-Feng Suy, Hong-Ren Su*, Shang-Hong Laiy, Tsung-Chan Leez and Ming-Yu Shih

*National Tsing Hua University, National Tsing Hua University, Industrial Technology Research Institute

In this paper, we propose an automatic human segmentation algorithm for video conferencing applications. Since humans are the principal subject in these videos, the proposed framework is based on human shape clues to separate humans from complex background and replace or blur the background for immersive communication. We first detect face position and size, track human boundary across frames, and propagate the segmentation likelihood to the next frame for obtaining the trimap to be used as input to the Random Walk algorithm. In addition, we also include gradient magnitude in edge weight to enhance the Random Walk segmentation results. Finally, we demonstrate experimental results on several image sequences to show the effectiveness and robustness of the proposed method.


Cattle Face Recognition Using Local Binary Pattern Descriptor

Cheng Cai1 and Jianqiao Li2

1Northwest A&F University, 2Northwest A&F University

In response to the current need for positive identification of cattle traceability, this paper presents a novel facial representation model of cattle based on local binary pattern (LBP) texture features and some extended LBP descriptors are also introduced. Algorithm training was performed independently on several normalized gray face images of 30 cattle (with each having a set of six, seven, eight, and nine images respectively). Robust alignment by sparse and low-rank decomposition was also used to align the images because of variations in illumination, image misalignment and occlusion in the test image. The performance of this technique was assessed on a separate set of images using the weighted Chi square distance [1]. The LBP descriptor shows its excellence in efficiency and accuracy with regard to the encouraging results on cattle face recognition. More training sets and modified algorithms will be considered to improve recognition rates. Future work should aim at improving the automation of the system and combining the LBP histogram with other effective histograms.


Weed Seeds Identification based on Structure ElementsDescriptor

Minnan Tang and Cheng Cai *

*Northwest A&F University

the implementation of new methods for automatic, reliable identification and classification of seeds is of great technical and economic importance in agricultural industry. As in ocular inspection, the automatic classification of seeds should be based on knowledge of seed size, shape, color and texture. In this work, we assess the discriminating power of these characteristics for the unique identification of seeds of 216 weed species. We identified a nearly optimal set of 4 (three morphological and one color and textural) seed characteristics as classification parameters, using the performance of the Support Vector Ma-chines as classifier. Among these characteristics, color and textural features are extracted and described by SED (structure elements descriptor) simultaneously which proves to perform better than other image retrieval methods. The main findings of this paper are shown in the strong discrimination power of SED. Moreover, experimental results suggest that recognition rate reaches the peak with the combination of the morphological characteristics and SED.


Non Separable 3D Lifting Structure Compatible with Separable Quadruple Lifting DWT

Masahiro Iwahashi*, Teerapong Orachon* and Hitoshi Kiya

*Nagaoka University of Technology, Tokyo Metropolitan University

This report reduces the total number of lifting steps in a 3D quadruple lifting DWT (discrete wavelet transform). In the JPEG 2000 international standard, the 9/7 quadruple lifting DWT has been widely utilized for image data compression. It has been also applied to volumetric medical image data analysis. However, it has long delay from input to output due to cascading four (quadruple) lifting steps per dimension. We reduce the total number of lifting steps introducing 3D direct memory accessing under the constraint that it has backward compatibility with the conventional DWT in JPEG 2000. As a result, the total number of lifting steps is reduced from 12 to 8 (67 %) without significant degradation of data compression performance.


3D Shape Retrieval Focused on Holes and Surface Roughness

Masaki Aono, Hitoshi Koyanagi, and Atsushi Tatsuma

Toyohashi University of Technology

Although quite a few 3D shape descriptors have been proposed for more than a decade, 3D shape retrieval has remained a still challenging research. No single 3D shape descriptor has been known to outperform all the different types of 3D shape geometries. In this paper, we propose a new 3D shape descriptor that focuses on 3D mechanical parts having holes and surface roughness by using Fourier spectra computed from multiple projections of distinct images. Our proposed method makes it possible to explore potential real applications of 3D shape retrieval to manufacturing industries where the cost reduction of creating a new 3D shape from scratch is greatly appreciated. Our method explicitly attempts to extract holes and surface roughness, as well as contours, lines, and circular edges as our proposed features from multiple projections of a given 3D shape model, and use them to retrieve 3D shape objects. We demonstrate the effectiveness of our method by using several 3D shape benchmarks, one of which is composed of many mechanical parts, and compare our proposed method with several previously known methods. The results are very encouraging and promising.


Can Ambiguous Words Be Helpful in Image-Understanding Systems?

Huiling Zhou, Jiwei Hu and Kin Man Lam

The Hong Kong Polytechnic University

A semantic gap always decreases the performance of the mapping for image-to-word, which is an important task in image understanding. Even efficient learning algorithms cannot solve this problem because: (1) of a lack of coincidence between the low-level features extracted from the visual data and the high-level information translated by human, and (2) an ambiguous word may lead to a wrong interpretation between low-level and high-level information. This paper introduces a discriminative model with a ranking function that optimizes the cost between the target word and the corresponding images, while simultaneously discovering the disambiguated senses of those words that are optimal for supervised tasks. Experiments were conducted using two datasets, and results show quite a promising result when compared with existing methods.


Facial Expression Recognition Using Hough Forest

Chi-Ting Hsu1, Shih-Chung Hsu1, and Chung-Lin Huang1,2

1National Tsing-Hua University, 2Asia University

This paper introduces a new facial expression recognition system. Facial expressions analysis encounters two major problems: non-rigid morphing (human facial expression are non-rigid and shape deformation) and person-specific appearance (the facial action features are people-dependent). Our facial expression system analyzes the non-rigid morphing facial expressions and eliminates the person-specific effects through patch features extracted from facial motion due to different facial expressions. Finally, classification and localization of the center of the facial expression in the video sequences are performed by using a Hough forest.


Visual Tracking using the Joint Inference of Target State and Segment-based Appearance Models

Junha Roh, Dong Woo Park, Junseok Kwon and Kyoung Mu Lee

Seoul National University

In this paper, a robust visual tracking method is proposed by casting tracking as an estimation problem of the joint space of non-rigid appearance model and state. Conventional trackers which use templates as the appearance model do not handle ambiguous samples effectively. On the other hand, trackers that use non-rigid appearance models have low discriminative power and lack methods for restoring methods from inaccurately labeled data. To address this problem, multiple non-rigid appearance models are proposed. The probabilities from these models are effectively marginalized by using the particle Markov chain Monte Carlo framework which provides an exact and efficient approximation of the joint density through marginalization and the theoretical evidences of convergence. An appearance model combines multiple classification results with different features and multiple models can infer an accurate solution despite the failure of several models. The proposed method exhibits high accuracy compared with nine other state-ofthe- art trackers in various sequences and the result was analyzed both analyzed both qualitatively and quantitatively.


Back to top!!

OS.27-WCN.1: Green Communications and Networking

Fast Handover Techniques for ESS-Subnet Topology Mismatch in IEEE 802.11

Chien-Chao Tseng*, Chia-Liang Lin*, Yu-Jen Chang* and Li-Hsing Yen

*National Chiao Tung University, National University of Kaohsiung

With the advances in wireless communication technology, mobile applications are gradually being introduced as part of our life. IEEE 802.11 is one of the most popular wireless communication technologies. Prior studies toward layer- 2 and layer-3 handoffs assume that Extend Service Set (ESS) exactly matches a dedicated subnet. However, an inter-ESS with intra-subnet handoff and an intra-ESS with inter-subnet handoff may also be possible. Such mismatching ESS-subnet configurations result in performance degradation. This paper proposes FCS, a Further Check Scheme which detects the change of subnet after an intra-ESS handoff and eliminates unnecessary handoff latency after an inter-ESS handoff. The experimental results show that FCS outperforms the conventional implementation of NetworkManager under Linux in terms of handoff latency.


An Active Helper Searching Mechanism for Directional Cooperative Media Access Control (MAC) Protocols

Yi-Yu Hsieh, Jiunn-Ru Lai* and His-Lu Chao††

National Kaohsiung University of Applied Sciences, ††National Chiao Tung University

Cooperative communication [1] and directional antenna systems are considered two key technologies for future wireless networks. There are also issues about the combination of these two technologies.[12,13] Most research in this area needed a table recording the near-by transmission. However, due to the property of passive overhearing, the table may only be partially good for the operation, especially for directional antenna systems. In this paper, we proposed an active helper searching mechanism to improve the completeness of the table. The helper activated its own helpers searching rather than waiting for the overhearing communication from others. We compared four schemes including the base 802.11g scheme, the D-NoopMAC scheme, the D-CoopMAC scheme with the CoopTable built by conventional methods and the D-CoopMAC scheme with D-CoTable built with our proposed active helper searching mechanism. The simulation results proved that coop-directional MAC schemes with the active helper scheme improved their performance. We also concluded that nodes with directional antenna should be regulated to help those omni-directional nodes to increase the total network throughput. We leaved node mobility issues as the future work.


Base Station Cooperation for Energy Efficiency: A Gauss-Poisson Process Approach

Pengcheng Qiao, Yi Zhong and Wenyi Zhang

Senior Member, IEEE

Base station cooperation is an effective means of improving the spectral efficiency of cellular networks. From an energy-efficiency perspective, whether base station cooperation benefits the network performance remains an issue to be answered. In this paper, we adopt tools from stochastic geometry to treat this issue. Specifically, we model the cooperating base stations as clusters in a Gauss-Poisson process, a variant of the usually considered Poisson point processes. We compare the performance in terms of energy efficiency with and without base station cooperation. The results reveal that only when the cooperative base stations account for a large proportion of all the base stations will the cooperation among base stations bring gains to the energy efficiency of the network.


On Channel-Aware Frequency-Domain Scheduling With QoS Support for Uplink Transmission in LTE Systems

Lung-Han Hsu and Hsi-Lu Chao

National Chiao Tung University

Due to the power consumption issue of user equipment (UE), Single-Carrier FDMA (SC-FDMA) has been selected as the uplink multiple access scheme of 3GPP Long Term Evolution (LTE). Similar to OFDMA downlink, SCFDMA enables multiple UEs to be served simultaneously in uplink as well. However, the single carrier characteristic requires that all the allocated subcarriers to a UE must be contiguous in frequency with each time slot. Moreover, a UE should adopt the same modulation and coding scheme at all allocated subcarriers. These two constraints do limit the scheduling flexibility. In this paper, we formulate the UL scheduling problem with proportional fairness support by taking two constraints into consideration. Since this optimization had been proven to be an NP-hard problem, we further develop one heuristic algorithm. We demonstrate that competitive performance can be achieved in terms of system throughput, which is evaluated by using 3GPP LTE system model simulations.


Efficient Network Coding at Relay for Relay-Assisted Network-Coding ARQ Protocols

Jung-Chun Kao and Kuo-Hao Ho

National Tsing Hua University

Relay-assisted network-coding (RANC) automatic repeat request (ARQ) protocols are ARQ protocols that leverage both opportunistic retransmission and network coding for wireless relay networks. This paper studies the issue of efficient network coding at relay for RANC ARQ. We develop a XORbased with the help of Fibonacci sequence scheme, abbreviated as XOR Fibo. Simulation results show that in terms of relative inefficiency, XOR Fibo has a significant performance gain over a plain XOR scheme. Moreover, the use of XOR Fibo can provide close to the same performance as a random network coding scheme that uses a large field size, with the added advantage of requiring fewer and simpler operations during the encoding and decoding processes.


Enhanced Cooperative Access Class Barring and Traffic Adaptive Radio Resource Management for M2M Communications over LTE-A

Yi-Huai Hsu*, Kuochen Wang and Yu-Chee Tseng

*National Chiao Tung University, National Chiao Tung University

We propose enhanced cooperative access class barring (ECACB) and traffic adaptive radio resource management (TARRM) for M2M communications over LTE-A. We use the number of Machine-Type Communication (MTC) devices that attach to an eNB, which is the base station of LTE-A, as a criterion to determine the probability that an MTC device may access the eNB. In this way, we can have a better set of access class barring parameters than CACB, which is the best available related work, so as to reduce random access delay experienced by an MTC device or user equipment (UE). After an MTC device successfully accesses an eNB, the eNB allocates radio resources for the MTC device based on the random access rate of the MTC device and the amount of data uploaded or downloaded by the MTC device. In addition, we use the concept from cognitive radio networks that when there are unused physical resource blocks (PRBs) of UEs, the eNB can schedule MTC devices to use these PRBs to enhance network throughput. Simulation results show that the proposed ECACBs average (worst) access delay of UEs is 33.19% (29.89%) lower than CACBs. Its average (worst) access delay of MTC devices is 12.15% (15.1%) lower than that of CACB. Its average (worst) throughput from UEs is 20.93% (26.44%) higher than that of CACB. Its average (worst) throughput from MTC devices is 19.95% (12.25%) higher than that of CACB. The proposed ECACB+TARRMs average (worst) throughput from UEs is 26.16% (31.42%) higher than CACBs. Its average (worst) throughput from MTC devices is 25.11% (20.76%) higher than that of CACB. To the best of our knowledge, no existing approach integrates access class barring with radio resource management for M2M communications over LTE-A.


Green Cooperative Relaying In Multi-Source Wireless Networks with High Throughput and Fairness Provisioning

Kuan-Yu Lin and Kuang-Hao Liu

National Cheng Kung University

Motivated by the urgent need of green communications, this paper investigates energy-efficient cooperative relaying methods for multi-source multi-relay wireless networks. Existing cooperative relaying schemes primarily focus on single-source cooperative networks and aim to maximize diversity gain exploitation, yet ignore the extra energy consumption used by relay nodes and fairness between source nodes. Instead, our object is to minimize relay power consumption and maintain networkwide fairness without throughput penalty. The considered problem includes two parts, namely source scheduling and relay assignment that are addressed separately. We derive the feasible condition for the green source-relay assignment problem and show that it is NP-hard. We propose a heuristic algorithm that deliver good performance with low complexity. Simulation results are presented to evaluate the efficacy of the proposed scheme in terms of average throughput, throughput fairness, average relay power consumption, and average outage probability, as compared to two related schemes, under both independent and identically distributed (i.i.d.) and independent and non-identically distributed (i.n.d.) channel configurations.


Sum-Rate Maximization and Energy-Cost Minimization for Renewable Energy Empowered Base-Stations using Zero-Forcing Beamforming

Yung-Shun Wang, Y.-W. Peter Hong and Wen-Tsuen Chen

Academia Sinica Taipei, National Tsing Hua University

Zero-forcing (ZF) beamforming is a practical linear transmission scheme that eliminates inter-user interference in the downlink of a multiuser multiple-input single-output (MISO) wireless system. By considering base-stations (BSs) that are supported by renewable energy, this work examines offline and online ZF beamforming designs based on two different objectives, namely, sum-rate maximization and energy-cost minimization. For offline policies, the channel states and the energy arrivals are assumed to be known a priori for all time instants whereas, in the online policies, only causal information is available. The designs are subject to energy causality and energy storage constraints, i.e., the constraint that energy cannot be used before it arrives and the constraint that the stored energy cannot exceed the maximum battery storage capacity. In the sum-rate maximization problem, the base-station is assumed to be supported only by renewable energy and the goal is to maximize the sum rate over all users by a predetermined deadline. The optimization of the ZF beamforming direction and power allocation can be decoupled, and the solutions can be found exactly. In the energycost minimization problem, the base-station is assumed to be supported by both renewable and power-grid energy, and the goal is to minimize the cost of purchasing grid energy subject to quality-of-service constraints at the users. The problem can be formulated as a convex optimization problem and can be solved efficiently using off-the-shelf solvers. Offline solutions are first obtained and the intuitions gained from their results are used to derive effective online policies. The effectiveness of the proposed policies are demonstrated through computer simulations.


Back to top!!

OS.28-SLA.9: Speech Recognition (II)

UML-Based Robotic Speech Recognition Development: A Case Study

Abdelaziz A.Abdelhamid and Waleed H.Abdulla

University of Auckland

The development of automatic speech recognition (ASR) systems plays a crucial role in their performance as well as their integration with spoken dialogue systems for controlling service robots. However, to the best of our knowledge, there is no research in the literature addressing the development of ASR systems and their integration with service robots from the software engineering perspective. Therefore, we propose in this paper a set of software engineering diagrams supporting a rapid development of ASR systems for controlling service robots. The proposed diagrams are presented in terms of a case study based on our speech recognition system, called RoboASR. The internal structure of this system is composed of five threads running concurrently to optimally carry out the various speech recognition processes along with the interaction with the dialogue manager of service robots. The diagrams proposed in this paper are presented in terms of the COMET method which is designed for describing practical and concurrent systems.


Temporally variable multi-aspect N-way morphing based on interference-free speech representations

Hideki Kawahara, Masanori Morise, Hideki Banno, and Verena G. Skuk§

Wakayama University, University of Yamanashi, Meijo University, §Friedrich Schiller University of Jena

Voice morphing is a powerful tool for exploratory research and various applications. A temporally variable multiaspect morphing is extended to enable morphing of arbitrarily many voices in a single step procedure. The proposed method is implemented based on interference-free representations of periodic signals and found to yield highly-naturally sounding manipulated voices which are useful for investigating human perception of voice. The formulation of the proposed method is general enough to be applicable to other representations and easily modified depending on application needs.


Speech Recognition under Noisy Environments using Multiple Microphones Based on Asynchronous and Intermittent Measurements

Kohei Machida and Akinori Ito

Tohoku University

We propose a robust speech recognition method under noisy environments using multiple microphones based on asynchronous and intermittent observation. In asynchronous and intermittent observation, the noise spectrum is estimated by the environmental noise observed in fragments from multiple microphones, and spectral subtraction is performed by this estimated noise spectrum. In this paper, we consider the case of estimating the noise spectrum from the noise observed by another microphone just before speech input. However, the noise spectrum needs to be compensated because of the difference in the location of the microphone in this case. Then, we examined compensating the noise spectrum by using the estimated LSFL on the log spectrum. By compensating the noise spectrum, the recognition rate improved compared with the case without compensation.


A Neural Understanding of Speech Motor Learning

Xi Chen1, Jianwu Dang1, 2, Han Yan1, Qiang Fang3 and Bernd J. Kröger 4, 1

1Tianjin University, 2Japan Advanced Institute of Science and Technology, 3Chinese Academy of Social Sciences, 4RWTH Aachen University

Speech motor learning is still an under-discussion process in neural computational modeling. In this paper we focus on the relationship between vowel articulation and its muscle activation patterns, propose a neural understanding of speech motor learning and elucidate the neural strategy for speech learning of infants. An existing physiological model including speech articulator organs which has successfully replicated the biomechanical articulatory movement has been used. Self-organizing map related to the contour positions of control points and muscle activation patterns was established during speech motor learning. Experimental result refer to the one-to-many problem in the mapping between the high-level to the low-level motor states, which indicates that quite different muscle activation patterns can lead to similar articulatory positions.


Speech Recognition with Large-Scale Speaker-Class-Based Acoustic Modeling

Kazuki Konno, Masaharu Kato and Tetsuo Kosaka

Yamagata University

This paper investigates speaker-independent speech recognition with speaker-class models. In previous studies based on this method, the number of speaker classes was relatively small and it was difficult to improve the performance significantly over the baseline. In this work, as many as 500 speaker-class models are used to enable more precise modeling of speaker characteristics. In order to avoid a lack of training data for each speaker-class model, a soft clustering technique is used in which a training speaker is allowed to belong to several classes. In the recognition experiments, a slight improvement in performance was obtained using a conventional method with several tens of speaker-class models. In contrast, a significant improvement was obtained using an unsupervised soft clustering method with several hundred speaker-class models. In addition, the results indicated a possibility of reducing the error rate drastically if the speaker-class model selection was conducted more effectively.


Confidence Estimation and Keyword Extraction from Speech Recognition Result Based on Web Information

Hara Kensuke, Sekiya Hideki, Kawase Tetsuya, Tamura Satoshi, and Hayamizu Satoru

Gifu University

This paper proposes to use Web information for confidence measure and to extract keywords for speech recognition results. Spoken document processing has been attracting attention particularly for information retrieval and video (audiovisual) content systems. For example, measuring a confidence score which indicates how likely a document or a segmented document includes recognition errors has been studied. It is well known keyword extraction from recognition results is also an important issue. For these purposes, in this paper, pointwise mutual information (PMI) between two words is employed. PMI has been used to calculate a confidence measure of speech recognition, as a coherence measure by co-occurrence of words. We propose to further improve the method by using a Web query expansion technique with term triplets which consist of nouns in the same document. We also apply PMI to keyword estimation by summing a co-occurrence score (sumPMI) between a targeting keyword candidate and each term. The proposed methods were tested with 10 lectures in Corpus of Spontaneous Japanese (CSJ) and 2 simulated movie dialogues. In the experiments it is shown that the estimated confidence score has high relationship with recognition accuracy, indicating the effectiveness of our method. And sumPMI scores for keywords have higher values in the subjective tests.


Estimating The Position of Mistracked Coil of EMA Data Using GMM-based Methods

Qiang Fang1, Jianguo Wei2, Fang Hu1, Aijun Li1, Haibo Wang3

1Institute of Linguistics, 2Tianjin University, 3CASS

Kinematic articulatory data are important for researches of speech production, articulatory speech synthesis, robust speech recognition, and speech inversion. Electromagnetic Articulograph (EMA) is a widely used instrument for collecting kinematic articulatory data. However, in EMA experiment, one or more coils attached to articulators are possible to be mistracked due to various reasons. To make full use of the EMA data, we attempt to reconstruct the location of mistracked coils with the methods based on Gaussian Mixture Model (GMM). These methods approximate the probability density function of the positions for the concerned coil given the positions of the other coils, then elaborating regression functions by using Minimum Mean Square Error (MMSE) and Maximum Likelihood (ML) methods. The results indicate that: i.) The positions of mistracked coils could be reconstructed from the positions of correctly tracked coils with the RMSE between 1mm and 1.5mm; ii.) The performance can be further improved by incorporating the velocity information in most cases.


Back to top!!

OS.29-SLA.10: Audio Signal Analysis, Processing and Classification (I)

Adaptive Semi-supervised Tree SVM for Sound Event Recognition in Home Environments

Ng Wen Zheng Terence*, Tran Huy Dat*, Huynh Thai Hoa and Chng Eng Siong

*Institute for Infocomm Research, Nanyang Technological University,

AbstractThis paper addresses a problem in sound event recognition, more specifically for home environments in which training data is not readily available. Our proposed method is an extension of our previous method based on a robust semi-supervised Tree-SVM classifier. The key step in this paper is that the MFCC features are adapted using custom filters constructed at each classification node of the tree. This is shown to significantly improve the discriminative capability. Experimental results under realistic noisy environments demonstrate that our proposed framework outperforms conventional methods.


Sparse Coding for Sound Event Classification

Mingming Zhang1,2, Weifeng Li1,2, Longbiao Wang3, Jianguo Wei4, Zhiyong Wu1,2, Qingmin Liao1,2

1Shenzhen Key Lab. of Information Sci&Tech/Shenzhen Engineering Lab. of IS&DRM, 2Tsinghua University, 3Nagaoka University of Technology, 4 Tianjin University

Generally sound event classification algorithms are always based on speech recognition methods: feature-extraction and model-training. In order to improve the classification performance, researchers always pay much attention to find more effective sound features or classifiers, which is obviously difficult. In recent years, sparse coding provides a class of effective algorithms to capture the high-level representation features of the input data. In this paper, we present a sound event classification method based on sparse coding and supervised learning model. Sparse coding coefficients will be used as the sound event features to train the classification model. Experiment results demonstrate an obvious improvement in sound event classification.


Towards a More Efficient Sparse Coding Based Audio-word Feature Extraction System

Chin-Chia Michael Yeh and Yi-Hsuan Yang

Research Center for Information Technology Innovation

This paper is concerned with the efficiency of sparse coding based audio-word feature extraction system. In particular, we have defined and added the concept of early and late temporal pooling to the classic sparse coding based audio-word feature extraction pipeline, and we have tested them on the genre tags subset of the CAL10k data set. We define temporal pooling as any functions that are able to transforms the input time series representation into a more temporally compact representation. Under this definition, we have examined the following two temporal pooling functions for improving the feature extractions efficiency, and they are: Early Texture Window Pooling and Multiple Frame Representation. Early texture window pooling tremendously boost the efficiency by compromising the retrieving accuracy, while multiple frame representation slightly improve both the feature extracting efficiency and retrieving accuracy. Overall, our best feature extraction setup achieves 0.202 in mean average precision on the genre tags subset of the CAL10k data set.



A Robust Sound Event Recognition Framework Under TV Playing Conditions

Ng Wen Zheng Terence*, Tran Huy Dat*, Jonathan Dennisy* and Chng Eng Siong

*Institute for Infocomm Research,Nanyang Technological University

In this paper, we address the problem of performing sound event recognition tasks in the presence of television playing in a home environment. Our proposed framework consist of two modules: (1) a novel regression-based noise cancellation (RNC), a preprocessing which utilises a addition reference microphone placed near the television to reduce the noise. RNC learns an empirical mapping instead of the convention adaptive methods to achieve better noise reduction. (2) An improved subband power distribution image feature (iSPD-IF) which build on our existing classification framework by enhancing the feature extraction. A comprehensive experiment is carried out on our recorded data, which demonstrates high classification accuracy under severe television noise.


Multimodal Person Authentication System Using Features of Utterance

*Qian Shi, Takeshi Nishino and Yoshinobu Kajikawa

*Kansai University

In this paper, we propose a biometrics authentication method using multimodal features in utterance. The multimodal features in utterance consists of lip shape (physical trait), lip motion pattern and voice pattern(behavioral trait). Therefore, the proposed method can be constructed with only a camera extracting lip area and voice without special equipment like other personal authentication methods. Moreover, the utterance phrase itself has a role of a key function by setting up an utterance phrase arbitrarily, and then the robustness of the authentication increases according to the phrase recognition which can reject an imposter with the feature similar to a registrant. In the proposed method, lip shape and voice features are extracted as edge or texture in the lip image and pitch or spectrum envelope in the voice signal. Experimental results demonstrate that the proposed method can improve the authentication accuracy compared with other methods based on the single modal.


SmartDJ: An Interactive Music Player For Music Discovery By Similarity Comparison

Maureen S. Y. Aw, Chung Sion Lim, Andy W. H. Khong

Nanyang Technological University,

We present a user-friendly method that employs acoustic features to automatically classify songs. This is achieved by extracting low-level features and reducing the feature space using principle component analysis (PCA). The songs are then plotted on a song-space graphic user interface (GUI) for manual or automatic browsing. The similarity between songs is given by the Euclidean distance in this lower-dimension song space. Using this song space, a prototype application known as the "SmartDJ" has been implemented on the MAX/MSP platform. This prototype application enables users to visualize their music library, select songs based on their similarity or automate the song selection process using a given seed song. We also describe, in this paper, several features of the application including the smooth mix transition feature which provides an enhanced experience for the users to perform song transition seamlessly.


Back to top!!

OS.30-SPS.4: 3D Video Representation and Coding

Generation of Eye Contact Image Using Depth Camera for Realistic Telepresence

Sang-Beom Lee and Yo-Sung Ho

Gwangju Institute of Science and Technology (GIST),

In this paper, we present an eye contact system for realistic telepresence using a depth camera that utilizes an infrared structured light. In order to generate the eye contact image, we capture a pair of color and depth video and separate the foreground single user from the background. Since the raw depth data includes several types of noises, we perform a joint bilateral filtering method to reduce the noise. Then, we apply a discontinuity-adaptive depth filter to the depth map to reduce the disocclusion region. From the color image and the preprocessed depth map, we construct a three-dimensional model of the user at the virtual viewpoint. The entire system is implemented through GPU-based parallel programming for real-time processing. Finally, we obtain the gaze-corrected user. Experimental results show that the proposed system is efficient in realizing eye contact and provides more realistic experience.


Efficient Up-sampling Method of Low-resolution Depth Map Captured by Time-of-Flight Depth Sensor

Yun-Suk Kang and Yo-Sung Ho

Gwangju Institute of Science and Technology (GIST)

In this paper, we propose an efficient up-sampling method of the low-resolution depth map that is obtained by a time-of-flight (TOF) depth sensor. After we capture the color images and TOF depth maps simultaneously using a fusion camera system, each pixel of the low-resolution depth map is relocated to the corresponding color image positions by 3D warping, and the warping error is eliminated in each color segment. Then, we employ a Markov random field (MRF) model to produce a high-resolution disparity map using the warped values and color segment information. Experimental results show that the proposed method efficiently up-sample the lowresolution depth maps compared to the other up-sampling approaches.



When Disparity meets Distance: HEVC Compression of Double-Faced Depth Map

Yu-Hsun Lin*, Ja-Ling Wu*, Toshihiko Yamasakiy and Kiyoharu Aizaway

*National Taiwan University,  University of Tokyo,

A depth map is inherently double-faced, one single depth map can provide two closely related 3D-vision parameters (i.e. the disparity s and the distance z). Most existing depth map compression techniques considered only one of them at a time, rather than addressing the effect of compression for the two parameters simultaneously. In order to remedy this shortage, a new distortion function is proposed and integrated into HEVC 3D compression framework to examine the effectiveness with respect to s and z at the same time. Experimental results show that, with the aid of the proposed distortion function, the rate-distortion performance of the distance quality will be significantly enhanced (from 0.5 dB to 8 dB), while keeping the corresponding quality of disparity almost the same. In addition to the coding performance improvements, we also discovered some interesting phenomenons that never occurred in traditional compression framework. We expect this work can inspire more interesting research works on depth map compression.


Super-Resolved Free-Viewpoint Image Synthesis Combined With Sparse-Representation-Based Super-Resolution

Ryo Nakashima, Keita Takahashi and Takeshi Naemura

University of Tokyo, Nagoya University

We consider super-resolved free-viewpoint image synthesis (SR-FVS), where a high-resolution (HR) image that would be observed from a virtual viewpoint is synthesized from a set of low-resolution multi-view images. In previous studies, methods for SR-FVS were proposed on the basis of reconstruction-based super-resolution (RB-SR). RB-SR uses multiple images to synthesize an HR image and thereby can naturally be applied to SR-FVS, where multi-view images are given as the input. However, the quality of the synthesized image depends on observation conditions such as the depth of the target scene, so sometimes the quality of SR-FVS can degrade severely. To mitigate such degradation, we propose integrating learning-based super-resolution (LB-SR), which uses knowledge learned from massive natural images, into the SR-FVS process. In this paper, we adopt sparse coding super-resolution (ScSR) as a LB-SR method and combine ScSR with an existing SR-FVS method.


Joint Texture-Depth Pixel Inpainting of Disocclusion Holes in Virtual View Synthesis

Smarti Reel*, Gene Cheung, Patrick Wong* and Laurence S. Dooley*

*The Open University, Milton Keynes, UK,National Institute of Informatics, Tokyo, Japan

Transmitting texture and depth maps from one or more reference views enables a user to freely choose virtual viewpoints from which to synthesize images for observation via depth-image-based rendering (DIBR). In each DIBR-synthesized image, however, there remain disocclusion holes with missing pixels corresponding to spatial regions occluded from view in the reference images. To complete these holes, unlike previous schemes that rely heavily (and unrealistically) on the availability of a high-quality depth map in the virtual view for inpainting of the corresponding texture map, in this paper a new Joint Texture- Depth Inpainting (JTDI) algorithm is proposed that simultaneously fill in missing texture and depth pixels. Specifically, we first use available partial depth information to compute priority terms to identify the next target pixel patch in a disocclusion hole for inpainting. Then, after identifying the best-matched texture patch in the known pixel region via template matching for texture inpainting, the variance of the corresponding depth patch is copied to the target depth patch for depth inpainting. Experimental results show that JTDI outperforms two previous inpainting schemes that either does not use available depth information during inpainting, or depends on the availability of a good depth map at the virtual view for good inpainting performance.


Overcomplete Compressed Sensing of Ray Space for Generating Free Viewpoint Images

Qiang Yao, Keita Takahashi, Toshiaki Fujii

Nagoya University

Free Viewpoint Image (FVI) technique has gained a great popularity because it enables people to freely choose the viewpoints from which they watch the targets and scenes. Generation of FVI requires all the information of a 3-D virtual space. Ray space is a direct representation of complete 3-D information, and FVIs can be generated simply by cutting ray space. However, in the straightforward construction of a ray space, a numerous number of images have to be captured in advance which triggered data explosive from data acquisiton to data storage and transmission. In this paper, we focus on compressed sensing to sparsely capture a ray space at encoder and reconstruct it at decoder. Specifically, we propose to adopt overcomplete dictionaries which are produced by learning methods, to sparsely represent the ray space data. Experimental results show that the proposed dictionary is much better than the structured dictionary in previous work or orthogonal basis in terms of the reconstruction quality of the ray space from compressively sensed data.


Novel 3D Video Conversion from Down-Sampled Stereo Video

Wun-Ting Lin,Shang-Hong Lai

National Tsing Hua University

Stereo video has become the main-stream 3D video format in recent years due to its simplicity in data representation and acquisition. Under stereo settings, the twin problems of video super-resolution and high-resolution disparity estimation are intertwined. In this paper, we present a novel 3D video conversion system that converts down-sampled stereo video to high-resolution stereo sequences with a Bayesian framework. In addition, we estimate the finer-resolution disparity maps with a two-step CRF model. Our super-resolution system can also be incorporated into the video coding process, which can significantly lower the data amount as well as preserving high-quality details. Experimental results demonstrate that our system can enhance image resolution in both stereo video and disparity map. Objective evaluation of the proposed video coding scheme combined with super-resolution at different compression ratios also shows competitive performance of proposed system for video compression.


Back to top!!

OS.31-BioSiPS.2: Biomedical Signal Processing and Systems

Spatial Auditory BCI with ERP Responses to FrontBack to the Head Stimuli Distinction Support

Zhenyu Cai, Shoji Makino, Tomasz M. Rutkowski,,

University of Tsukuba ,RIKEN Brain Science Institute

This paper presents recent results obtained with a new auditory spatial localization based BCI paradigm in which ERP shape differences at early latencies are employed to enhance classification accuracy in an oddball experimental setting. The concept relies on recent results in auditory neuroscience showing the possibility to differentiate early anterior contralateral responses to the spatial sources attended to. We also find that early brain responses indicate which direction, front or rear loudspeaker source, the subject attended to. Contemporary stimulidriven BCI paradigms benefit most from the P300 ERP latencies in a so-called "aha-response" setting. We show the further enhancement of the classification results in a spatial auditory paradigm, in which we incorporate N200 latencies. The results reveal that these early spatial auditory ERPs boost offline classification results of the BCI application. The offline BCI experiments with the multi-command BCI prototype support our research hypothesis with higher classification results and improved information transfer rates.


Detecting pathological speech using local and global characteristics of harmonic-to-noise ratio

Jung-Won Lee, Hong-Goo Kang, Samuel Kim and Yoonjae Lee

Yonsei University, Digital Media & Communication R&D Center

This paper proposes an efficient feature extraction method for automatic diagnosis systems to detect pathological subjects using continuous speech. Since continuous speech contains slow and rapid adjustments of vocal mechanisms which relate to initiations and terminations of voicing, the proposed algorithm utilizes both localized temporal characteristics and histogram-based global statistics of harmonic-to-noise ratio (HNR) to efficiently differentiate the key features from phonetic variation. Experimental results show that the proposed method improves the classification error rate by 11.2 % (relative) compared to the conventional method using HNR.


Speech Enhancement for Pathological Voice Using Time-Frequency Trajectory Excitation Modeling

Eunwoo Song*, Jongyoub Ryuy and Hong-Goo Kang*

*Yonsei University, Digital Media & Communication R&D Center,

This paper proposes a speech enhancement algorithm for pathological voices using a time-frequency trajectory excitation (TFTE) modeling. The TFTE model has a capability of delicately controlling the periodic and non-periodic excitation components by taking a single pitch based decomposition process. By investigating the difference of frequency characteristics between pathological and normal voices, this paper proposes an enhancement algorithm which can efficiently reduce the breathiness of the pathological voice while maintaining the identity of the speaker. Subjective test results are presented to verify the effectiveness of the proposed algorithm.


ECG Baseline Extraction by Gradient Varying Weighting Functions

Ying-Jou Chen a, Jian-Jiun Ding b, Chen-Wei Huang c, Yi-Lwun Ho d, and Chi-Sheng Hung e

Graduate Institute of Communication Engineering, National Taiwan University a,b,c

School of Medicine, National Taiwan University d,e

The electrocardiogram (ECG) signal is important for diagnosing cardiovascular diseases. However, in realistic scenario, the measured ECG signal is prone to be interfered by the artifacts caused from the respiration and the movement of patients. This artifact is called baseline wandering or baseline drifting and will lead to misdiagnosis if it is severe. Thus, pre-processing the measured ECG signal is necessary to make correct diagnosis. In this paper, we proposed a robust pre-processing method for extracting the baseline of ECG signals by the gradient varying weighting function. Our approach is adaptive to the input signal and is able to preserve the features of the ECG signal precisely. Simulation results show that our method outperforms other frequently used baseline extraction methods and has a good performance even if the input ECG signal is severely interfered by baseline drifting.


Patients’ consciousness analysis using Dynamic Approximate Entropy and MEMD method

Gaochao Cui* , Yunchao Yin *, Qibin Zhao, Andrzej Cichocki and Jianting Cao *,

*Saitama Institute of Technology, Brain Science Institute

Electroencephalography (EEG) based preliminary examination has been proposed in the clinical brain death determination. Multivariate empirical mode decomposition(MEMD) and approximate entropy(ApEn) are often used in the EEG signal analysis process. MEMD is an extended approach of empirical mode decomposition(EMD), in which it overcomes the problem of the decomposed number and frequency, and enables to extract brain activity features from multi-channel EEG simultaneously. ApEn as a complexity based method appears to have potential for the application to physiological and clinical time series data. In our previous studies, MEMD method and ApEn measure were always used severally, if MEMD and ApEn are used to analysis the same EEG signal simultaneously, the result of experiment will be more accurate. In this paper, we present MEMD method and ApEn measure based blind test without knowing about the clinical symptoms of patients beforehand. Features obtained from two typical cases indicate one patient being in coma and another in quasi- brain-death state.


Hemoglobin Prediction System from Pulse Signal

Khunawat Luangrat1, Prapat Suriyaphol2,3 and Yodchanan Wongsawat1

1 Department of Biomedical Engineering, Faculty of Engineering, Mahidol University, Thailand, 2Faculty of Medicine Siriraj Hospital, Mahidol University, Thailand, 3Center for Emerging and Neglected Infectious Diseases, Mahidol University, Thailand

This paper proposes the translational research on developing the homemade pulse detector to predict the hemoglobin (usually need blood test) by using the non-invasive method. The system uses the peak of each pulse signal from homemade-single infrared LED to calculate the amount of hemoglobin compare with the commercial product.


Spatial Auditory BCI Paradigm based on Real and Virtual Sound Image Generation

Nozomu Nishikawa, Shoji Makino, Tomasz M. Rutkowski,

University of Tsukuba, RIKEN Brain Science Institute

This paper presents a novel concept of spatial auditory braincomputer interface utilizing real and virtual sound images. We report results obtained from psychophysical and EEG experiments with nine subjects utilizing a novel method of spatial real or virtual sound images as spatial auditory brain computer interface (BCI) cues. Real spatial sound sources result in better behavioral and BCI response classification accuracies, yet a direct comparison of partial results in a mixed experiment confirms the usability of the virtual sound images for the spatial auditory BCI. Additionally, we compare stepwise linear discriminant analysis (SWLDA) and support vector machine (SVM) classifiers in a single sequence BCI experiment. The interesting point of the mixed usage of real and virtual spatial sound images in a single experiment is that both stimuli types generate distinct event related potential (ERP) response patterns allowing for their separate classification. This discovery is the strongest point of the reported research and it brings the possibility to create new spatial auditory BCI paradigms.


Classifying P300 Responses to Vowel Stimuli for Auditory Brain-Computer Interface

Yoshihiro Matsumoto, Shoji Makino, Koichi Mori and Tomasz M. Rutkowski

University of Tsukuba, Tsukuba, Japan,Research Institute of National Rehabilitation Center for Persons with Disabilities, RIKEN Brain Science Institute

A brain-computer interface (BCI) is a technology for operating computerized devices based on brain activity and without muscle movement. BCI technology is expected to become a communication solution for amyotrophic lateral sclerosis (ALS) patients. Recently the BCI2000 package application has been commonly used by BCI researchers. The P300 speller included in the BCI2000 is an application allowing the calculation of a classifier necessary for the user to spell letters or sentences in a BCIspeller paradigm. The BCIspeller is based on visual cues, and requires muscle activities such as eye movements, impossible to execute by patients in a totally locked-in state (TLS), which is a terminal stage of the ALS illness. The purpose of our project is to solve this problem, and we aim to develop an auditory BCI as a solution. However, contemporary auditory BCIspellers are much weaker compared with a visual modality. Therefore there is a necessity for improvement before practical application. In this paper, we focus on an approach related to the differences in responses evoked by various acoustic BCI speller related stimulus types. In spite of various event related potential waveform shapes, typically a classifier in the BCI speller discriminates only between targets and non-targets, and hence it ignores valuable and possibly discriminative features. Therefore, we expect that the classification accuracy could be improved by using an independent classifier for each of the stimulus cue categories. In this paper, we propose two classifier training methods. The first one uses the data of the five stimulus cues independently. The second method incorporates weighting for each stimulus cue feature in relation to all of them. The results of the experiments reported show the effectiveness of the second method for classification improvement.


Back to top!!

OS.32-SIPTM.4: Information Security and Multimedia Applications

Non-liner Learning for Mixture of Gaussians

Chih-Yang Lin1*, Pin-Hsian Liu2, Tatenda Muindisi1, Chia-Hung Yeh2, and Po-Chyi Su3

1 Asia University, Taichung ,2 National Sun Yat-sen University, 3National Central University

Background modeling plays a key role of event detection in intelligent surveillance systems. Gaussian Mixture Model (GMM) is the wide-used background modeling method in latest surveillance systems. However, the model has some disadvantageous when the object moves slowly. In this paper, we propose a mechanism which takes the advantage of Gaussian error function (ERF) to adjust the growths of each Gaussian's weights and variances, to solve the problem that traditional GMM misjudged the slow moving object as background. The mechanism improves the GMM model to detect the slow moving object accurately and enhance the robustness of surveillance systems.


Abandoned Object Detection in Complicated Environments

Kahlil Muchtar1, Chih-Yang Lin*2, Li-Wei Kang3, and Chia-Hung Yeh1

1National Sun Yat-sen University, 2 Asia University, Taichung, 3National Yunlin University

In video surveillance, tracking-based approaches are very popular especially for detecting abandoned objects in public areas. Once the object has been tracked, the object status can be further classified as removed or abandoned. However, some shortcomings were found on tracking-based approaches, e.g. illumination changes and occlusion. Therefore, in this paper, an alternative approach to detect abandoned objects is proposed by incorporating background modeling and Markov model. In addition the shadow removal is employed to rectify detected objects and obtain more accurate results. The experimental results show that the proposed scheme is better than other methods in terms of accuracy and correctness.


A Dynamic Mobile Services Access and Payment Platform with Reusable Tickets for Mobile Communication Networks

Hao-Chuan Tsai1 and Hui-Fuang Ng2

1Tzu Chi College of Technology,  2University Tunku Abdul Rahman

The next generation mobile systems provide high speed data transferring rate to mobile devices. And it is crucial to integrate together two critical issues, the authentication and roaming in heterogeneous networks. For this, Lei et al. proposed the mobile services access and payment mechanism for the next generation mobile systems recently. Their scheme provides a lightweight service access mechanism in which the computation complexity is low on mobile devices. Unfortunately, we found that their scheme suffers from the personal privacy problem. In this paper, we propose an improved version which is lightweight, practical, and likely to avoid re-initialization. When a user roams in heterogeneous networks, he can utilize the delegation reusable ticket to achieve mutual authentication and nonrepudiation. Our proposed scheme provides the better efficiency for users in mobile systems.


An Improved Method for Image Thresholding based on the Valley-Emphasis Method

Hui-Fuang Ng1,2, Davaajargal Jargalsaikhan2, Hao-Chuan Tsai3, Chih-Yang Lin2

1University Tunku Abdul Rahman, 2Asia University, 3Tzu Chi College of Technology

Thresholding is an important technique for image segmentation that extracts a target from its background on the basis of the distribution of gray levels. Many automatic threshold selection methods such as Otsu method provide satisfactory results for thresholding images with obvious bimodal gray level distribution. However, most threshold selection methods fail if the histogram is unimodal or close to unimodal. Valley-emphasis method partially resolves such problem by weighting the objective function of the Otsu method with the valley point in the histogram. In this study, we proposed an approach for improving the valley-emphasis method for optimal threshold selection by introducing a Gaussian weighting scheme to enhance the weighting effect. Experimental results indicate that the proposed method provides better and more stable thresholding results.


A Vision-Based Navigation System for Tamsui Historical Buildings

Wen-Chuan Wu1, Zi-Wei Lin1, and Yi-Hui Chen2

1Aletheia University, 2Aisa University, Taichung

This paper proposes a vision-based navigation system for Tamsui historical buildings. Most travelers in Tamsui take a picture of the historical buildings by using their mobile phone's camera. Our system is designed to automatically extract the color and edge features out of images, and then search similar images from an image database by comparing with these features. For the searched buildings, their contextual navigation and 3D model will be displayed or projected in a screen. This makes the navigation process easier and more interesting. Experimental results showed that the proposed scheme retrieves the target building images effectively and accurately.


Process Of Reading and Writing The Tag of The Motor Vehicle Electrical Identification System Based on The RFID Technology

Wu ChangCheng1 Liu DongBo2 and Hu JiaBin3

Traffic management research institute of Public Security Ministry

As the development of the intelligent transport system, motor vehicle electrical identification system, based on the theory of radio frequency identification, is widely researched to strength the traffic management. The motor vehicle electrical identification system is mainly composed of the tag, the reader and other aspects. Usually, the vehicle information, including the license number, the traveler name, the identification number and other personal information, is stored in the tag. Considering the radio communication between the tag and the reader, the personal security may be destroyed if the vehicle information is leaked. Therefore, this paper introduces a new process about reading and writing the tag of motor vehicle electrical identification system. Firstly, it designed the structure about the storage of the tag, including the information security area, the controlling area and the information area. Then, it proposed the method about reading and writing each area. Experimental results show that the proposed method can satisfy the command needing of the traffic management.


Back to top!!

OS.33-IVM.12: Emerging Technologies in Multimedia Communications

Diffie-Hellman Key Distribution in Wireless Multi-Way Relay Networks

Ronald Y. Chang, Sian-Jheng Lin, and Wei-Ho Chung

Academia Sinica

Diffie-Hellman protocol is a classical secret key exchange protocol for secure communications. This paper considers the extension of the original two-party Diffie-Hellman protocol to multi-party key distribution in wireless multi-way relay networks where multiple users can only communicate with one another through a single relay. Two efficient key exchange protocols are proposed. A performance comparison with existing methods adapted to the relay networks shows the enhanced efficiency and the originality of the proposed protocols designed specifically for the multi-way relay networks.


A Facial Skin Changing System

Yu-Ren Lai, Chih-Yuan Yaoy, Hao-Siang Hu, Ming-Te Chi, and Yu-Chi Lai

National Taiwan University of Science and Technology

This paper presents a novel system to remove facial scars and pores for a portrait. The facial skin complexion and color are important attractive factors, and most people consider that a good portrait should be scar free and have smooth facial colors. Currently, most available commercial digital cameras or smart phones all provide facial-skin-beatification functions, but most of them only use simple image processing functions to smooth the taken image for the removal of small-scale unwanted facial scars and pores and these functions cannot remove large and obvious scars or pores from the face as shown in Figure 1. Therefore, the main contribution of this system allows the user to replace the facial skin of a portrait with a beautiful and scar-less skin of another portrait chosen from a database which consists of scar-free and beautiful skin complexion collected from webs.


A Skeleton-Based Pairwise Curve Matching Scheme for People Tracking in a Multi-Camera Environment

Chien-Hao Kuoa, Shih-Wei Sunb;c, and Pao-Chi Changa

aNational Central University, bTaipei National University of the Arts, cTaipei National University of the Arts

In this paper, we propose a pairwise curve matching scheme in a multi-camera environment to handle the mis-tracking is- sue caused by occlusion problem happened in a single cam- era. According to the skeleton/joints of a human subject ana- lyzed from a depth camera (e.g., Kinect), based the foot points (joints) used for people tracking in a field of view, we apply homography transformation to project the foot points from different views to a virtual birds eye view, using Kalman fil- ter to achieve people tracking with a pairwise curve matching. The contribution of this paper is trifold: (a) the proposed pair- wise curve matching scheme can handle the occlusion prob- lem happened in one of the cameras, (b) the complexity of the proposed scheme is low and affordable to be implemented in a realtime application, and (c) the implementation on a Kinect camera can provide satisfactory tracking results in a bright or extremely dark environment due to the skeletons/joints ana- lyzed by the coded structured light-based infra-red (IR) sen- sor.


An Image-based Postal Barcode Decoder with Missing Bar Correction

Peng-Hua Wang and Jia-Wei Ciou

National Taipei University, National Taipei University

In this paper, we propose an image-based postal barcode decoding algorithm. The proposed algorithm is capable of correcting missing bars in a symbol. Postal barcodes in general belong to bar length modulated symbology. Usually, postal barcodes use a check digit to detect errors. It is possible for some postal code symbologies not only to detect errors, but also to correct them by using the uniqueness of the pattern of a digit. In this paper, we present a general method of correcting errors caused by missing bars. In order to validate our method, a POSTNET (Postal Numeric Encoding Techniques) decoder is implemented. Experimental results show that the proposed algorithm can correct one missing bar in a digit, and at most ten missing bars in a POSTNET symbol.


Privacy Image Protection Using Fine-Grained Mosaic Technique

Yi-Hui Chen1, Eric Jui-Lin Lu2, Chu-Fan Wang3

2Asia University, 2National Chung Hsing University, 3National Chung Hsing University

Access control has been applied in multimedia database to preserve and protect the sensitive information. The past researches generate authorization rules to control the authorizations with fine-grained ability in social photos, meeting photos, promotional photos. However, it does not appropriate to use in some privacy scenarios (e.g., the increasing popularity of digital images being stored and managed by the service Google Street View). With low cost of maintenance, this paper integrates the data hiding technique into a fined-grained access control to mosaic the sensitive information as well as enable to recover the mosaic region if necessary. The experiments show the positive data to confirm the feasibility of the proposed scheme.


Secret Sharing Mechanism with Cheater Detection

Pei-Yu Lin*, Yi-Hui Chen, Ming-Chieh Hsu* and Fu-Ming Juang*

*Yuan Ze University, Asia University

Cheater detection is essential for a secret sharing approach which allows the involved participants to detect cheaters during the secret retrieval process. In this article, we propose a verifiable secret sharing mechanism that can not only resist dishonest participants but can also satisfy the requirements of larger secret payload and camouflage. The new approach conceals the shadows into a pixel pair of the cover image based on the adaptive pixel pair matching. Consequently, the embedding alteration can be reduced to preserve the fidelity of the shadow image. The experimental results exhibit that the proposed scheme can share a large secret capacity and retain superior quality.


Back to top!!

OS.34-IVM.13: 3D Image Processing/Compression, Object Tracking, and Augmented Reality

Resolution Adjustable 3D Scanner Based on Using Stereo Cameras

Tzung-Han Lin*

*National Taiwan University of Science and Technology

This paper addresses a stereo-based 3D scanner system, which is able to acquire various resolution range data. The system consists of stereo cameras and one slit laser. In each stereo image pair, we cast one laser stripe on the surface of object, and analyze their disparities for determining their depth values. Utilizing a super-sampling filter, the sub-pixel features are generated for enhancing the native resolution of CCD component. In this system, we use one slit laser for sweeping the surface of objects and generating correspondences under the epipolar constrain. Since the correspondences are generated by the positions of the cast stripes, their resolution is controllable.


A Modified Mean Shift Algorithm for Visual Object Tracking

Shu-Wei Chou1, Chaur-Heh Hsieh2, Bor-Jiunn Hwang3, Hown-Wen Chen4

Ming-Chuan University

The CamShift is an adaptive version of Mean Shift algorithm. It has received wide attention as an efficient and robust method for object tracking. However, it is often distracted or interfered by the other larger objects with similar colors. This paper presents a novel tracking algorithm based on the mean shift framework. Unlike the CamShift, which uses the probability density image determined by the color feature, the proposed algorithm employs the probability density image derived from both color and shape features. Experimental results indicate the proposed algorithm improves robustness without sacrificing computational cost, as compared to the conventional CamShift algorithm


Mobile Application of Interactive Remote Toys with Augmented Reality

Chi-Fu Lin*, Pai-Shan Pa, and Chiou-Shann Fuh*

* National Taiwan University, National Taipei University of Education

Recently, because the rapid development of mobile devices, augmented reality has extended from personal computers to mobile devices. The highly interactive nature of augmented reality with its user has given rise to various augmented reality applications for mobile devices, ranging from mere interaction to marketing, games, navigation, and so on. As such, augmented reality has for years been one of the focuses in mobile devices application development. There are many physical toys with integrated electronic sensor chips or wireless communication mechanism, so that the toys are no longer boring and it produced new ways to interact. In this study, a human-computer interactive interface between physical toys and virtual objects was designed with the aim of incorporating augmented reality


An Error Propagation Free Data Hiding Algorithm in HEVC Intra-Coded Frames

Po-Chun Chang, Kuo-Liang Chung, Jiann-Jone Chen, Chien-Hsiung Lin and Tseng-Jung Lin

National Taiwan University of Science and Technology , National Taiwan University of Science and Technology

Efficient data hiding algorithms have been developed for video coders such as MPEG-4 and H.264/AVC, to deliver embedded information. Lin et al. proposed an error propagation free discrete cosine transform (DCT) based data hiding algorithm in H.264/AVC intra-coded frames. However, the state-of-the-art video codec, high efficiency video coding (HEVC), adopts both DCT and discrete sine transform (DST) such that the previous DCT based data hiding algorithms cannot afford to fully utilize available capacity for data hiding under the HEVC framework. We proposed to investigate the block DCT and DST coefficient characteristics to specify the transformed coefficients that can be perturbed without propagating errors to neighboring blocks. Experiments on four different complexity test videos justified the efficiency of the proposed algorithm in performing intra-frame error propagation free data hiding, providing higher embedding capacity in low bitrate coding, and yielding better reconstructed video quality


Dual Edge-Confined Inpainting of 3D Depth Map Using Color Image’s Edges and Depth Image’s Edges

Ming-Fu Hung, Shaou-Gang Miaou*, and Chih-Yuan Chiang

Chung Yuan Christian University, * Chung Yuan Christian University

Currently, most 3D formats adopt the approach of 2D plus depth map. Thus, many papers discuss depth maps, including how to inpaint them. When inpainted, the main focus is on edge areas because poor edges in depth maps will result in two problems: (1) There will be holes in 3D images after image synthesis; (2) When synthesized for other views, an edge mismatch may create some visual defect at the occlusion part of the images. In this paper, we propose a method to enhance the consistency between a color image and its depth map. With 2D color image’s edges and depth map’s edges, the proposed method finds the areas where the edges do not match, and uses a dual edge-confined inpainting technique to inpaint the edgemismatched areas. The inpainting technique exploits the relationship between edges and objects in color image and depth map, and conducts inpainting directionally, where a depth map is inpainted at the inconsistent part between a depth map and its corresponding color image to improve the quality of the depth map. The experimental results show that the proposed method enhances the edge consistency between the depth map and the corresponding color image. In addition, following traditional deblocking process in image coding, the proposed method increases the PSNR value of depth map for QP less than 30, and achieves a comparable performance to that obtained by using the trilateral filter plus SD algorithm which also uses color image and depth map information


Face Recognition Using Sparse Representation with Illumination Normalization and Component Features

Gee-Sern Hsu*, Ding-Yu Lin

National Taiwan University of Science and Technology

We merge illumination normalization and component features into the framework of Sparse Representation-based Classification (SRC) for face recognition across illumination. Unlike most SRC-based face recognition which constructs a dictionary from a training set with sufficient illumination variation, the proposed method adopts a dictionary with illuminationnormalized training set. This can be the first attempt to show that illumination normalization can upgrade the performance of SRC-based face recognition. To further improve the performance, we add in schemes exploiting local features, and prove its effectiveness. Experiments on FERET and Multi-PIE databases show that the performance of the proposed method can be competitive to the state of the art


Back to top!!

OS.35-IVM.14: Intelligent Multimedia Applications

True Motion Estimation Based on Reliable Motion Decision

Tien-Ying Kuo, Cheng-Hong Hsieh, Yi-Chung Lo, Jian-Hua Wang

National Taipei University of Technology

This paper presents a novel technique to estimate true motion vectors (TMVs) during video encoding. The proposed method designs a system unit to collect motion vectors’ (MV’) characteristic in the video encoder. MVs are then classified into reliable and unreliable MVs based on the collected early-stop distribution of each block. The reliable MVs are directly assigned as TMVs, whereas the unreliable MVs must go through the refinement processing for TMV conversion. This algorithm can be easily integrated into the existing video encoders. The experiment result shows that proposed method is superior to the literature works in terms of either performance or computation complexity


Clustering User Queries into Conceptual Space

Li-Chin Lee* and Yi-Shin Cheny

*National Tsing Hua University, National Tsing Hua University

The gap between user search intent and search results is an important issue. Grouping terms according to semantics seems to be a good way of bridging the semantic gap between the user and the search engine. We propose a framework to extract semantic concepts by grouping queries using a clustering technique. To represent and discover the semantics of a query, we utilize a web directory and social annotations (tags). In addition, we build hierarchies among concepts by splitting and merging clusters iteratively. Exploiting expert wisdom from web taxonomy and crowd wisdom from collaborative folksonomy, the experiment results show that our framework is effective


Complexity Control of Motion Compensation for Video Decoding

Wei-Hsiang Chiou Chih-Hung Kuo and Yi-Shian Shie

National Cheng Kung University

This paper proposes a complexity control mechanism for the video encoder to generate a bitstream that fits the power constraint of the decoder. We combine the complexity term of motion compensation with the conventional rate-distortion optimization (RDO). The Lagrange multiplier is updated for each macroblock (MB) to meet the target computing complexity. Experimental show that the proposed method provides a good control accuracy of computing complexity. The whole average error of test sequences is 1.20% with constant bit rate constraints


A Privacy Protection Scheme in H.264/AVC by Data Hiding

Po-Chyi Su* , Wei-Yu Chen* , Shao-Yu Shiau* , Ching-Yu Wu* and Addison Y.S. Suy

*National Central University, National Central University

In this research, a privacy protection mechanism in H.264/AVC videos is proposed. The sensitive or private visual information in frames, which should not be viewable by the general public or regular users, will be scrambled by directly modifying or removing the related data in H.264/AVC compressed bitstreams. In order to allow the authorized users to recover the partially scrambled video frames, the methodology of information hiding is employed; that is, the correct information is embedded and transmitted along with the video bitstream. After retrieving the data, the authorized users can descramble the protected areas in frames. Experimental results show that the partial scrambling can be achieved effectively and the size of the resulting video is kept under good control


The Modification of Beat to Beat Algorithm and its Application on the Assessment of Muscle Flexibility

Jian-Guo Bau* and Yung-Hui Li

* Hungkuang University, National Central University

AbstractLaser-Doppler Flowmetry (LDF) described by the theory related to the Doppler effect is one of the most convenient measurement techniques for routine tissue perfusion assessment. Although the LDF signal did not reveal a significant waveform in heart-beat frequency, its"waveform"could be obtained by beat to beat algorithm, in which the R peaks of electrocardiograph (ECG) were used as a reference. Nevertheless, because the period length of heart beat varies a little bit over time even for the same person, the segments of LDF signal divided according to the peaks of ECG definitely would not have exactly the same length. In this study, we modified the beat to beat algorithm by resampling the LDF segments and normalizing them into the same length. According to the modified beat to beat algorithm, the characteristics of LDF were compared between individuals with different flexibility of lower extremities. The LDF flux of the individuals with higher flexibility revealed higher stability underwent the muscle stretching, while the blood flux from the individuals with lower flexibility was interfered and became unstable during muscle stretching. We concluded that the modified beat to beat algorithm will be a useful tool for the analysis of LDF signals


Happiness Detection in Music Using Hierarchical SVMs with Dual Types of Kernels

Yu-Hao Chin, Chang-Hong Lin, Ernestasia Siahaan, and Jia-Ching Wang

National Central University

In this paper, we proposed a novel system for detecting happiness emotion in music. Two emotion profiles are constructed using decision value in support vector machine (SVM), and based on short term and long term feature respectively. When using short term feature to train models, the kernel used in SVM is probability product kernel. If the input feature is long term, the kernel used in SVM is RBF kernel. SVM model is trained from a raw feature set comprising the following types of features: rhythm, timbre, and tonality. Each SVM is applied to targeted emotion class with calm emotion as the background class to train hyperplanes respectively. With the eight hyperplanes trained from angry, happy, sad, relaxed, pleased, bored, nervous, and peaceful, each test clip can output four decision values, which are then regarded as the emotion profile. Two profiles are fusioned to train SVMs. The final decision value is then extracted to draw DET curve. The experiment result shows that the proposed system has a good performance on music emotion recognition


Back to top!!

OS.36-IVM.15: Advanced Image and Video Processing

Scale-Compensated Nonlocal Mean Super Resolution

Qiaochu Li, Qikun Guo, Saboya Yang and Jiaying Liu*

Peking University

In this paper, we propose a novel algorithm for multi-frame super resolution (SR) with consideration of scale changing between frames. First, we detect the scale of each frame by scale-detector. Based on the scale gap between adjacent frames, we extract patches and modify them from different scales into the same scale to obtain more redundant information. Finally, a reconstruction approach based on patch matching is applied to generate a high resolution (HR) frame. Compared to original Nonlocal Means SR (NLM SR), the proposed Scale-Compensated NLM finds more potential similar patches in different scales which are easily neglected in NLM SR. Experimental results demonstrate better performance of the proposed algorithm in both objective measurement and subjective perception


Stitching of Heterogeneous Images Using Depth Information

Jun-Tae Lee, Jae-Kyun Ahn, and Chang-Su Kim

Korea University

We propose a novel heterogeneous image stitching algorithm, which employs disparity information as well as color information. It is challenging to stitch heterogeneous images that have different background colors and diverse foreground objects. To overcome this difficulty, we set the criterion that objects should preserve their shapes in the stitched image. To satisfy this criterion, we derive an energy function using color and disparity gradients. As the gradients are highly correlated with object boundaries, we can find the optimal seam from the energy function, along which two images are pasted. Moreover, we develop a retargeting scheme to reduce the size of the stitched image further. Experimental results demonstrate that the proposed algorithm is a promising tool for stitching heterogeneous images


Cross-layer Optimized Multipath Video Streaming over Heterogeneous Wireless Networks

Oh Chan Kwon* and Hwangjun Song


In this work, we propose a cross-layer optimized multipath video streaming system over heterogeneous wireless networks. The proposed system uses multiple paths over wireless mobile networks in order to satisfy the quality-of-service requirements for seamless high-quality video streaming services and adopts fountain codes to handle the effect of lost packets over wireless networks efficiently. Moreover, a cross-layer approach is used to determine the code rate of fountain codes efficiently. Finally, the proposed system is implemented and tested in real environments


Self-Similarity Based Image Super-Resolution on Frequency Domain

Sae-Jin Park, Oh-Young Lee, Jong-Ok Kim

Korea University

Self-similarity has been popularly exploited for image super resolution in recent years. Image is decomposed into LF (low frequency) and HF (high frequency) components, and similar patches are searched in the LF domain across the pyramid scales of the original image. Once a similar LF patch is found, the LF is combined with the corresponding HR patch, and we reconstruct the HR (high resolution) version. In this paper, we separately search similar LR and HR patches in the LF and HF domains, respectively. In addition, self-similarity based SR is applied to the new structure-texture domain instead of the existing LF and HF. Experimental results show that the proposed method outperforms several conventional SR algorithms based on self-similarity


Robust Feature Description and Matching Using Local Graph

Man Hee Lee and In Kyu Park

Inha University

Feature detection and matching are essential parts in most computer vision applications. Many researchers have developed various algorithms to achieve good performance, such as SIFT (Scale-Invariant Feature Transform) and SURF (Speeded Up Robust Features). However, they usually fail when the scene has considerable out-of-plane rotation because they only focus on in-plane rotation and scale invariance. In this paper, we propose a novel feature description algorithm based on local graph representation and graph matching based, which is more robust to out-of-plane rotation. The proposed local graph encodes the geometric correlation between the neighboring features. In addition, we propose an efficient score function to compute the matching score between the local graphs. Experimental result shows that the proposed algorithm is more robust to out-of-plane rotation than conventional algorithms


Human Segmentation Algorithm for Real-time Video-call Applications

Seon Heo*, Hyung Il Kooy, Hong Il Kimz, Nam Ik Cho*

*Seoul National University, Ajou University, Samsung Electronics Co. Ltd

This paper presents a human region segmentation algorithm for real-time video-call applications. Unlike conventional methods, the segmentation process is automatically initialized and the motion of cameras is not restricted. To be precise, our method is initialized by face detection results and human/background regions are modeled with spatial color Gaussian mixture models (SCGMMs). Based on the SCGMMs, we build a cost function considering spatial and color distributions of pixels, region smoothness, and temporal coherence. Here, the temporal coherence term allows us to have stable segmentation results. The cost function is minimized by the well-known graphcut algorithm and we update our SCGMM models with the segmentation results. Experimental results have shown that our method yields stable segmentation results with a small amount of computation load


Back to top!!

OS.37-IVM.16: Image Enhancement and Restoration

Fast Image Alignment with Fourier Moment Matching on GPU

Hong-Ren Su1, Hao-Yuan Kuo1, Shang-Hong Lai1,2 and Chin-Chia Wu3

1 National Tsing Hua University, 2 National Tsing Hua University, 3 Industrial Technology Research Institute

In this paper, we develop a fast and accurate image alignment system which can be applied to image sequences in real time. The proposed image alignment system consists of two main components: the development of Fourier moment matching system and the implementation of the system in GPU. The Fourier moment matching is to efficiently find the location, orientation and size of the template from an input image. The GPU implementation speeds up the computation of the Fourier moment matching for the image alignment system to achieve real-time computation


Efficient Depth Map Recovery Using Concurrent Object Boundaries in Texture and Depth Images

Se-Ho Lee*, Tae-Young Chung*, Jae-Young Simy, and Chang-Su Kim*

*Korea University, Ulsan National Institute of Science and Technology

An efficient depth map recovery algorithm, using concurrent object boundaries in texture and depth signals, is proposed in this work. We first analyze the effects of a distorted depth map on the qualities of synthesized views. Based on the analysis, we propose an object boundary detection scheme to restore sharp boundaries from a distorted depth map. Specifically, we initially estimate object boundaries from a depth map using the gradient magnitude at each pixel. We then multiply the gradient magnitudes of texture and depth pixels. Then, we suppress boundary pixels with non-maximum magnitudes and refine the object boundaries. Finally, we filter depth pixels along the gradient orientations using a median filter. Experimental results show that the proposed algorithm significantly improves the qualities of synthesized views, as compared with conventional algorithms


A New Non-Uniform Quantization Method Based on Distribution of Compressive Sensing Measurements and Coefficients Discarding

Fengtai Zhai, Song Xiao and Lei Quan

Xidian University

Compressive sensing (CS) is a new method of sampling and compression which has great advantage over previous signal compression techniques. However, its compression ratio is relatively low compared with most of the current coding standards, which means a good quantization method is very important for CS. In this paper, a new method of non-uniform quantization is proposed based on the distribution of CS measurements and coefficients discarding. Firstly, the magnitude of CS measurements is estimated and the low probability measurements are discarded because of their high quantization error. It should be noted that the dropped measurements almost take no effect on the recovery quality because of the equal-weight property of CS samples. Then a nonlinear quantize function based on the distribution of sensed samples is proposed, by which those remained measurements are quantized. The experimental results show that the proposed method can obviously improve the quality of reconstructed image compared with previous methods in terms of the same sampling rate and different reconstruction algorithms


A Fixed-Point Tone Mapping Operation for HDR Images in the RGBE Format

Toshiyuki DOBASHI*, Tatsuya MUROFUSHI*, Masahiro IWAHASHIy and Hitoshi KIYA*

*Tokyo Metropolitan University, Nagaoka University of Technology

A fixed-point tone mapping operation (TMO) instead of a floating-point TMO is proposed in this paper. A TMO generates a low dynamic range (LDR) image from a high dynamic range (HDR) image. Since pixel values of an HDR image are generally expressed in a floating-point data format, e.g. RGBE and OpenEXR, a TMO is also implemented in floating-point arithmetic in conventional approaches. However, it requires a huge computational cost, even though a resulting LDR image is expressed in simple integers. The proposed TMO is implemented with only fixed-point arithmetic. As a result, the proposed method reduces a computational cost. Experimental results shows the PSNR of LDR images in the proposed method are comparable to those of the conventional methods


Automatic Trimap Generation for Digital Image Matting

Chang-Lin Hsieh and Ming-Sui Lee

National Taiwan University

Digital image matting is one of the most popular topics in image processing in recent years. For most matting methods, trimap serves as one of the key inputs, and the accuracy of the trimap affects image matting result a lot. Most existing works did not pay much attention to acquiring a trimap; instead, they assumed that the trimap was given, meaning the matting process usually involved users inputs. In this paper, an automatic trimap generation technique is proposed. First, the contour of the segmentation result is dilated to get an initial guess of the trimap followed by alpha estimation. Then, a smart brush with dynamic width is performed by analyzing the structure of the foreground object to generate another trimap. In other words, the brush size is enlarged if the object boundary contains fine details like hair, fur, etc. On the contrary, the brush size gets smaller if the contour of the object is just a simple curve or straight line. Moreover, by combining the trimap obtained in step one and downsampling the image, the uncertainty is defined as the blurred region, and the third trimap is formed. The final step is to combine these three trimaps together by voting. The experimental results show that the trimap generated by the proposed method effectively improves the matting result. Moreover, the enhancement of the accuracy of the trimap results in a reduction of regions to be processed, so that the matting procedure is accelerated


Back to top!!

OS.38-WCN.2: Wireless Communications and Networking

Multi-Decision Handover Mechanism over Wireless Relay Networks

Chung-Nan Lee*, Shin-Hung Lai*, Szu-Cheng Shen* and Shih-I Chen&

* National Sun Yat-sen University, &The Institute for Information Industry

With the popularity of wireless networks, it needs to support users mobility cross different base stations, hence, the handover mechanism becomes an important issue. When the user frequently moves between two cells, it will occur the Ping-Pong effect that increases the delay time and reduces the efficiency of system. In this paper, we proposed a new handover mechanism over relay networks to reduce the unnecessary handover. It uses the value of signal to interference and noise ratio (SINR) and the parameter of distance to make handover decision. The simulation results indicate the proposed handover mechanism can reduce more than 8% of the handover number in average in comparison to the competing method in the best case


Sectorization with Beam Pattern Design Using 3D Beamforming Techniques

Chang-Shen Lee, Ming-Chun Lee, Chung-Jung Huang, and Ta-Sung Lee

National Chiao Tung University

This paper presents a framework for a threedimensional (3D) beam pattern design with a load-balanced cell sectorization strategy. First, characteristics of the 3D beam pattern were observed, and convex optimization was used to provide a solution based on the criteria that describe the observations. In addition, a user equipment (UE) determination problem that arose because of cell sectorization was also addressed. By incorporating a beam pattern design with a proposed UE determination scheme, a framework of a backward compatible system was completed. The network performance regarding throughput for cell sectorization with a dedicated beam pattern design was evaluated and the simulation results show that the proposed system is superior compared to the conventional system and other proposed sectorization schemes. Index Terms: 3D beamforming, beam pattern design, cell sectorization, massive MIMO


Cell Selection Using Distributed Q-Learning in Heterogeneous Networks

Toshihito Kudo and Tomoaki Ohtsuki

Keio University

Cell selection with cell range expansion (CRE) that is a technique to expand a pico cell range virtually by adding a bias value to the pico received power, instead of increasing transmit power of the pico base station (PBS), can make coverage, cell-edge throughput, and overall network throughput improved. Many studies about CRE have used a common bias value among all user equipments (UEs), while the optimal bias values that minimize the number of UE outages vary from one UE to another. The optimal bias value that minimizes the number of UE outages depends on several factors such as the dividing ratio of radio resources between macro base stations (MBSs) and PBSs, it is given only by the trial and error method. In this paper, we propose a scheme to select a cell by using Q-learning algorithm where each UE learns which cell to select to minimize the number of UE outages from its past experience independently. Simulation results show that, compared to the practical common bias value setting, the proposed scheme reduces the number of UE outages and improves network throughput in the most cases. Moreover, instead of the degradation of the performances, it also solves the storage problem of our previous work


Robust Wi-Fi Location Fingerprinting Against Device Diversity Based on Spatial Mean Normalization

Chu-Hsuan Wang*, Tai-Wei Kao*, Shih-Hau Fang*, Yu Tsaoy, Lun-Chia Kuoz, Kao Shih-Wei z, and Nien-Chen Linz

*Yuan-Ze University, Academia Sinica, Industrial Technology Research Institute

Received signal strength (RSS) in Wi-Fi networks is commonly employed in indoor positioning systems; however, device diversity is a fundamental problem in such RSS-based systems. The variation in hardware is inevitable in the real world due to the tremendous growth in recent years of new Wi-Fi devices, such as iPhones, iPads, and Android devices, which is expected to continue. Different Wi-Fi devices performed differently in respect to the RSS values even at a fixed location, thus degrading localization performance significantly. This study proposes an enhanced approach, called spatial mean normalization (SMN), to design localization systems that are robust against heterogeneous devices. The main idea of SMN is to remove the spatial mean of RSS to compensate for the shift effect resulted from device diversity. The proposed algorithm was evaluated on an indoor Wi-Fi environment, where realistic RSS measurements were collected through heterogeneous laptops and smart phones. Experimental results demonstrate the effectiveness of SMN. Results show that SMN outperforms previous positioning features for heterogeneous devices


Hierarchical Multi-stage Interference Alignment for Downlink Heterogeneous Networks

Tomoki AKITAYA and Takahiko SABA

Chiba Institute of Technology

The occurrence of the intercell interference (ICI) is inevitable in downlink heterogeneous networks because user terminals (UTs) often receive the signals transmitted from different base stations (BSs) at the same time. To mitigate ICI, a hierarchical interference alignment (HIA) is proposed. HIA is based on the interference alignment (IA), which aligns the interference signals within a reduced dimensional subspace at each UT by multiplying the signals to be transmitted from each BS by the transmit beamforming matrix. However, HIA can be applied only to a network in which two picocells are placed within a macrocell. In this paper, we propose a hierarchical multi-stage interference alignment (HMIA) to cope with the restriction. In HMIA, by dividing the aligning process into multiple stages, every transmit beamforming matrix can be calculated in closed form. Furthermore, since the alignment is carried out in descending order of signal strength, strong interference signals are aligned preferentially. Simulation results can show that HMIA successfully deal with the network in which more than three picocells are placed within a macrocell although there is slight loss of the per-user capacity


Efficient Data-gathering using Graph-based Transform and Compressed Sensing for Irregularly Positioned Sensors

Sungwon Lee and Antonio Ortega

University of Southern California

In this work, we propose a decentralized approach for energy efficient data-gathering in a realistic scenario. We address a major limitation of compressed sensing (CS) approaches proposed to data for wireless sensor network (WSN), namely, that they work only on a regular grid tightly coupled to the sparsity basis. Instead, we assume that sensors are irregularly positioned in the field and do not assume that sparsifying basis is known a priori. Under the assumption that the sensor data is smooth in space, we propose to use a graph-based transform (GBT) to sparsify the sensor data measured at randomly positioned sensors. We first represent the random topology as a graph then construct the GBT as a sparsifying basis. With the GBT, we propose a heuristic design of the data-gathering where aggregations happen at the sensors with fewer neighbors in the graph. In our simulations, our proposed approach shows better performance in terms of total power consumption for a given reconstruction MSE, as compared to other CS approaches proposed for WSN


Back to top!!

OS.39-SLA.11: Expressive Talking Avatar Synthesis and Applications

Speech Driven Photo-Realistic Face Animation with Mouth and Jaw Dynamics

Ying He1,2, Yong Zhao1,2, Dongmei Jiang1,2,Hichem Sahli3,4

VUB-NPU Joint Research Group on AVSP,1. Northwestern Polytechnical University, 2. Shaanxi Provincial Key Laboratory of Speech and Image, 3. VUB-NPU Joint Research Group on AVSP, 4. Interuniversity Microelectronics Centre IMEC, Vrije Universiteit Brussel (VUB)

This paper proposes a system that transforms speech waveform to photo-realistic speech-synchronized talking face animations. We expand the multi-modal diviseme unit selection based mouth animation system of [8] to a full photo realistic facial animation system based on (i) modeling of the non-rigid deformations of the mouth and jaw via a general regression neural network, (ii) multi-resolution image blending approach for fusing the synthesized mouth image to the full face image, and (iii) synthesizing natural head poses or deflections using a modified version of the generalized procrustes analysis for face image alignment. The paper describes the main principles of the proposed method and illustrates its results on a set of testing speech sequences, together with qualitative and quantitative comparisons with results from the approach of the recognized system Video Rewrite. Experimental results show that the proposed method obtains realistic facial animations with very natural mouth and jaw movements coincident with the input speech


TalkingAndroid: An Interactive, Multimodal and Real-time Talking Avatar Application on Mobile Phones

Huijie Lin 1,2,3, Jia Jia 1,2,3 , Xiangjin Wu3 and Lianhong Cai1,2,3

1Ministry of Education, 2Tsinghua National Laboratory for Information Science and TechnologyTNList, 3Tsinghua University

In this paper, we present a novel interactive, multimodal and real-time 3D talking avatar application, on mobile platforms. The application is based on a novel network independent, stand-alone framework using cross-platform JNI and OpenGL ES library. In this framework, we implement the audio synthesis, facial animation rendering and the audio-visual synchronization process on the mobile client using the native APIs to optimize the render performance and power consumption. We also utilize the existing interactive APIs on the mobile devices to extend the usability of the application. Experiment results show that the proposed framework for mobile platforms can run smoothly on the current mobile devices with real-time multimodal interaction. Compared to the traditional video streaming method and the client-server framework, the proposed framework has much lower network requirement, with much shorter interaction delay and more efficient power consumption. The presented application can be used in entertainment, education and many other interactive areas


Personalized 3-D Facial Expression Synthesis based on Landmark Constraint

Haoran Liang*, Mingli Songy, Lei Xiezand Ronghua Liang*

*Zhejiang University of Technology, Zhejiang University, Northwestern Polytechnical University

With the development of computer technology, 3-D facial expression synthesis has been an important and challenging task in the field of computer animation. Since the faces generated by previous works lack of personalization, we propose a novel approach for 3-D facial expression synthesis based on nonlinear learning. Firstly, a pre-process alignment is performed for input 2-D or 3-D faces with landmarks based on cylindrical mapping, and the intrinsic representations of faces are generated using radial basis function network. Secondly, according to the low dimensional representations of input faces, reconstruction operations are carried out to synthesize 3-D face expressions by sharing linear combination coefficients. Finally, the output 3-D face expressions are further optimized by its corresponding landmarks both in 2-D and 3-D spaces using locality-constrained linear coding. The experimental results indicate the robustness and effectiveness of our facial expression synthesis approach


A Real-time Speech Driven Talking Avatar based on Deep Neural Network

Kai Zhao*, Zhiyong Wu * and Lianhong Cai *

*Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Shenzhen Key Laboratory of Information Science and Technology,Graduate School at Shenzhen, Tsinghua University, Tsinghua University

This paper describes our initial work in developing a real-time speech driven talking avatar system with deep neural network. The input of the system is the acoustic speech and the output is the articulatory movements (that are synchronized with the input speech) on a 3-dimentional avatar. The mapping from the input acoustic features to the output articulatory features is achieved by virtue of deep neural network (DNN). Experiments on the well known acoustic-articulatory English speech corpus MNGU0 demonstrate that the proposed audio-visual mapping method based on DNN can achieve good performance


Back to top!!

OS.40-SLA.12: Toward High-Performance Real-World ASR Applications

Feature Normalization Using MVAW Processing for Spoken Language Recognition

Chien-Lin Huang, Shigeki Matsuda, and Chiori Hori

National Institute of Information and Communications Technology

This study presents a noise robust front-end post- processing technology. After cepstral feature analysis, the feature normalization is usually applied for noisy reduction in spoken language recognition. We investigate a highly effective MVAW processing based on standard MFCC and SDC features on NIST-LRE 2007 tasks. The procedure includes mean subtraction, variance normalization, auto-regression moving- average filtering and feature warping. Experiments were conducted on a common GMM-UBM system. The results indicated significant improvements in recognition accuracy.


An Experimental Study on Structural-MAP Approaches to Implementing Very Large Vocabulary Speech Recognition Systems for Real-World Tasks

I-Fan Chen1, Sabato Marco Siniscalchi1,2, Seokyong Moon3, Daejin Shin3, Myong-Wan Koo4, Minhwa Chung5, and Chin-Hui Lee1

Georgia Institute of Technology, Cittadella Universitaria

In this paper we present an experimental study exploiting structural Bayesian adaptation for handling potential mismatches between training and test conditions for real-world applications to be realized in our multilingual very large vocabulary speech recognition (VLVSR) system project sponsored by MOTIE (The Ministry of Trade, Industry and Energy), Republic of Korea. The goal of the project is to construct a national-wide VLVSR cloud service platform for mobile applications. Besides system architecture design issues, at such a large scale, performance robustness problems, caused by mismatches in speakers, tasks, environments, and domains, etc., need to be taken into account very carefully as well. We decide to adopt adaptation, especially the structural MAP, techniques to reduce system accuracy degradation caused by these mismatches. Being part of an ongoing project, we describe how structural MAP approaches can be used for adaptation of both acoustic and language models for our VLVSR systems, and provide convincing experimental results to demonstrate how adaptation can be utilized to bridge the performance gap between the current state-of-the-art and deployable VLVSR systems.


Frequency-domain Dereverberation on Speech Signal using Surround Retinex

Mingming Zhang1,2, Weifeng Li1,2, Longbiao Wang3, Jianguo Wei4, Zhiyong Wu1,2, Qingmin Liao1,2

1Shenzhen Key Lab. of Information Sci&Tech/Shenzhen Engineering Lab. of IS&DRM, 2Tsinghua University, 3Nagaoka University of Technology, 4Tianjin University

In this paper, we propose a novel and practical single channel dereverberation scheme, which utilizes surround retinex model in frequency domain. A dereverberation filter is derived by the proposed method for suppressing the environmental reflections. The proposed algorithm can achieve effective dereverberation with a more reasonable computational complexity than conventional methods. Experimental results also reveal an improvement in automatic speech recognition (ASR) performance even in severely reverberation environments.


A Particle Filter Compensation Approach to Robust LVCSR

Duc Hoang Ha Nguyen*, Aleem Mushtaq§, Xiong Xiao, Eng Siong Chng*, Haizhou Li*†‡ and Chin-Hui Lee§

*Nanyang Technological University, Nanyang Technological University, A*STAR, §Georgia Institute of technology

We extend our previous work on particle filter compensation (PFC) to large vocabulary continuous speech recognition (LVCSR) and conduct the experiments on Aurora- 4 database. Obtaining an accurately aligned state and mixture sequence of hidden Markov models (HMMs) that describe the underlying clean speech features being estimated in noise is a challenging task for sub-word based LVCSR because the total number of triphone models involved can be very large. In this paper, we show that by using separate sets of HMMs for recognition and compensation, we can simplify the models used for PFC to a great extent and thus facilitate the estimation of the side information offered in the state and mixture sequences. When the missing side information for PFC is available, a large word error reduction of 28.46% from multi-condition training is observed. In the actual scenarios, an error reduction of only 5.3% is obtained. We are anticipating improved results that will narrow the gap between the system today and whats achievable if the side information could be exactly specified.


Context-Dependent Deep Neural Networks for Commercial Mandarin Speech Recognition Applications

Jianwei Niu*, Lei Xie*, Lei Jia and Na Hu

*Northwestern Polytechnical University, Baidu Inc.

Recently, context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have been successfully used in some commercial large-vocabulary English speech recognition systems. It has been proved that CD-DNN-HMMs significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs (CD-GMM-HMMs). In this paper, we report our latest progress on CD-DNN-HMMs for commercial Mandarin speech recognition applications in Baidu. Experiments demonstrate that CD-DNN-HMMs can get relative 26% word error reduction and relative 16% sentence error reduction in Baidu’s short message (SMS) voice input and voice search applications, respectively, compared with state-of-the-art CD-GMMHMMs trained using fMPE. To the best of our knowledge, this is the first time the performances of CD-DNN-HMMs are reported for commercial Mandarin speech recognition applications. We also propose a GPU on-chip speed-up training approach which can achieve a speed-up ratio of nearly two for DNN training.


Back to top!!

OS.41-IVM.17: Advanced Visual Media Data Generation, Editing and Transmission

Buffer-based Smooth Rate Adaptation for Dynamic HTTP Streaming

Chao Zhou, Chia-Wen Lin, Xinggong Zhang, Zongming Guo

*Peking University, National Tsing Hua University, Peking University, Beijing

Recently, Dynamic Streaming over HTTP (DASH) has been widely deployed in the Internet. However, it is still a challenge to play back video smoothly with high quality in the time-varying Internet. In this paper, we propose a buffer based rate adaptation scheme, which is able to smooth bandwidth variations and provide a continuous video playback. Through analysis, we show that simply preventing buffer underflow/overflow in the greedy rate adaptation method may incur serious rate oscillations, which is poor quality-of-experience for users. To improve it, we present a novel control-theoretic approach to control buffering size and rate adaptation. We modify the buffered video time model by adding two thresholds: an overflow threshold and an underflow threshold, to filter the effect of short-term network bandwidth variations while keeping playback smooth. However, the modified rate adaptation system is nonlinear. By choosing operating point properly, we linearize the rate control system. By a Proportional-Derivative (PD) controller, we are able to adapt video rate with high responsiveness and stability. We carefully design the parameters for the PD controller. Moreover, we show that reserving a small positive/negative bandwidth margin can greatly decrease the opportunities of buffer underflow/overflow incurred by the bandwidth prediction error. At last, we demonstrate that our proposed control-theoretic approach are highly efficient through real network trace.


Packet-Layer Model for Quality Assessment of Encrypted Video in Iptv Services

Qian Zhang, Ning Liao, Fan Zhang, Zhibo Chen§ and Lin Ma

Xi'an University of Architecture and Technology, Technicolor Research and Innovation, Lenovo Cooperation Research, §Technicolor Research and Innovation, Huawei Noahs Ark Lab

In this paper, a packet-layer quality assessment model is proposed for predicting the subjective quality of encrypted video for IPTV services. The detected information from encrypted video-stream is parsed into frame layer, and a novel estimation of frame type and Group-Of-Picture (GOP) structures is proposed to assist the parameter extraction utilized in the model. An efficient loss-related parameter is developed to reveal the visible degradation by loss. The quality assessment model focuses on predicting the quality measurement caused by both coding and channel artifacts. The cross-validation results on numerous databases show that the proposed model is not only better than other compared ones in performance, but also more generalized and robust to various testing conditions.


Image Enlargement by Patch-Based Seam Synthesis

Qi Wang*, Zhengzhe Liu*, Chen Li* and Bin Shengy*

*Shanghai Jiao Tong University, Chinese Academy of Sciences

Numerous approaches can be utilized for image enlargement, among them seam-carving, texture-synthesis, linear scaling and warping are commonly used. However, all of these methods have their disadvantages. In this paper, we propose a new image enlargement method inspired from seam-carving and texture synthesis, called Patch-Based Seam Synthesis. Our algorithm fully utilizes the texture information of an image and thus is a content-based method. The procedure of this new method is described as follows. Firstly, we use a set of Difference of Gaussian (DOG) and Difference of Offset Gaussian (DOOG) filters to extract the texture features of the image. Secondly, we use the Histogram Shape-Based Image Thresholding to divide the image into texture regions and non-texture regions. Thirdly, we find the energy map of the image based on the energy function and determine the minimal-energy seams, the 8-connected paths crossing the whole image, either vertically or horizontally. Finally, we use patch-based synthesis combined with image quilting algorithm to fill in the parts of the seams that are in the texture regions and linear interpolation to smooth the parts that lie in non-texture regions.


A Data-Driven Model for Anisotropic Heterogeneous Subsurface Scattering

Ying Song and Wencheng Wang

Institute of Software

We present a new BSSRDF representation for editing measured anisotropic heterogeneous translucent materials, such as veined marble, jade, artificial stones with lightingblocking discontinuities. Our work is inspired by the SubEdit representation introduced in [1]. Our main contribution is to improve the accuracy of the approximation while keeping it compact and efficient for editing.We decompose the local scattering profile into an isotropic term and an anisotropic term. The isotropic term encodes the scattering range and albedo property, and the anisotropic term encodes the spatial-variant subsurface scattering shape profile. We propose a compact model for the scattering profile based on non-negative matrix factorization, which allows user-guided editing. Experimental results have shown that our model can capture more spatial-anisotropic features than the previous work with similar compression rate.


Back to top!!

OS.42-SLA.13: Audio Signal Analysis, Processing and Classification (II)

A Vehicular Noise Surveillance System Integrated with Vehicle Type Classification

Chuang Shi1, Woon-Seng Gan2, Yong-Kim Chong3, Agha Apoorv4, and Kin-San Song5

1-4Nanyang Technological University, 5National Environment Agency

This paper introduces an ongoing project on the surveillance of noisy vehicles on the road. Noise pollution created by vehicles on urban roads is becoming more severe. To enforce current measures, we developed a vehicular noise surveillance system including a vehicle type classification method. Samples of vehicular noise were recorded on-site using this system. Harmonic features were extracted from each sample based on an average harmonic structure. The k-nearest neighbor (KNN) algorithm was applied to achieve classification accuracies for the passenger car, the van, the lorry, the bus, and the motorbike of 60.66%, 65.38%, 52.99%, 62.02%, and 80%, respectively. This study was motivated by the demand of monitoring noise levels generated by different types of vehicles. The classification method using audio features is independent of lighting condition, thus providing a replacement to machine vision based techniques in vehicle type classification.


Quantitative Evaluation of Violin Solo Performance

Yiju Lin, Wei-Chen Chang and Alvin W.Y. Su

National Cheng-Kung University

Evaluation of performances of musical instruments is usually subjective. It may be easier for keyboard instruments. For bowed-string instruments such as a violin, delicate articulations are required in its performance such that there exhibit much complexities in its sound, making the evaluation more difficult. In this paper, a note separation algorithm based on spectral domain factorization is used to extract the notes from recordings of violin solo performances. Each note can then be quantitatively evaluated based on a set of metrics that is designed to provide various aspects of violin performances including pitch accuracy, bowing steadiness, vibrato depth/rate, bowing intensity, tempo, and timbre characteristics and so on. The tools should be useful in musical instrument performance education.


Adaptive Processing and Learning for Audio Source Separation

Jen-Tzung Chien, Hiroshi Sawada and Shoji Makino

National Chiao Tung University, NTT Corporation, University of Tsukuba

This paper overviews a series of recent advances in adaptive processing and learning for audio source separation. In real world, speech and audio signal mixtures are observed in reverberant environments. Sources are usually more than mixtures. The mixing condition is occasionally changed due to the moving sources or when the sources are changed or abruptly present or absent. In this survey article, we investigate different issues in audio source separation including overdetermined/underdetermined problems, permutation alignment, convolutive mixtures, contrast functions, nonstationary conditions and system robustness. We provide a systematic and comprehensive view for these issues and address new approaches to overdetermined/underdetermined convolutive separation, sparse learning, nonnegative matrix factorization, information-theoretic learning, online learning and Bayesian approaches.


A Two-stage Query by Singing/Humming System on GPU

Wei-Tsa Kao, Chung-Che Wang, Kaichun K. Chang, Jyh-Shing Roger Jang, and Wenshan Liou

National Tsing Hua University, King's College London, National Taiwan University, III

this paper proposes the use of GPU (graphic processing unit) to implementing a two-stage comparison method for a QBSH (query by singing/humming) system. The system can take a users singing or humming and retrieve the top- 10 most likely candidates from a database of 8431 songs. In order to speed up the comparison, we apply linear scaling in the first stage to select candidate songs from the database. These candidate songs are then re-ranked by dynamic time warping to achieve better recognition accuracy in the second stage. With the optimum setting, we can achieve a speedup factor of 7 (compared to dynamic time warping on GPU) and an accuracy of 77.65%.


The NUS Sung and Spoken Lyrics Corpus: A Quantitative Comparison of Singing and Speech

Zhiyan Duan, Haotian Fang, Bo Li, Khe Chai Sim and Ye Wang

National University of Singapore

Despite a long-standing effort to characterize various aspects of the singing voice and their relations to speech, the lack of a suitable and publicly available dataset has precluded any systematic study on the quantitative difference between singing and speech at the phone level. We hereby present the NUS Sung and Spoken Lyrics Corpus (NUS-48E corpus) as the first step toward a large, phonetically annotated corpus for singing voice research. The corpus is a 169-min collection of audio recordings of the sung and spoken lyrics of 48 (20 unique) English songs by 12 subjects and a complete set of transcriptions and duration annotations at the phone level for all recordings of sung lyrics, comprising 25,474 phone instances. Using the NUS-48E corpus, we conducted a preliminary, quantitative study on the comparison between singing voice and speech. The study includes duration analyses of the sung and spoken lyrics, with a primary focus on the behavior of consonants, and experiments aiming to gauge how acoustic representations of spoken and sung phonemes differ, as well as how duration and pitch variations may affect the Mel Frequency Cepstral Coefficients (MFCC) features.


Back to top!!

OS.43-SLA.14: Audio and Acoustic Signal Processing

Audio Bandwidth Extension Based on Grey Model

Haichuan Bai, Changchun Bao, Xin Liu, Hongrui Li

Beijing University of Technology

A kind of audio bandwidth extension method based on Grey Model is proposed in this paper. Grey Model is utilized for estimating the envelope of high-frequency spectrum, according to the evolutionary tendency of audio spectral envelope. In addition, nearest-neighbor matching is utilized to predict the fine spectrum of high-frequency components. At last, through the envelope adjustment of high-frequency spectrum, the bandwidth extension of audio signals from wideband to super-wideband can be implemented. Objective performance evaluation indicates that the proposed method can effectively reconstruct the truncated high-frequency components and outperform the conventional method of audio bandwidth extension based on Gaussian mixture model.


Local Partial Least Square Regression for Spectral Mapping in Voice Conversion

Xiaohai Tian*, Zhizheng Wu, Eng Siong Chng*

*Nanyang Technological University, Nanyang Technological University

Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses neighboring data rather than all the training data to estimate the transformation function for each feature vector, we proposed a local partial least square method to avoid the over-smoothing problem of JDGMM and the over-fitting problem of local linear transformation when training data are limited. We conducted experiments using the VOICES database and measure both spectral distortion and correlation coefficient of the spectral parameter trajectory. The experimental results show that our proposed method obtain better performance as compared to baseline methods.


Semi-Blind Algorithm for Joint Noise Suppression and Dereverberation Based on Higher-Order Statistics and Acoustic Model Likelihood

Fine Dwinita Aprilyanti*, Hiroshi Saruwatari*, Kiyohiro Shikano*, Satoshi Nakamura* and Tomoya Takatani

*Nara Institute of Science and Technology, Toyota Motor Corporation

In this paper, we propose an automatic optimization scheme of FD-BSE-based joint suppression of noise and late reverberation to improve the speech recognition accuracy for spoken-dialogue system. First, we optimize the parameter of conventional FD-BSE-based method using the assessment of musical noise measured by higher-order statistics and acoustic model likelihood. Next, to maintain the optimum performance of the system, we proposed the switching scheme using the distance information provided by image sensor. The experimental results show that the proposed approach improves the word recognition accuracy.


A Study on Amplitude Variation of Bone Conducted Speech Compared to Air Conducted Speech

M. Shahidur Rahman and Tetsuya Shimamura

*Shahjalal University of Science and Technology, Saitama University

This paper investigates the amplitude variation of bone conducted (BC) speech compared to air conducted (AC) speech. During vocalization, vibrations travel through the vocal tract wall and skull bone, which can be captured by placing a bone-conductive microphone on the talkers head. Amplitude of this recorded BC speech is influenced by the mechanical properties of bone conduction pathways. This influence has relation with the vocal tract shape that determines the resonances of the vocal tract filter. Referring the vocal tract output as AC speech for simplicity, amplitude variation of BC speech can be described with respect to the location of the formants of AC speech. In this paper, amplitude variation of BC speech of Japanese vowels and long utterances have been investigated by exploiting the locations of first two formants of AC speech. Our observation suggests that when the first formant is very low with higher second formant, the relative amplitude of BC speech is amplified. As opposed to this, relatively higher first formant and lower second formant of AC speech cause reduction of the relative BC amplitude.


Incorporating Global Variance in the Training Phase of GMM-based Voice Conversion

Hsin-Te Hwang1,3, Yu Tsao2, Hsin-Min Wang3, Yih-Ru Wang1, Sin-Horng Chen1

1National Chiao Tung University, 2Academia Sinica, 3Academia Sinica

Maximum likelihood-based trajectory mapping considering global variance (MLGV-based trajectory mapping) has been proposed for improving the quality of the converted speech of Gaussian mixture model-based voice conversion (GMM-based VC). Although the quality of the converted speech is significantly improved, the computational cost of the online conversion process is also increased because there is no closed form solution for parameter generation in MLGV-based trajectory mapping, and an iterative process is generally required. To reduce the online computational cost, we propose to incorporate GV in the training phase of GMM-based VC. Then, the conversion process can simply adopt ML-based trajectory mapping (without considering GV in the conversion phase), which has a closed form solution. In this way, it is expected that the quality of the converted speech can be improved without increasing the online computational cost. Our experimental results demonstrate that the proposed method yields a significant improvement in the quality of the converted speech comparing to the conventional GMM-based VC method. Meanwhile, comparing to MLGV-based trajectory mapping, the proposed method provides comparable converted speech quality with reduced computational cost in the conversion process.


Back to top!!

OS.44-SLA.15: Speech and Audio Coding and Synthesis

SPIDER: A Continuous Speech Light Decoder

Abdelaziz A.Abdelhamid, Waleed H.Abdulla, and Bruce A.MacDonald

Auckland University

In this paper, we propose a speech decoder, called SPeech lIght decoDER (SPIDER), for extracting the best decoding hypothesis from a search space constructed using weighted finite-state transducers. Despite existence of many speech decoders, these decoders are quite complicated as they take into consideration many design goals, such as extraction of N-best decoding hypotheses and generation of lattices. This makes it difficult to learn these decoders and test new ideas in speech recognition that often require decoder modification. Therefore, we propose in this paper a simple decoder supporting the primitive functions required for achieving real-time speech recognition with state-of-the-art recognition performance. This decoder can be viewed as a seed for further improvements and addition of new functionalities. Experimental results show that the performance of the proposed decoder is quite promising when compared with two other speech decoders, namely HDecode and Sphinx3.


Optimization on Decoding Graphs Using Soft Margin Estimation

Abdelaziz A.Abdelhamid and Waleed H.Abdulla

Auckland University

This paper proposes a discriminative learning algorithm for improving the accuracy of continuous speech recognition systems through optimizing the language model parameters on decoding graphs. The proposed algorithm employs soft margin estimation (SME) to build an objective function for maximizing the margin between the correct transcriptions and the corresponding competing hypotheses. To this end, we adapted a discriminative training procedure based on SME, which is originally devised for optimizing acoustic models, to a different case of optimizing the parameters of language models on a decoding graph constructed using weighted finite-state transducers. Experimental results show that the proposed algorithm outperforms a baseline system based on the maximum likelihood estimation and achieves a reduction of 15.11% relative word error rate when tested on the Resource Management (RM1) database.


Joint Discriminative Learning of Acoustic and Language Models on Decoding Graphs

Abdelaziz A.Abdelhamid and Waleed H.Abdulla

Auckland University

In traditional models of speech recognition, acoustic and language models are treated in independence and usually estimated separately, which may yield a suboptimal recognition performance. In this paper, we propose a joint optimization framework for learning the parameters of acoustic and language models using minimum classification error criterion. The joint optimization is performed in terms of a decoding graph constructed using weighted finite-state transducers based on contextdependent hidden Markov models and tri-gram language models. To emphasize the effectiveness of the proposed framework, two speech corpora, TIMIT and Resource Management (RM1), are incorporated in the conducted experiments. The preliminary experiments show that the proposed approach can achieve significant reduction in phone, word and sentence error rates on both TIMIT and RM1 when compared with conventional parameter estimation approaches.


Realizing Tibetan Speech Synthesis by Speaker Adaptive Training

Hong-wu YANG*, Keiichiro OURA, Zhen-ye GAN* and Keiichi TOKUDA

*Northwest Normal University, Nagoya Institute of Technology

This paper presents a method to realize HMMbased Tibetan speech synthesis using a Mandarin speech synthesis framework. A Mandarin context-dependent label format is adopted to label Tibetan sentences. A Mandarin question set is also extended for Tibetan by adding language-specific questions. A Mandarin speech synthesis framework is utilized to train an average mixed-lingual model from a large Mandarin multispeaker- based corpus and a small Tibetan one-speaker-based corpus using the speaker adaptive training. Then the speaker adaptation transformation is applied to the average mixed-lingual model to obtain a speaker adapted Tibetan model. Experimental results show that this method outperforms the method using speaker dependent Tibetan model when only a small amount of training Tibetan utterances are available. When the number of training Tibetan utterances is increased, the performances of the two methods tend to be the same.


Back to top!!

OS.45-SLA.16: Emotion Analysis and Recognition

Emotional Adaptive Training for Speaker Verification

Fanhu Bie*,Dong Wang*,Thomas Fang Zheng*,Javier Tejedory and Ruxin Chenz

*Tsinghua University, Universidad Aut´onoma de Madrid, Sony Computer Entertainment America

Speaker verification suffers from significant performance degradation with emotion variation. In a previous study, we have demonstrated that an adaptation approach based on MLLR/CMLLR can provide a significant performance improvement for verification on emotional speech. This paper follows this direction and presents an emotional adaptive training (EAT) approach. This approach iteratively estimates the emotiondependent CMLLR transformations and re-trains the speaker models with the transformed speech, which therefore can make use of emotional enrollment speech to train a stronger speaker model. This is similar to the speaker adaptive training (SAT) in speech recognition. The experiments are conducted on an emotional speech database which involves speech recordings of 30 speakers in 5 emotions. The results demonstrate that the EAT approach provides significant performance improvements over the baseline system where the neutral enrollment data are used to train the speaker models and the emotional test utterances are verified directly. The EAT also outperforms another two emotionadaptation approaches in a significant way: (1) the CMLLRbased approach where the speaker models are trained with the neutral enrollment speech and the emotional test utterances are transformed by CMLLR in verification; (2) the MAP-based approach where the emotional enrollment data are used to train emotion-dependent speaker models and the emotional utterances are verified based on the emotion-matched models.


Cross-lingual Speech Emotion Recognition System Based on a Three-Layer Model for Human Perception

Reda Elbarougy 1,2 and Masato Akagi 1

1Japan Advanced Institute of Science and Technology (JAIST), 2Damietta University

The purpose of this study is to investigate whether emotion dimensions valence, activation, and dominance can be estimated cross-lingually. Most of the previous studies for automatic speech emotion recognition were based on detecting the emotional state working on mono-language. However, in order to develop a generalized emotion recognition system, the performance of these systems must be analyzed in mono-language as well as cross-language. The ultimate goal of this study is to build a bilingual emotion recognition system that has the ability to estimate emotion dimensions from one language using a system trained using another language. In this study, we first propose a novel acoustic feature selection method based on a human perception model. The proposed model consists of three layers: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. The experimental results reveal that the proposed method is effective for selecting acoustic features representing emotion dimensions, working with two different databases, one in Japanese and the other in German. Finally, the common acoustic features between the two databases are used as the input to the cross-lingual emotion recognition system. Moreover, the proposed cross-lingual system based on the three-layer model performs just as well as the two separate mono-lingual systems for estimating emotion dimensions values.


Emotion Recognition Method Based on Normalization of Prosodic Features

Motoyuki Suzuki*, Shohei Nakagaway and Kenji Kitay

*Osaka Institute of Technology, The University of Tokushima

Emotion recognition from speech signals is one of the most important technologies for natural conversation between humans and robots. Most emotion recognizers extract prosodic features from an input speech in order to use emotion recognition. However, prosodic features changes drastically depending on the uttered text. In order to solve this problem, we have proposed the normalization method of prosodic features by using the synthesized speech, which has the same word sequence but uttered with a "neutral" emotion. In this method, all prosodic features (pitch, power, etc.) are normalized. However, nobody knows which kind of prosodic features should be normalized. In this paper, all combinations of with/without normalization were examined, and the most appropriate normalization method was found. When both "RMS Energy" (root mean square frame energy) and "VoiceProb" (power of harmonics divided by the total power) were normalized, emotion recognition accuracy became 5.98% higher than the recognition accuracy without normalization.


Emotional Intonation Modeling: a Cross-Language Study on Chinese and Japanese

Ai-Jun Li*, Yuan Jia*, Qiang Fang* and Jian-Wu Dang

*Chinese Academy of Social Sciences, Tianjin University

This study attempts to apply PENTA model to simulate the emotional intonations of two typologically distinct languages, the tone language of Mandarin Chinese and the pitch accent language of Japanese. First, the overall F0 features of the emotional intonations of 4 speakers were analyzed and contrasted across seven emotions and across two languages. And then the performances of the qTA model for simulating each language were numerically evaluated and compared within and across the two languages. The results showed that F0 features have bigger distinctions across the two languages than within them. The qTA model can efficiently encode emotional or pragmatic information for both Chinese and Japanese.


Back to top!!

OS.46-SPS.5: Design and Implementation for High Performance & Complexity-Efficient Signal Processing Systems

Joint-Denoise-and-Forward Protocol for Multi-Way Relay Networks

Ronald Y. Chang, Sian-Jheng Lin, and Wei-Ho Chung

Academia Sinica

This paper considers a multi-way relay network in which multiple users intend to achieve full information exchange with one another with the aid of a single relay. The denoiseand- forward (DNF) protocol with binary physical-layer network coding (PNC) is considered. A novel denoising scheme is proposed which does not require any modification on the communication mechanism of the original DNF protocol but only exploits the correlation among multiple received signals at the relay. The decision regions and denoise mapping for the proposed scheme and the original scheme are illustrated and compared for an exemplary three-way relay network. Simulation results verify the efficacy of the proposed scheme in achieving improved user decoding performance.


A VLSI Design of an Arrayed Pipelined Tomlinson-Harashima Precoder for MU-MIMO Systems

Kosuke Shimazaki*, Shingo Yoshizaway, Yasuyuki Hatakawaz, Tomoko Matsumotoz, Satoshi Konishiz and Yoshikazu Miyanaga*

*Hokkaido University, Kitami Institute of Technology, KDDI R&D Laboratories Inc

This paper presents a VLSI design of a Tomlinson- Harashima (TH) precoder for multi-user MIMO (MU-MIMO) systems. The TH precoder consists of LQ decomposition (LQD), interference cancellation (IC), and weight coefficient multiplication (WCM) units. The LQ decomposition unit is based on an application specific instruction-set processor (ASIP) architecture with floating-point arithmetic for high accuracy operations. As for the IC and WCM units with fixed-point arithmetic, the proposed architecture keeps calculation accuracy and gives shorter pipeline latency and smaller circuit size by employing an arrayed structure. The implementation result shows that the proposed architecture reduces circuit area and power consumption by 11% and 15%, respectively.


An Ultra Low-Power and Area-Efficient Baseband Processor for WBAN Transmitter

Mengyuan Chen, Jun Han*, Dabin Fang, Yao Zou and Xiaoyang Zeng

Fudan University

The IEEE 802.15.6-2012 standard optimized for short-range/low power purpose for WBAN applications has been approved recently. Based on the standard, this paper proposes the hardware implementation of the baseband transmitter for the first time. In our design, the physical layer (PHY) employs narrowband(NB) PHY in the standard and DPSK Modulator is optimized by employing CSD-coded (canonical signed digit) filters for low power and area-efficiency consideration. Clock gating is also implemented to cut the dynamic power in the idle state. Implemented in 130nm CMOS technology, the least total power dissipation of the transmitter is only 69.5uW at 151.8kbps and 1.0V supply in the Medical Implant Communication Service (MICS) band. In addition, The power consumption of the PHY module under different frequency bands and different data rates is investigated. The minimum energy-per-bit is only 10.1pJ/bit at 971.4kbps indicating that our PHY module is more energy- efficient than previous works.


Compressed Sensing Recovery Algorithms and VLSI Implementation

Kuan-Ting Lin, Kai-Jiun Yang, Pu-Hsuan Lin and Shang-Ho Tsai

National Chiao Tung University

This paper proposes two recovery algorithms modified from subspace pursuit(SP) for compressed sensing problems. These algorithms can reduce the complexity of SP and maintain high recovery rate. Complexity analysis and simulation results are provided to demonstrate the improvements. Additionally this work has implemented the VLSI circuit APR of the proposed algorithm using TSMC 90 nm process. The target clock frequency is 100MHz, and the corresponding APR dimension is 11.69𝑚𝑚2. Based on the post-layout simulation the average power consumption is 431 mW.


Classification of Video Resolution for the Enhanced Display of Images on HDTV

Jewoong Ryu*, Gibak Kimy, Sang Hwa Lee*, Byungseok Minz and Nam Ik Cho*

*Seoul Nat’l University, Soongsil University, Samsung Electronics Co. Ltd.

Although video sources for HDTV broadcasting are mainly in HD resolution, some of them are still from low resolution video sources such as videos that are taken long time ago, internet-streamed or cellphone videos. When these sources are displayed on HDTV, they usually appear to be blurry and their color is not vivid. To alleviate these problems, a proper video enhancement for each source is necessary to display them on HDTV with satisfaction. However, since the decoded images on HDTV do not contain the information on the origin of sources in many cases, it needs to be classified whether their origins out from HD source or not. For this, we propose an HD/non- HD classifier based on the Support Vector Machine (SVM) with the frequency information of the selected region in the decoded image. To evaluate the performance of the proposed HD/non- HD classifier, we use a test database of 6252 HD and 6934 SD still images captured from various TV genres. The experimental results show that the proposed classifier yields high accuracy rate of 99.61%.


Back to top!!

OS.47-BioSiPS.3: Frontiers of BioSiPS in Daily Life

Development of a Wearable HRV Telemetry System to be Operated by Non-Experts in Daily Life

Toshitaka Yamakawa1, 2, 3, Koichi Fujiwara4, Manabu Kano4, Miho Miyajima5, Yoko Suzuki5,Taketoshi Maehara5, Katsuya Ohta5, Tetsuo Sasano5, Masato Matsuura5, and Eisuke Matsushima5

1Shizuoka University, 2Shizuoka University, 3Fuzzy Logic Systems Institute, 4Kyoto University, 5Tokyo Medical and Dental University

A telemetry system for the measurement of heart rate variability (HRV) with automatic gain control has been developed with a low-cost manufacturing process and a lowpower consumption design. The proposed automatic gain control technique provided highly reliable RR interval (RRI) detection for subjects of different ages, and enabled the subjects to use the system without any expert knowledge of the electrocardiogram (ECG) measurement. All the components and functions for the RRI measurement were implemented on a wearable telemeter which can operate for up to 440 h with a CR2032 coin battery, and the wirelessly transmitted RRI data is stored into a PC by a receiver via a USB connection. The errors of the RRI detection occurred at less than 2% probability in subjects of five different ages. In a long-term measurement of a young subject that extended over 48 h, the results showed a 0.752% probability of recurring errors. The obtained results suggest that the proposed system enables the long-term monitoring of HRV for both clinical care and healthcare operated by a non-expert.


Evolutionary Programming based Recommendation System for Online Shopping

Jehan Jung*, Yuka Matsuba, Rammohan Mallipeddi*, Hiroyuki Funaya, Kazushi Ikeda and Minho Lee*

*Kyungpook National University, Nara Institute of Science and Technology

In this paper, we propose an interactive evolutionary programming based recommendation system for online shopping that estimates the human preference based on eye movement analysis. Given a set of images of different clothes, the eye movement patterns of the human subjects while looking at the clothes they like differ from clothes they do not like. Therefore, in the proposed system, human preference is measured from the way the human subjects look at the images of different clothes. In other words, the human preference can be measured by using the fixation count and the fixation length using an eye tracking system. Based on the level of human preference, the evolutionary programming suggests new clothes that close the human preference by operations such as selection and mutation. The proposed recommendation is tested with several human subjects and the experimental results are demonstrated.


Heart Rate Variability Features for Epilepsy Seizure Prediction

Hirotsugu Hashimoto*, Koichi Fujiwara*, Yoko Suzukiy, Miho Miyajimay, Toshitaka Yamakawaz,Manabu Kano* , Taketoshi Maeharay, Katsuya Ohtay, Tetsuo Sasanoy, Masato Matsuuray and Eisuke Matsushimay

*Kyoto University, Tokyo Medical and Dental University, Shizuoka University

Although refractory epileptic patients suffer from uncontrolled seizures, their quality of life (QoL) may be improved if an epileptic seizure can be predicted in advance. In the preictal period, an excessive neuronal activity of epilepsy affects the autonomic nerve system. Since the fluctuation of the R-R interval (RRI) of an electrocardiogram (ECG), called heart rate variability (HRV), reflects the autonomic nervous function, an epileptic seizure may be predicted through monitoring HRV data of an epileptic patient. In the present work, preictal and interictal HRV data of epileptic patients were analyzed for developing an epilepsy seizure prediction system. The HRV data of five patients were collected, and their HRV features were calculated. The analysis results showed that frequency HRV features, such as LF and LF/HF, changed at least one minute before seizure onset in all seizure episodes. The possibility of realizing a HRV-based seizure prediction system was shown through these analysis.


Visual Fixation Patterns of Artists and Novices in Abstract Painting Observations

Naoko Koide, Takatomi Kubo, Tomohiro Shibata and Kazushi Ikeda

Nara Institute of Science and Technology

Artists have a specific evaluation of abstract paintings while art novices do that with difficulty. This difference is shown in eye fixation patterns, although the cause is not clear. To explain the difference in fixation patterns, we used one of the saliency maps of paintings which predict fixations well without prior knowledge such as a meaningful target. If artists had a deep knowledge of art, they attend less to the salient features than novices in observing abstract paintings. On the other hand, the fixation patterns possibly vary in visual tasks. Therefore we examined the effect of salient features in two tasks: free viewing and preference judgment. To evaluate this quantitatively, the correlation coefficient (CC) between fixation distribution and a saliency map was used. The CCs were compared between artists and novices for each task. We found that the CCs of artists were lower than those of novices in the free viewing task, but not in the preference judgment task. This implies that the artists knowledge of observation paintings was appeared only in free viewing.


A Comparative Study of Time Series Modeling for Driving Behavior Towards Prediction

Ryunosuke Hamada, Takatomi Kubo, Kazushi Ikeda, Zujie Zhang,Takashi Bando and Masumi Egawa

Nara Institute of Science and Technology, DENSO CORPORATION

Prediction of driving behaviors is an important problem in developing a next-generation driving support system. In order to take diverse driving situations into account, it is necessary to model multiple driving operation time series data. In this study we modeled multiple driving operation time series with four modeling methods including beta process autoregressive hidden Markov model (BP-AR-HMM), which we used in our previous study.We quantitatively compared the modeling methods with respect to prediction accuracies, and concluded that BP-AR-HMM excelled the other modeling methods in modeling multiple driving operation time series and predicting unknown driving operations. The result suggests that BP-AR-HMM estimated behaviors of a driver and transition probabilities between the behaviors more successfully than the other methods, because BP-AR-HMM can deal with commonalities and differences among multiple time series, but the others cannot. Therefore BP-AR-HMM may help us to predict driver behaviors in real environment and to develop the next-generation driving support system.


Back to top!!

OS.48-SIPTM.5: Recent Advances in Multirate Processing and Transforms

Lossless Transform with Functionality of Thumbnail Previewing

Masahiro Iwahashi*, Suvit Poomrittigul* and Hitoshi Kiya

*Nagaoka University of Technology, Tokyo Metropolitan University

This report proposes a new functionality of lossless coding of image signals. The proposed method provides 'thumbnail previewing' of the original image from a part of the bit-stream without expanding all the compressed data. It also avoids degradation of resolution in pixel density of the thumbnail image. It is composed of a new lossless color transform and an existing lossless wavelet transform. We add a free parameter to the color transform and utilize it to control a scaling parameter of the luminance component. As a result, it became possible to preview the 'thumb-nail' luminance image from a part of the bit-stream. Due to the free parameter, quality and data volume of the 'thumbnail' can be controlled according to a users' request. Unlike the existing two layer lossless / lossy coding, the proposed method achieves good performance in lossless coding of the original image signal.


A Two-Dimensional Non-Separable Implementation of Dyadic-Valued Cosine-Sine Modulated Filter Banks for Low Computational Complexity

Seisuke Kyochi*, Taizo Suzuki and Yuichi Tanaka

*The University of Kitakyushu, University of Tsukuba, Tokyo University of Agriculture and Technology

This paper aims to design a new two-dimensional (2D) non-separable implementation of dyadic-valued cosine-sine modulated filter banks (2D D-CSMFBs) for low computational complexity. CSMFBs satisfy rich directional selectivity (DS) and shift-invariance (SI), and they can be easily designed by the modulation of a prototype filter. In addition, our previous work introduced a one-dimensional (1D) D-CSMFBs (1D D-CSMFBs). By restricting real-valued filter coefficients to rational-valued ones, they can save computational cost while keeping DS and SI. The proposed 2D implementation in this paper can further reduce the computational cost by unifying the conventional 2D separable structure into a 2D non-separable one directly. Furthermore, experimental results of image non-linear approximation show that the proposed 2D D-CSMFBs are comparable or even better than the 1D D-CSMFBs.


Integer Multichannel Transform

Jian-Jiun Ding* and Po-Hong Wu

*National Taiwan University, National Taiwan University

In this paper, we develop an algorithm that can convert any reversible multichannel system into a reversible integer multichannel transform. The integer transform means the operation whose inputs and outputs are all sums of powers of two. Recently, the triangular matrix scheme was shown to be able to convert any nonsingular discrete transform into a reversible discrete integer transform. Since as other discrete transform, the multichannel system can also be expressed as a matrix form, we suggest that the triangular matrix scheme can also be applied for converting a multichannel system into a reversible integer transform. The proposed methods are useful for multirate signal processing, wavelet analysis, communication, and image processing.


Boundary Operation of 2-D Non-separable Oversampled Lapped Transforms

Kosuke FURUYA, Shintaro HARA and Shogo MURAMATSU

Niigata Univ., Niigata Univ.

This paper proposes a boundary operation technique of 2-D non-separable oversampled lapped transforms (NSOLT). The proposed technique is based on a lattice structure consisting of the 2-D separable block discrete cosine transform (DCT) and non-separable redundant support-extension processes. The atoms are allowed to be anisotropic with the oversampled, overlapping, symmetric, real-valued, and compact-support property. First, the blockwise implementation is developed so that the atoms can be locally controlled. The local control of atoms is shown to maintain perfect reconstruction. This property leads an atom termination (AT) technique as a boundary operation. The technique overcomes the drawback of NSOLT that the popular symmetric extension method is invalid. Through some experimental results, the significance of AT is verified.


Critically Sampled Graph-based Wavelet Transforms for Image Coding

Sunil K. Narang, * Yung-Hsuan Chao, and Antonio Ortega

University of Southern California

In this paper, we propose a new approach for image compression using graph-based biorthogonal wavelet filterbanks (referred to as graphBior filterbanks). These filterbanks, proposed in our previous work, operate on the graph representations of images, which are formed by linking nearby pixels with each other. The connectivity and the link weights are chosen so as to reflect the geometrical structure of the image. The filtering operations on these edge-aware image graphs avoid filtering across the image discontinuities, thus resulting in a significant reduction in the amount of energy in the high frequency bands. This reduces the bit-rate requirements for the wavelet coefficients, but at the cost of sending extra edge-information bits to the decoder. We discuss efficient ways of representing and encoding this edge information. Our experimental results, based on the SPIHT codec, demonstrate that the proposed approach achieves better R-D performance than the standard CDF9/7 filter on piecewise smooth images such as depth maps.


Back to top!!